Bug 598818 – File loading: keep e.g. 5MB of data for better encoding detection

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 598818 - File loading: keep e.g. 5MB of data for better encoding detection


Summary:	File loading: keep e.g. 5MB of data for better encoding detection


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	File loading and saving
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:
Whiteboard:

Duplicates:	633391 (view as bug list)
Depends on:
Blocks:

Reported:	2009-10-18 01:11 UTC by Ilya Chernykh
Modified:	2016-12-07 12:19 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
The file (150.39 KB, application/zip) 2009-10-18 01:13 UTC, Ilya Chernykh	Details
Another file whose encoding is not detected (187.76 KB, text/plain) 2010-03-31 03:43 UTC, Jean-Philippe Fleury	Details

Description Ilya Chernykh 2009-10-18 01:11:15 UTC

The file is attached.

Comment 1 Ilya Chernykh 2009-10-18 01:13:02 UTC

Created attachment 145709 [details]
The file

Comment 2 Jean-Philippe Fleury 2010-03-31 03:43:59 UTC

Created attachment 157554 [details]
Another file whose encoding is not detected

Comment 3 Jean-Philippe Fleury 2010-03-31 03:45:16 UTC

Comment on attachment 157554 [details]
Another file whose encoding is not detected

On gedit 2.30.0

Comment 4 jessevdk@gmail.com 2010-03-31 07:27:54 UTC

Just to have an idea of what the problem is, the new encoding detection only tries to guess the encoding from the first block that is read from a file. At the moment, this is 8192 bytes I think. This first block seems to be ambiguous in terms of encodings (I think it's just ASCII), so it settles on the first encoding it tries (UTF-8). Then later on it gets some other characters which are in ISO-8859-15 (or something similar), and errors out.

I think we should try to redo/continue the encoding detection at that point somehow until we have exhausted again the list of encodings, not just error out.

Comment 5 jessevdk@gmail.com 2010-05-14 07:45:56 UTC

Maybe we could store the first N MB of a file in memory and if a conversion error occurs in these first N MB, we go back to the list of encodings and try to reconvert these bytes. This way we do not have to seek or reopen the stream, and can more fairly detect the right encoding. For example 2, or 5 MB of temporary memory when loading a file does not seem to bad.

Comment 6 jessevdk@gmail.com 2010-10-28 22:13:26 UTC

*** Bug 633391 has been marked as a duplicate of this bug. ***

Comment 7 Oliver Joos 2010-11-14 02:15:22 UTC

Since 2.30 I also have text files that are not detected automatically anymore. One example is python source with around 12kBytes and correctly encoded with iso-8859-15.

I was able to reproduce the problem with short files and found that only a few chars can make the difference:

python -c "f=file('iso-8859-15_ok.txt', 'w'); f.write('W\xf6rld\n\n'); f.close()"

python -c "f=file('iso-8859-15_failing.txt', 'w'); f.write('T\xf6st\n\n'); f.close()"

gedit 2.28.0-0ubuntu2 (Ubuntu 9.10) is able to automatically detect both files, whereas gedit 2.30.0git20100413-0ubuntu1 (Ubuntu 10.04.1) and 2.30.4-1.fc14.i686 (Fedora 14) is not.

Note that once you chose the encoding manually, gedit will remember this! You then have to rename the file to reproduce the problem.

@Jesse: please consider to seek again or reopen the stream, because the time that will be wasted this way will be less than that of a user who has to choose the encoding manually. I agree that wasting some MBytes would not be a problem, but it's hard to decide how many MBytes would be optimal.

Comment 8 Oliver Joos 2011-10-26 21:49:21 UTC

I just checked gedit 3.20. it is still affected by this bug!

Comment 9 Ignacio Casal Quinteiro (nacho) 2011-10-27 07:39:35 UTC

Have you tried selecting the right encoding to see if it is opened?

Comment 10 Oliver Joos 2011-10-27 10:54:28 UTC

Yes. As said in the description, choosing it manually works and is even remembered for each filename. This report is about a bug in the auto-detection of encodings.

Yet I did not look in the gedit code how auto-detection is solved. The library python-chardet would look nice for that. Anyway, I think if "Wörld" is auto-detected correctly then "Töst" should be too!

Comment 11 Sébastien Wilmet 2014-08-20 21:21:16 UTC

Still present with the implementation in gtksourceview.

Comment 12 Sébastien Wilmet 2016-12-07 12:19:38 UTC

I'm implementing a new file loader in Gtef:
https://github.com/swilmet/gtef

Once it's finished and stable in Gtef, the code can be moved to GtkSourceView.

And the new file loader keeps all the file content in a list of GBytes, and uses uchardet to determine the encoding. So it will normally provide much better results.