GNOME Bugzilla – Bug 598818
File loading: keep e.g. 5MB of data for better encoding detection
Last modified: 2016-12-07 12:19:38 UTC
The file is attached.
Created attachment 145709 [details] The file
Created attachment 157554 [details] Another file whose encoding is not detected
Comment on attachment 157554 [details] Another file whose encoding is not detected On gedit 2.30.0
Just to have an idea of what the problem is, the new encoding detection only tries to guess the encoding from the first block that is read from a file. At the moment, this is 8192 bytes I think. This first block seems to be ambiguous in terms of encodings (I think it's just ASCII), so it settles on the first encoding it tries (UTF-8). Then later on it gets some other characters which are in ISO-8859-15 (or something similar), and errors out. I think we should try to redo/continue the encoding detection at that point somehow until we have exhausted again the list of encodings, not just error out.
Maybe we could store the first N MB of a file in memory and if a conversion error occurs in these first N MB, we go back to the list of encodings and try to reconvert these bytes. This way we do not have to seek or reopen the stream, and can more fairly detect the right encoding. For example 2, or 5 MB of temporary memory when loading a file does not seem to bad.
*** Bug 633391 has been marked as a duplicate of this bug. ***
Since 2.30 I also have text files that are not detected automatically anymore. One example is python source with around 12kBytes and correctly encoded with iso-8859-15. I was able to reproduce the problem with short files and found that only a few chars can make the difference: python -c "f=file('iso-8859-15_ok.txt', 'w'); f.write('W\xf6rld\n\n'); f.close()" python -c "f=file('iso-8859-15_failing.txt', 'w'); f.write('T\xf6st\n\n'); f.close()" gedit 2.28.0-0ubuntu2 (Ubuntu 9.10) is able to automatically detect both files, whereas gedit 2.30.0git20100413-0ubuntu1 (Ubuntu 10.04.1) and 2.30.4-1.fc14.i686 (Fedora 14) is not. Note that once you chose the encoding manually, gedit will remember this! You then have to rename the file to reproduce the problem. @Jesse: please consider to seek again or reopen the stream, because the time that will be wasted this way will be less than that of a user who has to choose the encoding manually. I agree that wasting some MBytes would not be a problem, but it's hard to decide how many MBytes would be optimal.
I just checked gedit 3.20. it is still affected by this bug!
Have you tried selecting the right encoding to see if it is opened?
Yes. As said in the description, choosing it manually works and is even remembered for each filename. This report is about a bug in the auto-detection of encodings. Yet I did not look in the gedit code how auto-detection is solved. The library python-chardet would look nice for that. Anyway, I think if "Wörld" is auto-detected correctly then "Töst" should be too!
Still present with the implementation in gtksourceview.
I'm implementing a new file loader in Gtef: https://github.com/swilmet/gtef Once it's finished and stable in Gtef, the code can be moved to GtkSourceView. And the new file loader keeps all the file content in a list of GBytes, and uses uchardet to determine the encoding. So it will normally provide much better results.