After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 598818 - File loading: keep e.g. 5MB of data for better encoding detection
File loading: keep e.g. 5MB of data for better encoding detection
Status: RESOLVED FIXED
Product: gtksourceview
Classification: Platform
Component: File loading and saving
unspecified
Other Linux
: Normal enhancement
: ---
Assigned To: GTK Sourceview maintainers
GTK Sourceview maintainers
: 633391 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2009-10-18 01:11 UTC by Ilya Chernykh
Modified: 2016-12-07 12:19 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
The file (150.39 KB, application/zip)
2009-10-18 01:13 UTC, Ilya Chernykh
Details
Another file whose encoding is not detected (187.76 KB, text/plain)
2010-03-31 03:43 UTC, Jean-Philippe Fleury
Details

Description Ilya Chernykh 2009-10-18 01:11:15 UTC
The file is attached.
Comment 1 Ilya Chernykh 2009-10-18 01:13:02 UTC
Created attachment 145709 [details]
The file
Comment 2 Jean-Philippe Fleury 2010-03-31 03:43:59 UTC
Created attachment 157554 [details]
Another file whose encoding is not detected
Comment 3 Jean-Philippe Fleury 2010-03-31 03:45:16 UTC
Comment on attachment 157554 [details]
Another file whose encoding is not detected

On gedit 2.30.0
Comment 4 jessevdk@gmail.com 2010-03-31 07:27:54 UTC
Just to have an idea of what the problem is, the new encoding detection only tries to guess the encoding from the first block that is read from a file. At the moment, this is 8192 bytes I think. This first block seems to be ambiguous in terms of encodings (I think it's just ASCII), so it settles on the first encoding it tries (UTF-8). Then later on it gets some other characters which are in ISO-8859-15 (or something similar), and errors out.

I think we should try to redo/continue the encoding detection at that point somehow until we have exhausted again the list of encodings, not just error out.
Comment 5 jessevdk@gmail.com 2010-05-14 07:45:56 UTC
Maybe we could store the first N MB of a file in memory and if a conversion error occurs in these first N MB, we go back to the list of encodings and try to reconvert these bytes. This way we do not have to seek or reopen the stream, and can more fairly detect the right encoding. For example 2, or 5 MB of temporary memory when loading a file does not seem to bad.
Comment 6 jessevdk@gmail.com 2010-10-28 22:13:26 UTC
*** Bug 633391 has been marked as a duplicate of this bug. ***
Comment 7 Oliver Joos 2010-11-14 02:15:22 UTC
Since 2.30 I also have text files that are not detected automatically anymore. One example is python source with around 12kBytes and correctly encoded with iso-8859-15.

I was able to reproduce the problem with short files and found that only a few chars can make the difference:

python -c "f=file('iso-8859-15_ok.txt', 'w'); f.write('W\xf6rld\n\n'); f.close()"

python -c "f=file('iso-8859-15_failing.txt', 'w'); f.write('T\xf6st\n\n'); f.close()"

gedit 2.28.0-0ubuntu2 (Ubuntu 9.10) is able to automatically detect both files, whereas gedit 2.30.0git20100413-0ubuntu1 (Ubuntu 10.04.1) and 2.30.4-1.fc14.i686 (Fedora 14) is not.

Note that once you chose the encoding manually, gedit will remember this! You then have to rename the file to reproduce the problem.

@Jesse: please consider to seek again or reopen the stream, because the time that will be wasted this way will be less than that of a user who has to choose the encoding manually. I agree that wasting some MBytes would not be a problem, but it's hard to decide how many MBytes would be optimal.
Comment 8 Oliver Joos 2011-10-26 21:49:21 UTC
I just checked gedit 3.20. it is still affected by this bug!
Comment 9 Ignacio Casal Quinteiro (nacho) 2011-10-27 07:39:35 UTC
Have you tried selecting the right encoding to see if it is opened?
Comment 10 Oliver Joos 2011-10-27 10:54:28 UTC
Yes. As said in the description, choosing it manually works and is even remembered for each filename. This report is about a bug in the auto-detection of encodings.

Yet I did not look in the gedit code how auto-detection is solved. The library python-chardet would look nice for that. Anyway, I think if "Wörld" is auto-detected correctly then "Töst" should be too!
Comment 11 Sébastien Wilmet 2014-08-20 21:21:16 UTC
Still present with the implementation in gtksourceview.
Comment 12 Sébastien Wilmet 2016-12-07 12:19:38 UTC
I'm implementing a new file loader in Gtef:
https://github.com/swilmet/gtef

Once it's finished and stable in Gtef, the code can be moved to GtkSourceView.

And the new file loader keeps all the file content in a list of GBytes, and uses uchardet to determine the encoding. So it will normally provide much better results.