GNOME Bugzilla – Bug 565868
Encoding auto-detection does not work for Cyrillic
Last modified: 2010-04-05 11:38:05 UTC
Please describe the problem: Problematic files attached. Steps to reproduce: 1. 2. 3. Actual results: Expected results: Does this happen every time? Other information:
Created attachment 125430 [details] problematic texts
What is the correct encoding for these files? ISO-8859-5 or another? Cyrillic is not an encoding per-se, it's a kind of writing. But there are many ways to encode Cyrillic, including Unicode (UTF-8) and others.
This is Windows encoding. Gedit does not recognize texts written in Windows.
I am not a developer, but is it reasonable to ask gedit to support windows text encoding? This text file contains no information that identifies the character encoding used. The command-line tool called 'file' identifies it thusly: ReadMe.txt: Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators The vim text editor is also confused by this file, and displays incorrect characters for the most part. Using Firefox, i must manually select the encoding, and it appears the correct encoding is Windows-1251. I think this displays the Cyrillic characters correctly. My understanding is that this is one of the key limitations of raw text files; there is no way to indicate to the application which character encoding is being used. Thus more advanced formats like XML or HTML require identifying the encoding. My final thought is: Does gedit support any non-UTF-8 encodings? If not, this bug may never be resolved.
Gedit should either ask for proper encoding or determine it automatically. Now you open a file, see gerbage and have no option to manually choose encoding. Other editors allow for that, but gedit says it is not necessary to have encoding chooser because it has auto-detection.
Ok, I guess this needs some explanation because I see some pretty wild guesses here of what gedit does or does not do. When opening a file, gedit will try to autodetect the encoding of the file. What this means is, it has a list of encodings it will try to convert from, and whichever one results in valid utf-8 is picked. The list of 'autodetected' encodings can be found in the /apps/gedit-2/preferences/encodings/auto_detected gconf key. As jonathan points out, there is not really a better way to find out the encoding. We cannot simply try all possible encodings when opening a file so we set the list to a sane default (which is at the moment is: UTF-8, CURRENT, ISO-8859-15, UTF-16. So, to solve your problem, you'll have to select the proper encoding in the file load dialog.
This is absolutely insane default because almost all Russian language documents in Internet and anywhere else encoded in cp1251 encoding. This means that opening any text file people need to select encoding manually. But as you know there is no encoding-selection dialog when double-clicking on a text file and no possibility to choose encoding after file opened.
Please, note that the list of 'autodetected' encoding can vary according to the user's locale. So, if you think that the current default is insane for the users with Russian locale, please contact the Russian i18n team and ask them to set a better default for Russian locale. Thanks for the collaboration.
I guess the team's activity is part of the Gnome project?
(In reply to comment #9) > I guess the team's activity is part of the Gnome project? yes it is. This is the page for the teams: http://l10n.gnome.org/teams/ Apart from that closing this a not a bug.
So Gedit can only recognize one encoding - that of the locale? In Russia most documents are in Windows encoding but it is reasonable to suppose that on a Linux computer most docs would be in the Linux encoding. So the user of Linux in Russia may equally frequently meet files in Windows and Linux encoding.
Personally I like the suggestion in bug 594410. You can try the test to modify the gconf value with gconf-editor. /apps/gedit-2/preferences/encodings/auto_detected and /apps/gedit-2/preferences/encodings/shown_in_menu I wonder if the following change fixes your problem. --- a/po/ru.po +++ b/po/ru.po @@ -523,7 +523,7 @@ msgstr "[CP866,IBM855,ISO-8859-5,KOI8R,WINDOWS-1251]" #. a list of supported encodings #: ../data/gedit.schemas.in.h:105 msgid "[UTF-8,CURRENT,ISO-8859-15,UTF-16]" -msgstr "[UTF-8,CURRENT,KOI8R,WINDOWS-1251,ISO-8859-5]" +msgstr "[CURRENT,WINDOWS-1251,ISO-8859-5,KOI8R,UTF-8]" #: ../gedit/dialogs/gedit-close-confirmation-dialog.c:140 msgid "Logout _without Saving"
As it seems like a translation bug, or something that can be changed in the gconf schema I'm closing this as NOTABUG. Feel free to reopen if you think it is a bug and you can provide more info about it.
Even if this is a problem of gconf schema it's a very annoying bug that touches every Russian user. And suggestion to recompile gedit for any user themselves is not a good idea.
(In reply to comment #14) > And suggestion to recompile gedit for any user themselves is not a good idea. Why you don't upstream to modified ru.po? The problem is the Russian encodings have the duplicated code points so I think modifying the encoding order is the realistic fix.
> You can try the test to modify the gconf value with gconf-editor. > /apps/gedit-2/preferences/encodings/auto_detected and > /apps/gedit-2/preferences/encodings/shown_in_menu I just looked to these keys and found that Windows-1251 already included in the auto-detection list. I wonder why if so, Gedit fails to auto-detect it and stays with incorrect encoding.
Also there is no encoding-chooser in the menu despite the abovementioned key.
(In reply to comment #16) > I just looked to these keys and found that Windows-1251 already included in the > auto-detection list. I wonder why if so, Gedit fails to auto-detect it and > stays with incorrect encoding. I think you need to modify the encoding order although the the encoding already exists. > Also there is no encoding-chooser in the menu despite the abovementioned key. Yes, currently there is not GUI for the auto-detect option.
> I think you need to modify the encoding order although the the encoding already exists. Probably Gedit gets false-positive on some other encoding earlier in the list, such as KOI-8 and does not check anything later.
(In reply to comment #19) > Probably Gedit gets false-positive on some other encoding earlier in the list, > such as KOI-8 and does not check anything later. AFAIK, gedit just checks the return value of iconv. You could check your text can be converted with iconv from another encoding to UTF-8. As I noted, some Russian encodings have the duplicated code points so if your text would include a few chars, the accuracy might be lost. That's why I think modifying the encoding order is the realistic fix and I guess most test cases would be succeeded with the Russian text. If iconv would be failed but gedit could show the text with the failed encoding, I agree it's a gedit bug. Otherwise probably I think there is no way to fix it in gedit side.
As far as following command fixes the problem: /usr/bin/gconftool-2 --direct --config-source=xml:readwrite:/etc/gconf/gconf.xml.defaults -s -t list --list-type=string /apps/gedit-2/preferences/encodings /auto_detected "[UTF-8,CURRENT,WINDOWS-1251,KOI8R,ISO-8859-5]" I changed gedit translation in gnome-2-28 and HEAD
Thanks Leonid, Closing this as FIXED. Feel free to reopen if there are still problems with this.