Bug 565868 – Encoding auto-detection does not work for Cyrillic

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 565868 - Encoding auto-detection does not work for Cyrillic


Summary:	Encoding auto-detection does not work for Cyrillic


Status:	RESOLVED FIXED

Product:	gedit
Classification:	Applications
Component:	general
Version:	2.24.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Gedit maintainers
QA Contact:	Gedit maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-12-28 15:09 UTC by Ilya Chernykh
Modified:	2010-04-05 11:38 UTC

See Also:
GNOME target:	---
GNOME version:	2.23/2.24

Attachments
problematic texts (7.83 KB, application/zip) 2008-12-28 15:09 UTC, Ilya Chernykh	Details

Description Ilya Chernykh 2008-12-28 15:09:00 UTC

Please describe the problem:
Problematic files attached.

Steps to reproduce:
1. 
2. 
3. 


Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 Ilya Chernykh 2008-12-28 15:09:42 UTC

Created attachment 125430 [details]
problematic texts

Comment 2 Jonathan Stewart 2009-03-09 20:46:29 UTC

What is the correct encoding for these files?  ISO-8859-5 or another?

Cyrillic is not an encoding per-se, it's a kind of writing.  But there are many ways to encode Cyrillic, including Unicode (UTF-8) and others.

Comment 3 Ilya Chernykh 2009-03-09 23:54:56 UTC

This is Windows encoding. Gedit does not recognize texts written in Windows.

Comment 4 Jonathan Stewart 2009-03-10 00:10:17 UTC

I am not a developer, but is it reasonable to ask gedit to support windows text encoding?  This text file contains no information that identifies the character encoding used. 

The command-line tool called 'file' identifies it thusly:
ReadMe.txt: Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators

The vim text editor is also confused by this file, and displays incorrect characters for the most part.

Using Firefox, i must manually select the encoding, and it appears the correct encoding is Windows-1251. I think this displays the Cyrillic characters correctly.

My understanding is that this is one of the key limitations of raw text files; there is no way to indicate to the application which character encoding is being used.  Thus more advanced formats like XML or HTML require identifying the encoding.

My final thought is: Does gedit support any non-UTF-8 encodings?  If not, this bug may never be resolved.

Comment 5 Ilya Chernykh 2009-03-10 00:29:11 UTC

Gedit should either ask for proper encoding or determine it automatically. Now you open a file, see gerbage and have no option to manually choose encoding. Other editors allow for that, but gedit says it is not necessary to have encoding chooser because it has auto-detection.

Comment 6 jessevdk@gmail.com 2009-03-10 09:03:17 UTC

Ok, I guess this needs some explanation because I see some pretty wild guesses here of what gedit does or does not do. When opening a file, gedit will try to autodetect the encoding of the file. What this means is, it has a list of encodings it will try to convert from, and whichever one results in valid utf-8 is picked. The list of 'autodetected' encodings can be found in the /apps/gedit-2/preferences/encodings/auto_detected gconf key.

As jonathan points out, there is not really a better way to find out the encoding. We cannot simply try all possible encodings when opening a file so we set the list to a sane default (which is at the moment is: UTF-8, CURRENT, ISO-8859-15, UTF-16.

So, to solve your problem, you'll have to select the proper encoding in the file load dialog.

Comment 7 Ilya Chernykh 2009-03-10 09:21:45 UTC

This is absolutely insane default because almost all Russian language documents in Internet and anywhere else encoded in cp1251 encoding. This means that opening any text file people need to select encoding manually. But as you know there is no encoding-selection dialog when double-clicking on a text file and no possibility to choose encoding after file opened.

Comment 8 Paolo Maggi 2009-03-10 10:05:30 UTC

Please, note that the list of 'autodetected' encoding can vary according to the user's locale.
So, if you think that the current default is insane for the users with Russian locale, please contact the Russian i18n team and ask them to set a better default for Russian locale.
Thanks for the collaboration.

Comment 9 Ilya Chernykh 2009-03-10 10:52:28 UTC

I guess the team's activity is part of the Gnome project?

Comment 10 Ignacio Casal Quinteiro (nacho) 2009-10-24 00:50:31 UTC

(In reply to comment #9)
> I guess the team's activity is part of the Gnome project?

yes it is. This is the page for the teams: http://l10n.gnome.org/teams/

Apart from that closing this a not a bug.

Comment 11 Ilya Chernykh 2009-10-24 01:08:19 UTC

So Gedit can only recognize one encoding - that of the locale?

In Russia most documents are in Windows encoding but it is reasonable to suppose that on a Linux computer most docs would be in the Linux encoding. So the user of Linux in Russia may equally frequently meet files in Windows and Linux encoding.

Comment 12 Takao Fujiwara 2009-10-29 02:20:54 UTC

Personally I like the suggestion in bug 594410.

You can try the test to modify the gconf value with gconf-editor.
/apps/gedit-2/preferences/encodings/auto_detected and
/apps/gedit-2/preferences/encodings/shown_in_menu

I wonder if the following change fixes your problem.

--- a/po/ru.po
+++ b/po/ru.po
@@ -523,7 +523,7 @@ msgstr "[CP866,IBM855,ISO-8859-5,KOI8R,WINDOWS-1251]"
 #. a list of supported encodings
 #: ../data/gedit.schemas.in.h:105
 msgid "[UTF-8,CURRENT,ISO-8859-15,UTF-16]"
-msgstr "[UTF-8,CURRENT,KOI8R,WINDOWS-1251,ISO-8859-5]"
+msgstr "[CURRENT,WINDOWS-1251,ISO-8859-5,KOI8R,UTF-8]"
 
 #: ../gedit/dialogs/gedit-close-confirmation-dialog.c:140
 msgid "Logout _without Saving"

Comment 13 Ignacio Casal Quinteiro (nacho) 2010-04-01 18:57:24 UTC

As it seems like a translation bug, or something that can be changed in the gconf schema I'm closing this as NOTABUG. Feel free to reopen if you think it is a bug and you can provide more info about it.

Comment 14 Ilya Chernykh 2010-04-01 19:23:01 UTC

Even if this is a problem of gconf schema it's a very annoying bug that touches every Russian user.

And suggestion to recompile gedit for any user themselves is not a good idea.

Comment 15 Takao Fujiwara 2010-04-02 01:01:42 UTC

(In reply to comment #14)
> And suggestion to recompile gedit for any user themselves is not a good idea.

Why you don't upstream to modified ru.po?
The problem is the Russian encodings have the duplicated code points so I think modifying the encoding order is the realistic fix.

Comment 16 Ilya Chernykh 2010-04-04 17:29:52 UTC

> You can try the test to modify the gconf value with gconf-editor.
> /apps/gedit-2/preferences/encodings/auto_detected and
> /apps/gedit-2/preferences/encodings/shown_in_menu

I just looked to these keys and found that Windows-1251 already included in the auto-detection list. I wonder why if so, Gedit fails to auto-detect it and stays with incorrect encoding.

Comment 17 Ilya Chernykh 2010-04-04 17:30:50 UTC

Also there is no encoding-chooser in the menu despite the abovementioned key.

Comment 18 Takao Fujiwara 2010-04-05 01:12:23 UTC

(In reply to comment #16)
> I just looked to these keys and found that Windows-1251 already included in the
> auto-detection list. I wonder why if so, Gedit fails to auto-detect it and
> stays with incorrect encoding.

I think you need to modify the encoding order although the the encoding already exists.

> Also there is no encoding-chooser in the menu despite the abovementioned key.

Yes, currently there is not GUI for the auto-detect option.

Comment 19 Ilya Chernykh 2010-04-05 01:57:03 UTC

> I think you need to modify the encoding order although the the encoding already
exists.

Probably Gedit gets false-positive on some other encoding earlier in the list, such as KOI-8 and does not check anything later.

Comment 20 Takao Fujiwara 2010-04-05 02:52:30 UTC

(In reply to comment #19)
> Probably Gedit gets false-positive on some other encoding earlier in the list,
> such as KOI-8 and does not check anything later.

AFAIK, gedit just checks the return value of iconv.
You could check your text can be converted with iconv from another encoding to UTF-8.
As I noted, some Russian encodings have the duplicated code points so if your text would include a few chars, the accuracy might be lost.
That's why I think modifying the encoding order is the realistic fix and I guess most test cases would be succeeded with the Russian text.

If iconv would be failed but gedit could show the text with the failed encoding, I agree it's a gedit bug.
Otherwise probably I think there is no way to fix it in gedit side.

Comment 21 leon 2010-04-05 11:32:29 UTC

As far as following command fixes the problem:

/usr/bin/gconftool-2 --direct --config-source=xml:readwrite:/etc/gconf/gconf.xml.defaults -s -t list --list-type=string /apps/gedit-2/preferences/encodings
/auto_detected "[UTF-8,CURRENT,WINDOWS-1251,KOI8R,ISO-8859-5]"

I changed gedit translation in gnome-2-28 and HEAD

Comment 22 Ignacio Casal Quinteiro (nacho) 2010-04-05 11:38:05 UTC

Thanks Leonid, Closing this as FIXED. Feel free to reopen if there are still problems with this.