After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 169943 - Encoding guessing functions
Encoding guessing functions
Status: RESOLVED WONTFIX
Product: glib
Classification: Platform
Component: i18n
2.6.x
Other All
: Normal enhancement
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2005-03-11 13:05 UTC by Young-Ho Cha
Modified: 2018-02-14 12:22 UTC
See Also:
GNOME target: ---
GNOME version: Unversioned Enhancement



Description Young-Ho Cha 2005-03-11 13:05:41 UTC
there are many data types that don't have charset information. (i'll call it
_legacy_ data)

ex: plain text, localized man page, id3v1 tags, subtitle(smi file), some network
protocol(ftp, whois, etc)

currently glib support g_locale_{to,from}_utf8(), so can convert charset easily
to utf8 from locale information.

but above data do not convert locale's charset, because locale's charset is used
for presentation something, not _legacy_ data's charset.

I suggest new converting functions relied on locale's country code, not charset. 

it will be very usefull handling _legacy_ data.
Comment 1 Young-Ho Cha 2005-03-11 13:16:57 UTC
and I liked these functions will replace g_locale_{to,from}_utf8() because
g_locale_{to,from}_utf8 functions have no behaviors in UTF-8 locale.
Comment 2 Owen Taylor 2005-03-11 14:49:38 UTC
A file encoding *guessing* function that used an input string and
the current locale as inputs would be a reasonable addition to GLib,
though it isn't a particularly simple problem. (To distinguish the
various Japanese encodings, you have to look at character frequency
in the input string, for example.)

g_locale_{to,from}_utf8 are well defined functions and will not be
changed.

I thought we already had a bug about that, but I dont' see it.
Comment 3 Philip Withnall 2018-02-14 12:22:59 UTC
In the last 13 years, the world has standardised more and more on UTF-8, and other encodings are becoming less common. Where they are used, the encoding is explicitly stated more often than before.

There is still a need for a way to guess the encoding of an arbitrary byte stream, but I don’t believe it’s a common and general enough problem to warrant being in GLib — and I think the need for such guessing is only going to reduce over time, leaving GLib with unnecessary API.

https://cgit.freedesktop.org/uchardet/uchardet/ is one library which provides this functionality.

⇒ WONTFIX