GNOME Bugzilla – Bug 169943
Encoding guessing functions
Last modified: 2018-02-14 12:22:59 UTC
there are many data types that don't have charset information. (i'll call it _legacy_ data) ex: plain text, localized man page, id3v1 tags, subtitle(smi file), some network protocol(ftp, whois, etc) currently glib support g_locale_{to,from}_utf8(), so can convert charset easily to utf8 from locale information. but above data do not convert locale's charset, because locale's charset is used for presentation something, not _legacy_ data's charset. I suggest new converting functions relied on locale's country code, not charset. it will be very usefull handling _legacy_ data.
and I liked these functions will replace g_locale_{to,from}_utf8() because g_locale_{to,from}_utf8 functions have no behaviors in UTF-8 locale.
A file encoding *guessing* function that used an input string and the current locale as inputs would be a reasonable addition to GLib, though it isn't a particularly simple problem. (To distinguish the various Japanese encodings, you have to look at character frequency in the input string, for example.) g_locale_{to,from}_utf8 are well defined functions and will not be changed. I thought we already had a bug about that, but I dont' see it.
In the last 13 years, the world has standardised more and more on UTF-8, and other encodings are becoming less common. Where they are used, the encoding is explicitly stated more often than before. There is still a need for a way to guess the encoding of an arbitrary byte stream, but I don’t believe it’s a common and general enough problem to warrant being in GLib — and I think the need for such guessing is only going to reduce over time, leaving GLib with unnecessary API. https://cgit.freedesktop.org/uchardet/uchardet/ is one library which provides this functionality. ⇒ WONTFIX