Bug 169943 – Encoding guessing functions

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 169943 - Encoding guessing functions


Summary:	Encoding guessing functions


Status:	RESOLVED WONTFIX

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.6.x
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-03-11 13:05 UTC by Young-Ho Cha
Modified:	2018-02-14 12:22 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Description Young-Ho Cha 2005-03-11 13:05:41 UTC

there are many data types that don't have charset information. (i'll call it
_legacy_ data)

ex: plain text, localized man page, id3v1 tags, subtitle(smi file), some network
protocol(ftp, whois, etc)

currently glib support g_locale_{to,from}_utf8(), so can convert charset easily
to utf8 from locale information.

but above data do not convert locale's charset, because locale's charset is used
for presentation something, not _legacy_ data's charset.

I suggest new converting functions relied on locale's country code, not charset. 

it will be very usefull handling _legacy_ data.

Comment 1 Young-Ho Cha 2005-03-11 13:16:57 UTC

and I liked these functions will replace g_locale_{to,from}_utf8() because
g_locale_{to,from}_utf8 functions have no behaviors in UTF-8 locale.

Comment 2 Owen Taylor 2005-03-11 14:49:38 UTC

A file encoding *guessing* function that used an input string and
the current locale as inputs would be a reasonable addition to GLib,
though it isn't a particularly simple problem. (To distinguish the
various Japanese encodings, you have to look at character frequency
in the input string, for example.)

g_locale_{to,from}_utf8 are well defined functions and will not be
changed.

I thought we already had a bug about that, but I dont' see it.

Comment 3 Philip Withnall 2018-02-14 12:22:59 UTC

In the last 13 years, the world has standardised more and more on UTF-8, and other encodings are becoming less common. Where they are used, the encoding is explicitly stated more often than before.

There is still a need for a way to guess the encoding of an arbitrary byte stream, but I don’t believe it’s a common and general enough problem to warrant being in GLib — and I think the need for such guessing is only going to reduce over time, leaving GLib with unnecessary API.

https://cgit.freedesktop.org/uchardet/uchardet/ is one library which provides this functionality.

⇒ WONTFIX