After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 782578 - g_get_charset always returns 8-bit codepage on Windows, crippling UTF-8 output
g_get_charset always returns 8-bit codepage on Windows, crippling UTF-8 output
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: general
unspecified
Other Windows
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2017-05-12 23:13 UTC by Eduard Braun
Modified: 2018-05-24 19:35 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Eduard Braun 2017-05-12 23:13:48 UTC
From the documentation of g_get_charset():
> On Windows the character set returned by this function is the so-called system default ANSI code-page. That is the character set used by the "narrow" versions of C library and Win32 functions that handle file names. It might be different from the character set used by the C library's current locale.

The problem is not so much this definition but the implications of it:
Unfortunately g_get_charset() is used by most (if not all) glib (and also glibmm, gtk) functions generating console output to determine the character set in which the output should be printed. (Notable representatives are for example glib's plain g_print() [1] as well as glibmm::Glib::ustring's output stream operator (<<) [2].). That means that if the console encoding does not match the encoding determined by g_get_charset() output on the console will be mostly wrong!

On Windows this is often the case as modern consoles are not at all bound to Windows' archaic codepages and are likely to use an encoding that does not match the one returned by g_get_charset().

For example MSYS2's console uses UTF-8 by default, cmd.exe on my system uses code page 850 by default, while my system's locale as determined by g_get_charset() is 1252. And it doesn't stop there: The Windows console can be easily set to accept UTF-8 output, while glib will be unable to produce output in the proper encoding!

I'd therefore suggest to either
a) rethink the usage of g_get_charset() when converting for console output and potentially create a new g_get_console_charset() that suits its purpose better.
b) Add a possibility to disable glib's automatic character conversion when creating console output (or rather: let the developer set the encoding glib should choose). This could for example be implemented by adding a conditional in g_get_charset that checks whether a pre-set encoding is desired.

As a) is probably hard to implement (How would one for example determine if the console application is running in an MSYS shell? Even then: How can the encoding of an MSYS shell be determined?) and might break backwards compatibility in unexpected ways, so something along the lines of b) would probably be the better approach.


[1] https://github.com/GNOME/glib/blob/e8487812b9782b6a01e8de9990593558394f4087/glib/gmessages.c#L3083
[2] https://github.com/GNOME/glibmm/blob/0797bf2954177f58b7ac6ebecce7264310481c55/glib/glibmm/ustring.cc#L1430
Comment 1 Christoph Reiter (lazka) 2017-05-13 08:30:16 UTC
g_get_charset() currently calls GetACP(). Using GetConsoleOutputCP() with a fallback in case there is no console (it returns 0 here..) sounds like a good idea to me.

If g_print is the only place where it's relevant maybe just add a Windows specific case there instead of adding new API?
Comment 2 Eduard Braun 2017-05-13 14:19:01 UTC
> g_get_charset() currently calls GetACP(). Using GetConsoleOutputCP() with a
> fallback in case there is no console (it returns 0 here..) sounds like a
> good idea to me.

That's more or less what I had in mind for option a) above. The reasons why I imagine it would be hard to implement:
1. You mention one yourself: In applications not attached to a console we have to find an alternative but right now I would not have no good idea what that alternative might be? There's nothing that would work always...
2. GetConsoleOutputCP() only seems to be useful for applications launched from cmd.exe. That probably covers most bases, but it fails in other shells like MSYS/MSYS2, therefore probably cygwin, not to speak of the many other custom implementations like debugger consoles etc. If we say "That's for those implementers to figure out" it may be fine though (although not automagically solved)
3. (Probably most important) g_get_charset() is called in places where we specifically want the systems ANSI codepage (GetACP), and *not* the console's codepage (GetConsoleOutputCP). One example is g_locale_to/from_utf8(). That's why I suggested a g_get_console_charset() that we can use in glib's output functions and that developers of other libraries/programs could start to use as they see fit.

> If g_print is the only place where it's relevant maybe just add a Windows
> specific case there instead of adding new API?

g_print / g_printerr are the most obvious (I'm unsure about g_log_writer_format_fields?). String utility functions in "glib/gprintf.h" don't seem to do character set conversions? Others need to be investigated.
The larger problem I saw initially were third-party libraries but I'm starting to believe this might be out of scope for this bug now and probably not a glib issue after all in most cases.


In conclusion I currently think it would be the most universal solution to offer g_get_console_charset() and use that in glib whenever output on the console should be generated. If we offer an additional g_set_console_charset() that overrides whatever g_get_console_charset() would determine it would give people the possibility to implement their own conversions where our approach fails or they have custom needs. (For example I recently wrote a console wrapper for an executable that is compiled with -mwindows; I decided to use UTF-8 as output encoding for the main application and let the console wrapper decide if and how to convert the output).
Comment 3 Christoph Reiter (lazka) 2017-05-13 14:38:58 UTC
If it's really required by other apps it can easily be changed in the future. I prefer usecase oriented API. If you want to print text, use g_print and let it handle everything. There are enough encodings exposed already.

Also one could argue that g_print should use WriteConsoleW and not use the codepage in case there is a console attached, but that's another issue.
Comment 4 Eduard Braun 2017-05-13 15:04:14 UTC
(In reply to Christoph Reiter (lazka) from comment #3)
> I prefer usecase oriented API. If you want to print text, use
> g_print and let it handle everything.

I'd agree - if it works...
(That statement not being true is the whole case I reported this bug)

How do you suggest we handle the case where we're in an application that is not attached to a console? What encoding do we pick?

There are two possibilities I see (none of them ideal):
1. Fall back to the system's locale as returned by g_get_charset() (basically status-quo). On Windows this would be an unnecessarily limited 8 bit codepage and as the console is often not encoded in this locale but something else (see initial report) it would almost never work properly (not very use case oriented...).
2. Pick UTF-8. That's an option I could live with, but it could cause other unwanted behavior, as UTF-8 is certainly not an universally useful encoding either as long as the receiving end does not expect it.

The problem in "use g_print and let it handle everything" is to find a solution that actually handles everything, and right now I don't see a solution that always automatically works in all cases...
Comment 5 Christoph Reiter (lazka) 2017-05-13 15:08:43 UTC
> How do you suggest we handle the case where we're in an application that is not attached to a console? What encoding do we pick?

I don't know.

If there is no good solution I'm +1 for just using utf-8 in that case.
Comment 6 Philip Withnall 2017-09-13 12:20:58 UTC
Given that there are other bugs cause by the same root cause here (for example, bug #772411), I think I’d be in favour of adding a new g_get_console_charset() rather than fixing just g_print().

Note that we should also update the documentation of g_get_charset() to make it clear that there’s potentially a difference between the file system and the console encodings.
Comment 7 GNOME Infrastructure Team 2018-05-24 19:35:02 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1270.