GNOME Bugzilla – Bug 150730
GLib should be double-byte charset aware for filenames on Windows
Last modified: 2011-02-18 16:09:33 UTC
See bug #141124, where this issue was brought to our attention, although the cause for the crash in that bug seems to have been another issue. When going through strings that represent file names in the system encoding used for file names GLib uses strchr() and strrchr() to search for directory separators (and maybe other characters). This is problematic on Windows in East Asian locales, or at least in Japan, where the file name encoding is codepage 932. This encoding includes double-byte characters where the second byte can be a backslash, the directory separator. (This is no problem for Windows itself, as the Windows kernel is aware what the machine's system codepage is, and thus knows which backslash bytes in a filename are directory separators, and which are the trailing byte of a double- byte character. For NT-based Windows, it is even simpler, as the kernel handles all file names in Unicode internally. On Unix charsets are a userspace issue and the kernel sees just bytes, so all double-byte encodings used are such where there can't be any slash bytes.) Such code in GLib needs to use _mbschr() and _mbsrchr() instead of strchr() and strrchr() on Windows, but can continue to use strchr() and strrchr() on Unix. Introducing suitable macros is the way to handle this, definitely we don't want to clutter the code with ifdefs. Code that steps through filename strings by incrementing an index or pointer, looking at each byte and checking for backslash, needs to use mblen() to step past double-byte characters. Etc. I haven't gone through GLib yet looking for all the ways double-byte characters in file names might cause problems. The ultimate solution is to add new API to GLib that handles filenames in UTF-8 on Windows (and the on-disk encoding on Unix), see bug #101792; but that will help only applications which are rewritten to use said API. The current GLib code that handles filenames in the system encoding needs to fixed. At least g_get_{base,dir}name().
Argh, even fixing just g_get_{base,dir}name isn't as straightforward as I first thought. These functions obviously can be applied to either to UTF-8 strings or system codepage strings. (And on Unix, no problem with that.) I wonder whether it's possible to unambiguously distinguish between a UTF-8 and a Windows double- byte (codepage 932, 936, 949 or 950) string? Presumably not. So, one has to fix the callers instead. The caller hopefully should know whether it is handling a UTF-8 or system encoding filename to g_get_{base,dir}name in each case. Sigh. Maybe just ignore the issue (WONTFIX) in GLib 2.4, and instead concentrate on getting the new API related to bug #101792 as well thought-out and elegant as possible for 2.6.
Tor, is this still relevant, now that we have the stdio wrappers and the filename encoding ?
I guess this is now irrelevant, yes. Closing as WONTFIX as the gsdtio wrappers is more kinda workaround and API change than actual fix.