Bug 150730 – GLib should be double-byte charset aware for filenames on Windows

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 150730 - GLib should be double-byte charset aware for filenames on Windows


Summary:	GLib should be double-byte charset aware for filenames on Windows


Status:	RESOLVED WONTFIX

Product:	glib
Classification:	Platform
Component:	win32
Version:	2.4.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-win32 maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-08-21 17:11 UTC by Tor Lillqvist
Modified:	2011-02-18 16:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Tor Lillqvist 2004-08-21 17:11:18 UTC

See bug #141124, where this issue was brought to our attention, although the 
cause for the crash in that bug seems to have been another issue.

When going through strings that represent file names in the system encoding 
used for file names GLib uses strchr() and strrchr() to search for directory 
separators (and maybe other characters). This is problematic on Windows in East 
Asian locales, or at least in Japan, where the file name encoding is codepage 
932. This encoding includes double-byte characters where the second byte can be 
a backslash, the directory separator.

(This is no problem for Windows itself, as the Windows kernel is aware what the 
machine's system codepage is, and thus knows which backslash bytes in a 
filename are directory separators, and which are the trailing byte of a double-
byte character. For NT-based Windows, it is even simpler, as the kernel handles 
all file names in Unicode internally. On Unix charsets are a userspace issue 
and the kernel sees just bytes, so all double-byte encodings used are such 
where there can't be any slash bytes.)

Such code in GLib needs to use _mbschr() and _mbsrchr() instead of strchr() and 
strrchr() on Windows, but can continue to use strchr() and strrchr() on Unix. 
Introducing suitable macros is the way to handle this, definitely we don't want 
to clutter the code with ifdefs.

Code that steps through filename strings by incrementing an index or pointer, 
looking at each byte and checking for backslash, needs to use mblen() to step 
past double-byte characters. Etc. I haven't gone through GLib yet looking for 
all the ways double-byte characters in file names might cause problems.

The ultimate solution is to add new API to GLib that handles filenames in UTF-8 
on Windows (and the on-disk encoding on Unix), see bug #101792; but that will 
help only applications which are rewritten to use said API. The current GLib 
code that handles filenames in the system encoding needs to fixed. At least 
g_get_{base,dir}name().

Comment 1 Tor Lillqvist 2004-08-21 19:10:28 UTC

Argh, even fixing just g_get_{base,dir}name isn't as straightforward as I first 
thought. These functions obviously can be applied to either to UTF-8 strings or 
system codepage strings. (And on Unix, no problem with that.) I wonder whether 
it's possible to unambiguously distinguish between a UTF-8 and a Windows double-
byte (codepage 932, 936, 949 or 950) string? Presumably not. So, one has to fix 
the callers instead. The caller hopefully should know whether it is handling a 
UTF-8 or system encoding filename to g_get_{base,dir}name in each case. Sigh.

Maybe just ignore the issue (WONTFIX) in GLib 2.4, and instead concentrate on 
getting the new API related to bug #101792 as well thought-out and elegant as 
possible for 2.6.

Comment 2 Matthias Clasen 2004-11-03 07:07:43 UTC

Tor, is this still relevant, now that we have the stdio wrappers and the
filename encoding ?

Comment 3 Tor Lillqvist 2004-11-03 07:53:48 UTC

I guess this is now irrelevant, yes. Closing as WONTFIX as the gsdtio wrappers 
is more kinda workaround and API change than actual fix.