After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 101792 - g_dir_read_name: filename encoding on Win32
g_dir_read_name: filename encoding on Win32
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: win32
2.0.x
Other All
: Normal normal
: ---
Assigned To: gtk-win32 maintainers
gtk-win32 maintainers
Depends on:
Blocks:
 
 
Reported: 2002-12-22 08:55 UTC by Tor Lillqvist
Modified: 2011-02-18 16:09 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Update to the dirent code used when building GLib with MSVC (13.19 KB, patch)
2004-08-13 04:07 UTC, Tor Lillqvist
none Details | Review
Suggested patch (23.61 KB, patch)
2004-08-13 04:11 UTC, Tor Lillqvist
none Details | Review
Couple of new files, ZIP archive (2.27 KB, application/octet-stream)
2004-08-13 04:15 UTC, Tor Lillqvist
  Details
New try (37.27 KB, patch)
2004-08-21 21:07 UTC, Tor Lillqvist
none Details | Review
Again new try (38.00 KB, patch)
2004-08-24 23:16 UTC, Tor Lillqvist
none Details | Review
Updated patch (43.08 KB, patch)
2004-10-27 01:25 UTC, Tor Lillqvist
none Details | Review

Description Tor Lillqvist 2002-12-22 08:55:52 UTC
We either need to document the fact that g_dir_read_name() returns the 
name in the character encoding used by the system and C library. Adding a 
g_dir_read_name_utf8() would be a good idea.

While at it, also should document that gthe return value from 
g_dir_read_name() points to non-GLib static storage and should be copied 
unless used only immediately after the call (however one should word it).
Comment 1 Tor Lillqvist 2002-12-22 09:42:12 UTC
Win32 issues: 

gdir.c should use the wide-char versions of the dirent functions if 
running on an NT-class system. (On NT/2k/XP, the file system is 
Unicode.) The newest mingw runtime contains wide-character versions 
of the dirent functions. For MSVC users the dirent implementation 
provided in build/win32/dirent would have to be updated. (It is 
directly copied from a previous version of the mingw runtime.)

g_dir_read_name() should then convert to the system codepage (single- 
or multiple-byte charset) in order to work as prebiously. Of course, 
this will fail if the name has characters not in the current system 
codepage. But that't not a regression, g_dir_read_name doesn't work 
for such names currently, either. A g_dir_read_name_utf8() would of 
course have no problems.

Or perhaps g_dir_read_name() should return the "short" (8.3) version 
of the file name in these cases? 
Comment 2 Owen Taylor 2002-12-23 15:25:33 UTC
It's certainly a serious problem if the system filename encoding
doesn't handle all filenames on the system.... things like
gtk_filesel_get_filename() use the system filename encoding.

And how would you go about opening a file that couldn't be
named in the system filename encoding? You couldn't use
g_file_get_contents() or g_io_channel_new_file(), or..



Comment 3 Owen Taylor 2003-01-28 20:32:30 UTC
Not sure what the resolution here is going to be, but doesn't appear
that anything is going to happen for 2.2.1.
Comment 4 Matthias Clasen 2003-02-23 22:51:14 UTC
I have added a note about encoding to the API docs for
g_dir_read_name() now.
Comment 5 Owen Taylor 2003-05-22 20:06:31 UTC
Very unclear what the direction forward here is - it seems
like we'd have to wrap the entire runtime to accept
filenames that are 8-bit UTF-8 and use Unicode versions
internally.
Comment 6 Tor Lillqvist 2003-10-13 11:59:39 UTC
I think the best way to fix the Win32 problem is to indeed use wide-
char versions of the dirent functions. If the file name can be 
converted to the current code page, then return that. Otherwise 
g_dir_read_name() should return the "short" (8.3) name. Some testing 
indicates that the cAlternateFileName field of the WIN32_FIND_DATA 
(as returned by the FindFirstFile() API) always is in plain ASCII.

BTW, the GetShortPathName() API is not useful. We would need to use 
the wide-char version of it, but that doesn't return an ASCII-only 
8.3 name for short file names, even if they contain non-current-
codepage chars.

For instance, if I have a file whose name consists of a single 
Cyrillic letter (and my codepage is non-Cyrillic), the short name as 
shown by dir /x is 6140~1, but GetShortPathNameW() returns the same 
single Cyrillic letter. (This does make some sense, of course, as 
that indeed is a quite short name...)

FindFirstFileW(), on the other hand, does store "6140~1" in the 
cAlternateFileName field of its WIN32_FIND_DATA parameter.

If some day a g_dir_read_name_utf8() was added, then we indeed also 
would have to provide g_open_utf8(), g_fopen_utf8(), g_stat_utf8(), 
etc.

Changing summary as this now mainly is a Win32 issue.
Comment 7 Owen Taylor 2003-10-13 16:26:22 UTC
I really don't think that using the 8.3 filenames is
workable - see discussion at length in:

http://mail.gnome.org/archives/gtk-devel-list/2003-October/msg00058.html

My basic concern is that we can't use this approach consistently
across GLib and GTK+ because it is impossible to create new
non-ASCII filenames that way.
Comment 8 Tor Lillqvist 2004-08-13 04:07:51 UTC
Created attachment 30500 [details] [review]
Update to the dirent code used when building GLib with MSVC

So that also Hans can test the following patches...
Comment 9 Tor Lillqvist 2004-08-13 04:11:24 UTC
Created attachment 30501 [details] [review]
Suggested patch

Patch that adds UTF-8 versions of the GDir functions, and some in gfileutils.c.
Rather early code, not really tested... just to get comments. Should the UTF-8
wrapper functions be called g_*_utf8 or g_utf8_*, for instance?
Comment 10 Tor Lillqvist 2004-08-13 04:15:01 UTC
Created attachment 30503 [details]
Couple of new files, ZIP archive

Two new files, with the UTF-8 wrappers for C library functions. Not sure about
what to call them, maybe gutf8wrappers.[ch] instead of gfileutf8.[ch]? Also,
probably need some more wrappers for other C library functions that take
filenames that I didn't think of.
Comment 11 Owen Taylor 2004-08-13 13:58:17 UTC
I don't think you quite understood my proposal. 

We keep the idea of a filename encoding, which is *literally* the
filename encoding on Unix. 

But on windows, for the filename encoding, we use something that is
not a native filename encoding, quite - the Unicode name converted
to UTF-8. So, g_filename_to/from_utf8() are no-ops on Windows, and
on Unix, the g_fopen(), etc, calls are identical to fopen(), etc
and do no filename conversion.

The approach you took doesn't work because we need to be able to
operate on "incorrectly" encoded filenames on Unix ... to refer
to every file on disk even if we don't know how to convert the
filename to UTF-8.
Comment 12 Tor Lillqvist 2004-08-14 01:29:32 UTC
Hmm, I see. I will cook up a patch for that approach.
Comment 13 Tor Lillqvist 2004-08-21 17:42:23 UTC
I don't think we can change g_filename_to/from_utf8 to be no-ops on Windows, 
that would break existing apps horribly. Instead we need new API and a concept 
of "GFilename" (or whatever we choose to call it), that would be UTF-8 on 
Windows and the literal on-disk file name on Unix (which can be whatever 
encoding, even different ones for for different files in the same dir with 
really confused users...)

g_filename_to_gfilename()/from_gfilename() would convert from/to the system 
codepage on Windows and be a no-op on Unix. Although we should of course 
strongly recommend that apps shouldn't use those, but instead always handle 
GFilenames, and use the g_ wrappers for C library functions to open/stat/etc 
them.

The new GDir functions could be called g_dir_open_gfilename(), 
g_dir_read_gfilename(), etc. The C library wrappers would be g_open_gfilename() 
etc. How does this sound? Suggested implementation will follow in some hours.
Comment 14 Tor Lillqvist 2004-08-21 21:07:57 UTC
Created attachment 30816 [details] [review]
New try

What about something like this... Includes two new files, now called
gfilewrappers.[ch].
Comment 15 Tor Lillqvist 2004-08-24 23:16:00 UTC
Created attachment 30908 [details] [review]
Again new try

Drop the GFilename idea. Make GLib return and take UTF-8 filenames on Windows.
Keep old ABI versions for DLL ABI stability, though. Use different names for
the new-style UTF-8 versions. Hide this through a #define.

(The #defines for the above are now sprinkled through the headers in question,
should probably be collected into one place, with a comment "you are not
supposed to know about these #defines")

Exclude the binary-compatibility entry points from the import libraries
(keeping them just as DLL entry points) through he PRIVATE keyword in the .def
file. (This actually works only in the Microsoft linker and very new GNU
binutils, though.)
Comment 16 Tor Lillqvist 2004-08-25 15:47:32 UTC
Applied the dirent.c patch as it is independent from this bug. Applied the 
g_win32_get_windows_version() and G_WIN32_HAVE_WIDECHAR_API() addition to 
gwin32.[ch] as it is used by code already in CVS (gutils.c).
Comment 17 Tor Lillqvist 2004-09-18 03:22:17 UTC
Owen, what do you think about the newest approach? Can I commit that (well, not 
exactly that patch any longer, but the same idea)?
Comment 18 Matthias Clasen 2004-10-26 14:16:35 UTC
Tor, do you have an uptodate version of this patch incorporating last weeks
discussion ?
Comment 19 Tor Lillqvist 2004-10-27 01:25:12 UTC
Created attachment 33098 [details] [review]
Updated patch

Didn't have as much time to look through it as I had planned (sleep, bah), but
some updates anyway. Still need to write more docs, and revise existing
documentation regarding file name charset convention. Would using the term
"GLib file name encoding" for the concept "on-disk encoding on Unix, UTF-8 on
Windows" be a good idea?
Comment 20 Matthias Clasen 2004-10-27 16:47:03 UTC
I added some minimal docs, more is needed.


2004-10-27  Matthias Clasen  <mclasen@redhat.com>

	Introduce the idea of a filename encoding, which is 
	*literally* the filename encoding on Unix. On windows, 
	use the Unicode name converted to UTF-8. (#156325,
	Tor Lillqvist, Owen Taylor)
	
	* glib/gdir.[hc]: 
	* glib/gconvert.[hc]: 
	* glib/gfileutils.[hc]: 
	* glib/gutils.[hc]: 
	* glib/giowin32.c: On Windows, keep old ABI versions 
	of GLib pathname api for DLL ABI stability. Use different 
	names for the new-style UTF-8 versions. Hide this through 
	a #define.

	* glib/gstdio.[hc]: New files containing wrappers for
	POSIX pathname api.

	* glib/glib.symbols: Add new symbols.

	* glib/makegalias.pl: Drop Win32 specific .def syntax,
	include gstdio.h