GNOME Bugzilla – Bug 101792
g_dir_read_name: filename encoding on Win32
Last modified: 2011-02-18 16:09:28 UTC
We either need to document the fact that g_dir_read_name() returns the name in the character encoding used by the system and C library. Adding a g_dir_read_name_utf8() would be a good idea. While at it, also should document that gthe return value from g_dir_read_name() points to non-GLib static storage and should be copied unless used only immediately after the call (however one should word it).
Win32 issues: gdir.c should use the wide-char versions of the dirent functions if running on an NT-class system. (On NT/2k/XP, the file system is Unicode.) The newest mingw runtime contains wide-character versions of the dirent functions. For MSVC users the dirent implementation provided in build/win32/dirent would have to be updated. (It is directly copied from a previous version of the mingw runtime.) g_dir_read_name() should then convert to the system codepage (single- or multiple-byte charset) in order to work as prebiously. Of course, this will fail if the name has characters not in the current system codepage. But that't not a regression, g_dir_read_name doesn't work for such names currently, either. A g_dir_read_name_utf8() would of course have no problems. Or perhaps g_dir_read_name() should return the "short" (8.3) version of the file name in these cases?
It's certainly a serious problem if the system filename encoding doesn't handle all filenames on the system.... things like gtk_filesel_get_filename() use the system filename encoding. And how would you go about opening a file that couldn't be named in the system filename encoding? You couldn't use g_file_get_contents() or g_io_channel_new_file(), or..
Not sure what the resolution here is going to be, but doesn't appear that anything is going to happen for 2.2.1.
I have added a note about encoding to the API docs for g_dir_read_name() now.
Very unclear what the direction forward here is - it seems like we'd have to wrap the entire runtime to accept filenames that are 8-bit UTF-8 and use Unicode versions internally.
I think the best way to fix the Win32 problem is to indeed use wide- char versions of the dirent functions. If the file name can be converted to the current code page, then return that. Otherwise g_dir_read_name() should return the "short" (8.3) name. Some testing indicates that the cAlternateFileName field of the WIN32_FIND_DATA (as returned by the FindFirstFile() API) always is in plain ASCII. BTW, the GetShortPathName() API is not useful. We would need to use the wide-char version of it, but that doesn't return an ASCII-only 8.3 name for short file names, even if they contain non-current- codepage chars. For instance, if I have a file whose name consists of a single Cyrillic letter (and my codepage is non-Cyrillic), the short name as shown by dir /x is 6140~1, but GetShortPathNameW() returns the same single Cyrillic letter. (This does make some sense, of course, as that indeed is a quite short name...) FindFirstFileW(), on the other hand, does store "6140~1" in the cAlternateFileName field of its WIN32_FIND_DATA parameter. If some day a g_dir_read_name_utf8() was added, then we indeed also would have to provide g_open_utf8(), g_fopen_utf8(), g_stat_utf8(), etc. Changing summary as this now mainly is a Win32 issue.
I really don't think that using the 8.3 filenames is workable - see discussion at length in: http://mail.gnome.org/archives/gtk-devel-list/2003-October/msg00058.html My basic concern is that we can't use this approach consistently across GLib and GTK+ because it is impossible to create new non-ASCII filenames that way.
Created attachment 30500 [details] [review] Update to the dirent code used when building GLib with MSVC So that also Hans can test the following patches...
Created attachment 30501 [details] [review] Suggested patch Patch that adds UTF-8 versions of the GDir functions, and some in gfileutils.c. Rather early code, not really tested... just to get comments. Should the UTF-8 wrapper functions be called g_*_utf8 or g_utf8_*, for instance?
Created attachment 30503 [details] Couple of new files, ZIP archive Two new files, with the UTF-8 wrappers for C library functions. Not sure about what to call them, maybe gutf8wrappers.[ch] instead of gfileutf8.[ch]? Also, probably need some more wrappers for other C library functions that take filenames that I didn't think of.
I don't think you quite understood my proposal. We keep the idea of a filename encoding, which is *literally* the filename encoding on Unix. But on windows, for the filename encoding, we use something that is not a native filename encoding, quite - the Unicode name converted to UTF-8. So, g_filename_to/from_utf8() are no-ops on Windows, and on Unix, the g_fopen(), etc, calls are identical to fopen(), etc and do no filename conversion. The approach you took doesn't work because we need to be able to operate on "incorrectly" encoded filenames on Unix ... to refer to every file on disk even if we don't know how to convert the filename to UTF-8.
Hmm, I see. I will cook up a patch for that approach.
I don't think we can change g_filename_to/from_utf8 to be no-ops on Windows, that would break existing apps horribly. Instead we need new API and a concept of "GFilename" (or whatever we choose to call it), that would be UTF-8 on Windows and the literal on-disk file name on Unix (which can be whatever encoding, even different ones for for different files in the same dir with really confused users...) g_filename_to_gfilename()/from_gfilename() would convert from/to the system codepage on Windows and be a no-op on Unix. Although we should of course strongly recommend that apps shouldn't use those, but instead always handle GFilenames, and use the g_ wrappers for C library functions to open/stat/etc them. The new GDir functions could be called g_dir_open_gfilename(), g_dir_read_gfilename(), etc. The C library wrappers would be g_open_gfilename() etc. How does this sound? Suggested implementation will follow in some hours.
Created attachment 30816 [details] [review] New try What about something like this... Includes two new files, now called gfilewrappers.[ch].
Created attachment 30908 [details] [review] Again new try Drop the GFilename idea. Make GLib return and take UTF-8 filenames on Windows. Keep old ABI versions for DLL ABI stability, though. Use different names for the new-style UTF-8 versions. Hide this through a #define. (The #defines for the above are now sprinkled through the headers in question, should probably be collected into one place, with a comment "you are not supposed to know about these #defines") Exclude the binary-compatibility entry points from the import libraries (keeping them just as DLL entry points) through he PRIVATE keyword in the .def file. (This actually works only in the Microsoft linker and very new GNU binutils, though.)
Applied the dirent.c patch as it is independent from this bug. Applied the g_win32_get_windows_version() and G_WIN32_HAVE_WIDECHAR_API() addition to gwin32.[ch] as it is used by code already in CVS (gutils.c).
Owen, what do you think about the newest approach? Can I commit that (well, not exactly that patch any longer, but the same idea)?
Tor, do you have an uptodate version of this patch incorporating last weeks discussion ?
Created attachment 33098 [details] [review] Updated patch Didn't have as much time to look through it as I had planned (sleep, bah), but some updates anyway. Still need to write more docs, and revise existing documentation regarding file name charset convention. Would using the term "GLib file name encoding" for the concept "on-disk encoding on Unix, UTF-8 on Windows" be a good idea?
I added some minimal docs, more is needed. 2004-10-27 Matthias Clasen <mclasen@redhat.com> Introduce the idea of a filename encoding, which is *literally* the filename encoding on Unix. On windows, use the Unicode name converted to UTF-8. (#156325, Tor Lillqvist, Owen Taylor) * glib/gdir.[hc]: * glib/gconvert.[hc]: * glib/gfileutils.[hc]: * glib/gutils.[hc]: * glib/giowin32.c: On Windows, keep old ABI versions of GLib pathname api for DLL ABI stability. Use different names for the new-style UTF-8 versions. Hide this through a #define. * glib/gstdio.[hc]: New files containing wrappers for POSIX pathname api. * glib/glib.symbols: Add new symbols. * glib/makegalias.pl: Drop Win32 specific .def syntax, include gstdio.h