GNOME Bugzilla – Bug 138412
[cygwin patch] URI conversion
Last modified: 2011-02-18 16:07:24 UTC
On Cygwin, URI conversion (tests/uri-test) fails without this patch. I'm not exactly sure why, but it does fix things up.
Created attachment 26064 [details] [review] rl-glib-CVSHEAD-cygwin-uri.patch
I don't understand this. Doesn't Cygwin's C library mostly ignore i18n issues, and the filenames that it provides/expects are the same ones that are visible to a native Win32 programs, i.e. in the system codepage? Or does current Cygwin use the wide-character Win32 API, and then provide/expect all filenames in UTF- 8? (Would be a good idea, IMHO.)
I'm not sure what's going on here. Whatever the default behaviour is for UNIX is what is required for Cygwin here, since the win32-specific functionality breaks URI conversion (within iconv(), AFAICS). I think Cygwin does use the wide-character API, since I've seen it call some functions ending with 'W' in kernel.dll. Are these all "wide" versions of the Windows API? uri-test fails without the patch, though.
Try creating a file with some non-ASCII characters in the name using Explorer (or Notepad, or whatever non-Cygwin tool). Then write a tiny Cygwin program that calls opendir() on that directory and readdir()s through it. Print out the found names, each character in hex (to avoid some output conversion that might also be going on). Did readdir() return the file name in UTF-8 or in your local codepage? I.e. does the non-ASCII character correspond to just one byte, or several? If Cygwin uses the current codepage for file names in its API, I don't see why using the Unix ifdef branch would be correct. You then would have to set the G_FILENAME_ENCODING or G_BROKEN_FILENAMES environment variable, and the end result would be the same as using the Win32 branch. I think whatever problem you see in the URI handling is another thing. What errors do you get?
I've tried opendir()/readdir() with a file with a pound (£), ae (æ) and cedilla (ç) in it. It returned encoded in iso-8859-1, which is not a Windows encoding. I called the file "Notepad.£æç.txt", but "ls" displayed it as "Notepad.???.txt", whereas by program returned it as: Notepad.£æç.txt: 4e 6f 74 65 70 61 64 2e ffffffa3 ffffffe6 ffffffe7 2e 74 78 74 I'm not sure why the symbols are 32-bits wide for a single char! The last two numbers do correspond to the iso-8859-1 codes though. I've attached the file so you can see it "raw".
Created attachment 26240 [details] dirtest.txt This appears to be iso-8859-1 encoded.
> It returned encoded in iso-8859-1, which is not a Windows encoding. Wrong. ISO-8859-1 is a strict subset of codepage 1252, so for those characters, you don't see any difference. (CP1252 has non-control characters in 0x80-0x9f, whereas those are control characters in ISO-8859-1.) To notice the difference, try creating a file in Explorer with the Euro character (for instance) in it. It is 0x80 in CP1252, nonexistent in ISO-8859-1. As Cygwin's readdir() returns the file names in the current Windows code page, and presumable all other Cygwin APIs also take/return filenames in that encoding, and it is not useful to use the UTF-8 oriented Unix ifdef branches, where you would have to explicitly tell GLib the encoding used for filenames with environment variables.
> > It returned encoded in iso-8859-1, which is not a Windows encoding. > Wrong. ISO-8859-1 is a strict subset of codepage 1252, so for those characters, > you don't see any difference. I can now verify this. However, when I add a UCS char not in CP1252, I'm getting a single char back (0x3f, '?'). Presumably, this means "not representable in CP1252", though it's a nasty way of doing it. I'll see if I can get uri-test to work using the win32 code. Regards, Roger
Yes, that's how the dirent functions both in Cygwin and mingw seem to work. They return characters in file names that aren't representable in the the system codepage as question marks. Presumably it's the underlying Win32 API FindFirstFile() and FindNextFile() that does this. (The 'A' version of those functions, that is.) I have suggested in another bug report that GLib's dirent wrappers, g_dir_open () and g_dir_read_name() should on Win32 use the wide character API (FindFirstFileW() and FindNextFileW()) and then for file names that aren't representable in the system codepage, return the 8.3 name instead, if present. (I don't know whether all NTFS and CIFS implementations always keep also an 8.3 format name for directory entries.) In GLib 2.6 perhaps there could be wrappers for the common filename-related ANSI C and POSIX functions that would take and return UTF-8 names. On Unix these wrappers would just call g_filename_to_utf8() and g_filename_from_utf8(), on Windows they would call g_utf16_to_utf8() and g_utf8_to_utf16() and use the wide-character API.
Using the wide character functions to get a UTF-8 filename would be the best long-term solution, I think. One reason uri-test was failing when compiling using the WIN32-specific code was the codepage returned by _get_get_charset being US-ASCII. I've attached a patch to correct this. Note this drops the config.charset fix from one of my previous patches, which should not be required. An issue I still have is that many of the uri-tests still fail. This is due to to filenames like c:\ not being seen as an absolute path. This is because G_DIR_SEPARATOR is only defined as '\' for G_OS_WIN32. Cygwin can use '/' or '\' interchangably. e.g. cd c:/tmp\\build/current. The OS_WIN32 and PLATFORM_WIN32 defines can't cope with this. Cygwin is supposed to be a UNIX- like environment, so '/' is the default. I think sticking with the UNIX branches in gconvert.c might be the best/simplest solution for now.
Created attachment 26464 [details] [review] Use the WIN32 code in localcharset.c I'll probably have a further patch to gconvert.c and possibly gutil.c.
Created attachment 26467 [details] [review] rl-glib-CVSHEAD-cygwin-uri3.patch Disable UTF-8 filename checks on all Win32 platforms. With this and the second patch, uri-test passes. The original patch (assuming you've not yet applied it) is no longer required.
Patches to localcharset.c and uri-test.c applied.