Bug 138412 – [cygwin patch] URI conversion

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 138412 - [cygwin patch] URI conversion


Summary:	[cygwin patch] URI conversion


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	win32
Version:	2.4.x
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-win32 maintainers

URL:
Whiteboard:

Depends on:
Blocks:	137591

Reported:	2004-03-29 14:21 UTC by Roger Leigh
Modified:	2011-02-18 16:07 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
rl-glib-CVSHEAD-cygwin-uri.patch (732 bytes, patch) 2004-03-29 14:22 UTC, Roger Leigh	none	Details \| Review
dirtest.txt (99 bytes, application/octet-stream) 2004-04-02 09:05 UTC, Roger Leigh		Details
Use the WIN32 code in localcharset.c (601 bytes, patch) 2004-04-08 11:24 UTC, Roger Leigh	none	Details \| Review
rl-glib-CVSHEAD-cygwin-uri3.patch (608 bytes, patch) 2004-04-08 13:12 UTC, Roger Leigh	none	Details \| Review

Description Roger Leigh 2004-03-29 14:21:42 UTC

On Cygwin, URI conversion (tests/uri-test) fails without this patch.  I'm not
exactly sure why, but it does fix things up.

Comment 1 Roger Leigh 2004-03-29 14:22:12 UTC

Created attachment 26064 [details] [review]
rl-glib-CVSHEAD-cygwin-uri.patch

Comment 2 Tor Lillqvist 2004-03-30 01:47:46 UTC

I don't understand this. Doesn't Cygwin's C library mostly ignore i18n issues, 
and the filenames that it provides/expects are the same ones that are visible 
to a native Win32 programs, i.e. in the system codepage? Or does current Cygwin 
use the wide-character Win32 API, and then provide/expect all filenames in UTF-
8? (Would be a good idea, IMHO.)

Comment 3 Roger Leigh 2004-03-30 11:02:50 UTC

I'm not sure what's going on here.  Whatever the default behaviour is for UNIX
is what is required for Cygwin here, since the win32-specific functionality
breaks URI conversion (within iconv(), AFAICS).

I think Cygwin does use the wide-character API, since I've seen it call some
functions ending with 'W' in kernel.dll.  Are these all "wide" versions of the
Windows API?

uri-test fails without the patch, though.

Comment 4 Tor Lillqvist 2004-03-30 22:53:16 UTC

Try creating a file with some non-ASCII characters in the name using Explorer 
(or Notepad, or whatever non-Cygwin tool). Then write a tiny Cygwin program 
that calls opendir() on that directory and readdir()s through it. Print out the 
found names, each character in hex (to avoid some output conversion that might 
also be going on). Did readdir() return the file name in UTF-8 or in your local 
codepage? I.e. does the non-ASCII character correspond to just one byte, or 
several?

If Cygwin uses the current codepage for file names in its API, I don't see why 
using the Unix ifdef branch would be correct. You then would have to set the 
G_FILENAME_ENCODING or G_BROKEN_FILENAMES environment variable, and the end 
result would be the same as using the Win32 branch. 

I think whatever problem you see in the URI handling is another thing. What 
errors do you get?

Comment 5 Roger Leigh 2004-04-02 09:04:00 UTC

I've tried opendir()/readdir() with a file with a pound (£), ae (æ) and cedilla
(ç) in it.  It returned encoded in iso-8859-1, which is not a Windows encoding.
I called the file "Notepad.£æç.txt", but "ls" displayed it as "Notepad.???.txt",
whereas by program returned it as:

Notepad.£æç.txt: 4e 6f 74 65 70 61 64 2e ffffffa3 ffffffe6 ffffffe7 2e 74 78 74

I'm not sure why the symbols are 32-bits wide for a single char!  The last two
numbers do correspond to the iso-8859-1 codes though.  I've attached the file so
you can see it "raw".

Comment 6 Roger Leigh 2004-04-02 09:05:02 UTC

Created attachment 26240 [details]
dirtest.txt

This appears to be iso-8859-1 encoded.

Comment 7 Tor Lillqvist 2004-04-02 10:12:17 UTC

> It returned encoded in iso-8859-1, which is not a Windows encoding.

Wrong. ISO-8859-1 is a strict subset of codepage 1252, so for those characters, 
you don't see any difference. (CP1252 has non-control characters in 0x80-0x9f, 
whereas those are control characters in ISO-8859-1.) To notice the difference, 
try creating a file in Explorer with the Euro character (for instance) in it. 
It is 0x80 in CP1252, nonexistent in ISO-8859-1.

As Cygwin's readdir() returns the file names in the current Windows code page, 
and presumable all other Cygwin APIs also take/return filenames in that 
encoding, and it is not useful to use the UTF-8 oriented Unix ifdef branches, 
where you would have to explicitly tell GLib the encoding used for filenames 
with environment variables.

Comment 8 Roger Leigh 2004-04-06 12:17:43 UTC

> > It returned encoded in iso-8859-1, which is not a Windows encoding.

> Wrong. ISO-8859-1 is a strict subset of codepage 1252, so for those characters, 
> you don't see any difference.

I can now verify this.  However, when I add a UCS char not in CP1252, I'm
getting a single char back (0x3f, '?').  Presumably, this means "not
representable in CP1252", though it's a nasty way of doing it.

I'll see if I can get uri-test to work using the win32 code.

Regards,
Roger

Comment 9 Tor Lillqvist 2004-04-07 04:01:56 UTC

Yes, that's how the dirent functions both in Cygwin and mingw seem to work. 
They return characters in file names that aren't representable in the the 
system codepage as question marks. Presumably it's the underlying Win32 API 
FindFirstFile() and FindNextFile() that does this. (The 'A' version of those 
functions, that is.)

I have suggested in another bug report that GLib's dirent wrappers, g_dir_open
() and g_dir_read_name() should on Win32 use the wide character API 
(FindFirstFileW() and FindNextFileW()) and then for file names that aren't 
representable in the system codepage, return the 8.3 name instead, if present. 
(I don't know whether all NTFS and CIFS implementations always keep also an 8.3 
format name for directory entries.)

In GLib 2.6 perhaps there could be wrappers for the common filename-related 
ANSI C and POSIX functions that would take and return UTF-8 names. On Unix 
these wrappers would just call g_filename_to_utf8() and g_filename_from_utf8(), 
on Windows they would call g_utf16_to_utf8() and g_utf8_to_utf16() and use the 
wide-character API.

Comment 10 Roger Leigh 2004-04-08 11:23:11 UTC

Using the wide character functions to get a UTF-8 filename would be the best 
long-term solution, I think.

One reason uri-test was failing when compiling using the WIN32-specific code 
was the codepage returned by _get_get_charset being US-ASCII.  I've attached a 
patch to correct this.  Note this drops the config.charset fix from one of my 
previous patches, which should not be required.

An issue I still have is that many of the uri-tests still fail.  This is due to 
to filenames like c:\ not being seen as an absolute path.  This is because
G_DIR_SEPARATOR is only defined as '\' for G_OS_WIN32.  Cygwin can use '/' 
or '\' interchangably.  e.g. cd c:/tmp\\build/current.  The OS_WIN32 and 
PLATFORM_WIN32 defines can't cope with this.  Cygwin is supposed to be a UNIX-
like environment, so '/' is the default.  I think sticking with the UNIX 
branches in gconvert.c might be the best/simplest solution for now.

Comment 11 Roger Leigh 2004-04-08 11:24:56 UTC

Created attachment 26464 [details] [review]
Use the WIN32 code in localcharset.c

I'll probably have a further patch to gconvert.c and possibly gutil.c.

Comment 12 Roger Leigh 2004-04-08 13:12:59 UTC

Created attachment 26467 [details] [review]
rl-glib-CVSHEAD-cygwin-uri3.patch

Disable UTF-8 filename checks on all Win32 platforms.

With this and the second patch, uri-test passes.  The original patch (assuming
you've not yet applied it) is no longer required.

Comment 13 Tor Lillqvist 2004-04-10 01:59:53 UTC

Patches to localcharset.c and uri-test.c applied.