GNOME Bugzilla – Bug 722025
cleanup/clarify command line argument encoding on Windows
Last modified: 2014-10-07 04:21:59 UTC
One would expect that such a function receives a command line argument passed to it, but it appears to expect its argument in the GLib filename encoding (which is UTF-8 on Windows). This is in contrast to GOptionContext which appears to expect that command line arguments on Windows machines will have been in the system codepage. I'd like some input from Windows people, but it seems like the correct thing to do would be to convert this in the same way GOptionContext does -- with g_locale_to_utf8().
Created attachment 266046 [details] [review] win32: fix g_file_new_for_commandline_arg encoding This function is intended to be used on commandline arguments (ie: argv from main()). On Windows, these will still be in the system codepage, not in UTF-8, so we will need to do a conversion first. Add a note to the documentation of GOptionContext warning against mixing this with G_OPTION_REMAINING which does its own conversion internally.
Note: this patch is slightly evil. This will "do the right thing" in the case that people write their code in a way that is oblivious to Windows, but we could probably do better for people who want to step up their game. In particular, if people are willing to write wmain() instead of main() or use GetCommandLineW() then this proposed change could reduce functionality for them. Both GApplication and GOptionContext get caught up in the middle of this in a very unfortunate way -- they both assume that their input as in the system codepage form (as from argv of main). We could modify GApplication to accept its input in utf8 instead (or add a variant of the _run() API that takes wchar_t, even) but this introduces interesting compatibility issues for people who assume that they can use GOptionContext with the arguments given to them by GApplication on Windows. Perhaps the solution here is along these lines: - add a wchar_t g_application_run() variant for windows and allow people who care about accepting arbitrary unicode filenames on windows to write a wmain() that calls this function - declare that all commandline arguments passed around by GApplication are not in the commandline encoding, but in the GLib filename encoding. On UNIX this is the same thing, but on Windows, this would be UTF8 instead of the system codepage - modify the paint-still-wet g_option_context_parse_strv() to add another difference: it expects its arguments to be in the filename encoding, not in the commandline argument encoding - document (the current reality) that g_file_new_for_commandline_arg() doesn't accept command line arguments directly from argv[] on Windows and that you must convert to utf8 first OR - fix g_file_new_for_commandline_arg() (as above) but add a new variant that takes its argument in the filename encoding instead of the commandline encoding. g_application_command_line_create_file_for_arg() would want to switch to using this variant (on account of GApplication now being strictly filename-encoding-based).
another possibility: have GApplication ignore the argc/argv given to g_application_run() on windows and call GetCommandLineW() for itself.
see also bug 547254. (I thought there was another bug which was an exact dup of this one, but I can't find it now...)
Created attachment 266070 [details] [review] g_option_context_parse_strv: use filename encoding Add another difference to the freshly-added g_option_context_parse_strv: it now expects its arguments to be in the GLib filename encoding instead of the argv[] encoding. The only place that this makes a difference is on Windows, where main's argv[] is in the system codepage (whereas the GLib filename encoding is UTF-8 there). The documentation teases g_win32_get_command_line() which is a new GLib-filename-encoding-based function that will be added in a following commit.
Created attachment 266071 [details] [review] win32: add g_win32_get_command_line() This returns the command line in GLib filename encoding format (ie: UTF-8) for use with g_option_context_parse_strv(). This will allow parsing of Unicode commandline arguments on Windows, even if the characters in those arguments fall outside of the range of the system codepage.
Created attachment 266072 [details] [review] g_file_new_for_commandline_arg: clarify encoding Add a note to the documentation for g_file_new_for_commandline_arg() that this function is intended to operate on strings already in the GLib filename encoding. This has been the case for a long time, but this documents the requirement.
Created attachment 266073 [details] [review] GApplication: change commandline encoding policy Clarify in the documentation that the commandline arguments passed around by GApplication (to local_command_line and returned via g_application_command_line_get_arguments()) are always in the GLib filename encoding. Fix the mismatch that would result from having argv passed to g_application_run() in main() on Windows (where it is in the system codepage) by ignoring argc/argv on Windows and calling g_win32_get_command_line() for ourselves. Document this. This might be a slight API break on Windows: we documented that it was possible to call g_application_run() with arguments other than argc/argv and now doing that will result in those arguments being ignored. It has always been recommended practice to only call g_application_run() from main() directly, however, and all of our code examples have shown only this. We will see if this causes any issues and consider reevaluating the situation if so.
Comment on attachment 266046 [details] [review] win32: fix g_file_new_for_commandline_arg encoding I think the more comprehensive revamp in the just-attached patchset is probably the best way forward here. The only thing I'm the least bit nervous about is the fact that we've changed behaviour on Windows in two ways: 1) programs that try to pass something other than argc/argv to g_application_run() will now be broken on Windows 2) programs that assumed that GApplication fed them system-codepage-encoded arguments on Windows will be broken there -- including ones that feed the arguments through g_option_context_parse(). The simple solution for them is to move over to g_option_context_parse_strv(), but strictly speaking, it's a break.
Created attachment 266075 [details] [review] GOptionContext: document Windows limitations Add a note to the overview documentation for GOptionContext about why you should avoid 'argv' on Windows and mention some possible alternative approaches.
*** Bug 547254 has been marked as a duplicate of this bug. ***
Review of attachment 266071 [details] [review]: ::: glib/gwin32.c @@ +589,3 @@ + * arguments if the filenames contain characters that fall outside of + * this codepage. If such filenames are passed, then arbitrary + * conversions will occur (such as replacing some characters with '?'). 'arbitrary' ? That is not really true. How about 'undesirable', instead ? @@ +601,3 @@ + * GetCommandLineW() and convert it to the GLib filename encoding (which + * is UTF-8 on Windows). + * Was there something to be said about wmain() here ?
Review of attachment 266073 [details] [review]: ::: gio/gapplicationcommandline.c @@ +356,1 @@ * This new requirement is certainly a compatibility break on non-Windows as well - we do this all over the place in glib and gtk+ examples (parse with g_option_context_parse). Might want to qualify this with 'if you care about portability to Windows'.
Review of attachment 266072 [details] [review]: ::: glib/tests/option-argv0.c @@ +98,3 @@ + } + + g_strfreev (argv); This looks like you subverted a test binary for some debugging or interactive testing ? Was this meant to be included here ?
Review of attachment 266075 [details] [review]: ::: glib/goption.c @@ +136,3 @@ + * arguments that only contain characters from the system codepage. + * This can cause problems when attempting to deal with filenames + * containing Unicode characters that fall outside of the codepage. A The TeX graybeard in me finds that A sticking out at the end of this line horrifying.
(In reply to comment #12) > 'arbitrary' ? That is not really true. How about 'undesirable', instead ? Sure. > @@ +601,3 @@ > Was there something to be said about wmain() here ? I don't think so. In so far as GLib is a tool that people use to write portable software, I think the best approach will be to stick with main(). (In reply to comment #13) > This new requirement is certainly a compatibility break on non-Windows as well > - we do this all over the place in glib and gtk+ examples (parse with > g_option_context_parse). Might want to qualify this with 'if you care about > portability to Windows'. It's not really a _requirement_, but if you're just naively using g_option_context_parse() with GApplication then you're already leaking memory... there are ways to get around that, so "must" is not strictly accurate (even still) but I prefer to keep the stronger language to ensure that we're pointing people in the correct direction. (In reply to comment #14) > This looks like you subverted a test binary for some debugging or interactive > testing ? > Was this meant to be included here ? Ya... was the easiest way to get something running in the correct mingw/wine environment :) Obviously not intended for inclusion (and already patched out from my local tree). (In reply to comment #15) > The TeX graybeard in me finds that A sticking out at the end of this line > horrifying. as in, "wrap the 'A'"? Would be nice to get some comments on the overall idea here. Is this the direction we want to go? More or less declaring that using 'argv' from main on Windows is something that you never want to do in modern applications?
(In reply to comment #16) > (In reply to comment #12) > > 'arbitrary' ? That is not really true. How about 'undesirable', instead ? > > Sure. > > > @@ +601,3 @@ > > Was there something to be said about wmain() here ? > > I don't think so. In so far as GLib is a tool that people use to write > portable software, I think the best approach will be to stick with main(). Then maybe 'best stick with main(), don't use wmain()' would be a good thing to say ? Or maybe just ignore it, dunno. > It's not really a _requirement_, but if you're just naively using > g_option_context_parse() with GApplication then you're already leaking > memory... there are ways to get around that, so "must" is not strictly accurate > (even still) but I prefer to keep the stronger language to ensure that we're > pointing people in the correct direction. Fair enough. But then we should also fix our own usage, over time. > (In reply to comment #15) > > The TeX graybeard in me finds that A sticking out at the end of this line > > horrifying. > > as in, "wrap the 'A'"? > Yes. Move the A to the next line
Review of attachment 266070 [details] [review]: I don't think this is quite right. We do allow having the filename encoding be different than the locale encoding. So, if you type, on a shell, with a utf8 locale, and a filename encoding of latin1 this: the-app --title=blåbär --file=bl* Then (given the existance of a file "blåbär" in latin1) the app will recieve argv[1] in utf8, and argv[2] in latin1. We should *not* try to convert argv[1] from latin1 to utf8, and we should not try to convert argv[2] from utf8 to latin1. Both passed in strings should be used as-is, except on windows then argv[2] should be converted from the locale to utf8. This is hairy shit, but i don't think this patch does the right thing.
Review of attachment 266072 [details] [review]: ::: gio/gfile.c @@ +6467,3 @@ + * arguments and the GLib filename encoding are the same thing, but on + * Windows there is a difference -- the filename encoding is UTF-8 but + * argv, as passed to main(), is in the system codepage. They are not the same, even on unix. the env var G_FILENAME_ENCODING controls the filename encoding.
(In reply to comment #20) > They are not the same, even on unix. the env var G_FILENAME_ENCODING controls > the filename encoding. I think we have a fundamental disagreement here. If passing argv[i] to open() is incorrect then every tool ever written for POSIX is broken. I don't buy the idea that by checking G_FILENAME_ENCODING we are somehow "more correct" than everyone else in the operating system. I think the only thing that makes sense is to assume that if the user has set G_FILENAME_ENCODING then we will receive filename commandline arguments in this form. This is the encoding that the shell will use to expand globs, for example (based on the on-disk filenames)...
I am not saying that passing argv[i] to open is incorrect. On the contrary, I'm very careful that filenames are *always* opaque byte arrays that you should never mess with, just pass on as-is. That doesn't mean that G_FILENAME_ENCODING is not useful though. It is useful for instance: * When you want to display the filename inside a GUI * When the user typed a new filename in a GtkEntry, which means we have a utf8 version of the string which is not what we want to create on disk. This is what e.g. g_filename_to_utf8() is for, and not what this patch is using it for.
And, no, files will not always be what G_FILENAME_ENCODING, which can even have multiple values to try several. We should always be able to handle *any* kind of byte-array filename, even if its not in any valid encoding. However, we will not be *creating* such files, and we may show weirdly in the UI.
What i mean is that if you have part of a command line like: --title=<string1> and --file=<string2> Then string1 should be converted from locale to utf8 for display, and string2 should be used as-is for i/o and converted via G_FILENAME_ENCODING for display.
(except on win32, there string2 should be converted from locale to utf8 for i/o)
So the point here is that a given commandline could have multiple encodings contained within different parts of it or possibly even within the same argument? Exciting.
Created attachment 266516 [details] [review] GOption: don't use "rand" in code example rand() is a function defined in the libc, so our code example gives warnings if you try to compile it. Fix that.
Created attachment 266517 [details] [review] g_option_context_parse_strv: use filename encoding Add another difference to the freshly-added g_option_context_parse_strv: now, on Windows, its arguments on to be in the GLib filename encoding instead of the argv[] encoding (from the system codepage). The documentation teases g_win32_get_command_line() which is a new GLib-filename-encoding-based function that will be added in a following commit.
Created attachment 266518 [details] [review] win32: add g_win32_get_command_line() This returns the command line in GLib filename encoding format (ie: UTF-8) for use with g_option_context_parse_strv(). This will allow parsing of Unicode commandline arguments on Windows, even if the characters in those arguments fall outside of the range of the system codepage.
Created attachment 266519 [details] [review] g_file_new_for_commandline_arg: clarify encoding Add a note to the documentation for g_file_new_for_commandline_arg() that this function is intended to operate on strings already in the GLib filename encoding on Windows. This has been the case for a long time, but this documents the requirement.
Created attachment 266520 [details] [review] GApplication: change commandline encoding policy Clarify in the documentation that the commandline arguments passed around by GApplication (to local_command_line and returned via g_application_command_line_get_arguments()) are in the GLib filename encoding (ie: UTF-8) on Windows, not the system code page. Fix the mismatch that would result from having argv passed to g_application_run() in main() on Windows (where it is in the system code page) by ignoring argc/argv on Windows and calling g_win32_get_command_line() for ourselves. Document this. This might be a slight API break on Windows: we documented that it was possible to call g_application_run() with arguments other than argc/argv and now doing that will result in those arguments being ignored. It has always been recommended practice to only call g_application_run() from main() directly, however, and all of our code examples have shown only this. We will see if this causes any issues and consider reevaluating the situation if so.
Created attachment 266521 [details] [review] GOptionContext: document Windows limitations Add a note to the overview documentation for GOptionContext about why you should avoid 'argv' on Windows and mention some possible alternative approaches, including a code example.
Review of attachment 266516 [details] [review]: sure
Review of attachment 266518 [details] [review]: ::: glib/gwin32.c @@ +606,3 @@ + * are not suitable for use with g_option_context_parse(), which assumes + * that its input will be in the system codepage. The return value is + * suitable for use with g_option_context_parse_strv(), however which is I think the comma should go after the however, not before. Or both ? @@ +610,3 @@ + * + * Unlike argv, the returned value is a normal strv and can (and should) + * be freed with g_strfreev() when you are done with it. I think the direct address ('you') here was criticized earlier - could just say 'when no longer needed'.
Review of attachment 266519 [details] [review]: ::: gio/gfile.c @@ +6465,3 @@ + * Note that on Windows, this function expects its argument to be in + * UTF-8 -- not the system code page. This means that you + * should not use this function with string from 'argv' as it is passed In other places, we use argv without quotes. We should be consistent: either argv or @argv, but not 'argv'
Review of attachment 266520 [details] [review]: ::: gio/gapplicationcommandline.c @@ +349,3 @@ * Gets the list of arguments that was passed on the command line. * + * The strings in the array may contain non-utf8 data on UNIX (such as You had UTF-8 in earlier patches, lets be consistent.
Review of attachment 266521 [details] [review]: ::: glib/goption.c @@ +134,3 @@ + * Note that on Windows, using #GOptionContext with the argv as passed + * to main() will result in a program that can only accept commandline + * arguments that only contain characters from the system codepage. Awkward repetition of only. Maybe: ...that don't contain characters outside the system codepage
Review of attachment 266517 [details] [review]: looks ok to me
Review of attachment 266517 [details] [review]: ::: glib/goption.c @@ +2510,3 @@ * + * Additionally, on Windows, arguments are assumed to be in the GLib + * filename encoding: UTF-8. This makes this function unsuitable for I would avoid using the name "glib filename encoding here", as this is somewhat confusing imho (as we're not using the filename encoding as-such on the unix side. I would rather say something like: The strings are in the same encoding as they where received in main(), which typically means that normal arguments are in the current locale, but parts of it may not, as on unix, valid filenames can be in another encoding (or even none). However, on win32 the strings must be in UTF-8, in order to allow the full set of filenames to be specified (as filenames *do* have an encoding on win32).
Review of attachment 266518 [details] [review]: ::: glib/gwin32.c @@ +597,3 @@ + * + * As such, most GLib programs should ignore the value of argv passed to + * their main() function and call g_win32_get_command_line() instead. I agree with this in general, but with this API that is not super easy, as you have to ifdef on win32. Would it not make more sense to define a "glib argv encoding" (which is same-as-main-argv on unix and utf8 on win32), then we could have a *portable* char *g_get_command_line_strv(int argc, char *argv[]); Which required you to pass in the argv from main and always returned a copy of it in the "glib argv encoding". This would allow ifdef-less support of win32, and means you can use g_option_context_parse_strv() always.
Review of attachment 266519 [details] [review]: With the above changes this could be simplified to say that the string has to be in the "glib argv encoding" with a reference to the other docs about that.
Attachment 266516 [details] pushed as 9592d40 - GOption: don't use "rand" in code example Attachment 266518 [details] pushed as 673ee54 - win32: add g_win32_get_command_line() Attachment 266519 [details] pushed as 643f2b3 - g_file_new_for_commandline_arg: clarify encoding Attachment 266520 [details] pushed as e5f9195 - GApplication: change commandline encoding policy Pushed all the things with cleaned up comments. I added a parapgrase of Alex's comments to the overview of GOptionContext in the same spot that we talk about how to deal with the issue on Windows.
*** Bug 522131 has been marked as a duplicate of this bug. ***
The implementation of g_win32_get_command_line is incorrect. GetCommandLineToArgvW() does not expand globs, which is inconsistent with what is passed to main(). To get correct functionality, you have to use the function __wgetmainargs, which is used in glib's spawn helper programs. http://msdn.microsoft.com/en-us/library/ff770599.aspx On the other hand, I'm glad that this has been finally fixed, after all those years.
The implementation of g_win32_get_command_line (it has a dirrefent name, but does the same thing) from bug 522131 is better: https://bug522131.bugzilla-attachments.gnome.org/attachment.cgi?id=238365
After doing some reading on this topic I'm not totally convinced that we _want_ to expand wildcards. It seems that a great many Windows programs don't do this and these days Microsoft's own PowerShell supports globbing in the shell itself...
I agree, expanding globs in the app is a different semantics, and is problematic wrt e.g. opening a file called "*" when you double click on it in the UI or whatnot. globbing is a shell/terminal operation, and should happen there.
OK, that makes sense, though I'm fairly sure that a file containing "*" or "?" can only be created with the native (NT) API, not the Win32 API. Thank you again for fixing this long-standing issue and finally making GOption actually usable on Windows.
Hi, Unfortunately this breaks scripts using GNOME bindings that make use of g_application_run(). On Windows, such scripts are run with things like "python gtk-demo.py" (as shebang lines are not supported on Windows normally, MSYS and cygwin would have it, but non-developers would probably not have them nor want to have them on their systems), so the arguments picked up with this would become: arg[0]: python arg[1]: gtk-demo.py Whereas the former method would pick up: arg[0]: gtk-demo.py So, those scripts would fail to run with the following error: (python.exe:5092): GLib-GIO-CRITICAL **: This application can not open files. As python would not be "recognized" by GApplication. I have opened bug 734095 for this-perhaps the easiest way would be to make the bindings set some envvar on init, and let a_application_run() check for it, and use the old method if such an envvar is found. With blessings, thank you!