GNOME Bugzilla – Bug 643479
Unicode support is not working correctly.
Last modified: 2011-06-09 21:06:27 UTC
gjs> "\u5E74" t gjs> "\xE5\xB9\xB4" <code point 5E74 in UTF-8> Not sure who or why.
Spidermonkey and thus gjs uses latin-1 encoded strings by default.
(In reply to comment #1) > Spidermonkey and thus gjs uses latin-1 encoded strings by default. I don't think this is an answer. Yes, the default => bytes conversion of Spidermonkey is & 0xff, but we are supposed to be taking care of converting strings appropriately in GJS so that this doesn't leak out. But I guess what you are saying then is that everything is working fine until we get to the display layer in GJS ... that you'd get the same result from "\u00e5\u00b9\u00b4" Jasper - do you want to come up with a patch for gjs_value_debug_string() to not strip high bytes, but instead convert the result to UTF-8 in the matter of gjs_try_string_to_utf8() (it should always return a string, so it needs to do "best-effort" rather than just failing if the JS string contains stuff that isn't valid Unicode in GLib terms.)
I have one, but at the time I filed this bug, I thought that print() would do the same thing.
Created attachment 189280 [details] [review] console: Fix usage of bytes in gjs_value_debug_string. This may have been confusing users when the output from the REPL emitted raw bytes didn't match up with something like "print".
Review of attachment 189280 [details] [review]: ::: gjs/jsapi-util.c @@ -819,3 @@ - JS_EndRequest(context); - - JS_EncodeStringToBuffer(str, bytes, len); You are losing the call to this function, which is important - it means we don't throw if we're trying to get a debug string from a JS string containing non-UTF8 data. Which is the whole point of gjs_value_debug_string() =) ::: modules/console.c @@ +218,3 @@ + char *display_str; + display_str = gjs_value_debug_string(context, result); One thing to consider actually may be ensuring that for strings, we get valid JavaScript syntax. Thus, my expected output would be: > "foo" "foo" > "\u263A" "☺" > "\u0000\0000" "\0000\0000" > (compare with current gjs) I'm not sure if there's a JSAPI function to do this - probably not.
> One thing to consider actually may be ensuring that for strings, we get valid > JavaScript syntax. Thus, my expected output would be: > > > "foo" > "foo" > > "\u263A" > "☺" > > "\u0000\0000" > "\0000\0000" > > > > I'm not sure if there's a JSAPI function to do this - probably not. I'm confused by the rules on this... you want random unicode code points to emit actual encoded bytes (UTF8 or whatever $LANG says?), "\u0000" is hardcoded to "\0000"? > (compare with current gjs) Right now, gjs just strips the high bytes. > "\u263A" : > "\u003A" : I assume anything is better than this.
whoops, accidentally clicked one of the hotlink buttons that set severity and status.
(In reply to comment #6) > I'm confused by the rules on this... you want random unicode code points to > emit actual encoded bytes (UTF8 or whatever $LANG says?), "\u0000" is hardcoded > to "\0000"? For binary strings let's forget unicode; we should use the hex escape "\x00" for clarity. So basically do: if (!g_utf8_validate (bytes_from_string, string_length)) { print_escaped (string_bytes) } else { g_print ("%s\n", string_bytes); } Where print_escaped iterates over the sequence and checks g_ascii_is_print(); if printable, shows them literally, otherwise escapes as "\xAB".
OK, I sat down to work on this again (check the commit date on the attachment) (In reply to comment #5) > Review of attachment 189280 [details] [review]: > > ::: gjs/jsapi-util.c > @@ -819,3 @@ > - JS_EndRequest(context); > - > - JS_EncodeStringToBuffer(str, bytes, len); > > You are losing the call to this function, which is important - it means we > don't throw if we're trying to get a debug string from a JS string containing > non-UTF8 data. Which is the whole point of gjs_value_debug_string() =) Well, unfortunately EncodeStringToBuffer throws away the high byte, so we have to use GetStringChars (which is what gjs_string_to_utf8 uses)... but if we can't return UTF8, should we just return UCS2? > For binary strings let's forget unicode; we should use the hex escape "\x00" > for clarity. So basically do: > > if (!g_utf8_validate (bytes_from_string, string_length)) { > print_escaped (string_bytes) > } else { > g_print ("%s\n", string_bytes); > } > > Where print_escaped iterates over the sequence and checks g_ascii_is_print(); > if printable, shows them literally, otherwise escapes as "\xAB". It's harder than this. "ab\x02cd" is completely valid UTF8, and I doubt we want unprintable characters in our UTF8 output.
Created attachment 189574 [details] [review] console: Handle both (valid) Unicode strings and binary correctly When printing a value that is a string back to the terminal, check if it's valid Unicode; if so, print it. Otherwise, we print the whole string using escape sequences.
Created attachment 189576 [details] [review] console: Handle both (valid) Unicode strings and binary correctly Print ASCII if we can
Review of attachment 189576 [details] [review]: This doesn't really support bytestrings completely, given that _make_valid_utf8 replaces all invalid UTF8 sequences with a replacement char. Otherwise, fine. ::: gjs/jsapi-util.c @@ +767,3 @@ + * JS strings that contain valid Unicode, we return a UTF-8 formatted + * string. Otherwise, we return one where non-ASCII-printable bytes + * are \x escaped. They're not '\x'-escaped.
Attachment 189576 [details] pushed as 05bd1ae - console: Handle both (valid) Unicode strings and binary correctly