Bug 643479 – Unicode support is not working correctly.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 643479 - Unicode support is not working correctly.


Summary:	Unicode support is not working correctly.


Status:	RESOLVED FIXED

Product:	gjs
Classification:	Bindings
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gjs-maint
QA Contact:	gjs-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-02-28 12:39 UTC by Jasper St. Pierre (not reading bugmail)
Modified:	2011-06-09 21:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
console: Fix usage of bytes in gjs_value_debug_string. (2.90 KB, patch) 2011-06-05 22:13 UTC, Jasper St. Pierre (not reading bugmail)	needs-work	Details \| Review
console: Handle both (valid) Unicode strings and binary correctly (2.21 KB, patch) 2011-06-09 19:03 UTC, Colin Walters	none	Details \| Review
console: Handle both (valid) Unicode strings and binary correctly (2.33 KB, patch) 2011-06-09 19:08 UTC, Colin Walters	committed	Details \| Review

Description Jasper St. Pierre (not reading bugmail) 2011-02-28 12:39:38 UTC

gjs> "\u5E74"
  t
  gjs> "\xE5\xB9\xB4"
  <code point 5E74 in UTF-8>

Not sure who or why.

Comment 1 Johan (not receiving bugmail) Dahlin 2011-03-18 14:11:07 UTC

Spidermonkey and thus gjs uses latin-1 encoded strings by default.

Comment 2 Owen Taylor 2011-03-18 15:47:50 UTC

(In reply to comment #1)
> Spidermonkey and thus gjs uses latin-1 encoded strings by default.

I don't think this is an answer. Yes, the default => bytes conversion of Spidermonkey is & 0xff, but we are supposed to be taking care of converting strings appropriately in GJS so that this doesn't leak out.

But I guess what you are saying then is that everything is working fine until we get to the display layer in GJS ... that you'd get the same result from
"\u00e5\u00b9\u00b4"

Jasper - do you want to come up with a patch for gjs_value_debug_string() to not strip high bytes, but instead convert the result to UTF-8 in the matter of gjs_try_string_to_utf8() (it should always return a string, so it needs to do "best-effort" rather than just failing if the JS string contains stuff that isn't valid Unicode in GLib terms.)

Comment 3 Jasper St. Pierre (not reading bugmail) 2011-03-18 17:51:17 UTC

I have one, but at the time I filed this bug, I thought that print() would do the same thing.

Comment 4 Jasper St. Pierre (not reading bugmail) 2011-06-05 22:13:00 UTC

Created attachment 189280 [details] [review]
console: Fix usage of bytes in gjs_value_debug_string.

This may have been confusing users when the output from the REPL
emitted raw bytes didn't match up with something like "print".

Comment 5 Colin Walters 2011-06-06 16:02:51 UTC

Review of attachment 189280 [details] [review]:

::: gjs/jsapi-util.c
@@ -819,3 @@
-    JS_EndRequest(context);
-
-        JS_EncodeStringToBuffer(str, bytes, len);

You are losing the call to this function, which is important - it means we don't throw if we're trying to get a debug string from a JS string containing non-UTF8 data.  Which is the whole point of gjs_value_debug_string() =)

::: modules/console.c
@@ +218,3 @@
 
+        char *display_str;
+        display_str = gjs_value_debug_string(context, result);

One thing to consider actually may be ensuring that for strings, we get valid JavaScript syntax.  Thus, my expected output would be:

> "foo"
"foo"
> "\u263A"
"☺"
> "\u0000\0000"
"\0000\0000"
>

(compare with current gjs)

I'm not sure if there's a JSAPI function to do this - probably not.

Comment 6 Jasper St. Pierre (not reading bugmail) 2011-06-07 08:18:35 UTC

> One thing to consider actually may be ensuring that for strings, we get valid
> JavaScript syntax.  Thus, my expected output would be:
> 
> > "foo"
> "foo"
> > "\u263A"
> "☺"
> > "\u0000\0000"
> "\0000\0000"
> >
> 
> I'm not sure if there's a JSAPI function to do this - probably not.

I'm confused by the rules on this... you want random unicode code points to emit actual encoded bytes (UTF8 or whatever $LANG says?), "\u0000" is hardcoded to "\0000"?

> (compare with current gjs)

Right now, gjs just strips the high bytes.

> "\u263A"
:
> "\u003A"
:

I assume anything is better than this.

Comment 7 Jasper St. Pierre (not reading bugmail) 2011-06-07 08:19:17 UTC

whoops, accidentally clicked one of the hotlink buttons that set severity and status.

Comment 8 Colin Walters 2011-06-07 16:19:54 UTC

(In reply to comment #6)

> I'm confused by the rules on this... you want random unicode code points to
> emit actual encoded bytes (UTF8 or whatever $LANG says?), "\u0000" is hardcoded
> to "\0000"?

For binary strings let's forget unicode; we should use the hex escape "\x00" for clarity.  So basically do:

if (!g_utf8_validate (bytes_from_string, string_length)) {
  print_escaped (string_bytes) 
} else { 
  g_print ("%s\n", string_bytes);
}

Where print_escaped iterates over the sequence and checks g_ascii_is_print(); if printable, shows them literally, otherwise escapes as "\xAB".

Comment 9 Jasper St. Pierre (not reading bugmail) 2011-06-09 00:35:48 UTC

OK, I sat down to work on this again (check the commit date on the attachment)

(In reply to comment #5)
> Review of attachment 189280 [details] [review]:
> 
> ::: gjs/jsapi-util.c
> @@ -819,3 @@
> -    JS_EndRequest(context);
> -
> -        JS_EncodeStringToBuffer(str, bytes, len);
> 
> You are losing the call to this function, which is important - it means we
> don't throw if we're trying to get a debug string from a JS string containing
> non-UTF8 data.  Which is the whole point of gjs_value_debug_string() =)

Well, unfortunately EncodeStringToBuffer throws away the high byte, so we have to use GetStringChars (which is what gjs_string_to_utf8 uses)... but if we can't return UTF8, should we just return UCS2?

> For binary strings let's forget unicode; we should use the hex escape "\x00"
> for clarity.  So basically do:
> 
> if (!g_utf8_validate (bytes_from_string, string_length)) {
>   print_escaped (string_bytes) 
> } else { 
>   g_print ("%s\n", string_bytes);
> }
> 
> Where print_escaped iterates over the sequence and checks g_ascii_is_print();
> if printable, shows them literally, otherwise escapes as "\xAB".

It's harder than this. "ab\x02cd" is completely valid UTF8, and I doubt we want unprintable characters in our UTF8 output.

Comment 10 Colin Walters 2011-06-09 19:03:44 UTC

Created attachment 189574 [details] [review]
console: Handle both (valid) Unicode strings and binary correctly

When printing a value that is a string back to the terminal, check if
it's valid Unicode; if so, print it.  Otherwise, we print the whole
string using escape sequences.

Comment 11 Colin Walters 2011-06-09 19:08:35 UTC

Created attachment 189576 [details] [review]
console: Handle both (valid) Unicode strings and binary correctly

Print ASCII if we can

Comment 12 Jasper St. Pierre (not reading bugmail) 2011-06-09 20:57:09 UTC

Review of attachment 189576 [details] [review]:

This doesn't really support bytestrings completely, given that _make_valid_utf8 replaces all invalid UTF8 sequences with a replacement char.

Otherwise, fine.

::: gjs/jsapi-util.c
@@ +767,3 @@
+ * JS strings that contain valid Unicode, we return a UTF-8 formatted
+ * string.  Otherwise, we return one where non-ASCII-printable bytes
+ * are \x escaped.

They're not '\x'-escaped.

Comment 13 Colin Walters 2011-06-09 21:06:25 UTC

Attachment 189576 [details] pushed as 05bd1ae - console: Handle both (valid) Unicode strings and binary correctly