Bug 680730 – String -> uint8 array coercion as byte array, not utf8 is unexpected

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 680730 - String -> uint8 array coercion as byte array, not utf8 is unexpected


Summary:	String -> uint8 array coercion as byte array, not utf8 is unexpected


Status:	RESOLVED FIXED

Product:	gjs
Classification:	Bindings
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gjs-maint
QA Contact:	gjs-maint

URL:
Whiteboard:

Duplicates:	688768 (view as bug list)
Depends on:
Blocks:

Reported:	2012-07-28 02:56 UTC by Ray Strode [halfline]
Modified:	2013-01-04 00:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
test case from thread (3.08 KB, text/plain) 2012-07-28 02:58 UTC, Ray Strode [halfline]		Details
Switch String -> uint8 array coercion to be UTF8 (2.24 KB, patch) 2012-11-29 19:58 UTC, Colin Walters	committed	Details \| Review
Remove obsolete handling of binary data and move to CStringsAreUTF8 (29.49 KB, patch) 2013-01-02 20:30 UTC, Giovanni Campagna	committed	Details \| Review
Remove all calls to g_utf8_to_utf16 (8.14 KB, patch) 2013-01-02 20:30 UTC, Giovanni Campagna	committed	Details \| Review

Description Ray Strode [halfline] 2012-07-28 02:56:45 UTC

Someone pointed out on the gnome-shell list a problem they're seeing with gjs here:

https://mail.gnome.org/archives/gnome-shell-list/2012-July/msg00135.html

Basically you can't take the result of g_key_file_to_data and feed it to g_file_set_contents.

I attached with gdb just to see what was going on and what I found was before g_file_set_contents is called, gjs_string_get_binary_data is called.  This has this code to extract the data from javascript:

 len = JS_GetStringEncodingLength(context, str);                   
                                                 
 if (data_p) {                                   
     bytes = g_malloc((len + 1) * sizeof(char)); 
     JS_EncodeStringToBuffer(str, bytes, len);   
     bytes[len] = '\0';                          
     *data_p = bytes;                            
 }                                               

JS_EncodeStringToBuffer is documented as having this quirk:

"Note that for non-ASCII strings, if JS_CStringsAreUTF8 is false, these functions can return a corrupted copy of the contents of the string."

and running g_utf8_validate in the debugger on the output bytes confirms the output bytes are getting corrupted by that function.  Furthermore, running

call JS_SetCStringsAreUTF8()
call JS_EncodeStringToBuffer(str, bytes, len)

magically makes g_utf8_validate start working.  Earlier in the function there's this code:

 /* JS_GetStringBytes() returns low byte of 
  * each schar unless JS_CStringsAreUTF8() 
  * in which case we're screwed
  */                   
 if (JS_CStringsAreUTF8()) {                                         
     gjs_throw(context,
               "If JS_CStringsAreUTF8(), gjs doesn't know how "
               "to do binary strings");
     return JS_TRUE;
 }              

which makes me think:

1) the code used to run JS_GetStringBytes instead of JS_EncodeStringToBuffer 
2) the code is/was trying to take advantage of a quirk in spidermonkey when !JS_CStringsAreUTF8() to stuff binary data in every other byte of an unvalidated utf-16 string.

Comment 1 Ray Strode [halfline] 2012-07-28 02:58:21 UTC

Created attachment 219759 [details]
test case from thread

This is the test case the reporter supplied on the mailing thread, with small modifications so I had time to attach with gdb before things went south.

Comment 2 Ray Strode [halfline] 2012-07-28 04:24:47 UTC

So looking in the log I see this hint to historical background:

commit c01ad5f14e878179592712fb2f0b5360bff63790
Date:   Tue Apr 28 17:07:34 2009 -0400

    Allow passing a JS string as a C array of int8/uint8/int16/uint16.
    
    We can now pass values into C functions which take arrays of small integers
    using the standard "one value per character" encoding of buffers as JavaScript
    strings.  This is primarily useful for methods which take binary buffers
    as gint8* or uint8* (instead of char *).
    
    Test cases against the 'Everything' module show how this works.

(with the test showing:

+    // implicit conversions from strings to int arrays
+    assertEquals(10, Everything.test_array_gint8_in(4, "\x01\x02\x03\x04"));
+    assertEquals(10, Everything.test_array_gint16_in(4, "\x01\x02\x03\x04"));
+    assertEquals(2560, Everything.test_array_gint16_in(4, "\u0100\u0200\u0300\u0400"));
 }
)

So that poses a bit of a conflict, either JSStrings always store encoded strings or they're always store raw bytes.  I don't think it's possible to  unambiguously have it both ways unless JSString has some sort of state that lets it get tagged with an optional encoding.  I'm guessing it doesn't but I don't really know.  If that's right, then  I suppose we could do some opportunistic heuristic at the time we're extracting the data to "guess" whether it's getting stored as UTF-16 or padded raw bytes, but that seems fragile.  It might be better to create a different kind of javascript value internally for strings coming from C, one that has a well defined encoding, and try to make this new type compatible with javascript strings to javascript code.

Comment 3 Ray Strode [halfline] 2012-07-28 04:33:22 UTC

So the other approach would be to never convert utf8 coming from glib to utf16, but to always just stuff the utf8 into the jsstring as padded bytes. I don't think printed strings would show up right, then, without hacks, though.

Comment 4 Jasper St. Pierre (not reading bugmail) 2012-07-28 15:54:24 UTC

What do Unicode literals produce? '\u294A' ?

Comment 5 Colin Walters 2012-07-29 08:51:42 UTC

Is this a regression, or just unexpected behavior?  

This might be something weird with how g_key_file_get_data has a length, but is also saying it's UTF-8.  Using byte length with UTF-8...well I wouldn't call it "broken", but it's a weird corner case.

Comment 6 Ray Strode [halfline] 2012-07-30 14:03:12 UTC

Jasper, I don't understand your question.  What do unicode literals produce in what scenarios ? Do you mean, "if you have a JSString of utf-16 unicode characters with codepoints fully filling two bytes, such as '\u294A' what is the result of JS_EncodeStringToBuffer when JS_CStringsAreUTF8() is true?"

Colin, even if it didn't return length, you'd still have a problem right? The fundamental problem (assuming i'm not misunderstanding it, which is possible) is when we store data in JSStrings, we do in different ways depending on where the data came from, we don't track which way we stored that data, and then when it comes time to get the data out, we use the introspection information from the function about to be called to guess how we stored the data.  E.g., we assume the data is stored as padded raw bytes because the function g_file_set_contents wants arbitrary binary data.  Of course, the function about to be called has no way of knowing how we encoded the data in the string originally, it only knows how it wants the data.

Comment 7 Jasper St. Pierre (not reading bugmail) 2012-07-30 14:35:27 UTC

There's no guarantee that '\u294A' has any specific encoding at the language level. It could be stored as UTF-8 for all we know. I'm saying, if I build a JS function in C that calls JS_EncodeStringToBuffer when JS_CStringsAreUTF8 is *false* (the gjs case), and pass it '\u294A', what happens?

Comment 8 Ray Strode [halfline] 2012-07-30 15:27:16 UTC

You can't pass a string like '\u294A' directly to EncodeStringToBuffer, you pass it a JSString object.  How that JSString encodes the data is entirely up to the C code that creates the JSString object. From the javascript side, the commit mentioned in comment 2 causes string literals like '\u294A' to get encoded as if they were utf-16 (without validation, though).  Namely, you'd end up with an array with one 2-byte element.  If instead you had done '\x29\x4a' then the commit mentioned in comment 2 causes a 2 element 2-byte array to be allocated (with the high byte of each element ignored).  So I'm assuming your question is about the JSString case with one 2-byte element. So to the question,

"if you pass a JSString object with ->chars[0] = 0x294a to JS_EncodeStringToBuffer when utf8 mode is disabled, what gets returned?"

I haven't tried it, but the docs say "a corrupted copy of the string", and I assume the answer is "the top byte gets dropped", and so my guess without trying it is the letter "J" (which is 0x4a).

I don't think the question is that relevant though, since the code path we're hitting is specifically the "we think this JSString has data in every other byte case" (gjs_string_to_intarray with element_type UINT8)

Comment 9 Ray Strode [halfline] 2012-08-08 18:43:59 UTC

So 

https://developer.mozilla.org/en-US/docs/SpiderMonkey/JSAPI_Reference/JS_NewExternalString

 and 

https://developer.mozilla.org/en-US/docs/SpiderMonkey/JSAPI_Reference/JS_GetExternalStringClosure

look interesting and may be exactly what we need to "tag" how the bytes in our custom strings are being laid down in JSString chars array.

Comment 10 Ray Strode [halfline] 2012-10-09 05:23:15 UTC

I started to take a crack at this:

http://git.gnome.org/browse/gjs/commit/?h=wip/binary-data

but then realized when trying to compile it, that those functions aren't available in the spidermonkey we use in gjs.

Anyway, I posted the incomplete code to the branch above for now, so it doesn't get lost.

Comment 11 Colin Walters 2012-11-28 23:19:38 UTC

*** Bug 688768 has been marked as a duplicate of this bug. ***

Comment 12 Colin Walters 2012-11-29 04:26:11 UTC

So the keyfile length parameter is a red herring; what this bug is all about is the history of gjs, spidermonkey, JavaScript the language, and GLib with respect to byte arrays.

When gjs was first written, JavaScript didn't have Uint8Array. Now of course, you can make an array full of doubles, but this is quite inefficient. On old JavaScript (such as Spidermonkey when gjs was created), it was fantastically, shockingly inefficient, becuase each double had to be malloc()'d. Since then Spidermonkey switched to "fat" values, which helps a lot, but still.

As far as I understand it, in the pushing-JS-to-the-limits community, using strings as byte arrays was a well known hack, since JavaScript allowed you to put any data you want in, including invalid UCS-2.

Thus, someone (almost certainly Havoc) chose to have gjs interpret JS strings, when coerced to uint8 array, as an array of uint16 pairs.

Later, gjs gained a byteArray class. Slightly later than that, JavaScript and Spidermonkey gained a language-native Uint8Array, which is the basis for a lot of high-performance JS work.

Then GLib gained GBytes.

Now, given the present state, interpreting JS String as uint16 pairs in the context of coercion to uint8 array is quite unexpected. I myself have been confused by it.

The problem is further compounded by the inconsistency of GLib with respect to utf8 versus uint8 array; some APIs like g_file_set_contents() you might expect to be utf8 since they say "gchar*", but they're annotated to be uint8 array. Were these APIs written today, they'd ideally take a GBytes.

So what should we do?

Comment 13 Colin Walters 2012-11-29 04:27:47 UTC

There is a workaround for application authors available; use imports.byteArray.fromString(somestr, "UTF-8");

For example, the keyfile code would look like this:

let [data, datalen] = keyfile.to_data();
let byteData = imports.byteArray.fromString(data, "UTF-8");
GLib.file_set_contents('gedit.desktop', byteData, byteData.length);

Comment 14 Colin Walters 2012-11-29 04:32:37 UTC

But the big question - can we, now that we have better byte array handling all around, just drop the old historical hack of String-as-byte-array?  Many things could be improved by that.

But the problem is really, to be clear, I have no idea just how much would break.

Oh and what makes this problem a *thousand* times worse is that it only matters when non-ASCII is involved...

Comment 15 Jasper St. Pierre (not reading bugmail) 2012-11-29 04:50:04 UTC

1) Does Uint8Array exist in SpiderMonkey? I thought it was a DOM binding thing.

2) Please, let's drop the String-as-byte-array thing. But JavaScript already has extremely poor Unicode support. I think we should also turn CStringsAreUTF8 back on, and ensure that all strings returned from bindings are valid UTF8. I'm not sure what the best solution to g_file_set_contents is; I don't think it would break much if we changed it to "gint8", unless we have some users compiling GNOME on Cray machines or other machines where chars are greater than 8 bits.

3) I'm not sure what the correct handling of binary data is. I don't think anybody cares about the Unicode-as-binary-data thing. I don't want to force users that use g_file_set_contents to build a ByteArray or Uint8Array or GBytes from their unicode string. Implicitly making a UTF8 bytestring from their unicode JSStrings should be fine, I think. If people need more, we can do something like imports.byteArray.fromString(data, "punycode"); or whatever. Python got it right with the encode/decode interfaces and bytes/str types.

4) GKeyFile is really an unfortunate API, as you have to pass this useless "-1" around. We used to have "(array length=length)" which made it require a byte array from my recollection, but removed that because it was unexpected, especially for an API that required UTF8. I don't know if it makes sense to special case this in any way, so that we can drop that -1 requirement. Methods like pango_parse_markup do this as well; it seems to save a strlen(), or avoid a NUL termination for various reasons. I wonder if we could / should special case this.

Comment 16 Colin Walters 2012-11-29 13:35:42 UTC

(In reply to comment #15)
> 1) Does Uint8Array exist in SpiderMonkey? I thought it was a DOM binding thing.

Yes, search for "TypedArray" in the js185 source.

> 2) Please, let's drop the String-as-byte-array thing.

Let's phrase this as "Switch the String->uint8 array coercion to UTF8".  

> I think we should also turn CStringsAreUTF8
> back on, and ensure that all strings returned from bindings are valid UTF8. 

Yes, we could do that *if* we switch the conversion.  But this is a separate discussion.

> 4) GKeyFile is really an unfortunate API, as you have to pass this useless "-1"

Irrelevant to this bug.

> I'm
> not sure what the best solution to g_file_set_contents is;

This bug is not about g_file_set_contents() specifically.  It's about what happens when you pass a JavaScript String to a C API expecting a uint8 array.

> 3) I'm not sure what the correct handling of binary data is. I don't think
> anybody cares about the Unicode-as-binary-data thing.

Today?  No, the clearly sane way to do byte arrays in JS is UInt8Array.  But the point is that there is *history* here.  There was a reason for the choice at the time.  It's now "technical debt".

What you are completely ignoring here is - what is the cost of switching? How many programs are passing non-UTF8 "byte-JS-String"s (for lack of a better term) to uint8 array APIs?

Comment 17 Ray Strode [halfline] 2012-11-29 14:57:43 UTC

(In reply to comment #12)
> Thus, someone (almost certainly Havoc) chose to have gjs interpret JS strings,
> when coerced to uint8 array, as an array of uint16 pairs.
The issue is that JS strings can have data encoded in several different ways depending on how the string was created initially.  Under the covers the string is an array of uint16 elements, but those elements can be filled in various ways (1. tightly packed 2 valid raw bytes to each element 2. loosely packed 1 raw byte + 1 invalid byte to each element 3. as a utf-16 string)

> So what should we do?
So if I understand you, you're saying given we have UInt8Array the question is whether we should allow string literals like '\x4d\x2d\x3a\x4a' and '\u4d2d\u3a4a' to continue to implicitly be treated as binary data when passed to glib functions that take binary data, or should we error our if they aren't encoded as valid strings.  It seems useful to me to keep that working, but I don't know how widespread that capability is and it seems hard unless we have a way to tag strings (when they're created) as holding binary.

Comment 18 Ray Strode [halfline] 2012-11-29 14:58:20 UTC

(In reply to comment #13)
> There is a workaround for application authors available; use
> imports.byteArray.fromString(somestr, "UTF-8");
> 
> For example, the keyfile code would look like this:
> 
> let [data, datalen] = keyfile.to_data();
> let byteData = imports.byteArray.fromString(data, "UTF-8");
> GLib.file_set_contents('gedit.desktop', byteData, byteData.length);

Right, I mentioned that here:

https://mail.gnome.org/archives/gnome-shell-list/2012-July/msg00143.html

Comment 19 Colin Walters 2012-11-29 16:16:12 UTC

(In reply to comment #17)
> (In reply to comment #12)
> > Thus, someone (almost certainly Havoc) chose to have gjs interpret JS strings,
> > when coerced to uint8 array, as an array of uint16 pairs.
> The issue is that JS strings can have data encoded in several different ways
> depending on how the string was created initially.  Under the covers the string
> is an array of uint16 elements, but those elements can be filled in various
> ways (1. tightly packed 2 valid raw bytes to each element 2. loosely packed 1
> raw byte + 1 invalid byte to each element 3. as a utf-16 string)

I'm not sure what you're saying here, honestly.  

> > So what should we do?
> So if I understand you, you're saying given we have UInt8Array the question is
> whether we should allow string literals like '\x4d\x2d\x3a\x4a' and
> '\u4d2d\u3a4a' to continue to implicitly be treated as binary data when passed
> to glib functions that take binary data, or should we error our if they aren't
> encoded as valid strings.  

The proposal is to *convert* the utf16 data from JS String to utf8 - if that succeeds,
we pass the utf8 data + length-in-bytes as a uint8 array.  If conversion fails, we throw.

While this is likely what you meant by "error out if they aren't encoded as valid strings",
the phrasing here is really important.

> It seems useful to me to keep that working, but I
> don't know how widespread that capability is and it seems hard unless we have a
> way to tag strings (when they're created) as holding binary.

That tag effectively already exists now in that callers can use imports.byteArray.fromString()...but I doubt it's widely used.

So again though, let me attempt to turn the conversation to - if we make the change, what would break?  We would need to do some enumeration of commonly-used uint8 array APIs provided by the GTK+ stack, and then search for gjs users that are passing strings.  Using pure ASCII strings is OK - anything else is not.

Comment 20 Ray Strode [halfline] 2012-11-29 17:51:53 UTC

So I chatted with walters about this in meat space, since it's all pretty confusing.  To be clearer, right now gjs will do this conversion for a interpreter created strings:

"\x41\x2d\x3f" -> jsstr1 = { 0x0041, 0x002d, 0x003f }
"\u413d\u2d3f\u3c2a" -> jsstr2 = { 0x413d, 0x2d3f, 0x3c2a }

Also, under the covers gjs can return javascript strings from C functions that return utf8 as so:

uint8_t out[] = { 0x210b, 0x69, 0x21 } -> jsstr3 = { 0x210b, 0x0069, 0x0021 } 

Now if jsstr1 gets passed to a C function that takes uint8_t * (binary data) as an argument, it correctly converts as so:

jsstr1 = { 0x0041, 0x002d, 0x003f } -> uint8_t str1[] = { 0x41, 0x2d, 0x3f }

That is to say gjs will discard the upper bytes of the string since it knows they were just filler.

The problem is that the above discarding behavior breaks the other two cases:

jsstr2 = { 0x413d, 0x2d3f, 0x3c2a } -> uint8_t str2[] = { 0x3d, 0x3f, 0x2a }
jsstr3 = { 0x210b, 0x0069, 0x0021 } -> uint8_t str3[] = { 0x0b, 0x69, 0x21 }

and we have no way to distinguish when to discard and when not to barring heuristics.  The conclusion we came to is case 1 isn't tha interesting to support, so it makes a lot of sense to remove support for that case and clean up the code, even though it will break backward compatibility.

Comment 21 Ray Strode [halfline] 2012-11-29 17:54:53 UTC

(my origional proposal was to try to do the heuristics as early as possible to determine how the jsstring was encoded, tag the string with the encoding somehow, and the use the tag to determine how to decode the data to pass it to the C function)

Comment 22 Colin Walters 2012-11-29 19:58:14 UTC

Created attachment 230236 [details] [review]
Switch String -> uint8 array coercion to be UTF8

Previously for historical reasons, we treated JavaScript Strings as
byte array containers when in the context of a uint8 array conversion.
This is not what "most" users expect, particularly when taking UTF-8
data from one GLib function and passing it to a different GLib
function that accepts a uint8 array.

Comment 23 Colin Walters 2012-11-29 21:06:26 UTC

The latest gnomeos-3.8-x86_64-devel boots in qemu with this patch, no additional errors thrown.

Comment 24 Ray Strode [halfline] 2012-11-30 21:46:20 UTC

Review of attachment 230236 [details] [review]:

Seems fine, though, I do wonder if it's complete. Since we don't allow stuffing binary data in strings anymore I'd assume get_binary_data should be removed.  But there are a bunch of places that still call it.  One of the interesting looking ones is gjs_js_one_value_to_dbus, which has (paraphrased):

 if (JSVAL_IS_STRING(value)) {
    /* if out signature is "ay" */
    gjs_string_get_binary_data(context, value, &data, &len);
 }

So it's assuming the string is packed in the way we don't allow anymore.  That's probably wrong and needs to be changed too, I think. Another interesting case is gjs_string_get_ascii. It seeminly calls gjs_string_get_binary_data as a sort of fast path to avoid validation. If that's important, then I think it makes sense to move the guts of get_binary_data into get_ascii.  All the other callers look like they should be changed to gjs_string_to_utf8 or gjs_string_get_ascii.

::: gi/arg.c
@@ +526,3 @@
             return JS_FALSE;
         *arr_p = result;
+        *length = strlen(result);

sort of wish to_utf8 returned the length

Comment 25 Giovanni Campagna 2012-12-03 19:04:17 UTC

Just to point out, the plan in bug 669350 was to remove import.dbus (and the associated libgjs-dbus) before 3.7.1, so to me, the handling of ay there is not very important.

Comment 26 Jasper St. Pierre (not reading bugmail) 2012-12-03 19:18:26 UTC

We can do that now. There are no GNOME apps that use gjs that depend on imports.dbus. I ported the rest of them.

Comment 27 Ray Strode [halfline] 2012-12-04 22:39:57 UTC

reopening since we still have to drop gjs_string_get_binary_data eventually.

Comment 28 lamefun 2013-01-01 18:50:20 UTC

What about this: have 2 GJS binaries, gjs and gjs2, gjs will exhibit old behaviour and gjs2 will use Uint8Array for binary data?

Comment 29 Ray Strode [halfline] 2013-01-02 14:37:49 UTC

why try to keep the old behavior? Is there something using it?

Comment 30 Giovanni Campagna 2013-01-02 20:30:11 UTC

Created attachment 232565 [details] [review]
Remove obsolete handling of binary data and move to CStringsAreUTF8

All C strings are UTF8 in the glib world, so that setting make sense
for gjs, and the reason it was previously off was to allow a broken
way to represent binary data.
Now that we don't do that any more, we can remove a lot of useless code
and just let Spidermonkey do the right thing.

----

Time to enjoy the benefits of this change!

Comment 31 Giovanni Campagna 2013-01-02 20:30:26 UTC

Created attachment 232567 [details] [review]
Remove all calls to g_utf8_to_utf16

Lets have just one place where we do unicode conversion, and make that
the one in Spidermonkey.

Comment 32 Ray Strode [halfline] 2013-01-02 20:42:45 UTC

wheeee

Comment 33 Colin Walters 2013-01-04 00:32:48 UTC

Review of attachment 232565 [details] [review]:

This all looks pretty good to me.

Comment 34 Colin Walters 2013-01-04 00:34:13 UTC

Review of attachment 232567 [details] [review]:

Indeed, the code looks a lot better now.

Comment 35 Giovanni Campagna 2013-01-04 00:55:48 UTC

Attachment 232565 [details] pushed as 8220460 - Remove obsolete handling of binary data and move to CStringsAreUTF8
Attachment 232567 [details] pushed as 27389b7 - Remove all calls to g_utf8_to_utf16

The handling of binary data in gjs changed radically, and now in theory
matches the dev's expectations, so I'm inclined to close this fixed.