GNOME Bugzilla – Bug 599860
Wrong encoding for C strings: Latin1 instead of UTF8
Last modified: 2016-11-29 05:27:39 UTC
Created attachment 146385 [details] Call JS_SetCStringsAreUTF8() before the first call to JS_NewRuntime By default, JavaScript engine uses Latin1 character set when converting from C strings. This cause problem to me, because I want to use Cyrillic in the source code and incoming data. MDC says that JS_SetCStringsAreUTF8() must be called before the first call to JS_NewRuntime. See https://developer.mozilla.org/En/SpiderMonkey/JSAPI_Reference/JS_CStringsAreUTF8 See patch in attachment.
Created attachment 146389 [details] [review] Call JS_SetCStringsAreUTF8() before the first call to JS_NewRuntime
This will break gjs_string_get_binary_data() / from_binary_data() I don't think you should need this. gjs_string_to_utf8 / from_utf8 should work, and should be used in any places that you would provide UTF-8. What function are you passing utf8 to, that currently breaks?
> This will break gjs_string_get_binary_data() / from_binary_data() I will investigate this. > I don't think you should need this. I don't think you should think this way. > What function are you passing utf8 to, that currently breaks? The gjs itself. I cannot execute programs with Cyrillic characters in their body or with Cyrillic command line arguments. Type any Unicode text in the gjs console, for example, e.g. [code]"Some Unicode characters"[/code]. You should see exact same text in output but you will see set of accented characters and umlauts instead. You can get Unicode samples here: http://www.ltg.ed.ac.uk/~richard/unicode-sample.html
Created attachment 146662 [details] [review] Better handling of Unicode in GJS
Created attachment 146663 [details] Test case
I updated patch. All test cases are passed except for raising of error on wrong UTF-8 string.
Have not had a chance to look at the patch in detail yet, but one question we need to look into: we use gjs inside the xulrunner process. If xulrunner doesn't use the CStringsAreUTF8 mode then things may blow up.
(In reply to comment #7) > If xulrunner > doesn't use the CStringsAreUTF8 mode then things may blow up. Can you provide a test case for that? What I should run to check it?
if (js_context->runtime == NULL) { + JS_SetCStringsAreUTF8(); js_context->runtime = JS_NewRuntime(32*1024*1024 /* max bytes */); If it's possible to create more than one GjsContext, then this is wrong, since JS_SetCStringsAreUTF8 can only be called once before the first JS_NewRuntime; see http://mxr.mozilla.org/mozilla-central/source/js/src/jsapi.cpp#5539 (it asserts in debug mode otherwise). (In reply to comment #7) > we use gjs inside the xulrunner process. If xulrunner > doesn't use the CStringsAreUTF8 mode then things may blow up. AFAICT xulrunner does not use this mode. https://bugzilla.mozilla.org/show_bug.cgi?id=318402#c2 said "we are not yet turning on this macro by default. Mozilla code is not ready for it." http://mxr.mozilla.org/mozilla-central/ident?i=JS_SetCStringsAreUTF8 turns up no calls to this function in xulrunner code. (Just for reference, this function was added in https://bugzilla.mozilla.org/show_bug.cgi?id=318402 ). BTW: I can't seem to find any code on the web that uses gjs inside xulrunner; where is this app where you do so? Or do you mean some proprietary thing you're working on at litl?
I think it would be generally more productive to identify the places where C strings are being passed to SpiderMonkey without proper attention to their encoding.
Created attachment 146936 [details] [review] Better handling of Unicode in GJS
> If it's possible to create more than one GjsContext, then this is wrong, since > JS_SetCStringsAreUTF8 can only be called once before the first JS_NewRuntime; > see http://mxr.mozilla.org/mozilla-central/source/js/src/jsapi.cpp#5539 (it > asserts in debug mode otherwise). I updated my patch to check is UTF8 mode enabled. > I think it would be generally more productive to identify the places where C strings are being passed to SpiderMonkey without proper attention to their encoding. Such way is endless source of errors. English programmers tends to forgot about proper conversation or to filter out "dangerous" characters.
(In reply to comment #12) > > I think it would be generally more productive to identify the places where C > strings are being passed to SpiderMonkey without proper attention to their > encoding. > > Such way is endless source of errors. English programmers tends to forgot about > proper conversation or to filter out "dangerous" characters. There is a huge unsubstantiated leap you are making between: A) The 'gjs' command line is observed not to handle encoding correctly. B) There is an overall problem that will pop up in many places with encoding handling unless the SpiderMonkey global switch is changed Without having evidence for that leap, I'm not going to review your patches and I don't think anyone else will either.
I am non-English speaker, so it is hard to me to select proper words for discussions like that. I will describe my experience with gjs, so it will be easier to understand me. My goal is to create localized programming environment, like 1C (Russian programming environment for accounting purposed) or localized scripting in *Office. Although some keywords in JavaScript cannot be translated (now), JavaScript allows utf-8 characters in identifiers, so API can be translated. I tried gjs and found, that it is not handling UTF-8 encoded programs properly. I inserted call to JS_SetCStringsAreUTF8() and gjs start to work flawlessly, which mean that gjs developers forgot to use encoding routines in all essential places. I also tried to translate GIR API and use C routines. GJS does conversion from UTF16 to UTF8 strings properly. After removing one check for "dangerous" characters in g-ir-compiler, I am able to use Ukrainian translation of API from GJS without any problems.
(In reply to comment #13) > (In reply to comment #12) > > > > I think it would be generally more productive to identify the places where C > > strings are being passed to SpiderMonkey without proper attention to their > > encoding. > > > > Such way is endless source of errors. English programmers tends to forgot about > > proper conversation or to filter out "dangerous" characters. I mean: "English-spoken programmers use ASCII charset and do not notice problems with other encoding too often.", not "English programmers want to ignore internationalization problems". I am sorry if you understand me in an other way. > There is a huge unsubstantiated leap you are making between: > > A) The 'gjs' command line is observed not to handle encoding correctly. I saw command line version of gjs only, Did you have any other version to check? I checked GnomeShell, and it works fine when Unicode strings are passed through GIR API, and it does not work when Unicode characters used in sources. > B) There is an overall problem that will pop up in many places with encoding > handling unless the SpiderMonkey global switch is changed You can check that by redefining of non-Unicode functions in jsapi.h to produce warnings (or errors) during compilation.
B) There is an overall problem that will pop up in many places with encoding handling unless the SpiderMonkey global switch is changed I checked source and found that there is 118 calls to non-unicode functions from jsapi.h: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- [vlisivka@apollo gjs]$ egrep -nr 'JS_DefineProperty|JS_GetPropertyAttributes|JS_GetPropertyAttrsGetterAndSetter|JS_SetPropertyAttributes|JS_DefinePropertyWithTinyId|JS_AlreadyHasOwnProperty|JS_HasProperty|JS_LookupProperty|JS_GetProperty|JS_SetProperty|JS_DeleteProperty2|JS_DefineFunction|JS_CompileScript|JS_CompileScriptForPrincipals|JS_CompileFunction|JS_CompileFunctionForPrincipals|JS_EvaluateScript|JS_EvaluateScriptForPrincipals|JS_NewString|JS_NewStringCopyN|JS_NewStringCopyZ|JS_InternStringN|JS_InternString|JS_ReportErrorNumber|JS_ReportErrorFlagsAndNumber|JS_NewRegExpObject' . ./modules/gettext-native.c:226: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:232: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:238: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:244: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:250: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:256: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:262: if (!JS_DefineFunction(context, module_obj, ./modules/gettext-native.c:268: if (!JS_DefineFunction(context, module_obj, ./modules/lang.c:52: if (!JS_DefineFunction(context, module_obj, ./modules/dbus.c:981: argv[0] = STRING_TO_JSVAL(JS_NewStringCopyZ(context, name)); ./modules/dbus.c:1016: argv[0] = STRING_TO_JSVAL(JS_NewStringCopyZ(context, name)); ./modules/dbus.c:1196: argv[0] = STRING_TO_JSVAL(JS_NewStringCopyZ(context, name)); ./modules/dbus.c:1197: argv[1] = STRING_TO_JSVAL(JS_NewStringCopyZ(context, owner_unique_name)); ./modules/dbus.c:1235: argv[0] = STRING_TO_JSVAL(JS_NewStringCopyZ(context, name)); ./modules/dbus.c:1236: argv[1] = STRING_TO_JSVAL(JS_NewStringCopyZ(context, owner_unique_name)); ./modules/dbus.c:1388: s = JS_NewStringCopyZ(context, unique_name); ./modules/dbus.c:1482: machine_id_string = JS_NewStringCopyZ(context, machine_id); ./modules/dbus.c:1514: if (!JS_DefineProperty(context, bus_proto_obj, ./modules/dbus.c:1522: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1528: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1534: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1540: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1547: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1553: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1559: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1565: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1571: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1577: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1583: if (!JS_DefineFunction(context, bus_proto_obj, ./modules/dbus.c:1590: if (!JS_DefineProperty(context, module_obj, ./modules/dbus.c:1638: if (!JS_DefineProperty(context, bus_obj, "_dbusBusType", ./modules/dbus.c:1652: if (!JS_DefineProperty(context, module_obj, ./modules/dbus.c:1675: if (!JS_DefineFunction(context, module_obj, ./modules/dbus.c:1681: if (!JS_DefineProperty(context, module_obj, ./modules/dbus.c:1688: if (!JS_DefineProperty(context, module_obj, ./modules/dbus.c:1695: if (!JS_DefineProperty(context, module_obj, ./modules/dbus.c:1702: if (!JS_DefineProperty(context, module_obj, ./modules/mainloop.c:335: if (!JS_DefineFunction(context, module_obj, ./modules/mainloop.c:341: if (!JS_DefineFunction(context, module_obj, ./modules/mainloop.c:347: if (!JS_DefineFunction(context, module_obj, ./modules/mainloop.c:353: if (!JS_DefineFunction(context, module_obj, ./modules/mainloop.c:359: if (!JS_DefineFunction(context, module_obj, ./modules/mainloop.c:365: if (!JS_DefineFunction(context, module_obj, ./modules/debugger.c:129: if (!JS_DefineFunction(context, module_obj, ./modules/console.c:201: script = JS_CompileScript(context, obj, buffer, strlen(buffer), "typein", ./modules/console.c:235: if (!JS_DefineFunction(context, module_obj, ./modules/dbus-values.c:126: if (!JS_DefineProperty(context, obj, ./modules/dbus-values.c:1029: if (!JS_DefineProperty(context, JSVAL_TO_OBJECT(value), ./modules/dbus-values.c:1031: STRING_TO_JSVAL(JS_NewStringCopyZ(context, sender)), ./modules/dbus-exports.c:557: sender_string = JS_NewStringCopyZ(context, dbus_message_get_sender(method_call)); ./modules/dbus-exports.c:563: if (!JS_DefineProperty(context, ./modules/dbus-exports.c:580: if (!JS_DefineProperty(context, ./modules/dbus-exports.c:590: if (!JS_DefineProperty(context, ./modules/dbus-exports.c:607: signature_string = JS_NewStringCopyZ(context, signature); ./modules/dbus-exports.c:613: if (!JS_DefineProperty(context, ./modules/dbus-exports.c:1312: JS_SetProperty(context, obj, prop_name, &value); ./modules/dbus-exports.c:1866: if (!JS_DefineProperty(context, in_object, ./gjs/jsapi-util-array.c:318: value = STRING_TO_JSVAL(JS_NewStringCopyZ(context, "abcdefghijk")); ./gjs/jsapi-util.c:218:/* Checks whether an object has a property; unlike JS_GetProperty(), ./gjs/jsapi-util.c:230:/* Checks whether an object has a property; unlike JS_GetProperty(), ./gjs/jsapi-util.c:247: JS_GetProperty(context, obj, property_name, &value); ./gjs/jsapi-util.c:259: * while JS_GetProperty() treats only "no such property" as an error. ./gjs/jsapi-util.c:273: JS_GetProperty(context, obj, property_name, &value); ./gjs/jsapi-util.c:279: JS_ClearPendingException(context); /* in case JS_GetProperty() was on crack */ ./gjs/jsapi-util.c:282: /* remember gjs_throw() is a no-op if JS_GetProperty() ./gjs/jsapi-util.c:393: if (!JS_DefineProperty(load_context, in_object, ./gjs/jsapi-util.c:568: element = STRING_TO_JSVAL(JS_NewStringCopyZ(context, array_values[i])); ./gjs/jsapi-util.c:576: if (!JS_DefineProperty(context, in_object, ./gjs/jsapi-util.c:603: str = JS_NewStringCopyZ(context, klass->name); ./gjs/jsapi-util.c:879: JS_SetProperty(dst_context, JSVAL_TO_OBJECT(src_exc), "stack", &new_stack); ./gjs/jsapi-util.c:1074: if (!JS_GetProperty(context, date_constructor, "prototype", &date_prototype)) ./gjs/context.c:494: if (!JS_DefineProperty(js_context->context, js_context->global, ./gjs/context.c:508: if (!JS_DefineFunction(js_context->context, js_context->global, ./gjs/context.c:514: if (!JS_DefineFunction(js_context->context, js_context->global, ./gjs/context.c:692: "Exception was set prior to JS_EvaluateScript()"); ./gjs/context.c:696: if (!JS_EvaluateScript(js_context->context, ./gjs/context.c:720: "JS_EvaluateScript() failed but no exception message?"); ./gjs/context.c:724: "JS_EvaluateScript() failed but no exception message?"); ./gjs/context.c:737: "Exception was set even though JS_EvaluateScript() returned true - did you gjs_throw() but not return false somewhere perhaps?"); ./gjs/jsapi-util-error.c:79: jstr = JS_NewStringCopyZ(context, s); ./gjs/jsapi-util-error.c:86: func = JS_CompileFunction(context, ./gjs/jsapi-util-error.c:239: JS_GetProperty(context, JSVAL_TO_OBJECT(exc), "message", ./gjs/stack.c:114: if (!JS_GetPropertyDescArray(cx, call_obj, &call_props)) ./gjs/stack.c:154: JS_GetProperty(cx, call_obj, "arguments", &val) && ./gjs/stack.c:159: if (JS_GetProperty(cx, args_obj, "length", &val) && ./gjs/stack.c:166: if (JS_GetProperty(cx, args_obj, number, &val)) { ./gjs/jsapi-util-string.c:189: s = JS_NewStringCopyN(context, ./gjs/jsapi-util-string.c:519: js_string = JS_NewStringCopyZ(context, ascii_string); ./gjs/importer.c:69: if (!JS_DefineProperty(context, module_obj, ./gjs/importer.c:72: STRING_TO_JSVAL(JS_NewStringCopyZ(context, module_name)) : ./gjs/importer.c:81: if (!JS_DefineProperty(context, module_obj, ./gjs/importer.c:139: if (!JS_DefineProperty(context, obj, ./gjs/importer.c:163: if (!JS_GetPropertyAttributes(context, obj, name, ./gjs/importer.c:173: if (!JS_SetPropertyAttributes(context, obj, name, ./gjs/importer.c:294: JS_DefineProperty(context, in_object, ./gjs/importer.c:310: if (!JS_EvaluateScript(context, ./gjs/importer.c:329: "JS_EvaluateScript() returned FALSE but did not set exception"); ./gjs/importer.c:423: if (!JS_EvaluateScript(context, ./gjs/importer.c:443: "JS_EvaluateScript() returned FALSE but did not set exception"); ./gjs/importer.c:556: JS_DefineProperty(context, obj, ./gjs/importer.c:1157: if (!JS_DefineProperty(context, in_object, ./gjs/importer.c:1216: if (!JS_DefineProperty(context, in_object, ./gi/repo.c:77: if (JS_GetProperty(load_context, versions, ns_name, &version_val) && ./gi/repo.c:282: JS_DefineProperty(context, repo, ./gi/repo.c:295: JS_GetProperty(context, repo, "GLib", &value); ./gi/repo.c:310: if (!JS_DefineProperty(context, module_obj, ./gi/repo.c:341: if (!JS_DefineProperty(context, in_object, ./gi/object.c:217: * value. We could also use JS_DefineProperty though and specify a ./gi/boxed.c:912: result = JS_DefinePropertyWithTinyId(context, proto, field_name, i, ./gi/arg.c:344: if (!JS_GetPropertyById(context, props, prop_id, &val_js)) ./gi/arg.c:347: if (!JS_GetProperty(context, props, key_arg.v_pointer, &val_js)) ./gi/arg.c:1265: if (!JS_DefineProperty(context, obj, keyutf8, valjs, ./gi/function.c:583: if (!JS_DefineProperty(context, in_object, ./gi/ns.c:293: if (!JS_DefineProperty(context, in_object, ./gi/param.c:79: *value_p = STRING_TO_JSVAL(JS_NewStringCopyZ(context, value_str)); ./gi/keep-alive.c:368: if (!JS_DefineProperty(context, global, ./gi/enumeration.c:87: if (!JS_DefineProperty(context, in_object, ./gi/enumeration.c:165: if (!JS_DefineProperty(context, in_object, -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- So yes, there is an overall problem that will pop up in many places with encoding handling. IMHO, too much work is needed to investigate each case manually. It is much easier to toggle single switch.
All of JS_DefineFunction should be safe, since the module code just have ASCII function names. Probably true of most of the DefineProperty, though I bet there are exceptions like the importer which maps filesystem directories to properties. To be concrete ones of this form should be safe: ./gjs/stack.c:154: JS_GetProperty(cx, call_obj, "arguments", &val) Exactly how many there are would need some analysis of each one, but I'm fairly certain it's a lot lower than 118.
> All of JS_DefineFunction should be safe, since the module code just have ASCII function names. They have English names in ASCII encoding. At some point, English words can be translated into foreign languages (I working on that). Logic like "It is English only, so it safe to use ASCII encoding" creates problems to foreign users. Is it will be hard to place all such strings into separate header in UCS2 encodding? For example #define UCS_ARGUMENTS "a\0r\0g\0u\0m\0e\0n\0t\0s\0" #define UCS_ARGUMENTS_LENGTH 9 Such header can be produced automatically. PS. [offtopic]Second place on regional sport-dancing competition :-( [/offtopic]
Quickly checking I think there are two separate bugs here. First one is that given a javascript file with unicode identifiers we fail with syntax error while browsers and rhino and probably others work just fine. For example: function ääkkösiä() { } ääkkösiä(); (needs <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> for browsers) I believe the reason this fails both with interactive console and when executed from a script is that we're using JS_CompileScript (in modules/console.c) and JS_EvaluateScript (in gjs/context.c and gjs/importer.js) which expect ASCII/latin1 (or maybe utf-8) OTOH Spidermonkey also has UCS2 versions of those functions JS_CompileUCScript JS_EvaluateUCScript We could require utf-8 input, convert to UCS2 and use those functions. It should fix most of the console/eval issues. Second problem is that most of our C API implicitly assumes ASCII/latin1 (e.g. gjs_object_get_property is using JS_GetProperty instead of JS_GetUCProperty) - the encoding isn't explicitly specified. Currently everything works, just doesn't allow unicode, because all string literals we're using are ASCII and we can't get unicode from JS because of the first bug. Arguably the right fix for the C API would be to assume utf-8 identifiers, convert to UCS2 and always use the UC variants - except with string literals. That would let one write C modules exporting unicode API. I'm not entirely sure whether or not fixing the first bug would expose unicode to the C API - I think all identifier names should be JSStrings and need to be converted from UCS2 to utf-8/ascii anyway. I'm thinking dbus and some logging might be affected.
Any progress with that bug in GJS? Why it is still in UNCONFIRMED state?
We use JS::CompileOptions::setUTF8() on all executed JS code now. I think this bug can be closed.