GNOME Bugzilla – Bug 314453
Nautilus crashes in Solaris when browsing the attached file.
Last modified: 2008-06-11 19:39:24 UTC
Locale: All UTF-8 locales OS: Solaris 1. Unpack the attached tar file (crashName.tar) 2. Launch Nautilus and browse this directory Result: Nautilus crashes.
Created attachment 51320 [details] Attaching the tar file
Attached filenames in the directory are gb18030 encoded, created on zh_CN.GB18030 locale. When nautilus loads a directory, the files in the directory need to be sorted. This is done by creating a key from the display name nautilus_file_get_display_name_collation_key() which uses g_utf8_collate_key() to generate the key. Here the file name contains 'invalid unicode' in UTF-8 locales. The g_utf8_collate_key() function returns 'NULL' in this case which is then passed to strcmp(). This results in the crash in Solaris. We include an additional check to ensure that the keys are not 'NULL' before passing them to strcmp().
Created attachment 51493 [details] [review] Attached patch fixes the issue
This seems wrong. nautilus_file_get_display_name_nocopy() should *always* return a utf8 string even if the on-disk filename is not utf8, otherwise all sorts of things can break. We need to figure out why its not returning utf8 instead.
I can't reproduce this on linux, so i can't debug further. You need to figure out why nautilus_file_get_display_name_nocopy() doesn't return utf8.
In nautilus_file_get_display_name_nocopy() we finally call eel_make_valid_utf8() to get a valid utf-8 name. This function converts the invalid bytes in the file name to '?' and appends 'Invalid Unicode' to the file name. The unichar U+708B7 falls in the range of valid unichar and hence eel_make_valid_utf8() includes this byte in the file name. Changing the range of the valid unichar so that U+708B7 doesn't fall into the valid category, will solve the issue.
Created attachment 51866 [details] [review] Patch to fix the issue
Comment on attachment 51866 [details] [review] Patch to fix the issue This patch is wrong. The current macro defines valid Unicode code points according to the Unicode standard. Not all points in this range have assigned values but that isn't a factor in UTF-8 manipulation. You need to figure out where you are getting the NULL from - as far as I can tell g_utf8_collate_key() will never return NULL for any input other than NULL.
As i understand, the the root cause is below code in function g_utf8_collate_key() { .... xfrm_len = strxfrm (NULL, str_norm, 0); result = g_malloc (xfrm_len + 1); strxfrm (result, str_norm, xfrm_len + 1); ... } strxfrm(NULL, str_norm, 0) will return -1 for some special chinese characters if it is gb18030 encoded on utf-8 locale, and thus result will be NULL. strxfrm seems to depend on OS too, hence the crash doesn't occur on suse Linux for the same file.
> This patch is wrong. The current macro defines valid > Unicode code points according to the Unicode standard. > Not all points in this range have assigned values but that isn't a > factor in UTF-8 manipulation. Unicode 4.0 is defined below. http://www.unicode.org/Public/UNIDATA/NamesList.txt U+30000 - U+E0000 is out of the range, then strxfrm returns -1.
This range doesn't have assigned characters but neither do quite a few code points in the BMP. Does strxfrm return -1 for those as well? I don't have the POSIX spec at hand here so I don't know what it says for strxfrm returning -1, but it doesn't seem to me that this is something that GLib can predict - what if GLib is built with the Unicode-4.1 character database, but the underlying OS only has 4.0. We already have handling of the case where the string isn't representable in the current locale. The same handling could be used for strxform returning -1.
I understand your mention. It is difficult to conver all assignments exactly. I'm not sure which version of Unicode are supported in glib however the above range has not assigned chars since Unicode 4.0: http://www.unicode.org/Public/4.0-Update/NamesList-4.0.0.txt We would like to avoid crashes at least.
May be related to bug 314255, which is also about strxfrm returning unexpected values.
Can we integrate the patch? Yes, strxfrm() depends on LC_COLLATE. Solaris supports Unicode 4.0. # 7/18/2003: Added additional LC_MONETARY keywords and definitions defined # in the SUSv3/AGR/Unix200x spec. # # 2/19/2004: Added Unicode 4.0 characters (total 1,226 new characters). # Details can be found from the following: # # http://www.unicode.org/versions/Unicode4.0.0/ #
To reiterate, the patch is wrong. I can't see any justification for adding some particular ranges of unassigned code points to UNICODE_VALID because the Solaris strxfrm() happens to choke on them. (Does it work on all other unassigned codepoinst?)
Ah, I didn't understand your question. I eliminated the critical codepoints but yes, it still keeps unassgined codepoints in the macro. OK, I would like to compile this bug. I think there are three problems: #1. nautilus crashes with strcmp (0, 0). #2. UNICODE_VALID() is implemented in glib but strxfrm() is in libc. #3. UNICODE_VALID() does not reflect unicode map perfectly. #1. nautilus crashes with strcmp (0, 0). To begin with, I expected the attachement 51493 is a good way. Now we need to evaluate if it is expected that g_utf8_collate_key() returns NULL. if (g_get_charset (&charset)) { xfrm_len = strxfrm (NULL, str_norm, 0); result = g_malloc (xfrm_len + 1); strxfrm (result, str_norm, xfrm_len + 1); } This calculates the string length, i.e. xfrm_len, as the current locale but not UTF-8. I think it's an expected way. Could you confirm it? That means strxfrm() is possible to return -1 with UTF-8 string on the current locale. Then g_malloc(0) returns NULL and g_utf8_collate_key() also returns NULL. What do you think? #2. UNICODE_VALID() is implemented in glib but strxfrm() is in libc. I think the unicode implementation is in one of them. What concepts do you have in glib? #3. UNICODE_VALID() does not reflect unicode map perfectly. Personally I would like to fix this method, too. I checked Solaris source codes. Solaris have the map of the two dimension array of unicode group and plan for each code points. strxfrm uses the array internally. I cannot transfer the code but it would be a hit. If it is expected that g_utf8_collate_key() returns NULL, I think we can fix the root cause.
* glib/gunicollate.c (g_utf8_collate_key): Handle strfxrm returning -1 a little better. Problem pointed out by Takao Fujiwara Please reopen if that doesn't fix your problem.