GNOME Bugzilla – Bug 476356
Script containing invalid UTF-8 characters is not loaded
Last modified: 2007-09-21 23:53:48 UTC
Steps to reproduce: 1. get photoeffects script from http://users.telenet.be/ev1/gimpphotoeffects_en.html (I'll attach it, too) 2. install it 3. run gimp or refresh scripts 4. notice that you do not get the expected menu entires (<Image>->Script-Fu->Photo Effects) but 5. remove all japanese comments 6. notice that scripts starts to work Is this an encoding issue? Firefox claims that the script is in Shift_JIS, file says "Non-ISO extended ASCII English text".
Created attachment 95484 [details] script with japanese comments
The problem is an encoding issue. Comment lines containing characters that are not valid ASCII and are not valid UTF-8 will cause a problem. The inchar() routine of TinyScheme needs to be updated to ignore over bad characters.
If fixing the routing handling the comments is not simple, we should simply require scripts to be in UTF-8 encoding.
I thought that was a requirement anyway since we switched to tiny scheme?
Since we switched to tinyscheme, the scripts are expected to be in UTF-8. But we could still be tolerant and ignore invalid characters in comments (in non-functional parts of the script). Otherwise, some users might consider this as a regression because their scripts that were (incorrectly) encoded using a local character set were working with previous GIMP versions and are not working anymore. But I agree with Sven in comment #3: if it is too hard to be tolerant, then we could instead be more strict and require all scripts to be in UTF-8. We could validate the script once before parsing it and display a sensible error message if the script encoding is not UTF-8.
I was just reviewing how UTF-8 characters are coded. Based on what I have just read it should be simple enough to modify the basic_inchar() routine of TinyScheme to ignore invalid UTF-8 characters. basic_inchar() is called each time a new character is needed from the input source (either a file or string buffer). It should ignore any bytes until either the top bit is 0 (byte < 128) or the top two bits are set (byte >= 192). Both conditions indicate the start of a (possibly) valid character. Doing the above would let basic_inchar() return valid characters at the expense of being unable to have a script read a binary file. In order to read a binary file one would need to know when a script is being parsed and when a script is being run.
Changed summary line to provide a better indication of the problem.
Created attachment 95699 [details] [review] patch to implement proper handling of utf-8. This patch implements proper UTF8 parsing and discards invalid byte sequences silently. Note that this does not address enabling binary I/O.
Created attachment 95700 [details] [review] Fix of the previous patch Sorry, the previous patch had parts of an unrelated fix in it.
Since you are already doing all the checks in your code, you should probably use g_utf8_get_char() instead of g_utf8_get_char_validated().
Created attachment 95992 [details] [review] Simplified and bug-fixed version of previous patch. Updated version of previous patch. Fixes a couple of minor errors in the previous patch and slightly simplifies it.
2007-09-21 Kevin Cozens <kcozens@cvs.gnome.org> * plug-ins/script-fu/tinyscheme/scheme.c (basic_inchar): Applied modified patch from Simon Budig. Any bytes read from a file which are not valid UTF-8 characters will be ignored. Fixes bug #476356.