Bug 476356 – Script containing invalid UTF-8 characters is not loaded

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 476356 - Script containing invalid UTF-8 characters is not loaded


Summary:	Script containing invalid UTF-8 characters is not loaded


Status:	RESOLVED FIXED

Product:	GIMP
Classification:	Other
Component:	Script-Fu
Version:	git master
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	2.4
Assigned To:	GIMP Bugs
QA Contact:	GIMP Bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-09-12 21:55 UTC by Michael Schumacher
Modified:	2007-09-21 23:53 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
script with japanese comments (101.96 KB, text/x-scheme) 2007-09-12 21:59 UTC, Michael Schumacher		Details
patch to implement proper handling of utf-8. (3.00 KB, patch) 2007-09-17 02:35 UTC, Simon Budig	none	Details \| Review
Fix of the previous patch (2.63 KB, patch) 2007-09-17 02:38 UTC, Simon Budig	none	Details \| Review
Simplified and bug-fixed version of previous patch. (2.56 KB, patch) 2007-09-21 22:18 UTC, Kevin Cozens	committed	Details \| Review

Description Michael Schumacher 2007-09-12 21:55:00 UTC

Steps to reproduce:

1. get photoeffects script from http://users.telenet.be/ev1/gimpphotoeffects_en.html (I'll attach it, too)
2. install it
3. run gimp or refresh scripts
4. notice that you do not get the expected menu entires (<Image>->Script-Fu->Photo Effects)

but

5. remove all japanese comments
6. notice that scripts starts to work


Is this an encoding issue? Firefox claims that the script is in Shift_JIS, file says "Non-ISO extended ASCII English text".

Comment 1 Michael Schumacher 2007-09-12 21:59:14 UTC

Created attachment 95484 [details]
script with japanese comments

Comment 2 Kevin Cozens 2007-09-12 23:28:33 UTC

The problem is an encoding issue. Comment lines containing characters that are not valid ASCII and are not valid UTF-8 will cause a problem. The inchar() routine of TinyScheme needs to be updated to ignore over bad characters.

Comment 3 Sven Neumann 2007-09-13 06:35:41 UTC

If fixing the routing handling the comments is not simple, we should simply require scripts to be in UTF-8 encoding.

Comment 4 Michael Natterer 2007-09-13 10:22:16 UTC

I thought that was a requirement anyway since we switched to tiny scheme?

Comment 5 Raphaël Quinet 2007-09-13 13:45:34 UTC

Since we switched to tinyscheme, the scripts are expected to be in UTF-8. But we could still be tolerant and ignore invalid characters in comments (in non-functional parts of the script). Otherwise, some users might consider this as a regression because their scripts that were (incorrectly) encoded using a local character set were working with previous GIMP versions and are not working anymore. 

But I agree with Sven in comment #3: if it is too hard to be tolerant, then we could instead be more strict and require all scripts to be in UTF-8. We could validate the script once before parsing it and display a sensible error message if the script encoding is not UTF-8.

Comment 6 Kevin Cozens 2007-09-13 18:02:01 UTC

I was just reviewing how UTF-8 characters are coded. Based on what I have just read it should be simple enough to modify the basic_inchar() routine of TinyScheme to ignore invalid UTF-8 characters.

basic_inchar() is called each time a new character is needed from the input source (either a file or string buffer). It should ignore any bytes until either the top bit is 0 (byte < 128) or the top two bits are set (byte >= 192). Both conditions indicate the start of a (possibly) valid character. 

Doing the above would let basic_inchar() return valid characters at the expense of being unable to have a script read a binary file. In order to read a binary file one would need to know when a script is being parsed and when a script is being run.

Comment 7 Kevin Cozens 2007-09-14 18:40:08 UTC

Changed summary line to provide a better indication of the problem.

Comment 8 Simon Budig 2007-09-17 02:35:44 UTC

Created attachment 95699 [details] [review]
patch to implement proper handling of utf-8.

This patch implements proper UTF8 parsing and discards invalid byte sequences silently.

Note that this does not address enabling binary I/O.

Comment 9 Simon Budig 2007-09-17 02:38:42 UTC

Created attachment 95700 [details] [review]
Fix of the previous patch

Sorry, the previous patch had parts of an unrelated fix in it.

Comment 10 Sven Neumann 2007-09-17 07:05:38 UTC

Since you are already doing all the checks in your code, you should probably use g_utf8_get_char() instead of g_utf8_get_char_validated().

Comment 11 Kevin Cozens 2007-09-21 22:18:21 UTC

Created attachment 95992 [details] [review]
Simplified and bug-fixed version of previous patch.

Updated version of previous patch. Fixes a couple of minor errors in the previous patch and slightly simplifies it.

Comment 12 Kevin Cozens 2007-09-21 23:53:48 UTC

2007-09-21  Kevin Cozens  <kcozens@cvs.gnome.org>

        * plug-ins/script-fu/tinyscheme/scheme.c (basic_inchar): Applied
        modified patch from Simon Budig. Any bytes read from a file which
        are not valid UTF-8 characters will be ignored. Fixes bug #476356.