GNOME Bugzilla – Bug 137029
non-interface strings from script-fu scripts get included in the script-fu po file
Last modified: 2004-12-22 21:47:04 UTC
Several strings that should not be translated ended up in the script-fu .po files. These strings consist of snippets of Scheme code. Example from mkbrush.scm: )))\n (filename (string-append data-dir\n \t\t\t The affected scripts are: mkbrush.scm, select-to-brush.scm and select-to-pattern.scm. The problem may be caused by the use of an underscore in a preceding line. From the same script: (data-dir (car (gimp-gimprc-query "gimp_dir"))) (filename (string-append data-dir I grepped the scripts directory using the /_[^"][a-z]+["]/ regexp, and only unearthed the above mentioned scripts. (The regexp found more scripts, but in these the underscore appeared in a translable string.)
This bug was originally reported against GIMP-2.0pre4. It seems to happen with intltool-0.30 (verified) and also with intltool-0.27.2 (I'd have to verify that again).
the problem is still there. Branko can you provice me with a reg ex that only matches the translatable strings?
I'm not really sure on the syntax here, but the following approach might give better results: sub type_scheme { while ($input =~ /(_)?"((?:[^"\\]+|\\.)*)"/sg) { if (defined($1) && ($1 eq "_")) { $messages{$2} = []; } } } The idea is to get one string (two quotes and what's between them) at a time, thereby eliminating chances of having '_"' detected inside a string. Then, we check if the string was preceded with a "_". OTOH, I dropped "\w*\(?" and "\)?" from re, since I don't know what it's for (and I cannot test it). It can be reinserted, and only check ($1 eq "_") would have to be adjusted to something like ($1 =~ /^_/). Since my knowledge of Scheme is non-existent, others will have to comment on the correctness of the patch.
Kenneth, an expression that would only match valid translation strings would be something like /_["][^"]+["]/. This means: _ Find an underscore ["] Followed by a double quote (could probably also be written as \") [^"]+ Followed by one or more non-quotes ["] Followed by a double quote (The reason the double quote needs to be escaped in the regex is because in some regular syntaxes the double quote character is meaningfull. However, this doesn't work if Scheme itself allows for double quote escaping. For instance, if this is a valid string in Scheme: _"This is a \"good\" translatable string." my expression matches '_"This is a \"'. If a double quote can be escaped as \", then the expression would have to be something like: _["]((\["])|[^"])+["] However, I did not test that last one. As far as I can tell, no Scheme scripts are shipped with the GIMP that have escaped quotes.
Branko, I feel your approach has one (big?) drawback. If there's a certain string preceding translatable one, you might run into problems. Eg. imagine list like this one: ("this_" _"no do") Your regex would match on _" _" here, which is clearly wrong (we want "no do" to be marked for translation), and would mess up all the other extractions as well I think. That's why I go through *all* the strings in my proposal (i.e. I assume that all strings are correctly given with both start/end quotes), and check if the current one is prepended with "_". I also treat any escaped characters as single characters (so using \" would be no problem). Let me explain my regex a bit, so you can tell if it's suitable, and see if I'm doing something wrong: /(_)?"((?:[^"\\]+|\\.)*)"/ ^^^^ if any string is prepended with "_" put that as $1 ^^^^^^^ any number of non-quote and non-backslash characters ^^^^ or backslash followed by single character ^ any number of repeats of this "?:" is simply not to allocate a register for parentheses match Arguably, this regex could have been simply /(_)?"((?:[^"\\]|\\.)*)"/, but I don't know why I didn't do it that way, and I remember I have tested the former suggestion. My approach would also fail if double quote is otherwise allowed to be standalone in Scheme (or if one can have something like ('blah\"blah'), with unmatched double quote). Also, in Elisp (Emacs Lisp), "\\\n" (backslash followed by newline) is ignored, so we might want to do that as well.
We have in the meantime changed all strings that used to confuse intltool. So this is at the moment not a problem for GIMP. But of course with new scripts being added, the problem could show up again and should not be considered fixed.
In comment #5 it states a problem with the regex near the end of comment #4. A check for whitespace before the _" sequence would seem to be a simple fix. In the case of Scheme scripts the addition of the whitespace check before _" should help. The current regexp in the extraction script is allowing some other characters to appear between the _ and " which may be needed for C code but not for Scheme.
The only strings marked for translation in Script-Fu scripts are found in the script-fu-register statement and take the form of _"string" surrounded by whitespace. Script-Fu does not support the translation of strings elsewhere in a script. On the other hand, Tiny-Fu supports translation of strings of the form _"string" in the register block as well as anywhere they may appear in a script. No brackets are involved in the marking of a string for translation in Scheme and no characters appear between the _ and " which mark the start of a string to be translated. I modified my local version of intltool-extract as follows: sub type_scheme { while ($input =~ /_"(((\\")|[^"])+)"/sg) { $messages{$1} = []; } } Preliminary tests indicate that this simpler regex is working properly and that it does not pick up stray bits of Scheme code.
Kevin, does this regex work with eg. ;; somewhere up in the code (if (< 1 0) ("blah _")) ... (script-fu-register "script-fu-slide" _"<Image>/Script-Fu/Blah..." "Makes image look like blah") I somehow doubt it (and my tests agree), and I don't think it's wise to introduce a requirement that no strings in *code* end with "_". Can someone craft a testcase which will catch all the cornerstone cases such as this one? If you instead do what I did above, keeping your regex (with slight changes): sub type_scheme { $input =~ s/^;.*//g; # see below for this one while ($input =~ /(_)?"(((\\")|[^"])*)"/sg) { if (defined($1) and $1 eq "_") { $messages{$2} = []; } } } (I just made "_" optional, and check if it's present, and use * instead of +, so to allow "" which happens in current code) Now it works fine for me with this testcase. I can still see a problem with this approach if unmatched double quotes are allowed anywhere in the code (except escaped in a string itself). For instance, the following example will still fail (without underscore in the comment, it would not fail for you, but will fail with my change): ;; somewhere up in the code we use _" (script-fu-register "script-fu-slide" _"<Image>/Script-Fu/Blah..." "Makes image look like blah") So, what should probably be done is to ignore comments (before while loop): $input =~ s/^;.*//g; Can quote stand by itself anywhere else? Here's a simple script that will test changes between two intltool-extract's (remove redirection to /dev/null if you want to see diffs, adjust NEWINTLTOOL and OLDINTLTOOL appropriately): #!/bin/sh NEWINTLTOOL=/brlja/bin/intltool-extract OLDINTLTOOL=/brlja/test/bin/intltool-extract for i in *.scm; do $NEWINTLTOOL --quiet --type=gettext/scheme $i cp $i.h new.$i.h $OLDINTLTOOL --quiet --type=gettext/scheme $i diff -u $i.h new.$i.h >/dev/null || echo $i done # --- cut here --- In Gimp HEAD, it finds no differences for all the scm files in plug-ins/script-fu/scripts/. In Gimp gimp-2-0, it prints out: alien-glow-button.scm alien-glow-logo.scm basic2-logo.scm gradient-bevel-logo.scm mkbrush.scm select-to-brush.scm select-to-pattern.scm I'll attach output with >/dev/null removed, so you can see that it only removes erroneous matches, and there're no regressions. It also catches one new message in gimp-2-0 which was previously ignored (see following attachment). I feel pretty safe to commit this. Does anyone have any objection? [I'd still like a testcase as well]
Created attachment 32545 [details] [review] Changes between two runs of intltool-extract
Also note that Kevin's approach indeed does work for all gimp-2-0 files exactly same as mine, but I'd still like all these corner cases to be handled correctly (_" in a comment, or if string ends with _: this is why I go to so much trouble here). Not doing this is just waiting for a problem to happen (imagine a programmer inserting a comment like ;;; here we use _" to mark strings for translation ), when we can solve it right away.
I'll commit this later today if nobody steps forward. As I explained above, I did a hefty stress (all files included) test on current Gimp 2.0 and Gimp HEAD .scm files and saw no regressions.
If it wokrs for GIMP, I imagine it's ok for now. The only other thing I can think of that might have a problem, would be gnucash. I don't know if they use intltool or not though.
The suggested changes to the regex do improve things but as pointed out in comment #9 do not cover all possibilities. The two situations mentioned in comment #9 can still cause problems but will probably be rare. We may have to live with those two cases. I don't see any single regex being able to deal with all possible scenarios. The only way to fully handle all cases would be to use a small C program that can keep track of the context of any _" sequences it finds. I don't think making the leading _ of _" optional as suggested in comment #9 is correct. That would mean all strings would be found and not just the ones marked for translation. I don't think a string of the form "" needs to be translated so the use of + is still preferable to *.
Kevin, read the code a bit more carefuly ;-) _ is optional in the regex, but I then check if it's present and add string for translation only if it is (that's what the if (defined...) is for). It's optional so regex would keep track of quote-matching for me, and that's why I allow "" as well (it won't be translated since it's not prepended with a _). So, my final version above does work for *all* obscure cases I could think of. I glanced through Scheme language specification I ran across, and I see no problems with it: quotes are used only to delimit strings, and cannot appear anywhere else (except in comments, which I also handle, or escaped in strings which are handled as well). So, by finding all the strings (even those not marked for translation), I'm certain to have quote-matching/pairing correct. As I said, I ran a stress-test on *all* the Gimp .scm files, and found no differences in HEAD (and differences in gimp-2-0 are all improvements, check the attached diff above). No erroneous messages were catched.
For reference, here's the final proposed version of the subroutine: sub type_scheme { $input =~ s/^;.*//g; while ($input =~ /(_)?"(((\\")|[^"])*)"/sg) { if (defined($1) and $1 eq "_") { $messages{$2} = []; } } }
Ok, I see it now. :-) I think the only other change that might be useful in your proposed version is to allow optional white space between start of line and the ; marking the start of a comment.
Indeed, you're quite right, and that opens another full can of worms ;-) Imagine: "blah; blah" ; this is nasty comment _" or "blah ; is this allowed?" (the latter form [string continuing after newline] seems to be *unspecified* in http://www.swiss.ai.mit.edu/~jaffer/r5rs_8.html#SEC62 [Scheme language reference]), so it's probably a non-issue, but the first one should be taken care of) So, I'd go the following route here: check for any pairs of quotes in a line, ignore them until first ; outside quotes is found. We, of course, have to be "ungreedy" with that, so that in cases such as '"blah;blah" ; "hey" ; this' everything after first "blah;blah" is treated as a comment. So, here's the nasty comment-stripping regex: $input =~ s/^(([^";]*"(((\\")|[^"])*)"[^";]*)*?);.*/defined($1)?$1:''/ge; But, this still fails in some cases, so it might be better to actually implement a simple finite-state parser. I've got that working fine as well (I'll attach that for comparison), but it's about 10% slower on 285 scm files I took from Gimp 2.0 tree (I can attach test results and scripts as well; it's 10% out of total 6-7 seconds). The upside would be that it's way easier to understand and modify, and more likely to be correct. Another note: I won't be committing anything without extensive test case (I've almost finished one; it helped me notice a few more problems in the regex handling as above, so that's why a parser might be better suited). Note that all the modifications I'm doing are tested not to break anything on *all* existing Gimp files.
Created attachment 32639 [details] Line-by-line, char-by-char Scheme parser Ok, here's the type_scheme which should work in all the cases we're aware of, without ugly regexes. It's slower, but more likely to work.
Created attachment 32640 [details] Test for checking Scheme strings extraction This is the test I have so far come up with.
Sorry, one elsif branch (the elseif (!$state) one) is completely unneccessary in the above version of the function (I've uploaded a wrong file).
I've committed a slightly different version of the above function (so, I've gone the state-parser way), and improved a test case. I also added support for translators' comments in Scheme code (we might need it in the future), which is also tested in the test case. Again, I did a test on entire Gimp collection of Scheme files, and found no breakages. Marking as RESOLVED/FIXED, reopen if you come up with another issue.