Bug 137029 – non-interface strings from script-fu scripts get included in the script-fu po file

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 137029 - non-interface strings from script-fu scripts get included in the script-fu po file


Summary:	non-interface strings from script-fu scripts get included in the script-fu po...


Status:	RESOLVED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other other

Importance:	Normal minor
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-03-13 00:29 UTC by Branko Collin
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Changes between two runs of intltool-extract (2.96 KB, patch) 2004-10-13 07:57 UTC, Danilo Segan	none	Details \| Review
Line-by-line, char-by-char Scheme parser (1.24 KB, text/plain) 2004-10-15 12:17 UTC, Danilo Segan		Details
Test for checking Scheme strings extraction (308 bytes, text/plain) 2004-10-15 12:19 UTC, Danilo Segan		Details

Description Branko Collin 2004-03-13 00:29:02 UTC

Several strings that should not be translated ended up in the script-fu .po
files. These strings consist of snippets of Scheme code. 

Example from mkbrush.scm: 

)))\n
     (filename (string-append data-dir\n
\t\t\t      

The affected scripts are: mkbrush.scm, select-to-brush.scm and
select-to-pattern.scm.

The problem may be caused by the use of an underscore in a preceding line.
From the same script: 

     (data-dir (car (gimp-gimprc-query "gimp_dir")))
     (filename (string-append data-dir

I grepped the scripts directory using the /_[^"][a-z]+["]/ regexp, and only
unearthed the above mentioned scripts. (The regexp found more scripts, but
in these the underscore appeared in a translable string.)

Comment 1 Sven Neumann 2004-03-13 02:43:44 UTC

This bug was originally reported against GIMP-2.0pre4. It seems to
happen with intltool-0.30 (verified) and also with intltool-0.27.2
(I'd have to verify that again).

Comment 2 Kenneth Rohde Christiansen 2004-06-12 21:17:01 UTC

the problem is still there. Branko can you provice me with a reg ex that only
matches the translatable strings?

Comment 3 Danilo Segan 2004-06-14 12:42:26 UTC

I'm not really sure on the syntax here, but the following approach might give
better results:

sub type_scheme {
    while ($input =~ /(_)?"((?:[^"\\]+|\\.)*)"/sg) {
        if (defined($1) && ($1 eq "_")) {
	    $messages{$2} = [];
        }
    }
}

The idea is to get one string (two quotes and what's between them) at a time,
thereby eliminating chances of having '_"' detected inside a string.  Then, we
check if the string was preceded with a "_". 

OTOH, I dropped "\w*\(?" and "\)?" from re, since I don't know what it's for
(and I cannot test it).  It can be reinserted, and only check ($1 eq "_") would
have to be adjusted to something like ($1 =~ /^_/).

Since my knowledge of Scheme is non-existent, others will have to comment on the
correctness of the patch.

Comment 4 Branko Collin 2004-07-13 09:47:38 UTC

Kenneth, an expression that would only match valid translation strings would be
something like /_["][^"]+["]/. 

This means: 

_      Find an underscore
["]    Followed by a double quote (could probably also be written as \")
[^"]+  Followed by one or more non-quotes
["]    Followed by a double quote

(The reason the double quote needs to be escaped in the regex is because in some
regular syntaxes the double quote character is meaningfull.

However, this doesn't work if Scheme itself allows for double quote escaping.
For instance, if this is a valid string in Scheme: 

_"This is a \"good\" translatable string."

my expression matches '_"This is a \"'.

If a double quote can be escaped as \", then the expression would have to be
something like: 

_["]((\["])|[^"])+["]

However, I did not test that last one. As far as I can tell, no Scheme scripts
are shipped with the GIMP that have escaped quotes.

Comment 5 Danilo Segan 2004-10-12 14:46:41 UTC

Branko, I feel your approach has one (big?) drawback.

If there's a certain string preceding translatable one, you might run into problems.

Eg. imagine list like this one:
("this_" _"no do")

Your regex would match on _" _" here, which is clearly wrong (we want "no do" to
be marked for translation), and would mess up all the other extractions as well
I think.

That's why I go through *all* the strings in my proposal (i.e. I assume that all
strings are correctly given with both start/end quotes), and check if the
current one is prepended with "_". I also treat any escaped characters as single
characters (so using \" would be no problem).

Let me explain my regex a bit, so you can tell if it's suitable, and see if I'm
doing something wrong:
/(_)?"((?:[^"\\]+|\\.)*)"/
 ^^^^ if any string is prepended with "_" put that as $1
          ^^^^^^^ any number of non-quote and non-backslash characters
                 ^^^^ or backslash followed by single character
                      ^ any number of repeats of this

"?:" is simply not to allocate a register for parentheses match

Arguably, this regex could have been simply /(_)?"((?:[^"\\]|\\.)*)"/, but I
don't know why I didn't do it that way, and I remember I have tested the former
suggestion.

My approach would also fail if double quote is otherwise allowed to be
standalone in Scheme (or if one can have something like ('blah\"blah'), with
unmatched double quote). Also, in Elisp (Emacs Lisp), "\\\n" (backslash followed
by newline) is ignored, so we might want to do that as well.

Comment 6 Sven Neumann 2004-10-12 15:53:17 UTC

We have in the meantime changed all strings that used to confuse intltool. So
this is at the moment not a problem for GIMP. But of course with new scripts
being added, the problem could show up again and should not be considered fixed.

Comment 7 Kevin Cozens 2004-10-13 02:29:51 UTC

In comment #5 it states a problem with the regex near the end of comment #4. A
check for whitespace before the _" sequence would seem to be a simple fix. In
the case of Scheme scripts the addition of the whitespace check before _" should
help. The current regexp in the extraction script is allowing some other
characters to appear between the _ and " which may be needed for C code but not
for Scheme.

Comment 8 Kevin Cozens 2004-10-13 05:34:00 UTC

The only strings marked for translation in Script-Fu scripts are found in the
script-fu-register statement and take the form of _"string" surrounded by
whitespace. Script-Fu does not support the translation of strings elsewhere in a
script. On the other hand, Tiny-Fu supports translation of strings of the form
_"string" in the register block as well as anywhere they may appear in a script.

No brackets are involved in the marking of a string for translation in Scheme
and no characters appear between the _ and " which mark the start of a string to
be translated.

I modified my local version of intltool-extract as follows:

sub type_scheme {
    while ($input =~ /_"(((\\")|[^"])+)"/sg) {
    $messages{$1} = [];
    }
}

Preliminary tests indicate that this simpler regex is working properly and that
it does not pick up stray bits of Scheme code.

Comment 9 Danilo Segan 2004-10-13 07:55:29 UTC

Kevin, does this regex work with eg.

;; somewhere up in the code 
(if (< 1 0)
    ("blah _"))
...

(script-fu-register "script-fu-slide"
                    _"<Image>/Script-Fu/Blah..."
                    "Makes image look like blah")

I somehow doubt it (and my tests agree), and I don't think it's wise to
introduce a requirement that no strings in *code* end with "_". Can someone
craft a testcase which will catch all the cornerstone cases such as this one?

If you instead do what I did above, keeping your regex (with slight changes):
sub type_scheme {
    $input =~ s/^;.*//g; # see below for this one

    while ($input =~ /(_)?"(((\\")|[^"])*)"/sg) {
      if (defined($1) and $1 eq "_") {
        $messages{$2} = [];
      }
    }
}

(I just made "_" optional, and check if it's present, and use * instead of +, so
to allow "" which happens in current code)

Now it works fine for me with this testcase. I can still see a problem with this
approach if unmatched double quotes are allowed anywhere in the code (except
escaped in a string itself).  For instance, the following example will still
fail (without underscore in the comment, it would not fail for you, but will
fail with my change):
;; somewhere up in the code we use _"
(script-fu-register "script-fu-slide"
                    _"<Image>/Script-Fu/Blah..."
                    "Makes image look like blah")

So, what should probably be done is to ignore comments (before while loop):
    $input =~ s/^;.*//g;

Can quote stand by itself anywhere else?

Here's a simple script that will test changes between two intltool-extract's
(remove redirection to /dev/null if you want to see diffs, adjust NEWINTLTOOL
and OLDINTLTOOL appropriately):
#!/bin/sh
NEWINTLTOOL=/brlja/bin/intltool-extract
OLDINTLTOOL=/brlja/test/bin/intltool-extract
for i in *.scm;
do
  $NEWINTLTOOL --quiet --type=gettext/scheme $i
  cp $i.h new.$i.h
  $OLDINTLTOOL --quiet --type=gettext/scheme $i
  diff -u $i.h new.$i.h >/dev/null || echo $i
done
# --- cut here ---

In Gimp HEAD, it finds no differences for all the scm files in
plug-ins/script-fu/scripts/. In Gimp gimp-2-0, it prints out:

alien-glow-button.scm
alien-glow-logo.scm
basic2-logo.scm
gradient-bevel-logo.scm
mkbrush.scm
select-to-brush.scm
select-to-pattern.scm

I'll attach output with >/dev/null removed, so you can see that it only removes
erroneous matches, and there're no regressions. It also catches one new message
in gimp-2-0 which was previously ignored (see following attachment).

I feel pretty safe to commit this. Does anyone have any objection? [I'd still
like a testcase as well]

Comment 10 Danilo Segan 2004-10-13 07:57:42 UTC

Created attachment 32545 [details] [review]
Changes between two runs of intltool-extract

Comment 11 Danilo Segan 2004-10-13 08:09:59 UTC

Also note that Kevin's approach indeed does work for all gimp-2-0 files exactly
same as mine, but I'd still like all these corner cases to be handled correctly
(_" in a comment, or if string ends with _: this is why I go to so much trouble
here).  Not doing this is just waiting for a problem to happen (imagine a
programmer inserting a comment like
;;; here we use _" to mark strings for translation
), when we can solve it right away.

Comment 12 Danilo Segan 2004-10-14 10:45:54 UTC

I'll commit this later today if nobody steps forward. As I explained above, I
did a hefty stress (all files included) test on current Gimp 2.0 and Gimp HEAD
.scm files and saw no regressions.

Comment 13 Rodney Dawes 2004-10-14 16:39:02 UTC

If it wokrs for GIMP, I imagine it's ok for now. The only other thing I can
think of that might have a problem, would be gnucash. I don't know if they use
intltool or not though.

Comment 14 Kevin Cozens 2004-10-14 18:58:10 UTC

The suggested changes to the regex do improve things but as pointed out in
comment #9 do not cover all possibilities. The two situations mentioned in
comment #9 can still cause problems but will probably be rare. We may have to
live with those two cases. I don't see any single regex being able to deal with
all possible scenarios. The only way to fully handle all cases would be to use a
small C program that can keep track of the context of any _" sequences it finds.

I don't think making the leading _ of _" optional as suggested in comment #9 is
correct. That would mean all strings would be found and not just the ones marked
for translation. I don't think a string of the form "" needs to be translated so
the use of + is still preferable to *.

Comment 15 Danilo Segan 2004-10-14 20:02:18 UTC

Kevin, read the code a bit more carefuly ;-)

_ is optional in the regex, but I then check if it's present and add string for
translation only if it is (that's what the if (defined...) is for). It's
optional so regex would keep track of quote-matching for me, and that's why I
allow "" as well (it won't be translated since it's not prepended with a _).

So, my final version above does work for *all* obscure cases I could think of. I
glanced through Scheme language specification I ran across, and I see no
problems with it: quotes are used only to delimit strings, and cannot appear
anywhere else (except in comments, which I also handle, or escaped in strings
which are handled as well). So, by finding all the strings (even those not
marked for translation), I'm certain to have quote-matching/pairing correct.

As I said, I ran a stress-test on *all* the Gimp .scm files, and found no
differences in HEAD (and differences in gimp-2-0 are all improvements, check the
attached diff above). No erroneous messages were catched.

Comment 16 Danilo Segan 2004-10-14 20:04:25 UTC

For reference, here's the final proposed version of the subroutine:

sub type_scheme {
    $input =~ s/^;.*//g;

    while ($input =~ /(_)?"(((\\")|[^"])*)"/sg) {
        if (defined($1) and $1 eq "_") {
            $messages{$2} = [];
        }
    }
}

Comment 17 Kevin Cozens 2004-10-15 03:31:47 UTC

Ok, I see it now. :-)  I think the only other change that might be useful in
your proposed version is to allow optional white space between start of line and
the ; marking the start of a comment.

Comment 18 Danilo Segan 2004-10-15 12:04:15 UTC

Indeed, you're quite right, and that opens another full can of worms ;-)
Imagine:
"blah; blah" ; this is nasty comment _"
or
"blah
; is this allowed?"

(the latter form [string continuing after newline] seems to be *unspecified* in
http://www.swiss.ai.mit.edu/~jaffer/r5rs_8.html#SEC62 [Scheme language
reference]), so it's probably a non-issue, but the first one should be taken
care of)

So, I'd go the following route here: check for any pairs of quotes in a line,
ignore them until first ; outside quotes is found. We, of course, have to be
"ungreedy" with that, so that in cases such as '"blah;blah" ; "hey" ; this'
everything after first "blah;blah" is treated as a comment.

So, here's the nasty comment-stripping regex:
    $input =~ s/^(([^";]*"(((\\")|[^"])*)"[^";]*)*?);.*/defined($1)?$1:''/ge;

But, this still fails in some cases, so it might be better to actually implement
a simple finite-state parser. I've got that working fine as well (I'll attach
that for comparison), but it's about 10% slower on 285 scm files I took from
Gimp 2.0 tree (I can attach test results and scripts as well; it's 10% out of
total 6-7 seconds). The upside would be that it's way easier to understand and
modify, and more likely to be correct.

Another note: I won't be committing anything without extensive test case (I've
almost finished one; it helped me notice a few more problems in the regex
handling as above, so that's why a parser might be better suited). Note that all
the modifications I'm doing are tested not to break anything on *all* existing
Gimp files.

Comment 19 Danilo Segan 2004-10-15 12:17:02 UTC

Created attachment 32639 [details]
Line-by-line, char-by-char Scheme parser

Ok, here's the type_scheme which should work in all the cases we're aware of,
without ugly regexes. It's slower, but more likely to work.

Comment 20 Danilo Segan 2004-10-15 12:19:08 UTC

Created attachment 32640 [details]
Test for checking Scheme strings extraction

This is the test I have so far come up with.

Comment 21 Danilo Segan 2004-10-15 12:24:18 UTC

Sorry, one elsif branch (the elseif (!$state) one) is completely unneccessary in
the above version of the function (I've uploaded a wrong file).

Comment 22 Danilo Segan 2004-10-23 09:13:52 UTC

I've committed a slightly different version of the above function (so, I've gone
the state-parser way), and improved a test case.  I also added support for
translators' comments in Scheme code (we might need it in the future), which is
also tested in the test case.

Again, I did a test on entire Gimp collection of Scheme files, and found no
breakages.

Marking as RESOLVED/FIXED, reopen if you come up with another issue.