GNOME Bugzilla – Bug 48489
handle gettext escapes other than '\\' and '\"'
Last modified: 2004-12-22 21:47:04 UTC
This sub is a long way from handling all valid gettext string escapes. It currently handles " and tries to handle \n. I've been unable to find the list of valid .po escapes, I'm afraid the gettext source code is the place where they're defined (handling all valid C escapes would be a good start). ------- Bug moved to this database by unknown@bugzilla.gnome.org 2001-09-09 21:18 -------
There's a comment now in unescape_one_sequence in intltool-merge.in.in that lists the various other escape sequences supported by gettext. We could support them all with a little work. But I'd want test cases for all of them too, of course.
Adding keywords in hope that someone has time to add the other escapes
How does one test that intltool-merge is doing the right thing here? It seems like this style of escape sequences is only used in actual code, which doesn't get handled by intltool-merge. The bug says it is trying to handle \n correctly, but I don't understand how it needs to bother with it at all. I have a patch that adds more, but not all, of the escape sequences to the supported list, but I am not sure how to test it.
well, as far as I remember and can see from looking at the code it writes \n when there is an actual newline in the XML stream, and \" when there is a " I dont know what else we need to handle. Can you post your list Rodney?
This is the list from the comment in intltool-merge.in.in: # gettext also handles \n, \t, \b, \r, \f, \v, \a, \xxx (octal), # \xXX (hex) and has a comment saying they want to handle \u and \U. This is what I've addedin my tree: + return "\t" if $sequence eq "\\t"; + return "\b" if $sequence eq "\\b"; + return "\r" if $sequence eq "\\r"; + return "\f" if $sequence eq "\\f"; + return "\a" if $sequence eq "\\a"; Most of these don't really make sense in a standard text file as something that is not escaped. I'm not sure what the intent with this code is, aside from the newlines, quotes, and backslash. And the code seems to return an actual newline, when there is an "\n" in the text, rather than the other way 'round, as you said.
From Sun's documentation (I didn't want to use GNU documentation on purpose) at http://docs.sun.com/db/doc/816-0210/6m6nb7mf2?a=view > Message strings can contain the escape sequences > \n for newline, \t for tab, \v for vertical tab, > \b for backspace, \r for carriage return, \f for formfeed, > \\ for backslash, \" for double quote, \a for alarm, > \ddd for octal bit pattern, and \xDD for hexadecimal bit pattern. Once someone implements this, it can be easily tested: 1. create a PO file which uses this for any or both of msgid and msgstr, eg. test.po: msgid "\x41utomatic\t\102old" msgstr "\104one\n\142efore" 2. create a sample eg. XML file to test this out, sample.xml.in: <blah> <_test xml:space="preserve">Automatic[realtab]Bold</_test> </blah> 3. run "intltool-merge -x . sample.xml.in sample.xml", which should output sample.xml: <blah> <test xml:space="preserve">Automatic[realtab]Bold</test> <test xml:space="preserve" xml:lang="test">Done before</test> </blah> This example of mine will also test space-preserving features, so you might want to simplify it a bit (i.e. don't use "\t" or "\n", but just "\ddd" or "\xDD" sequences).
So I added all of the escape sequences in my tree, except for \ddd and \xDD, since I'm not sure how to handle them yet really. However, \v seems to not work. When I run make check with it, I get a complaint about \v being an invalid escape sequence.
One could first do in unescape_po_string to catch all of these: $string =~ s/(\\x[0-9a-fA-F]{2}|\\[0-7]{3}|\\.)/unescape_one_sequence($1)/eg; And then add support in unescape_one_sequence for these: if ($sequence =~ /\\x([0-9a-fA-F]{2})/) return hex $1; if ($sequence =~ /\\([0-7]{3})/) return oct $1; I didn't test this though.
On the \v topic: msgfmt accepts it for me (GNU gettext 0.14.1), and GNU gettext manual says that strings should be treated as C strings (so "\v" ought to be acceptable). I'm actually not sure I understand you where do you get it reported as "invalid escape sequence".
Unrecognized escape \v passed through at test.pl line 3. test.pl here is: print "\v\n"; I guess perl doesn't like the escape sequence? This is a problem, since intltool is a bunch of perl. I'm not sure how we can work around this. Can we just ignore all the escapes and pass them on through to gettext to deal with? It seems optimal for us to avoid creating extra implementations of things like this.
We certainly can: all we need insure is that any sequences which we decode are also encoded when they go out, so we don't end up with a mess. Perhaps a status-quo is good enough (along with WONTFIX)?
I am not sure a status quo is good enough. It might be better to remove all of them, than to keep it half-assed. :) If there are technically valid reasons for why we would need to de/en-code the escapes though, we should probably implement everything. If we can avoid doing that, though, I would much rather fix the code to get rid of it entirely, and just pass everything straight through to gettext.
On the second thought, I think this might be actually necessary, since intltool is reading PO files itself (no gettext in the process). Would it be wiser perhaps to create MO files (thus letting gettext's msgfmt take care of processing), and reading them in? But this means big changes in intltool caching code (but MO file format is simple enough, and I have so far already written readers in PHP and C#, so I'm probably up to task). It seems to be necessary to keep and completely this support (but we're unlikely to run into problems, since translators usually don't use encoding sequences, especially with UTF-8 being standard in Gnome).
Created attachment 32598 [details] [review] Support all known escape sequences 2004-10-14 Danilo Šegan <dsegan@gmx.net> * intltool-merge.in.in (unescape_po_string): Catch \xDD and \ddd sequences as well. (unescape_one_sequence): Add support for \r, \t, \b, \f, \a, \v, \xDD, \ddd, \0.
Here's a patch which adds support for all known escape sequences. It works correctly, and I get the same results with merge6.po as msgfmt produces (i.e. \0123 sequence is treated as \012 3). I will update merge6.xml.in/merge6.po testcase (to use \123 for "S", and correct results/merge6.xml which has \0123 now, which is wrong) if you agree with these changes. Note that I'm not sure on "\0", but msgfmt seems to parse it (though it breaks stuff, produced MO files are crap from glancing over).
For reference: perlop(1) contains section "Quote and Quote-like operators" where "\v" is not mentioned, that's why I used ASCII number for vertical tab.
I don't think we need to parse \0 specially. If it's breaking things, we shouldn't do it for sure. It would be nice if your patch already included the changes for the tests, so I could easily test the changes by doing make check. :)
Created attachment 32613 [details] [review] Update merge6 test Here you go: updated testcase as well. On \0 topic: I meant that \0 breaks things in msgfmt itself (produced MO file seems incorrect; \0 is otherwise used as a delimiter for plural-forms in GNU MO files, so that's probably why). It doesn't break anything in intltool-merge AFAICT. It might strip the string as well at \0 (I don't know what Perl uses internally to end strings), but that's probably not bad. It seems to me that \0 in a PO file string has sort of "unspecified behaviour", in that you don't know what to expect. It is transformed into NULL byte in msgfmt (so it doesn't treat it as an error), but it seems to cause some problems as well. So, I don't mind what we'll do with it, since it sucks anyhow ;) FWIW, the same crap happens in msgfmt if one uses other equivalent sequences of \000 or \x00.
I am for the change without the \0 if you are resonable sure that this doesn't break anything :)
Hum, how did this end up in "acme" product?
Comment on attachment 32598 [details] [review] Support all known escape sequences Ok, I've committed this without return "\0" if $sequence eq "\\0"; line.