Bug 48489 – handle gettext escapes other than '\\' and '\"'

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 48489 - handle gettext escapes other than '\\' and '\"'


Summary:	handle gettext escapes other than '\\' and '\"'


Status:	RESOLVED FIXED

Product:	intltool
Classification:	Deprecated
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Low minor
Target Milestone:	---
Assigned To:	intltool maintainers
QA Contact:	intltool maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2001-08-18 14:04 UTC by Cyrille Chépélov
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Support all known escape sequences (1.51 KB, patch) 2004-10-14 11:17 UTC, Danilo Segan	committed	Details \| Review
Update merge6 test (1.08 KB, patch) 2004-10-14 19:01 UTC, Danilo Segan	committed	Details \| Review

Description Cyrille Chépélov 2001-09-10 01:18:46 UTC

This sub is a long way from handling all valid gettext string escapes. 
It currently handles " and tries to handle \n.

I've been unable to find the list of valid .po escapes, I'm afraid the gettext
source code is the place where they're defined (handling all valid C escapes
would be a good start).



------- Bug moved to this database by unknown@bugzilla.gnome.org 2001-09-09 21:18 -------

Comment 1 Darin Adler 2002-01-16 19:00:34 UTC

There's a comment now in unescape_one_sequence in intltool-merge.in.in
that lists the various other escape sequences supported by gettext.

We could support them all with a little work. But I'd want test cases
for all of them too, of course.

Comment 2 Kenneth Rohde Christiansen 2004-01-07 19:30:39 UTC

Adding keywords in hope that someone has time to add the other escapes

Comment 3 Rodney Dawes 2004-05-29 17:51:51 UTC

How does one test that intltool-merge is doing the right thing here? It seems
like this style of escape sequences is only used in actual code, which doesn't
get handled by intltool-merge. The bug says it is trying to handle \n correctly,
but I  don't understand how it needs to bother with it at all. I have a patch
that adds more, but not all, of the escape sequences to the supported list, but
I am not sure how to test it.

Comment 4 Kenneth Rohde Christiansen 2004-06-12 20:48:23 UTC

well, as far as I remember and can see from looking at the code it writes \n
when there is an actual newline in the XML stream, and \" when there is a "

I dont know what else we need to handle. Can you post your list Rodney?

Comment 5 Rodney Dawes 2004-06-13 00:36:09 UTC

This is the list from the comment in intltool-merge.in.in:

     # gettext also handles \n, \t, \b, \r, \f, \v, \a, \xxx (octal),
     # \xXX (hex) and has a comment saying they want to handle \u and \U.

This is what I've addedin my tree:

+    return "\t" if $sequence eq "\\t";
+    return "\b" if $sequence eq "\\b";
+    return "\r" if $sequence eq "\\r";
+    return "\f" if $sequence eq "\\f";
+    return "\a" if $sequence eq "\\a";

Most of these don't really make sense in a standard text file as something that
is not escaped. I'm not sure what the intent with this code is, aside from the
newlines, quotes, and backslash. And the code seems to return an actual newline,
when there is an "\n" in the text, rather than the other way 'round, as you said.

Comment 6 Danilo Segan 2004-10-12 14:18:28 UTC

From Sun's documentation (I didn't want to use GNU documentation on purpose) at
http://docs.sun.com/db/doc/816-0210/6m6nb7mf2?a=view

> Message strings can contain the escape sequences 
> \n for newline, \t for tab, \v for vertical tab, 
> \b for backspace, \r for carriage return, \f for formfeed, 
> \\ for backslash, \" for double quote, \a for alarm, 
> \ddd for octal bit pattern, and \xDD for hexadecimal bit pattern.

Once someone implements this, it can be easily tested:
1. create a PO file which uses this for any or both of msgid and msgstr, eg.
test.po:
  msgid "\x41utomatic\t\102old"
  msgstr "\104one\n\142efore"
2. create a sample eg. XML file to test this out, sample.xml.in:
 <blah>
   <_test xml:space="preserve">Automatic[realtab]Bold</_test>
 </blah>
3. run "intltool-merge -x . sample.xml.in sample.xml", which should output
sample.xml:
 <blah>
   <test xml:space="preserve">Automatic[realtab]Bold</test>
   <test xml:space="preserve" xml:lang="test">Done
before</test>
 </blah>

This example of mine will also test space-preserving features, so you might want
to simplify it a bit (i.e. don't use "\t" or "\n", but just "\ddd" or "\xDD"
sequences).

Comment 7 Rodney Dawes 2004-10-12 16:53:48 UTC

So I added all of the escape sequences in my tree, except for \ddd and \xDD,
since I'm not sure how to handle them yet really. However, \v seems to not work.
When I run make check with it, I get a complaint about \v being an invalid
escape sequence.

Comment 8 Danilo Segan 2004-10-13 09:03:06 UTC

One could first do in unescape_po_string to catch all of these:

    $string =~ s/(\\x[0-9a-fA-F]{2}|\\[0-7]{3}|\\.)/unescape_one_sequence($1)/eg;

And then add support in unescape_one_sequence for these:
    if ($sequence =~ /\\x([0-9a-fA-F]{2})/) return hex $1;
    if ($sequence =~ /\\([0-7]{3})/) return oct $1;

I didn't test this though.

Comment 9 Danilo Segan 2004-10-13 09:06:01 UTC

On the \v topic: msgfmt accepts it for me (GNU gettext 0.14.1), and GNU gettext
manual says that strings should be treated as C strings (so "\v" ought to be
acceptable).  

I'm actually not sure I understand you where do you get it reported as "invalid
escape sequence".

Comment 10 Rodney Dawes 2004-10-13 14:35:26 UTC

Unrecognized escape \v passed through at test.pl line 3.

test.pl here is: print "\v\n";

I guess perl doesn't like the escape sequence? This is a problem, since intltool
is a bunch of perl. I'm not sure how we can work around this. Can we just ignore
all the escapes and pass them on through to gettext to deal with? It seems
optimal for us to avoid creating extra implementations of things like this.

Comment 11 Danilo Segan 2004-10-13 16:40:02 UTC

We certainly can: all we need insure is that any sequences which we decode are
also encoded when they go out, so we don't end up with a mess.

Perhaps a status-quo is good enough (along with WONTFIX)?

Comment 12 Rodney Dawes 2004-10-14 00:38:38 UTC

I am not sure a status quo is good enough. It might be better to remove all of
them, than to keep it half-assed. :) If there are technically valid reasons for
why we would need to de/en-code the escapes though, we should probably implement
everything. If we can avoid doing that, though, I would much rather fix the code
to get rid of it entirely, and just pass everything straight through to gettext.

Comment 13 Danilo Segan 2004-10-14 10:53:01 UTC

On the second thought, I think this might be actually necessary, since intltool
is reading PO files itself (no gettext in the process). Would it be wiser
perhaps to create MO files (thus letting gettext's msgfmt take care of
processing), and reading them in? But this means big changes in intltool caching
code (but MO file format is simple enough, and I have so far already written
readers in PHP and C#, so I'm probably up to task).

It seems to be necessary to keep and completely this support (but we're unlikely
to run into problems, since translators usually don't use encoding sequences,
especially with UTF-8 being standard in Gnome).

Comment 14 Danilo Segan 2004-10-14 11:17:35 UTC

Created attachment 32598 [details] [review]
Support all known escape sequences

2004-10-14  Danilo Šegan  <dsegan@gmx.net>

	* intltool-merge.in.in (unescape_po_string): Catch \xDD and \ddd
	sequences as well.
	(unescape_one_sequence): Add support for \r, \t, \b, \f, \a, \v,
	\xDD, \ddd, \0.

Comment 15 Danilo Segan 2004-10-14 11:24:17 UTC

Here's a patch which adds support for all known escape sequences. It works
correctly, and I get the same results with merge6.po as msgfmt produces (i.e.
\0123 sequence is treated as \012 3). I will update merge6.xml.in/merge6.po
testcase (to use \123 for "S", and correct results/merge6.xml which has \0123
now, which is wrong) if you agree with these changes.

Note that I'm not sure on "\0", but msgfmt seems to parse it (though it breaks
stuff, produced MO files are crap from glancing over).

Comment 16 Danilo Segan 2004-10-14 11:41:37 UTC

For reference: perlop(1) contains section "Quote and Quote-like operators" where
"\v" is not mentioned, that's why I used ASCII number for vertical tab.

Comment 17 Rodney Dawes 2004-10-14 15:30:12 UTC

I don't think we need to parse \0 specially. If it's breaking things, we
shouldn't do it for sure. It would be nice if your patch already included the
changes for the tests, so I could easily test the changes by doing make check. :)

Comment 18 Danilo Segan 2004-10-14 19:01:53 UTC

Created attachment 32613 [details] [review]
Update merge6 test

Here you go: updated testcase as well.

On \0 topic: I meant that \0 breaks things in msgfmt itself (produced MO file
seems incorrect; \0 is otherwise used as a delimiter for plural-forms in GNU MO
files, so that's probably why). It doesn't break anything in intltool-merge
AFAICT. It might strip the string as well at \0 (I don't know what Perl uses
internally to end strings), but that's probably not bad.

It seems to me that \0 in a PO file string has sort of "unspecified behaviour",
in that you don't know what to expect. It is transformed into NULL byte in
msgfmt (so it doesn't treat it as an error), but it seems to cause some
problems as well.  So, I don't mind what we'll do with it, since it sucks
anyhow ;)

FWIW, the same crap happens in msgfmt if one uses other equivalent sequences of
\000 or \x00.

Comment 19 Kenneth Rohde Christiansen 2004-10-14 23:59:33 UTC

I am for the change without the \0 if you are resonable sure that this doesn't
break anything :)

Comment 20 Danilo Segan 2004-10-23 09:18:52 UTC

Hum, how did this end up in "acme" product?

Comment 21 Danilo Segan 2004-10-23 09:31:08 UTC

Comment on attachment 32598 [details] [review]
Support all known escape sequences

Ok, I've committed this without 
 return "\0" if $sequence eq "\\0";
line.