GNOME Bugzilla – Bug 644037
gmmproc: Distortion of non-ASCII characters in comments
Last modified: 2011-04-28 21:48:14 UTC
When gmmproc builds glibmm/glib/glibmm/regex.h from glibmm/glib/src/regex.hg, it issues the warning Wide character in print at ../../tools/pm/Output.pm line 662. and in the description of class Regex * "\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single is changed to * "\xc3\xa0" (i.e. "Ã ") is two bytes long but it is treated as a single I'm not the only one affected by this unwanted replacement of "à" by "Ã ". See http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1Regex.html. It's possible that there are similar distortions in other files, regex.h is the one I happened to notice. I believe that this problem with regex.h started on 2010-12-13 with commit http://git.gnome.org/browse/glibmm/commit/?id=45ea6b0db7986a6bceca0f9494950434599ba072 "glibmm: Add a MatchInfo class for use with the existing Regex class." Note: I'm not saying it's anything wrong with that commit. I'm saying that a weakness in gmmproc showed up as a result of that commit. I've spent some time studying descriptions of how Perl handles Unicode, and particularly UTF-8 encoding. The short story is that you have to tell Perl what character encoding is used in files it reads and writes. Otherwise Perl assumes it's the native eight-bit character set of the platform (for example iso-8859-1, aka. Latin-1). If you let Perl read and write utf-8 files while it believes it's Latin-1, the written files may or may not be correct. The more processing of read text Perl does before it's written to another file, the more probable it is that non-ASCII characters will be distorted. There is a long story, telling why I suspect the error showed up as a result of the mentioned commit. I can include the long story on request. There are a number of ways to tell Perl what character encoding is used. 1. The argument -Cio to the perl command should do it, but when I tried /usr/bin/perl -Cio -I../../tools/pm -- ../../tools/gmmproc ...... it had no effect. I'm not sure, but perhaps it's because files are read and written in separate modules, like Output.pm, and not in file gmmproc. 2. Add use open IO => ":utf8"; near the beginning of WrapParser.pm and Output.pm (and perhaps other files). 3. Every time a file is opened, specify the character encoding. Solutions 1 and 2 (if you can call 1 a solution) assume that gmmproc reads and writes only UTF-8 files. Is that true? I use Perl v5.10.1 on Ubuntu 10.10.
Thanks for this. I prefer 2. > Solutions 1 and 2 (if you can call 1 a solution) assume that gmmproc reads and > writes only UTF-8 files. Is that true? Probably, yes. I have no problem with assuming that and fixing any files that are not UTF-8. Could you try creating a patch?
Created attachment 183120 [details] [review] patch: Tell gmmproc that it reads and writes UTF-8 files. I've added use open IO => ":utf8"; in GtkDefs.pm, Output.pm, and WrapParser.pm. I think that's enough. Those are the modules used by gmmproc that read and write files other than XML files. The other commands (generate_wrap_init.pl, defs_fixer.pl, enum.pl, generate_extra_defs) can remain ignorant of the encoding. What makes gmmproc different is that it also reads XML files (xxx_docs.xml and xxx_docs_override.xml). XML files are treated differently. UTF-8 is the default encoding. If another encoding is used, it's indicated in the file itself. And when strings from the XML files, known to be UTF-8, are combined with strings from other files, wrongly believed to be Latin-1, the result can be wrong. Actually it happens to be right, if all strings that are used from the XML files contain only ASCII characters.
Thanks. Pushed. It seems to fix the generated HTML.
The added 'use open IO => ":utf8";' in GtkDefs.pm can cause an unacceptable increase in execution time, as pointed out in an email from Mike Gorse. > On Thu, 2011-04-07 at 09:36 -0400, Mike Gorse wrote: >> Hi Murray, >> >> Is it normal for gtkmm to take something like two days to build? I was >> starting to build it in jhbuild the other day (on a Thinkpad T61 with a >> 2.4ghz Core 2 Duo 7700), and gmmproc took around 5-25 minutes of CPU time >> per invocation, so I was wondering if that was typical or if there's >> something peculiar about my configuration. > > No, that's not normal, as I told you on twitter. Fyi, I have the attached patch applied locally, but I'm hesitant to file a bug because I'm not sure that the underlying issue isn't a bug with the OpenSUSE build of Perl. When your tokenize function gets called, it iteratively calls substr() on the string read in from the file, and, since the string is internally represented as utf-8, it calls perl_sv_len_utf8. For some reason, the string's length doesn't get cached, so it winds up recalculating the string's length (reading through the whole string) every time substr is called (ie, for every character in the file). But, if I build perl 5.12.3 from source, rather than use the OpenSUSE Perl 5.12.3 package, this doesn't happen, so I'm guessing that it is caused by some patch or flag that OpenSUSE passes in. ------ The patch mentioned in the email is diff --git a/tools/pm/GtkDefs.pm b/tools/pm/GtkDefs.pm index f665099..0a5e286 100644 --- a/tools/pm/GtkDefs.pm +++ b/tools/pm/GtkDefs.pm @@ -171,6 +171,7 @@ sub read_defs($$;$) sub split_tokens($) { my ($token_string) = @_; + utf8::downgrade($token_string); my @tokens = (); My first reaction was: "Let's remove the 'use open IO => ":utf8";' from GtkDefs.pm and keep it only in Output.pm, and WrapParser.pm. The files that GtkDefs.pm reads contain only Ascii characters." But I may be wrong. xxx_signals.defs contains documentation of properties, and IMHO non-Ascii characters shall be allowed in documentation. Since utf-8 strings and Latin-1 strings are mixed in gmmproc, Perl must keep track of the encoding of each string, or else there can be a mess when all output is joined to one big string in Output::output_temp_g1(). The use of substr() in GtkDefs::split_tokens() can be terribly inefficient when used on a utf8 string, since the index used by substr() is a character index, not a byte index. To find character n in a utf8 string, you have to read from the beginning of the string, not just add n to the start address of the string. Unfortunately my version of Perl did not show any noticeable increase in execution time, when I added 'use open IO => ":utf8";' in GtkDefs.pm. Probably the Perl system detects that $token_string in GtkDefs::split_tokens() contains only Ascii characters, although the utf8 flag is set. Then it can use fast address arithmetic to find a requested substring. But when I added a few non-Ascii characters to gtk_signals.defs, the execution time of gmmproc increased from 4 s to 1 min. After some tests with different Perl functions and pragmas, I think this is the way GtkDefs::split_tokens() shall be changed. -- Near the beginning, add my $was_utf8 = utf8::is_utf8($token_string); # Turn off the utf8 flag. It may make substr() terribly slow. utf8::encode($token_string) if ($was_utf8); -- Near the end, just before 'return @tokens;', add # Turn on utf8 flags in @tokens, if the utf8 flag was on in $token_string. if ($was_utf8) { foreach (@tokens) { utf8::decode($_); } } The Perl documentation recommends utf8::downgrade() and utf8::upgrade() instead of utf8::encode() and utf8::decode(). (At least I think so, the documentation is voluminous and not very easy to understand in all detail.) But when I tried with a small test script, it did not work well. utf8::downgrade() fails if the string contains a non-Latin-1 character. I'll make some more tests, and let people have a chance to object to my proposed fix. Then I can supply a patch.
Reopening this bug.
Created attachment 186420 [details] [review] patch: gmmproc: GtkDocs::split_tokens() splits a byte string. Here's the patch. I've added more comments, otherwise it's exactly as suggested in comment 4. I don't know if anyone has tested it on an OpenSUSE system, but I would be surprised if the execution time of gmmproc is more than a few seconds with this fix. On Ubuntu 10.10 it's a few seconds, and there is no noticeable increase in execution time if a few non-Ascii or non-Latin-1 characters are added to gtk_signals.defs. And yes, those characters are correctly copied to gtkmm/xxx.h.
I was thinking about another way to fix it. We should probably decode all strings we read from files as utf8 by using decode() and encode them before writing to file with encode(). And split_tokens could be written better, I mean - instead of calling subsequently substr() which seems to take long time everytime it is called, something like the code below should be used: my @characters = split(//, $token_string); for my $char (@characters) { ... # when token begin and end are known then my $token = join( '', @characters[ $begin_index .. $index ] ); # push $token into tokens array. } or something like this. Have you tried this approach?
(In reply to comment #7) > I was thinking about another way to fix it. We should probably decode all > strings we read from files as utf8 by using decode() and encode them before > writing to file with encode(). This is exactly what is achieved by the pragma 'use open IO => ":utf8";', if I haven't misunderstood the Perl documentation. See e.g. perluniintro (Unicode I/O) and pragma open. > And split_tokens could be written better, A typical execution time of the present gmmproc for a pair of files in gtkmm/gtk/src (.hg + .ccg) is about 4.5 s, when there are only Ascii characters in the .defs files, about 1 min when there are non-Ascii characters in gtk_signals.defs. With the patch in comment 6 implemented, the execution time is about 4.5 s both with and without non-Ascii characters. And although no one has confirmed it, I'm convinced that the execution time is in the order of seconds rather than minutes also on OpenSUSE. I have tested previously with my @characters = split(/(.)/, $token_string); Execution time with and without non-Ascii characters: about 11 s. Now I repeated the test with your better proposal my @characters = split(//, $token_string); Execution time with and without non-Ascii characters: about 7 s. Is an increase in execution time of about 50% an acceptable price for getting a better Perl code? I'm inclined to say no. But if it's combined with my proposal in bug 646820, I say yes. If gmmproc almost always reads defs data from a binary file, the extra time for reading from the text files is negligible.
(In reply to comment #7) > And split_tokens could be written better, Here's an interesting alternative: # @characters is not a good name, since some elements contain more than one # character. my @characters = split(/([\\"'()])/, $token_string); Execution time with and without non-Ascii characters: about 3.7 s. This is the fastest alternative except for the binary file of bug 646820. It's not necessary to split all of $token_string into single characters. Enum::split_enum_tokens() contains a similar loop with my $char = substr($token_string, $index, 1); but that function parses much shorter strings, strings which have been split by GtkDefs::split_tokens(), and Enum::split_enum_tokens() parses only tokens from one of the defs files, typically a file of moderate size. I added some non-Ascii characters to each enum in gtk_enums.defs (in the first part of 'values', which is parsed by Enum::split_enum_tokens() but not used). It had no noticeable influence on the execution time. The loop in split_enum_tokens() is harmless, but of course it can be changed similarly to the loop in GtkDefs::split_tokens() just to make those two functions similar to each other.
Created attachment 186657 [details] [review] gmmproc: GtkDefs::split_tokens() uses split() instead of substr(). Here's a new patch that replaces the patch in comment 6. This time I think I've even got the title right. (No, it wasn't in the patch in comment 6.) GtkDefs::split_tokens() is changed as suggested in comment 9. Enum::split_enum_tokens() is not changed. It's amazing have much GtkDefs::split_tokens() means for the execution time of gmmproc. I tried a version where my $index = -1; for my $substring (@substrings) { $index++; was replaced by for (my $index = 0; $index <= $#substrings; $index++) { my $substring = $substrings[$index]; This seemingly innocent change increased the execution time of gmmproc in gtkmm/gtk/src about 8%. Let's give Krzesimir a chance to comment on this and the two previous comments, and give Mike Gorse or someone else a chance to test it on OpenSUSE, where the present version proved to be terribly slow (comment 4). Then, if no objections are raised, the patch can be committed.
Thanks; the execution time is reasonable on OpenSUSE for me with this patch (ie, typically a few seconds rather than a few minutes).
Neat, splitting that way is better idea than substringing, thanks. The patch could be pushed IMO but that is Murray's call I suppose.
No, you guys are best qualified to decide. Do what you think is best, please.
Ok, I checked just to be sure that the output of new split_tokens function is the same as the one produced by old version. It was so I pushed the patch. Thanks.