Bug 644037 – gmmproc: Distortion of non-ASCII characters in comments

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 644037 - gmmproc: Distortion of non-ASCII characters in comments


Summary:	gmmproc: Distortion of non-ASCII characters in comments


Status:	RESOLVED FIXED

Product:	glibmm
Classification:	Bindings
Component:	build
Version:	2.27.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkmm-forge
QA Contact:	gtkmm-forge

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2011-03-06 18:27 UTC by Kjell Ahlstedt
Modified:	2011-04-28 21:48 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch: Tell gmmproc that it reads and writes UTF-8 files. (2.12 KB, patch) 2011-03-11 07:57 UTC, Kjell Ahlstedt	none	Details \| Review
patch: gmmproc: GtkDocs::split_tokens() splits a byte string. (2.15 KB, patch) 2011-04-21 11:38 UTC, Kjell Ahlstedt	none	Details \| Review
gmmproc: GtkDefs::split_tokens() uses split() instead of substr(). (4.19 KB, patch) 2011-04-26 14:12 UTC, Kjell Ahlstedt	none	Details \| Review

Description Kjell Ahlstedt 2011-03-06 18:27:44 UTC

When gmmproc builds glibmm/glib/glibmm/regex.h from glibmm/glib/src/regex.hg,
it issues the warning
   Wide character in print at ../../tools/pm/Output.pm line 662.
and in the description of class Regex
 * "\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single 
is changed to
 * "\xc3\xa0" (i.e. "Ã ") is two bytes long but it is treated as a single 

I'm not the only one affected by this unwanted replacement of "à" by "Ã ".
See http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1Regex.html.

It's possible that there are similar distortions in other files, regex.h is
the one I happened to notice. I believe that this problem with regex.h started
on 2010-12-13 with commit
http://git.gnome.org/browse/glibmm/commit/?id=45ea6b0db7986a6bceca0f9494950434599ba072
"glibmm: Add a MatchInfo class for use with the existing Regex class."

Note: I'm not saying it's anything wrong with that commit. I'm saying that a
weakness in gmmproc showed up as a result of that commit.

I've spent some time studying descriptions of how Perl handles Unicode, and
particularly UTF-8 encoding.

The short story is that you have to tell Perl what character encoding is used
in files it reads and writes. Otherwise Perl assumes it's the native eight-bit
character set of the platform (for example iso-8859-1, aka. Latin-1).
If you let Perl read and write utf-8 files while it believes it's Latin-1, the
written files may or may not be correct. The more processing of read text Perl
does before it's written to another file, the more probable it is that
non-ASCII characters will be distorted.

There is a long story, telling why I suspect the error showed up as a result of
the mentioned commit. I can include the long story on request.

There are a number of ways to tell Perl what character encoding is used.

1.
The argument -Cio to the perl command should do it, but when I tried
   /usr/bin/perl -Cio -I../../tools/pm -- ../../tools/gmmproc ......
it had no effect. I'm not sure, but perhaps it's because files are read and
written in separate modules, like Output.pm, and not in file gmmproc.

2.
Add
   use open IO => ":utf8";
near the beginning of WrapParser.pm and Output.pm (and perhaps other files).

3.
Every time a file is opened, specify the character encoding.

Solutions 1 and 2 (if you can call 1 a solution) assume that gmmproc reads and
writes only UTF-8 files. Is that true?

I use Perl v5.10.1 on Ubuntu 10.10.

Comment 1 Murray Cumming 2011-03-09 12:44:56 UTC

Thanks for this. I prefer 2.

> Solutions 1 and 2 (if you can call 1 a solution) assume that gmmproc reads and
> writes only UTF-8 files. Is that true?

Probably, yes. I have no problem with assuming that and fixing any files that are not UTF-8.

Could you try creating a patch?

Comment 2 Kjell Ahlstedt 2011-03-11 07:57:17 UTC

Created attachment 183120 [details] [review]
patch: Tell gmmproc that it reads and writes UTF-8 files.

I've added
   use open IO => ":utf8";
in GtkDefs.pm, Output.pm, and WrapParser.pm. I think that's enough. Those are
the modules used by gmmproc that read and write files other than XML files.
The other commands (generate_wrap_init.pl, defs_fixer.pl, enum.pl,
generate_extra_defs) can remain ignorant of the encoding.

What makes gmmproc different is that it also reads XML files (xxx_docs.xml and
xxx_docs_override.xml). XML files are treated differently. UTF-8 is the default
encoding. If another encoding is used, it's indicated in the file itself. And
when strings from the XML files, known to be UTF-8, are combined with strings
from other files, wrongly believed to be Latin-1, the result can be wrong.
Actually it happens to be right, if all strings that are used from the XML
files contain only ASCII characters.

Comment 3 Murray Cumming 2011-03-11 08:35:31 UTC

Thanks. Pushed. It seems to fix the generated HTML.

Comment 4 Kjell Ahlstedt 2011-04-18 18:17:26 UTC

The added 'use open IO => ":utf8";' in GtkDefs.pm can cause an unacceptable
increase in execution time, as pointed out in an email from Mike Gorse.

> On Thu, 2011-04-07 at 09:36 -0400, Mike Gorse wrote:
>> Hi Murray,
>>
>> Is it normal for gtkmm to take something like two days to build?  I was
>> starting to build it in jhbuild the other day (on a Thinkpad T61 with a
>> 2.4ghz Core 2 Duo 7700), and gmmproc took around 5-25 minutes of CPU time
>> per invocation, so I was wondering if that was typical or if there's
>> something peculiar about my configuration.
>
> No, that's not normal, as I told you on twitter.

Fyi, I have the attached patch applied locally, but I'm hesitant to file a 
bug because I'm not sure that the underlying issue isn't a bug with the 
OpenSUSE build of Perl.  When your tokenize function gets called, it 
iteratively calls substr() on the string read in from the file, and, since 
the string is internally represented as utf-8, it calls perl_sv_len_utf8. 
For some reason, the string's length doesn't get cached, so it winds up 
recalculating the string's length (reading through the whole string) every 
time substr is called (ie, for every character in the file).

But, if I build perl 5.12.3 from source, rather than use the OpenSUSE Perl 
5.12.3 package, this doesn't happen, so I'm guessing that it is caused by 
some patch or flag that OpenSUSE passes in.
------

The patch mentioned in the email is

   diff --git a/tools/pm/GtkDefs.pm b/tools/pm/GtkDefs.pm
   index f665099..0a5e286 100644
   --- a/tools/pm/GtkDefs.pm
   +++ b/tools/pm/GtkDefs.pm
   @@ -171,6 +171,7 @@ sub read_defs($$;$)
    sub split_tokens($)
    {
      my ($token_string) = @_;
   +  utf8::downgrade($token_string);
      my @tokens = ();

My first reaction was: "Let's remove the 'use open IO => ":utf8";' from
GtkDefs.pm and keep it only in Output.pm, and WrapParser.pm. The files that
GtkDefs.pm reads contain only Ascii characters."

But I may be wrong. xxx_signals.defs contains documentation of properties, and
IMHO non-Ascii characters shall be allowed in documentation. Since utf-8
strings and Latin-1 strings are mixed in gmmproc, Perl must keep track of the
encoding of each string, or else there can be a mess when all output is joined
to one big string in Output::output_temp_g1().

The use of substr() in GtkDefs::split_tokens() can be terribly inefficient when
used on a utf8 string, since the index used by substr() is a character index,
not a byte index. To find character n in a utf8 string, you have to read from
the beginning of the string, not just add n to the start address of the string.

Unfortunately my version of Perl did not show any noticeable increase in
execution time, when I added 'use open IO => ":utf8";' in GtkDefs.pm. Probably
the Perl system detects that $token_string in GtkDefs::split_tokens() contains
only Ascii characters, although the utf8 flag is set. Then it can use fast
address arithmetic to find a requested substring. But when I added a few
non-Ascii characters to gtk_signals.defs, the execution time of gmmproc
increased from 4 s to 1 min.

After some tests with different Perl functions and pragmas, I think this is the
way GtkDefs::split_tokens() shall be changed.

-- Near the beginning, add
  my $was_utf8 = utf8::is_utf8($token_string);
  # Turn off the utf8 flag. It may make substr() terribly slow.
  utf8::encode($token_string) if ($was_utf8);

-- Near the end, just before 'return @tokens;', add
  # Turn on utf8 flags in @tokens, if the utf8 flag was on in $token_string.
  if ($was_utf8)
  {
    foreach (@tokens)
    {
      utf8::decode($_);
    }
  }

The Perl documentation recommends utf8::downgrade() and utf8::upgrade() instead
of utf8::encode() and utf8::decode(). (At least I think so, the documentation
is voluminous and not very easy to understand in all detail.) But when I tried
with a small test script, it did not work well. utf8::downgrade() fails if the
string contains a non-Latin-1 character.

I'll make some more tests, and let people have a chance to object to my
proposed fix. Then I can supply a patch.

Comment 5 Kjell Ahlstedt 2011-04-18 18:18:32 UTC

Reopening this bug.

Comment 6 Kjell Ahlstedt 2011-04-21 11:38:20 UTC

Created attachment 186420 [details] [review]
patch: gmmproc: GtkDocs::split_tokens() splits a byte string.

Here's the patch. I've added more comments, otherwise it's exactly as suggested
in comment 4.

I don't know if anyone has tested it on an OpenSUSE system, but I would be
surprised if the execution time of gmmproc is more than a few seconds with this
fix. On Ubuntu 10.10 it's a few seconds, and there is no noticeable increase in
execution time if a few non-Ascii or non-Latin-1 characters are added to
gtk_signals.defs. And yes, those characters are correctly copied to
gtkmm/xxx.h.

Comment 7 Krzesimir Nowak 2011-04-21 13:10:43 UTC

I was thinking about another way to fix it. We should probably decode all strings we read from files as utf8 by using decode() and encode them before writing to file with encode(). And split_tokens could be written better, I mean - instead of calling subsequently substr() which seems to take long time everytime it is called, something like the code below should be used:

my @characters = split(//, $token_string);

for my $char (@characters)
{
  ...
  # when token begin and end are known then
  my $token = join( '', @characters[ $begin_index .. $index ] );
  # push $token into tokens array.
}

or something like this.

Have you tried this approach?

Comment 8 Kjell Ahlstedt 2011-04-21 15:14:50 UTC

(In reply to comment #7)
> I was thinking about another way to fix it. We should probably decode all
> strings we read from files as utf8 by using decode() and encode them before
> writing to file with encode().

This is exactly what is achieved by the pragma 'use open IO => ":utf8";', if I
haven't misunderstood the Perl documentation. See e.g. perluniintro
(Unicode I/O) and pragma open.

> And split_tokens could be written better,

A typical execution time of the present gmmproc for a pair of files in
gtkmm/gtk/src (.hg + .ccg) is about 4.5 s, when there are only Ascii characters
in the .defs files, about 1 min when there are non-Ascii characters in
gtk_signals.defs.

With the patch in comment 6 implemented, the execution time is about 4.5 s both
with and without non-Ascii characters. And although no one has confirmed it,
I'm convinced that the execution time is in the order of seconds rather than
minutes also on OpenSUSE.

I have tested previously with
  my @characters = split(/(.)/, $token_string);
Execution time with and without non-Ascii characters: about 11 s.

Now I repeated the test with your better proposal
  my @characters = split(//, $token_string);
Execution time with and without non-Ascii characters: about 7 s.

Is an increase in execution time of about 50% an acceptable price for getting
a better Perl code? I'm inclined to say no. But if it's combined with my
proposal in bug 646820, I say yes. If gmmproc almost always reads defs data
from a binary file, the extra time for reading from the text files is
negligible.

Comment 9 Kjell Ahlstedt 2011-04-22 14:55:28 UTC

(In reply to comment #7)
> And split_tokens could be written better,

Here's an interesting alternative:
  # @characters is not a good name, since some elements contain more than one
  # character.
  my @characters = split(/([\\"'()])/, $token_string);
Execution time with and without non-Ascii characters: about 3.7 s.

This is the fastest alternative except for the binary file of bug 646820.
It's not necessary to split all of $token_string into single characters.

Enum::split_enum_tokens() contains a similar loop with
  my $char = substr($token_string, $index, 1);
but that function parses much shorter strings, strings which have been split
by GtkDefs::split_tokens(), and Enum::split_enum_tokens() parses only tokens
from one of the defs files, typically a file of moderate size. I added some
non-Ascii characters to each enum in gtk_enums.defs (in the first part of
'values', which is parsed by Enum::split_enum_tokens() but not used). It had no
noticeable influence on the execution time. The loop in split_enum_tokens() is
harmless, but of course it can be changed similarly to the loop in
GtkDefs::split_tokens() just to make those two functions similar to each other.

Comment 10 Kjell Ahlstedt 2011-04-26 14:12:03 UTC

Created attachment 186657 [details] [review]
gmmproc: GtkDefs::split_tokens() uses split() instead of substr().

Here's a new patch that replaces the patch in comment 6. This time I think I've
even got the title right. (No, it wasn't in the patch in comment 6.)
GtkDefs::split_tokens() is changed as suggested in comment 9.
Enum::split_enum_tokens() is not changed.

It's amazing have much GtkDefs::split_tokens() means for the execution time of
gmmproc. I tried a version where

  my $index = -1;
  for my $substring (@substrings)
  {
    $index++;

was replaced by

  for (my $index = 0; $index <= $#substrings; $index++)
  {
    my $substring = $substrings[$index];

This seemingly innocent change increased the execution time of gmmproc in
gtkmm/gtk/src about 8%.

Let's give Krzesimir a chance to comment on this and the two previous comments,
and give Mike Gorse or someone else a chance to test it on OpenSUSE, where the
present version proved to be terribly slow (comment 4). Then, if no objections
are raised, the patch can be committed.

Comment 11 Mike Gorse 2011-04-26 16:05:00 UTC

Thanks; the execution time is reasonable on OpenSUSE for me with this patch (ie, typically a few seconds rather than a few minutes).

Comment 12 Krzesimir Nowak 2011-04-27 19:49:14 UTC

Neat, splitting that way is better idea than substringing, thanks. The patch could be pushed IMO but that is Murray's call I suppose.

Comment 13 Murray Cumming 2011-04-28 05:49:11 UTC

No, you guys are best qualified to decide. Do what you think is best, please.

Comment 14 Krzesimir Nowak 2011-04-28 21:48:14 UTC

Ok, I checked just to be sure that the output of new split_tokens function is the same as the one produced by old version. It was so I pushed the patch. Thanks.