Bug 546648 – Patch to gutf8.c

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 546648 - Patch to gutf8.c


Summary:	Patch to gutf8.c


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.17.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-08-06 18:22 UTC by acfryx
Modified:	2015-09-05 17:18 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
gutf8.c diff file (9.66 KB, patch) 2008-08-06 18:23 UTC, acfryx	none	Details \| Review
gutf8.c Patch update (9.97 KB, patch) 2008-08-07 03:21 UTC, acfryx	none	Details \| Review

Description acfryx 2008-08-06 18:22:25 UTC

Please describe the problem:
Small patches and bug fixes to gutf8.c source.

Steps to reproduce:
1. 
2. 
3. 


Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 acfryx 2008-08-06 18:23:31 UTC

Created attachment 115997 [details] [review]
gutf8.c diff file

Comment 2 acfryx 2008-08-07 03:21:16 UTC

Created attachment 116019 [details] [review]
gutf8.c Patch update

Previous patch had some bugs (2+) and duo to some performance tests some code as been rolled-back.

Comment 3 Matthias Clasen 2008-08-09 04:35:53 UTC

You forgot to describe the bug that is being fixed here.

Comment 4 acfryx 2008-08-09 09:51:04 UTC

(In reply to comment #3)
> You forgot to describe the bug that is being fixed here. 
> 

OK... I thought that the diff file was enough.

I'll do that now. (Considering only the newest patch.)

Macro UTF8_COMPUTE: changed algorithm, hopefully a faster one (some basic test prove it).

Function g_utf8_find_next_char: when the supplied string is not nul-terminated (argument [end] != NULL) and when the next character is pointed by [end] it returns NULL - fixed, now returns [end].

Functions g_utf8_strlen, g_utf8_to_ucs4, g_ucs4_to_utf8, g_utf16_to_utf8: initialization off some variables (like result = NULL) removed.

Function g_utf8_pointer_to_offset: removed local variable [s] because it is always a copy of the argument [str]. Algorithm tune up.

Function g_utf8_get_char_extended: changed algorithm (based on UTF8_COMPUTE) and removed last error check:
    if (UTF8_LENGTH(wc) != len)
        return (gunichar)-1;
...because at that point the character is, without doubt, correct.

Macro SURROGATE_VALUE: small change - instead of "* 0x400" use of "<< 10".

Function g_utf16_to_utf8, g_utf16_to_ucs4: changed surrogate conversion/detection to a faster conversion method.

Function g_utf8_to_utf16, g_ucs4_to_utf16: surrogate values calculation changed.

Function fast_validate: removed some useless verifications changed algorithm (based on UTF8_COMPUTE)

Changed some documentation like:
- *       If @len < 0, then the string is terminated with a 0 character.
+ *       If @len < 0, then the string is nul-terminated.

Please comment...

Comment 5 Rob Staudinger 2008-08-09 16:13:00 UTC

(In reply to comment #4)

(...)

> Macro SURROGATE_VALUE: small change - instead of "* 0x400" use of "<< 10".

Optimisations like this can be left to the compiler these days.

Comment 6 Behdad Esfahbod 2008-08-10 20:23:44 UTC

Hi Adrian,

The common rule for taking optimizations is that there should be either something really stupid the current code is doing, or that there is measurable improvements by using the new code.

Comments inline.


> Macro UTF8_COMPUTE: changed algorithm, hopefully a faster one (some basic test
> prove it).

What kind of testing did you do?  I kinda like your changes but it's not obvious to me that it can result in any measurable speedup.


> Function g_utf8_find_next_char: when the supplied string is not nul-terminated
> (argument [end] != NULL) and when the next character is pointed by [end] it
> returns NULL - fixed, now returns [end].

No, [end] is out of the string.  It should return NULL as it currently does.


> Functions g_utf8_strlen, g_utf8_to_ucs4, g_ucs4_to_utf8, g_utf16_to_utf8:
> initialization off some variables (like result = NULL) removed.

Those are there to shut compiler warnings about possibly uninitialized values up.  They don't make any difference in any real case.

> Function g_utf8_pointer_to_offset: removed local variable [s] because it is
> always a copy of the argument [str]. Algorithm tune up.

I don't see what algorithm tune up you made.  The code looks identical.  I wouldn't worry about the local variable.  These kind of things make absolutely no difference with any decent compiler.


> Function g_utf8_get_char_extended: changed algorithm (based on UTF8_COMPUTE)
> and removed last error check:
>     if (UTF8_LENGTH(wc) != len)
>         return (gunichar)-1;
> ...because at that point the character is, without doubt, correct.>

> Macro SURROGATE_VALUE: small change - instead of "* 0x400" use of "<< 10".
> 
> Function g_utf16_to_utf8, g_utf16_to_ucs4: changed surrogate
> conversion/detection to a faster conversion method.
> 
> Function g_utf8_to_utf16, g_ucs4_to_utf16: surrogate values calculation
> changed.
> 
> Function fast_validate: removed some useless verifications changed algorithm
> (based on UTF8_COMPUTE)

I'm reluctant to make these changes to a very well tested code, unless they provide significant speedup in a real use case.


> Changed some documentation like:
> - *       If @len < 0, then the string is terminated with a 0 character.
> + *       If @len < 0, then the string is nul-terminated.

Thanks.  Committed these.

> Please comment...
>

Comment 7 acfryx 2008-08-10 21:46:09 UTC

(In reply to comment #6)

> The common rule for taking optimizations is that there should be either
> something really stupid the current code is doing, or that there is measurable
> improvements by using the new code.

Thanks for the advice... I'm entering this new world and I'm slowly realizing that but because my expertise is embedded systems I have the tendency to squeeze everything to the maximum but I'm currently reading some guides to compiler optimization and I hope to loose some of this tendency.

> > Macro UTF8_COMPUTE: changed algorithm, hopefully a faster one (some basic test
> > prove it).
> 
> What kind of testing did you do?  I kinda like your changes but it's not
> obvious to me that it can result in any measurable speedup.

I test the time it takes to execute the old macro from 0 to 255 several times and store the mean, then I make the same thing to the new macro and then I compare both times by making time_new/time_old.
The results I get were: no optimization 1.3, O2 ~1.1, O3 impossible to tell.
Although the results are minimal I think its intuitive that in the old mode the PC as to make to operation and in the new mode it as to make only one. Although this seems fine there's a catch because normally text is fine, so the test << if (Char < 192) << is executed all the time even knowing almost all the time it is not needed. Probably it is better to leave as it is...

> > Function g_utf8_find_next_char: when the supplied string is not nul-terminated
> > (argument [end] != NULL) and when the next character is pointed by [end] it
> > returns NULL - fixed, now returns [end].
> 
> No, [end] is out of the string.  It should return NULL as it currently does.

I'm not shore I understand... if the string is, say "abc" (lol), the parameter [end] points to the memory address past 'c' and not to 'c'? If yes you are right but the documentation isn't clear about that.

> > Functions g_utf8_strlen, g_utf8_to_ucs4, g_ucs4_to_utf8, g_utf16_to_utf8:
> > initialization off some variables (like result = NULL) removed.
> 
> Those are there to shut compiler warnings about possibly uninitialized values
> up.  They don't make any difference in any real case.

OK... didn't know... (again)

> > Function g_utf8_pointer_to_offset: removed local variable [s] because it is
> > always a copy of the argument [str]. Algorithm tune up.
> 
> I don't see what algorithm tune up you made.  The code looks identical

Obviously I chose the wrong words to describe the change...

> I wouldn't worry about the local variable.  These kind of things make > absolutely no difference with any decent compiler.

As I read the optimization guides I realize that my change can sometimes even be worse...

> > Function g_utf8_get_char_extended: changed algorithm (based on UTF8_COMPUTE)
> > and removed last error check:
> >     if (UTF8_LENGTH(wc) != len)
> >         return (gunichar)-1;
> > ...because at that point the character is, without doubt, correct.>
> 
> > Macro SURROGATE_VALUE: small change - instead of "* 0x400" use of "<< 10".
> > 
> > Function g_utf16_to_utf8, g_utf16_to_ucs4: changed surrogate
> > conversion/detection to a faster conversion method.
> > 
> > Function g_utf8_to_utf16, g_ucs4_to_utf16: surrogate values calculation
> > changed.
> > 
> > Function fast_validate: removed some useless verifications changed algorithm
> > (based on UTF8_COMPUTE)
> 
> I'm reluctant to make these changes to a very well tested code, unless they
> provide significant speedup in a real use case.

My objective was to make some clean up while gaining minimal performance and not to make things much much faster.
I can make a post explaining my changes and why they don't change what's already done.

Comment 8 Behdad Esfahbod 2008-08-10 23:17:12 UTC

(In reply to comment #7)
> (In reply to comment #6)
> 
> > The common rule for taking optimizations is that there should be either
> > something really stupid the current code is doing, or that there is measurable
> > improvements by using the new code.
> 
> Thanks for the advice... I'm entering this new world and I'm slowly realizing
> that but because my expertise is embedded systems I have the tendency to
> squeeze everything to the maximum but I'm currently reading some guides to
> compiler optimization and I hope to loose some of this tendency.

It's easy to try to optimize everything.  I know, because I used to be just like that.  But with well-tested working code, it's better to learn to not touch things if you don't need to.

> > > Macro UTF8_COMPUTE: changed algorithm, hopefully a faster one (some basic test
> > > prove it).
> > 
> > What kind of testing did you do?  I kinda like your changes but it's not
> > obvious to me that it can result in any measurable speedup.
> 
> I test the time it takes to execute the old macro from 0 to 255 several times
> and store the mean, then I make the same thing to the new macro and then I
> compare both times by making time_new/time_old.
> The results I get were: no optimization 1.3, O2 ~1.1, O3 impossible to tell.
> Although the results are minimal I think its intuitive that in the old mode the
> PC as to make to operation and in the new mode it as to make only one. Although
> this seems fine there's a catch because normally text is fine, so the test <<
> if (Char < 192) << is executed all the time even knowing almost all the time it
> is not needed. Probably it is better to leave as it is...

Well, your test does not necessarily reflect real-world distribution on the inputs.  And yours penalizes non-ASCII cases by having that Char<192 case as you pointed out.  Moreover, in any real world scenario, one or two operations is totally hidden in things like memory stalls and floating point operations.  It just is not going to show in *any* profiling any day.

 
> > > Function g_utf8_find_next_char: when the supplied string is not nul-terminated
> > > (argument [end] != NULL) and when the next character is pointed by [end] it
> > > returns NULL - fixed, now returns [end].
> > 
> > No, [end] is out of the string.  It should return NULL as it currently does.
> 
> I'm not shore I understand... if the string is, say "abc" (lol), the parameter
> [end] points to the memory address past 'c' and not to 'c'? If yes you are
> right but the documentation isn't clear about that.

Yes, the docs can be improved.  See bug 547200.

> > I'm reluctant to make these changes to a very well tested code, unless they
> > provide significant speedup in a real use case.
> 
> My objective was to make some clean up while gaining minimal performance and
> not to make things much much faster.
> I can make a post explaining my changes and why they don't change what's
> already done.

I know each individual change you propose may be faster than what's in there, and that they do the same thing.  The devil however is in the details.  Bugs are never introduced intentionally.  They just happen as sideeffects of other changes.  When no change is needed, don't make one.  That's the rule I'm trying to follow.

Cheers,

Comment 9 acfryx 2008-08-11 00:05:42 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)

> > > > Function g_utf8_find_next_char: when the supplied string is not nul-terminated
> > > > (argument [end] != NULL) and when the next character is pointed by [end] it
> > > > returns NULL - fixed, now returns [end].
> > > 
> > > No, [end] is out of the string.  It should return NULL as it currently does.
> > 
> > I'm not shore I understand... if the string is, say "abc" (lol), the parameter
> > [end] points to the memory address past 'c' and not to 'c'? If yes you are
> > right but the documentation isn't clear about that.
> 
> Yes, the docs can be improved.  See bug 547200.

"shore" - my mistake... lol

> > > I'm reluctant to make these changes to a very well tested code, unless they
> > > provide significant speedup in a real use case.
> > 
> > My objective was to make some clean up while gaining minimal performance and
> > not to make things much much faster.
> > I can make a post explaining my changes and why they don't change what's
> > already done.
> 
> I know each individual change you propose may be faster than what's in there,
> and that they do the same thing.  The devil however is in the details.  Bugs
> are never introduced intentionally.  They just happen as sideeffects of other
> changes.  When no change is needed, don't make one.  That's the rule I'm trying
> to follow.

Honestly it's hard to understand (again that feeling that everything must be squeezed)... but I just got here, so...

As I refer in the gtk-devel mailing list there is a lot of inconsistency in this file and other files (related to strings). I would like to know if they are to stay as they are or if it is better to change them (at least for the next major release)?

Best regards...

Comment 10 Matthias Clasen 2015-09-05 17:18:29 UTC

I've pushed another optimization patchset for gutf8.c recently, so this is will need to be revisited, if any part still applies.