Bug 795991 – gmarkup: Optimize g_markup_escape_text()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 795991 - gmarkup: Optimize g_markup_escape_text()


Summary:	gmarkup: Optimize g_markup_escape_text()


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2018-05-10 06:17 UTC by Mohammed Sadiq
Modified:	2018-05-24 20:32 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gmarkup: Optimize g_markup_escape_text() (3.11 KB, patch) 2018-05-10 06:17 UTC, Mohammed Sadiq	none	Details \| Review
gmarkup: Optimize g_markup_escape_text() (3.13 KB, patch) 2018-05-10 06:21 UTC, Mohammed Sadiq	needs-work	Details \| Review

Description Mohammed Sadiq 2018-05-10 06:17:39 UTC

I had to parse several big ASCII xml files. So I thought of
optimizing it a bit.

Comment 1 Mohammed Sadiq 2018-05-10 06:17:49 UTC

Created attachment 371875 [details] [review]
gmarkup: Optimize g_markup_escape_text()

The string @text is supposed to be only UTF-8.
And as we require to escape only ASCII characters, we
can optimize this by simply reading the next byte (as opposed to next unichar)

* In best case (All are print chars), the function is ~2.5 times faster.
* When every char is an XML escape char (<, >, ', ", &) the speed is +~20%.
* And in worst case (when every char is an UTF-8 control char), it is about
~10% faster than the previous version.

Comment 2 Mohammed Sadiq 2018-05-10 06:21:31 UTC

Created attachment 371876 [details] [review]
gmarkup: Optimize g_markup_escape_text()

The string @text is supposed to be only UTF-8.  Instead of reading
every unichar, parse every byte, and escape if required. If there
comes a UTF-8 control char, then read the unichar and escape it.

* In best case (All are print chars), the function is ~2.5 times faster.
* When every char is an XML escape char (<, >, ', ", &) the speed is +~20%.
* And in worst case (when every char is an UTF-8 control char), it is about
~10% faster than the previous version.

Comment 3 Philip Withnall 2018-05-18 11:12:27 UTC

Review of attachment 371876 [details] [review]:

This looks good, but I think the code needs a fair bit more documentation, otherwise it’s going to be hard to read in future.

Also, can you please expand the tests in glib/tests/markup-escape.c to cover some additional cases:
 • Some three-byte UTF-8 sequences
 • Some four-byte UTF-8 sequences
 • Various other two-byte sequences above U+009F
 • A variant of unichar_test() which embeds the given UTF-8 sequence several times in a longer ASCII string (for example, with a template `ascii %{utf8} more ascii %{utf8}%{utf8}`) to check the transitions between handling single bytes and handling a multibyte UTF-8 sequence (and vice-versa), and to check handling two consecutive multibyte UTF-8 sequences

Thanks!

::: glib/gmarkup.c
@@ +2138,3 @@
 }
 
+

Nitpick: Spurious blank line introduced.

@@ +2148,1 @@
   const gchar *end;

Please add a comment at the top of the function giving an overview of how it works, why we operate on bytes rather than gunichars, and why we only need to consider the two-byte UTF-8 sequences rather than also needing to think about three- and four-byte sequences.

@@ +2150,3 @@
   end = text + length;
 
+  while (text < end)

Why change the iteration variable from p to text? It’s a bit clearer to keep p and text separate, so that text could be referred to again in future if the function needed to be refactored.

@@ +2157,3 @@
         {
         case '&':
+          g_string_append_len (str, "&amp;", 5);

Do these changes from using g_string_append() to g_string_append_len() really make a difference to the performance? They make it a lot less readable.

If they do make a significant difference, I’d prefer to do it a different way: in a separate patch, add a static inline version of g_string_append() in gstring.h, which has the same implementation as in gstring.c. Any reasonable optimising compiler should be able to optimise out the g_string_append() function call, precalculate the strlen() on the static string, and turn it all directly into a g_string_insert_len() call.

Note that we’d need to keep the current g_string_append() implementation in gstring.c (as well as introducing the static inline in the header), so that we continue to export an address for that symbol.

@@ +2191,3 @@
+                {
+                  g_string_append_printf (str, "&#x%x;", u);
+                  text++;

This `text++` line deserves a comment. As I understand it, the text++ here skips over the 0xC2 byte, then the text++ just before the end of the loop skips over the second byte in the UTF-8 sequence here. That’s not really obvious unless you read through the code a few times though.

@@ +2194,3 @@
+                }
+              else
+                g_string_append_c (str, c);

Similarly, what happens in the case of (for example) U+0085 (bytes C2 85) is not particularly obvious and deserves a comment. As I understand it, the C2 byte will be detected, the gunichar u (0x85) won’t match anything in the if-statement above, so this else case will be hit, which will append byte 0xC2 to str. On the next iteration of the loop, we’ll have (c == 0x85) which is part-way through a UTF-8 sequence. It won’t match any of the cases above, though, and will fall through to the g_string_append_c() case below, which will append it (byte 0x85) to str, and then iterate on to the start of the next UTF-8 sequence.

That does all work, but is definitely not obvious from a first reading of the code.

Comment 4 GNOME Infrastructure Team 2018-05-24 20:32:21 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1376.