GNOME Bugzilla – Bug 572508
gmarkup speedup ...
Last modified: 2009-05-01 14:26:07 UTC
It seems that the time GMarkup takes to parse XML is quite important to login-time - particularly when we get CPU starvation due to lots of parallelism, there are two sets of chunky GMarkup based parsing that happen on Gnome login: the %gconf-tree.xml goodness, and the bonobo activation .server files [ needed before the panel can get into full flow ]. The attached patch re-works GMarkup's parser in several ways, to be enumerated; the overall punch line is that it gives a ~40+% speed win with ~no feature loss: before: ==22104== Events : Ir Dr Dw I1mr D1mr D1mw I2mr D2mr D2mw ==22104== Collected : 607701851 230493464 106708957 10470 74674 1276 1463 73690 586 after: ==13918== Events : Ir Dr Dw I1mr D1mr D1mw I2mr D2mr D2mw ==13918== Collected : 349936177 122295042 61872399 2055 75729 2191 1404 73667 601 The ways in which this is achieved are: * do local, not global utf8 validation - this is not needed for the ASCII delimiters we look for when parsing, or the spaces we hit, or the element close tags [ which we already check vs. open tags ]. => defer utf8 validation until necessary, and try to do built-in ASCII detection for free, on names. => this makes tests/markups/fail-2 and fail-3 give slightly different results: we fail appropriately but at slightly later stage, and for some different reasons eg. "no body tag" for fail-2. * do less allocation - a ton of the CPU cycles were in g_slice_ and g_realloc/free around strings - since we use a rather limited number of these at any given time - I store a local / no-locking-required pool of both GSList nodes and GStrings on the parser context: this saves a lot of time. * do in-place unescaping - the previous unescaping logic appeared baroque and over-complicated, it also used an always-duplicating algorithm - which, since used for every attribute & text imposed significant overhead. It now performs in-place un-escaping, again for a pleasant speed win. * do slightly less precise parsing, and validate later - the previous code would warn giving accurate character info as it hit a mis-behaving character; post re-write, we delay some of this validation until rather later, and while giving the correct errors, may be slightly less accurate / helpful with the error messages. * be less tolerant of oddness such as '\0' inside text elements in the XML stream - apparently libxml2 doesn't, and it's unclear why this is necessary, aside from it being some sort of curio: this could be fixed in the in-place unescaping code if anyone cares. Finally, thus far the code changes are: +440 -626 - so a net LOC reduction too. I'll polish this shortly to fix the regression tests, but comments much appreciated.
Created attachment 129115 [details] [review] patch
I'm all in favour of speeding things up, but is there any chance to split this into a patch series along the lines outlined in the initial comment ? That would make review much easier.
Created attachment 129117 [details] [review] updated patch, fixing tests.
> I'm all in favour of speeding things up, but is there any chance to split this > into a patch series along the lines outlined in the initial comment ? > That would make review much easier. Oh ! - well, some of the bits are easy enough to extract, others are more tangled - eg. the in-place escaping is trivial to extract, perhaps the allocation reduction stuff, but the char-by-char parsing & delayed validation optimisations are relatively more tricky to extract cleanly. Certainly it is possible, I'll try to do it - but sadly no time today ...
I started looking anyway, here are some first impressions: gchar **attr_names; gchar **attr_values; GString **attr_names_gstr; GString **attr_values_gstr; I wonder why we are maintaining these pairs of arrays in parallel now. Would it not be enough to maintain the GString arrays while parsing, and just create the gchar arrays at the time when we are calling the start_element callback ? static inline void string_blank (GString *string) Does this really buy us anything over calling g_string_set_size (s, 0) ? else context->partial_chunk = g_string_sized_new (MIN (28, text_end - text_start)); } if (text_start != text_end) g_string_insert_len (context->partial_chunk, -1, text_start, text_end - text_start); Don't you want to ensure that the newly allocated chunk is large enough so that the insertion works without reallocating ? Ie should that not be MAX (28, text_end - text_start) ? I notice in passing that truncate_partial and current_element may want to be inline.
> I started looking anyway Thanks; much appreciated. > I wonder why we are maintaining these pairs of arrays in parallel now. We certainly don't need to - we can g_alloca the space and knock out the stringv's as we call the start_element - as you say; will fix. > static inline void string_blank (GString *string) > > Does this really buy us anything over calling g_string_set_size (s, 0) ? Well, string_blank evaporates away to a nothing-ness when inlined, whereas g_string_set_size has two extra compare/branches, and more code. I wouldn't mind of course - but we call that beastie ~ 360k times during login inside gconfd-2 alone; it's only a few million cycles of win - but seemed un-controversial at the time ? ;-) > Don't you want to ensure that the newly allocated chunk is large enough > so that the insertion works without reallocating ? yes - quite right, should be a MAX; thanks :-) > I notice in passing that truncate_partial and current_element may > want to be inline. sure; I assumed the compiler would do that, but no harm in asking too. Also - FWIW, I just noticed in gconf that a -ton- of text_element callbacks occur, almost all of them containing the indentation 'fluff' between elements - and we're spending a load of time re-visiting them to do slow utf-8 validation, when we could do that for free during the unescaping (as in name_validate) - will add a trivial tweak for that & update the patch by tomorrow.
> but seemed un-controversial at the time ? ;-) Yeah, no biggie. Leave it in. It is not as if GString internals were a no-touch area...
Created attachment 129573 [details] [review] updated patch ... This fixes the issues you mention Matthias, and adds some 'free' is_ascii code to avoid the cost of utf8 validation for many common cases.
A few more things: There is an #if 0 ed out block that should probably be removed if it is not needed anymore + if (context->parser->start_element && + name_validate (context, start_name, error)) We should probably move that up a bit so that all the name/value copying only happens if necessary.
Oh, and do you have any profiling numbers for the final patch ?
Oh, sorry yes the #if 0 block is redundant and I believe we can remove current_text_len as well when that goes. wrt. the start_element thing - sure, we can do that easily enough, but it's a very un-common case to have either no start_element callback, or an invalid name - almost negligable I suspect. Wrt. profiling numbers; I attach my sample XML file - it is essentially (modulo an additional bogus parent tag) /etc/gconf/gconf.xml.defaults/%gconf-tree.xml /etc/gconf/gconf.xml.defaults/%gconf-tree-zh_CN.xml # or other chinese variant /usr/lib/bonobo/servers/*.server all concatenated - essentially the XML we read on start; of course with zh_CN I expect the worst case, wrt. UTF-8 re-validation, for en-GB I expect something yet faster ;-) * Before * ==5769== Events : Ir Dr Dw I1mr D1mr D1mw I2mr D2mr D2mw ==5769== Collected : 607701826 230493575 106709308 8997 74606 1260 1420 73661 577 ==5769== ==5769== I refs: 607,701,826 ==5769== I1 misses: 8,997 ==5769== L2i misses: 1,420 ==5769== I1 miss rate: 0.0% ==5769== L2i miss rate: 0.0% ==5769== ==5769== D refs: 337,202,883 (230,493,575 rd + 106,709,308 wr) ==5769== D1 misses: 75,866 ( 74,606 rd + 1,260 wr) ==5769== L2d misses: 74,238 ( 73,661 rd + 577 wr) ==5769== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==5769== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==5769== ==5769== L2 refs: 84,863 ( 83,603 rd + 1,260 wr) ==5769== L2 misses: 75,658 ( 75,081 rd + 577 wr) ==5769== L2 miss rate: 0.0% ( 0.0% + 0.0% ) * After * ==5783== Events : Ir Dr Dw I1mr D1mr D1mw I2mr D2mr D2mw ==5783== Collected : 324251801 113330927 62288073 1710 75875 3331 1410 73659 687 ==5783== ==5783== I refs: 324,251,801 ==5783== I1 misses: 1,710 ==5783== L2i misses: 1,410 ==5783== I1 miss rate: 0.0% ==5783== L2i miss rate: 0.0% ==5783== ==5783== D refs: 175,619,000 (113,330,927 rd + 62,288,073 wr) ==5783== D1 misses: 79,206 ( 75,875 rd + 3,331 wr) ==5783== L2d misses: 74,346 ( 73,659 rd + 687 wr) ==5783== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==5783== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==5783== ==5783== L2 refs: 80,916 ( 77,585 rd + 3,331 wr) ==5783== L2 misses: 75,756 ( 75,069 rd + 687 wr) ==5783== L2 miss rate: 0.0% ( 0.0% + 0.0% ) I attach some kcachegrind screenshots to show the general state of affairs - which essentially mirrors the code changes, as you would expect :-) A ~45% reduction in I refs at least.
Created attachment 129637 [details] sample xml, tested in glib/tests/markup-test [ with only the 4k chunk section in place ]
Created attachment 129638 [details] before
Created attachment 129639 [details] after
Michael, did you measure warm/cold login time before/after?
oh - well, if by measure you mean re-vectoring the gconf binary, wrapping it with callgrind, and measuring that the win is real there too - then yes ;-) I havn't done wall-clock measurements; after-all it took only a second or so at the best of times. I call-grinded things beforehand for both gconfd and b-a-s, the latter spending ~98% of it's time in gmarkup, and gconf spending 76% of it's time underneath g_markup_parse_context - though of course some small part of that is from gconf specific callbacks. After the work, there is less grief there. I have 1.7bn cycles 'before' and 0.7bn cycles 'after' (though this prolly includes some other previously published gconf wins). Of course, there is still a load more fat in gconf itself, and I have a good number of other ideas around speeding that up if you're interested :-) - but gmarkup was the dominating factor. HTH.
> wrt. the start_element thing - sure, we can do that easily enough, but it's a > very un-common case to have either no start_element callback, or an invalid > name - almost negligable I suspect. Yeah, I didn't expect any significant win from it, but it just seems like the right thing to do... Anyway, we should probably get this in with this last set of fixes (ie with the #if 0 block removed and the start_element if moved up). Michael, can you commit it with those changes ?
* glib/gmarkup.c: Various optimizations: do less allocations by keeping a pool of GStrings, do in-place unescaping, avoid redundant utf-8 validation. Patch by Michael Meeks
Hi Matthias, Are you certain you actually committed the code ? :-) I mean, I see the ChangeLog entry & all that, but my patch still applies without changes, and gmarkup looks as inefficient as ever in HEAD git ;-> I attach the patch we apply in SLED11. HTH, Michael.
Created attachment 133666 [details] [review] patch from SLE11 work
Hmm, maybe some git accident, or something. Could you commit it yourself this time ?
Committed, and pushed (I hope). Thanks.