GNOME Bugzilla – Bug 315365
Improve layout performance by shortcircuiting fribidi on unidirectional text
Last modified: 2005-11-22 10:57:20 UTC
Here's a patch to improve the performance of PangoLayout by about 15%. (It is noticeable if you resize a large document.) It skips all the bidi code if it isn't needed. I'm assuming that if there are no strong RTL characters we can assume the entire string is level 0 LTR, and if there are no strong LTR characters we can assume it is all level 1 RTL. It might need some tweaking to check that the paragraph level matches the text level.
Created attachment 51874 [details] [review] Patch to speedup layout
Hi Damon, Thanks for the patch, but it seriously breaks things. Your assumption is wrong. To be safe, you need to check for FRIBIDI_MASK_EXPLICIT too. Moreover, I really don't like touching fribidi.c, since it's not upstream code and should and will get updated from upstream every once in a while. The other question is whether it really has 15% improvements. I doubt that. How did you profile? Which symbols show up? I have profiled Pango several times and the bidi code never showed up that much, and when it does, most of the time is spent in _pango_fribidi_get_type. The reason is pretty simple, fribidi run-length-encodes the types and operates on the that.
Yes, I thought there might be other things to check for. But is it possible to get it to work? I'm using clock() to time a simple benchmark app. For short pieces of text I was seeing improvements around 15%. I think the main problem with the bidi code is that it does loads of mallocs, about one for each word in the text (in new_type_link()).
About the malloc problem, original fribidi code is not like that, it keeps unused type_links around. In pango they replaced it with g_malloc, but I *guess* glib already does smart things about allocating lots of small pieces of memory. Sorry if it sounds silly, but are you running ~10 times and taking the average? The only way that the bidi code can be short-circuited is when the text doesn't have any characters passing FRIBIDI_IS_RTL (please don't use masks directly), then you can safely assign level 0 to all chars. The levels are not necessarily all 0 if you run the algorithm (you get level 2 for digits), but that shouldn't make any difference (other than all in a sudden your digits can kern with neighbouring text.)
g_malloc() normally just calls malloc(). It isn't optimised. Using something like a static GArray would be an easy way to avoid all the mallocs. (It is actually more like 2 mallocs per word at present - one for the word and one for the space.) (And yes I do run the benchmark several times!)
ok then maybe reverting back to the MemChunk code that has been in fribidi itself, or using a Glib MemChunk. You can see what it used to be in the fribidi.patch file in there.
With the current GMemChunk that won't help
Matthias, can you elaborate please? With the MemChunk code that I hacked up in FriBidi, it allocates a page at a time, which is far better than 16 bytes at a time. What's wrong with GMemChunk then?
Damon, the patches in bug 317192 make mini-fribidi use GMemChunk for TypeLinks. Can you test performance-wise?
GMemChunk is has consistently been measured to be a bad idea; free lists can help in some cases, but backing those free lists off of GMemChunk rather than malloc generally doesn't make sense. (see bug 118439 for lots of related discussion)
Behdad, GMemChunk is actually slower than malloc. See bug 118439 for my latest efforts to come up with a chunked allocator that can allocate small chunks without any space overhead and is faster than malloc.
Damon, other than moving FriBidi to use the new allocator that Matthias is developing in bug 118439, do you think there's anything else in this bug to do?
It might still be worth skipping the bidi code if possible, though we can't say without benchmarking it. (In my test, GMemChunk is over twice as fast as malloc, if you use G_ALLOC_ONLY (which the fribidi patch does). I think it is only G_ALLOC_AND_FREE that has too much overhead.)
You are right about GMemChunk, FriBidi keeps a trash stack on top of GMemChunk, so that's an improvement anyway. Depending on the "update mini-fribidi" bug, since all changes should go after that update. You may want to just bitwise-or all FriBidiCharType's and if the RTL bit is not set, simply return a zero vector as bidi-levels.
It is a bit ugly to see all the conversions that go on between fribidi and pango here... 1) convert from utf8 to ucs4 because fribidi expects ucs4 2) fribidi mallocs an array of character types, only to convert it to a home-grown linked list and free it again 3) pango takes the linked list, and converts it back into an array
Humm, good idea Matthias. It can be easily fixed actually. In the new FriBidi codebase I've split the main analyze function into parts such that you can shortcircuit several parts. It's easier to fix later, so I'd say wait until it's in upstream.
All Matthias's concerns in comment 15 are resolved now. We just need to short-circuit all-ltr case. Doing now.
Created attachment 54297 [details] [review] mini-fribidi shortcircuit patch Attaching a patch for mini-fribidi to simply return a zero-filled embedding_levels array if all chars are left-to-right. Shows some improvement for ltr locales.
how about the case of all rtl ? also, looking at the pango side, all we ever seem to use from the embedding levels is the information where the embedding level changes occur. Maybe we can save something by only computing that information ?
all rtl very very rare: space is not rtl. Not worth it. We can export the internal FriBidi TypeLink thing which contains run-length encoded information including the embedding levels. Then Pango can ask fribidi to free it after it's done with the data, which is used (only) during itemization. I may work on that, but not quite right now. We can apply the patch for now I guess.
Created attachment 54511 [details] [review] new patch for both ltr and rtl Ok, attaching a solid patch for both ltr and rtl. Will apply after review (Matthias?) Remains to get rid of the embedding_levels all together, and use a run-length-encoded structure.
Patch committed. I'm working on exposing mini-fribidi run-list to pango directly, so we don't have to populate the embeddings_list array.
I decided that at this stage it's probably not worth it to blur the line between pango and mini-fribidi by exposing the run list to pango. Feel free to reopen if you want to submit a patch.