GNOME Bugzilla – Bug 104683
Gnumeric's test-pango is slow
Last modified: 2006-02-26 23:29:51 UTC
Currently, gnumeric uses a pangolayout per cell, say 400 on a screen. It's painfully slow. pango_layout_check_lines eats about 40% of the time to draw things. The layouts typically contain short strings, have no width set, have no white-space (and thus no newlines), and are all ascii. It doesn't get simpler. Since pango_layout_set_text is already walking over the string, it could very well notice this and somehow tell the very general machinery that it might be worth it to take a shortcut somewhere.
- "Painfully slow" is a matter of opinion - It's normally that text layout takes as much time or more time than drawing. - pango_layout_check_lines() is the "do the paragraph layout" call. Everything except actual drawing happens in there. - I'd very much rather not special case, and I've seen no real evidence that special casing will help.
Gnumeric redraw is painfully slow. Unusably so. I'm getting alot of complaints and we're attempting to do whatever we can. Owen, we mean no insult here, but something needs to be done if Gnumeric is to use pango. Would you accept a patch if we can find a short circuit ?
"It's too slow" is simply not an acceptable bug report. A bug report should point to a concrete action. Without knowing what gnumeric is doing; without seeing a profile of where it's taking the time, I can't comment on it being too slow for you. In my experience, laying out a screen full of text does not take significant amounts of time with the current pango performance levels. If you have to lay out the _entire_ worksheet before displaying, then yes, that's probably going to be too slow in some cases. But you'd have to make pango *much* faster before that wasn't true. (Layout speed is ~100k character/sec on a 700mhz machine) I very much doubt that useful short-circuiting is possible. Short-circuiting means adding more layers of abstraction and more complexity, and if the Pango code has a problem, its too many layers of complexity and too much complexity. There may be optimizations that can be done. I'm happy to take bug reports about them -- e.g. bug 104495. But a "it's too slow bug" does me no good.
We're only doing layout for the visible region. On the order of 600 cells per area. Doing the entire spreadsheet would be suicidal. Our general approach seems sane. We cache 1 pangolayout per cell, and generate it the first time the cell is displayed. This is not just pulling it out of the air. Quantify is clearly pointing out something in there. Here's a cummulative break down of the redraw loop. 83.17% scg_make_cell_visible 83.17% gtk_layout_adjustment_changed 83.17% gtk_window_key_press_event 53.90% cell_draw -- I expect this to be the bottleneck 51.53% gdk_draw_layout 51.11% gdk_draw_layout_with_colors 46.10% cell_render_value 46.10% row_calc_spans 46.10% rendered_value_new 45.11% pango_layout_get_extents_internal 41.97% pango_layout_get_extents 40.86% gdk_draw_layout_line_with_colors 39.50% pango_layout_get_pixel_size 37.79% pango_layout_check_lines -- this is painful 21.81% pango_itemize 18.46% gdk_draw_glyphs 18.08% gdk_window_draw_glyphs 16.83% gdk_window_process_all_updates
I can't spend more time closing this bug... Performance improvement in Pango is always a good thing, sure. If laying one screenful of text is hideously slow, though, there is a gnumeric bug.
> "It's too slow" is simply not an acceptable bug report. Owen, you are acting overly defensive. > A bug report should point to a concrete action. What? Rubbish. A bug report should point out of concrete problem. "Actions" are in the solution department, not in the reporting department. Owen, I can understand that you like to keep beautiful, general code. Special-casing should not be done without cause. I think your resistance is rooting in only testing with small cases. Maybe only a few hundred layouts, on a local display, with state- of-the-art cpu. And I bet most of those layouts are static and don't need to be redrawn in the common case. Gnumeric has 400-600 layouts *per* *screen*. Where you see a response time of (say) 0.05s, we see (say) 0.2s, just because we have more layouts. Further, we get hit again after a scroll-down press. That's where "painful" comes in. (Alternatively we could take the whole hit during load of a spreadsheet. That adds tens of seconds or minutes to loading big sheets.) Here'd s general optimization you might consider. If the final fragment fits completely, don't go around and look at breakpoints. Cut-and-pasted patch below. I have no idea how much it saves. Index: pango/pango-layout.c =================================================================== RCS file: /cvs/gnome/pango/pango/pango-layout.c,v retrieving revision 1.97 diff -u -r1.97 pango-layout.c --- pango/pango-layout.c 20 Dec 2002 17:18:51 -0000 1.97 +++ pango/pango-layout.c 5 Feb 2003 17:53:50 -0000 @@ -2734,7 +2734,8 @@ switch (result) { case BREAK_ALL_FIT: - if (can_break_in (layout, state->start_offset, old_num_chars, first_item_in_line)) + if (state->items->next && + can_break_in (layout, state->start_offset, old_num_chars, first_item_in_line)) { have_break = TRUE; break_remaining_width = old_remaining_width;
I think you are missing something here: pango_layout_check_lines *IS* pango for all practical purposes. It's the "Do the layout" function. So, your bug is saying "Pango is slow", a bug that can never be closed because it will always be too slow for someone. It's not a sufficiciently detailed analysis of the problem. (And for that reason, please file your last patch as a separate bug, I'll just miss it on this bug.)
Created attachment 14687 [details] [review] Reduce pango cpu consumption by 70%+ (conservatively)
With this patch, layouts that are very simple -- all ascii and no whitespace -- become as fast as you would expect. Notably that means layouts containing numbers. To activate, set PANGO_TRIVIAL=1. (There are lots of other improvements in there that are not affected by PANGO_TRIVIAL.)
You obviously have spent a lot of time on this patch. So, I feel bad at repeating what I said above: I very much doubt that useful short-circuiting is possible. Short-circuiting means adding more layers of abstraction and more complexity, and if the Pango code has a problem, its too many layers of complexity and too much complexity. OK, perhaps you've proved that short-circuiting that improves performance is possible, but you've done much violence to the Pango code base in the process. (Also, if you are benchmarking with the pango-x backend, that really is pretty irrelevant at this point... to respond to a question in some other bug report, if you turn off antialiasing, the remote-X-server performance of Xft without RENDER should be a lot more competitive with the core font protocol.) The assumption that "if text is ascii-only has no spaces, there is no interaction between characters" is not particular well justified either. It's very likely that we'll support the TrueType kerning table in basic-xft in the future, for example.
The improvements in pangox are only a few percent, cpu-wise. Why don't use start by cherry-picking from the patch? There are plenty of fixes and improvement that don't depent on PANGO_TRIVIAL and thus doesn't change a pixel on the screen.
My point about the X backend is that the primary virtue of the trivial stuff seems to be avoiding the full shaper machinery, and the shaper machinery is entirely different (and faster!) for the fontconfig based backends.
I get your point, but I am telling you that it is orthogonal to about half the improvements in that patch. Really, the caching of the first run's extents is a huge win because the current code ends up calculating the same thing over and over again. (And it's not as easy as it sounds to stop it from doing that.) Likewise, glyphstrings are being realloc'ed umteen times because they start out that size 1 -- that's totally pointless. And pango_layout_set_text walks over the string a gazillion times. And font descriptions get copied every single time a layout is laid out. (And they contain a string, so that's expensive.) You really have to shake yourself out of that "pango is fast enough" mindset. It isn't, and there are lots of low hanging fruit not related to short-circuiting the shaper.
"Here's a big batch of changes to Pango, some of which I've posted elsewhere, some not. You might want to go through and take some stuff from it." "I don't any information about how much the individual changes speed up Pango, but trust me, it makes things faster" Great. Thanks. I've never seen any of the functions that you are optimizing show up significantly on profiles, so frankly, I'm skeptical that they are more than 1% or so speedups. Yes, 1% + 1% + 1%... eventually gets you to 10%, but it's an excercise for a rainy day, rather than anything that's going to solve major performance problems. I'll attach the benchmark that I used when testing Soeren's approach of caching fontsets for the fontconfig backend. Some things like increasing the initial glyph string allocation are in fact probably obvious; if there are separate bugs for such issues, then I can actually close the bug when they are applied. Other changes in your patch are ???: - new_string->glyphs = g_memdup (string->glyphs, - string->space * sizeof (PangoGlyphInfo)); - new_string->log_clusters = g_memdup (string->log_clusters, - string->space * sizeof (gint)); + new_string->glyphs = g_new (PangoGlyphInfo, string->space); + memcpy (new_string->glyphs, string->glyphs, + string->num_glyphs * sizeof (PangoGlyphInfo)); + + new_string->log_clusters = g_new (gint, string->space); + memcpy (new_string->log_clusters, string->log_clusters, + string->num_glyphs * sizeof (gint));
The latter is under the heading of general bug fixing. You are copying too much memory. Besides being wasteful, it is undefined behaviour. (Spotted visually, not by profile.)
Let me see... Methodology: running gnumeric with a certain sheet that has =rand() in A1:K3000. After start, hit PageDown 16 times fast, wait for redraw, then hit Ctrl-Q. Timing only inside functions rendered_value_new and cell_draw, thus identifying problem areas for sheet browsing, an important area. Gnumeric is being run remotely and a fast LAN with PANGO_TRIVIAL=0 and GDK_USE_XFT=0. Base run: pango + above patch + force all is_first_run to FALSE and comment out ensure_first_run_rects call in pango_layout_get_iter. New run: pango + above patch. Thus we're timing only the effect of caching the first run's extents. The tool in question is Rational Quantify 2003 with -use-machine=UltraSparc:750MHz. (That's not accurate, but good enough.) Timing of syscalls is OFF. Since Gnumeric is a gui program, it's very hard to create 100% exact identical runs, but the diffs below show we are extremely close. CONSLUSION: caching first run's extents alone speeds up the run by 11%. (The 0.04s does not seem accurate, but that's what it says.) Since this includes whatever time Gnumeric spends on its own behalf, the pango speedup is somewhat higher. Owen: all this just to show you that I am not doing voodoo optimization. Give me a little more trust that those "rainy day" excuses above. Differences between: program /home/welinder/private/gnome/gnumeric/src/gnumeric (pid 26827) collected on UltraSparc (750 MHz) by Quantify version 2003.06.00 and program /home/welinder/private/gnome/gnumeric/src/gnumeric (pid 18090) collected on UltraSparc (750 MHz) by Quantify version 2003.06.00 . Function name Calls Cycles % change ! pango_x_find_glyph -253392 -7094976 52.38% faster ! pango_x_font_get_glyph_extents -253392 -6841584 52.38% faster ! pango_glyph_string_extents_range -31674 -6081408 63.17% faster ! pango_x_get_per_char -253392 -5067840 44.75% faster ! pango_font_get_glyph_extents -253392 -3040704 52.38% faster ! pango_layout_run_get_extents -31674 -1050296 74.96% faster ! unknown_static_function[libpangox-1.0.so.0/1] -506784 -1013568 44.43% faster ! Letext[libpango-1.0.so.0/5] -253392 -506784 41.01% faster ! pango_layout_get_item_properties -31674 -316740 60.55% faster ! pango_glyph_string_extents -31674 -190044 63.17% faster ! pango_layout_get_iter 0 -122808 26.40% faster ! unknown_static_function[libpango-1.0.so.0/61] -31674 -63348 52.38% faster ! frexp -65 -4130 0.55% faster ! ldexp 0 -3055 0.81% faster ! _XSend -5 -400 55.56% faster ! thr_main -6 -72 25.00% faster ! writev -5 -70 55.56% faster ! ___errno -6 -54 25.00% faster ! _XFlushInt -1 -50 6.67% faster ! _writev -5 -25 55.56% faster ! _X11TransWritev -5 -20 55.56% faster ! _X11TransSocketWritev -5 -15 55.56% faster ! write -1 -14 6.67% faster ! XDrawString16 0 -6 0.00% faster ! _write -1 -5 6.67% faster ! _X11TransWrite -1 -4 6.67% faster ! _XFlush -1 -3 6.67% faster ! _X11TransSocketWrite -1 -3 6.67% faster ! _XSetClipRectangles 0 22 0.00% slower ! cleanfree 1 39 0.00% slower ! memcpy 5 150 0.02% slower ! modf 0 585 0.12% slower ! realfree 32 608 0.01% slower ! t_delete 70 1647 0.09% slower ! _malloc_unlocked 0 3652 0.04% slower ! t_splay 329 10818 0.62% slower ! update_run 0 89650 23.40% slower ! pango_layout_line_get_extents 0 633960 49.61% slower - gtk_entry_motion_notify -0 -0 39 differences; -30656895 cycles (-0.041 secs at 750 MHz) 11.25% faster overall (including system calls).
Re. filing separate bugs, what is that good for when your attitude is so stronly against even thinking of the possiblity of performance problems? But hey, here are some old ones to start with. http://bugzilla.gnome.org/show_bug.cgi?id=105500 http://bugzilla.gnome.org/show_bug.cgi?id=106892 http://bugzilla.gnome.org/show_bug.cgi?id=107138 http://bugzilla.gnome.org/show_bug.cgi?id=106925 http://bugzilla.gnome.org/show_bug.cgi?id=101997 http://bugzilla.gnome.org/show_bug.cgi?id=106973 http://bugzilla.gnome.org/show_bug.cgi?id=94718 http://bugzilla.gnome.org/show_bug.cgi?id=106942
I don't know ... any benchmark that involves pressing keys, remote displays, network latency, has a definite tinge of vodoo about it to me. My experience with profiling tools is that they are useful, but seldom should be trusted without a solid "does this optimization make the wall clock time for this reproducible operation faster" But anyways the reason I want separate patches is precisely _because_ I don't necessarily want to context switch and handle your patches right away. I'm not in patching Pango/ patching GTK+ mode right now ... I'm trying to get a start on GTK+-2.4 API feature work. About my "attitude" about performance problems, I'm perfectly willing to accept that there may be performance improvements possible, my objection to this bug is the blatant procedural stupidity of such a bug report: - When are you going to close this bug? Should we title it "Pango too slow for Morten"? - How the heck am I going to keep track whether fixes you propose in here are applied or not? It's a recipe for lost work.
*** Bug 158991 has been marked as a duplicate of this bug. ***
Jody told me to run gnumeric/src/test-pango. Here are the results as a sysprof file, and the sysprof screenshot.
Created attachment 54487 [details] gnumeric-test-pango.sysprof.gz
Created attachment 54490 [details] gnumeric-test-pango.png Note that [heap] and the X server in general take up more than 50% of the time. The uppermost pango function in the profile is pango_renderer_draw_layout(), at 14.70% of the time. So, the X server is the biggest culprit here. It's probably the sucky implementation of the RENDER extension. I'd like to close this bug as INVALID or NOTGNOME, and have some more updated bugs for Pango performance. What do you think?
I'm having flashbacks... about 8 months ago I spend a fortnight optimising Xft/RENDER/kdrive. I wonder if Xorg has any of the performance problems kdrive had. Has anyone profiled this with a modular X?
Before you close it, could you try taking the X server out of the loop by displaying remotely?
Created attachment 54506 [details] gnumeric-test-pango-remote-x.sysprof.gz Sysprof log from using a remote X server.
Created attachment 54507 [details] gnumeric-test-pango-remote-x.png Fix Cairo!
Since this has become a bug about Gnumeric's specific use of Pango, I'll retitle the bug.
About comment #23 - Ross, do you have those patches around? I'd love to test them on my X server.
Federico: the Xft patches were committed, but the kdrive patches were rather specific to kdrive so won't apply to xorg. Basically the composite glyphs implementation created a small picture for each glyph, composited those into another picture, which was then composited onto the destination. Mallum rewrote this to composite directly onto the destination, leading to (IIRC) a 30% speed increase.
Time to close this I believe... Thanks to all for work on various improvements that started from this bug.