GNOME Bugzilla – Bug 525875
Optimize evaluation involving large ranges
Last modified: 2008-04-11 21:54:46 UTC
Functions like VLOOKUP are often used a large number of times for the same very large area. When the linear- search form of VLOOKUP is used, that is pretty horrible. 1. In collect.c, cache common, large areas. Patch in hand for gnm_float; still needs strings. 2. Add collect flag to use NaN for would-be ignored items. (NULL for strings.) 3. Optimize VLOOKUP's and HLOOKUP's linear modes to evaluate the whole range -- now cached -- and search in the result. 4. If that is not fast enough, we might have to cache a stable sort of the ranges. Then we can binary search in there.
the strip_missing routine in src/tools/analysis_tools.c has serious performance issues too, to see that, just use a whole column.
My efforts is a bit of a bust. The solution might have to become local to [VH]LOOKUP. Are you sure that strip_missing is to blame? I would guess any performance problems here would be because the collect callback is called for each and every empty cell.
I was wrong, the slow one is union_of_int_sets.
Oh. That's horrible. I can make things linear easily.
Created attachment 108834 [details] [review] Massive [VH]LOOKUP speed improvement This speeds up the linear string equality case of VLOOKUP and HLOOKUP, and possibly some MATCH case too. Complexity is changed from linear search with piles of g_utd8_casefold calls to a simple hash lookup. This is achieved by caching the result of converting the range the first time it is observed. (This patch will thus only give a speedup if the same range is used multiple times. That is quite common, though.) Recalc of OO's i81336_vlookup.ods sample goes from ~300s to <1s on my machine.
Created attachment 108835 [details] [review] Patch to try Jean, this untested patch ought to make union_of_int_sets as fast as it is ever going to be. strip_missing also looked like it was O(n^2), so I fixed that too. The code assumes that things are sorted and keeps them that way.
That's the assumption I made when I fixed the union_of_int_sets in collect.c last year (as far as I remember, I also changed the name), but I did not touch the analysis-tool.c instance because I was not certain that data were always sorted there.
Created attachment 108860 [details] [review] Improved patch to try There is no need for two copies of that code. This uses the best versions of the two functions.
This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.
*** Bug 311816 has been marked as a duplicate of this bug. ***