GNOME Bugzilla – Bug 567389
1. performance in comparision to openoffice_3.1_m38 / 2. problem of "VLOOKUP" function / 3. export problem to the native *.gnumeric file format
Last modified: 2009-02-01 17:32:59 UTC
Hello friends of gnumeric, 1. PERORMANCE ISSUE I found gnumeric when I looked for a faster alternative to openoffice. But my openoffice seems to be faster than gnumeric-1.9.3 compiled on sabayon 4.0 64bit kernel 2.6.27. For performance comparison I took one of my large spreadsheats ( *.sxc 1,7MB / saved as *.xls 22,9MB) saved it in the *.sxc file format (because the odf 1.2 version of *.ods is not yet fully supported by gnumeric) The following comparison cannot be generalized but it might correspond to your experiences. notebook celeron M550 @2000MHz, 2,5GB CHANGING INPUT VALUE FOR CALCULATION: GNUMERIC 85seconds, memory consumption 386MB OPENOFFICE 9seconds, memory consumption 1006MB RECALCULATION F9: GNUMERIC 21seconds OPENOFFICE 9seconds This shows that the time for recalculation (F9) and the time for changing a input value is the same for openoffice (in this case). Gnumeric needs 21s for Recalculation but 85seconds for changing the input value. What does Gnumeric do after the first 21seconds? CPU-usage is always 100% 2. PROBLEM OF "VLOOKUP" FUNCTION Almost everything was calculated correctly but there was a mysteric phenomenon concerning the vlookup function. It was not possible to define a higher value for the matrix legth then 2076. If I would increase the Value >=2077 the result would be always "0" =vlookup(max(Q14:Q2076);Q14:R2076;2;1) 3. EXPORT PROBLEM TO NATIVE *.GNUMERIC FILE FORMAT The import filter seems to work quiet good for *.sxc files. The following function works correctly after loading the *.sxc file with GNUMERIC. =if(or(K$10=0;K$13=0);J16;if(or(and(J15<>0;J15<>" ");L15=1;L15=0);if(I16>$G16;" ";if(and((($G16/(indirect("g"&row(J16)-(row(J16)-max(K$14:K16)))))-1)*100>K$10;sum(L15:indirect("L"&(row(J15)-(row(J15)-max(K$14:K15)))))>(K$13-2));$G16;if((($G16/(indirect("g"&(row(J16)-(row(J16)-max(K$14:K16))))))-1)*100>K$10;1;0)));" ")) After saving the file in the "MS Excel (tm) 97/2000/XP u. 5.0/95" file format the function has got a lot of additional brackets but is still working correctly! =if(or(K$10=0;K$13=0);J17;if(or(and(J16<>0;J16<>" ");L16=1;L16=0);if(I17>$G17;" ";if(and((((((((((((((((($G17/((((((((indirect("g"&row(J17)-((((((((row(J17)-max(K$14:K17))))))))))))))))))))))))))-1))))))))*100>K$10;sum(L16:indirect("L"&((((((((row(J16)-((((((((row(J16)-max(K$14:K16)))))))))))))))))))>((((((((K$13-2)))))))));$G17;if((((((((((((((((($G17/((((((((indirect("g"&((((((((row(J17)-((((((((row(J17)-max(K$14:K17))))))))))))))))))))))))))))))))))-1))))))))*100>K$10;1;0)));" ")) This bug was already discribed in http://bugzilla.gnome.org/show_bug.cgi?id=567380 After saving the file in the *.GNUMERIC file format the function looks like the original function of the *.sxc file but shows the error "#BEZUG!" =if(or(K$10=0;K$13=0);J16;if(or(and(J15<>0;J15<>" ");L15=1;L15=0);if(I16>$G16;" ";if(and((($G16/(indirect("g"&row(J16)-(row(J16)-max(K$14:K16)))))-1)*100>K$10;sum(L15:indirect("L"&(row(J15)-(row(J15)-max(K$14:K15)))))>(K$13-2));$G16;if((($G16/(indirect("g"&(row(J16)-(row(J16)-max(K$14:K16))))))-1)*100>K$10;1;0)));" ")) NICE GREETINGS
Created attachment 126226 [details] Openoffice *.sxc sample file for gnumetric / cell "R9" with vlookup fails if range value is higher 2076 / column "L" fails after saving as *.gnumetric file format Openoffice *.sxc sample file for gnumetric / cell "R9" with vlookup fails if range value is higher 2076 / column "L" fails after saving as *.gnumetric file format
Re 2: The formula in cell R9 appears to be invalid. When the fourth argument to VLOOKUP is true (you use "1") then the range to be searched in must be sorted. It's not: the Q column contains a few numbers with lots of strings in-between. I'm fairly certain that invokes undefined behaviour.
(In reply to comment #2) > Re 2: > > The formula in cell R9 appears to be invalid. When the fourth argument > to VLOOKUP is true (you use "1") then the range to be searched in must > be sorted. It's not: the Q column contains a few numbers with lots of > strings in-between. > > I'm fairly certain that invokes undefined behaviour. > Re 3: The numbers in the "Q" column are sorted. The "Q" column is my "counter" column. Therefore I decided to take the "1" as fourth argument instead of the "0" for an unsorted column. I hoped the performance would be better. The cells with the " " strings should be ignored by VLOOLUP. BUT THE FOLLOWING BEHAVIOUR MAKE ME WONDER: My spreadsheet contains 1676 rows. If the range of the VLOOKUP function is up to 2076 everything works good. But if the range of VLOOKUP is exceeding my spreadsheed rows by more then 400 rows it fails. This is strange isn't it? There must be a systematic error in the VLOOKUP function itself. Before the VLOOKUP range was up to 10000 rows and everyting was working when I used openoffice. I used this large range in order to be able to increase the number of rows in future without the need of adjusting the function in cell "R9" But as a workarount for gnumeric I could integrate a dynamic "last-row- adjustment" in the VLOOKUP function. This should solve the problem in my case.
> The cells with the " " strings should be ignored by VLOOLUP. The docs say the range should be sorted. That includes the strings: http://office.microsoft.com/en-us/excel/HP052093351033.aspx The values in the first column of table_array must be placed in ascending sort order; otherwise, VLOOKUP may not give the correct value. You can put the values in ascending order by choosing the Sort command from the Data menu and selecting Ascending. For more information, see [Default sort orders] "Default sort orders" is defined as all strings being sorted after any number. (It's a javascript link, so you'll have to go there yourself.) > There must be a systematic error in the VLOOKUP function itself. I doubt it. We are simply using bisection search under the assumption that the range is sorted. And using bisection search is the only reason for demanding a sorted range in the first place. As the Excel documentation states, if you violate that assumption you should expect bogus results. Now, it would be useful if you can come up with a reference that claims that vlookup ought to work on a range like you have, I am interested.
"HLOOKUP, VLOOKUP, LOOKUP return incorrect values in Excel": http://support.microsoft.com/kb/181201
Note, that we actually hit a problem in libxml2 when reading this file. If you see, "Detected an entity reference loop" then you are probably seeing the problem described at https://bugzilla.redhat.com/show_bug.cgi?id=460396
Performance-wise two things are going on here. 1. Gnumeric always evaluates both the "then" and "else" braches of an IF. That causes trouble in your R column. (There are reasons why we do it. They have to do with semantics of implicit iteration. They are not awfully good reasons.) 2. Gnumeric, Excel, and OpenOffice would all benefit if you did you use formulas like MAX(Q$14:Q20) copied into a tall column. Instead, make a column, say S, where S14=Q14 and S15=MAX(S14,Q15) and copy S15 down. NOTE: you might need to use ";" instead of "," in your locale. Do the same thing for the max over the R column data. Spreadsheets are too dumb to realize what you are doing here and the work involved is of size 1+2+3+....+n, i.e., quadratic. Gnumeric works fairly well with INDIRECT, but Excel and OO do not, so you might also want to consider using INDEX instead.
The both-branches-of-if is bug 326595.
Preliminary tests show that when the "if" problem is solved, recalc time drops to 2-3s.
(In reply to comment #9) > Preliminary tests show that when the "if" problem is solved, recalc time > drops to 2-3s. > I just tested your patch Created an attachment (id=126603) http://bugzilla.gnome.org/show_bug.cgi?id=326595 The recalc time (F9) is now too short for a manual time measurement with my stop watch. I guess it is less than 1 second. Before it took 9 seconds. WOW! The recalc time after changing an input value is shorter, too. But it still takes 75 seconds (85 seconds without the patch) Is there a possibility disable the automatic recalculation after changing an input value? I know this option from Openoffice.
You can disable automatic recalculation in Format->Workbook, callculation tab
When I change the "tolerance" field, I am seeing a long time spent tracking which cells must be recalculated as a consequence. It is not surprising that there are cases that make our tracking take a long time, but a bit worrisome that it happens in practice.
Created attachment 127030 [details] [review] Patch to try This seems to help somewhat. Numbers for you setup would be interesting, so please try it.
For the record, on my machine, changing tolerance from 3 to 2 used to take ~40s. It now takes ~4s.
That's it! Gnumeric is more then 10 times faster and even faster then Openoffice on my machine. It took just 7,5seconds for the calculation after changing the value for "GLD langfristig" (85seconds before) The performance item seems to be solved. This result is based on gnumeric-1.9.4 SVN (2009-01-24) with the patched ("patch to try") dependend.c from gnumeric 1.9.3. The original dependend.c of gnumeric-1.9.4 was as slow as before. Obviously the patch is not yet included. Further I saw that the problem with the "Extra parentheses introduced when writing xls" is solved, too. http://bugzilla.gnome.org/show_bug.cgi?id=567380 When I tried to insert a X-Y-chart I noticed a new option in the data section of the chart tool "Show in Legend". This option seems not to work yet. I guess the legend, the discription of X and Y-axis, the grid of X and Y-axis and the titel of the diagramm is planned for future versions. (Or did I not found these options?)
Ok, performance patch committed. Please file graph issue as a separate bug -- it'll probably need to be handled by different people and this bug is getting a bit big as it is.
Created attachment 127360 [details] [review] Patch to cache vlookup/hlookup/match ranges Patch to try. This caches pre-processed vlookup ranges and, as a side effect, probably moves vlookup closer to the semantics you desire: strings are kicked out early.
A variant of the cache patch committed. The .gnumeric problem is bug 569730. What's left here?
I have some trouble while patching the functions.c file. I am neither able to patch the latest function.c file from svn nor the function.c file from version 1.9.3 And the manual implementation of the changes would take a very long time. I used the patch command without any options "patch original-functions.c patch-functions.c"
patch is a strange, old program... cd gnumeric-top-level patch -p0 <patchfile However, all the patches are -- in one form or another -- in svn, so it might be easier to simply try that.
Re "2. PROBLEM OF "VLOOKUP" FUNCTION" Up to now I have not recognized any different behaviour of the VLOOKUP function. I still get these "bogus results" if I extend the range above 2076 while using the "1" as fourth parameter of the VLOOKUP function But in In my special I can use "0" as fourth parameter of the VLOOKUP function (unsorted list, search for excact value) as a work around. With "0" it is possible to extend the range up to the last line of the spreadsheet. Before I used "1" as fourth parameter in the hope for a better performance. But there is no difference concerning performance I used the functions.c (2009-01-30) file from http://svn.gnome.org/svn/gnumeric/trunk/plugins/fn-lookup/ Re "3. EXPORT PROBLEM TO NATIVE *.GNUMERIC FILE FORMAT" seems to be solved! No problems detected anymore I THINK THERE IS NOTHING LEFT HERE. ALL THE PROBLEMS HAVE BEEN SOLVED. GOOD JOB!
Retry the vlookup, please -- empty cells were being treated as zero. Fixed in svn. This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.
Note, that due to the pre-processings, the forget-about-sorting case ("FALSE" as third argument) is likely to be the fastest, but that very, very little time is spent in calculations, so it will not matter anyway.
Re: #Retry the vlookup, please -- empty cells were being treated as zero. #Fixed in svn. Now the VLOOKUP function works like the VLOOKUP function of Oo. The range can be extended above 2076 rows. First I thought if empty cells were being treated as zero it would not be possible to search for a zero-"0"-value in between of empty cells with the VLOOKUP function. But nevertheless it works even for this special case.