After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 567389 - 1. performance in comparision to openoffice_3.1_m38 / 2. problem of "VLOOKUP" function / 3. export problem to the native *.gnumeric file format
1. performance in comparision to openoffice_3.1_m38 / 2. problem of "VLOOKUP"...
Status: RESOLVED FIXED
Product: Gnumeric
Classification: Applications
Component: General
1.9.x
Other All
: Normal normal
: ---
Assigned To: Morten Welinder
Jody Goldberg
Depends on: 326595 569730
Blocks:
 
 
Reported: 2009-01-11 17:55 UTC by Nihil Baxter
Modified: 2009-02-01 17:32 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Openoffice *.sxc sample file for gnumetric / cell "R9" with vlookup fails if range value is higher 2076 / column "L" fails after saving as *.gnumetric file format (739.74 KB, application/octet-stream)
2009-01-11 18:50 UTC, Nihil Baxter
  Details
Patch to try (1.83 KB, patch)
2009-01-22 19:33 UTC, Morten Welinder
committed Details | Review
Patch to cache vlookup/hlookup/match ranges (19.53 KB, patch)
2009-01-27 21:32 UTC, Morten Welinder
none Details | Review

Description Nihil Baxter 2009-01-11 17:55:04 UTC
Hello friends of gnumeric,

1. PERORMANCE ISSUE

I found gnumeric when I looked for a faster alternative to openoffice. But my openoffice seems to be faster than gnumeric-1.9.3 compiled on sabayon 4.0 64bit kernel 2.6.27.

For performance comparison I took one of my large spreadsheats ( *.sxc 1,7MB / saved as *.xls 22,9MB) saved it in the *.sxc file format (because the odf 1.2 version of *.ods is not yet fully supported by gnumeric)

The following comparison cannot be generalized but it might correspond to your experiences.

notebook celeron M550 @2000MHz, 2,5GB


CHANGING  INPUT VALUE FOR CALCULATION:

GNUMERIC
85seconds, memory consumption 386MB

OPENOFFICE
9seconds, memory consumption 1006MB


RECALCULATION F9:

GNUMERIC
21seconds

OPENOFFICE
9seconds


This shows that the time for recalculation (F9) and the time for changing a input value is the same for openoffice (in this case). Gnumeric needs 21s for Recalculation but 85seconds for changing the input value.

What does Gnumeric do after the first 21seconds? CPU-usage is always 100%


2. PROBLEM OF "VLOOKUP" FUNCTION

Almost everything was calculated correctly but there was a mysteric phenomenon concerning the vlookup function.

It was not possible to define a higher value for the matrix legth then 2076. If I would increase the Value >=2077 the result would be always "0"

=vlookup(max(Q14:Q2076);Q14:R2076;2;1)



3. EXPORT PROBLEM TO NATIVE *.GNUMERIC FILE FORMAT

The import filter seems to work quiet good for *.sxc files.

The following function works correctly after loading the *.sxc file with GNUMERIC.

=if(or(K$10=0;K$13=0);J16;if(or(and(J15<>0;J15<>" ");L15=1;L15=0);if(I16>$G16;" ";if(and((($G16/(indirect("g"&row(J16)-(row(J16)-max(K$14:K16)))))-1)*100>K$10;sum(L15:indirect("L"&(row(J15)-(row(J15)-max(K$14:K15)))))>(K$13-2));$G16;if((($G16/(indirect("g"&(row(J16)-(row(J16)-max(K$14:K16))))))-1)*100>K$10;1;0)));" "))


After saving the file in the "MS Excel (tm) 97/2000/XP u. 5.0/95" file format the function has got a lot of additional brackets but is still working correctly!

=if(or(K$10=0;K$13=0);J17;if(or(and(J16<>0;J16<>" ");L16=1;L16=0);if(I17>$G17;" ";if(and((((((((((((((((($G17/((((((((indirect("g"&row(J17)-((((((((row(J17)-max(K$14:K17))))))))))))))))))))))))))-1))))))))*100>K$10;sum(L16:indirect("L"&((((((((row(J16)-((((((((row(J16)-max(K$14:K16)))))))))))))))))))>((((((((K$13-2)))))))));$G17;if((((((((((((((((($G17/((((((((indirect("g"&((((((((row(J17)-((((((((row(J17)-max(K$14:K17))))))))))))))))))))))))))))))))))-1))))))))*100>K$10;1;0)));" "))
This bug was already discribed in 
http://bugzilla.gnome.org/show_bug.cgi?id=567380 

After saving the file in the *.GNUMERIC file format the function looks like the original function of the *.sxc file but shows the error "#BEZUG!"

=if(or(K$10=0;K$13=0);J16;if(or(and(J15<>0;J15<>" ");L15=1;L15=0);if(I16>$G16;" ";if(and((($G16/(indirect("g"&row(J16)-(row(J16)-max(K$14:K16)))))-1)*100>K$10;sum(L15:indirect("L"&(row(J15)-(row(J15)-max(K$14:K15)))))>(K$13-2));$G16;if((($G16/(indirect("g"&(row(J16)-(row(J16)-max(K$14:K16))))))-1)*100>K$10;1;0)));" "))


NICE GREETINGS
Comment 1 Nihil Baxter 2009-01-11 18:50:38 UTC
Created attachment 126226 [details]
Openoffice *.sxc sample file for gnumetric / cell "R9" with vlookup fails if range value is higher 2076 / column "L" fails after saving as *.gnumetric file format

Openoffice *.sxc sample file for gnumetric / cell "R9" with vlookup fails if range value is higher 2076 / column "L" fails after saving as *.gnumetric file format
Comment 2 Morten Welinder 2009-01-11 21:18:27 UTC
Re 2:

The formula in cell R9 appears to be invalid.  When the fourth argument
to VLOOKUP is true (you use "1") then the range to be searched in must
be sorted.  It's not: the Q column contains a few numbers with lots of
strings in-between.

I'm fairly certain that invokes undefined behaviour.
Comment 3 Nihil Baxter 2009-01-12 08:29:41 UTC
(In reply to comment #2)
> Re 2:
> 
> The formula in cell R9 appears to be invalid.  When the fourth argument
> to VLOOKUP is true (you use "1") then the range to be searched in must
> be sorted.  It's not: the Q column contains a few numbers with lots of
> strings in-between.
> 
> I'm fairly certain that invokes undefined behaviour.
> 

Re 3:

The numbers in the "Q" column are sorted. The "Q" column is my "counter" column. Therefore I decided to take the "1" as fourth argument instead of the "0" for an unsorted column. I hoped the performance would be better. The cells with the " " strings should be ignored by VLOOLUP. 


BUT THE FOLLOWING BEHAVIOUR MAKE ME WONDER:

My spreadsheet contains 1676 rows. If the range of the VLOOKUP function is up to 2076 everything works good. But if the range of VLOOKUP is exceeding my spreadsheed rows by more then 400 rows it fails. 
This is strange isn't it? There must be a systematic error in the VLOOKUP function itself.

Before the VLOOKUP range was up to 10000 rows and everyting was working when I used openoffice. I used this large range in order to be able to increase the number of rows in future without the need of adjusting the function in cell "R9" 
But as a workarount for gnumeric I could integrate a dynamic "last-row- adjustment" in the VLOOKUP function. This should solve the problem in my case.
Comment 4 Morten Welinder 2009-01-12 14:02:18 UTC
> The cells with the " " strings should be ignored by VLOOLUP. 

The docs say the range should be sorted.  That includes the strings:

http://office.microsoft.com/en-us/excel/HP052093351033.aspx

    The values in the first column of table_array must be placed in
    ascending sort order; otherwise, VLOOKUP may not give the correct
    value. You can put the values in ascending order by choosing the
    Sort command from the Data menu and selecting Ascending. For more
    information, see [Default sort orders]

"Default sort orders" is defined as all strings being sorted after any
number.  (It's a javascript link, so you'll have to go there yourself.)

> There must be a systematic error in the VLOOKUP function itself.

I doubt it.  We are simply using bisection search under the assumption that
the range is sorted.  And using bisection search is the only reason for
demanding a sorted range in the first place.  As the Excel documentation
states, if you violate that assumption you should expect bogus results.

Now, it would be useful if you can come up with a reference that claims that
vlookup ought to work on a range like you have, I am interested.
Comment 5 Morten Welinder 2009-01-12 14:08:01 UTC
"HLOOKUP, VLOOKUP, LOOKUP return incorrect values in Excel":
http://support.microsoft.com/kb/181201
Comment 6 Morten Welinder 2009-01-12 19:45:57 UTC
Note, that we actually hit a problem in libxml2 when reading this file.

If you see, "Detected an entity reference loop" then you are probably seeing
the problem described at https://bugzilla.redhat.com/show_bug.cgi?id=460396
Comment 7 Morten Welinder 2009-01-15 16:08:40 UTC
Performance-wise two things are going on here.

1. Gnumeric always evaluates both the "then" and "else" braches of an IF.
   That causes trouble in your R column.

   (There are reasons why we do it.  They have to do with semantics of
   implicit iteration.  They are not awfully good reasons.)

2. Gnumeric, Excel, and OpenOffice would all benefit if you did you use
   formulas like MAX(Q$14:Q20) copied into a tall column.  Instead, make
   a column, say S, where S14=Q14 and S15=MAX(S14,Q15) and copy S15 down.
   NOTE: you might need to use ";" instead of "," in your locale.
   Do the same thing for the max over the R column data.

   Spreadsheets are too dumb to realize what you are doing here and the
   work involved is of size 1+2+3+....+n, i.e., quadratic.

Gnumeric works fairly well with INDIRECT, but Excel and OO do not, so you
might also want to consider using INDEX instead.
Comment 8 Morten Welinder 2009-01-15 20:57:55 UTC
The both-branches-of-if is bug 326595.
Comment 9 Morten Welinder 2009-01-16 18:17:53 UTC
Preliminary tests show that when the "if" problem is solved, recalc time
drops to 2-3s.
Comment 10 Nihil Baxter 2009-01-17 17:09:30 UTC
(In reply to comment #9)
> Preliminary tests show that when the "if" problem is solved, recalc time
> drops to 2-3s.
> 

I just tested your patch 

Created an attachment (id=126603)

http://bugzilla.gnome.org/show_bug.cgi?id=326595

The recalc time (F9) is now too short for a manual time measurement with my stop watch. I guess it is less than 1 second. Before it took 9 seconds.

WOW!

The recalc time after changing an input value is shorter, too. But it still takes 75 seconds (85 seconds without the patch) 

Is there a possibility disable the automatic recalculation after changing an input value? 
I know this option from Openoffice.

Comment 11 Andreas J. Guelzow 2009-01-19 05:35:35 UTC
You can disable automatic recalculation in Format->Workbook, callculation tab
Comment 12 Morten Welinder 2009-01-20 16:37:29 UTC
When I change the "tolerance" field, I am seeing a long time spent tracking
which cells must be recalculated as a consequence.  It is not surprising that
there are cases that make our tracking take a long time, but a bit worrisome
that it happens in practice.
Comment 13 Morten Welinder 2009-01-22 19:33:21 UTC
Created attachment 127030 [details] [review]
Patch to try

This seems to help somewhat.  Numbers for you setup would be interesting,
so please try it.
Comment 14 Morten Welinder 2009-01-23 00:19:52 UTC
For the record, on my machine, changing tolerance from 3 to 2 used to take
~40s.  It now takes ~4s.
Comment 15 Nihil Baxter 2009-01-25 21:40:37 UTC
That's it! Gnumeric is more then 10 times faster and even faster then Openoffice on my machine. It took just 7,5seconds for the calculation after changing the value for "GLD langfristig" (85seconds before) The performance item seems to be solved. 

This result is based on gnumeric-1.9.4 SVN (2009-01-24) with the patched ("patch to try") dependend.c from gnumeric 1.9.3. 

The original dependend.c of gnumeric-1.9.4 was as slow as before. Obviously the patch is not yet included.

Further I saw that the problem with the "Extra parentheses introduced when writing xls" is solved, too.
http://bugzilla.gnome.org/show_bug.cgi?id=567380

When I tried to insert a X-Y-chart I noticed a new option in the data section of the chart tool "Show in Legend". This option seems not to work yet. I guess the legend, the discription of X and Y-axis, the grid of X and Y-axis and the titel of the diagramm is planned for future versions. (Or did I not found these options?)
Comment 16 Morten Welinder 2009-01-26 02:03:20 UTC
Ok, performance patch committed.

Please file graph issue as a separate bug -- it'll probably need to be
handled by different people and this bug is getting a bit big as it is.
Comment 17 Morten Welinder 2009-01-27 21:32:51 UTC
Created attachment 127360 [details] [review]
Patch to cache vlookup/hlookup/match ranges

Patch to try.  This caches pre-processed vlookup ranges and, as a side effect,
probably moves vlookup closer to the semantics you desire: strings are kicked
out early.
Comment 18 Morten Welinder 2009-01-29 21:38:38 UTC
A variant of the cache patch committed.

The .gnumeric problem is bug 569730.

What's left here?
Comment 19 Nihil Baxter 2009-01-31 13:04:10 UTC
I have some trouble while patching the functions.c file. I am neither able to patch the latest function.c file from svn nor the function.c file from version 1.9.3

And the manual implementation of the changes would take a very long time.                                                                     
I used the patch command without any options 
"patch original-functions.c patch-functions.c"  
Comment 20 Morten Welinder 2009-01-31 14:42:04 UTC
patch is a strange, old program...

cd gnumeric-top-level
patch -p0 <patchfile

However, all the patches are -- in one form or another -- in svn, so it might
be easier to simply try that.
Comment 21 Nihil Baxter 2009-02-01 14:14:54 UTC
Re
"2. PROBLEM OF "VLOOKUP" FUNCTION"

Up to now I have not recognized any different behaviour of the VLOOKUP function. 

I still get these "bogus results" if I extend the range above 2076 while using the "1" as fourth parameter of the VLOOKUP function

But in In my special I can use "0" as fourth parameter of the VLOOKUP function (unsorted list, search for excact value) as a work around. With "0" it is possible to extend the range up to the last line of the spreadsheet.

Before I used "1" as fourth parameter in the hope for a better performance. But there is no difference concerning performance

I used the functions.c (2009-01-30) file from 
http://svn.gnome.org/svn/gnumeric/trunk/plugins/fn-lookup/


Re
"3. EXPORT PROBLEM TO NATIVE *.GNUMERIC FILE FORMAT"

seems to be solved! No problems detected anymore


I THINK THERE IS NOTHING LEFT HERE. ALL THE PROBLEMS HAVE BEEN SOLVED. GOOD JOB!
Comment 22 Morten Welinder 2009-02-01 15:53:19 UTC
Retry the vlookup, please -- empty cells were being treated as zero.
Fixed in svn.

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.
Comment 23 Morten Welinder 2009-02-01 16:00:28 UTC
Note, that due to the pre-processings, the forget-about-sorting case
("FALSE" as third argument) is likely to be the fastest, but that very,
very little time is spent in calculations, so it will not matter anyway.
Comment 24 Nihil Baxter 2009-02-01 17:32:59 UTC
Re:
#Retry the vlookup, please -- empty cells were being treated as zero.
#Fixed in svn.

Now the VLOOKUP function works like the VLOOKUP function of Oo. The range can be extended above 2076 rows. 

First I thought if empty cells were being treated as zero it would not be possible to search for a zero-"0"-value in between of empty cells with the VLOOKUP function. 

But nevertheless it works even for this special case.