Bug 703381 – Bug with linear regression - Error for small numbers (floating point ?)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 703381 - Bug with linear regression - Error for small numbers (floating point ?)


Summary:	Bug with linear regression - Error for small numbers (floating point ?)


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	Analytics
Version:	1.12.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	Morten Welinder
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-07-01 10:40 UTC by david
Modified:	2013-07-15 20:46 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Calc sheet demonstrating the problem. (100.66 KB, application/x-gnumeric) 2013-07-01 10:40 UTC, david		Details
Tentative patch (2.55 KB, patch) 2013-07-08 19:30 UTC, Morten Welinder	none	Details \| Review
sample gnumeric file (7.37 KB, application/x-gnumeric) 2013-07-08 20:07 UTC, Andreas J. Guelzow		Details
screen shot of regression 3 (194.01 KB, image/png) 2013-07-08 21:28 UTC, Andreas J. Guelzow		Details

Description david 2013-07-01 10:40:56 UTC

Created attachment 248126 [details]
Calc sheet demonstrating the problem.

I generated data with the following python script :

 import numpy
 x=numpy.arange(1e26,20e26,1e26)
 y=60e-68*x+5e-41*numpy.random.randn(*x.shape) +30e-41
 for f,g in zip(x,y):
     print(f,g)

I pasted the data in gnumeric and I did a linear regression on these data.

The output is not calculated, since I suppose the magnitude order of the coefficients are too big. Especially, on regression (1), the intercept is not shown (assumed to be zero) while it is obviously not the case.

I believe this is a bug, maybe the floating point precision is not enough in gnumeric to handle this values.

Comment 1 david 2013-07-01 10:43:39 UTC

You can try to generate data with a smallest "a" coefficient (here it is 60e-68),  believe this makes things worst, and prevents the program even to compute the slope.

Comment 2 Andreas J. Guelzow 2013-07-01 17:17:58 UTC

For comparison, the results of regression 1 in R are:

Call:
lm(formula = data$B ~ data$A)

Residuals:
       Min         1Q     Median         3Q        Max 
-4.449e-41 -2.537e-41 -1.074e-41  1.235e-41  1.059e-40 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.130e-40  1.858e-41   16.84 4.86e-12 ***
data$A      5.812e-67  1.630e-68   35.66  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 3.891e-41 on 17 degrees of freedom
Multiple R-squared: 0.9868,	Adjusted R-squared: 0.986 
F-statistic:  1272 on 1 and 17 DF,  p-value: < 2.2e-16

Comment 3 Morten Welinder 2013-07-02 02:33:53 UTC

This example unnecessarily triggers the degenerate-dimension code
because we end up with an ill-conditioned matrix with eigenvalues

e[0] = -4.3588989435406739759
e[1] = 2.3874672772626644741e+27

There are a number of reasonable things we can do to avoid this.
On top is to normalize the columns, either completely to norm 1
or just to bring the norm into the range [0.5;1[ using a power of 2.

Comment 4 Morten Welinder 2013-07-08 19:30:45 UTC

Created attachment 248651 [details] [review]
Tentative patch

This patch introduces pre-scaling by a power of 2.

Comment 5 Andreas J. Guelzow 2013-07-08 20:07:40 UTC

Created attachment 248656 [details]
sample gnumeric file

With out without the proposed patch there seem to be strange things going on. Regression (3) in the attached file does not look right to me. (Regression 1 and 2 has random data so the regression analyses will change.)

R yields in the regression 3 case:

Call:
lm(formula = data$Y ~ data$X)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.05921 -0.02198 -0.01139  0.03156  0.06413 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.020510   0.030999   0.662    0.527    
data$X      0.988296   0.004996 197.820 4.77e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.04538 on 8 degrees of freedom
Multiple R-squared: 0.9998,	Adjusted R-squared: 0.9998 
F-statistic: 3.913e+04 on 1 and 8 DF,  p-value: 4.772e-16

Comment 6 Morten Welinder 2013-07-08 20:54:39 UTC

With the changed code:

Coefficients B17:B18 agree with R.
Standard errors C17:C18 agree.
t-statistics D17:D18 agree.
p-values E17:E18 agree.

residual dof B8 agrees.
I'm not sure about R values.  (To close to 1 with R's rounding.)
F statistics E12 and p-value F12 agree.


Andreas: where do we have problems?

Comment 7 Andreas J. Guelzow 2013-07-08 21:28:41 UTC

Created attachment 248669 [details]
screen shot of regression 3

I have doubled checked:

I applied the patch to goffice.
I recompiled goffice.
I even recompiled gnumeric (although that should not be necessary).
Opened the file and asked for recalculation.

I still trigger the ill-conditioned matrix situation for regression 3. See the attachment.

Comment 8 Andreas J. Guelzow 2013-07-08 21:38:56 UTC

If I enable debug_scale I get:
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 4
Scale 2: 4
Scale 3: 4
Scale 4: 4
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8
Scale 0: 1
Scale 1: 8

Comment 9 Morten Welinder 2013-07-08 21:47:43 UTC

I get the very same scales and they are reasonable given the input.

Comment 10 Andreas J. Guelzow 2013-07-08 23:09:00 UTC

In the interesting part:

627			stat_->se = g_new0 (DOUBLE, n);
(gdb) 
628			for (k = 0; k < n; k++) {
(gdb) p k
$4 = -1073752528
(gdb) p n
$5 = 2
(gdb) n
632				for (i = 0; i < n; i++)
(gdb) 
633					SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0);
(gdb) 
632				for (i = 0; i < n; i++)
(gdb) 
633					SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0);
(gdb) 
632				for (i = 0; i < n; i++)
(gdb) 
636				if (SUFFIX(go_quad_matrix_fwd_solve) (R, inv, inv, TRUE)) {
(gdb) 
641				SUFFIX(go_quad_dot_product) (&N, inv, inv, n);
(gdb) n
642				SUFFIX(go_quad_mul) (&p, &N2, &N);
(gdb) 
643				SUFFIX(go_quad_sqrt) (&p, &p);
(gdb) 
644				stat_->se[k] = SUFFIX(go_quad_value) (&p) / xscale[k];
(gdb) p p
$6 = {h = 0, l = 0}
(gdb) p k
$7 = 0
(gdb) n
628			for (k = 0; k < n; k++) {
(gdb) 
632				for (i = 0; i < n; i++)
(gdb) 
633					SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0);
(gdb) p inv[i]
$8 = {h = -0.31622776601683794, l = 7.9765867244650389e-18}
(gdb) n
632				for (i = 0; i < n; i++)
(gdb) 
633					SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0);
(gdb) p inv[i]
$9 = {h = -0.60553007081949828, l = -5.4695103194615878e-17}
(gdb) n
632				for (i = 0; i < n; i++)
(gdb) 
636				if (SUFFIX(go_quad_matrix_fwd_solve) (R, inv, inv, TRUE)) {
(gdb) 
641				SUFFIX(go_quad_dot_product) (&N, inv, inv, n);
(gdb) p N
$10 = {h = 0.46666666666666667, l = -7.4014868308343876e-18}
(gdb) p n
$11 = 2
(gdb) n
642				SUFFIX(go_quad_mul) (&p, &N2, &N);
(gdb) p N
$12 = {h = 0.77575757575757576, l = 6.7286243916676587e-19}
(gdb) p N2
$13 = {h = -1.0918017190279363e-43, l = 0}
(gdb) p p
$14 = {h = 0, l = 0}
(gdb) n
643				SUFFIX(go_quad_sqrt) (&p, &p);
(gdb) p p
$15 = {h = -8.4697345476106574e-44, l = 2.9568751380867482e-60}
(gdb) 
(gdb) n
644				stat_->se[k] = SUFFIX(go_quad_value) (&p) / xscale[k];
(gdb) p p
$16 = {h = 0, l = 0}
(gdb) 

We should not get 0...

Comment 11 Andreas J. Guelzow 2013-07-08 23:10:07 UTC

I should not that I deleted the first 4 sheets so that this is the first calc of the regression 3.

Comment 12 Morten Welinder 2013-07-08 23:39:15 UTC

If I am reading this right -- notably $15 vs $16 -- then go_quad_sqrt doesn't
work for you.

Comment 13 Morten Welinder 2013-07-08 23:43:40 UTC

Scratch that: $15.h is negative, so you get zero as sqrt.

p is negative because N2 is negative.

Line 600 (unpatched) doesn't look right:
			SUFFIX(go_quad_div) (&N2, &N2, &d);

N2 isn't defined.

Comment 14 Morten Welinder 2013-07-09 00:13:57 UTC

Looking at old code, I think this is right.  Please try:

diff --git a/goffice/math/go-regression.c b/goffice/math/go-regression.c
index 824975b..68e4637 100644
--- a/goffice/math/go-regression.c
+++ b/goffice/math/go-regression.c
@@ -595,6 +595,7 @@ SUFFIX(general_linear_regression) (CONSTMATRIX xssT, int n,
                else {
                        QUAD d;
                        SUFFIX(go_quad_init) (&d, df_resid);
+                       SUFFIX(go_quad_init) (&N2, stat_->ss_resid); 
                        SUFFIX(go_quad_div) (&N2, &N2, &d);
                }
                stat_->var = SUFFIX(go_quad_value) (&N2);

Comment 15 Andreas J. Guelzow 2013-07-09 05:06:49 UTC

With only the added line of comment #14 (i.e. without the patch of comment #4) the regressions in my file of comment #5 are working just fine. In this case there are still issues with t regressions in the original file of the reporter.

With both the patches of comment #4 and comment #14, the regressions in my file of comment #5 are also okay as well as the regressions in the original reporter's file.

There are still problems with the leverage function in the original reporter's file.

Comment 16 Morten Welinder 2013-07-09 13:21:05 UTC

Fixes from comment #4 and comment #14 committed.

The leverage function will need prescaling too.

Comment 17 Morten Welinder 2013-07-15 20:46:07 UTC

Leverage now prescales.

This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.