GNOME Bugzilla – Bug 703381
Bug with linear regression - Error for small numbers (floating point ?)
Last modified: 2013-07-15 20:46:07 UTC
Created attachment 248126 [details] Calc sheet demonstrating the problem. I generated data with the following python script : import numpy x=numpy.arange(1e26,20e26,1e26) y=60e-68*x+5e-41*numpy.random.randn(*x.shape) +30e-41 for f,g in zip(x,y): print(f,g) I pasted the data in gnumeric and I did a linear regression on these data. The output is not calculated, since I suppose the magnitude order of the coefficients are too big. Especially, on regression (1), the intercept is not shown (assumed to be zero) while it is obviously not the case. I believe this is a bug, maybe the floating point precision is not enough in gnumeric to handle this values.
You can try to generate data with a smallest "a" coefficient (here it is 60e-68), believe this makes things worst, and prevents the program even to compute the slope.
For comparison, the results of regression 1 in R are: Call: lm(formula = data$B ~ data$A) Residuals: Min 1Q Median 3Q Max -4.449e-41 -2.537e-41 -1.074e-41 1.235e-41 1.059e-40 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.130e-40 1.858e-41 16.84 4.86e-12 *** data$A 5.812e-67 1.630e-68 35.66 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.891e-41 on 17 degrees of freedom Multiple R-squared: 0.9868, Adjusted R-squared: 0.986 F-statistic: 1272 on 1 and 17 DF, p-value: < 2.2e-16
This example unnecessarily triggers the degenerate-dimension code because we end up with an ill-conditioned matrix with eigenvalues e[0] = -4.3588989435406739759 e[1] = 2.3874672772626644741e+27 There are a number of reasonable things we can do to avoid this. On top is to normalize the columns, either completely to norm 1 or just to bring the norm into the range [0.5;1[ using a power of 2.
Created attachment 248651 [details] [review] Tentative patch This patch introduces pre-scaling by a power of 2.
Created attachment 248656 [details] sample gnumeric file With out without the proposed patch there seem to be strange things going on. Regression (3) in the attached file does not look right to me. (Regression 1 and 2 has random data so the regression analyses will change.) R yields in the regression 3 case: Call: lm(formula = data$Y ~ data$X) Residuals: Min 1Q Median 3Q Max -0.05921 -0.02198 -0.01139 0.03156 0.06413 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.020510 0.030999 0.662 0.527 data$X 0.988296 0.004996 197.820 4.77e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.04538 on 8 degrees of freedom Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998 F-statistic: 3.913e+04 on 1 and 8 DF, p-value: 4.772e-16
With the changed code: Coefficients B17:B18 agree with R. Standard errors C17:C18 agree. t-statistics D17:D18 agree. p-values E17:E18 agree. residual dof B8 agrees. I'm not sure about R values. (To close to 1 with R's rounding.) F statistics E12 and p-value F12 agree. Andreas: where do we have problems?
Created attachment 248669 [details] screen shot of regression 3 I have doubled checked: I applied the patch to goffice. I recompiled goffice. I even recompiled gnumeric (although that should not be necessary). Opened the file and asked for recalculation. I still trigger the ill-conditioned matrix situation for regression 3. See the attachment.
If I enable debug_scale I get: Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 4 Scale 2: 4 Scale 3: 4 Scale 4: 4 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8 Scale 0: 1 Scale 1: 8
I get the very same scales and they are reasonable given the input.
In the interesting part: 627 stat_->se = g_new0 (DOUBLE, n); (gdb) 628 for (k = 0; k < n; k++) { (gdb) p k $4 = -1073752528 (gdb) p n $5 = 2 (gdb) n 632 for (i = 0; i < n; i++) (gdb) 633 SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0); (gdb) 632 for (i = 0; i < n; i++) (gdb) 633 SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0); (gdb) 632 for (i = 0; i < n; i++) (gdb) 636 if (SUFFIX(go_quad_matrix_fwd_solve) (R, inv, inv, TRUE)) { (gdb) 641 SUFFIX(go_quad_dot_product) (&N, inv, inv, n); (gdb) n 642 SUFFIX(go_quad_mul) (&p, &N2, &N); (gdb) 643 SUFFIX(go_quad_sqrt) (&p, &p); (gdb) 644 stat_->se[k] = SUFFIX(go_quad_value) (&p) / xscale[k]; (gdb) p p $6 = {h = 0, l = 0} (gdb) p k $7 = 0 (gdb) n 628 for (k = 0; k < n; k++) { (gdb) 632 for (i = 0; i < n; i++) (gdb) 633 SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0); (gdb) p inv[i] $8 = {h = -0.31622776601683794, l = 7.9765867244650389e-18} (gdb) n 632 for (i = 0; i < n; i++) (gdb) 633 SUFFIX(go_quad_init) (&inv[i], i == k ? 1 : 0); (gdb) p inv[i] $9 = {h = -0.60553007081949828, l = -5.4695103194615878e-17} (gdb) n 632 for (i = 0; i < n; i++) (gdb) 636 if (SUFFIX(go_quad_matrix_fwd_solve) (R, inv, inv, TRUE)) { (gdb) 641 SUFFIX(go_quad_dot_product) (&N, inv, inv, n); (gdb) p N $10 = {h = 0.46666666666666667, l = -7.4014868308343876e-18} (gdb) p n $11 = 2 (gdb) n 642 SUFFIX(go_quad_mul) (&p, &N2, &N); (gdb) p N $12 = {h = 0.77575757575757576, l = 6.7286243916676587e-19} (gdb) p N2 $13 = {h = -1.0918017190279363e-43, l = 0} (gdb) p p $14 = {h = 0, l = 0} (gdb) n 643 SUFFIX(go_quad_sqrt) (&p, &p); (gdb) p p $15 = {h = -8.4697345476106574e-44, l = 2.9568751380867482e-60} (gdb) (gdb) n 644 stat_->se[k] = SUFFIX(go_quad_value) (&p) / xscale[k]; (gdb) p p $16 = {h = 0, l = 0} (gdb) We should not get 0...
I should not that I deleted the first 4 sheets so that this is the first calc of the regression 3.
If I am reading this right -- notably $15 vs $16 -- then go_quad_sqrt doesn't work for you.
Scratch that: $15.h is negative, so you get zero as sqrt. p is negative because N2 is negative. Line 600 (unpatched) doesn't look right: SUFFIX(go_quad_div) (&N2, &N2, &d); N2 isn't defined.
Looking at old code, I think this is right. Please try: diff --git a/goffice/math/go-regression.c b/goffice/math/go-regression.c index 824975b..68e4637 100644 --- a/goffice/math/go-regression.c +++ b/goffice/math/go-regression.c @@ -595,6 +595,7 @@ SUFFIX(general_linear_regression) (CONSTMATRIX xssT, int n, else { QUAD d; SUFFIX(go_quad_init) (&d, df_resid); + SUFFIX(go_quad_init) (&N2, stat_->ss_resid); SUFFIX(go_quad_div) (&N2, &N2, &d); } stat_->var = SUFFIX(go_quad_value) (&N2);
With only the added line of comment #14 (i.e. without the patch of comment #4) the regressions in my file of comment #5 are working just fine. In this case there are still issues with t regressions in the original file of the reporter. With both the patches of comment #4 and comment #14, the regressions in my file of comment #5 are also okay as well as the regressions in the original reporter's file. There are still problems with the leverage function in the original reporter's file.
Fixes from comment #4 and comment #14 committed. The leverage function will need prescaling too.
Leverage now prescales. This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.