Bug 534659 – linear regression coefficient R^2 requires documentation

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 534659 - linear regression coefficient R^2 requires documentation


Summary:	linear regression coefficient R^2 requires documentation


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	Documentation
Version:	1.7.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Adrian Custer
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-05-24 18:02 UTC by reg-gnome
Modified:	2008-09-25 16:07 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
example problem (3.04 KB, application/octet-stream) 2008-05-24 20:41 UTC, reg-gnome		Details
fixes the issue, imho (623 bytes, patch) 2008-05-25 06:14 UTC, Jean Bréfort	rejected	Details \| Review

Description reg-gnome 2008-05-24 18:02:19 UTC

The linear regression R squared is not calculated correctly when the line is forced to go through 0 (affine). In particular, the R^2 of the affine line appears to usually be higher than that of a non-affine line, which makes little sense. I've confirmed the incorrect R^2 calculation with excel and matlab, which agree with each other and disagree with gnumeric. This problem appears to exist for pretty much any data set I make up.

Comment 1 Morten Welinder 2008-05-24 20:24:59 UTC

Please supply a concrete example and state (a) what you get, (b) where
to find that number, and (c) what you think you should get.

There are many different places you can do R^2 calculations and many
different versions of Gnumeric.

Comment 2 reg-gnome 2008-05-24 20:41:28 UTC

Created attachment 111481 [details]
example problem

Comment 3 reg-gnome 2008-05-24 20:43:03 UTC

This is with gnumeric 1.7.11. Calculation of R^2 either via by graphing or via LINEST give the same results (higher R^2 forcing the line through zero). Example attached.

Comment 4 Jean Bréfort 2008-05-25 05:58:33 UTC

graphing and LINEST use the same code to evaluate the regression. Trsted that file with OOoCalc, it gives the same R^2 for both regressions. No excel around, so I could not test with it. KSpread gives the same result as gnumeric.
Applying R^2=SSR/TSS gives 0,4454 when the line is forced through 0. So it seems that all tested spreadsheets are wrong.

Comment 5 Jean Bréfort 2008-05-25 06:14:59 UTC

Created attachment 111488 [details] [review]
fixes the issue, imho

I really do ot understand why the sum of Y^2 was used in the non affine case. Of course, I'm not a specialist and I might be wrong...

Comment 6 Jean Bréfort 2008-05-25 06:54:20 UTC

I was wrong with KSpread. It just imports the value, not the formula, apparently.

Btw, in the sample file, Y and X values are permuted in the linest function.

Comment 7 Jean Bréfort 2008-05-25 07:33:51 UTC

Made the same calculation with R and it gives the same result as gnumeric, so there might be no bug there.

Comment 8 Andreas J. Guelzow 2008-05-26 05:23:06 UTC

R^2 is the square of the linear correlation coefficient (Pearson moment) so it should be independent from the regression model used.

Now, for the normal (the regression line is not forced through the origin) case (which would be the affine case if we want to use that term), this square of the linear correlation coefficient is the fraction of the variation explained by the regression line. Now some people may think that this should also be true for a best fit lined forced throught the origin. That is obviously not the case.

Comment 9 Jean Bréfort 2008-05-26 05:35:26 UTC

Thanks Andreas. So my patch was just stupid and bot R and gnumeric are wrong.

Comment 10 Jean Bréfort 2008-05-26 05:46:30 UTC

Hmm, excel gives the explained variance fraction. And all spreadsheets give only one R^2 value for multilinear regressions.

Comment 11 Andreas J. Guelzow 2008-05-26 17:44:18 UTC

re comment #9: I am not sure I said that gnumeric and R are wrong. Any program that gives different values for R^2 depending on model should be wrong. If Excel gives the fraction explained variation/total variation then they should not be calling it R^2 for the case where the regression line is forced through the origin.

Comment 12 reg-gnome 2008-05-26 20:50:29 UTC

According to wikipedia (not saying it's a trustworthy source), http://en.wikipedia.org/wiki/Coefficient_of_determination, "the coefficient of determination R^2 is the proportion of variability in a data set that is accounted for by a statistical model." I've not found a source that says R^2 is independent of the model.

Regardless, gnumeric is definitely wrong. The R^2 number gnumeric is calculating for the case of line through origin is higher than the R^2 without the constraint and is not correct however you interpret R^2. Also, as a practical note, I'm pretty sure the explained variance statistic (whatever you want to call it) is more intuitive when you're plotting a line on a graph and showing the number on the graph. Especially when you're displaying R^2 right under the equation, having the R^2 be independent of the equation is misleading. I would also note that gnumeric for non-linear fits (e.g. log fit) gives different R^2 numbers, presumably the explained variance.

Comment 13 Andreas J. Guelzow 2008-05-27 16:37:03 UTC

There appears to be confusion about two things:

R is the Pearson moment correlation coefficient. It is independent from the regression model used.

R^2 should be the square of R. (That's why it is called R^2.)

THe coefficient of determination is the proportion of variation accounted for by the statistical model.

Tor your everyday linear regression, the coefficint of determination happens to be R^2 (the square of R). Some people may then use R^2 as the symbol for the coefficient of determination, but it is clearly ridiculous to say that R^2 is anything but the square of R.


I do not make any statements about the correctness of gnumeric's functions or graphs. The functions are frequently simply designed to emulate Excel's result (as of some version).

Comment 14 Jean Bréfort 2008-05-27 16:49:54 UTC

I do make a statement about the correctness of our regression functions. What we calculate as R^2 is neither R^2 or the coefficient of determination in the general case. It is something else which I just do not undestand.
Excel uses R^2 as the symbol for the coefficient of determination. What should we do? The current situation is just wrong.

For the sheet function we can just return the coefficient of determination as Excel, and change the documentation to tell that if is the coefficient of determination, not R squared. But what symbol should we use in graphs?

Comment 15 Jean Bréfort 2008-06-09 19:15:55 UTC

In the end I asked the r-devel list for clarification and got two equivalent responses. See https://stat.ethz.ch/pipermail/r-devel/2008-June/049927.html

Seems we do not have any bug there. May be we should just be more clear in the documentation about the no-intercept regressions.

Comment 16 Andreas J. Guelzow 2008-09-11 03:02:14 UTC

This seems to come down primarily to a documentation issue. So I am changing the component.

Note that statemetns that "Excel yields..." with respect to these linear regression coefficients as far as models are concerned that force the regression line through the origin is meangless since Excel's answer changed greatly with Excel 2003. So at a minimum a statement giving the Excel version would be required.

Comment 17 Andreas J. Guelzow 2008-09-25 16:07:33 UTC

I have added some explanation in the function description.

This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.