GNOME Bugzilla – Bug 534659
linear regression coefficient R^2 requires documentation
Last modified: 2008-09-25 16:07:33 UTC
The linear regression R squared is not calculated correctly when the line is forced to go through 0 (affine). In particular, the R^2 of the affine line appears to usually be higher than that of a non-affine line, which makes little sense. I've confirmed the incorrect R^2 calculation with excel and matlab, which agree with each other and disagree with gnumeric. This problem appears to exist for pretty much any data set I make up.
Please supply a concrete example and state (a) what you get, (b) where to find that number, and (c) what you think you should get. There are many different places you can do R^2 calculations and many different versions of Gnumeric.
Created attachment 111481 [details] example problem
This is with gnumeric 1.7.11. Calculation of R^2 either via by graphing or via LINEST give the same results (higher R^2 forcing the line through zero). Example attached.
graphing and LINEST use the same code to evaluate the regression. Trsted that file with OOoCalc, it gives the same R^2 for both regressions. No excel around, so I could not test with it. KSpread gives the same result as gnumeric. Applying R^2=SSR/TSS gives 0,4454 when the line is forced through 0. So it seems that all tested spreadsheets are wrong.
Created attachment 111488 [details] [review] fixes the issue, imho I really do ot understand why the sum of Y^2 was used in the non affine case. Of course, I'm not a specialist and I might be wrong...
I was wrong with KSpread. It just imports the value, not the formula, apparently. Btw, in the sample file, Y and X values are permuted in the linest function.
Made the same calculation with R and it gives the same result as gnumeric, so there might be no bug there.
R^2 is the square of the linear correlation coefficient (Pearson moment) so it should be independent from the regression model used. Now, for the normal (the regression line is not forced through the origin) case (which would be the affine case if we want to use that term), this square of the linear correlation coefficient is the fraction of the variation explained by the regression line. Now some people may think that this should also be true for a best fit lined forced throught the origin. That is obviously not the case.
Thanks Andreas. So my patch was just stupid and bot R and gnumeric are wrong.
Hmm, excel gives the explained variance fraction. And all spreadsheets give only one R^2 value for multilinear regressions.
re comment #9: I am not sure I said that gnumeric and R are wrong. Any program that gives different values for R^2 depending on model should be wrong. If Excel gives the fraction explained variation/total variation then they should not be calling it R^2 for the case where the regression line is forced through the origin.
According to wikipedia (not saying it's a trustworthy source), http://en.wikipedia.org/wiki/Coefficient_of_determination, "the coefficient of determination R^2 is the proportion of variability in a data set that is accounted for by a statistical model." I've not found a source that says R^2 is independent of the model. Regardless, gnumeric is definitely wrong. The R^2 number gnumeric is calculating for the case of line through origin is higher than the R^2 without the constraint and is not correct however you interpret R^2. Also, as a practical note, I'm pretty sure the explained variance statistic (whatever you want to call it) is more intuitive when you're plotting a line on a graph and showing the number on the graph. Especially when you're displaying R^2 right under the equation, having the R^2 be independent of the equation is misleading. I would also note that gnumeric for non-linear fits (e.g. log fit) gives different R^2 numbers, presumably the explained variance.
There appears to be confusion about two things: R is the Pearson moment correlation coefficient. It is independent from the regression model used. R^2 should be the square of R. (That's why it is called R^2.) THe coefficient of determination is the proportion of variation accounted for by the statistical model. Tor your everyday linear regression, the coefficint of determination happens to be R^2 (the square of R). Some people may then use R^2 as the symbol for the coefficient of determination, but it is clearly ridiculous to say that R^2 is anything but the square of R. I do not make any statements about the correctness of gnumeric's functions or graphs. The functions are frequently simply designed to emulate Excel's result (as of some version).
I do make a statement about the correctness of our regression functions. What we calculate as R^2 is neither R^2 or the coefficient of determination in the general case. It is something else which I just do not undestand. Excel uses R^2 as the symbol for the coefficient of determination. What should we do? The current situation is just wrong. For the sheet function we can just return the coefficient of determination as Excel, and change the documentation to tell that if is the coefficient of determination, not R squared. But what symbol should we use in graphs?
In the end I asked the r-devel list for clarification and got two equivalent responses. See https://stat.ethz.ch/pipermail/r-devel/2008-June/049927.html Seems we do not have any bug there. May be we should just be more clear in the documentation about the no-intercept regressions.
This seems to come down primarily to a documentation issue. So I am changing the component. Note that statemetns that "Excel yields..." with respect to these linear regression coefficients as far as models are concerned that force the regression line through the origin is meangless since Excel's answer changed greatly with Excel 2003. So at a minimum a statement giving the Excel version would be required.
I have added some explanation in the function description. This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.