GNOME Bugzilla – Bug 700132
Implement r.ptukey and r.qtukey.
Last modified: 2013-05-30 14:15:06 UTC
R offers a cumulative distribution function (ptukey) and a quantile function (qtukey) for q. It would be nice if they were available in Gnumeric.
I keep reading that as turkey.
We can definitely import R's implementations. They aren't great, though, in the sense that 1. They can't handle all possible parameters. df==1, for example. See http://code.google.com/p/qsturng-py/ 2. They're not very fast. 3. We don't get a dtukey function.
While that implementation may not be great, it is usually better than depending on tables, although I note that the table values differ from R's implementation for df=2, r=2, alpha=0.05 already in the third decimals. They match for all other values as far as I can tell. In any case, primary use of these functions should be for df >> 1. I noticed that when I used R to created tables of critical values ... In real life though it would only be used for a few individual values. There really is no use for dtukey anyways.
I'm working on ptukey now. > There really is no use for dtukey anyways. qtukey is slow and doesn't even try to get more than 4 correct digits. If we had dtukey, we would be only 2 newton steps away from full precision. (Full precision as the inverse of ptukey as implemented.)
Created attachment 244593 [details] [review] Tentative patch Initial implementation of r.ptukey
Created attachment 244594 [details] Test spreadsheet Test file generated from the data at http://qsturng-py.googlecode.com/hg/make_tbls.py
Issues: 1. Description of arguments "cc" and "rr". I don't know what I am doing here. 2. Tables seem to assume cc=1 so that's all being tested. 3. Some values are way off. E49, for example. Is the test right?
The test file http://qsturng-py.googlecode.com/hg/make_tbls.py says: # The values for p in [.5, .75, .9, .95, .975, .99, .995, .999] # were pulled from: # http://www.stata.com/stb/stb46/dm64/sturng.pdf # # Values for p in [.1, .675, .8, .85] were calculated using R's qtukey function # So they are mixing two tables, where the second should match what the R-code calculates.
Reference 7 in http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177705684 supposedly describes how the integrals in question can be approximated.
In Gleason's 1998 paper that gives the primary numbers for the test data, he writes: "The values of qp (r, ν) were computed using the Copenhaver & Holland (1988) algorithm. For p ≥ 0.90, values were checked against Harter’s (1960) table; discrepancies were rare and generally resolved by accepting the value from Harter’s table." So in other words some of the values are not what Copenhaver & Holland gave but are taken from the older table by Harter. So it is not surprising that there are some blips. Gleason's paper nicely describes how this distribution arises. And there are only 2 variables of interest: the number of independent normal random variables contributing to the range, and the degrees of freedom of the independent chi^2 distributed random variable. So I have no idea what the cc in the R implementation is really about.
In R we have: > qtukey function (p, nmeans, df, nranges = 1, lower.tail = TRUE, log.p = FALSE) .Internal(qtukey(p, nranges, nmeans, df, lower.tail, log.p)) <environment: namespace:stats> > qtukey(0.9,nmeans=5,df=7) [1] 4.280326 > qtukey(0.9,5,7) [1] 4.280326 so the order of arguments in the interface is not the same as the order of arguments in the internal function. Your current implementation uses R's internal order of arguments rather than the interface one. It would surely be less confusing to users to use the interface version. I see that the Copenhaver & Holland discussion allows for nranges > 1, but I am not quite sure where that would be useful.
I would probably say something like: r: number of underlying normal random variables per range df: degree of freedom of the underlying chi-squared distributed random variable nr: number of ranges, defaults to 1
Created attachment 244638 [details] Tentative patch Updates patch
Created attachment 244639 [details] Test spreadsheet
Arguments are now x, nmeans, df, nranges=1, lower_tail=TRUE, log_p=FALSE. (Due to the way we generate stuff, we have to match the prototype in mathfunc.h so I just added a trivial wrapper for reordering.)
In the documentation of the arguments we should probably mention that nranges defaults to 1 and log_p to true. The test file is missing a few formulas in a stretch in column G. There are clearly some stretches were the difference between the test values and the calculated values are quite large, but this could also be due to the test values being off. I would think you can commit those patches, and if we ever find out more details on how to calculate it more precisely we can modify the core then.
Some further reference to the CDF: ublib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_cdf-pdf-rand_continuous_studentized_range.htm
Oops, the reference should have been http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_cdf-pdf-rand_continuous_studentized_range.htm
http://onlinestatbook.com/2/calculators/studentized_range_dist.html If I enter values Q=3.2215, n=100, df=2, I get 90% If I enter that Q in A27 (and E27; need to fix that), then I get a value of ~10% with accuracy 3.9 This suggests that the test sheet is wrong (or that the above code has the same flaws as ptukey).
The java class used for http://onlinestatbook.com/2/calculators/studentized_range_dist.html was provided by Russell V. Lenth. If you check his website you find links to two R-packages he wrote. It is not that far a stretch to assume that since the author of the java class uses R, the code for ptukey and the java class may be very similar.
I thought that we could perhaps use something like Mathematica to also create test-data. But it seems that even Mathematica 9 does not have access to the Studentized range distribution. (They do use it inside someof their stats tests but see to not provide access to the pdf or cdf (or tail probabilities).)
r.ptukey is in (with local fixes). The nmeans=2 case has an explicit formula which allows us to create some limited test data for that.
A49 is 3.009265469 if I change that to 3.090265469 I get 4+ correct digits I suspect some of these numbers were typed from an old table.
> I thought that we could perhaps use something like Mathematica to also create > test-data. Can Mathematica do numerical integration? It might be absurdly inefficient, yet still useful for test cases.
R_ptukey is starting to look like a real C function now. One thing *really* bugs me: if I reduce the size of the intervals on which numerical integration is done, i.e., have more sub-intervals, then the accuracy goes _down_. (As measured on the nmeans=2 cases for which we have an explicit formula.)
More is now known. 1. The Coperhagen-Holland algorithm used is based on numerical integration G(q,c,df,r=1) = Integrate (HH(q,u,c,r,df),u = 0 ... Inf) where HH(q,u,c,r,df) = H(q*sqrt(u),c)^r * f(u,df) H is also based on numerical integration and produces values [0;1]. 2. H is good. Every value I have compared against Mathematica results have 13+ digits of precision. It does disagree somewhat with 50 year old tables after about 6 digits. For this I trust Mathematica more. 3. G is based on Gauss-Legendre quadrature using 16 nodes. This is applied to intervals [0;L], [L,2L], [2L; 3L], ... for some L which is based on df. For these purposes assume L=1. 4. HH(0,u,c,r,df) = H(Inf,u,c,r,df) = 0. HH is non-negative and C^Infinity. I think there is a u0 so that HH is increasing for u < u0 and decreasing for u > u0. That u0 depends on c, r, df. 5. The sequence of intervals from (3) is terminated when the values get too small. There are some hackish (and fixable, I think) code to avoid terminating too early in case HH is very flat in the first few intervals 6. The numerical integration from (3) requires that HH can be approximated well by a polynomial over the interval. This is almost always true, but *NOT* always for the first interval starting at 0. For q=0.2;c=2;df=2;r=1, for example, dHH/du -> Infinity for u->0+. (Ie., HH is vertical at left end-point. That is very bad for accuracy with Gauss-Legendre. Just reducing L doesn't solve it. 7. If the situation can be detected, a change of variables u=1-cos(s) seems to work wonders on the shape of HH. 8. Alternatively, the first interval can be replaced by a sequence of intervals [10^{-n}L; 10^{1-n}L] each covering 90% of the left-over interval. Both the drop in interval length and in HH's magnitude should make that sequence tend to 0 in a hurry. 9. The literature has lots of unsubstantiated claims of the need to use high accuracy floating-point for this code. R explicitly asks for "long double" in a few cases. Apart from a part of f(_,_) implemented using the log-gamma function I believe this is bogus. 10. I think lots of ptukey.gnumeric's test values are inaccurate. I have Mathematica buring up cycles producing test values.
A variant of 8 was done: instead of a constant factor 10, use 2,3,4,... 10 was done, but needs redoing with more accuracy and, perhaps, more carefully chosen data points. Also: 11. The Copenhager-Holland implementation truncates the integration after 50 steps. That sometimes isn't enough. The solution is to increase interval size when we detect that we're in the tail. When contributions drop below 0.1% we now double the interval length. Also: R.QTUKEY is in based on the generic pfuncinverter. Having dtukey would make me happier.
I need to revise (2). H isn't as good as I thought, particularly for large number of means (say c>40). The problem is the same as for the outer integral: the integrand isn't behaving like a polynomial.
H fixed. We now use interval size 3/log(1+nmeans) and as many intervals as needed. On average that's not much more than we used. I now know of only one case where we get less that 10 correct digits: R.PTUKEY(4.99971386817809,100,80) where we get 0.49999992701795 but should get 0.49999992707724. That's 9.93 correct digits. This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.
For future work: If a very large cc was used, then the calculation pow(erf(w/sqrt(8)),cc) isn't the right way to go. The erf value is squeezed up against 1 with no way for the power to bring it down again. Something like pow1p(-erfc(w/sqrt(8)),cc) should work.
pr_w = qsqz > 1 ? pow1p (-2.0 * pnorm (qsqz, 0, 1, FALSE, FALSE), cc) : gnm_pow (gnm_erf (qsqz / M_SQRT2gnum), cc);