After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 700132 - Implement r.ptukey and r.qtukey.
Implement r.ptukey and r.qtukey.
Status: RESOLVED FIXED
Product: Gnumeric
Classification: Applications
Component: Analytics
git master
Other Linux
: Normal enhancement
: ---
Assigned To: Morten Welinder
Jody Goldberg
Depends on:
Blocks:
 
 
Reported: 2013-05-11 19:05 UTC by Andreas J. Guelzow
Modified: 2013-05-30 14:15 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Tentative patch (16.34 KB, patch)
2013-05-18 01:51 UTC, Morten Welinder
none Details | Review
Test spreadsheet (281.75 KB, application/x-gnumeric)
2013-05-18 01:53 UTC, Morten Welinder
  Details
Tentative patch (17.14 KB, application/octet-stream)
2013-05-18 18:07 UTC, Morten Welinder
  Details
Test spreadsheet (279.13 KB, application/x-gnumeric)
2013-05-18 18:07 UTC, Morten Welinder
  Details

Description Andreas J. Guelzow 2013-05-11 19:05:08 UTC
R offers a cumulative distribution function (ptukey) and a quantile function (qtukey) for q. It would be nice if they were available in Gnumeric.
Comment 1 Morten Welinder 2013-05-12 15:59:23 UTC
I keep reading that as turkey.
Comment 2 Morten Welinder 2013-05-17 20:01:02 UTC
We can definitely import R's implementations.  They aren't great, though,
in the sense that

1. They can't handle all possible parameters.  df==1, for example.
   See http://code.google.com/p/qsturng-py/

2. They're not very fast.

3. We don't get a dtukey function.
Comment 3 Andreas J. Guelzow 2013-05-17 20:38:24 UTC
While that implementation may not be great, it is usually better than depending on tables, although I note that the table values differ from R's implementation for df=2, r=2, alpha=0.05 already in the third decimals. They match for all other values as far as I can tell.

In any case, primary use of these functions should be for df >> 1.

I noticed that when I used R to created tables of critical values ... In real life though it would only be used for a few individual values.

There really is no use for dtukey anyways.
Comment 4 Morten Welinder 2013-05-18 00:51:00 UTC
I'm working on ptukey now.

> There really is no use for dtukey anyways.

qtukey is slow and doesn't even try to get more than 4 correct digits.
If we had dtukey, we would be only 2 newton steps away from full precision.
(Full precision as the inverse of ptukey as implemented.)
Comment 5 Morten Welinder 2013-05-18 01:51:41 UTC
Created attachment 244593 [details] [review]
Tentative patch

Initial implementation of r.ptukey
Comment 6 Morten Welinder 2013-05-18 01:53:27 UTC
Created attachment 244594 [details]
Test spreadsheet

Test file generated from the data at http://qsturng-py.googlecode.com/hg/make_tbls.py
Comment 7 Morten Welinder 2013-05-18 01:58:04 UTC
Issues:

1. Description of arguments "cc" and "rr".  I don't know what I am doing
   here.

2. Tables seem to assume cc=1 so that's all being tested.

3. Some values are way off.  E49, for example.  Is the test right?
Comment 8 Andreas J. Guelzow 2013-05-18 03:10:09 UTC
The test file  http://qsturng-py.googlecode.com/hg/make_tbls.py says:
# The values for p in [.5, .75, .9, .95, .975, .99, .995, .999]
# were pulled from:
#    http://www.stata.com/stb/stb46/dm64/sturng.pdf
#
# Values for p in [.1, .675, .8, .85] were calculated using R's qtukey function
#

So they are mixing two tables, where the second should match what the R-code calculates.
Comment 9 Andreas J. Guelzow 2013-05-18 03:33:12 UTC
Reference 7 in http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177705684 supposedly describes how the integrals in question can be approximated.
Comment 10 Andreas J. Guelzow 2013-05-18 03:46:44 UTC
In Gleason's 1998 paper that gives the primary numbers for the test data, he writes:
"The values of qp (r, ν) were computed using the Copenhaver & Holland (1988) algorithm. For p ≥ 0.90, values were checked against Harter’s (1960) table;
discrepancies were rare and generally resolved by accepting the value from Harter’s table."

So in other words some of the values are not what Copenhaver & Holland gave but are taken from the older table by Harter. So it is not surprising that there are some blips.

Gleason's paper nicely describes how this distribution arises. And there are only 2 variables of interest: the number of independent normal random variables contributing to the range, and the degrees of freedom of the independent chi^2 distributed random variable. So I have no idea what the cc in the R implementation is really about.
Comment 11 Andreas J. Guelzow 2013-05-18 04:06:05 UTC
In R we have:

> qtukey
function (p, nmeans, df, nranges = 1, lower.tail = TRUE, log.p = FALSE) 
.Internal(qtukey(p, nranges, nmeans, df, lower.tail, log.p))
<environment: namespace:stats>
> qtukey(0.9,nmeans=5,df=7)
[1] 4.280326
> qtukey(0.9,5,7)
[1] 4.280326

so the order of arguments in the interface is not the same as the order of arguments in the internal function. Your current implementation uses R's internal order of arguments rather than the interface one. It would surely be less confusing to users to use the interface version.

I see that the Copenhaver & Holland discussion allows for nranges > 1, but I am not quite sure where that would be useful.
Comment 12 Andreas J. Guelzow 2013-05-18 04:11:59 UTC
I would probably say something like:

r:  number of underlying normal random variables per range
df: degree of freedom of the underlying chi-squared distributed random variable
nr: number of ranges, defaults to 1
Comment 13 Morten Welinder 2013-05-18 18:07:19 UTC
Created attachment 244638 [details]
Tentative patch

Updates patch
Comment 14 Morten Welinder 2013-05-18 18:07:56 UTC
Created attachment 244639 [details]
Test spreadsheet
Comment 15 Morten Welinder 2013-05-18 18:13:37 UTC
Arguments are now x, nmeans, df, nranges=1, lower_tail=TRUE, log_p=FALSE.
(Due to the way we generate stuff, we have to match the prototype in mathfunc.h
so I just added a trivial wrapper for reordering.)
Comment 16 Andreas J. Guelzow 2013-05-18 19:10:14 UTC
In the documentation of the arguments we should probably mention that nranges defaults to 1 and log_p to true.

The test file is missing a few formulas in a stretch in column G.

There are clearly some stretches were the difference between the test values and the calculated values are quite large, but this could also be due to the test values being off.

I would think you can commit those patches, and if we ever find out more details on how to calculate it more precisely we can modify the core then.
Comment 17 Andreas J. Guelzow 2013-05-18 19:24:56 UTC
Some further reference to the CDF:
ublib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_cdf-pdf-rand_continuous_studentized_range.htm
Comment 19 Morten Welinder 2013-05-18 20:05:40 UTC
http://onlinestatbook.com/2/calculators/studentized_range_dist.html

If I enter values Q=3.2215, n=100, df=2, I get 90%

If I enter that Q in A27 (and E27; need to fix that), then I get a value
of ~10% with accuracy 3.9

This suggests that the test sheet is wrong (or that the above code has the
same flaws as ptukey).
Comment 20 Andreas J. Guelzow 2013-05-18 20:27:09 UTC
The java class used for http://onlinestatbook.com/2/calculators/studentized_range_dist.html was provided by Russell V. Lenth. If you check his website you find links to two R-packages he wrote. It is not that far a stretch to assume that since the author of the java class uses R, the code for ptukey and the java class may be very similar.
Comment 21 Andreas J. Guelzow 2013-05-18 20:33:05 UTC
I thought that we could perhaps use something like Mathematica to also create test-data. But it seems that even Mathematica 9 does not have access to the Studentized range distribution. (They do use it inside someof their stats tests but see to not provide access to the pdf or cdf (or tail probabilities).)
Comment 22 Morten Welinder 2013-05-19 00:50:48 UTC
r.ptukey is in (with local fixes).

The nmeans=2 case has an explicit formula which allows us to create some
limited test data for that.
Comment 23 Morten Welinder 2013-05-19 01:42:19 UTC
A49       is          3.009265469   
if I change that to   3.090265469   I get 4+ correct digits

I suspect some of these numbers were typed from an old table.
Comment 24 Morten Welinder 2013-05-19 20:05:12 UTC
> I thought that we could perhaps use something like Mathematica to also create
> test-data.

Can Mathematica do numerical integration?

It might be absurdly inefficient, yet still useful for test cases.
Comment 25 Morten Welinder 2013-05-22 00:38:26 UTC
R_ptukey is starting to look like a real C function now.

One thing *really* bugs me: if I reduce the size of the intervals on which
numerical integration is done, i.e., have more sub-intervals, then the
accuracy goes _down_.  (As measured on the nmeans=2 cases for which we
have an explicit formula.)
Comment 26 Morten Welinder 2013-05-22 23:34:07 UTC
More is now known.

1. The Coperhagen-Holland algorithm used is based on numerical integration

       G(q,c,df,r=1) = Integrate (HH(q,u,c,r,df),u = 0 ... Inf)

   where

       HH(q,u,c,r,df) = H(q*sqrt(u),c)^r * f(u,df)

   H is also based on numerical integration and produces values [0;1].

2. H is good.  Every value I have compared against Mathematica results have
   13+ digits of precision.  It does disagree somewhat with 50 year old
   tables after about 6 digits.  For this I trust Mathematica more.

3. G is based on Gauss-Legendre quadrature using 16 nodes.  This is applied
   to intervals [0;L], [L,2L], [2L; 3L], ...  for some L which is based on
   df.  For these purposes assume L=1.

4. HH(0,u,c,r,df) = H(Inf,u,c,r,df) = 0.  HH is non-negative and C^Infinity.

   I think there is a u0 so that HH is increasing for u < u0 and decreasing
   for u > u0.  That u0 depends on c, r, df.

5. The sequence of intervals from (3) is terminated when the values get too
   small.  There are some hackish (and fixable, I think) code to avoid
   terminating too early in case HH is very flat in the first few intervals

6. The numerical integration from (3) requires that HH can be approximated
   well by a polynomial over the interval.  This is almost always true,
   but *NOT* always for the first interval starting at 0.

   For q=0.2;c=2;df=2;r=1, for example, dHH/du -> Infinity for u->0+.  (Ie., HH
   is vertical at left end-point.  That is very bad for accuracy with
   Gauss-Legendre.  Just reducing L doesn't solve it.

7. If the situation can be detected, a change of variables u=1-cos(s) seems
   to work wonders on the shape of HH.

8. Alternatively, the first interval can be replaced by a sequence of intervals
   [10^{-n}L; 10^{1-n}L] each covering 90% of the left-over interval.  Both
   the drop in interval length and in HH's magnitude should make that sequence
   tend to 0 in a hurry.

9. The literature has lots of unsubstantiated claims of the need to use high
   accuracy floating-point for this code.  R explicitly asks for "long double"
   in a few cases.  Apart from a part of f(_,_) implemented using the
   log-gamma function I believe this is bogus.

10. I think lots of ptukey.gnumeric's test values are inaccurate.
    I have Mathematica buring up cycles producing test values.
Comment 27 Morten Welinder 2013-05-24 15:02:28 UTC
A variant of 8 was done: instead of a constant factor 10, use 2,3,4,...

10 was done, but needs redoing with more accuracy and, perhaps, more carefully
chosen data points.

Also:

11. The Copenhager-Holland implementation truncates the integration after
    50 steps.  That sometimes isn't enough.  The solution is to increase
    interval size when we detect that we're in the tail.  When contributions
    drop below 0.1% we now double the interval length.

Also:

R.QTUKEY is in based on the generic pfuncinverter.  Having dtukey would
make me happier.
Comment 28 Morten Welinder 2013-05-25 02:26:05 UTC
I need to revise (2).  H isn't as good as I thought, particularly for large
number of means (say c>40).

The problem is the same as for the outer integral: the integrand isn't
behaving like a polynomial.
Comment 29 Morten Welinder 2013-05-26 23:23:34 UTC
H fixed.  We now use interval size 3/log(1+nmeans) and as many intervals
as needed.  On average that's not much more than we used.


I now know of only one case where we get less that 10 correct digits:

    R.PTUKEY(4.99971386817809,100,80)

where we get 0.49999992701795 but should get 0.49999992707724.

That's 9.93 correct digits.


This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.
Comment 30 Morten Welinder 2013-05-29 18:22:57 UTC
For future work:

If a very large cc was used, then the calculation

    pow(erf(w/sqrt(8)),cc)

isn't the right way to go.  The erf value is squeezed up against 1 with no
way for the power to bring it down again.

Something like

    pow1p(-erfc(w/sqrt(8)),cc)

should work.
Comment 31 Morten Welinder 2013-05-30 14:15:06 UTC
    pr_w = qsqz > 1
	    ? pow1p (-2.0 * pnorm (qsqz, 0, 1, FALSE, FALSE), cc)
	    : gnm_pow (gnm_erf (qsqz / M_SQRT2gnum), cc);