Bug 700132 – Implement r.ptukey and r.qtukey.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 700132 - Implement r.ptukey and r.qtukey.


Summary:	Implement r.ptukey and r.qtukey.


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	Analytics
Version:	git master
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Morten Welinder
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-05-11 19:05 UTC by Andreas J. Guelzow
Modified:	2013-05-30 14:15 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Tentative patch (16.34 KB, patch) 2013-05-18 01:51 UTC, Morten Welinder	none	Details \| Review
Test spreadsheet (281.75 KB, application/x-gnumeric) 2013-05-18 01:53 UTC, Morten Welinder		Details
Tentative patch (17.14 KB, application/octet-stream) 2013-05-18 18:07 UTC, Morten Welinder		Details
Test spreadsheet (279.13 KB, application/x-gnumeric) 2013-05-18 18:07 UTC, Morten Welinder		Details

Description Andreas J. Guelzow 2013-05-11 19:05:08 UTC

R offers a cumulative distribution function (ptukey) and a quantile function (qtukey) for q. It would be nice if they were available in Gnumeric.

Comment 1 Morten Welinder 2013-05-12 15:59:23 UTC

I keep reading that as turkey.

Comment 2 Morten Welinder 2013-05-17 20:01:02 UTC

We can definitely import R's implementations.  They aren't great, though,
in the sense that

1. They can't handle all possible parameters.  df==1, for example.
   See http://code.google.com/p/qsturng-py/

2. They're not very fast.

3. We don't get a dtukey function.

Comment 3 Andreas J. Guelzow 2013-05-17 20:38:24 UTC

While that implementation may not be great, it is usually better than depending on tables, although I note that the table values differ from R's implementation for df=2, r=2, alpha=0.05 already in the third decimals. They match for all other values as far as I can tell.

In any case, primary use of these functions should be for df >> 1.

I noticed that when I used R to created tables of critical values ... In real life though it would only be used for a few individual values.

There really is no use for dtukey anyways.

Comment 4 Morten Welinder 2013-05-18 00:51:00 UTC

I'm working on ptukey now.

> There really is no use for dtukey anyways.

qtukey is slow and doesn't even try to get more than 4 correct digits.
If we had dtukey, we would be only 2 newton steps away from full precision.
(Full precision as the inverse of ptukey as implemented.)

Comment 5 Morten Welinder 2013-05-18 01:51:41 UTC

Created attachment 244593 [details] [review]
Tentative patch

Initial implementation of r.ptukey

Comment 6 Morten Welinder 2013-05-18 01:53:27 UTC

Created attachment 244594 [details]
Test spreadsheet

Test file generated from the data at http://qsturng-py.googlecode.com/hg/make_tbls.py

Comment 7 Morten Welinder 2013-05-18 01:58:04 UTC

Issues:

1. Description of arguments "cc" and "rr".  I don't know what I am doing
   here.

2. Tables seem to assume cc=1 so that's all being tested.

3. Some values are way off.  E49, for example.  Is the test right?

Comment 8 Andreas J. Guelzow 2013-05-18 03:10:09 UTC

The test file  http://qsturng-py.googlecode.com/hg/make_tbls.py says:
# The values for p in [.5, .75, .9, .95, .975, .99, .995, .999]
# were pulled from:
#    http://www.stata.com/stb/stb46/dm64/sturng.pdf
#
# Values for p in [.1, .675, .8, .85] were calculated using R's qtukey function
#

So they are mixing two tables, where the second should match what the R-code calculates.

Comment 9 Andreas J. Guelzow 2013-05-18 03:33:12 UTC

Reference 7 in http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177705684 supposedly describes how the integrals in question can be approximated.

Comment 10 Andreas J. Guelzow 2013-05-18 03:46:44 UTC

In Gleason's 1998 paper that gives the primary numbers for the test data, he writes:
"The values of qp (r, ν) were computed using the Copenhaver & Holland (1988) algorithm. For p ≥ 0.90, values were checked against Harter’s (1960) table;
discrepancies were rare and generally resolved by accepting the value from Harter’s table."

So in other words some of the values are not what Copenhaver & Holland gave but are taken from the older table by Harter. So it is not surprising that there are some blips.

Gleason's paper nicely describes how this distribution arises. And there are only 2 variables of interest: the number of independent normal random variables contributing to the range, and the degrees of freedom of the independent chi^2 distributed random variable. So I have no idea what the cc in the R implementation is really about.

Comment 11 Andreas J. Guelzow 2013-05-18 04:06:05 UTC

In R we have:

> qtukey
function (p, nmeans, df, nranges = 1, lower.tail = TRUE, log.p = FALSE) 
.Internal(qtukey(p, nranges, nmeans, df, lower.tail, log.p))
<environment: namespace:stats>
> qtukey(0.9,nmeans=5,df=7)
[1] 4.280326
> qtukey(0.9,5,7)
[1] 4.280326

so the order of arguments in the interface is not the same as the order of arguments in the internal function. Your current implementation uses R's internal order of arguments rather than the interface one. It would surely be less confusing to users to use the interface version.

I see that the Copenhaver & Holland discussion allows for nranges > 1, but I am not quite sure where that would be useful.

Comment 12 Andreas J. Guelzow 2013-05-18 04:11:59 UTC

I would probably say something like:

r:  number of underlying normal random variables per range
df: degree of freedom of the underlying chi-squared distributed random variable
nr: number of ranges, defaults to 1

Comment 13 Morten Welinder 2013-05-18 18:07:19 UTC

Created attachment 244638 [details]
Tentative patch

Updates patch

Comment 14 Morten Welinder 2013-05-18 18:07:56 UTC

Created attachment 244639 [details]
Test spreadsheet

Comment 15 Morten Welinder 2013-05-18 18:13:37 UTC

Arguments are now x, nmeans, df, nranges=1, lower_tail=TRUE, log_p=FALSE.
(Due to the way we generate stuff, we have to match the prototype in mathfunc.h
so I just added a trivial wrapper for reordering.)

Comment 16 Andreas J. Guelzow 2013-05-18 19:10:14 UTC

In the documentation of the arguments we should probably mention that nranges defaults to 1 and log_p to true.

The test file is missing a few formulas in a stretch in column G.

There are clearly some stretches were the difference between the test values and the calculated values are quite large, but this could also be due to the test values being off.

I would think you can commit those patches, and if we ever find out more details on how to calculate it more precisely we can modify the core then.

Comment 17 Andreas J. Guelzow 2013-05-18 19:24:56 UTC

Some further reference to the CDF:
ublib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_cdf-pdf-rand_continuous_studentized_range.htm

Comment 18 Andreas J. Guelzow 2013-05-18 19:25:31 UTC

Oops, the reference should have been
http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_cdf-pdf-rand_continuous_studentized_range.htm

Comment 19 Morten Welinder 2013-05-18 20:05:40 UTC

http://onlinestatbook.com/2/calculators/studentized_range_dist.html

If I enter values Q=3.2215, n=100, df=2, I get 90%

If I enter that Q in A27 (and E27; need to fix that), then I get a value
of ~10% with accuracy 3.9

This suggests that the test sheet is wrong (or that the above code has the
same flaws as ptukey).

Comment 20 Andreas J. Guelzow 2013-05-18 20:27:09 UTC

The java class used for http://onlinestatbook.com/2/calculators/studentized_range_dist.html was provided by Russell V. Lenth. If you check his website you find links to two R-packages he wrote. It is not that far a stretch to assume that since the author of the java class uses R, the code for ptukey and the java class may be very similar.

Comment 21 Andreas J. Guelzow 2013-05-18 20:33:05 UTC

I thought that we could perhaps use something like Mathematica to also create test-data. But it seems that even Mathematica 9 does not have access to the Studentized range distribution. (They do use it inside someof their stats tests but see to not provide access to the pdf or cdf (or tail probabilities).)

Comment 22 Morten Welinder 2013-05-19 00:50:48 UTC

r.ptukey is in (with local fixes).

The nmeans=2 case has an explicit formula which allows us to create some
limited test data for that.

Comment 23 Morten Welinder 2013-05-19 01:42:19 UTC

A49       is          3.009265469   
if I change that to   3.090265469   I get 4+ correct digits

I suspect some of these numbers were typed from an old table.

Comment 24 Morten Welinder 2013-05-19 20:05:12 UTC

> I thought that we could perhaps use something like Mathematica to also create
> test-data.

Can Mathematica do numerical integration?

It might be absurdly inefficient, yet still useful for test cases.

Comment 25 Morten Welinder 2013-05-22 00:38:26 UTC

R_ptukey is starting to look like a real C function now.

One thing *really* bugs me: if I reduce the size of the intervals on which
numerical integration is done, i.e., have more sub-intervals, then the
accuracy goes _down_.  (As measured on the nmeans=2 cases for which we
have an explicit formula.)

Comment 26 Morten Welinder 2013-05-22 23:34:07 UTC

More is now known.

1. The Coperhagen-Holland algorithm used is based on numerical integration

       G(q,c,df,r=1) = Integrate (HH(q,u,c,r,df),u = 0 ... Inf)

   where

       HH(q,u,c,r,df) = H(q*sqrt(u),c)^r * f(u,df)

   H is also based on numerical integration and produces values [0;1].

2. H is good.  Every value I have compared against Mathematica results have
   13+ digits of precision.  It does disagree somewhat with 50 year old
   tables after about 6 digits.  For this I trust Mathematica more.

3. G is based on Gauss-Legendre quadrature using 16 nodes.  This is applied
   to intervals [0;L], [L,2L], [2L; 3L], ...  for some L which is based on
   df.  For these purposes assume L=1.

4. HH(0,u,c,r,df) = H(Inf,u,c,r,df) = 0.  HH is non-negative and C^Infinity.

   I think there is a u0 so that HH is increasing for u < u0 and decreasing
   for u > u0.  That u0 depends on c, r, df.

5. The sequence of intervals from (3) is terminated when the values get too
   small.  There are some hackish (and fixable, I think) code to avoid
   terminating too early in case HH is very flat in the first few intervals

6. The numerical integration from (3) requires that HH can be approximated
   well by a polynomial over the interval.  This is almost always true,
   but *NOT* always for the first interval starting at 0.

   For q=0.2;c=2;df=2;r=1, for example, dHH/du -> Infinity for u->0+.  (Ie., HH
   is vertical at left end-point.  That is very bad for accuracy with
   Gauss-Legendre.  Just reducing L doesn't solve it.

7. If the situation can be detected, a change of variables u=1-cos(s) seems
   to work wonders on the shape of HH.

8. Alternatively, the first interval can be replaced by a sequence of intervals
   [10^{-n}L; 10^{1-n}L] each covering 90% of the left-over interval.  Both
   the drop in interval length and in HH's magnitude should make that sequence
   tend to 0 in a hurry.

9. The literature has lots of unsubstantiated claims of the need to use high
   accuracy floating-point for this code.  R explicitly asks for "long double"
   in a few cases.  Apart from a part of f(_,_) implemented using the
   log-gamma function I believe this is bogus.

10. I think lots of ptukey.gnumeric's test values are inaccurate.
    I have Mathematica buring up cycles producing test values.

Comment 27 Morten Welinder 2013-05-24 15:02:28 UTC

A variant of 8 was done: instead of a constant factor 10, use 2,3,4,...

10 was done, but needs redoing with more accuracy and, perhaps, more carefully
chosen data points.

Also:

11. The Copenhager-Holland implementation truncates the integration after
    50 steps.  That sometimes isn't enough.  The solution is to increase
    interval size when we detect that we're in the tail.  When contributions
    drop below 0.1% we now double the interval length.

Also:

R.QTUKEY is in based on the generic pfuncinverter.  Having dtukey would
make me happier.

Comment 28 Morten Welinder 2013-05-25 02:26:05 UTC

I need to revise (2).  H isn't as good as I thought, particularly for large
number of means (say c>40).

The problem is the same as for the outer integral: the integrand isn't
behaving like a polynomial.

Comment 29 Morten Welinder 2013-05-26 23:23:34 UTC

H fixed.  We now use interval size 3/log(1+nmeans) and as many intervals
as needed.  On average that's not much more than we used.


I now know of only one case where we get less that 10 correct digits:

    R.PTUKEY(4.99971386817809,100,80)

where we get 0.49999992701795 but should get 0.49999992707724.

That's 9.93 correct digits.


This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.

Comment 30 Morten Welinder 2013-05-29 18:22:57 UTC

For future work:

If a very large cc was used, then the calculation

    pow(erf(w/sqrt(8)),cc)

isn't the right way to go.  The erf value is squeezed up against 1 with no
way for the power to bring it down again.

Something like

    pow1p(-erfc(w/sqrt(8)),cc)

should work.

Comment 31 Morten Welinder 2013-05-30 14:15:06 UTC

    pr_w = qsqz > 1
	    ? pow1p (-2.0 * pnorm (qsqz, 0, 1, FALSE, FALSE), cc)
	    : gnm_pow (gnm_erf (qsqz / M_SQRT2gnum), cc);