GNOME Bugzilla – Bug 134166
statistical analysis frequency table desired
Last modified: 2008-10-12 23:49:45 UTC
Hi - I'm really happy with gnumeric so far. I took my first crack at using the statistics features today and started with the histogram. A couple observations... not sure how stable this stuff is supposed to be. #1) Manual setting of the bins doesn't seem to work - or at least it is not obvious how to set up. I have a simple column of 9 datapoints and can get it to work fine with calculated bins: A 1 Result 2 1 3 2 4 3 5 1 6 1 7 2 8 2 9 2 10 2 I want to get a histogram of three bins: I set the Input range to A2:A10 I set the bins to: Calculated, min=1, max=3, N=3 Output sheet is as follows: Bin Frequency <1 0 1.66666666666667 3 2.33333333333333 5 3 1 >3 0 The frequency numbers are correct and the bins are logical: 1 to 1.666 1.666 to 2.333 2.333 to 3 With outliers also shown. ** So that works but if I use pre-assigned bins it doesn't. I add the following column B 1 Bins 2 <1 3 1.666 4 2.333 5 3 6 >3 These are the same as what the calculated attempt used. Then I set go back in use the same input data and select pre-assigned bins with range B2:B6. However the output sheet is empty. Nothing. I tried B3:B5 too in case the < or > were causing problems but same thing - no output. How should this be specified to get the desired output? #2) What I'd really like to do is create a histogram for text entries. Is this possible as well? e.g. Say a data set like: Bill Bill Bob Bob Chris Paul Can I set bins up a) on explicit names? b) using regular expressions like B* etc.? thx!
#2: unfortunately you can't do that yet. The values have to be numerical and the preassigned bins are given by the cutoffs. #1: to get a table like the calculated bins your bin values should be: 1 1.66666 2.33333 3 this will yield 5 intervals. This should work. Which exact version of gnumeric are you using?
Yes that does yield output and this is good enough for me that I can set up bins and get the answers I want. However the actual results are different. Note that the first bin is <=1 rather than <1 so distribution is 3,0,5,1,0 - not 0,3,5,1,0. I don't really care as long as I can predict what it'll do - but you probably do want it to be consistent. On the upper bound I don't care if it says More or >3 -- all the same to me. Bin Frequency 1 3 1.6666 0 2.3333 5 3 1 More 0 So case closed. Pity about #2 not being there - that would be very nice. thx!
I guess we should check the documentaion and see that this gets correctly documented. There is a point to the slightly differnet behaviour: the calculated bis assume a finite interval and have two overflow/underflow bins, while the predetermined bins act as cutoffs with the same behaviour for each. Ideally of course this can be specified by the user.
I have just made some changes to the histogram tool. I hope the cutoffs/bins make more sense now. I am leaving this report open with a new subject line to remind us of the need to support #2.
I have written a new tool to handle #2. Adding it to the histogram tool would have create a rather complicated looking dialog with many items not applicable for the current situation. This problem has been fixed in the development version. The fix will be available in the next major software release. Thank you for your bug report.