- Proof for author
- 0.2
Quick calculation of Kendall’s Rank Correlation Distribution
Gordon Sutcliffe (gsutcliffe@talk21.com)
The nub of the paper is the km2
function with
krfd
as a user-friendly cover function for quick calculation
of Kendall’s rank correlation frequency distribution. The remainder is
intended to provide a useful, but by no means exhaustive, background.
Introduction
Kendall’s Rank Correlation Distribution provides the statistical probability for the assertion that two sets of n ranks are significantly different, rather than just random fluctuations. The cumulative probability used for their solution can be produced from the integer frequencies calculated here.
Currently the probabilities for n>10 ranks are managed by using approximations to the normal distribution. However, it is not easy to determine the degree of approximation.
The problem from the calculation point of view is that Kendall’s original method takes an enormous amount of computer time to calculate when the number of ranks is more than 12. On a PC [1] the frequency distribution for n=13 ranks would take days, with the time increasing as factorial n.
Quick calculation is based on Kendall’s theory, but uses a recurrence relationship which enables the frequency distribution to be calculated using floating point arithmetic in a fraction of a second up to n=170. (Factorial 170 is the limit for standard floating point arithmetic on a PC.)
The integer floating-point frequencies are exact up to n=18, but at higher n, precision drops to about 13 to 14 significant digits [2]. J extended-integer arithmetic produces exact frequencies up to n=170 in 2000 seconds, or up to n=300 in 14 hours on a PC.
Methods of calculation
Two calculation methods are considered. Method 1 is based on first principles, as in Kendall’s work. Runtime usually limits calculation to about n=12 ranks at most. The second method is based on a recurrence relation. It is practical up to n=170 ranks using floating point arithmetic and can provide exact frequencies to over n=300 using J extended-integer arithmetic.
Both methods agree with Biometrika Table 45, [3] up to its limit of 10 ranks. Basic properties of the Kendall’s rank correlation frequency distribution against Q for n>0 are:
Distribution symmetry about: | (max. Q)/2 | Q=0 frequencies: | 1 |
Sum of the frequencies: | !n | Q=1 frequencies: | n-1 |
Number of frequencies, Q: | 1+n×(n-1)/2 | Q=2 frequencies: | (-1)+n×(n-1)/2 [for n>1] |
Q and n are discussed further under Method 1.
Method 1: First principles
Taking the case for n=3, the 3 ranks have 6 = 3 × 2 × 1 permutations.
Assuming that a rank order a b c
corresponds to perfect correlation, the
(minimum) number of neighbour interchanges required to transform each
permutation to a b c
is ascertained:
Permutations for n=3 | a b c | a c b | b a c | b c a | c a b | c b a |
Required neighbour interchanges, Q | 0 | 1 | 1 | 2 | 2 | 3 |
Frequency total for each unique Q value above | 1 | — | 2 | — | 2 | 1 |
The frequencies for each value of Q = 0, 1, 2, 3 are repeated in the table below in the row for n=3.
No. of ranks |
Number of neighbour interchanges, Q for the frequencies in the column below them |
Sum of Freqs. |
Count of Qs |
|||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | !n | 1+n× (n-1)/2 |
0 | 1 | (-1) |
1 | 1 | ||||||||||||||
1 | 1 | (-1) |
1 | 1 | ||||||||||||||
2 | 1 | 1 | -1 |
(-1) |
2 | 2 | ||||||||||||
3 | 1 | 2 | 2 | 1 | -1 |
-2 |
-2 |
(-1) |
6 | 4 | ||||||||
4 | 1 | 3 | 5 | 6 | 5 | 3 -1 |
1 -3 |
-5 |
-6 |
-5 |
-3 |
(-1) |
24 | 7 | ||||
5 | 1 | 4 | 9 | 15 | 20 | 22 | 20 -1 |
15 -4 |
9 -9 |
4 -15 |
1 -20 |
-22 |
-20 |
-15 |
-9 |
-4 |
120 (-1) |
11 |
6 | 1 | 5 | 14 | 29 | 49 | 71 | 90 | 101 | 101 | 90 | 71 | 49 | 29 | 14 | 5 | 1 | 720 | 16 |
Method 2: Recurrence relation
Row r+1 is based [4] on the previous row r. Rows are calculated from row n=1 initially. To calculate the frequencies for the next row, first negate a copy of the current row and insert as shown by the italics at the bottom of the row. The row is then summed progressively from column 0 to the right to produce the next row, stopping before the (_1) column, which would otherwise produce an unwanted zero.
Notes and references
- PC system specification s/w: Windows XP home ed. 5.1.2600, Service pack 2; J v6.01c; H/W: Processor: x86 Family 15 Model 2 Stepping 9 GenuineIntel ~2605 MHz; RAM: 512 MB.
- More precisely the proportional error is in the range: (1 - 6e-15) to (1 + 4.5e-15).
- E.S. Pearson and H.O. Hartley, ed., Biometrika Tables for Statisticians Volume 1, 2nd Edition, 1958, Table 45, 84-5, 85-6, 88-9
- M.G. Kendall & A. Stewart, The Advanced Theory of Statistics, Volume 2, 3rd Edition, Inference and Relationship, Robust and Distribution-Free Procedures, Ch. 31.25 Eqn (31.34)
Appendix: J code
Summary
Verbs grouped according to their main application: in the last column,
n
is the number of ranks, Q
the number of
interchanges.
Kendall’s Rank Correlation Distribution | ||
---|---|---|
km1 | Original method | km1 n |
km2 | Recurrence method | km2^:(<:n) 1 |
krfd | Cover function for km2 , Frequency Distribution |
krfd n |
Kendall’s Rank Probability Distribution | ( % +/) krfd n | |
Kendall’s Rank Cumulative Probabilities; used for Biometrika [3] check | (] +/\@:% +/) krfd n | |
Probability of random variation in two sets of ranks | ||
r2q | Ranks to Q values | A_ranks r2q B_ranks |
q2p | Q values to Probabilities | n q2p Q |
q2t | Q values to Kendall’s tau | n q2t Q |
Single- and double-tail comparison of rank sets | see Listing | |
Miscellaneous | ||
dsp | Display of results and inline debugging aid | … dsp right_arg … |
ktcv | Kendall’s tau Critical Values | n ktcv nominal_probability |
Listing
dsp =: 1!:2&2 NB. Utility monad: display and return argument n =. 3 NB. n indicates the number of ranks
NB. === Kendall's Rank Correlation Distribution ==================== NB. Both km1 (Original method) and km2 (Recurrence method) create NB. Kendall's Rank Correlation Frequency Distribution for n ranks. km1 =: [: +/"1 [: = [: +/"1 (i.@! ({. +/@:> ])\."1@:A. i.) NB. Monad Syntax: km1 n Note: n>10 may fail on memory limit
km2 =: 3 : 0 NB. Monad SYNTAX: km2^:(<:n) y=.1 Note: n>0 ; y=1 NB. n-1 recurrences update the starting value of y n =. +/ _1 x: 2{. y NB. l is the no. of Q values for h =. >.-: l=. >: -: ( * >:) n NB. the new distribution (n+1) ( , |.`(|.@}:)@.(2|l)) +/\ (h&{. - [: (-h)&{. (0>.h-n+1)&{.) y )
krfd =: 3 : 0 NB. Syntax: [initial_distrbn] krfd iterations_reqd. 1 krfd y NB. Monad default x=1 sets floating point arithmetic. NB. x can be a previous result and/or have an 'x' NB. suffix for extended integer arithmetic : km2^:(y-1=#x) x NB. Reduce effective y by 1 when x=1 (or 1x) ) (] % +/) krfd n NB. Probability distribution ( +/\@:% +/) krfd n NB. Cumulative probability NB. For Biometrika Table 45 Ref [1] check
NB. === Probability of Random Variations of two sets of ranks ====== r2q =: 3 : 0 NB. Ranks to Q NB. SYNTAX: [A_ranks] r2q B_ranks Monad assumes y pre-ordered if. (] -&# ~.) y do. 'r2q: y ranks indiscrete' return. end. Q =. +/ ({. +/@:> ])\. y NB. Exit with Q value : if. (] -&# ~.) x do. 'r2q: x ranks indiscrete' return. end. r2q (/:x) { y NB. Reorder y as ascending x )
NB. Convert Q to probability in single- or double-tail test q2p =: 4 : 0 NB. SYNTAX: no._of_ranks qp2 Q_value if. -.y e. i.>: l=.-:(* <:) x do. 'q2p: incompat. args' return. end. st =. (!x)%~ +/ (y+1){. d=:krfd x NB. Single-tail probability z =. y <. l - y NB. Double-tail: complement y dt =. 1 <. +: (!x)%~ +/ (z+1){. d NB. Double-tail probability hd =. ' n Q Single-tail Double-tail', LF NB. Headings hd, 4 8 10j6 12j6": x, y, st, dt NB. Results )
NB. Convert Q to Kendall's tau q2t =: 1: - 4: * ] % ( * <:)@[ NB. SYNTAX: n q2t Q
NB. Rank comparison to single-, double-tail probability 'r0 r1 r2' =: 0 1 2 3 4 5 ; 5 4 3 2 1 0 ; 6?.6 NB. Rank data 6 q2p r0 r2q r0 NB. Both results very significant 6 q2p r0 r2q r1 NB. Double-tail result very significant 6 q2p r0 r2q r2 NB. Neither result significant
NB. === Miscellaneous ============================================== NB. Critical values of Kendall's tau in cumulative probability ktcv =: 4 : 0 NB. SYNTAX: n ktcv Nominal_s-tail_signif._prob. d =. (+/\@krfd % !) x NB. Cumulative probability x q2t (d <: y) i: 1 NB. Critical value of Kendall's tau )