Current issue

Vol.26 No.4

Vol.26 No.4

Volumes

© 1984-2017
British APL Association
All rights reserved.

Archive articles posted online on request: ask the archivist.

archive/24/1

Volume 24, No.1

  • Proof for author
  • 0.2

Quick calculation of Kendall’s Rank Correlation Distribution

Gordon Sutcliffe (gsutcliffe@talk21.com)

The nub of the paper is the km2 function with krfd as a user-friendly cover function for quick calculation of Kendall’s rank correlation frequency distribution. The remainder is intended to provide a useful, but by no means exhaustive, background.

Introduction

Kendall’s Rank Correlation Distribution provides the statistical probability for the assertion that two sets of n ranks are significantly different, rather than just random fluctuations. The cumulative probability used for their solution can be produced from the integer frequencies calculated here.

Currently the probabilities for n>10 ranks are managed by using approximations to the normal distribution. However, it is not easy to determine the degree of approximation.

The problem from the calculation point of view is that Kendall’s original method takes an enormous amount of computer time to calculate when the number of ranks is more than 12. On a PC [1] the frequency distribution for n=13 ranks would take days, with the time increasing as factorial n.

Quick calculation is based on Kendall’s theory, but uses a recurrence relationship which enables the frequency distribution to be calculated using floating point arithmetic in a fraction of a second up to n=170. (Factorial 170 is the limit for standard floating point arithmetic on a PC.)

The integer floating-point frequencies are exact up to n=18, but at higher n, precision drops to about 13 to 14 significant digits [2]. J extended-integer arithmetic produces exact frequencies up to n=170 in 2000 seconds, or up to n=300 in 14 hours on a PC.

Methods of calculation

Two calculation methods are considered. Method 1 is based on first principles, as in Kendall’s work. Runtime usually limits calculation to about n=12 ranks at most. The second method is based on a recurrence relation. It is practical up to n=170 ranks using floating point arithmetic and can provide exact frequencies to over n=300 using J extended-integer arithmetic.

Both methods agree with Biometrika Table 45, [3] up to its limit of 10 ranks. Basic properties of the Kendall’s rank correlation frequency distribution against Q for n>0 are:

Distribution symmetry about: (max. Q)/2 Q=0 frequencies: 1
Sum of the frequencies: !n Q=1 frequencies: n-1
Number of frequencies, Q: 1+n×(n-1)/2 Q=2 frequencies: (-1)+n×(n-1)/2 [for n>1]

Q and n are discussed further under Method 1.

Method 1: First principles

Taking the case for n=3, the 3 ranks have 6 = 3 × 2 × 1 permutations. Assuming that a rank order a b c corresponds to perfect correlation, the (minimum) number of neighbour interchanges required to transform each permutation to a b c is ascertained:

Permutations for n=3 a b c a c b b a c b c a c a b c b a
Required neighbour interchanges, Q 0 1 1 2 2 3
Frequency total for each unique Q value above 1 2 2 1

The frequencies for each value of Q = 0, 1, 2, 3 are repeated in the table below in the row for n=3.

Table of Frequency Distributions for n=1 to 6 Ranks
The italic figures are temporary additions to aid explanation of the recurrence method
No.
of
ranks
Number of neighbour interchanges, Q
for the frequencies in the column below them
Sum
of
Freqs.
Count
of
Qs
n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 !n 1+n×
(n-1)/2
0 1
(-1)
1 1
1 1
(-1)
1 1
2 1 1
-1

(-1)
2 2
3 1 2 2 1
-1

-2

-2

(-1)
6 4
4 1 3 5 6 5 3
-1
1
-3

-5

-6

-5

-3

(-1)
24 7
5 1 4 9 15 20 22 20
-1
15
-4
9
-9
4
-15
1
-20

-22

-20

-15

-9

-4
120
(-1)
11
6 1 5 14 29 49 71 90 101 101 90 71 49 29 14 5 1 720 16

Method 2: Recurrence relation

Row r+1 is based [4] on the previous row r. Rows are calculated from row n=1 initially. To calculate the frequencies for the next row, first negate a copy of the current row and insert as shown by the italics at the bottom of the row. The row is then summed progressively from column 0 to the right to produce the next row, stopping before the (_1) column, which would otherwise produce an unwanted zero.

Notes and references

  1. PC system specification s/w: Windows XP home ed. 5.1.2600, Service pack 2; J v6.01c; H/W: Processor: x86 Family 15 Model 2 Stepping 9 GenuineIntel ~2605 MHz; RAM: 512 MB.
  2. More precisely the proportional error is in the range: (1 - 6e-15) to (1 + 4.5e-15).
  3. E.S. Pearson and H.O. Hartley, ed., Biometrika Tables for Statisticians Volume 1, 2nd Edition, 1958, Table 45, 84-5, 85-6, 88-9
  4. M.G. Kendall & A. Stewart, The Advanced Theory of Statistics, Volume 2, 3rd Edition, Inference and Relationship, Robust and Distribution-Free Procedures, Ch. 31.25 Eqn (31.34)

Appendix: J code

Summary

Verbs grouped according to their main application: in the last column, n is the number of ranks, Q the number of interchanges.

Kendall’s Rank Correlation Distribution
km1 Original method km1 n
km2 Recurrence method km2^:(<:n) 1
krfd Cover function for km2, Frequency Distribution krfd n
Kendall’s Rank Probability Distribution ( % +/) krfd n
Kendall’s Rank Cumulative Probabilities; used for Biometrika [3] check (] +/\@:% +/) krfd n
Probability of random variation in two sets of ranks
r2q Ranks to Q values A_ranks r2q B_ranks
q2p Q values to Probabilities n q2p Q
q2t Q values to Kendall’s tau n q2t Q
Single- and double-tail comparison of rank sets see Listing
Miscellaneous
dsp Display of results and inline debugging aid … dsp right_arg …
ktcv Kendall’s tau Critical Values n ktcv nominal_probability

Listing

dsp =: 1!:2&2  NB. Utility monad: display and return argument
n =. 3         NB. n indicates the number of ranks
NB. === Kendall's Rank Correlation Distribution ====================

  NB. Both km1 (Original method) and km2 (Recurrence method) create
  NB. Kendall's Rank Correlation Frequency Distribution for n ranks.

km1 =: [: +/"1 [: = [: +/"1 (i.@! ({. +/@:> ])\."1@:A. i.)
  NB. Monad Syntax:  km1  n   Note: n>10 may fail on memory limit
km2 =: 3 : 0  NB. Monad SYNTAX:  km2^:(<:n) y=.1  Note: n>0 ; y=1
        NB. n-1 recurrences update the starting value of y
 n =. +/ _1 x: 2{. y              NB. l is the no. of Q values for
 h =. >.-: l=. >: -: ( * >:) n    NB.  the new distribution (n+1)
 ( , |.`(|.@}:)@.(2|l)) +/\  (h&{. - [: (-h)&{. (0>.h-n+1)&{.) y
)
krfd =: 3 : 0  NB. Syntax: [initial_distrbn]  krfd  iterations_reqd.
 1 krfd y      NB. Monad default x=1 sets floating point arithmetic.
               NB. x can be a previous result and/or have an 'x'
               NB.  suffix for extended integer arithmetic
:
 km2^:(y-1=#x) x  NB. Reduce effective y by 1 when x=1 (or 1x)
)

(] % +/) krfd n      NB. Probability distribution
( +/\@:% +/) krfd n  NB. Cumulative probability
                     NB. For Biometrika Table 45 Ref [1] check
NB. === Probability of Random Variations of two sets of ranks ======

r2q =: 3 : 0                      NB. Ranks to Q
NB. SYNTAX:  [A_ranks]  r2q  B_ranks   Monad assumes y pre-ordered
 if.  (] -&# ~.) y do. 'r2q: y ranks indiscrete' return. end.
 Q =. +/ ({. +/@:> ])\. y         NB. Exit with Q value
:
 if.  (] -&# ~.) x do. 'r2q: x ranks indiscrete' return. end.
 r2q (/:x) { y                    NB. Reorder y as ascending x
)
  NB. Convert Q to probability in single- or double-tail test
q2p =: 4 : 0  NB. SYNTAX:  no._of_ranks  qp2  Q_value
 if. -.y e. i.>: l=.-:(* <:) x do. 'q2p: incompat. args' return. end.
 st =. (!x)%~ +/ (y+1){. d=:krfd x  NB. Single-tail probability
 z =. y <. l - y                    NB. Double-tail: complement y
 dt =. 1 <. +: (!x)%~ +/ (z+1){. d  NB. Double-tail probability
 hd =. '   n       Q  Single-tail Double-tail', LF  NB. Headings
 hd, 4 8 10j6 12j6": x, y, st, dt   NB. Results
)
  NB. Convert Q to Kendall's tau
q2t =: 1: - 4: * ] % ( * <:)@[  NB. SYNTAX:  n  q2t  Q
  NB. Rank comparison to single-, double-tail probability
'r0 r1 r2' =: 0 1 2 3 4 5  ;  5 4 3 2 1 0 ; 6?.6  NB. Rank data
6 q2p r0 r2q r0    NB. Both results very significant
6 q2p r0 r2q r1    NB. Double-tail result very significant
6 q2p r0 r2q r2    NB. Neither result significant
NB. === Miscellaneous ==============================================

  NB. Critical values of Kendall's tau in cumulative probability
ktcv =: 4 : 0  NB. SYNTAX:  n  ktcv  Nominal_s-tail_signif._prob.
 d =. (+/\@krfd % !) x      NB. Cumulative probability
 x q2t (d <: y) i: 1        NB. Critical value of Kendall's tau
)

 

script began 10:17:57
caching off
debug mode off
cache time 3600 sec
indmtime not found in cache
cached index is fresh
recompiling index.xml
index compiled in 0.2595 secs
read index
read issues/index.xml
identified 26 volumes, 101 issues
array (
  'id' => '10500140',
)
regenerated static HTML
article source is 'XHTML'
completed in 0.2898 secs