Prova l'equivalenza delle seguenti due formule per la correlazione di Spearman


14

Da Wikipedia , la correlazione di rango di Spearman viene calcolata convertendo le variabili e in variabili classificate e e quindi calcolando la correlazione di Pearson tra le variabili classificate:Y i x i y iXiYixiyi

Calcola Spearman via wikipedia

Tuttavia, l'articolo continua affermando che se non ci sono legami tra le variabili e , la formula sopra è equivalente aY iXiYi

seconda formula per calcolare Spearman

dove , la differenza nei gradi.di=yixi

Qualcuno può dare una prova di questo per favore? Non ho accesso ai libri di testo a cui fa riferimento l'articolo di Wikipedia.

Risposte:


14

ρ=i(xix¯)(yiy¯)i(xix¯)2i(yiy¯)2

Poiché non ci sono legami, le x le y sono entrambe costituite da numeri interi compresi tra 1 e n compresi.

Quindi possiamo riscrivere il denominatore:

i(xix¯)(yiy¯)i(xix¯)2

Ma il denominatore è solo una funzione di n :

i(xix¯)2=ixi2nx¯2=n(n+1)(2n+1)6n((n+1)2)2=n(n+1)((2n+1)6(n+1)4)=n(n+1)((8n+46n6)24)=n(n+1)((n1)12)=n(n21)12

Ora diamo un'occhiata al numeratore:

i(xix¯)(yiy¯)=ixi(yiy¯)ix¯(yiy¯)=ixiyiy¯ixix¯iyi+nx¯y¯=ixiyinx¯y¯=ixiyin(n+12)2=ixiyin(n+1)123(n+1)=n(n+1)12.(3(n+1))+ixiyi=n(n+1)12.[(n1)(4n+2)]+ixiyi=n(n+1)(n1)12n(n+1)(2n+1)/6+ixiyi=n(n+1)(n1)12ixi2+ixiyi=n(n+1)(n1)12i(xi2+yi2)/2+ixiyi=n(n+1)(n1)12i(xi22xiyi+yi2)/2=n(n+1)(n1)12i(xiyi)2/2=n(n21)12di2/2

Numerator/Denominator

=n(n+1)(n1)/12di2/2n(n21)/12=n(n21)/12di2/2n(n21)/12=16di2n(n21).

Hence

ρ=16di2n(n21).


5
You could eliminate the last 80% of this work by starting with the observation that ρ is invariant under location and scale changes, thereby reducing the problem to expressing xiyi in terms of (xiyi)2 when xi2=yi2=1; the formula obviously is 12di2=12(xiyi)2=1xiyi. Then the only real work to be done is accomplished by your calculation of the denominator.
whuber

@whuber +1, that's a good bit neater. But I think I'll leave it in the longer, less neat, bull-at-a-gate form.
Glen_b -Reinstate Monica

thanks, both answers are good but I have accepted this one as it is the one I started attempting myself.
Alex

I should explain my reasons for going the more prosaic route -- the other answers are neat, illuminating and clever, but require insights that are unlikely to be generated by any but the better students on their own. The advantage of showing it's entirely amenable to straightforward if uninspired manipulation is that it should be within the grasp of even the moderately able if uninspired-to-insight student. Sometimes knowing you don't need any insightful tricks is helpful (to those who don't see them).
Glen_b -Reinstate Monica

I guess it depends on your view of what constitutes a "trick," "manipulation," and "insight." Long batteries of involved algebraic calculations, as you intimate, provide little or no insight (as well as offering many opportunities for mistakes)--and I fear that students may view them as being formidable for their very bulk alone, as well as unmotivated. Other operations, such as a preliminary standardization (which is so helpful here), may initially be viewed as "tricks" but after a few applications should become to be seen as insightful and fundamental tools.
whuber

10

We see that in the second formula there appears the squared Euclidean distance between the two (ranked) variables: D2=Σdi2. The decisive intuition at the start will be how D2 might be related to r. It is clearly related via the cosine theorem. If we have the two variables centered, then the cosine in the linked theorem's formula is equal to r (it can be easily proved, we'll take here as granted). And h2 (the squared Euclidean norm) is Nσ2, sum-of-squares in a centered variable. So the theorem's formula looks like this: Dxy2=Nσx2+Nσy22NσxNσyr. Please note also another important thing (which might have to be proved separately): When data are ranks, D2 is the same for centered and not centered data.

Further, since the two variables were ranked, their variances are the same, σx=σy=σ, so D2=2Nσ22Nσ2r.

r=1D22Nσ2. Recall that ranked data are from a discrete uniform distribution having variance (N21)/12. Substituting it into the formula leaves r=16D2N(N21).


8

The algebra is simpler than it might first appear.

IMHO, there is little profit or insight achieved by belaboring the algebraic manipulations. Instead, a truly simple identity shows why squared differences can be used to express (the usual Pearson) correlation coefficient. Applying this to the special case where the data are ranks produces the result. It exhibits the heretofore mysterious coefficient

6n(n21)

as being half the reciprocal of the variance of the ranks 1,2,,n. (When ties are present, this coefficient acquires a more complicated formula, but will still be one-half the reciprocal of the variance of the ranks assigned to the data.)

Once you have seen and understood this, the formula becomes memorable. Comparable (but more complex) formulas that handle ties, show up in nonparametric statistical tests like the Wilcoxon rank sum test, or appear in spatial statistics (like Moran's I, Geary's C, and others) become instantly understandable.


Consider any set of paired data (Xi,Yi) with means X¯ and Y¯ and variances sX2 and sY2. By recentering the variables at their means X¯ and Y¯ and using their standard deviations sX and sY as units of measurement, the data will be re-expressed in terms of the standardized values

(xi,yi)=(XiX¯sX,YiY¯sY).

By definition, the Pearson correlation coefficient of the original data is the average product of the standardized values,

ρ=1ni=1nxiyi.

The Polarization Identity relates products to squares. For two numbers x and y it asserts

xy=12(x2+y2(xy)2),

which is easily verified. Applying this to each term in the sum gives

ρ=1ni=1n12(xi2+yi2(xiyi)2).

Because the xi and yi have been standardized, their average squares are both unity, whence

(1)ρ=12(1+11ni=1n(xiyi)2)=112(1ni=1n(xiyi)2).

The correlation coefficient differs from its maximum possible value, 1, by one-half the mean squared difference of the standardized data.

This is a universal formula for correlation, valid no matter what the original data were (provided only that both variables have nonzero standard deviations). (Faithful readers of this site will recognize this as being closely related to the geometric characterization of covariance described and illustrated at How would you explain covariance to someone who understands only the mean?.)


In the special case where the Xi and Yi are distinct ranks, each is a permutation of the same sequence of numbers 1,2,,n. Thus X¯=Y¯=(n+1)/2 and, with a tiny bit of calculation we find

sX2=sY2=1ni=1n(i(n+1)/2)2=n2112

(which, happily, is nonzero whenever n>1). Therefore

(xiyi)2=((Xi(n+1)/2)(Yi(n+1)/2))2(n21)/12=12(XiYi)2n21.

This nice simplification occurred because the Xi and Yi have the same means and standard deviations: the difference of their means therefore disappeared and the product sXsY became sX2 which involves no square roots.

Plugging this into the formula (1) for ρ gives

ρ=16n(n21)i=1n(XiYi)2.

2
(+1) The geometric interpretation in terms of your famous "rectangles for covariance" answer is very neat but I wonder if casual readers will see it - perhaps a sketch diagram might help (I was tempted to add one myself!). For the curious: the formula r=1sxy2/2 is number 9 in the list of Thirteen Ways to Look at the Correlation Coefficient, by Joseph Lee Rodgers and W. Alan Nicewander in The American Statistician , Vol. 42, No. 1. (Feb., 1988), pp. 59-66. stat.berkeley.edu/~rabbee/correlation.pdf
Silverfish

2
@Silver Thank you for the helpful comments. The Rodgers and Nicewander article is summarized on our site at stats.stackexchange.com/a/104577. Someday I might draw the diagram you describe... .
whuber

5

High school students may see the PMCC and Spearman correlation formulae years before they have the algebra skills to manipulate sigma notation, though they may well know the method of finite differences for deducing the polynomial equation for a sequence. So I have tried to write a "high school proof" for the equivalence: finding the denominator using finite differences, and minimising the algebraic manipulation of sums in the numerator. Depending on the students the proof is presented to, you may prefer this approach to the numerator, but combine it with a more conventional method for the denominator.

Denominator, i(xix¯)2i(yiy¯)2

With no ties, the data are the ranks {1,2,,n} in some order, so it is easy to show x¯=n+12. We can reorder the sum Sxx=i=1n(xix¯)2=k=1n(kn+12)2, though with lower grade students I'd likely write this sum out explicitly rather than in sigma notation. The sum of a quadratic in k will be cubic in n, a fact that students familiar with the finite difference method may grasp intuitively: differencing a cubic produces a quadratic, so summing a quadratic produces a cubic. Determining the coefficients of the cubic f(n) is straightforward if students are comfortable manipulating Σ notation and know (and remember!) the formulae for k=1nk and k=1nk2. But they can also be deduced using finite differences, as follows.

When n=1, the data set is just {1}, x¯=1, so f(1)=(11)2=0.

For n=2, the data are {1,2}, x¯=1.5, so f(2)=(11.5)2+(21.5)2=0.5.

For n=3, the data are {1,2,3}, x¯=2, so f(3)=(12)2+(22)2+(32)2=2.

These computations are fairly brief, and help reinforce what the notation i=1n(xix¯)2 means, and in short order we produce the finite difference table.

Finite difference table for Sxx

We can obtain the coefficients of f(n) by cranking out the finite difference method as outlined in the links above. For instance, the constant third differences indicate our polynomial is indeed cubic, with leading coefficient 0.53!=112. There are a few tricks to minimise drudgery: a well-known one is to use the common differences to extend the sequence back to n=0, as knowing f(0) immediately gives away the constant coefficient. Another is to try extending the sequence to see if f(n) is zero for an integer n - e.g. if the sequence had been positive but decreasing, it would be worth extending rightwards to see if we could "catch a root", as this makes factorisation easier later. In our case, the function seems to hover around low values when n is small, so let's extend even further leftwards.

Extended finite difference table for Sxx

Aha! It turns out we have caught all three roots: f(1)=f(0)=f(1)=0. So the polynomial has factors of (n+1), n, and (n1). Since it was cubic it must be of the form:

f(n)=an(n+1)(n1)

We can see that a must be the coefficient of n3 which we already determined to be 112. Alternatively, since f(2)=0.5 we have a(2)(3)(1)=0.5 which leads to the same conclusion. Expanding the difference of two squares gives:

Sxx=n(n21)12

Since the same argument applies to Syy, the denominator is SxxSyy=Sxx2=Sxx and we are done. Ignoring my exposition, this method is surprisingly short. If one can spot that the polynomial is cubic, it is necessary only to calculate Sxx for the cases n{1,2,3,4} to establish the third difference is 0.5. Root-hunters need only extend the sequence leftwards to n=0 and n=1, by when all three roots are found. It took me a couple of minutes to find Sxx this way.

Numerator, i(xix¯)(yiy¯)

I note the identity (ba)2b22ab+a2 which can be rearranged to:

ab12(a2+b2(ba)2)

If we let a=xix¯=xin+12 and b=yiy¯=yin+12 we have the useful result that ba=yixi=di because the means, being identical, cancel out. That was my intuition for writing the identity in the first place; I wanted to switch from working with the product of the moments to the square of their differences. We now have:

(xix¯)(yiy¯)=12((xix¯)2+(yiy¯)2di2)

Hopefully even students unsure how to manipulate Σ notation can see how summing over the data set yields:

Sxy=12(Sxx+Syyi=1ndi2)

We have already established, by reordering the sums, that Syy=Sxx, leaving us with:

Sxy=Sxx12i=1ndi2

The formula for Spearman's correlation coefficient is within our grasp!

rS=SxySxxSyy=Sxx12idi2Sxx=1idi22Sxx

Substituting the earlier result that Sxx=112n(n21) will finish the job.

rS=1idi2212n(n21)=16idi2n(n21)
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.