Puoi per favore spiegare il paradosso di Simpson con equazioni, anziché tabelle di contingenza?

14

Probabilmente non ho una chiara comprensione del paradosso di Simpson . Informalmente so che la media della risposta Y1, raggruppata su tutti i possibili livelli del fattore A, può essere superiore alla media della risposta Y2 su tutti i livelli di A, anche se la media di Y1 per ciascun livello di A (ogni gruppo) è sempre inferiore alla media corrispondente di Y2. Ho letto degli esempi, ma mi sorprendo ancora ogni volta che lo vedo, forse perché non imparo bene con esempi specifici: ho problemi a generalizzarli. Imparo meglio, e preferirei vedere, una spiegazione nelle formule. Puoi spiegare il paradosso basandosi sulle equazioni, anziché contare le tabelle?

Inoltre, penso che la ragione della mia sorpresa sia che potrei inconsciamente fare alcune ipotesi sulle medie coinvolte nel paradosso, il che potrebbe non essere vero in generale. Forse dimentico di ponderare il numero di campioni in ciascun gruppo? Ma poi, vorrei vedere un'equazione che mi mostra che la stima della media totale è più accurata se appesantisco la media di ciascun gruppo in base al numero di campioni in ciascun gruppo, perché (se questo è vero) non è ovvio per me in generale. Ingenuamente, penso che la stima di $\mathbf{E}[Y_1]$ abbia un errore standard inferiore quando ho più campioni, indipendentemente dalla ponderazione.

mathematical-statistics simpsons-paradox

— DeltaIV
fonte

1

Ho un post correlato qui con simulazioni. La simulazione può esserti utile per capire il paradosso di Simpson

— Haitao Du

ecco una macchina che produce i paradossi di Simpson su richiesta!

— kjetil b halvorsen,

11

Ecco un approccio generale alla comprensione algebrica del Paradox di Simpson per i dati di conteggio.

Supponiamo di disporre di dati di sopravvivenza per un'esposizione e di creare una tabella di contingenza 2x2. Per mantenere le cose semplici avremo gli stessi conteggi in ogni cella. Potremmo rilassarlo, ma renderebbe l'algebra piuttosto confusa.

\begin{array}{cccc} Morto & Sopravvissuto & Tasso di mortalità \\ Exposed & X & X & 0.5 \\ non impressionate & X & X & 0.5 \end{array}

$\begin{array}{|c|c|c|c|} \hline & \text{Died} & \text{Survived} & \text{Death Rate} \\ \hline \text{Exposed} & X & X & 0.5 \\ \hline \text{Unexposed}& X & X & 0.5\\ \hline \end{array}$

In questo caso, il tasso di mortalità è lo stesso sia nei gruppi esposti che in quelli non esposti.

Ora, se dividiamo i dati, diciamo in un gruppo per le femmine e un altro gruppo per i maschi, otteniamo 2 tabelle, con i seguenti conteggi:

Maschi:

\begin{array}{cccc} Morto & Sopravvissuto & Tasso di mortalità \\ Exposed & X un' & X B & \frac{un'}{un' + B} \\ non impressionate & X c & X d & \frac{c}{c + d} \end{array}

$\begin{array}{|c|c|c|c|} \hline & \text{Died} & \text{Survived} & \text{Death Rate} \\ \hline \text{Exposed} & Xa & Xb & \frac{a}{a+b} \\ \hline \text{Unexposed}& Xc & Xd & \frac{c}{c+d}\\ \hline \end{array}$

e per le femmine:

\begin{array}{cccc} Died & Survived & Death Rate \\ Exposed & X (a - 1) & X (b - 1) & \frac{a - 1}{a + b - 2} \\ Unexposed & X (c - 1) & X (d - 1) & \frac{c - 1}{c + d - 2} \end{array}

$\begin{array}{|c|c|c|c|} \hline & \text{Died} & \text{Survived} & \text{Death Rate} \\ \hline \text{Exposed} & X(a-1) & X(b-1) & \frac{a-1}{a+b-2} \\ \hline \text{Unexposed}& X(c-1) & X(d-1) & \frac{c-1}{c+d-2}\\ \hline \end{array}$

dove $a,b,c,d \in [0,1]$ sono le proporzioni di ciascuna cella nella tabella di dati aggregati che sono maschi.

Il paradosso di Simpson si verificherà quando il tasso di mortalità per i maschi esposti è maggiore del tasso di mortalità per i maschi non esposti E il tasso di mortalità per le femmine esposte è maggiore del tasso di mortalità per le femmine non esposte. In alternativa, si verificherà anche quando il tasso di mortalità per i maschi esposti è inferiore al tasso di mortalità per i maschi non esposti E il tasso di mortalità per le femmine esposte è inferiore al tasso di mortalità per le femmine non esposte. Cioè quando

(\frac{a}{a + b} < \frac{c}{c + d}) and (\frac{a - 1}{a + b - 2} < \frac{c - 1}{c + d - 2})

$\left(\frac{a}{a+b} < \frac{c}{c+d}\right) \text{ and } \left(\frac{a-1}{a+b-2} < \frac{c-1}{c+d-2}\right)$

Or

$\text{Or }$

(\frac{a}{a + b} > \frac{c}{c + d}) and (\frac{a - 1}{a + b - 2} > \frac{c - 1}{c + d - 2})

$\left(\frac{a}{a+b} > \frac{c}{c+d}\right) \text{ and } \left(\frac{a-1}{a+b-2} > \frac{c-1}{c+d-2}\right)$

Come esempio concreto, lascia $X=100$ e $a=0.5, b=0.8, c=0.9$ . Quindi avremo il paradosso di Simpson quando:

(\frac{0.5}{0.8 + 0.9} < \frac{0.9}{0.9 + d}) and (\frac{0.5 - 1}{0.5 + 0.8 - 2} < \frac{0.9 - 1}{0.9 + d - 2})

$\left(\frac{0.5}{0.8+0.9} < \frac{0.9}{0.9+d}\right) \text{ and } \left(\frac{0.5-1}{0.5+0.8-2} < \frac{0.9-1}{0.9+d-2}\right)$

(- 9 < d < 1.44) and (0.96 < d < 1.1)

$(-9 < d < 1.44) \text{ and } (0.96 < d < 1.1)$

Da cui concludiamo che d deve trovarsi in $(0.96,1]$

La seconda serie di disuguaglianze fornisce:

(\frac{0.5}{0.8 + 0.9} > \frac{0.9}{0.9 + d}) and (\frac{0.5 - 1}{0.5 + 0.8 - 2} > \frac{0.9 - 1}{0.9 + d - 2})

$\left(\frac{0.5}{0.8+0.9} > \frac{0.9}{0.9+d}\right) \text{ and } \left(\frac{0.5-1}{0.5+0.8-2} > \frac{0.9-1}{0.9+d-2}\right)$

(d < - 0.9 or d > 1.44) and (0.96 < d or d > 1.44)

$(d < -0.9 \text{ or } d>1.44) \text{ and } (0.96 < d \text{ or } d > 1.44)$

which has no solution for $d \in [0,1]$

So for the three values that we chose for $a,b,$ and $c$ , to invoke Simpson's paradox, $d$ must be greater than 0.96. In the case where the value was $0.99$ then we would obtain a Death Rate for Males of

0.5 / (0.5 + 0.8) = 38 % in the exposed group

$0.5/ (0.5+0.8) = 38 \text{% in the exposed group}$

0.9 / (0.9 + 0.99) = 48 % in the unexposed group

$0.9/ (0.9+0.99) = 48 \text{% in the unexposed group}$

and for Females:

(0.5 - 1) / (0.5 + 0.8 - 2) = 71 % in the exposed group

$(0.5-1)/ (0.5+0.8-2) = 71 \text{% in the exposed group}$

(0.9 - 1) / (0.9 + 0.99 - 2) = 91 % in the unexposed group

$(0.9-1)/ (0.9+0.99-2) = 91 \text{% in the unexposed group}$

So, males have a higher death rate in the unexposed group than in the exposed group, and females also have a higher death rate in the unexposed group than the exposed group, yet the death rates in the aggregated data are the same for exposed and unexposed.

— Robert Long
fonte

16

Suppose we have data on 2 variables, $x$ and $y$ , for 2 groups, A and B.

Data in group A are such that the fitted regression line is

y = 11 - x

$y = 11 - x$

with mean values of $2$ and $9$ for $x$ and $y$ respectively.

Data in group B are such that the fitted regression line is

y = 25 - x

$y = 25 - x$

with mean values of $11$ and $14$ for $x$ and $y$ respectively.

So the regression coefficient for $x$ is $-1$ in both groups.

Further, let there be equal numbers of observations in each group, with both and y distributed symmetrically. We now wish to compute the overall regression line. To keep matters simple we will assume that the overall regression line passes through the means of each group, that is $(2,9)$ for group A and $(11,14)$ for group B. Then it is easy to see that the overall regression line slope must be $(14-9)/(11-2) = 0.55$ which is the overall regression coefficient for $x$ . Thus we see Simpson’s paradox in action – we have a negative association of $x$ with $y$ in each group individually, but a positive association overall when the data are aggregated. We can demonstrate this easily in R as follows:

rm(list=ls())
Xa <- c(1,2,3)
Ya <- c(10,9,8)
m0 <- lm(Ya~Xa)
plot(Xa,Ya, xlim=c(0,20), ylim=c(5,20), col="red")
abline(m0, col="red")

Xb <- c(10,11,12)
Yb <- c(15,14,13)
m1 <- lm(Yb~Xb)
points(Xb,Yb, col="blue")
abline(m1, col="blue")

X <- c(Xa,Xb)
Y <- c(Ya,Yb)
m2 <- lm(Y~X)
abline(m2, col="black")

The red points and regression line are group A, the blue points and regression line are group B and the black line is the overall regression line.

— Robert Long
fonte

Hi, thanks for the answer, but this is yet another specific example of the Simpson's paradox. I specifically asked for something in the form of a theorem or a set of equations, a more abstract and general approach. Anyway, since there are no other answers, I'll study your example and if I feel that it helps me to generalize the concept, I'll accept the answer.

— DeltaIV

3

@DeltaIV I have written a new answer using purely algebraic arguments.

— Robert Long