Intervallo di previsione per variabile casuale binomiale

Qual è la formula (approssimativa o esatta) per un intervallo di previsione per una variabile casuale binomiale?

Supponiamo $Y \sim \mathsf{Binom}(n, p)$ e osserviamo $y$ (tratto da $Y$ ). La $n$ è nota.

Il nostro obiettivo è quello di ottenere un intervallo di previsione al 95% per un nuovo pareggio da $Y$ .

La stima puntuale è $n\hat{p}$ , dove $\hat{p}=\frac{y}{n}$ . Un intervallo di confidenza per è semplice, ma non riesco a trovare una formula per un intervallo di previsione per l'. Se sapessimo(piuttosto che poi un intervallo di previsione del 95% solo consiste nel trovare i quantili di un binomio. C'è qualcosa di ovvio che sto trascurando? $\hat{p}$ $Y$ $p$ $\hat{p}$

confidence-interval binomial prediction-interval

— Statseeker
fonte

Vedi quali metodi non bayesiani esistono per l'inferenza predittiva? . In questo caso il metodo che utilizza i pivot non è disponibile (non credo) ma potresti usare una delle probabilità predittive. O ovviamente, un approccio bayesiano.

— Scortchi - Ripristina Monica

Ciao ragazzi, vorrei prendere un momento per rispondere alle preoccupazioni sollevate. - riguardo alla fiducia per p: non mi interessa. - considerando che le previsioni sono il 95% della distribuzione: sì, questo è esattamente ciò che gli intervalli di previsione sono indipendentemente dal contesto (nella regressione devi assumere errori normali, dove gli intervalli di confidenza si basano su CLT - sì, l'esempio di prevedere il numero di teste in il lancio della moneta è corretto. Ciò che rende difficile questo problema è che ora non "p", ma abbiamo una stima.

— Statseeker,

@Addison Leggi il libro Intervalli statistici di G. Hahn e W. Meeker. Spiegano la differenza tra intervalli di confidenza, intervalli di previsione, intervalli di tolleranza e intervalli credibili bayesiani. Un intervallo di previsione del 95% non contiene il 95% della distribuzione. Fa quello che fanno gli intervalli più frequenti. Se si esegue ripetutamente il campionamento da B (n, p) e si utilizza ogni volta lo stesso metodo per produrre un intervallo di previsione del 95% per p, il 95% degli intervalli di previsione conterrà il valore reale di p. Se si desidera coprire il 95% della distribuzione, costruire un intervallo di tolleranza.

— Michael R. Chernick,

Gli intervalli di tolleranza coprono una percentuale della distribuzione. Per un intervallo di tolleranza del 95% per il 90% della distribuzione, ripetere nuovamente il processo più volte e utilizzare lo stesso metodo per generare l'intervallo ogni volta, quindi in circa il 95% dei casi almeno il 90% della distribuzione cadrà nell'intervallo e il 5% delle volte meno del 90% della distribuzione sarà contenuto nell'intervallo.

— Michael R. Chernick,

Lawless & Fredette (2005), "Frequentist Prediction Intervals and Predictive Distributions", Biometrika , 92 , 3 è un altro buon riferimento, oltre a quelli al link che ho dato.

— Scortchi - Ripristina Monica

Ok, proviamo questo. Darò due risposte: quella bayesiana, che secondo me è semplice e naturale, e una delle possibili frequentazioni.

Soluzione bayesiana

Assumiamo una prima Beta su , i, e., , perché il modello Beta-binomiale è coniugato, il che significa che la distribuzione posteriore è anche una distribuzione beta con parametri , (sto usando per indicare il numero di successi in prove, invece di ). Pertanto, l'inferenza è notevolmente semplificata. Ora, se hai qualche conoscenza preliminare sui probabili valori di $p$ $p \sim Beta(\alpha,\beta)$ $\hat{\alpha}=\alpha+k,\hat{\beta}=\beta+n-k$ $k$ $n$ $y$ , potresti usarlo per impostare i valori di e , cioè per definire il tuo beta precedente, altrimenti potresti assumere un precedente uniforme (non informativo), con , o altri priori non informativi (vedi ad esempioqui). In ogni caso, il tuo posteriore è $p$ $\alpha$ $\beta$ $\alpha=\beta=1$

$Pr(p|n,k)=Beta(\alpha+k,\beta+n-k)$

Nell'inferenza bayesiana, tutto ciò che conta è la probabilità posteriore, nel senso che una volta che lo sai, puoi fare inferenze per tutte le altre quantità nel tuo modello. Vuoi fare inferenza sugli osservabili : in particolare, su un vettore di nuovi risultati , dove non è necessariamente uguale a . In particolare, per ogni , vogliamo calcolare la probabilità di avere esattamente successi nelle successive prove, dato che abbiamo ottenuto $y$ $\mathbf{y}=y_1,\dots,y_m$ $m$ $n$ $j=0,\dots,m$ $j$ $m$ $k$ successi nelle precedenti prove; la funzione di massa predittiva posteriore: $n$

Tuttavia, il nostro modello binomiale per significa che, a condizione che abbia un certo valore, la probabilità di avere successi in prove non dipende dai risultati passati: è semplicemente $Y$ $p$ $j$ $m$

$f(j|m,p)=\binom{j}{m} p^j(1-p)^j$

Così l'espressione diventa

$Pr(j|m,n,k)=\int_0^1 \binom{j}{m} p^j(1-p)^j Pr(p|n,k)dp=\int_0^1 \binom{j}{m} p^j(1-p)^j Beta(\alpha+k,\beta+n-k)dp$

The result of this integral is a well-known distribution called the Beta-Binomial distribution: skipping the passages, we get the horrible expression

$Pr(j|m,n,k)=\frac{m!}{j!(m-j)!}\frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+k)\Gamma(\beta+n-k)}\frac{\Gamma(\alpha+k+j)\Gamma(\beta+n+m-k-j)}{\Gamma(\alpha+\beta+n+m)}$

Our point estimate for $j$ , given quadratic loss, is of course the mean of this distribution, i.e.,

$\mu=\frac{m(\alpha+k)}{(\alpha+\beta+n)}$

Now, let's look for a prediction interval. Since this is a discrete distribution, we don't have a closed form expression for $[j_1,j_2]$ , such that $Pr(j_1\leq j \leq j_2)= 0.95$ . The reason is that, depending on how you define a quantile, for a discrete distribution the quantile function is either not a function or is a discontinuous function. But this is not a big problem: for small $m$ , you can just write down the $m$ probabilities $Pr(j=0|m,n,k),Pr(j\leq 1|m,n,k),\dots,Pr(j \leq m-1|m,n,k)$ and from here find $j_1,j_2$ such that

$Pr(j_1\leq j \leq j_2)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\geq 0.95$

Of course you would find more than one couple, so you would ideally look for the smallest $[j_1,j_2]$ such that the above is satisfied. Note that

$Pr(j=0|m,n,k)=p_0,Pr(j\leq 1|m,n,k)=p_1,\dots,Pr(j \leq m-1|m,n,k)=p_{m-1}$

are just the values of the CMF (Cumulative Mass Function) of the Beta-Binomial distribution, and as such there is a closed form expression, but this is in terms of the generalized hypergeometric function and thus is quite complicated. I'd rather just install the R package extraDistr and call pbbinom to compute the CMF of the Beta-Binomial distribution. Specifically, if you want to compute all the probabilities $p_0,\dots,p_{m-1}$ in one go, just write:

library(extraDistr)  
jvec <- seq(0, m-1, by = 1) 
probs <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

where alpha and beta are the values of the parameters of your Beta prior, i.e., $\alpha$ and $\beta$ (thus 1 if you're using a uniform prior over $p$ ). Of course it would all be much simpler if R provided a quantile function for the Beta-Binomial distribution, but unfortunately it doesn't.

Practical example with the Bayesian solution

Let $n=100$ , $k=70$ (thus we initially observed 70 successes in 100 trials). We want a point estimate and a 95%-prediction interval for the number of successes $j$ in the next $m=20$ trials. Then

n <- 100
k <- 70
m <- 20
alpha <- 1
beta  <- 1

where I assumed a uniform prior on $p$ : depending on the prior knowledge for your specific application, this may or may not be a good prior. Thus

bayesian_point_estimate <- m * (alpha + k)/(alpha + beta + n) #13.92157

Clearly a non-integer estimate for $j$ doesn't make sense, so we could just round to the nearest integer (14). Then, for the prediction interval:

jvec <- seq(0, m-1, by = 1)
library(extraDistr)
probabilities <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

The probabilities are

> probabilities
 [1] 1.335244e-09 3.925617e-08 5.686014e-07 5.398876e-06
 [5] 3.772061e-05 2.063557e-04 9.183707e-04 3.410423e-03
 [9] 1.075618e-02 2.917888e-02 6.872028e-02 1.415124e-01
[13] 2.563000e-01 4.105894e-01 5.857286e-01 7.511380e-01
[17] 8.781487e-01 9.546188e-01 9.886056e-01 9.985556e-01

For an equal-tail probabilities interval, we want the smallest $j_2$ such that $Pr(j\leq j_2|m,n,k)\ge 0.975$ and the largest $j_1$ such that $Pr(j < j_1|m,n,k)=Pr(j \le j_1-1|m,n,k)\le 0.025$ . This way, we will have

$Pr(j_1\leq j \leq j_2|m,n,k)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\ge 0.975-0.025=0.95$

Thus, by looking at the above probabilities, we see that $j_2=18$ and $j_1=9$ . The probability of this Bayesian prediction interval is 0.9778494, which is larger than 0.95. We could find shorter intervals such that $Pr(j_1\leq j \leq j_2|m,n,k)\ge 0.95$ , but in that case at least one of the two inequalities for the tail probabilities wouldn't be satisfied.

Frequentist solution

I'll follow the treatment of Krishnamoorthy and Peng, 2011. Let $Y\sim Binom(m,p)$ and $X\sim Binom(n,p)$ be independently Binominally distributed. We want a $1-2\alpha-$ prediction interval for $Y$ , based on a observation of $X$ . In other words we look for $I=[L(X;n,m,\alpha),U(X;n,m,\alpha)]$ such that:

$Pr_{X,Y}(Y\in I)=Pr_{X,Y}(L(X;n,m,\alpha)\leq Y\leq U(X;n,m,\alpha)]\geq 1-2\alpha$

The " $\geq 1-2\alpha$ " is due to the fact that we are dealing with a discrete random variable, and thus we cannot expect to get exact coverage...but we can look for an interval which has always at least the nominal coverage, thus a conservative interval. Now, it can be proved that the conditional distribution of $X$ given $X+Y=k+j=s$ is hypergeometric with sample size $s$ , number of successes in the population $n$ and population size $n+m$ . Thus the conditional pmf is

$Pr(X=k|X+Y=s,n,n+m)=\frac{\binom{n}{k}\binom{m}{s-k}}{\binom{m+n}{s}}$

The conditional CDF of $X$ given $X+Y=s$ is thus

$Pr(X\leq k|s,n,n+m)=H(k;s,n,n+m)=\sum_{i=0}^k\frac{\binom{n}{i}\binom{m}{s-i}}{\binom{m+n}{s}}$

The first great thing about this CDF is that it doesn't depend on $p$ , which we don't know. The second great thing is that it allows to easily find our PI: as a matter of fact, if we observed a value $k$ of X, then the $1-\alpha$ lower prediction limit is the smallest integer $L$ such that

$Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$

correspondingly, the the $1-\alpha$ upper prediction limit is the largest integer such that

$Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$

Thus, $[L,U]$ is a prediction interval for $Y$ of coverage at least $1-2\alpha$ . Note that when $p$ is close to 0 or 1, this interval is conservative even for large $n$ , $m$ , i.e., its coverage is quite larger than $1-2\alpha$ .

Practical example with the Frequentist solution

Same setting as before, but we don't need to specify $\alpha$ and $\beta$ (there are no priors in the Frequentist framework):

n <- 100
k <- 70
m <- 20

The point estimate is now obtained using the MLE estimate for the probability of successes, $\hat{p}=\frac{k}{n}$ , which in turns leads to the following estimate for the number of successes in $m$ trials:

frequentist_point_estimate <- m * k/n #14

For the prediction interval, the procedure is a bit different. We look for the largest $U$ such that $Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$ , thus let's compute the above expression for all $U$ in $[0,m]$ :

jvec <- seq(0, m, by = 1)
probabilities <- phyper(k,n,m,k+jvec)

We can see that the largest $U$ such that the probability is still larger than 0.025 is

jvec[which.min(probabilities > 0.025) - 1] # 18

Same as for the Bayesian approach. The lower prediction bound $L$ is the smallest integer such that $Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$ , thus

probabilities <- 1-phyper(k-1,n,m,k+jvec)
jvec[which.max(probabilities > 0.025) - 1] # 8

Thus our frequentist "exact" prediction interval is $[L,U]=[8,18]$ .

— DeltaIV
fonte