Comprensione della derivazione del trade-bias varianza


20

Sto leggendo il capitolo del compromesso di bias varianza di Gli elementi dell'apprendimento statistico e ho dei dubbi nella formula a pagina 29. Lascia che i dati derivino da un modello tale che

Y=f(x)+ϵ
dove ϵ è un numero casuale con valore atteso ε = e [ ε ] = 0 e varianza e [ ( ε - ε ) 2 ] = e [ ε 2 ] = σ 2ϵ^=E[ϵ]=0E[(ϵϵ^)2]=E[ϵ2]=σ2. Lascia che il valore atteso di errore del modello sia
E[(Yfk(x))2]
dove fk(x) è la previsione di x del nostro discente. Secondo il libro, l'errore è
E[(Yfk(x))2]=σ2+Bias(fk)2+Var(fk(x)).

La mia domanda è: perché il termine bias non è 0? sviluppando la formula dell'errore vedo

E[(Yfk(x))2]=E[(f(x)+ϵfk(x))2]=E[(f(x)fk(x))2]+2E[(f(x)fk(x))ϵ]+E[ϵ2]=Var(fk(x))+2E[(f(x)fk(x))ϵ]+σ2

come è un numero casuale indipendente 2 E [ ( f ( x ) - f k ( x ) ) ϵ ] = 2 E [ ( f ( x ) - f k ( x ) ) ] E [ ϵ ] = 0ϵ2E[(f(x)fk(x))ϵ]=2E[(f(x)fk(x))]E[ϵ]=0

Dove sbaglio?

Risposte:


20

Non ti sbagli, ma hai commesso un errore in una fase poiché . E [ ( f ( x ) - f k ( x ) ) 2 ] è MSE ( f k ( x ) ) = VE[(f(x)fk(x))2]Var(fk(x))E[(f(x)fk(x))2] .MSE(fk(x))=Var(fk(x))+Bias2(fk(x))

E[(Yfk(x))2]=E[(f(x)+ϵfk(x))2]=E[(f(x)fk(x))2]+2E[(f(x)fk(x))ϵ]+E[ϵ2]=E[(f(x)E(fk(x))+E(fk(x))fk(x))2]+2E[(f(x)fk(x))ϵ]+σ2=Var(fk(x))+Bias2(fk(x))+σ2.

Note: E[(fk(x)E(fk(x)))(f(x)E(fk(x))]=E[fk(x)E(fk(x))](f(x)E(fk(x)))=0.


In case of binary outcomes, Is there an equivalent proof with cross entropy as error measure?
emanuele

1
It doesn't work out quite so well with a binary response. See Ex 7.2 in the second edition of "The Elements of Statistical Learning".
Matthew Drury

3
could you explain how you go from E[(f(x)E(fk(x))+E(fk(x))fk(x))2]+2E[(f(x)fk(x))ϵ]+σ2 to Var(fk(x))+Bias2(fk(x))+σ2?
Antoine

16

A few more steps of the Bias - Variance decomposition

Indeed, the full derivation is rarely given in textbooks as it involves a lot of uninspiring algebra. Here is a more complete derivation using notation from the book "Elements of Statistical Learning" on page 223


If we assume that Y=f(X)+ϵ and E[ϵ]=0 and Var(ϵ)=σϵ2 then we can derive the expression for the expected prediction error of a regression fit f^(X) at an input X=x0 using squared error loss

Err(x0)=E[(Yf^(x0))2|X=x0]

For notational simplicity let f^(x0)=f^, f(x0)=f and recall that E[f]=f and E[Y]=f

E[(Yf^)2]=E[(Yf+ff^)2]=E[(yf)2]+E[(ff^)2]+2E[(ff^)(yf)]=E[(f+ϵf)2]+E[(ff^)2]+2E[fYf2f^Y+f^f]=E[ϵ2]+E[(ff^)2]+2(f2f2fE[f^]+fE[f^])=σϵ2+E[(ff^)2]+0

For the term E[(ff^)2] we can use a similar trick as above, adding and subtracting E[f^] to get

E[(ff^)2]=E[(f+E[f^]E[f^]f^)2]=E[fE[f^]]2+E[f^E[f^]]2=[fE[f^]]2+E[f^E[f^]]2=Bias2[f^]+Var[f^]

Putting it together

E[(Yf^)2]=σϵ2+Bias2[f^]+Var[f^]


Some comments on why E[f^Y]=fE[f^]

Taken from Alecos Papadopoulos here

Recall that f^ is the predictor we have constructed based on the m data points {(x(1),y(1)),...,(x(m),y(m))} so we can write f^=f^m to remember that.

On the other hand Y is the prediction we are making on a new data point (x(m+1),y(m+1)) by using the model constructed on the m data points above. So the Mean Squared Error can be written as

E[f^m(x(m+1))y(m+1)]2

Expanding the equation from the previous section

E[f^mY]=E[f^m(f+ϵ)]=E[f^mf+f^mϵ]=E[f^mf]+E[f^mϵ]

The last part of the equation can be viewed as

E[f^m(x(m+1))ϵ(m+1)]=0

Since we make the following assumptions about the point x(m+1):

  • It was not used when constructing f^m
  • It is independent of all other observations {(x(1),y(1)),...,(x(m),y(m))}
  • It is independent of ϵ(m+1)

Other sources with full derivations


1
Why E[f^Y]=fE[f^]? I don't think Y and f^ are independent, since f^ is essentially constructed using Y.
Felipe Pérez

5
But the question is essentially the same, why E[f^ϵ]=0? The randomness of f^ comes from the error ϵ so I don't see why would f^ and ϵ be independent, and hence, E(f^ϵ)=0.
Felipe Pérez

From your precisation seems that the in sample vs out of sample perspective is crucial. It's so? If we work only in sample and, then, see ϵ as residual the bias variance tradeoff disappear?
markowitz

1
@FelipePérez as far as I understand, the randomness of f^ comes from the train-test split (which points ended up in the training set and gave f^ as the trained predictor). In other words, the variance of f^ comes from all the possible subsets of a given fixed data-set that we can take as the training set. Because the data-set is fixed, there is no randomness coming from ϵ and therefore f^ and ϵ are independent.
Alberto Santini
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.