Comprensione della derivazione del trade-bias varianza


Sto leggendo il capitolo del compromesso di bias varianza di Gli elementi dell'apprendimento statistico e ho dei dubbi nella formula a pagina 29. Lascia che i dati derivino da un modello tale che

dove ϵ è un numero casuale con valore atteso ε = e [ ε ] = 0 e varianza e [ ( ε - ε ) 2 ] = e [ ε 2 ] = σ 2ϵ^=E[ϵ]=0E[(ϵϵ^)2]=E[ϵ2]=σ2. Lascia che il valore atteso di errore del modello sia
dove fk(x) è la previsione di x del nostro discente. Secondo il libro, l'errore è

La mia domanda è: perché il termine bias non è 0? sviluppando la formula dell'errore vedo


come è un numero casuale indipendente 2 E [ ( f ( x ) - f k ( x ) ) ϵ ] = 2 E [ ( f ( x ) - f k ( x ) ) ] E [ ϵ ] = 0ϵ2E[(f(x)fk(x))ϵ]=2E[(f(x)fk(x))]E[ϵ]=0

Dove sbaglio?



Non ti sbagli, ma hai commesso un errore in una fase poiché . E [ ( f ( x ) - f k ( x ) ) 2 ] è MSE ( f k ( x ) ) = VE[(f(x)fk(x))2]Var(fk(x))E[(f(x)fk(x))2] .MSE(fk(x))=Var(fk(x))+Bias2(fk(x))


Note: E[(fk(x)E(fk(x)))(f(x)E(fk(x))]=E[fk(x)E(fk(x))](f(x)E(fk(x)))=0.

In case of binary outcomes, Is there an equivalent proof with cross entropy as error measure?

It doesn't work out quite so well with a binary response. See Ex 7.2 in the second edition of "The Elements of Statistical Learning".
Matthew Drury

could you explain how you go from E[(f(x)E(fk(x))+E(fk(x))fk(x))2]+2E[(f(x)fk(x))ϵ]+σ2 to Var(fk(x))+Bias2(fk(x))+σ2?


A few more steps of the Bias - Variance decomposition

Indeed, the full derivation is rarely given in textbooks as it involves a lot of uninspiring algebra. Here is a more complete derivation using notation from the book "Elements of Statistical Learning" on page 223

If we assume that Y=f(X)+ϵ and E[ϵ]=0 and Var(ϵ)=σϵ2 then we can derive the expression for the expected prediction error of a regression fit f^(X) at an input X=x0 using squared error loss


For notational simplicity let f^(x0)=f^, f(x0)=f and recall that E[f]=f and E[Y]=f


For the term E[(ff^)2] we can use a similar trick as above, adding and subtracting E[f^] to get


Putting it together


Some comments on why E[f^Y]=fE[f^]

Taken from Alecos Papadopoulos here

Recall that f^ is the predictor we have constructed based on the m data points {(x(1),y(1)),...,(x(m),y(m))} so we can write f^=f^m to remember that.

On the other hand Y is the prediction we are making on a new data point (x(m+1),y(m+1)) by using the model constructed on the m data points above. So the Mean Squared Error can be written as


Expanding the equation from the previous section


The last part of the equation can be viewed as


Since we make the following assumptions about the point x(m+1):

  • It was not used when constructing f^m
  • It is independent of all other observations {(x(1),y(1)),...,(x(m),y(m))}
  • It is independent of ϵ(m+1)

Other sources with full derivations

Why E[f^Y]=fE[f^]? I don't think Y and f^ are independent, since f^ is essentially constructed using Y.
Felipe Pérez

But the question is essentially the same, why E[f^ϵ]=0? The randomness of f^ comes from the error ϵ so I don't see why would f^ and ϵ be independent, and hence, E(f^ϵ)=0.
Felipe Pérez

From your precisation seems that the in sample vs out of sample perspective is crucial. It's so? If we work only in sample and, then, see ϵ as residual the bias variance tradeoff disappear?

@FelipePérez as far as I understand, the randomness of f^ comes from the train-test split (which points ended up in the training set and gave f^ as the trained predictor). In other words, the variance of f^ comes from all the possible subsets of a given fixed data-set that we can take as the training set. Because the data-set is fixed, there is no randomness coming from ϵ and therefore f^ and ϵ are independent.
Alberto Santini
