A few more steps of the Bias - Variance decomposition
Indeed, the full derivation is rarely given in textbooks as it involves a lot of uninspiring algebra. Here is a more complete derivation using notation from the book "Elements of Statistical Learning" on page 223
If we assume that Y=f(X)+ϵ and E[ϵ]=0 and Var(ϵ)=σ2ϵ then we can derive the expression for the expected prediction error of a regression fit f^(X) at an input X=x0 using squared error loss
Err(x0)=E[(Y−f^(x0))2|X=x0]
For notational simplicity let f^(x0)=f^, f(x0)=f and recall that E[f]=f and E[Y]=f
E[(Y−f^)2]=E[(Y−f+f−f^)2]=E[(y−f)2]+E[(f−f^)2]+2E[(f−f^)(y−f)]=E[(f+ϵ−f)2]+E[(f−f^)2]+2E[fY−f2−f^Y+f^f]=E[ϵ2]+E[(f−f^)2]+2(f2−f2−fE[f^]+fE[f^])=σ2ϵ+E[(f−f^)2]+0
For the term E[(f−f^)2] we can use a similar trick as above, adding and subtracting E[f^] to get
E[(f−f^)2]=E[(f+E[f^]−E[f^]−f^)2]=E[f−E[f^]]2+E[f^−E[f^]]2=[f−E[f^]]2+E[f^−E[f^]]2=Bias2[f^]+Var[f^]
Putting it together
E[(Y−f^)2]=σ2ϵ+Bias2[f^]+Var[f^]
Some comments on why E[f^Y]=fE[f^]
Taken from Alecos Papadopoulos here
Recall that f^ is the predictor we have constructed based on the m data points {(x(1),y(1)),...,(x(m),y(m))} so we can write f^=f^m to remember that.
On the other hand Y is the prediction we are making on a new data point (x(m+1),y(m+1)) by using the model constructed on the m data points above. So the Mean Squared Error can be written as
E[f^m(x(m+1))−y(m+1)]2
Expanding the equation from the previous section
E[f^mY]=E[f^m(f+ϵ)]=E[f^mf+f^mϵ]=E[f^mf]+E[f^mϵ]
The last part of the equation can be viewed as
E[f^m(x(m+1))⋅ϵ(m+1)]=0
Since we make the following assumptions about the point x(m+1):
- It was not used when constructing f^m
- It is independent of all other observations {(x(1),y(1)),...,(x(m),y(m))}
- It is independent of ϵ(m+1)
Other sources with full derivations