Comprensione della derivazione del trade-bias varianza

Sto leggendo il capitolo del compromesso di bias varianza di Gli elementi dell'apprendimento statistico e ho dei dubbi nella formula a pagina 29. Lascia che i dati derivino da un modello tale che

Y = f (x) + ϵ

$Y = f(x)+\epsilon$ dove

ϵ

$\epsilon$ è un numero casuale con valore atteso

e varianza

\hat{ϵ} = E [ϵ] = 0

$\hat{\epsilon} = E[\epsilon]=0$

E [(ϵ - \hat{ϵ})^{2}] = E [ϵ^{2}] = σ^{2}

$E[(\epsilon - \hat\epsilon)^2]=E[\epsilon^2]=\sigma^2$ . Lascia che il valore atteso di errore del modello sia

E [(Y - f_{k} (x))^{2}]

$E[(Y-f_k(x))^2]$ dove

f_{k} (x)

$f_k(x)$ è la previsione di

x

$x$ del nostro discente. Secondo il libro, l'errore è

E [(Y - f_{k} (x))^{2}] = σ^{2} + B i a s (f_{k})^{2} + V a r (f_{k} (x)) .

$E[(Y-f_k(x))^2]=\sigma^2+Bias(f_k)^2+Var(f_k(x)).$

La mia domanda è: perché il termine bias non è 0? sviluppando la formula dell'errore vedo

E [(Y - f_{k} (x))^{2}] = E [(f (x) + ϵ - f_{k} (x))^{2}] = E [(f (x) - f_{k} (x))^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + E [ϵ^{2}] = V a r (f_{k} (x)) + 2 E [(f (x) - f_{k} (x)) ϵ] + σ^{2}

$E[(Y-f_k(x))^2]=\\ E[(f(x)+\epsilon-f_k(x))^2]=\\ E[(f(x)-f_k(x))^2]+2E[(f(x)-f_k(x))\epsilon]+E[\epsilon^2]=\\ Var(f_k(x))+2E[(f(x)-f_k(x))\epsilon]+\sigma^2$

come è un numero casuale indipendente $\epsilon$ $2E[(f(x)-f_k(x))\epsilon]=2E[(f(x)-f_k(x))]E[\epsilon]=0$

Dove sbaglio?

— emanuele
fonte

Risposte:

Non ti sbagli, ma hai commesso un errore in una fase poiché . è $E[(f(x)-f_k(x))^2] \ne Var(f_k(x))$ $E[(f(x)-f_k(x))^2]$ . $\text{MSE}(f_k(x)) = Var(f_k(x)) + \text{Bias}^2(f_k(x))$

\begin{aligned} E [(Y - f_{k} (x))^{2}] & = E [(f (x) + ϵ - f_{k} (x))^{2}] \\ = E [(f (x) - f_{k} (x))^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + E [ϵ^{2}] \\ = E [{(f (x) - E (f_{k} (x)) + E (f_{k} (x)) - f_{k} (x))}^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + σ^{2} \\ = V a r (f_{k} (x)) + {Bias}^{2} (f_{k} (x)) + σ^{2} . \end{aligned}

$\begin{align*} E[(Y-f_k(x))^2]& = E[(f(x)+\epsilon-f_k(x))^2] \\ &= E[(f(x)-f_k(x))^2]+2E[(f(x)-f_k(x))\epsilon]+E[\epsilon^2]\\ &= E\left[\left(f(x) - E(f_k(x)) + E(f_k(x))-f_k(x) \right)^2 \right] + 2E[(f(x)-f_k(x))\epsilon]+\sigma^2 \\ & = Var(f_k(x)) + \text{Bias}^2(f_k(x)) + \sigma^2. \end{align*}$

Note: $E[(f_k(x)-E(f_k(x)))(f(x)-E(f_k(x))] = E[f_k(x)-E(f_k(x))](f(x)-E(f_k(x))) = 0.$

— Greenparker
fonte

In case of binary outcomes, Is there an equivalent proof with cross entropy as error measure?

— emanuele

It doesn't work out quite so well with a binary response. See Ex 7.2 in the second edition of "The Elements of Statistical Learning".

— Matthew Drury

could you explain how you go from

E [{(f (x) - E (f_{k} (x)) + E (f_{k} (x)) - f_{k} (x))}^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + σ^{2}

$E\left[\left(f(x) - E(f_k(x)) + E(f_k(x))-f_k(x) \right)^2 \right] + 2E[(f(x)-f_k(x))\epsilon]+\sigma^2$ to

V a r (f_{k} (x)) + {Bias}^{2} (f_{k} (x)) + σ^{2}

$Var(f_k(x)) + \text{Bias}^2(f_k(x)) + \sigma^2$ ?

— Antoine

A few more steps of the Bias - Variance decomposition

Indeed, the full derivation is rarely given in textbooks as it involves a lot of uninspiring algebra. Here is a more complete derivation using notation from the book "Elements of Statistical Learning" on page 223

If we assume that $Y = f(X) + \epsilon$ and $E[\epsilon] = 0$ and $Var(\epsilon) = \sigma^2_\epsilon$ then we can derive the expression for the expected prediction error of a regression fit $\hat f(X)$ at an input $X = x_0$ using squared error loss

E r r (x_{0}) = E [(Y - \hat{f} (x_{0}))^{2} | X = x_{0}]

$Err(x_0) = E[ (Y - \hat f(x_0) )^2 | X = x_0]$

For notational simplicity let $\hat f(x_0) = \hat f$ , $f(x_0) = f$ and recall that $E[f] = f$ and $E[Y] = f$

\begin{aligned} E [(Y - \hat{f})^{2}] & = E [(Y - f + f - \hat{f})^{2}] \\ = E [(y - f)^{2}] + E [(f - \hat{f})^{2}] + 2 E [(f - \hat{f}) (y - f)] \\ = E [(f + ϵ - f)^{2}] + E [(f - \hat{f})^{2}] + 2 E [f Y - f^{2} - \hat{f} Y + \hat{f} f] \\ = E [ϵ^{2}] + E [(f - \hat{f})^{2}] + 2 (f^{2} - f^{2} - f E [\hat{f}] + f E [\hat{f}]) \\ = σ_{ϵ}^{2} + E [(f - \hat{f})^{2}] + 0 \end{aligned}

$\begin{aligned} E[ (Y - \hat f)^2 ] &= E[(Y - f + f - \hat f )^2] \\ & = E[(y - f)^2] + E[(f - \hat f)^2] + 2 E[(f - \hat f)(y - f)] \\ & = E[(f + \epsilon - f)^2] + E[(f - \hat f)^2] + 2E[fY - f^2 - \hat f Y + \hat f f] \\ & = E[\epsilon^2] + E[(f - \hat f)^2] + 2( f^2 - f^2 - f E[\hat f] + f E[\hat f] ) \\ & = \sigma^2_\epsilon + E[(f - \hat f)^2] + 0 \end{aligned}$

For the term $E[(f - \hat f)^2]$ we can use a similar trick as above, adding and subtracting $E[\hat f]$ to get

\begin{aligned} E [(f - \hat{f})^{2}] & = E [(f + E [\hat{f}] - E [\hat{f}] - \hat{f})^{2}] \\ = E {[f - E [\hat{f}]]}^{2} + E {[\hat{f} - E [\hat{f}]]}^{2} \\ = {[f - E [\hat{f}]]}^{2} + E {[\hat{f} - E [\hat{f}]]}^{2} \\ = B i a s^{2} [\hat{f}] + V a r [\hat{f}] \end{aligned}

$\begin{aligned} E[(f - \hat f)^2] & = E[(f + E[\hat f] - E[\hat f] - \hat f)^2] \\ & = E \left[ f - E[\hat f] \right]^2 + E\left[ \hat f - E[ \hat f] \right]^2 \\ & = \left[ f - E[\hat f] \right]^2 + E\left[ \hat f - E[ \hat f] \right]^2 \\ & = Bias^2[\hat f] + Var[\hat f] \end{aligned}$

Putting it together

E [(Y - \hat{f})^{2}] = σ_{ϵ}^{2} + B i a s^{2} [\hat{f}] + V a r [\hat{f}]

$E[ (Y - \hat f)^2 ] = \sigma^2_\epsilon + Bias^2[\hat f] + Var[\hat f]$

Some comments on why $E[\hat f Y] = f E[\hat f]$

Taken from Alecos Papadopoulos here

Recall that $\hat f$ is the predictor we have constructed based on the $m$ data points $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}) \}$ so we can write $\hat f = \hat f_m$ to remember that.

On the other hand $Y$ is the prediction we are making on a new data point $(x^{(m+1)},y^{(m+1)})$ by using the model constructed on the $m$ data points above. So the Mean Squared Error can be written as

E [{\hat{f}}_{m} (x^{(m + 1)}) - y^{(m + 1)}]^{2}

$E[\hat f_m(x^{(m+1)}) - y^{(m+1)}]^2$

Expanding the equation from the previous section

E [{\hat{f}}_{m} Y] = E [{\hat{f}}_{m} (f + ϵ)] = E [{\hat{f}}_{m} f + {\hat{f}}_{m} ϵ] = E [{\hat{f}}_{m} f] + E [{\hat{f}}_{m} ϵ]

$E[\hat f_m Y]=E[\hat f_m (f+ \epsilon)]=E[\hat f_m f+\hat f_m \epsilon]=E[\hat f_m f]+E[\hat f_m \epsilon]$

The last part of the equation can be viewed as

E [{\hat{f}}_{m} (x^{(m + 1)}) \cdot ϵ^{(m + 1)}] = 0

$E[\hat f_m(x^{(m+1)}) \cdot \epsilon^{(m+1)}] = 0$

Since we make the following assumptions about the point $x^{(m+1)}$ :

It was not used when constructing $\hat f_m$
It is independent of all other observations $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}) \}$
It is independent of $\epsilon^{(m+1)}$

Other sources with full derivations

— Xavier Bourret Sicotte
fonte

Why

E [\hat{f} Y] = f E [\hat{f}]

$E[\hat{f}Y]=f E[\hat{f}]$ ? I don't think

Y

$Y$ and

\hat{f}

$\hat{f}$ are independent, since

\hat{f}

$\hat{f}$ is essentially constructed using

Y

$Y$ .

— Felipe Pérez

But the question is essentially the same, why

E [\hat{f} ϵ] = 0

$E[\hat{f}\epsilon]=0$ ? The randomness of

\hat{f}

$\hat{f}$ comes from the error

ϵ

$\epsilon$ so I don't see why would

\hat{f}

$\hat{f}$ and

ϵ

$\epsilon$ be independent, and hence,

E (\hat{f} ϵ) = 0

$\mathbb{E}(\hat{f}\epsilon)=0$ .

— Felipe Pérez

From your precisation seems that the in sample vs out of sample perspective is crucial. It's so? If we work only in sample and, then, see

ϵ

$\epsilon$ as residual the bias variance tradeoff disappear?

— markowitz

@FelipePérez as far as I understand, the randomness of

\hat{f}

$\hat{f}$ comes from the train-test split (which points ended up in the training set and gave

\hat{f}

$\hat{f}$ as the trained predictor). In other words, the variance of

\hat{f}

$\hat{f}$ comes from all the possible subsets of a given fixed data-set that we can take as the training set. Because the data-set is fixed, there is no randomness coming from

ϵ

$\epsilon$ and therefore

\hat{f}

$\hat{f}$ and

ϵ

$\epsilon$ are independent.

— Alberto Santini

Comprensione della derivazione del trade-bias varianza

A few more steps of the Bias - Variance decomposition

Some comments on why E[f^Y]=fE[f^]E[f^Y]=fE[f^]E[\hat f Y] = f E[\hat f]

Other sources with full derivations

Some comments on why $E[\hat f Y] = f E[\hat f]$