L'ordine delle variabili esplicative è importante per il calcolo dei coefficienti di regressione?

All'inizio pensavo che l'ordine non avesse importanza, ma poi ho letto del processo di ortogonalizzazione di gram-schmidt per il calcolo di coefficienti di regressione multipli, e ora sto ripensandoci.

Secondo il processo gram-schmidt, più tardi una variabile esplicativa viene indicizzata tra le altre variabili, più piccolo è il suo vettore residuo perché da esso vengono sottratti i vettori residui delle variabili precedenti. Di conseguenza, anche il coefficiente di regressione della variabile esplicativa è più piccolo.

Se questo è vero, allora il vettore residuo della variabile in questione sarebbe più grande se fosse indicizzato in precedenza, dal momento che un minor numero di vettori residui sarebbe sottratto da essa. Ciò significa che anche il coefficiente di regressione sarebbe maggiore.

Ok, quindi mi è stato chiesto di chiarire la mia domanda. Quindi ho pubblicato schermate dal testo che mi hanno confuso in primo luogo. Ok, ecco qui.

La mia comprensione è che ci sono almeno due opzioni per calcolare i coefficienti di regressione. La prima opzione è indicata (3.6) nello screenshot seguente.

Il primo modo

Ecco la seconda opzione (ho dovuto usare più screenshot).

Il secondo modo

inserisci qui la descrizione dell'immagine

A meno che non stia leggendo male qualcosa (il che è sicuramente possibile), sembra che l'ordine sia importante nella seconda opzione. Importa nella prima opzione? Perché o perché no? O il mio quadro di riferimento è così incasinato che questa non è nemmeno una domanda valida? Inoltre, tutto ciò è in qualche modo correlato alla somma dei quadrati di tipo I rispetto alla somma dei quadrati di tipo II?

Grazie mille in anticipo, sono così confuso!

regression multiple-regression regression-coefficients

— Ryan Zotti
fonte

Potresti delineare la procedura esatta su come vengono calcolati i coefficienti? Da quello che so dell'ortogonalizzazione di Gram-Schmidt e di come può essere applicato al problema della regressione, posso presumere che usando la procedura gs è possibile adattarsi alla regressione, ma non ai coefficienti originali. Si noti che l'adattamento di regressione è la proiezione nello spazio delle colonne. Se ortogonali le colonne ottieni la base ortogonale dello spazio che attraversa le colonne, quindi l'adattamento sarà una combinazione lineare di questa base e anche una combinazione lineare di colonne originali. Sarà lo stesso ...

— mpiktas,

ma i coefficienti saranno diversi. Questo è perfettamente normale.

— mpiktas,

Immagino di essere confuso perché pensavo di aver letto in "Gli elementi dell'apprendimento statistico" che i coefficienti calcolati usando il processo gram-schmidt sarebbero stati gli stessi calcolati usando il processo tradizionale: B = (X'X) ^ - 1 X'y.

— Ryan Zotti,

Ecco l'estratto del libro che parla della procedura: "Possiamo vedere la stima [dei coefficienti] come il risultato di due applicazioni della regressione semplice. I passaggi sono: 1. regredisci x su 1 per produrre il residuo z = x - x ̄1; 2. regredire y sulla z residua per dare il coefficiente βˆ1. Questa ricetta generalizza al caso di input p, come mostrato in Algoritmo 3.1. Notare che gli input z0,..., zj − 1 al passo 2 sono ortogonali, quindi i semplici coefficienti di regressione calcolati in realtà ci sono anche i coefficienti di regressione multipla. "

— Ryan Zotti,

Diventa un po 'confuso quando copio e incollo nella sezione commenti qui, quindi è probabilmente meglio guardare direttamente la fonte. Le pagine da 53 a 54 di "The Elements of Statistical Learning" sono disponibili gratuitamente per il download sul sito Web di Stanford: www-stat.stanford.edu/~tibs/ElemStatLearn .

— Ryan Zotti,

Risposte:

Credo che la confusione possa derivare da qualcosa di un po 'più semplice, ma offre una buona opportunità per esaminare alcune questioni correlate.

Si noti che il testo non sostiene che tutti i coefficienti di regressione può essere calcolata tramite i residui successivi vettori come $\newcommand{\bhat}{\hat{\beta}}\newcommand{\m}{\mathbf}\newcommand{\z}{\m{z}}\bhat_i$ Ma piuttosto che solo l'ultima , può essere calcolato in questo modo!

{\hat{β}}_{io} \overset{?}{=} \frac{⟨ y, z_{io} ⟩}{‖ z_{io} ‖^{2}},

$\bhat_i \stackrel{?}{=} \frac{\langle \m y, \z_i \rangle}{\|\z_i\|^2}\>,$

{\hat{β}}_{p}

$\bhat_p$

Il successivo schema di ortogonalizzazione (una forma di ortogonalizzazione di Gram – Schmidt) sta (quasi) producendo una coppia di matrici e tali che $\newcommand{\Z}{\m{Z}}\newcommand{\G}{\m{G}}\Z$ $\G$ dove è con colonne ortonormali e è triangolare superiore. Dico "quasi" poiché l'algoritmo specifica solo fino alle norme delle colonne, che in generale non saranno una, ma si può fare in modo che l'unità abbia una norma normalizzando le colonne e apportando una semplice regolazione corrispondente alla matrice delle coordinate .

X = Z G,

$\m X = \Z \G \>,$

Z

$\Z$

n \times p

$n \times p$

G = (g_{i j})

$\G = (g_{ij})$

p \times p

$p \times p$

Z

$\Z$

G

$\G$

Supponendo, naturalmente, che ha rango , l'unica soluzione ai minimi quadrati è il vettore che risolve il sistema $\m X \in \mathbb R^{n \times p}$ $p \leq n$ $\bhat$

X^{T} X \hat{β} = X^{T} y .

$\m X^T \m X \bhat = \m X^T \m y \>.$

Sostituendo e con (per costruzione), otteniamo $\m X = \Z \G$ $\Z^T \Z = \m I$ Che è equivalente a

G^{T} G \hat{β} = G^{T} Z^{T} y,

$\G^T \G \bhat = \G^T \Z^T \m y \> ,$

G \hat{β} = Z^{T} y .

$\G \bhat = \Z^T \m y \>.$

Ora, concentrati sull'ultima riga del sistema lineare. L'unico elemento diverso da zero di nell'ultima riga è . Così, otteniamo che $\G$ $g_{pp}$ Non è difficile vedere (verificarlo come un controllo di comprensione!) Che e quindi questo fornisce la soluzione. (Caveat lector: ho usato già normalizzato di avere unità di norma, mentre nel libro che hannononQuesto spiega il fatto che il libro ha una norma quadrato al denominatore, mentre io ho solo la norma..)

g_{p p} {\hat{β}}_{p} = ⟨ y, z_{p} ⟩ .

$g_{pp} \bhat_p = \langle \m y, \z_p \rangle \>.$

g_{p p} = ‖ z_{p} ‖

$g_{pp} = \|\z_p\|$

z_{i}

$\z_i$

$\bhat_i$ $(p-1)$

g_{p - 1, p - 1} {\hat{β}}_{p - 1} + g_{p - 1, p} {\hat{β}}_{p} = ⟨ z_{p - 1}, y ⟩,

$g_{p-1,p-1} \bhat_{p-1} + g_{p-1,p} \bhat_p = \langle \m z_{p-1}, \m y \rangle \>,$

{\hat{β}}_{p - 1} = g_{p - 1, p - 1}^{- 1} ⟨ z_{p - 1}, y ⟩ - g_{p - 1, p - 1}^{- 1} g_{p - 1, p} {\hat{β}}_{p} .

$\bhat_{p-1} = g_{p-1,p-1}^{-1} \langle \m z_{p-1}, \m y \rangle \> - g_{p-1,p-1}^{-1} g_{p-1,p} \bhat_p .$

g_{i i}

$g_{ii}$

{\hat{β}}_{i}

$\bhat_i$

$\m X$ $\m X^{(r)}$ $r$ $\bhat_r$ $\bhat_r$ $\m y$ $\m x_r$

Decomposizioni QR generali

$\m X$

X = Q R,

$\m X = \m Q \m R \>,$

X

$\m X$

\hat{β}

$\bhat$

R^{T} R \hat{β} = R^{T} Q^{T} y,

$\m R^T \m R \bhat = \m R^T \m Q^T \m y \>,$

R \hat{β} = Q^{T} y .

$\m R \bhat = \m Q^T \m y \> .$

R

$\m R$

{\hat{β}}_{p}

$\bhat_p$

$\m X$ $\hat{\m y}$

— cardinale
fonte

$\beta_j$ $\beta_p$

Esercizio 3.4 in ESL

$X$

Soluzione

$X$

X = Z Γ,

$X = Z \Gamma,$

Z

$Z$

z_{j}

$z_j$

Γ

$\Gamma$

γ_{i j} = \frac{⟨ z_{i}, x_{j} ⟩}{‖ z_{i} ‖^{2}}

$\gamma_{ij} = \frac{\langle z_i, x_j \rangle}{\| z_i \|^2}$

x_{j} = z_{j} + \sum_{k = 0}^{j - 1} γ_{k j} z_{k} .

$x_j = z_j + \sum_{k=0}^{j-1} \gamma_{kj} z_k.$

$QR$ $X = QR$ $Q$ $R$ $Q = Z D^{-1}$ $R = D\Gamma$ $D$ $D_{jj} = \| z_j \|$

$\hat \beta$

(X^{T} X) \hat{β} = X^{T} y .

$(X^T X) \hat \beta = X^T y.$

Q R

$QR$

\begin{aligned} (R^{T} Q^{T}) (Q R) \hat{β} & = R^{T} Q^{T} y \\ R \hat{β} & = Q^{T} y \end{aligned}

$\begin{align*} (R^T Q^T) (QR) \hat \beta &= R^T Q^T y \\ R \hat \beta &= Q^T y \end{align*}$

$R$

\begin{aligned} R_{p p} {\hat{β}}_{p} & = ⟨ q_{p}, y ⟩ \\ ‖ z_{p} ‖ {\hat{β}}_{p} & = ‖ z_{p} ‖^{- 1} ⟨ z_{p}, y ⟩ \\ {\hat{β}}_{p} & = \frac{⟨ z_{p}, y ⟩}{‖ z_{p} ‖^{2}} \end{aligned}

$\begin{align*} R_{pp} \hat \beta_p &= \langle q_p, y \rangle \\ \| z_p \| \hat \beta_p &= \| z_p \|^{-1} \langle z_p, y \rangle \\ \hat \beta_p &= \frac{\langle z_p, y \rangle}{\| z_p \|^2} \end{align*}$

{\hat{β}}_{j}

$\hat \beta_j$

{\hat{β}}_{p - 1}

$\hat \beta_{p-1}$

\begin{aligned} R_{p - 1, p - 1} {\hat{β}}_{p - 1} + R_{p - 1, p} {\hat{β}}_{p} & = ⟨ q_{p - 1}, y ⟩ \\ ‖ z_{p - 1} ‖ {\hat{β}}_{p - 1} + ‖ z_{p - 1} ‖ γ_{p - 1, p} {\hat{β}}_{p} & = ‖ z_{p - 1} ‖^{- 1} ⟨ z_{p - 1}, y ⟩ \end{aligned}

$\begin{align*} R_{p-1, p-1} \hat \beta_{p-1} + R_{p-1,p} \hat \beta_p &= \langle q_{p-1}, y \rangle \\ \| z_{p-1} \| \hat \beta_{p-1} + \| z_{p-1} \| \gamma_{p-1,p} \hat \beta_p &= \| z_{p-1} \|^{-1} \langle z_{p-1}, y \rangle \end{align*}$ and then solving for

{\hat{β}}_{p - 1}

$\hat \beta_{p-1}$ . This process can be repeated for all

β_{j}

$\beta_j$ , thus obtaining the regression coefficients in one pass of the Gram-Schmidt procedure.

— Andrew Tulloch
fonte

Why not try it and compare? Fit a set of regression coefficients, then change the order and fit them again and see if they differ (other than possible round-off error).

As @mpiktas points out it is not exactly clear what you are doing.

I can see using GS to solve for $B$ in the least squares equation $(x'x)B=(x'y)$ . But then you would be doing the GS on the $(x'x)$ matrix, not the original data. In this case the coefficients should be the same (other than possible rounding error).

Another approach of GS in regression is to apply GS to the predictor variables to eliminate colinearity between them. Then the orthogonalized variables are used as the predictors. In this case order matters and the coefficients will be different because the interpretation of the coefficients depends on the order. Consider 2 predictors $x_1$ and $x_2$ and do GS on them in that order then use as predictors. In that case the first coefficient (after the intercept) shows the effect of $x_1$ on $y$ by itself and the second coefficient is the effect of $x_2$ on $y$ after adjusting for $x_1$ . Now if you reverse the order of the x's then the first coefficient shows the effect of $x_2$ on $y$ by itself (ignoring $x_1$ rather than adjusting for it) and the second is the effect of $x_1$ adjusting for $x_2$ .

— Greg Snow
fonte

I think your last paragraph is probably closest to the source of my confusion -- GS does make the order matter. That's what I thought. I'm still a bit confused, though, because the book I'm reading, called: "The Elements of Statistical Learning" (a Stanford publication that's freely available: www-stat.stanford.edu/~tibs/ElemStatLearn) seems to suggest that GS is equivalent to the standard approach for calculating the coefficients; that is, B = (X'X)^-1 X'y.

— Ryan Zotti

And part of what you say confuses me a bit too: "I can see using GS to solve for B in the least squares equation (x′x)^−1 B=(x′y). But then you would be doing the GS on the (x′x) matrix, not the original data." I thought the x'x matrix contained the original data?... At least that's what Elements of Statistical Learning says. It says the x in the x'x is an N by p matrix where N is the number of inputs (observations) and p is the number of dimensions.

— Ryan Zotti

If GS is not the standard procedure for calculating the coefficients, then how is collinearity typically treated? How is redundancy (collinearity) typically distributed among the x's? Doesn't collinearity traditionally make the coefficients unstable? Then wouldn't that suggest that the GS process is the standard process? Because the GS process also makes the coefficients unstable -- a smaller residual vector makes the coefficient unstable.

— Ryan Zotti

At least that's what the text says, "If xp is highly correlated with some of the other xk’s, the residual vector zp will be close to zero, and from (3.28) the coefficient βˆp will be very unstable."

— Ryan Zotti

Note that GS is a form of QR decomposition.

— cardinal