Implementazione passo-passo di PCA in R usando il tutorial di Lindsay Smith

13

Sto lavorando in R attraverso un eccellente tutorial PCA di Lindsay I Smith e mi sto bloccando nell'ultima fase. Lo script R di seguito ci porta fino allo stadio (a pag.19) in cui i dati originali vengono ricostruiti dal Componente principale (singolare in questo caso), che dovrebbe produrre un diagramma a linea retta lungo l'asse PCA1 (dato che i dati ha solo 2 dimensioni, la seconda delle quali viene eliminata intenzionalmente).

d = data.frame(x=c(2.5,0.5,2.2,1.9,3.1,2.3,2.0,1.0,1.5,1.1),
               y=c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9))

# mean-adjusted values 
d$x_adj = d$x - mean(d$x)
d$y_adj = d$y - mean(d$y)

# calculate covariance matrix and eigenvectors/values
(cm = cov(d[,1:2]))

#### outputs #############
#          x         y
# x 0.6165556 0.6154444
# y 0.6154444 0.7165556
##########################

(e = eigen(cm))

##### outputs ##############
# $values
# [1] 1.2840277 0.0490834
#
# $vectors
#          [,1]       [,2]
# [1,] 0.6778734 -0.7351787
# [2,] 0.7351787  0.6778734
###########################


# principal component vector slopes
s1 = e$vectors[1,1] / e$vectors[2,1] # PC1
s2 = e$vectors[1,2] / e$vectors[2,2] # PC2

plot(d$x_adj, d$y_adj, asp=T, pch=16, xlab='x', ylab='y')
abline(a=0, b=s1, col='red')
abline(a=0, b=s2)

inserisci qui la descrizione dell'immagine

# PCA data = rowFeatureVector (transposed eigenvectors) * RowDataAdjust (mean adjusted, also transposed)
feat_vec = t(e$vectors)
row_data_adj = t(d[,3:4])
final_data = data.frame(t(feat_vec %*% row_data_adj)) # ?matmult for details
names(final_data) = c('x','y')

#### outputs ###############
# final_data
#              x           y
# 1   0.82797019 -0.17511531
# 2  -1.77758033  0.14285723
# 3   0.99219749  0.38437499
# 4   0.27421042  0.13041721
# 5   1.67580142 -0.20949846
# 6   0.91294910  0.17528244
# 7  -0.09910944 -0.34982470
# 8  -1.14457216  0.04641726
# 9  -0.43804614  0.01776463
# 10 -1.22382056 -0.16267529
############################

# final_data[[1]] = -final_data[[1]] # for some reason the x-axis data is negative the tutorial's result

plot(final_data, asp=T, xlab='PCA 1', ylab='PCA 2', pch=16)

inserisci qui la descrizione dell'immagine

Questo è tutto ciò che ho, e tutto bene finora. Ma non riesco a capire come si ottengono i dati per la trama finale - la varianza attribuibile a PCA 1 - che Smith traccia come:

inserisci qui la descrizione dell'immagine

Questo è quello che ho provato (che ignora l'aggiunta dei mezzi originali):

trans_data = final_data
trans_data[,2] = 0
row_orig_data = t(t(feat_vec[1,]) %*% t(trans_data))
plot(row_orig_data, asp=T, pch=16)

.. e ottenuto un errore:

inserisci qui la descrizione dell'immagine

.. perché ho perso una dimensione dei dati in qualche modo nella moltiplicazione della matrice. Sarei molto grato per un'idea di cosa non va qui.

* Modificare *

Mi chiedo se questa è la formula giusta:

row_orig_data = t(t(feat_vec) %*% t(trans_data))
plot(row_orig_data, asp=T, pch=16, cex=.5)
abline(a=0, b=s1, col='red')

Ma sono un po 'confuso se è così perché (a) capisco i rowVectorFeaturebisogni da ridurre alla dimensionalità desiderata (autovettore per PCA1), e (b) non si allinea con la linea PCA1:

inserisci qui la descrizione dell'immagine

Qualunque punto di vista molto apprezzato.

r pca

— geotheory
fonte

Solo una breve nota (già menzionata nelle risposte di seguito, ma potenzialmente confusa per qualcuno che guarda alla tua domanda): la tua s1pendenza è stata calcolata con un errore (dovrebbe essere

, non

), ecco perché la linea rossa non è perfettamente allineato con i dati della prima figura e con la ricostruzione dell'ultima.

y / x

$y/x$

x / y

$x/y$

— ameba dice di reintegrare Monica il

Per quanto riguarda la ricostruzione dei dati originali dai principali componenti principali, vedere questo nuovo thread: stats.stackexchange.com/questions/229092 .

— ameba dice di reintegrare Monica il

10

Eri molto vicino e sei stato colto da un sottile problema nel lavorare con le matrici in R. Ho lavorato attraverso il tuo final_datae ho ottenuto i risultati corretti in modo indipendente. Poi ho dato un'occhiata più da vicino al tuo codice. Per farla breve, dove hai scritto

row_orig_data = t(t(feat_vec[1,]) %*% t(trans_data))

saresti andato bene se avessi scritto

row_orig_data = t(t(feat_vec) %*% t(trans_data))

trans_data $2\times1$ $2\times10$ t(feat_vec[1,]) $1\times2$ row_orig_data = t(as.matrix(feat_vec[1,],ncol=1,nrow=2) %*% t(trans_data))non-conformable arguments

row_orig_data = t(as.matrix(feat_vec[1,],ncol=1,nrow=2) %*% t(trans_data)[1,])

$2\times1$ $1\times10$ final_data $20=2\times10$ values in row_orig_data from $12=2\times1 + 1\times10$ values on the right hand side.

I've left my original answer below, as someone might find it useful, and it does demonstrate getting the required plots. It also shows that the code can be a bit simpler by getting rid of some unnecessary transposes: $(XY)^T=Y^TX^T$ so t(t(p) %*% t(q)) = q %*% t.

Re your edit, I've added the principal component line in green to my plot below. In your question you got had the slope as $x/y$ not $y/x$ .

Write

d_in_new_basis = as.matrix(final_data)

then to get your data back in its original basis you need

d_in_original_basis = d_in_new_basis %*% feat_vec

You can zero out the parts of your data that are projected along the second component using

d_in_new_basis_approx = d_in_new_basis
d_in_new_basis_approx[,2] = 0

and you can then transform as before

d_in_original_basis_approx = d_in_new_basis_approx %*% feat_vec

Plotting these on the same plot, together with the principal component line in green, shows you how the approximation worked.

plot(x=d_in_original_basis[,1]+mean(d$x),
     y=d_in_original_basis[,2]+mean(d$y),
     pch=16, xlab="x", ylab="y", xlim=c(0,3.5),ylim=c(0,3.5),
     main="black=original data\nred=original data restored using only a single eigenvector")
points(x=d_in_original_basis_approx[,1]+mean(d$x),
       y=d_in_original_basis_approx[,2]+mean(d$y),
       pch=16,col="red")
points(x=c(mean(d$x)-e$vectors[1,1]*10,mean(d$x)+e$vectors[1,1]*10), c(y=mean(d$y)-e$vectors[2,1]*10,mean(d$y)+e$vectors[2,1]*10), type="l",col="green")

enter image description here

Let's rewind to what you had. This line was ok

final_data = data.frame(t(feat_vec %*% row_data_adj))

The crucial bit here is feat_vec %*% row_data_adj which is equivalent to $Y=S^TX$ where $S$ is the matrix of eigenvectors and $X$ is your data matrix with your data in rows, and $Y$ is the data in the new basis. What this is saying is that the first row of $Y$ is the sum of (rows of $X$ weighted by the first eigenvector). And the second row of $Y$ is the sum of (rows of $X$ weighted by the second eigenvector).

Then you had

trans_data = final_data
trans_data[,2] = 0

This is ok: you're just zeroing out the parts of your data that are projected along the second component. Where it goes wrong is

row_orig_data = t(t(feat_vec[1,]) %*% t(trans_data))

Writing $\hat Y$ for the matrix of data $Y$ in the new basis, with zeros in the second row, and writing $\mathbf{e}_1$ for the first eigenvector, the business end of this code t(feat_vec[1,]) %*% t(trans_data) comes down to $\mathbf{e}_1 \hat Y$ .

As explained above (this is where I realised the subtle R problem and wrote the first part of my answer), mathematically you are trying to multiply a $2\times1$ vector by a $2\times10$ matrix. This doesn't work mathematically. What you should do is take the first row of $\hat Y$ = the first row of $Y$ : call this $\mathbf{y}_1$ . Then multiply $\mathbf{e}_1$ and $\mathbf{y}_1$ together. The $i$ th column of the result $\mathbf{e}_1\mathbf{y}_1$ is the eigenvector $\mathbf{e}_1$ weighted by the 1st coordinate only of the $i$ th point in the new basis, which is what you want.

— TooTone
fonte

Thanks TooTone this is very comprehensive, and resolves the ambiguities in my understanding of the matrix calculation and role of featureVector in the final stage.

— geotheory

Great :). I answered this question because I am studying the theory of SVD / PCA at the moment and wanted to get to grips with how it works with an example: your question was good timing. After working through all the matrix calculations I was a bit surprised that it turned out to be an R problem -- so I'm glad you appreciated the matrices aspect of it too.

— TooTone

4

I think you have the right idea but stumbled over a nasty feature of R. Here again the relevant code piece as you've stated it:

trans_data = final_data
trans_data[,2] = 0
row_orig_data = t(t(feat_vec[1,]) %*% t(trans_data))
plot(row_orig_data, asp=T, pch=16)

Essentially final_data contains the coordinates of the original points with respect to the coordinate system defined by the eigenvectors of the covariance matrix. To reconstruct the original points one therefore has to multiply each eigenvector with the associated transformed coordinate, e.g.

(1) final_data[1,1]*t(feat_vec[1,] + final_data[1,2]*t(feat_vec[2,])

which would yield the original coordinates of the first point. In your question you set the second component correctly to zero, trans_data[,2] = 0. If you then (as you already edited) calculate

(2) row_orig_data = t(t(feat_vec) %*% t(trans_data))

you calculate formula (1) for all points simultaneously. Your first approach

row_orig_data = t(t(feat_vec[1,]) %*% t(trans_data))

calculates something different and only works because R automatically drops the dimension attribute for feat_vec[1,], so it is not a row vector anymore but treated as a column vector. The subsequent transpose makes it a row vector again and that's the reason why at least the calculation does not produce an error, but if you go through the math you will see that it is something different than (1). In general it is a good idea in matrix multiplications to suppress dropping of the dimension attribute which can be achieved by the drop parameter, e.g. feat_vec[1,,drop=FALSE].

Your edited solution seems correct, but you calculated the slope if PCA1 wrongly. The slope is given by $\Delta y / \Delta x$ , hence

s1 = e$vectors[2,1] / e$vectors[1,1] # PC1
s2 = e$vectors[2,2] / e$vectors[1,2] # PC2

— Georg Schnabel
fonte

Thanks very much Georg. You're right about the PCA1 slope. Very useful tip also about the drop=F argument.

— geotheory

4

After exploring this exercise you can try the easier ways in R. There are two popular functions for doing PCA: princomp and prcomp. The princomp function does the eigenvalue decomposition as done in your exercise. The prcomp function uses singular value decomposition. Both methods will give the same results almost all of the time: this answer explains the differences in R, whereas this answer explains the math. (Thanks to TooTone for comments now integrated into this post.)

Here we use both to reproduce the exercise in R. First using princomp:

d = data.frame(x=c(2.5,0.5,2.2,1.9,3.1,2.3,2.0,1.0,1.5,1.1), 
               y=c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9))

# compute PCs
p = princomp(d,center=TRUE,retx=TRUE)

# use loadings and scores to reproduce with only first PC
loadings = t(p$loadings[,1]) 
scores = p$scores[,1] 

reproduce = scores %*% loadings  + colMeans(d)

# plots
plot(reproduce,pch=3,ylim=c(-1,4),xlim=c(-1,4))
abline(h=0,v=0,lty=3)
mtext("Original data restored using only a single eigenvector",side=3,cex=0.7)

biplot(p)

enter image description here

Second using prcomp:

d = data.frame(x=c(2.5,0.5,2.2,1.9,3.1,2.3,2.0,1.0,1.5,1.1), 
               y=c(2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9))

# compute PCs
p = prcomp(d,center=TRUE,retx=TRUE)

# use loadings and scores to reproduce with only first PC
loadings = t(p$rotation[,1])
scores = p$x[,1]

reproduce = scores %*% loadings  + colMeans(d)

# plots
plot(reproduce,pch=3,ylim=c(-1,4),xlim=c(-1,4))
abline(h=0,v=0,lty=3)
mtext("Original data restored using only a single eigenvector",side=3,cex=0.7)

biplot(p)

enter image description here

Clearly the signs are flipped but the explanation of variation is equivalent.

— mrbcuda
fonte

Thanks mrbcuda. Your biplot looks identical to Lindsay Smith's so I presume he/she used the same method 12 years ago! I'm also aware of some other higher level methods, but as you rightly point out this is an exercise to make the underlying PCA maths explicit.

— geotheory