Scelta delle variabili da includere in un modello di regressione lineare multipla


35

Attualmente sto lavorando per costruire un modello usando una regressione lineare multipla. Dopo aver armeggiato con il mio modello, non sono sicuro di come determinare meglio quali variabili conservare e quali rimuovere.

Il mio modello è iniziato con 10 predittori per il DV. Quando si utilizzano tutti e 10 i predittori, quattro sono stati considerati significativi. Se rimuovo solo alcuni dei predittori ovviamente errati, alcuni dei miei predittori che inizialmente non erano significativi diventano significativi. Il che mi porta alla mia domanda: come si fa a determinare quali predittori includere nel loro modello? Mi è sembrato che dovresti eseguire il modello una volta con tutti i predittori, rimuovere quelli che non sono significativi e quindi rieseguire. Ma se rimuovere solo alcuni di questi predittori rende significativi altri, mi chiedo se sto adottando un approccio sbagliato a tutto ciò.

Credo che questo thread sia simile alla mia domanda, ma non sono sicuro di interpretare correttamente la discussione. Forse questo è più un argomento di design sperimentale, ma forse qualcuno ha qualche esperienza che può condividere.


La risposta a ciò dipende fortemente dai tuoi obiettivi e requisiti: stai cercando un'associazione semplice o stai mirando alla previsione; quanto sei alto nell'interpretazione; avete informazioni sulle variabili di altre pubblicazioni che potrebbero influenzare il processo? che ne dite di interazioni o versioni trasformate delle variabili: potete includerle? ecc. Devi avere maggiori dettagli su cosa stai cercando di fare per ottenere una buona risposta.
Nick Sabbe,

Sulla base di ciò che hai chiesto, questo sarà per la previsione. L'influenza su altre variabili offre solo un'associazione possibile. Non ci sono interazioni tra di loro. Solo un valore deve essere trasformato ed è stato fatto.
cryptic_star

1
Esiste una teoria che dice quali predittori dovresti includere? Se hai molte variabili che hai misurato e nessuna teoria, ti consiglierei di tenere una serie di osservazioni in modo da poter testare il tuo modello su dati che non sono stati usati per crearlo. Non è corretto testare e validare un modello sugli stessi dati.
Michelle,

Convalida incrociata (come discute da Nick Sabbe), metodi penalizzati (Dikran Marsupial) o scelta di variabili basate sulla teoria precedente (Michelle) sono tutte opzioni. Ma nota che la selezione delle variabili è intrinsecamente un compito molto difficile. Per capire perché è così potenzialmente pieno, può aiutare a leggere la mia risposta qui: algoritmi per la selezione automatica del modello . Infine, vale la pena riconoscere che il problema riguarda la struttura logica di questa attività, non se il computer lo fa automaticamente per te o se lo fai manualmente per te stesso.
gung - Ripristina Monica

Check also out answers to this post: stats.stackexchange.com/questions/34769/…
jokel

Risposte:


19

Based on your reaction to my comment:

You are looking for prediction. Thus, you should not really rely on (in)significance of the coefficients. You would be better to

  • Pick a criterion that describes your prediction needs best (e.g. missclassification rate, AUC of ROC, some form of these with weights,...)
  • For each model of interest, evaluate this criterion. This can be done e.g.by providing a validation set (if you're lucky or rich), through crossvalidation (typically tenfold), or whatever other options your criterion of interest allows. If possible also find an estimate of the SE of the criterion for each model (e.g. by using the values over the different folds in crossvalidation)
  • Now you can pick the model with the best value of the criterion, though it is typically advised to pick the most parsimoneous model (least variables) that is within one SE of the best value.

Wrt each model of interest: herein lies quite a catch. With 10 potential predictors, that is a truckload of potential models. If you've got the time or the processors for this (or if your data is small enough so that models get fit and evaluated fast enough): have a ball. If not, you can go about this by educated guesses, forward or backward modelling (but using the criterion instead of significance), or better yet: use some algorithm that picks a reasonable set of models. One algorithm that does this, is penalized regression, in particular Lasso regression. If you're using R, just plug in the package glmnet and you're about ready to go.


+1, but could you explain why exactly you would "pick the most parsimoneous model (least variables) that is within one SE of the best value"?
rolando2

Parsimony is, for most situations, a wanted property: it heightens interpretability, and reduces the number of measurements you need to make for a new subject to use the model. The other side of the story is that what you get for your criterion is but an estimate, with matching SE: I've seen quite a few plots showing the criterion estimates against some tuning parameter, where the 'best' value was just an exceptional peak. As such, the 1 SE-rule (which is arbitrary, but an accepted practice) protects you from this with the added value of providing more parsimony.
Nick Sabbe

13

There is no simple answer to this. When you remove some of the non-significant explanatory variables, others that are correlated with those may become significant. There is nothing wrong with this, but it makes model selection at least partly art rather than science. This is why experiments aim for keeping explanatory variables orthogonal to eachother, to avoid this problem.

Traditionally analysts did stepwise adding and subtracting of variables to the model one at a time (similar to what you have done) and testing them individually or in small groups with t or F tests. The problem with this is you may miss some combination of variables to subract (or add) where their combined effect (or non-effect) is hidden by the collinearity.

With modern computing power it is feasible to fit all 2^10 = 1024 possible combinations of explanatory variables and choose the best model by one of a number of possible criteria eg AIC, BIC, or predictive power (for example, ability to predict the values of a test subset of the data that you have separated from the set you use to fit your model). However, if you are going to be testing (implicitly or explicitly) 1024 models you will need to rethink your p-values from the classical approach - treat with caution...


Thanks for the high level walk through of the pluses and minuses of both sides. It confirmed a lot of what I suspected.
cryptic_star

11

If you are only interested in predictive performance, then it is probably better to use all of the features and use ridge-regression to avoid over-fitting the training sample. This is essentially the advice given in the appendix of Millar's monograph on "subset selection in regression", so it comes with a reasonable pedigree!

The reason for this is that if you choose a subset based on a performance estimate based on a fixed sample of data (e.g. AIC, BIC, cross-validation etc.), the selection criterion will have a finite variance and so it is possible to over-fit the selection criterion itself. In other words, to begin with as you minimise the selection criterion, generalisation performance will improve, however there will come a point where the more you reduce the selection criterion, the worse generalisation becomes. If you are unlucky, you can easily end up with a regression model that performs worse than the one you started with (i.e. a model with all of the attributes).

This is especially likely when the dataset is small (so the selection criterion has a high variance) and when there are many possible choices of model (e.g. choosing combinations of features). Regularisation seems to be less prone to over-fitting as it is a scalar parameter that needs to be tuned and this gives a more constrained view of the complexity of the model, i.e. fewer effective degrees of freedom with which to over-fit the selection criterion.


0

Use the leaps library. When you plot the variables the y-axis shows R^2 adjusted. You look at where the boxes are black at the highest R^2. This will show the variables you should use for your multiple linear regression.

Wine example below:

library(leaps)
regsubsets.out <-
  regsubsets(Price ~ Year + WinterRain + AGST + HarvestRain + Age + FrancePop,
         data = wine,
         nbest = 1,       # 1 best model for each number of predictors
         nvmax = NULL,    # NULL for no limit on number of variables
         force.in = NULL, force.out = NULL,
         method = "exhaustive")
regsubsets.out

#----When you plot wherever R^2 is the highest with black boxes,
#so in our case AGST + HarvestRain + WinterRain + Age and the dependent var.is Price----#
summary.out <- summary(regsubsets.out)
as.data.frame(summary.out$outmat)
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")

This doesn't sound very distinct from so-called 'best subsets' selection, which has known problems.
gung - Reinstate Monica

leaps explicitly computes the 'best subsets', although it doesn't advise you how to select among subsets of different size. (That being a matter between you and your statistical clergy.)
steveo'america

Funny enough, leaps is based on "FORTRAN77 code by Alan Miller [...] which is described in more detail in his book 'Subset Selection in Regression'", a book which is mentioned by Dikran in another answer to this question :-)
jorijnsmit


-2

Why not doing correlation analysis First and then onclude in regression only those that corelate with Dv?


2
This is generally a poor way of choosing which variables to select, see e.g. Is using correlation matrix to select predictors for regression correct? A correlation analysis is quite different to multiple regression, because in the latter case we need to think about "partialling out" (regression slopes show the relationship once other variables are taken into account), but a correlation matrix doesn't show this.
Silverfish

This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From Review
Sycorax says Reinstate Monica

1
@GeneralAbrial It strikes me that this is an answer to the question, albeit a brief one. It isn't a good solution to the problem, but that's what up/downvotes are for. (I think the "why not" is intended as a rhetorical question, rather than a request for clarification from the author.)
Silverfish

-4

My advisor offered another possible way to go about this. Run all of your variables once, and then remove those that fail to meet some threshold (we set our threshold as p < .25). Continue iterating that way until all variables fall below that .25 value, then report those values which are significant.


1
Hi allie, that is what @Peter Ellis mentioned in the second paragraph of his answer. His second sentence there covers the problem that this technique introduces. Do you have a theory, which is telling you what predictors to put into your model?
Michelle

Yes, @Michelle is right to underscore the liability to this approach. It can produce very arbitrary results.
rolando2

Yes, there is a theory behind it all, which we are hoping to expand on. In particular, we are looking at how certain social cues (such as speech) interact. We are aware which ones do or don't already have clout. However, we are attempting to provide finer-grained versions. So, speech may be broken down into question, opinion, assessment, etc.
cryptic_star

2
Okay, so you're doing exploratory analysis. :) You can try different combinations, but you'll need to test the model you end up with on new data. By definition, with what you're doing, you'll have the "best" model for your data, but it may not work if you collect another set of data.
Michelle
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.