Variabile indipendente = Variabile casuale?


25

Sono un po 'confuso se una variabile indipendente (chiamata anche predittore o caratteristica) in un modello statistico, ad esempio la nella regressione lineare , è una variabile casuale?XY=β0+β1X


12
Il modello lineare è condizionato da , quindi non importa se sia casuale. X
Xi'an,

4
Controllare questo . Bella domanda, a proposito.
Antoni Parellada,

@ Xi'an, nella progettazione fissa le ipotesi del modello lineare non sono condizionate su X , vedi la mia risposta. Quindi importa molto. Questo è il motivo per cui gli esperimenti sono molto più facili da interpretare rispetto ai risultati dello studio osservazionale
Aksakal,

Risposte:


19

Esistono due formulazioni comuni di regressione lineare. Per concentrarmi sui concetti, li astrarrò un po '. La descrizione matematica è un po 'più complessa della descrizione inglese, quindi cominciamo con quest'ultima:

La regressione lineare è un modello in cui si presume che una risposta Y sia casuale con una distribuzione determinata dai regressori X tramite una mappa lineare β(X) e, possibilmente, da altri parametri θ .

Nella maggior parte dei casi, l'insieme delle possibili distribuzioni è una famiglia di posizioni con i parametri α e θ e β(X) fornisce il parametro α . L'esempio archetipico è la regressione ordinaria in cui l'insieme delle distribuzioni è la famiglia N(μ,σ) normale ( μ , σ ) e μ=β(X) è una funzione lineare dei regressori.

Poiché non l'ho ancora descritto matematicamente, è ancora una domanda aperta a quali tipi di oggetti matematici X , Y , β e θ riferiscono - e credo che questo sia il problema principale in questo thread. Sebbene si possano fare varie (equivalenti) scelte, la maggior parte sarà equivalente o casi speciali della seguente descrizione.


  1. Regressori fissi. I regressori sono rappresentati come vettori reali XRp . La risposta è una variabile casuale Y:ΩR (dove Ω è dotato di un campo sigma e probabilità). Il modello è una funzione f:R×ΘMd (o, se lo si desidera, un insieme di funzioni RMd parametrizzate da Θ ). Mdè una submanifold topologica di dimensioni finite (di solito seconda differenziabile) (o submanifold-with-boundary) della dimensione d dello spazio delle distribuzioni di probabilità. f è generalmente considerato continuo (o sufficientemente differenziabile). ΘRd1 sono i "parametri di disturbo." Si presume che la distribuzione di Y sia f(β(X),θ) per qualche vettore doppio sconosciuto βRp (i "coefficienti di regressione") e sconosciuto θΘ. Possiamo scrivere questo

    Yf(β(X),θ).

  2. Regressori casuali. I regressori e la risposta sono p+1 vettore tridimensionale valori variabile casuale Z=(X,Y):ΩRp×R . Il modello f è lo stesso tipo di oggetto di prima, ma ora fornisce la probabilità condizionata

    Y|Xf(β(X),θ).

La descrizione matematica è inutile senza alcune prescrizioni che spiegano come si intende applicare ai dati. Nel caso del regressore fisso concepiamo X come specificato dallo sperimentatore. Pertanto, potrebbe essere utile visualizzare Ω come prodotto Rp×Ω dotato di una sigma algebra di prodotto. Lo sperimentatore determina X e la natura determina (alcuni sconosciuti, astratti) ωΩ . Nel caso del regressore casuale, la natura determina ωΩ , il componente X della variabile casuale πX(Z(ω)) determinaX (che è "osservato"), e ora abbiamo una coppia ordinata(X(ω),ω))Ω esattamente come nel caso del regressore fisso.


The archetypical example of multiple linear regression (which I will express using standard notation for the objects rather than this more general one) is that

f(β(X),σ)=N(β(x),σ)
for some constant σΘ=R+. As x varies throughout Rp, its image differentiably traces out a one-dimensional subset--a curve--in the two-dimensional manifold of Normal distributions.

When--in any fashion whatsoever--β is estimated as β^ and σ as σ^, the value of β^(x) is the predicted value of Y associated with x--whether x is controlled by the experimenter (case 1) or is only observed (case 2). If we either set a value (case 1) or observe a realization (case 2) x of X, then the response Y associated with that X is a random variable whose distribution is N(β(x),σ), which is unknown but estimated to be N(β^(x),σ^).


Let me just mention, that this is a fantastic answer (but probably not for everybody).
l7ll7

2
P.S. Do you know of any book, where these foundational question are explained as precisely as you did here ? As a mathematician, all the books I found reflected the other answers here, that are much less precise from a mathematical point of view. (This doesn't make them bad, of course, it's just that those books are not for me - I would love a book that is more precise, like this answer.)
l7ll7

In the first sentence of the last paragraph, isn't β^(x) the predicted value for y (a realization of the random variable Y), not the predicted value for x? Or have I misunderstood your language, and "predicted value for x" means "predicted value when x is the set(observed) value of X?"
Chad

1
@Chad Thank you for pointing out the ambiguous language. I have edited that sentence to clarify the meaning, which is consistent with your understanding.
whuber

7

First of all, @whuber gave an excellent answer. I'll give it a different take, maybe simpler in some sense, also with a reference to a text.

MOTIVATION

X can be random or fixed in the regression formulation. This depends on your problem. For so called observational studies it has to be random, and for experiments it usually is fixed.

Example one. I'm studying the impact of exposure to electron radiation on the hardness of a metal part. So, I take a few samples of the metal part and expose the to varying levels of radiation. My exposure level is X, and it's fixed, because I set to the levels that I chose. I fully control the conditions of the experiment, or at least try to. I can do the same with other parameters, such as temperature and humidity.

Example two. You're studying the impact of economy on frequency of occurrences of fraud in credit card applications. So, you regress the fraud event counts on GDP. You do not control GDP, you can't set to a desired level. Moreover, you probably want to look at multivariate regressions, so you have other variables such as unemployment, and now you have a combination of values in X, which you observe, but do not control. In this case X is random.

Example three. You are studying the efficacy of new pesticide in field, i.e. not in the lab conditions, but in the actual experimental farm. In this case you can control something, e.g. you can control the amount of pesticide to put. However, you do not control everything, e.g. weather or soil conditions. Ok, you can control the soil to some extent, but not completely. This is an in-between case, where some conditions are observed and some conditions are controlled. There's this entire field of study called experimental design that is really focused on this third case, where agriculture research is one of the biggest applications of it.

MATH

Here goes the mathematical part of an answer. There's a set of assumptions that are usually presented when studying linear regression, called Gauss-Markov conditions. They are very theoretical and nobody bothers to prove that they hold in any practical set up. However, they are very useful in understanding the limitations of ordinary least squares (OLS) method.

So, the set of assumptions is different for random and fixed X, which roughly correspond to observational vs. experimental studies. Roughly, because as I shown in the third example, sometimes we're really in-between the extremes. I found the "Gauss-Markov" theorem section in Encyclopedia of Research Design by Salkind is a good place to start, it's available in Google Books.

Y=Xβ+ε

  • E[ε]=0
  • E[ε2]=σ2
  • E[εi,εj]=0

vs. the same assumptions in the random design:

  • E[ε|X]=0
  • Homoscedasticity, E[ε2|X]=σ2
  • No serial correlation, E[εi,εj|X]=0

As you can see the difference is in conditioning the assumptions on the design matrix for the random design. Conditioning makes these stronger assumptions. For instance, we are not just saying, like in fixed design, that the errors have zero mean; in random design we also say they're not dependent on X, covariates.


2

In statistics a random variable is quantity that varies randomly in some way. You can find a good discussion in this excellent CV thread: What is meant by a “random variable”?

In a regression model, the predictor variables (X-variables, explanatory variables, covariates, etc.) are assumed to be fixed and known. They are not assumed to be random. All of the randomness in the model is assumed to be in the error term. Consider a simple linear regression model as standardly formulated:

Y=β0+β1X+εwhere εN(0,σ2)
The error term, ε, is a random variable and is the source of the randomness in the model. As a result of the error term, Y is a random variable as well. But X is not assumed to be a random variable. (Of course, it might be a random variable in reality, but that is not assumed or reflected in the model.)

So you mean X is a constant ? Because that is the only other way to make sense of X from a mathematical point of view, since ε is a random variable and addition is only defined between two random variables and not "something else" + random variable. Though one of the two random variables could be constant, which is the case I'm referring to.
l7ll7

P.S. I looked at all the explanations from said link and none very illuminating: Why ? Because none make the connection between random variables as probabilists understand it vs. how statisticians understand it. So some answers restate the standard, precise probability theory definition, while others restate the (yet unclear to me) vague statistical definition. But none really explain the connection between these two concepts.(The only exception is the long ticket-in-a-box model answer, which may show some promise, but even so [...]
l7ll7

the difference wasn't fleshed out clearly enough to be strikingly illuminating; I'll have to meditate on this specific answer to see if there's any value to it)
l7ll7

@user10324, if you like, you can think of X as a set of constants. You could also think of it as a non-random variable.
gung - Reinstate Monica

No, the non-random variable way of thinking about it does not work, for two reasons: One, as I argued in the comments above, there is no such thing as a "variable" in mathematics, and two, even if it were, then addition in that case is not defined, as I argued in the comments above.
l7ll7

1

Not sure if I understand the question, but if you're just asking, "must an independent variable always be a random variable", then the answer is no.

An independent variable is a variable which is hypothesised to be correlated with the dependent variable. You then test whether this is the case through modelling (presumably regression analysis).

There are a lot of complications and "ifs, buts and maybes" here, so I would suggest getting a copy of a basic econometrics or statistics book covering regression analysis and reading it thoroughly, or else getting the class notes from a basic statistics/econometrics course online if possible.


Ok, but what is it, if it is not a random variable ? Just a (therefore deterministic) function ? I'm confused regarding the mathematical nature of the object "X". Actually, I found in the meantime a textbook, Probability and Statistics by Papoulis, where on page 149 he says "given two random variables X and Y [...]" and then goes on to explain how to regress X on Y. So he seems to understand X as a random variable ?
l7ll7

P.S. I want to add that there is no such thing as a "variable" in mathematics when you look at it as a "standalone" objects (my background is maths). Variables in mathematics are just parts of standalone objects (e.g. arguments of function), but have no standalone meaning. If I would just write "x" in mathematics, it could mean the function xx, or it could be a specific number, if x was assigned a values previously, but we don't have just x. And since log. regression is a mathematical model, I'm interested in the mathematical meaning of X.
l7ll7

It sounds as though you have a much greater understanding of maths than me. I'm just giving you the standard university undergraduate econometrics/statistics answer. I wonder if perhaps you might be overthinking it a bit, at least from the perspective of practical analysis. Regarding the quote from that book, my interpretation of that is that the specific x and y to which he is referring are random - but that doesn't mean that any x or any y are random.
Statsanalyst

e.g. the dependent variable in a model for voting trends in UK politics might be the number of votes received by the Conservative candidate in each constituency (Riding to Canadians, District to Americans), and the independent variable might be average house prices (a proxy for wealth/income in the UK). Neither of these is a "random" variable as I understand it, but this would be a perfectly reasonable thing to model.
Statsanalyst

Ok, that's is good to know what kind of answers I can expect/is the standard at econometrics/statistics departments and I appreciate that feedback very much (I would upvote again, but I can't since I already did). The problem with mathematics is "once you go black you never go back": Yearlong training in mathematical precision will induce a feeling of uneasiness if something is not crystal-clear fleshed out until one achieves claritiy [...]
l7ll7
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.