Qual è l'intuizione dietro le distribuzioni gaussiane condizionate?


46

Supponiamo che XN2(μ,Σ) . Quindi la distribuzione condizionale di X1 dato che X2=x2 è multivariato normalmente distribuito con media:

E[P(X1|X2=x2)]=μ1+σ12σ22(x2μ2)

and variance:

Var[P(X1|X2=x2)]=σ11σ122σ22

It makes sense that the variance would decrease since we have more information. But what is the intuition behind the mean formula? How does the covariance between X1 and X2 factor into the conditional mean?


2
Is your question simply 'why isn't the mean of the conditional distribution = μ1'?
gung - Reinstate Monica

@gung: This is true if x2=μ2. But why are σ11 and σ22 involved?
eroeijr

3
In natural ("standardized") units we write Xi=μ1+σiZi where σi=σii. In these terms the conditional distribution is Normal with E(Z1|Z2)=ρZ2 and ρ=σ12/(σ1σ2). The fact that |ρ|1 is called "mean reversion" or "regression to the mean": there is an extensive technical and popular literature on this going back 130 years.
whuber

2
Say, eroeijr, is this post yours? (Aside from the 'guest' at the start there's a distinct similarity in the names.) If it is yours you should ask to merge the two accounts and take that big bonus in points you'd have.
Glen_b

2
As @Glen_b suggested, if you have multiple (unregistered) accounts, please complete the form at stats.stackexchange.com/contact and request that they be merged.
chl

Risposte:


57

Synopsis

Every statement in the question can be understood as a property of ellipses. The only property particular to the bivariate Normal distribution that is needed is the fact that in a standard bivariate Normal distribution of X,Y--for which X and Y are uncorrelated--the conditional variance of Y does not depend on X. (This in turn is an immediate consequence of the fact that lack of correlation implies independence for jointly Normal variables.)

The following analysis shows precisely what property of ellipses is involved and derives all the equations of the question using elementary ideas and the simplest possible arithmetic, in a way intended to be easily remembered.


Circularly symmetric distributions

The distribution of the question is a member of the family of bivariate Normal distributions. They are all derived from a basic member, the standard bivariate Normal, which describes two uncorrelated standard Normal distributions (forming its two coordinates).

Figure 1: the standard bivariate normal distribution

The left side is a relief plot of the standard bivariate normal density. The right side shows the same in pseudo-3D, with the front part sliced away.

This is an example of a circularly symmetric distribution: the density varies with distance from a central point but not with the direction away from that point. Thus, the contours of its graph (at the right) are circles.

Most other bivariate Normal distributions are not circularly symmetric, however: their cross-sections are ellipses. These ellipses model the characteristic shape of many bivariate point clouds.

Figure 2: another bivariate normal distribution, plotted

These are portraits of the bivariate Normal distribution with covariance matrix Σ=(123231). It is a model for data with correlation coefficient 2/3.


How to Create Ellipses

An ellipse--according to its oldest definition--is a conic section, which is a circle distorted by a projection onto another plane. By considering the nature of projection, just as visual artists do, we may decompose it into a sequence of distortions that are easy to understand and calculate with.

First, stretch (or, if necessary, squeeze) the circle along what will become the long axis of the ellipse until it is the correct length:

Step 1: stretch

Next, squeeze (or stretch) this ellipse along its minor axis:

Step 2: squeeze

Third, rotate it around its center into its final orientation:

Step 3: rotate

Finally, shift it to the desired location:

Step 4: shift

These are all affine transformations. (In fact, the first three are linear transformations; the final shift makes it affine.) Because a composition of affine transformations is (by definition) still affine, the net distortion from the circle to the final ellipse is an affine transformation. But it can be somewhat complicated:

Composite transformation

Notice what happened to the ellipse's (natural) axes: after they were created by the shift and squeeze, they (of course) rotated and shifted along with the axis itself. We easily see these axes even when they are not drawn, because they are axes of symmetry of the ellipse itself.

We would like to apply our understanding of ellipses to understanding distorted circularly symmetric distributions, like the bivariate Normal family. Unfortunately, there is a problem with these distortions: they do not respect the distinction between the x and y axes. The rotation at step 3 ruins that. Look at the faint coordinate grids in the backgrounds: these show what happens to a grid (of mesh 1/2 in both directions) when it is distorted. In the first image the spacing between the original vertical lines (shown solid) is doubled. In the second image the spacing between the original horizontal lines (shown dashed) is shrunk by a third. In the third image the grid spacings are not changed, but all the lines are rotated. They shift up and to the right in the fourth image. The final image, showing the net result, displays this stretched, squeezed, rotated, shifted grid. The original solid lines of constant x coordinate no longer are vertical.

The key idea--one might venture to say it is the crux of regression--is that there is a way in which the circle can be distorted into an ellipse without rotating the vertical lines. Because the rotation was the culprit, let's cut to the chase and show how to created a rotated ellipse without actually appearing to rotate anything!

Skewed ellipse

This is a skew transformation. It actually does two things at once:

  • It squeezes in the y direction (by an amount λ, say). This leaves the x-axis alone.

  • It lifts any resulting point (x,y) by an amount directly proportional to x. Writing that constant of proportionality as ρ, this sends (x,y) to (x,y+ρx).

The second step lifts the x-axis into the line y=ρx, shown in the previous figure. As shown in that figure, I want to work with a special skew transformation, one that effectively rotates the ellipse by 45 degrees and inscribes it into the unit square. The major axis of this ellipse is the line y=x. It is visually evident that |ρ|1. (Negative values of ρ tilt the ellipse down to the right rather than up to the right.) This is the geometric explanation of "regression to the mean."

Choosing an angle of 45 degrees makes the ellipse symmetric around the square's diagonal (part of the line y=x). To figure out the parameters of this skew transformation, observe:

  • The lifting by ρx moves the point (1,0) to (1,ρ).

  • The symmetry around the main diagonal then implies the point (ρ,1) also lies on the ellipse.

Where did this point start out?

  • The original (upper) point on the unit circle (having implicit equation x2+y2=1) with x coordinate ρ was (ρ,1ρ2).

  • Any point of the form (ρ,y) first got squeezed to (ρ,λy) and then lifted to (ρ,λy+ρ×ρ).

The unique solution to the equation (ρ,λ1ρ2+ρ2)=(ρ,1) is λ=1ρ2. That is the amount by which all distances in the vertical direction must be squeezed in order to create an ellipse at a 45 degree angle when it is skewed vertically by ρ.

To firm up these ideas, here is a tableau showing how a circularly symmetric distribution is distorted into distributions with elliptical contours by means of these skew transformations. The panels show values of ρ equal to 0, 3/10, 6/10, and 9/10, from left to right.

Tableau

The leftmost figure shows a set of starting points around one of the circular contours as well as part of the horizontal axis. Subsequent figures use arrows to show how those points are moved. The image of the horizontal axis appears as a slanted line segment (with slope ρ). (The colors represent different amounts of density in the different figures.)


Application

We are ready to do regression. A standard, elegant (yet simple) method to perform regression is first to express the original variables in new units of measurement: we center them at their means and use their standard deviations as the units. This moves the center of the distribution to the origin and makes all its elliptical contours slant 45 degrees (up or down).

When these standardized data form a circular point cloud, the regression is easy: the means conditional on x are all 0, forming a line passing through the origin. (Circular symmetry implies symmetry with respect to the x axis, showing that all conditional distributions are symmetric, whence they have 0 means.) As we have seen, we may view the standardized distribution as arising from this basic simple situation in two steps: first, all the (standardized) y values are multiplied by 1ρ2 for some value of ρ; next, all values with x-coordinates are vertically skewed by ρx. What did these distortions do to the regression line (which plots the conditional means against x)?

  • The shrinking of y coordinates multiplied all vertical deviations by a constant. This merely changed the vertical scale and left all conditional means unaltered at 0.

  • The vertical skew transformation added ρx to all conditional values at x, thereby adding ρx to their conditional mean: the curve y=ρx is the regression curve, which turns out to be a line.

Similarly, we may verify that because the x-axis is the least squares fit to the circularly symmetric distribution, the least squares fit to the transformed distribution also is the line y=ρx: the least-squares line coincides with the regression line.

These beautiful results are a consequence of the fact that the vertical skew transformation does not change any x coordinates.

We can easily say more:

  • The first bullet (about shrinking) shows that when (X,Y) has any circularly symmetric distribution, the conditional variance of Y|X was multiplied by (1ρ2)2=1ρ2.

  • More generally: the vertical skew transformation rescales each conditional distribution by 1ρ2 and then recenters it by ρx.

For the standard bivariate Normal distribution, the conditional variance is a constant (equal to 1), independent of x. We immediately conclude that after applying this skew transformation, the conditional variance of the vertical deviations is still a constant and equals 1ρ2. Because the conditional distributions of a bivariate Normal are themselves Normal, now that we know their means and variances, we have full information about them.

Finally, we need to relate ρ to the original covariance matrix Σ. For this, recall that the (nicest) definition of the correlation coefficient between two standardized variables X and Y is the expectation of their product XY. (The correlation of X and Y is simply declared to be the correlation of their standardized versions.) Therefore, when (X,Y) follows any circularly symmetric distribution and we apply the skew transformation to the variables, we may write

ε=YρX

for the vertical deviations from the regression line and notice that ε must have a symmetric distribution around 0. Why? Because before the skew transformation was applied, Y had a symmetric distribution around 0 and then we (a) squeezed it and (b) lifted it by ρX. The former did not change its symmetry while the latter recentered it at ρX, QED. The next figure illustrates this.

3D plot showing conditional distributions and the least-squares line

The black lines trace out heights proportional to the conditional densities at various regularly-spaced values of x. The thick white line is the regression line, which passes through the center of symmetry of each conditional curve. This plot shows the case ρ=1/2 in standardized coordinates.

Consequently

E(XY)=E(X(ρX+ε))=ρE(X2)+E(Xε)=ρ(1)+0=ρ.

The final equality is due to two facts: (1) because X has been standardized, the expectation of its square is its standardized variance, equal to 1 by construction; and (2) the expectation of Xε equals the expectation of X(ε) by virtue of the symmetry of ε. Because the latter is the negative of the former, both must equal 0: this term drops out.

We have identified the parameter of the skew transformation, ρ, as being the correlation coefficient of X and Y.


Conclusions

By observing that any ellipse may be produced by distorting a circle with a vertical skew transformation that preserves the x coordinate, we have arrived at an understanding of the contours of any distribution of random variables (X,Y) that is obtained from a circularly symmetric one by means of stretches, squeezes, rotations, and shifts (that is, any affine transformation). By re-expressing the results in terms of the original units of x and y--which amount to adding back their means, μx and μy, after multiplying by their standard deviations σx and σy--we find that:

  • The least-squares line and the regression curve both pass through the origin of the standardized variables, which corresponds to the "point of averages" (μx,μy) in original coordinates.

  • The regression curve, which is defined to be the locus of conditional means, {(x,ρx)}, coincides with the least-squares line.

  • The slope of the regression line in standardized coordinates is the correlation coefficient ρ; in the original units it therefore equals σyρ/σx.

Consequently the equation of the regression line is

y=σyρσx(xμx)+μy.
  • The conditional variance of Y|X is σy2(1ρ2) times the conditional variance of Y|X where (X,Y) has a standard distribution (circularly symmetric with unit variances in both coordinates), X=(XμX)/σx, and Y=(YμY)/σY.

None of these results is a particular property of bivariate Normal distributions! For the bivariate Normal family, the conditional variance of Y|X is constant (and equal to 1): this fact makes that family particularly simple to work with. In particular:

  • Because in the covariance matrix Σ the coefficients are σ11=σx2, σ12=σ21=ρσxσy, and σ22=σy2, the conditional variance of Y|X for a bivariate Normal distribution is

σy2(1ρ2)=σ22(1(σ12σ11σ22)2)=σ22σ122σ11.

Technical Notes

The key idea can be stated in terms of matrices describing the linear transformations. It comes down to finding a suitable "square root" of the correlation matrix for which y is an eigenvector. Thus:

(1ρρ1)=AA

where

A=(10ρ1ρ2).

A much better known square root is the one initially described (involving a rotation instead of a skew transformation); it is the one produced by a singular value decomposition and it plays a prominent role in principal components analysis (PCA):

(1ρρ1)=BB;

B=Q(ρ+1001ρ)Q

where Q=(12121212) is the rotation matrix for a 45 degree rotation.

Thus, the distinction between PCA and regression comes down to the difference between two special square roots of the correlation matrix.


1
Beautiful pictures and great descriptions. There were a few sentences in the update that were left incomplete (like you knew basically what you were going to say, but hadn't settled on the final wording).
cardinal

1
@Cardinal Thanks. I will be re-reading this and looking for such things, as well as for the inevitable typos. You are too kind to point out other things which you surely noticed, such as some gaps in the exposition. The biggest is that I did not actually show that these ellipses are at 45 degree angles (equivalently, inscribed in the unit square); I simply assumed that. I am still looking for a simple demonstration. The other is that one might worry the skew transformation could produce a different distribution than the original stretch-squeeze-rotate-shift--but it's easy to show it doesn't.
whuber

3
That's really interesting. Thanks for taking the time to write it up.
Bill

In 1st paragraph of applications it's written that: "we center them at their means and use their standard deviations as the units. This moves the center of the distribution to the origin and makes all its elliptical contours slant 45 degrees", but I don't understand how centering the variables at their means move their centers to origin and align them to 45 degree?
Kaushal28

@whuber when you start with unit circle (standardized sample set), you say correlation is 0, so I imagine, we get a circle something like f(X,Y)=e12(x2+y2). But how 0 correlation means independence? (because f(X,Y) is obtained by f(X)f(Y) as we see. It is usually not true right? Even dependent variables could produce 0 correlation?
Parthiban Rajendran

21

This is essentially linear (OLS) regression. In that case, you are finding the conditional distribution of Y given that X=xi. (Strictly speaking, OLS regression does not make assumptions about the distribution of X, whereas your example is a multivariate normal, but we will ignore these things.) Now, if the covariance between X1 and X2 is not 0, then the mean of the conditional distribution of X2 has to shift as you change the value of x1 where you are 'slicing through' the multivariate distribution. Consider the figure below:

enter image description here

Here we see that the marginal distributions are both normal, with a positive correlation between X1 and X2. If you look at the conditional distribution of X2 at any point on X1, the distribution is a univariate normal. However, because of the positive correlation (i.e., the non-zero covariance), the mean of those conditional distributions shifts up as you move from left to right. For example, the figure shows that μX2|X1=25μX2|X1=45.

(For any future readers who might be confused by the symbols, I want to state that, e.g., σ22 is an element of the covariance matrix Σ. Thus, it is the variance of X2, even though people will typically think of a variance as σ2, and σ as a standard deviation.)

Your equation for the mean is directly connected to the equation for estimating the slope in OLS regression (and remember that in regression y^i is the conditional mean):

β^1=Cov(x,y)Var(x)
In your equation, σ12/σ22 is the covariance over the variance; that is, it is the slope, just as above. Thus, your equation for the mean is just sliding your conditional mean, μX2|X1=xi, up or down from its unconditional mean, μX2, based on how far away from μX2 x2i is and the slope of the relationship between X1 and X2.

What happens if you condition on more variables? You would just add and subtract extra terms from the mean and variance?

2
@kerkejnrke, if you model the distribution of Y conditional on some specific level of a set of variables X, you are doing multiple regression. This is a little more complicated, but ultimately the same thing. The mean would be: y^i=Xiβ^, where β^=(XTX)1XTY.
gung - Reinstate Monica

What did you use to produce the graph? Mathematica?
mpiktas

@mpiktas, my graph or whuber's? I believe his are Mathematica, but I made the one above w/ R. (Ugly code though...)
gung - Reinstate Monica

1
@mpiktas, I can't imagine my code should ever be described as "awesome"... The normal curves are drawn w/ dnorm(y). I simply add the output to 25 & 45, & use as x.
gung - Reinstate Monica

3

Gung's answer is good (+1). There is another way of looking at it, though. Imagine that the covariance between X1 and X2 were to be positive. What does it mean for σ1,2>0? Well, it means that when X2 is above X2's mean, X1 tends to be above X1's mean, and vice versa.

Now suppose I told you that X2=x2>μ2. That is, suppose I told you that X2 is above its mean. Wouldn't you conclude that X1 is likely above its mean (since you know σ1,2>0 and you know what covariance means)? So, now, if you take the mean of X1, knowing that X2 is above X2's mean, you are going to get a number above X1's mean. That is what the formula says:

E{X1|X2=x2}=μ1+σ1,2σ2,2(x2μ2)
If the covariance is positive and X2 is above its mean, then E{X1|X2=x2}>μ1.

The conditional expectation takes the form above for the normal distribution, not for all distributions. This seems a little strange given that the reasoning in the paragraph above seems pretty compelling. However, (almost) no matter what the distributions of X1 and X2 this formula is right:

BLP{X1|X2=x2}=μ1+σ1,2σ2,2(x2μ2)
Where BLP means best linear predictor. The normal distribution is special in that conditional expectation and best linear predictor are the same thing.

There does not seem to be any element of this argument which actually shows the coefficient of x2μ2 should equal the ratio of the covariances σ12/σ22. Why not the cube of that ratio? Or its sine? Or some other measure of association, such as the KL divergence (which has little to do with covariance)? Such formulas would qualitatively reproduce the behavior you describe. Given such vagueness in the reasoning, it should be no surprise that your formula applies only to a particular form of bivariate distribution and not to just any distribution.
whuber

1
@whuber Yeah, and it's even worse than that. It's not particularly hard to cook up an example with non-normal distributions where, for some value of x2>μ2, E(X1|X2=x2)<μ1 even though σ1,2>0. The "tends to be" and "likely to be" parts of my discussion are slushy. Perhaps one could lead with the BLP formula (maybe deriving it?), but the question asked for intuition rather than proof.
Bill

1
"Intuitive" does not imply "non-quantitative": the two can go together. It is often difficult to find an intuitive argument that gives quantitative results, but frequently it can be done and the process of finding such an argument is always illuminating.
whuber

Re the last paragraph: I have found out that the normal distribution is not so special: families created by affine transformations of circularly symmetric distributions are the special ones (of which there are very many).
whuber

@whuber That's pretty interesting. Do you have a link or cite?
Bill
Utilizzando il nostro sito, riconosci di aver letto e compreso le nostre Informativa sui cookie e Informativa sulla privacy.
Licensed under cc by-sa 3.0 with attribution required.