Qual è l'intuizione dietro le distribuzioni gaussiane condizionate?

Supponiamo che $\mathbf{X} \sim N_{2}(\mathbf{\mu}, \mathbf{\Sigma})$ . Quindi la distribuzione condizionale di $X_1$ dato che $X_2 = x_2$ è multivariato normalmente distribuito con media:

E [P (X_{1} | X_{2} = x_{2})] = μ_{1} + \frac{σ_{12}}{σ_{22}} (x_{2} - μ_{2})

$E[P(X_1 | X_2 = x_2)] = \mu_1+\frac{\sigma_{12}}{\sigma_{22}}(x_2-\mu_2)$

and variance:

V a r [P (X_{1} | X_{2} = x_{2})] = σ_{11} - \frac{σ_{12}^{2}}{σ_{22}}

${\rm Var}[P(X_1 | X_2 = x_2)] = \sigma_{11}-\frac{\sigma_{12}^{2}}{\sigma_{22}}$

It makes sense that the variance would decrease since we have more information. But what is the intuition behind the mean formula? How does the covariance between $X_1$ and $X_2$ factor into the conditional mean?

normal-distribution multivariate-analysis intuition

— eroeijr
fonte

Is your question simply 'why isn't the mean of the conditional distribution =

μ_{1}

$\mu_1$ '?

— gung - Reinstate Monica

@gung: This is true if

x_{2} = μ_{2}

$x_2 = \mu_2$ . But why are

σ_{11}

$\sigma_{11}$ and

σ_{22}

$\sigma_{22}$ involved?

— eroeijr

In natural ("standardized") units we write

X_{i} = μ_{1} + σ_{i} Z_{i}

$X_i=\mu_1+\sigma_iZ_i$ where

σ_{i} = \sqrt{σ_{i i}}

$\sigma_i=\sqrt{\sigma_{ii}}$ . In these terms the conditional distribution is Normal with

E (Z_{1} | Z_{2}) = ρ Z_{2}

$\mathbb{E}(Z_1|Z_2)=\rho Z_2$ and

ρ = σ_{12} / (σ_{1} σ_{2}) .

$\rho=\sigma_{12}/(\sigma_1\sigma_2).$ The fact that

| ρ | \leq 1

$|\rho|\le 1$ is called "mean reversion" or "regression to the mean": there is an extensive technical and popular literature on this going back 130 years.

— whuber

Say, eroeijr, is this post yours? (Aside from the 'guest' at the start there's a distinct similarity in the names.) If it is yours you should ask to merge the two accounts and take that big bonus in points you'd have.

— Glen_b

As @Glen_b suggested, if you have multiple (unregistered) accounts, please complete the form at stats.stackexchange.com/contact and request that they be merged.

— chl

Synopsis

Every statement in the question can be understood as a property of ellipses. The only property particular to the bivariate Normal distribution that is needed is the fact that in a standard bivariate Normal distribution of $X,Y$ --for which $X$ and $Y$ are uncorrelated--the conditional variance of $Y$ does not depend on $X$ . (This in turn is an immediate consequence of the fact that lack of correlation implies independence for jointly Normal variables.)

The following analysis shows precisely what property of ellipses is involved and derives all the equations of the question using elementary ideas and the simplest possible arithmetic, in a way intended to be easily remembered.

Circularly symmetric distributions

The distribution of the question is a member of the family of bivariate Normal distributions. They are all derived from a basic member, the standard bivariate Normal, which describes two uncorrelated standard Normal distributions (forming its two coordinates).

Figure 1: the standard bivariate normal distribution

The left side is a relief plot of the standard bivariate normal density. The right side shows the same in pseudo-3D, with the front part sliced away.

This is an example of a circularly symmetric distribution: the density varies with distance from a central point but not with the direction away from that point. Thus, the contours of its graph (at the right) are circles.

Most other bivariate Normal distributions are not circularly symmetric, however: their cross-sections are ellipses. These ellipses model the characteristic shape of many bivariate point clouds.

Figure 2: another bivariate normal distribution, plotted

These are portraits of the bivariate Normal distribution with covariance matrix $\Sigma = \left(\begin{array}{cc} 1 & -\frac{2}{3} \\ -\frac{2}{3} & 1 \\\end{array}\right).$ It is a model for data with correlation coefficient $-2/3$ .

How to Create Ellipses

An ellipse--according to its oldest definition--is a conic section, which is a circle distorted by a projection onto another plane. By considering the nature of projection, just as visual artists do, we may decompose it into a sequence of distortions that are easy to understand and calculate with.

First, stretch (or, if necessary, squeeze) the circle along what will become the long axis of the ellipse until it is the correct length:

Step 1: stretch

Next, squeeze (or stretch) this ellipse along its minor axis:

Step 2: squeeze

Third, rotate it around its center into its final orientation:

Step 3: rotate

Finally, shift it to the desired location:

Step 4: shift

These are all affine transformations. (In fact, the first three are linear transformations; the final shift makes it affine.) Because a composition of affine transformations is (by definition) still affine, the net distortion from the circle to the final ellipse is an affine transformation. But it can be somewhat complicated:

Composite transformation

Notice what happened to the ellipse's (natural) axes: after they were created by the shift and squeeze, they (of course) rotated and shifted along with the axis itself. We easily see these axes even when they are not drawn, because they are axes of symmetry of the ellipse itself.

We would like to apply our understanding of ellipses to understanding distorted circularly symmetric distributions, like the bivariate Normal family. Unfortunately, there is a problem with these distortions: they do not respect the distinction between the $x$ and $y$ axes. The rotation at step 3 ruins that. Look at the faint coordinate grids in the backgrounds: these show what happens to a grid (of mesh $1/2$ in both directions) when it is distorted. In the first image the spacing between the original vertical lines (shown solid) is doubled. In the second image the spacing between the original horizontal lines (shown dashed) is shrunk by a third. In the third image the grid spacings are not changed, but all the lines are rotated. They shift up and to the right in the fourth image. The final image, showing the net result, displays this stretched, squeezed, rotated, shifted grid. The original solid lines of constant $x$ coordinate no longer are vertical.

The key idea--one might venture to say it is the crux of regression--is that there is a way in which the circle can be distorted into an ellipse without rotating the vertical lines. Because the rotation was the culprit, let's cut to the chase and show how to created a rotated ellipse without actually appearing to rotate anything!

Skewed ellipse

This is a skew transformation. It actually does two things at once:

It squeezes in the $y$ direction (by an amount $\lambda$ , say). This leaves the $x$ -axis alone.
It lifts any resulting point $(x,y)$ by an amount directly proportional to $x$ . Writing that constant of proportionality as $\rho$ , this sends $(x,y)$ to $(x, y+\rho x)$ .

The second step lifts the $x$ -axis into the line $y=\rho x$ , shown in the previous figure. As shown in that figure, I want to work with a special skew transformation, one that effectively rotates the ellipse by 45 degrees and inscribes it into the unit square. The major axis of this ellipse is the line $y=x$ . It is visually evident that $|\rho| \le 1$ . (Negative values of $\rho$ tilt the ellipse down to the right rather than up to the right.) This is the geometric explanation of "regression to the mean."

Choosing an angle of 45 degrees makes the ellipse symmetric around the square's diagonal (part of the line $y=x$ ). To figure out the parameters of this skew transformation, observe:

The lifting by $\rho x$ moves the point $(1,0)$ to $(1,\rho)$ .
The symmetry around the main diagonal then implies the point $(\rho, 1)$ also lies on the ellipse.

Where did this point start out?

The original (upper) point on the unit circle (having implicit equation $x^2+y^2=1$ ) with $x$ coordinate $\rho$ was $(\rho, \sqrt{1-\rho^2})$ .
Any point of the form $(\rho, y)$ first got squeezed to $(\rho, \lambda y)$ and then lifted to $(\rho, \lambda y + \rho\times\rho)$ .

The unique solution to the equation $(\rho, \lambda \sqrt{1-\rho^2} + \rho^2) = (\rho, 1)$ is $\lambda = \sqrt{1-\rho^2}$ . That is the amount by which all distances in the vertical direction must be squeezed in order to create an ellipse at a 45 degree angle when it is skewed vertically by $\rho$ .

To firm up these ideas, here is a tableau showing how a circularly symmetric distribution is distorted into distributions with elliptical contours by means of these skew transformations. The panels show values of $\rho$ equal to $0,$ $3/10,$ $6/10,$ and $9/10,$ from left to right.

Tableau

The leftmost figure shows a set of starting points around one of the circular contours as well as part of the horizontal axis. Subsequent figures use arrows to show how those points are moved. The image of the horizontal axis appears as a slanted line segment (with slope $\rho$ ). (The colors represent different amounts of density in the different figures.)

Application

We are ready to do regression. A standard, elegant (yet simple) method to perform regression is first to express the original variables in new units of measurement: we center them at their means and use their standard deviations as the units. This moves the center of the distribution to the origin and makes all its elliptical contours slant 45 degrees (up or down).

When these standardized data form a circular point cloud, the regression is easy: the means conditional on $x$ are all $0$ , forming a line passing through the origin. (Circular symmetry implies symmetry with respect to the $x$ axis, showing that all conditional distributions are symmetric, whence they have $0$ means.) As we have seen, we may view the standardized distribution as arising from this basic simple situation in two steps: first, all the (standardized) $y$ values are multiplied by $\sqrt{1-\rho^2}$ for some value of $\rho$ ; next, all values with $x$ -coordinates are vertically skewed by $\rho x$ . What did these distortions do to the regression line (which plots the conditional means against $x$ )?

The shrinking of $y$ coordinates multiplied all vertical deviations by a constant. This merely changed the vertical scale and left all conditional means unaltered at $0$ .
The vertical skew transformation added $\rho x$ to all conditional values at $x$ , thereby adding $\rho x$ to their conditional mean: the curve $y=\rho x$ is the regression curve, which turns out to be a line.

Similarly, we may verify that because the $x$ -axis is the least squares fit to the circularly symmetric distribution, the least squares fit to the transformed distribution also is the line $y=\rho x$ : the least-squares line coincides with the regression line.

These beautiful results are a consequence of the fact that the vertical skew transformation does not change any $x$ coordinates.

We can easily say more:

The first bullet (about shrinking) shows that when $(X,Y)$ has any circularly symmetric distribution, the conditional variance of $Y|X$ was multiplied by $\left(\sqrt{1-\rho^2}\right)^2 = 1 - \rho^2$ .
More generally: the vertical skew transformation rescales each conditional distribution by $\sqrt{1-\rho^2}$ and then recenters it by $\rho x$ .

For the standard bivariate Normal distribution, the conditional variance is a constant (equal to $1$ ), independent of $x$ . We immediately conclude that after applying this skew transformation, the conditional variance of the vertical deviations is still a constant and equals $1-\rho^2$ . Because the conditional distributions of a bivariate Normal are themselves Normal, now that we know their means and variances, we have full information about them.

Finally, we need to relate $\rho$ to the original covariance matrix $\Sigma$ . For this, recall that the (nicest) definition of the correlation coefficient between two standardized variables $X$ and $Y$ is the expectation of their product $XY$ . (The correlation of $X$ and $Y$ is simply declared to be the correlation of their standardized versions.) Therefore, when $(X,Y)$ follows any circularly symmetric distribution and we apply the skew transformation to the variables, we may write

ε = Y - ρ X

$\varepsilon = Y - \rho X$

for the vertical deviations from the regression line and notice that $\varepsilon$ must have a symmetric distribution around $0$ . Why? Because before the skew transformation was applied, $Y$ had a symmetric distribution around $0$ and then we (a) squeezed it and (b) lifted it by $\rho X$ . The former did not change its symmetry while the latter recentered it at $\rho X$ , QED. The next figure illustrates this.

3D plot showing conditional distributions and the least-squares line

The black lines trace out heights proportional to the conditional densities at various regularly-spaced values of $x$ . The thick white line is the regression line, which passes through the center of symmetry of each conditional curve. This plot shows the case $\rho = -1/2$ in standardized coordinates.

Consequently

E (X Y) = E (X (ρ X + ε)) = ρ E (X^{2}) + E (X ε) = ρ (1) + 0 = ρ .

$\mathbb{E}(XY) = \mathbb{E}\left(X(\rho X + \varepsilon)\right) = \rho\mathbb{E}(X^2) + \mathbb{E}(X\varepsilon) = \rho(1) + 0=\rho.$

The final equality is due to two facts: (1) because $X$ has been standardized, the expectation of its square is its standardized variance, equal to $1$ by construction; and (2) the expectation of $X\varepsilon$ equals the expectation of $X(-\varepsilon)$ by virtue of the symmetry of $\varepsilon$ . Because the latter is the negative of the former, both must equal $0$ : this term drops out.

We have identified the parameter of the skew transformation, $\rho$ , as being the correlation coefficient of $X$ and $Y$ .

Conclusions

By observing that any ellipse may be produced by distorting a circle with a vertical skew transformation that preserves the $x$ coordinate, we have arrived at an understanding of the contours of any distribution of random variables $(X,Y)$ that is obtained from a circularly symmetric one by means of stretches, squeezes, rotations, and shifts (that is, any affine transformation). By re-expressing the results in terms of the original units of $x$ and $y$ --which amount to adding back their means, $\mu_x$ and $\mu_y$ , after multiplying by their standard deviations $\sigma_x$ and $\sigma_y$ --we find that:

The least-squares line and the regression curve both pass through the origin of the standardized variables, which corresponds to the "point of averages" $(\mu_x,\mu_y)$ in original coordinates.
The regression curve, which is defined to be the locus of conditional means, $\{(x, \rho x)\},$ coincides with the least-squares line.
The slope of the regression line in standardized coordinates is the correlation coefficient $\rho$ ; in the original units it therefore equals $\sigma_y \rho / \sigma_x$ .

Consequently the equation of the regression line is

y = \frac{σ_{y} ρ}{σ_{x}} (x - μ_{x}) + μ_{y} .

$y = \frac{\sigma_y\rho}{\sigma_x}\left(x - \mu_x\right) + \mu_y.$

The conditional variance of $Y|X$ is $\sigma_y^2(1-\rho^2)$ times the conditional variance of $Y'|X'$ where $(X',Y')$ has a standard distribution (circularly symmetric with unit variances in both coordinates), $X'=(X-\mu_X)/\sigma_x$ , and $Y'=(Y-\mu_Y)/\sigma_Y$ .

None of these results is a particular property of bivariate Normal distributions! For the bivariate Normal family, the conditional variance of $Y'|X'$ is constant (and equal to $1$ ): this fact makes that family particularly simple to work with. In particular:

Because in the covariance matrix $\Sigma$ the coefficients are $\sigma_{11}=\sigma_x^2,$ $\sigma_{12}=\sigma_{21}=\rho\sigma_x\sigma_y,$ and $\sigma_{22}=\sigma_y^2,$ the conditional variance of $Y|X$ for a bivariate Normal distribution is

σ_{y}^{2} (1 - ρ^{2}) = σ_{22} (1 - {(\frac{σ_{12}}{\sqrt{σ_{11} σ_{22}}})}^{2}) = σ_{22} - \frac{σ_{12}^{2}}{σ_{11}} .

$\sigma_y^2(1-\rho^2)=\sigma_{22}\left(1-\left(\frac{\sigma_{12}}{\sqrt{\sigma_{11}\sigma_{22}}}\right)^2\right)=\sigma_{22} - \frac{\sigma_{12}^2}{\sigma_{11}}.$

Technical Notes

The key idea can be stated in terms of matrices describing the linear transformations. It comes down to finding a suitable "square root" of the correlation matrix for which $y$ is an eigenvector. Thus:

(\begin{array}{cc} 1 & ρ \\ ρ & 1 \end{array}) = A A^{'}

$\left(\begin{array}{cc} 1 & \rho \\ \rho & 1 \\\end{array}\right) = \mathbb{A}\mathbb{A}'$

where

A = (\begin{array}{cc} 1 & 0 \\ ρ & \sqrt{1 - ρ^{2}} \end{array}) .

$\mathbb{A} = \left(\begin{array}{cc} 1 & 0 \\ \rho & \sqrt{1-\rho^2} \\\end{array}\right).$

A much better known square root is the one initially described (involving a rotation instead of a skew transformation); it is the one produced by a singular value decomposition and it plays a prominent role in principal components analysis (PCA):

(\begin{array}{cc} 1 & ρ \\ ρ & 1 \end{array}) = B B^{'};

$\left(\begin{array}{cc} 1 & \rho \\ \rho & 1 \\\end{array}\right) = \mathbb{B}\mathbb{B}';$

B = Q (\begin{array}{cc} \sqrt{ρ + 1} & 0 \\ 0 & \sqrt{1 - ρ} \end{array}) Q^{'}

$\mathbb{B} = \mathbb{Q} \left( \begin{array}{cc} \sqrt{\rho +1} & 0 \\ 0 & \sqrt{1-\rho } \\ \end{array} \right)\mathbb{Q}'$

where $\mathbb{Q} = \left( \begin{array}{cc} \frac{1}{\sqrt{2}} & -\frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ \end{array} \right)$ is the rotation matrix for a $45$ degree rotation.

Thus, the distinction between PCA and regression comes down to the difference between two special square roots of the correlation matrix.

— whuber
fonte

Beautiful pictures and great descriptions. There were a few sentences in the update that were left incomplete (like you knew basically what you were going to say, but hadn't settled on the final wording).

— cardinal

@Cardinal Thanks. I will be re-reading this and looking for such things, as well as for the inevitable typos. You are too kind to point out other things which you surely noticed, such as some gaps in the exposition. The biggest is that I did not actually show that these ellipses are at 45 degree angles (equivalently, inscribed in the unit square); I simply assumed that. I am still looking for a simple demonstration. The other is that one might worry the skew transformation could produce a different distribution than the original stretch-squeeze-rotate-shift--but it's easy to show it doesn't.

— whuber

That's really interesting. Thanks for taking the time to write it up.

— Bill

In 1st paragraph of applications it's written that: "we center them at their means and use their standard deviations as the units. This moves the center of the distribution to the origin and makes all its elliptical contours slant 45 degrees", but I don't understand how centering the variables at their means move their centers to origin and align them to 45 degree?

— Kaushal28

@whuber when you start with unit circle (standardized sample set), you say correlation is 0, so I imagine, we get a circle something like

f (X, Y) = e^{\frac{1}{2} (x^{2} + y^{2})}

$f(X,Y)=e^{\frac{1}{2}{(x^2 + y^2)}}$ . But how 0 correlation means independence? (because

f (X, Y)

$f(X,Y)$ is obtained by

f (X) f (Y)

$f(X)f(Y)$ as we see. It is usually not true right? Even dependent variables could produce 0 correlation?

— Parthiban Rajendran

This is essentially linear (OLS) regression. In that case, you are finding the conditional distribution of $Y$ given that $X=x_i$ . (Strictly speaking, OLS regression does not make assumptions about the distribution of $X$ , whereas your example is a multivariate normal, but we will ignore these things.) Now, if the covariance between $X_1$ and $X_2$ is not $0$ , then the mean of the conditional distribution of $X_2$ has to shift as you change the value of $x_1$ where you are 'slicing through' the multivariate distribution. Consider the figure below:

enter image description here

Here we see that the marginal distributions are both normal, with a positive correlation between $X_1$ and $X_2$ . If you look at the conditional distribution of $X_2$ at any point on $X_1$ , the distribution is a univariate normal. However, because of the positive correlation (i.e., the non-zero covariance), the mean of those conditional distributions shifts up as you move from left to right. For example, the figure shows that $\mu_{X_2|X_1=25}\ne\mu_{X_2|X_1=45}$ .

(For any future readers who might be confused by the symbols, I want to state that, e.g., $\sigma_{22}$ is an element of the covariance matrix $\Sigma$ . Thus, it is the variance of $X_2$ , even though people will typically think of a variance as $\sigma^2$ , and $\sigma$ as a standard deviation.)

Your equation for the mean is directly connected to the equation for estimating the slope in OLS regression (and remember that in regression $\hat y_i$ is the conditional mean):

{\hat{β}}_{1} = \frac{Cov (x, y)}{Var (x)}

$\hat\beta_1=\frac{\text{Cov}(x,y)}{\text{Var}(x)}$ In your equation,

^{σ_{12}} /_{σ_{22}}

$^{\sigma_{12}}/_{\sigma_{22}}$ is the covariance over the variance; that is, it is the slope, just as above. Thus, your equation for the mean is just sliding your conditional mean,

μ_{X_{2} | X_{1} = x_{i}}

$\mu_{X_2|X_1=x_i}$ , up or down from its unconditional mean,

μ_{X_{2}}

$\mu_{X_2}$ , based on how far away from

μ_{X_{2}}

$\mu_{X_2}$

x_{2 i}

$x_{2i}$ is and the slope of the relationship between

X_{1}

$X_1$ and

X_{2}

$X_2$ .

— gung - Reinstate Monica
fonte

What happens if you condition on more variables? You would just add and subtract extra terms from the mean and variance?

@kerkejnrke, if you model the distribution of

Y

$Y$ conditional on some specific level of a set of variables

X

$\bf X$ , you are doing multiple regression. This is a little more complicated, but ultimately the same thing. The mean would be:

{\hat{y}}_{i} = X_{i} \hat{β}

$\hat y_i=\bf X_i \boldsymbol{\hat\beta}$ , where

\hat{β} = (X^{T} X)^{- 1} X^{T} Y

$\boldsymbol{\hat\beta}=(\bf X^TX)^{-1}X^TY$ .

— gung - Reinstate Monica

What did you use to produce the graph? Mathematica?

— mpiktas

@mpiktas, my graph or whuber's? I believe his are Mathematica, but I made the one above w/ R. (Ugly code though...)

— gung - Reinstate Monica

@mpiktas, I can't imagine my code should ever be described as "awesome"... The normal curves are drawn w/ dnorm(y). I simply add the output to 25 & 45, & use as x.

— gung - Reinstate Monica

Gung's answer is good (+1). There is another way of looking at it, though. Imagine that the covariance between $X_1$ and $X_2$ were to be positive. What does it mean for $\sigma_{1,2}>0$ ? Well, it means that when $X_2$ is above $X_2$ 's mean, $X_1$ tends to be above $X_1$ 's mean, and vice versa.

Now suppose I told you that $X_2=\mathit{x}_2>\mu_2$ . That is, suppose I told you that $X_2$ is above its mean. Wouldn't you conclude that $X_1$ is likely above its mean (since you know $\sigma_{1,2}>0$ and you know what covariance means)? So, now, if you take the mean of $X_1$ , knowing that $X_2$ is above $X_2$ 's mean, you are going to get a number above $X_1$ 's mean. That is what the formula says:

E {X_{1} | X_{2} = x_{2}} = μ_{1} + \frac{σ_{1, 2}}{σ_{2, 2}} (x_{2} - μ_{2})

$\begin{equation} E\{X_1 | X_2=\mathit{x}_2\} = \mu_1 + \frac{\sigma_{1,2}}{\sigma_{2,2}}\left( \mathit{x}_2-\mu_2\right) \end{equation}$ If the covariance is positive and

X_{2}

$X_2$ is above its mean, then

E {X_{1} | X_{2} = x_{2}} > μ_{1}

$E\{X_1 | X_2=\mathit{x}_2\} > \mu_1$ .

The conditional expectation takes the form above for the normal distribution, not for all distributions. This seems a little strange given that the reasoning in the paragraph above seems pretty compelling. However, (almost) no matter what the distributions of $X_1$ and $X_2$ this formula is right:

B L P {X_{1} | X_{2} = x_{2}} = μ_{1} + \frac{σ_{1, 2}}{σ_{2, 2}} (x_{2} - μ_{2})

$\begin{equation} BLP\{X_1 | X_2=\mathit{x}_2\} = \mu_1 + \frac{\sigma_{1,2}}{\sigma_{2,2}}\left( \mathit{x}_2-\mu_2\right) \end{equation}$ Where

B L P

$BLP$ means best linear predictor. The normal distribution is special in that conditional expectation and best linear predictor are the same thing.

— Bill
fonte

There does not seem to be any element of this argument which actually shows the coefficient of

x_{2} - μ_{2}

$x_2-\mu_2$ should equal the ratio of the covariances

σ_{12} / σ_{22}

$\sigma_{12}/\sigma_{22}$ . Why not the cube of that ratio? Or its sine? Or some other measure of association, such as the KL divergence (which has little to do with covariance)? Such formulas would qualitatively reproduce the behavior you describe. Given such vagueness in the reasoning, it should be no surprise that your formula applies only to a particular form of bivariate distribution and not to just any distribution.

— whuber

@whuber Yeah, and it's even worse than that. It's not particularly hard to cook up an example with non-normal distributions where, for some value of

x_{2} > μ_{2}

$\mathit{x}_2>\mu_2$ ,

E (X_{1} | X_{2} = x_{2}) < μ_{1}

$E(X_1|X_2=\mathit{x}_2)<\mu_1$ even though

σ_{1, 2} > 0

$\sigma_{1,2}>0$ . The "tends to be" and "likely to be" parts of my discussion are slushy. Perhaps one could lead with the BLP formula (maybe deriving it?), but the question asked for intuition rather than proof.

— Bill

"Intuitive" does not imply "non-quantitative": the two can go together. It is often difficult to find an intuitive argument that gives quantitative results, but frequently it can be done and the process of finding such an argument is always illuminating.

— whuber

Re the last paragraph: I have found out that the normal distribution is not so special: families created by affine transformations of circularly symmetric distributions are the special ones (of which there are very many).

— whuber

@whuber That's pretty interesting. Do you have a link or cite?

— Bill