Multivariate normal distribution


The multivariate normal distribution is a real, multivariate probability distribution that is a generalization of the normal distribution to multiple dimensions. For NN Gaussian independent random variables X1,,XNX_{1},\ldots,X_{N}, the joint distribution function is:

f(x1,,xN)=f1(x1)fN(xN)=1(2π)N/2σ1σNei=1N(xiμi)2/2σi2f(x_{1},\ldots,x_{N})=f_{1}(x_{1})\ldots f_{N}(x_{N})=\frac{1}{(2\pi)^{N/2}\sigma_{1}\ldots\sigma_{N}}e^{-\sum_{i=1}^{N} (x_{i}-\mu_{i})^{2}/2\sigma_{i}^{2}}

μi\mu_{i} and σi2\sigma_{i}^{2} are the mean and variance of the ii-th distribution.

If the variables are not independent, it can be written in terms of the covariance matrix Σ\Sigma:

f(x)=1(2π)N/2detΣe(xμ)TΣ1(xμ)/2f(\mathbf{x})=\frac{1}{(2\pi)^{N/2}\sqrt{ \det \Sigma }}e^{-(\mathbf{x}-\boldsymbol{\mu})^{T}\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})/2}

where μ=(μ1,,μN)\boldsymbol{\mu}=(\mu_{1},\ldots,\mu_{N}). Most of the exponent is often put in its own variable:

Q2=(xμ)TΣ1(xμ)Q^{2}=(\mathbf{x}-\boldsymbol{\mu})^{T}\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})

This quantity has interesting properties that can be used to analyze the dispersion of the distribution.

As a shorthand, a random vector X\mathbf{X} can be said to follow a multivariate normal of mean vector μ\boldsymbol{\mu} and covariance matrix Σ\Sigma with the notation

XN(μ,Σ)\mathbf{X}\sim \mathcal{N}(\boldsymbol{\mu},\Sigma)

Moments

The moment-generating function of the multivariate normal is known in closed form for the independent variable case and is a direct extension of the univariate normal:

MX(t)=exp(i=1Nσi2ti22),MX(t)=exp(i=1Nσi2ti22+μiti)M_{\mathbf{X}}(\mathbf{t})=\exp\left( \sum_{i=1}^{N} \frac{\sigma_{i}^{2}t_{i}^{2}}{2} \right),\qquad M_{\mathbf{X}}^{*}(\mathbf{t})=\exp\left( \sum_{i=1}^{N} \frac{\sigma_{i}^{2}t_{i}^{2}}{2}+\mu_{i}t_{i} \right)

Dispersion and Q2Q^{2} ellipses

The multivariate normal distribution is, like its univariate sibling, possibly the most important multivariate distribution, owing to the commonness of Gaussian random variables. Both the cases with independent and dependent random variables come up in practice, and it's therefore useful to discuss both.

Independent variables

Since the variables are independent, the covariance matrix is diagonal:

Σ=(σ12000σ22000σN2)\Sigma=\begin{pmatrix} \sigma_{1}^{2} & 0 & \ldots & \ldots & 0 \\ 0 & \sigma_{2}^{2} & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & \ldots & \ldots & \sigma_{N}^{2} \end{pmatrix}

and its inverse is

Σ1=(1σ120001σ220001σN2)\Sigma^{-1}=\begin{pmatrix} \frac{1}{\sigma_{1}^{2}} & 0 & \ldots & \ldots & 0 \\ 0 & \frac{1}{\sigma_{2}^{2}} & 0 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & \ldots & \ldots & \frac{1}{\sigma_{N}^{2}} \end{pmatrix}

This effectively reduces the covariance matrix down to a variance vector.

The exponent Q2Q^{2} mentioned above is useful because its values trace hypersurfaces of equal probability. If you set a value of Q2Q^{2} and find the locus of (x1,,xN)(x_{1},\ldots,x_{N}) points that realize that value, you get a closed and quite well-behaved hypersurface that represents all points with equal probability. If this sounds confusing, just know that in the simplest case of N=2N=2, they're just regular ellipses. To see it, just write down the equation for Q2Q^{2}:

Q2=(x1μ1)2σ12+(x2μ2)2σ22Q^{2}=\frac{(x_{1}-\mu_{1})^{2}}{\sigma_{1} ^{2}}+ \frac{(x_{2}-\mu_{2})^{2}}{\sigma_{2}^{2}}

This is the equation of an ellipse with semiaxes σ1\sigma_{1} and σ2\sigma_{2}, centered in (μ1,μ2)(\mu_{1},\mu_{2}), scaled by a factor QQ.1 For higher NN, they are hyperellipsoids. When they can be represented visually (mostly just N=2N=2, possibly N=3N=3), they serve as a useful way to visualize the dispersion of the multivariate normal. In fact, exactly because the standard deviations are the semiaxes, Q2Q^{2} is essentially the NN-dimensional analog of the variance and QQ that of the standard deviation. Of course, the real variances are still σ12,,σN2\sigma_{1}^{2},\ldots,\sigma_{N}^{2}, but Q2Q^{2} is an efficient way to package all that information in a single variable. In the N=2N=2 case, since the semiaxes are scaled by QQ, setting QQ lets us draw the ellipse corresponding to whatever multiple of σ\sigma we want. For Q2=1Q^{2}=1, we get the ellipse with (σ1,σ2)(\sigma_{1},\sigma_{2}), for Q2=4Q^{2}=4, we get the ellipse with (2σ1,2σ2)(2\sigma_{1},2\sigma_{2}), and so on.

The area inside the hyperellipsoid is used to measure probability in the same way a cumulative distribution function does. The probability of a value being inside of a Q2=1Q^{2}=1 hyperellipsoid is P(Q21)P(Q^{2}\leq1). To properly evaluate this, we need the CDF of Q2Q^{2}. As it happens, Q2Q^{2} is actually chi-square distributed, specifically with NN degrees of freedom. This is because Q2Q^{2} is the sum of squares of x1,,xNx_{1},\ldots,x_{N}, and the sum of squares of Gaussians follows a χN2\chi ^{2}_{N} distribution. Thus, we can find the probability just fine, even analytically. For an N=2N=2 ellipse, the χ22\chi^{2}_{2} is just

fX(x;2)=12ex/2f_{X}(x;2)=\frac{1}{2}e^{-x/2}

and so the integral is

P(Q21)=P(χ221)=0112ex/2 dx=ex/201=11e0.39P(Q^{2}\leq 1)=P(\chi ^{2}_{2}\leq 1)=\int_{0}^{1} \frac{1}{2}e^{-x/2}\ dx=-e^{-x/2}|_{0}^{1}=1- \frac{1}{\sqrt{ e }}\simeq0.39

For Q2=4Q^{2}=4 we instead get

P(Q24)=P(χ224)=0412ex/2 dx0.86P(Q^{2}\leq 4)=P(\chi_{2}^{2}\leq 4)=\int_{0}^{4} \frac{1}{2}e^{-x/2}\ dx\simeq0.86

These numbers are the multivariate analog of the classic 1-2-3σ\sigma rule of the Gaussian. Where 1σ68%1\sigma\Rightarrow68\% and 2σ95%2\sigma\Rightarrow95\% in a univariate Gaussian, Q2=139%Q^{2}=1\Rightarrow39\% and Q2=486%Q^{2}=4\Rightarrow 86\% in a multivariate Gaussian.

Dependent variables

If the variables are not independent, things gets a lot more verbose as covariance can no longer be ignored. Let's consider N=2N=2 variables only. We can no longer use the easy formula for Q2Q^{2} and need the full one:

Q2=(xμ)TΣ1(xμ)Q^{2}=(\mathbf{x}-\boldsymbol{\mu})^{T}\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})

In other words, we need the covariance matrix and its inverse:

Σ=(σ12ρσ1σ2ρσ1σ2σ22),detΣ=σ12σ22(1ρ2),Σ1=1detΣ(σ22ρσ1σ2ρσ1σ2σ12)\Sigma=\begin{pmatrix} \sigma_{1}^{2} & \rho \sigma_{1}\sigma_{2} \\ \rho \sigma_{1}\sigma_{2} & \sigma_{2}^{2} \end{pmatrix},\quad \det \Sigma=\sigma_{1}^{2}\sigma_{2}^{2}(1-\rho ^{2}),\quad \Sigma^{-1}=\frac{1}{\det \Sigma}\begin{pmatrix} \sigma_{2}^{2} & -\rho \sigma_{1}\sigma_{2} \\ -\rho \sigma_{1}\sigma_{2} & \sigma_{1}^{2} \end{pmatrix}

The second half of the formula is

Σ1(xμ)=1detΣ((x1μ1)σ22ρσ1σ2(σ2μ2)ρσ1σ2(x1μ1)+σ12(x2μ2))\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})=\frac{1}{\det \Sigma}\begin{pmatrix} (x_{1}-\mu_{1})\sigma_{2}^{2} -\rho \sigma_{1}\sigma_{2}(\sigma_{2}-\mu_{2}) \\ \rho \sigma_{1}\sigma_{2}(x_{1}-\mu_{1}) +\sigma_{1}^{2}(x_{2}-\mu_{2}) \end{pmatrix}

and so putting it all together and doing the matrix multiplications:

Q2=(xμ)TΣ1(xμ)=1detΣ[(x1μ1)2σ222ρσ1σ2(x1μ1)(x2μ2)+σ12(x2μ2)2]=11ρ2[(x1μ1)2σ122ρx1μ1σ1x2μ2σ2+(x2μ2)2σ22]\begin{align} Q^{2}&=(\mathbf{x}-\boldsymbol{\mu})^{T}\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}) \\ &=\frac{1}{\det \Sigma}[(x_{1}-\mu_{1})^{2}\sigma_{2}^{2}-2\rho \sigma_{1}\sigma_{2}(x_{1}-\mu_{1})(x_{2}-\mu_{2})+\sigma_{1}^{2}(x_{2}-\mu_{2})^{2}] \\ &=\frac{1}{1-\rho ^{2}}\left[ \frac{(x_{1}-\mu_{1})^{2}}{\sigma_{1}^{2}}-2\rho \frac{x_{1}-\mu_{1}}{\sigma_{1}}\frac{x_{2}-\mu_{2}}{\sigma_{2}}+ \frac{(x_{2}-\mu_{2})^{2}}{\sigma_{2}^{2}} \right] \end{align}

If we find the draw the ellipses traced by this Q2Q^{2}, we find that they are now angled. This angle is quite important, as it visually represents correlation between x1x_{1} and x2x_{2}. Indeed, this angle is proportional to the correlation coefficient ρ\rho and is the major difference from the independent case above (where the ellipse was axis-aligned). When ρ=0\rho=0, you get an axis-aligned ellipse.

Graph Multinormal equiprobability.svg|80%|center

A general bivariate ellipse showing important points and lines. The red diagonals (passing through segments CC\overline{C C'} and DD\overline{DD'}) are known as the regression lines of the distribution.

We can get extract a lot of useful information with some geometry. The x1=μ1x_{1}=\mu_{1} and x2=μ2x_{2}=\mu_{2} lines intersect with the ellipse at points

A=(μ21ρ2σ2,μ2),A=(μ2+1ρ2σ2,μ2)A=(\mu_{2}-\sqrt{ 1-\rho ^{2} }\sigma_{2}, \mu_{2}),\quad A'=(\mu_{2}+\sqrt{ 1-\rho ^{2} }\sigma_{2}, \mu_{2}) B=(μ1,μ11ρ2σ1),B=(μ1,μ1+1ρ2σ1)B=(\mu_{1},\mu_{1}-\sqrt{ 1-\rho ^{2} }\sigma_{1}),\quad B'=(\mu_{1},\mu_{1}+\sqrt{ 1-\rho ^{2} }\sigma_{1})

and the segments inside the ellipse are

AA=2σ21ρ2,BB=2σ11ρ2\overline{AA'}=2\sigma_{2}\sqrt{ 1-\rho ^{2} },\qquad \overline{BB'}=2\sigma_{1}\sqrt{ 1-\rho ^{2} }

idk lol

Let's consider the variable y\mathbf{y} defined as

y=Ax,μy=Aμ\mathbf{y}=\mathrm{A}\mathbf{x},\qquad \boldsymbol{\mu}_{y}=\mathrm{A}\boldsymbol{\mu}

where A\mathrm{A} is a matrix. The covariance matrix is

Vy=AVAT\mathrm{V}_{y}=\mathrm{A}\mathrm{V}\mathrm{A}^{T}

Let's also consider a vector z\mathbf{z} such that its components are standard-normal distributed:

ziN(0,1)i=1nzi2χn2z_{i}\sim N(0,1)\quad\Rightarrow \quad\sum_{i=1}^{n} z_{i}^{2}\sim \chi_{n}^{2}

z\mathbf{z} can be written as

z=V1/2(xμ)\mathbf{z}=\mathrm{V}^{-1/2}(\mathbf{x}-\boldsymbol{\mu})

z\mathbf{z} follows a multinormal distribution with covariance matrix Vz\mathrm{V}_{z}, so N(0,Vz)N(0,\mathrm{V}_{z}), where Vz\mathrm{V}_{z} is

Vz=V1/2VV1/2=I\mathrm{V}_{z}=\mathrm{V}^{-1/2}\mathrm{V}\mathrm{V}^{-1/2}=\mathrm{I}

since the ziz_{i} are independent of each other, cov(zi,zj)=0\text{cov}(z_{i},z_{j})=0 for iji\neq j. Thus, Q2Q^{2} in terms of z\mathbf{z} is

Q2=(xμ)TV1/2zTV1/2(xμ)z=zTz=i=1nzi2Q^{2}=\underbrace{ (\mathbf{x}-\boldsymbol{\mu})^{T}\mathrm{V}^{-1/2} }_{ \mathbf{z}^{T} }\underbrace{ \mathrm{V}^{-1/2}(\mathbf{x}-\boldsymbol{\mu}) }_{ \mathbf{z} }=\mathbf{z}^{T}\mathbf{z}=\sum_{i=1}^{n} z_{i}^{2}

which is indeed chi-square-distributed.

Footnotes

  1. Alternatively, an unscaled ellipse with semiaxes Qσ1Q\sigma_{1} and Qσ2Q\sigma_{2}.