Skip to content

Multivariate Normal Distributions – I – Basics and a random vector of independent Gaussians

This post series is about mathematical aspects of so called “Multivariate Normal Distributions“. In the literature two abbreviations are common: MNDs or MVNs. I will use both synonymously. To get an easy access I want to introduce MNDs as the result of a linear transformations applied to random vectors whose components can be described by independent 1-dimensional normal distributions. Afterward I will discuss a more general definition of MNDs and derive their most important properties. As I am a physicist and not a mathematician, I will try not to become too formalistic. Readers with a mathematical education may apologize a lack of rigidity.

My perspective is that of a person who saw MNDs appear in different contexts of Machine Learning [ML]. Already the data samples used for the training of an ML algorithm may show properties which indicate an underlying MND. But also the analysis of data samples in the latent spaces of modern generative algorithms may reveal properties of MNDs. As we work with finite samples we have to apply numerical methods for the analysis for our data distributions and compare the results with theoretical predictions.

One helpful fact in this respect is that the probability densities of MNDs have contour hyper-surfaces defined by quadratic forms. We, therefore, have to deal with multidimensional ellipsoids and their 2-dimensional projections, namely ellipses. We have to cover these geometrical figures both with analytical and numerical methods.

Illustration 1: Ellipsoidal contours of a MND projected into the ℝ3

The detection of MNDs and their analysis in a ML context can on the lowest level be done by investigating the variance-covariance matrix of multidimensional data distributions. We will have a look at methods for a geometrical reconstruction of MND contours with the help of coefficients of a respective matrix derived from available sample data by numerical methods. The theoretically predicted contour ellipses based on these coefficients can be compared with contours produced by a direct numerical interpolation of the sample’s vector data. I will also discuss the relation to PCA transformations and how to use them and their inverse for contour and vector reconstruction from selected data in two dimensional projections. Further mathematical methods, which check the MND compatibility of sample data more thoroughly, will be mentioned and discussed briefly.

In this first post I start our journey through the land of MNDs by describing a random vector composed of independent Gaussians. On our way I will briefly touch some basic topics as populations, samples, marginal distributions and probability densities in an intuitive and therefore incomplete way.

The posts in these articles require knowledge in Linear Algebra and a bit of vector analysis.

Basics: Samples of objects in Machine Learning and underlying populations

In Machine Learning we typically work with (finite) samples of discrete objects which have well defined properties or “features”. When these features can be characterized by numerical values, the sample objects can mathematically be represented by points in the ℝn or corresponding position vectors. Such a position vector would reach out from the origin of an Euclidean coordinate system to a specific point representing a selected data object of our sample. This means that we deal with discrete distributions of data points and respective vectors in multi-dimensional spaces.

Samples and statistics

We assume that the properties of the objects in an ML sample follow some common underlying patterns (in the hope that an ML algorithm will detect them during training). These properties would be encoded in the components of the related vectors. In what way does statistics enter this context? The first point is that we assume that the concrete sample of vectors we use for the training of an ML-algorithm is representative for other samples of objects (and related vectors) showing alike properties. One such prominent sample would be the one we apply our trained algorithm to at user inference. In other words:

We assume that the elements of our samples are taken via a statistical process from a general underlying population providing many more if not all possible objects of the same kind.

The “population” is abstract in several aspects: 1) In most practical scenarios we have no direct access to it. 2) It refers to objects and not necessarily to the vectors we in the end work with. 3) A “population” may also be the theoretical result of some kind of machinery which at least in principle is able to produce all (even an infinite) set of objects of the same kind. Take a genetical code as an example.

Let us therefore assume that the elements of our population reside in abstract space Ω – which is accompanied 1) by a kind of defined event or process ΨS for the “picking” of objects and 2) by probabilities PΩ for picking certain types of objects, i.e. of objects having specific properties in common. If you want to, you may assume that the population shows an object distribution with respect to certain properties. The distribution then could be used to definie related probabilities PΩ. We assume that the process which picks individual elements of the population and fills our sample is based on chance and guided by PΩ with respect to the element selection from the population. I.e. we assume that our samples are generated without any bias compared to the properties of the underlying population and its object distribution.

What we in our case need is a mapping S from Ω to the ℝn :

\[ \pmb{ S }[\pmb{\operatorname{\Omega}}, \, P_{\Omega},\, \operatorname{\Psi}_S] \,:\, \pmb{\operatorname{\Omega}} \, \rightarrow \mathbb{R}^n \]

As soon as we have a map S and a description of a random picking process ΨS , we can repeat ΨS multiple times to create one or more samples of vectors. The production of different samples would then (due to the underlying probabilities) follow statistical laws.

The distributions of vectors in our samples with respect to certain properties of these vectors and its components should statistically reflect the properties of the underlying population objects and related probabilities PΩ. For proper statistical processes behind the sample creation we would assume that the probabilities derived from many and/or growing samples would approach the probabilities describing the (assumed) distribution of the objects in the underlying (potentially infinite) population.

How do we get and measure probabilities of the sample based distribution of vectors in the ℝn ? One option is that we sum up the number of vectors pointing into a certain discrete sub-volumes ΔV of the ℝn and set the results in relation to the number of all available vectors (and thus the covered part V of the ℝn). This gives us the probability PV of finding a vector with component values in ranges defined by the borders of ΔV. We assume that the probability that SS takes on a value s in a measurable sub-space ΔV ∈ ℝn is defined by a probability to pick a respective element in the population space:

\[ P_V \left(\pmb{s} \in \Delta V \right) \,=\, P_\Omega \left( \pmb{\omega} \in \pmb{\Omega} \,|\, \pmb{S}_s (\pmb{\omega}) = \pmb{s} \in \Delta V \right) \]

Meaning: The probability of finding vectors pointing to a volume ΔV of the ℝn is defined by a probability PΩ of picking elements from the population with fitting properties.

Dense distributions of data points and vectors

The end points of vectors pointing to neighboring small volumes ΔV of the ℝn may fill the surrounding space relatively densely. The population may be based on a distribution of many objects with smoothly varying properties. Our map S might then lead to a dense distribution of points (and related vectors) in certain volumes of the ℝ or the ℝn. By dividing the probabilities for small finite volumes ΔV by the respective measured volume we would get discrete values of a quantity we name a “probability density“. We learn by the way that the target spaces of S functions should be measurable.

In many cases we may find profound indications from the analysis of large or many different ML samples – or from theories about our object population – that the individual points or vectors are distributed according to a certain well defined continuousprobability density function” [pdf]. This corresponds to a transition of a finite population to an infinite one with a dense object distribution and to a limit process for ever growing samples picked from such a distribution. Note that a continuous probability densityn in the requires three things:

  1. S must cover certain volumes of the ℝn more and more densely by the endpoints of vectors put into growing statistical samples.
  2. A volume measure and a vector norm.
  3. The number of points (end-points of vectors) per unit volume derived from the samples must vary relatively smoothly between neighboring volume elements and approach a continuous variation in the limit process to infinitely large samples and infinitesimal volume elements.

If we believe that an available finite statistical sample of vectors indeed represents an underlying continuous vector distribution, we may want to approximate the latter by (multidimensional) numerical interpolations of the discrete numbers of points or vectors per volume element retrieved from of our samples. And compare the resulting curves for the probability density with theoretical curves of pdfs based on the properties of the continuous distributions of specific vector populations.

In the following sections we want to define some specific vector distributions and their probability density functions in mathematical terms.

Random vectors and probability density functions

We take a multivariate random vector as a multidimensional “vector” S having 1-dim statistical random variables Sj as components.

\[ \pmb{S} \: = \: \left( \, S_1, \, S_2, \, ….\, S_n\, \right)^T \]

A random variable Sj in our case is actually given by a function Sj: Ω→ℝ. Sj maps a set of objects taken from an underlying population to a statistical sample of real numbers. If applied multiple times Sj creates discrete distributions of points in ℝ. In a sense Sj therefore represents statistical sample distributions in ℝ.

A random vector S maps elements of Ω to vectors. It is a vector valued function. It indirectly represents resulting statistical distributions of vectors in (potentially infinite) samples of (position) vectors defined on the ℝn and respective distributions of data points. For a finite sample of vectors we can in principle index all the comprised vectors and list them up:

\[ \pmb{S} (\pmb{\omega}^1, \pmb{\omega}^2, …, \, \pmb{\omega}^m ) \,\sim \, \left\{ \pmb{s}^1, \, \pmb{s}^2, …, \, \pmb{s}^k, …, \, \pmb{s}^m \right\}, \quad \mbox{with} \,\, \pmb{s}^k \,=\, \left( s^k_1, \, s^k_2, \, …, s^k_j, …, \, s^k_m \right)^T \, . \]

T symbolizes the transpose operation.

Correlated components of the random vector: Note that the components of a vector sk of a specific set may not be independent of each other. Meaning: For some sets and populations the components of the comprised vectors in the ℝn do not vary individually (aside of statistical fluctuations). If we have chosen a component value skj then the value ski of another component may have to fulfill certain conditions depending on the values of the first or other components. In other words: The individual distributions represented by the various Sj can e.g. be (pairwise)

  • independent or dependent of each other,
  • they can be uncorrelated or correlated.

This holds for finite sets with discrete vectors or potentially infinite sets of vectors whose endpoints cover a sub-volume of the ℝn densely. For a distinction of (un-) correlation from (in-) dependence see below.

Continuous distributions and probability densities in one dimension
A concrete sample vector sk ∈ ℝn of a vector distribution can be described by specific values sjc assumed by variables sj, which we use below to flexibly describe vector components. The sj determine the coordinates of respective data points. In the discrete case sj can only take on certain well defined values in ℝ (see above). For discrete distributions a concrete value sjk is assumed with a certain probability P(sjc). The factual probability given by a sample depends of course on the number of those objects in a set that reproduce sj and on the total number of samples. If such a definition is too fine grained we may use probabilities defined on intervals a ≤ sj ≤ b.

In case of an infinite population with a dense distribution of objects with smoothly varying properties, we can approach a continuous distribution of points in target regions of ℝ by creating larger and larger samples. In such a kind of limit process Sj may map properties of the objects in Ω onto a dense distribution of points in ℝ having a continuous 1-dimensional probability density – in the sense that the number of points in neighboring small intervals varies continuously.

Now let us look at a dense population with a distribution of objects with smoothly varying properties – and at extremely huge or infinite sets derived from it. If sj in a limit process can take all values in ℝ and cover selected finite intervals [a, b] of ℝ densely we may define a continuous probability density pj(sj) for infinitely small intervals within the finite intervals such that the probability for sj to assume values in an interval [a, b] is given by

\[ \operatorname{P}(a \le s_j \le b) \,=\, \int_a^b p_j(s_j)\, ds_j \, . \]

pj(sj) characterizes an 1-dimensional distribution of dense a population represented by Sj. In some interesting cases the map of such a distribution and derived potentially infinite samples may cover large parts of ℝ or all of ℝ densely.

Probability density of a random vector distribution

By making a limit transition to infinitely small volumes in the ℝn , we can extend the idea of a probability density to a random vector S representing some object population. The more elements we get in a statistical sample of vectors (based on picking from the population) the more will the difference in the numbers of vectors pointing into neighboring small volumes of equal size be described by differences of the population’s probability density given at the volumes’ centers.

In a limit process for an infinite population the endpoints of the sample’s position vectors would fill the ℝn or a sub-space V of it more and more densely. In the end we would get an infinitely dense distribution in the sense that every infinitesimal volume dV in the covered region V would have a vector pointing to it. For huge samples we could use fine-grained, but finite volume-elements ΔV and count the number of vectors ΔnV pointing to each of these volume elements to define discrete data points gΔV = ΔnV/N/ΔV for an approximation of a continuous density function via interpolation.

Note that a precise definition of a “dense” vector population would require a measure for the distance between vectors a and b. We can formally achieve this by defining a norm for the difference vector ||ba||. But I think it is clear what we mean: Two position vectors are close neighbors if and when their endpoints are close neighbors in the ℝn in the sense that the distance between these points becomes very small or negligible.

If each of the components sj of the vectors s can take all values in ℝ we may describe the probability that the population represented by S contains vectors s pointing to a volume element ΔV by a continuous multidimensional probability density function [pdf] for infinitesimal volume elements dVS = ds1 ds2dsn. Let us call the pdf for a continuous random vector pS(s1,s2, …) and define the probability P of finding or creating vectors pointing into a finite volume:

\[ p_S(\pmb{s}) \: = \: p_S \left( \, s_1, \, s_2, \, ….\, s_n\, \right)\,, \quad \mbox{with} \]
\[ \begin{align} P(\pmb{s}, \Delta V) \, &= \, \int_{\Delta V} p_S(\pmb{s}) \, ds_1ds_2…ds_n \\ &= \, \int\int…\int_{\Delta V} p_S \left( \, s_1, \, s_2, \, ….\, s_n\, \right) \, ds_1ds_2 \cdots ds_n \end{align} \]

Note that pS(): ℝn→ℝ is a continuous function mapping vectors via their components to real values. For pS to play its role as a probability density it must fulfill a normalization condition for the covered space V in the ℝn (which in some cases may be the infinite ℝn itself):

\[ \int\int…\int_{V} p_S \left( \, s_1, \, s_2, \, ….\, s_n\, \right) \, ds_1ds_2 \cdots ds_n \,=\, 1 \, . \]

In terms of finite discrete distributions: Summing up the number of vectors pointing to discrete volume elements over all (hit) volume elements in the ℝn must give us the total number of all available vectors.

Note that pS maps the ℝn to ℝ. A continuous pS thereby defines a hyper-surface in the ℝn + 1.

Dependencies and correlations of the components of a random vector

An interesting question is whether and, if so, how pS(s) depends on the 1-dimensional probability density functions pj(sj) or on the parameters [params] of these density functions:

\[ p_S(\pmb{s}) \: = \: p_S \, \pmb{\Large [} \, params\{p_1(s_1)\}, \, params\{p_2(s_2)\}, \, ….\,, \, params\{p_n(s_n)\} \, \pmb{\Large ]} \,. \]

The brackets [] denote a very complex relation in the general case; a direct analytical decomposition of pS(s) into the pj(sj) may in general not be possible or only after some sophisticated transformations. For MNDs we will at least be able to show how the parameters describing the densities pj(sj) constitute the functional form of pS(s). And we will find transformations which allow for a functional decomposition.

Let us reverse our line of thought and assume that we have a given probability density function pS(s) for a vector distribution (or derived such a distribution by numerical interpolations between elements of a discrete set of sample vectors). Then we may be interested in the 1-dimensional densities of constituting random variables. We introduce the probability density of a “marginal distribution” as

\[ p_j(s_j) \,=\, \int\int…\int_V p_S \left( \, s_1, \, s_2, \, …,\, s_n\, \right) \, ds_1ds_2 \cdots ds_{j-1} ds_{j+1} \cdots ds_n \]

I.e., we integrate pS(s) completely over n-1 coordinate directions of the ℝn, but leave out the 1-dimensional ℝ-subspace for sj. Note that the integral covers all dependencies of the components of the vectors in the set we work with.

Note also that evaluating the integral does not mean that we can decompose pS(s) easily into some analytical relation between the pj(sj). But for some special distributions we should at least be able to provide a relation of pj(sj) to pS(s) mediated by the a conditional probability density for the other component values p|sj() in the logical sense of a factorization of probabilities

\[ p_S(\pmb{s} ) = p|_{s_j}(s_1, s_2, s_{j-1},s_{j+1}, .. s_n) * p_j(s_j) \]

Random vector composed of independent univariate normal distributions

Let us leave all the subtleties involved in the statistical mappings mediated by S onto potentially continuous data point distributions in the ℝn. We achieve this by discussing probability density functions. Let us simplify and focus on a bunch of n independent 1-dimensional normal distributions with probability densities gj(wj). (Knowing that we derive the distributions via n mapping functions Wj ). For a 1-dimensional normal distribution the probability density is given by a Gaussian

\[ g_j(w_j, \, \mu_j, \, \sigma_j) \: = \: {1 \over {\sigma_j * \sqrt{2\pi} } } \, {\large e}^{ – \, {\Large 1 \over \Large 2} \left( {\Large w_j \, – \, \Large \mu_j \over \Large \sigma_j} \right)^2 } \]

The μj is the mean value and σj is the square root of the variance of the Wj distribution. We use the following notation to express that the result of Wj belongs to the set of 1-dimensional normal distributions:

\[ W_j \,\sim\, \mathcal{N}_1 \left(\mu_j,\,\sigma_j^{2} \right) \]

Regarding the parameters: \( \mathcal{N}\left(\mu_j, \sigma_j^2 \right) \) denotes a 1-dimensional normal distribution with a mean value \(\mu_j\) and a variance \(\sigma_j^2 \).

We use our bunch of n distributions Wj to compose a random vector W with n components

\[ \pmb{W} \,=\, \left( W_1, W_2, \cdots W_j, \cdots W_n \right) \,.\ \]

Due to the independence of the 1-dimensional components Wj, the resulting probability density function gw(w) is just the product of the pdfs for the individual components:

\[ g_{\small W}(\pmb{w}, \pmb{\mu}_{\small W}, \pmb{\Sigma}_{\small W}) \, = \, \prod\limits_{j=1}^n g_j(w_j,\, \mu_j,\, \sigma_j) \]

Σw symbolizes a diagonal matrix with all the σj-values on the diagonal and otherwise 0 (see the next section for the appearance of this matrix). μw is a vector comprising all the μj-elements. We get a compact expression for gw(w):

\[ g_{\small W}(\pmb{w}, \pmb{\mu}_{\small W}, \pmb{\operatorname{\Sigma}}_{\small W}) \, = \, {1 \over \sqrt{(2\pi)^n} \, \prod\limits_{j=1}^n{\sigma_j}} \, {\large e}^{ – \, {\Large 1 \over \Large 2} {\Large \sum\limits_{j=1}^n} {\left( {\Large w_j \, – \, \Large \mu_j \over \Large \sigma_j \phantom{(} } \right)^2} } \]

For reasons that will become clear in the next post, we formally rewrite this function with the help of a vector notation to become:

\[ g_{\small W} (\pmb{w}, \pmb{\mu}_{\small W}, \pmb{\operatorname{\Sigma}}_{\small W}) \, = \, {1 \over \sqrt{(2\pi)^n} \, \sqrt{ \operatorname{det} \left( \pmb{\Sigma}_{\small W} \right) } } \, {\large e}^{ – \, {\Large 1 \over \Large 2} \left( \pmb{\Large w} \, – \, \pmb{\Large \mu} \right)^T \,\bullet \, \pmb{\Large \Sigma}_{\small W}^{-1} \, \bullet \, \left( \pmb{\Large w} \, – \, \pmb{ \Large \mu} \right) } \]
\[ \mbox{with} \quad \pmb{\operatorname{\Sigma}}_{\small W} \, = \, {\begin{pmatrix} \sigma_1^2 & 0 &0 \cdots &0\\ 0 &\sigma_2^2 & 0 \cdots & 0 \\ \vdots &\vdots &\ddots &\vdots \\ 0 &0 &\cdots &\sigma_n^2 \end{pmatrix}} \]

The bullets in the exponent mark matrix multiplications (in the sense of Linear Algebra). The inverse of Σw is diagonal, too, and has the reciprocates 1/σj2 as coefficients.

It is easy to prove by proper integration of gw(w) that the marginal distributions Wj of W are just described by the probability densities gj(wj). It is also simple to show that the probability density gw(w) fulfills the required normalization condition

\[ {1 \over \sqrt{(2\pi)^n} \, \sqrt{ \operatorname{det} \left( \pmb{\Sigma}_{\small W} \right) } } \, \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} {\large e}^{- \, {\Large 1 \over \Large 2} \left( \pmb{\Large w} \, – \, \pmb{\Large \mu} \right)^T \,\bullet \, \pmb{\Large \Sigma}_{\small W}^{-1} \, \bullet \, \left( \pmb{\Large w} \, – \, \pmb{ \Large \mu} \right) }\, dw_1dw_2 \cdots dw_n \,=\, 1 \]

We introduce the notation of a multivariate normal distribution of independent Gaussians in the ℝn as a distribution with a pdf gw(w) and thus having two parameters given by a mean vector μw and a diagonal matrix Σw as

\[ \pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{\mu}_{\small W},\, \pmb{\Sigma}_{\small W} \right) \, . \]

Standardized multivariate normal distributions composed of independent 1-dimensional Gaussians

Let us now move our Cartesian coordinate system of our ℝn such that we have

\[ \pmb{\mu}_{\small W} \: = \: \left\{ \, \mu_1, \, \mu_2, \, ….\, \mu_n\, \right\}^T = \boldsymbol{0} \, . \]

Let us further scale the individual coordinates by

\[ z_j \, =\, {w_j \over \sigma_j }, \quad \pmb{z} = \left(\,z_1, \, z_2, \, \cdots, \, z_n\,\right)^T \, . \]

Then we get a centered and standardized normal vector distribution [SNVD] of independent Gaussians :

\[ \pmb{Z} \: = \: \left( \, Z_1, \, Z_2, \, ….\, Z_n\, \right)^T \,, \, \quad \pmb{Z} \, \sim \pmb{\mathcal{N}}_n \left( \pmb{0}, \, \pmb{\operatorname{I}} \right) \]

\( \pmb{\mathcal{N}} \left( \pmb{\mu}, \, \pmb{\operatorname{I}} \right) \) denotes a random normal vector with a mean vector \(\pmb{\mu}\) and an (nxn) identity matrix \( \pmb{\operatorname{I}} \) as the so called covariance matrix (see the next section). Its probability density has a simpler form, namely

\[ \begin{align} g_{Z}(\pmb{z}) \, &= \, {1 \over \sqrt{(2\pi)^n} } \, {\large e}^{ – \, {\Large 1 \over \Large 2} {\Large \sum\limits_{j=1}^n} \, {\Large z_j}^2 } \\ &=\, {1 \over \sqrt{(2\pi)^n} } \, {\large e}^{ – \, {\Large 1 \over \Large 2} \left( {\Large \pmb{z}^T \, \bullet \, \pmb{z}} \right) } \end{align} \]

I denotes the identity matrix with dimension (n x n). The exponent includes a scalar product between the vectors. In the future we will not write an extra dot between the horizontal and the vertical vectors. Two matrices aside each other symbolize a matrix multiplication in the sense of Linear Algebra – if the number of columns and rows fit.

Why did the leading factor get so simple? Well, the density function re-scales with the coordinates because on the infinitesimal scale we must fulfill:

\[ g_{\pmb{Z}}(\pmb{z})\, dV_z \: = \: g_{\small W} (\pmb{w}) \, dV_w \]

Because of dzj = (1/σj) dwj , the new volume element dVz in Z-space becomes

\[ dV_z \: = \: \prod\limits_{j=1}^n {1\over \sigma_j}\,dw_j \:=\: {1 \over \prod\limits_{j=1}^n \sigma_j } dV_w \]

The attentive reader recognizes the Jacobi determinant of the coordinate transformation. It eliminates one of the factors in the denominator of the leading term of gw(w) during the transition to gz(z).

For Z we obviously have:

\[ \sigma^z_j \,=\, 1, \: \forall j \, \Rightarrow \, \pmb{\Sigma}_{\small Z} \,=\, \pmb{\operatorname{I}} \]

It is easy to show that the pdf of Z remains normalized:

\[ {1 \over \sqrt{(2\pi)^n} } \, \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} {\large e}^{ – \, {\Large 1 \over \Large 2} {\Large \boldsymbol{z}\boldsymbol{z}^T} }dz_1dz_2 \cdots dz_n \,=\, 1 \]

Expectation value and covariance of random vectors

We want to understand the role of the matrix Σw a bit better. Actually, this matrix is closely related to a formal extension of the “covariance” of two 1-dimensional random distributions to random vector distributions. But let us start with the expectation value of the random vector first.

The expectation vector of a random vector is just the vector composed of the expectation values of its (marginal) component distributions:

\[ \pmb{\mu}\left(\pmb{S}\right) \,=\, \pmb{\mu}_S \: = \: \operatorname{\mathbb{E}}\left(\pmb{S} \right) \, = \, \left( \operatorname{\mathbb{E}}(S_1), \, \operatorname{\mathbb{E}}(S_2), …, \, \operatorname{\mathbb{E}}(S_n) \right)^T \]

The covariance of a random vector results in a natural way by a generalization of the covariance of two 1-dim distributions X and Y:

\[ \begin{align} \operatorname{Cov}(X, Y) &= \operatorname{\mathbb{E}}\left[\,\left(\,X \,-\, \operatorname{\mathbb{E}}\left(X\right)\,\right) \, \left(\,Y \,-\, \operatorname{\mathbb{E}}\left(Y\right)\, \right)\,\right] \\ &= \operatorname{\mathbb{E}}\left(X Y\right) \,-\, \operatorname{\mathbb{E}}\left(X\right) \operatorname{\mathbb{E}}\left(Y\right) \end{align} \]

A multidimensional generalization has to take into account all possible combinations (Sk, Sj). This indicates already that we need a matrix. A simple formal way to get expectation values of all pairwise combinations of components from a random vector S is to use a matrix product between S and ST product. This leads almost naturally to a matrix of expectation values:

\[ \begin{align} \operatorname{Cov}\left(\pmb{S}\right) \: &:= \: \operatorname{\mathbb{E}}\left[\left(\pmb{S} – \operatorname{\mathbb{E}}(\pmb{S}) \right)\, \left(\pmb{S} – \operatorname{\mathbb{E}}(\pmb{S}) \right)^T \right] \end{align} \]

Note the order of transposition in the definition of Cov! A (vertical) vector is combined with a transposed (horizontal) vector. The rules of a matrix-multiplication then give you a matrix as the result! The expectation value has to be determined for every element of the matrix. Thus, the interpretation of the notation given above is:

Pick all pairwise combinations (Sj, Sk) of the component distributions. Calculate the covariance of the pair cov(Sj, Sk) and put it at the (j,k)-place inside the matrix.

Meaning:

\[ \begin{align} \operatorname{Cov}\left(\pmb{S}\right)\:&=\:\operatorname{\mathbb{E}} {\begin{pmatrix} (S_{1}-\mu_{1})^{2}&(S_{1}-\mu_{1})(S_{2}-\mu_{2})&\cdots &(S_{1}-\mu_{1})(S_{n}-\mu_{n})\\(S_{2}-\mu_{2})(S_{1}-\mu_{1})&(S_{2}-\mu_{2})^{2}&\cdots &(S_{2}-\mu _{2})(S_{n}-\mu_{n})\\\vdots &\vdots &\ddots &\vdots \\(S_{n}-\mu_{n})(S_{1}-\mu_{1})&(S_{n}-\mu_{n})(S_{2}-\mu _{2})&\cdots &(X_{n}-\mu_{n})^{2} \end{pmatrix}} \\\\ &=\quad {\begin{pmatrix} \operatorname{Var} (S_{1})&\operatorname{Cov} (S_{1},S_{2})&\cdots &\operatorname{Cov} (S_{1},S_{n})\\ \operatorname{Cov} (S_{2},S_{1})&\operatorname{Var} (S_{2})&\cdots &\operatorname{Cov} (S_{2},S_{n})\\\vdots &\vdots &\ddots &\vdots \\ \operatorname{Cov} (S_{n},S_{1})&\operatorname{Cov} (S_{n},S_{2})&\cdots &\operatorname {Var} (S_{n}) \end{pmatrix}} \end{align} \]

The above matrix is the (variance-) covariance matrix of S, which we also abbreviate with ΣS.

\[ \pmb{\Sigma}_S \,:=\, \operatorname{Cov} \left( \pmb{S} \right) \]

As Wikipedia tells you, some people call it a bit different. I refer to it later on just as the covariance matrix (of a random vector). Because the covariance of two independent distributions X and Y is zero

\[ X, \,Y \,\, independent \,\, \Rightarrow \,\, \operatorname{Cov}(X, Y) \,=\, 0 \,, \]

we directly find

\[ \pmb{\Sigma}_W \,=\, \operatorname{Cov} \left( \pmb{W} \right) \,. \]

We have found an interpretation of the matrix appearing in the probability density gw(w) of \(\pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{\mu}_{\small W},\, \pmb{\Sigma}_{\small W} \right) \):

The matrix Σw-1 appearing in the exponential function of gw(w) for a normal random vector with independent Gaussians as components is the inverse of the variance-covariance matrix Σw = Cov(W).

We expect that correlations between the 1-dimensional component distributions will leave their fingerprints in the coefficients of these matrices.

An open question: The attentive reader may now pose the following question: The matrix Σw = Cov(W) may be well defined for a general random vector. But how can we be sure that the inverse matrix always exists? This is a very good question – and we will have to deal with it in forthcoming posts on MNDs.

Correlation and independence

Equipped with the above definitions we can now better distinguish the independence of individual distributions on each other from their un-correlation. Independence means that the probability density function of a random vector factorizes into a product of the probability density functions for the component distributions.

\[ g(\pmb{s}) \,=\, \, \prod\limits_{j=1}^n g_j (s_j) \]

Correlation instead refers to the covariance in the sense that some coefficients off the diagonal of the covariance matrix are not zero:

\[ \operatorname{Cov} (S_{j},S_{k}) \, \ne \, 0, \quad \mbox{for some } \, j, k, \,\, \mbox{with} \, j \ne k \]

Expectation value and covariance of standardized normal random vectors with independent components

For our centered and standardized normal distribution Z we find:

\[ \begin{align} \pmb{\mu}_Z \,=\, \pmb{\mu}\left(\pmb{Z}\right) \, &= \, \operatorname{\mathbb{E}}\left(\pmb{Z}\right) = \pmb{0} \\ \pmb{\Sigma}_Z \, =\, \operatorname{Cov}\left(\pmb{Z}\right) \, &= \, \operatorname{\mathbb{E}}\left[\pmb{Z}\, \pmb{Z}^T\right] \, = \, \pmb{\operatorname{I}} \end{align} \]

Conclusion and outlook

Objects with many quantifiable features can be described by vectors. Multivariate random vectors extend the idea of 1-dimensional, i.e. univariate random variables into the ℝn. A random vector represents a map of an object population as well as derived statistical samples and related statistical probabilities for the appearance of vectors with certain properties. Samples are created from the population by the repeated application of a defined process of statistically picking individual elements. By accounting vectors with certain properties we get to probabilities. A random vector thus describes a probabilistic distribution of vectors and related data points. The component values of these vectors follow univariate probability distributions. The probabilities of the univariate distributions can be understood as marginal probabilities of the random vector.

A limit transition from finite sets of sample vectors to huge or even infinite sets fed from an underlying continuous population may lead to distributions of (position) vectors whose endpoints densely fill the ℝn. A limit process allows for the definition of a probability density function for random vector distributions. The probability density functions describing the distribution of vector endpoints in the ℝn can only in simple cases be decomposed into a simple combination of the 1-dimensional pdfs for the components. In general the marginal distributions may not be independent of each other or uncorrelated.

By using the expectation values of marginal distributions we can define mean vectors for such vector distributions in a straightforward way. The quantity corresponding to the variance of univariate distributions is for a random vector distribution becomes a matrix – the variance-covariance matrix.

When the random vector is derived from independent univariate normal distributions the covariance matrix becomes diagonal and the probability density of the vector distribution is given by a product of the univariate Gaussians for the marginal distributions. The probability density of a multivariate normal distribution Z composed of independent and standardized marginal Gaussians gets a particularly simple form and the covariance matrix reduces to a (nxn) identity matrix.

In the next post of this series

we will apply a linear transformation to our secial distribution Z. This will be done by applying a matrix A to the z-vectors. We will try to find out how this impacts the probability density of the transformed distribution.