This post series is about mathematical aspects of so called “Multivariate Normal Distributions“. In respective literature two abbreviations are common: MNDs or MVNs. I will use both synonymously. To get an easy access, I want to introduce a MND as the result of a linear transformations applied to random vectors whose components can be described by independent 1-dimensional normal distributions.

Afterward I will discuss a more general definition of MNDs and derive their most important properties. As I am a physicist and not a mathematician, I will try not to become too formalistic. Readers with a mathematical education may apologize a lack of rigidity.

My perspective is that of a person who saw MNDs appear in different contexts of Machine Learning [ML]. Already the data samples used for the training of an ML algorithm may show properties which indicate an underlying MND. But also the analysis of data samples in the latent spaces of modern generative algorithms may reveal properties of MNDs. As we work with finite samples we have to apply numerical methods for the analysis for our data distributions and compare the results with theoretical predictions.

One helpful fact in this respect is that the probability densities of MNDs have contour hyper-surfaces defined by quadratic forms. We, therefore, have to deal with multidimensional ellipsoids and their 2-dimensional projections, namely ellipses. We have to cover these geometrical figures both with analytical and numerical methods.

**Illustration 1:** Ellipsoidal contours of a MND projected into the ℝ³

The detection of MNDs and their analysis in a ML context can on the lowest level be done by investigating the variance-covariance matrix of multidimensional data distributions. We will have a look at methods for a geometrical reconstruction of MND contours with the help of coefficients of a respective matrix derived from available sample data by numerical methods. The theoretically predicted contour ellipses based on these coefficients can be compared with contours produced by a direct numerical interpolation of the sample’s vector data. I will also discuss the relation to PCA transformations and how to use them and their inverse for contour and vector reconstruction from selected data in two dimensional projections. Further mathematical methods, which check the MND compatibility of sample data more thoroughly, will be mentioned and discussed briefly.

In this first post I start our journey through the land of MNDs by describing a random vector composed of independent Gaussians. On our way I will briefly touch some basic topics as populations, samples, marginal distributions and probability densities in an intuitive and therefore incomplete way.

The posts in these articles require knowledge in Linear Algebra and a bit of vector analysis.

Basics: Samples of objects in Machine Learning and underlying populations

In Machine Learning we typically work with finite samples of discrete objects which have well defined properties or “features“.

When these features can be characterized by numerical values, the sample objects can mathematically be represented by points in the ℝⁿ or corresponding position vectors. Such a position vector would reach out from the origin of a Cartesian coordinate system (spanning the ℝⁿ) to a specific point representing a selected data object of our sample.

The coordinates of the data point or respective vector components are given by the feature values. This means that we deal with discrete distributions of data points and respective vectors in multi-dimensional Euclidean spaces.

Samples and statistics

We assume that the properties of the objects in an available ML sample follow some common underlying patterns (in the hope that an ML algorithm will detect such patterns during training). These properties would be encoded in the components of related vectors in the feature space. (And later on in the weights of a neural network trained on the sample’s object distribution in feature space …)

In what way does statistics enter this context?

The first point is that we hope that the concrete sample of vectors we use for the training of an ML-algorithm is representative for other samples of similar objects (and related vectors). We further assume that the object distribution in the feature space represents the “real” distribution of all possible objects across the common spectrum of features reasonably well. In other words:

We assume that the elements of our samples are taken via an unbiased statistical process from an underlying general population providing many more if not all possible objects of the same kind.

This is the reason why we think that the results of a ML training will reflect real patterns, typical for yet unseen samples, too.

Abstract population and the distribution of its elements in the feature space

The “population” is abstract with respect to multiple of its aspects:

In most practical ML scenarios we have no direct access to it. We neither have a huge sample comprising all possible objects of the same kind or even infinite samples.
Most samples in ML refer to concrete objects and not necessarily to abstract vectors in an Euclidean space like the ℝⁿ we in the end work with.
A “population” may also be the theoretical result of some kind of machinery or creation rules which at least in principle is able to produce all and very often a potentially infinite set of objects of the same kind. Take a genetic code as an example. However, in very many cases we do not even know the rules of this machinery when we start with a statistical analysis of the properties of one or multiple samples.
If we have many samples and all show the same mean values of certain features and/or the same correlation of certain features we have reason to expect that some common underlying principle guided the creation of the objects in question.

Abstract space for an underlying population

Let us therefore assume that the elements of our population reside in abstract space Ω – which is accompanied

by a kind of defined unbiased process Ψ_S for the “picking” of objects from it and
by probabilities P_Ω for picking certain types of objects, i.e. of objects which have certain specific properties, i.e. feature values, in common.

In many cases we have reasons to assume that the whole population shows a well defined object distribution with respect to certain properties. The distribution then could be used to define related probabilities P_Ω. While we know rules behind such distributions in artificial, e.g. industrial object creation, it is sometimes not so easy to find and explain the origin of natural distributions. For an introduction to the origin of some elementary statistical distributions see here. Note that all statistical distributions allow for variations of feature values. In natural processes the variations correspond to fluctuations. The spectrum of such variations or fluctuations may follow some laws, e.g. guided by entropy maximization in nature.

Think about this twice – statistics enters the treatment of samples potentially on two levels:

(1) On a very basic level we have natural fluctuation processes accompanying the object creation of a population – and not all fluctuations around some mean values may occur on this level with the same frequency, but may follow some mathematical rules.
(2) The second level is that of creating samples from an underlying population. The statistical distribution of the sample objects will only reflect the object distribution in the full population if we follow an unbiased way of picking objects from the population. I.e. the picking process must not prefer certain objects of the population; the picking must occur blindly with respect to feature values. Each object must be “pickable” with the same probability from the population. By whatever means we guarantee this. Only then the statistical distribution of the finite sample across the feature space will approximately reveal the real distribution of population objects in feature space.

What analogies could provide visual examples of picking objects from a population? Think of little spherical balls representing our real objects, each ball with a defined color or other marks representing a certain set of properties. Now, we take all the spheres of the population, put them in a volume (like a cube) and mix them thoroughly. We sub-divide the cube virtually in many discrete, neighbouring sub-cubes. We then pick objects by guiding a picking arm to one of the sub-volumes with a probability equal for all volumes. I.e. we pick blindly with respect to properties, but with equal probability regarding the mixture. In between picks we may mix thoroughly again. (To guarantee equal picking with respect to defined sub-volume in a cube is sometimes more difficult than one might expect; see e.g. here.) In such a case the abstract space Ω would get a real representative counterpart.

Picking objects for a sample

We assume that the process which picks individual elements of the population and fills our sample is based on (blind) chance (with respect to object properties). If the sample is big enough it will thus approximate P_Ω with respect to the distribution of population objects in feature space. I.e. we assume that our sample objects are picked without any bias regarding the properties of the objects of the underlying population.

What we in our case get is a mapping S from Ω to the ℝⁿ :

\[ \pmb{ S }[\pmb{\operatorname{\Omega}}, \, P_{\Omega},\, \operatorname{\Psi}_S] \,:\, \pmb{\operatorname{\Omega}} \, \rightarrow \mathbb{R}^n \, . \]

As soon as we have such a map S and a description of a random picking process Ψ_S , we can repeat Ψ_S multiple times to create one or more samples of vectors. The production of different samples would then (due to the underlying probabilities P_Ω) follow statistical laws:
For such proper statistical processes behind the sample creation we would assume that the probabilities derived from many and/or growing samples would approach the probabilities describing the (assumed) distribution of the objects in the underlying population (with potentially infinite members) in the feature space. The distributions of vectors in our samples with respect to certain properties of these vectors (and its components) should statistically approximate the properties of the underlying population and related probabilities P_Ω. The more samples or the bigger samples we work with, the better the approximation.

How do we get and measure probabilities of the sample based distribution of vectors in the ℝⁿ ? One option is that we sum up the number of vectors pointing into a certain discrete sub-volumes ΔV of the ℝⁿ and set the results in relation to the number of all available vectors (and thus the covered part V of the ℝⁿ). This gives us the probability P_V of finding a vector with component values in ranges defined by the borders of ΔV.

For large samples (or averaging over many samples), we assume that the probability that S takes on a value s with an end-point in a certain measurable sub-space ΔV ∈ ℝⁿ is defined by the probability to pick a respective element ω with the same property in the population space – namely to point into the discrete ΔV of the feature space. If we express the end-point lying in ΔV by a somewhat sloppy notation s ∈ ΔV, we arrive at :

\[ P_V \left(\pmb{s} \in \Delta V \right) \,=\, P_\Omega \left( \pmb{\omega} \in \pmb{\Omega} \,|\, \pmb{S} (\pmb{\omega}) = \pmb{s} \right) \, .\]

Meaning: The probability of finding vectors pointing to a volume ΔV of the ℝⁿ is defined by a probability P_Ω of picking elements with fitting properties from the population.

Dense distributions of data points and vectors

The end points of vectors pointing to neighboring small volumes ΔV of the ℝⁿ may fill the surrounding space relatively densely. The population may be based on a distribution of many objects with smoothly varying properties. Our map S will then lead to a dense distribution of points (and related vectors) in certain volumes of the ℝ or the ℝⁿ.

By dividing the probability of finding an object in a small finite volumes ΔV by this volume we would get discrete values of a quantity we name a “probability density“. The probability density values would of course depend on the coordinates of the mid-point of ΔV . Such a process requires, by the way, that the target spaces of S functions should be measurable by virtues of some norm. In our case with the target space being the ℝⁿ we, of course, use the standard Euclidean norm.

In many cases we may find profound indications from the analysis of large or many different ML samples – or from theories about our object population – that the individual points or vectors are distributed according to a certain well defined continuous “probability density function” [pdf]. This corresponds to a transition of a finite population to an infinite one with a dense object distribution and to a limit process for ever growing samples picked from such a distribution. Note that a continuous probability density in the ℝⁿ requires three pre-requisites:

A volume measure and a vector norm of the target space of S for object representation.
S must cover certain volumes of the ℝⁿ more and more densely by the endpoints of vectors for more and more objects added to continuously growing statistical samples.
The number of points (end-points of vectors) per unit volume derived from the samples must vary relatively smoothly between neighboring volume elements and approach a continuous variation in the limit process to infinitely large samples and infinitesimal volume elements.

In case we believe that an available finite statistical sample of vectors indeed represents an underlying continuous vector distribution, we may want to approximate the latter by (multidimensional) numerical interpolations of the discrete numbers of points or vectors per volume element retrieved from of our samples. And compare the resulting curves for the probability density with theoretical curves of probability density functions based on the properties of the continuous distributions of specific vector populations.

In the following sections we want to define some very special vector distributions and their probability density functions in mathematical terms.

Random vectors and probability density functions

We take a multivariate random vector as a multidimensional “vector” S having 1-dim statistical random variables S_j as components.

\[ \pmb{S} \: = \: \left( \, S_1, \, S_2, \, ….\, S_n\, \right)^T \,. \]

A random variable S_j in our case is actually given by a function Sj: Ω→ℝ. S_j maps a set of objects taken from an underlying population to a statistical sample of real numbers. If applied multiple times S_j creates discrete distributions of points in ℝ. In a sense S_j therefore represents statistical sample distributions in ℝ.

A random vector S maps elements of Ω to vectors in the ℝⁿ. It is a vector valued function. It indirectly represents resulting statistical distributions of vectors in (potentially infinite) samples of (position) vectors defined on the ℝⁿ and respective distributions of data points. For a finite sample of vectors we can in principle index all the comprised vectors and list them up:

\[ \pmb{S} (\pmb{\omega}^1, \pmb{\omega}^2, …, \, \pmb{\omega}^m ) \,\sim \, \left\{ \pmb{s}^1, \, \pmb{s}^2, …, \, \pmb{s}^k, …, \, \pmb{s}^m \right\}, \quad \mbox{with} \,\, \pmb{s}^k \,=\, \left( s^k_1, \, s^k_2, \, …, s^k_j, …, \, s^k_m \right)^T \, . \]

T symbolizes the transpose operation.

Correlated components of the random vector: Note that the components of vectors s^k of a specific set may not be independent of each other. Meaning: For some sets and populations the components of the comprised vectors in the ℝⁿ do not vary individually (aside of statistical fluctuations):

If we chose a component value s^k_j, then the value s^k_i of another component may have to fulfill certain conditions depending on the values of the first and/or other components. In other words: The individual distributions represented by the various S_j can e.g. be (pairwise)

independent or dependent of each other,
they can be uncorrelated or correlated.

This holds for finite sets with discrete vectors or potentially infinite sets of vectors whose endpoints cover a sub-volume of the ℝⁿ densely. For a distinction of (un-) correlation from (in-) dependence see below.

Continuous distributions and probability densities in one dimension
A concrete sample vector s^k ∈ ℝⁿ of a vector distribution can be described by specific values s_j^k assumed by variables s_j, which we use below to flexibly describe vector components. The s_j determine the coordinates of respective data points. In the discrete case s_j can only take on certain well defined values in ℝ (see above). For discrete distributions a concrete value s_j^k is assumed with a certain probability P(s_j^k). The factual probability given by a sample depends of course on the number of those objects in a set that reproduce s_j and on the total number of samples. If such a definition is too fine grained we may use probabilities defined on intervals a ≤ s_j ≤ b.

In case of an infinite population with a dense distribution of objects with smoothly varying properties, we can approach a continuous distribution of points in target regions of ℝ by creating larger and larger samples. In such a kind of limit process S_j may map properties of the objects in Ω onto a dense distribution of points in ℝ having a continuous 1-dimensional probability density – in the sense that the number of points in neighboring small intervals varies continuously.

Now let us look at a dense population with a distribution of objects with smoothly varying properties – and at extremely huge or infinite sets derived from it. If s_j in a limit process can take all values in ℝ and cover selected finite intervals [a, b] of ℝ densely we may define a continuous probability density p_j(s_j) for infinitely small intervals within the finite intervals such that the probability for s_j to assume values in an interval [a, b] is given by

\[ \operatorname{P}(a \le s_j \le b) \,=\, \int_a^b p_j(s_j)\, ds_j \, . \]

p_j(s_j) characterizes an 1-dimensional distribution of dense a population represented by S_j. In some interesting cases the map of such a distribution and derived potentially infinite samples may cover large parts of ℝ or all of ℝ densely.

When we have a large sample of M objects with a number ΔM_a,b being located in [a, b], and other objects in e.g. neighboring intervals, then we would assume:

\[ \operatorname{P}(a \le s_j \le b) \, \approx \, \Delta M_{a,b} / \, M. \]

Probability density of a random vector distribution

By making a limit transition to infinitely small volumes in the ℝⁿ , we can extend the idea of a probability density to a random vector S representing some object population with the ℝⁿ as feature space. The more elements we get in a statistical sample of vectors (based on picking from the population) the more will the difference in the numbers of vectors pointing into neighboring small volumes of equal size be described by differences of the population’s probability density given at the volumes’ centers.

In a limit process for an infinite population the endpoints of the sample’s position vectors would fill the ℝⁿ (or a defined sub-space V of it; V ∈ ℝⁿ ) more and more densely. In the end we would get an infinitely dense distribution in the sense that every infinitesimal volume dV in the covered region V would have vectors pointing to it. For huge samples with N elements we could use fine-grained, but finite volume-elements ΔV and count the number of vectors Δn_V pointing to each of these volume elements to define discrete data points g_ΔV = Δn_V/N/ΔV for an approximation of a continuous density function via interpolation.

Note that a precise definition of a “dense” vector population would require a measure for the distance between vectors a and b. We can formally achieve this by defining a norm for the difference vector ||b–a||. But I think it is clear what we mean: Two position vectors are close neighbors if and when their endpoints are close neighbors in the ℝⁿ in the sense that the distance between these points becomes very small or negligible.

If each of the components s_j of the vectors s can take all values in ℝ we may describe the probability that the population represented by S contains vectors s pointing to a volume element ΔV by a continuous multidimensional probability density function [pdf] for infinitesimal volume elements dV_S = ds₁ ds₂ … ds_n.

Let us call the pdf for a continuous random vector p_S (s₁,s₂, …) and define the probability P of finding or creating vectors pointing into a finite volume:

\[ p_S(\pmb{s}) \: = \: p_S \left( \, s_1, \, s_2, \, ….\, s_n\, \right)\,, \quad \mbox{with} \]

\[ \begin{align} P(\pmb{s}, \Delta V) \, &= \, \int_{\Delta V} p_S(\pmb{s}) \, ds_1ds_2…ds_n \\ &= \, \int\int…\int_{\Delta V} p_S \left( \, s_1, \, s_2, \, ….\, s_n\, \right) \, ds_1ds_2 \cdots ds_n \,. \end{align} \]

Note that p_S(): ℝⁿ→ℝ is a continuous function mapping vectors via their components to real values p_S ∈ ℝ. For p_S to play its role as a probability density it must fulfill a normalization condition for the covered space V in the ℝⁿ (which in some cases may be the infinite ℝⁿ itself):

\[ \int\int…\int_{V} p_S \left( \, s_1, \, s_2, \, ….\, s_n\, \right) \, ds_1ds_2 \cdots ds_n \,=\, 1 \, . \]

In terms of finite discrete distributions: Summing up the number of vectors pointing to discrete volume elements over all (hit) volume elements in the ℝⁿ must give us the total number of all available vectors.

Note that because p_S maps the ℝⁿ to ℝ, a continuous p_S thereby defines a hyper-surface in some ℝ^{n + 1}, and a constant p_S = const defines a hyper-surface in the ℝⁿ.

Dependencies and correlations of the components of a random vector

An interesting question is whether and, if so, how p_S(s) depends on the 1-dimensional probability density functions p_j(s_j) or on the parameters [params] of these density functions:

\[ p_S(\pmb{s}) \: = \: p_S \, \pmb{\Large [} \, params\{p_1(s_1)\}, \, params\{p_2(s_2)\}, \, ….\,, \, params\{p_n(s_n)\} \, \pmb{\Large ]} \,. \]

The brackets [ ] denote a very complex relation in the general case; a direct analytical decomposition of p_S(s) into the p_j(s_j) may in general not be possible or only after some sophisticated transformations.

For MNDs, however, we will be able to show how the parameters describing the densities p_j(s_j) constitute the functional form of p_S(s). And we will find transformations which allow for a functional decomposition.

Let us reverse our line of thought and assume that we have a given probability density function p_S(s) for a vector distribution (or derived such a distribution by numerical interpolations between elements of a discrete set of sample vectors). Then we may be interested in the 1-dimensional densities of constituting random variables. We introduce the probability density of a “marginal distribution” as

\[ p_j(s_j) \,=\, \int\int…\int_V p_S \left( \, s_1, \, s_2, \, …,\, s_n\, \right) \, ds_1ds_2 \cdots ds_{j-1} ds_{j+1} \cdots ds_n \,. \]

I.e., we integrate p_S(s) completely over n-1 coordinate directions of the ℝⁿ, but leave out the 1-dimensional ℝ-subspace for s_j. Note that the integral covers all dependencies of the components of the vectors in the set we work with.

Note also that evaluating the integral does not mean that we can decompose p_S(s) easily into some analytical relation between the p_j(s_j). But for some special distributions we should at least be able to provide a relation of p_j(s_j) to p_S(s) mediated by the conditional probability density for the other component values p|_sj() in the logical sense of a factorization of probabilities

\[ p_S(\pmb{s} ) = p|_{s_j}(s_1, s_2, s_{j-1},s_{j+1}, .. s_n) * p_j(s_j) \,. \]

Random vector composed of independent univariate normal distributions

Let us leave all the subtleties involved in the statistical mappings mediated by S onto potentially continuous data point distributions in the ℝⁿ. We achieve this by discussing concrete ad well defined probability density functions.

Let us focus on a bunch of n independent 1-dimensional normal distributions with probability densities g_j(w_j). (Knowing that we derive the distributions via n mapping functions W_j ). For a 1-dimensional normal distribution the probability density is given by a Gaussian

\[ g_j(w_j, \, \mu_j, \, \sigma_j) \: = \: {1 \over {\sigma_j * \sqrt{2\pi} } } \, {\large e}^{ – \, {\Large 1 \over \Large 2} \left( {\Large w_j \, – \, \Large \mu_j \over \Large \sigma_j} \right)^2 } \,. \]

The μ_j is the mean value and σ_j is the square root of the variance of the W_j distribution. We use the following notation to express that the result of W_j belongs to the set of 1-dimensional normal distributions:

\[ W_j \,\sim\, \mathcal{N}_1 \left(\mu_j,\,\sigma_j^{2} \right) \,. \]

Regarding the parameters: \( \mathcal{N}\left(\mu_j, \sigma_j^2 \right) \) denotes a 1-dimensional normal distribution with a mean value \(\mu_j\) and a variance \(\sigma_j^2 \).

Now, let us use a bunch of n distributions W_j to compose a random vector W with n components

\[ \pmb{W} \,=\, \left( W_1, W_2, \cdots W_j, \cdots W_n \right) \,.\ \]

Due to the assumed independence of the 1-dimensional components W_j, the resulting probability density function g_w(w) is just the product of the pdfs for the individual components:

\[ g_{\small W}(\pmb{w}, \pmb{\mu}_{\small W}, \pmb{\Sigma}_{\small W}) \, = \, \prod\limits_{j=1}^n g_j(w_j,\, \mu_j,\, \sigma_j) \, . \]

Σ_w symbolizes a diagonal matrix with all the σ_j-values on the diagonal and otherwise 0 (see the next section for the appearance of this matrix). μ_w is a vector comprising all the μ_j-elements. We get a compact expression for g_w(w):

\[ g_{\small W}(\pmb{w}, \pmb{\mu}_{\small W}, \pmb{\operatorname{\Sigma}}_{\small W}) \, = \, {1 \over \sqrt{(2\pi)^n} \, \prod\limits_{j=1}^n{\sigma_j}} \, {\large e}^{ – \, {\Large 1 \over \Large 2} {\Large \sum\limits_{j=1}^n} {\left( {\Large w_j \, – \, \Large \mu_j \over \Large \sigma_j \phantom{(} } \right)^2} } \,. \]

For reasons that will become clear in the next post, we formally rewrite this function with the help of a vector notation to become:

\[ g_{\small W} (\pmb{w}, \pmb{\mu}_{\small W}, \pmb{\operatorname{\Sigma}}_{\small W}) \, = \, {1 \over \sqrt{(2\pi)^n} \, \sqrt{ \operatorname{det} \left( \pmb{\Sigma}_{\small W} \right) } } \, {\large e}^{ – \, {\Large 1 \over \Large 2} \left( \pmb{\Large w} \, – \, \pmb{\Large \mu} \right)^T \,\bullet \, \pmb{\Large \Sigma}_{\small W}^{-1} \, \bullet \, \left( \pmb{\Large w} \, – \, \pmb{ \Large \mu} \right) } \]

\[ \mbox{with} \quad \pmb{\operatorname{\Sigma}}_{\small W} \, = \, {\begin{pmatrix} \sigma_1^2 & 0 &0 \cdots &0\\ 0 &\sigma_2^2 & 0 \cdots & 0 \\ \vdots &\vdots &\ddots &\vdots \\ 0 &0 &\cdots &\sigma_n^2 \end{pmatrix}} \,. \]

The bullets in the exponent mark matrix multiplications (in the sense of Linear Algebra). The inverse of Σ_w is diagonal, too, and has the reciprocates 1/σ_j² as coefficients.

It is easy to prove by proper integration of g_w(w) that the marginal distributions W_j of W are just described by the probability densities g_j(w_j). It is also simple to show that the probability density g_w(w) fulfills the required normalization condition

\[ {1 \over \sqrt{(2\pi)^n} \, \sqrt{ \operatorname{det} \left( \pmb{\Sigma}_{\small W} \right) } } \, \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} {\large e}^{- \, {\Large 1 \over \Large 2} \left( \pmb{\Large w} \, – \, \pmb{\Large \mu} \right)^T \,\bullet \, \pmb{\Large \Sigma}_{\small W}^{-1} \, \bullet \, \left( \pmb{\Large w} \, – \, \pmb{ \Large \mu} \right) }\, dw_1dw_2 \cdots dw_n \,=\, 1 \,. \]

We introduce the notation of a multivariate normal distribution of independent Gaussians in the ℝⁿ as a distribution with a pdf g_w(w) and thus having two parameters given by a mean vector μ_w and a diagonal matrix Σ_w as

\[ \pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{\mu}_{\small W},\, \pmb{\Sigma}_{\small W} \right) \, . \]

Σ_w is also called the variance-covariance matrix or shorter covariance matrix of a MND. See below.

Standardized multivariate normal distributions composed of independent 1-dimensional Gaussians

Let us now move our Cartesian coordinate system of our ℝⁿ such that we have

\[ \pmb{\mu}_{\small W} \: = \: \left\{ \, \mu_1, \, \mu_2, \, ….\, \mu_n\, \right\}^T = \boldsymbol{0} \, . \]

Let us further scale the individual coordinates by

\[ z_j \, =\, {w_j \over \sigma_j }, \quad \pmb{z} = \left(\,z_1, \, z_2, \, \cdots, \, z_n\,\right)^T \, . \]

Then we get a centered and standardized normal vector distribution [SNVD] of independent Gaussians :

\[ \pmb{Z} \: = \: \left( \, Z_1, \, Z_2, \, ….\, Z_n\, \right)^T \,, \, \quad \pmb{Z} \, \sim \pmb{\mathcal{N}}_n \left( \pmb{0}, \, \pmb{\operatorname{I}} \right) \,. \]

\( \pmb{\mathcal{N}} \left( \pmb{0}, \, \pmb{\operatorname{I}} \right) \) denotes a random normal vector with a mean vector \(\pmb{\mu} = \pmb{0} \) and an (nxn) identity matrix \( \pmb{\operatorname{I}} \) as its covariance matrix (see the next section). Its probability density has a simpler form, namely

\[ \begin{align} g_{Z}(\pmb{z}) \, &= \, {1 \over \sqrt{(2\pi)^n} } \, {\large e}^{ – \, {\Large 1 \over \Large 2} {\Large \sum\limits_{j=1}^n} \, {\Large z_j}^2 } \\ &=\, {1 \over \sqrt{(2\pi)^n} } \, {\large e}^{ – \, {\Large 1 \over \Large 2} \left( {\Large \pmb{z}^T \, \bullet \, \pmb{z}} \right) }\, . \end{align} \]

The exponent includes a scalar product between the vectors. In the future we will not write an extra dot between the horizontal and the vertical vectors. Two matrices aside each other symbolize a matrix multiplication in the sense of Linear Algebra – if the number of columns and rows fit.

Why did the leading factor get so simple? Well, the density function re-scales with the coordinates because on the infinitesimal scale we must fulfill:

\[ g_{\pmb{Z}}(\pmb{z})\, dV_z \: = \: g_{\small W} (\pmb{w}) \, dV_w \,. \]

Because of dz_j = (1/σ_j) dw_j , the new volume element dV_z in Z-space becomes

\[ dV_z \: = \: \prod\limits_{j=1}^n {1\over \sigma_j}\,dw_j \:=\: {1 \over \prod\limits_{j=1}^n \sigma_j } dV_w \,. \]

The attentive reader recognizes the Jacobi determinant of the coordinate transformation. It eliminates one of the factors in the denominator of the leading term of g_w(w) during the transition to g_z(z).

For Z we obviously have:

\[ \sigma^z_j \,=\, 1, \: \forall j \, \Rightarrow \, \pmb{\Sigma}_{\small Z} \,=\, \pmb{\operatorname{I}} \,. \]

It is easy to show that the pdf of Z remains normalized:

\[ {1 \over \sqrt{(2\pi)^n} } \, \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} {\large e}^{ – \, {\Large 1 \over \Large 2} {\Large \boldsymbol{z}\boldsymbol{z}^T} }dz_1dz_2 \cdots dz_n \,=\, 1 \,\,. \]

Expectation value and covariance of random vectors

We want to understand the role of the matrix Σ_w a bit better. Actually, this matrix is closely related to a formal extension of the “covariance” of two 1-dimensional random distributions to random vector distributions. But let us start with the expectation value of the random vector first.

The expectation vector of a random vector is just the vector composed of the expectation values of its (marginal) component distributions:

\[ \pmb{\mu}\left(\pmb{S}\right) \,=\, \pmb{\mu}_S \: = \: \operatorname{\mathbb{E}}\left(\pmb{S} \right) \, = \, \left( \operatorname{\mathbb{E}}(S_1), \, \operatorname{\mathbb{E}}(S_2), …, \, \operatorname{\mathbb{E}}(S_n) \right)^T \, . \]

The covariance of a random vector results in a natural way by a generalization of the covariance of two 1-dim distributions X and Y:

\[ \begin{align} \operatorname{Cov}(X, Y) &= \operatorname{\mathbb{E}}\left[\,\left(\,X \,-\, \operatorname{\mathbb{E}}\left(X\right)\,\right) \, \left(\,Y \,-\, \operatorname{\mathbb{E}}\left(Y\right)\, \right)\,\right] \\ &= \operatorname{\mathbb{E}}\left(X Y\right) \,-\, \operatorname{\mathbb{E}}\left(X\right) \operatorname{\mathbb{E}}\left(Y\right) \,. \end{align} \]

A multidimensional generalization has to take into account all possible combinations (S_k, S_j). This indicates already that we need a matrix. A simple formal way to get expectation values of all pairwise combinations of components from a random vector S is to use a matrix product between S^T and S. This leads almost naturally to a matrix of expectation values:

\[ \begin{align} \operatorname{Cov}\left(\pmb{S}\right) \: &:= \: \operatorname{\mathbb{E}}\left[\left(\pmb{S} – \operatorname{\mathbb{E}}(\pmb{S}) \right) \, \left(\pmb{S} – \operatorname{\mathbb{E}}(\pmb{S}) \right)^T \right] \, . \end{align} \]

Note the order of transposition in the definition of Cov! A (vertical) vector is combined with a transposed (horizontal) vector. The rules of a matrix-multiplication then give you a matrix as the result! The expectation value has to be determined for every element of the matrix. Thus, the interpretation of the notation given above is:

Pick all pairwise combinations (S_j, S_k) of the component distributions. Calculate the covariance of the pair cov(S_j, S_k) and put it at the (j,k)-place inside the matrix.

Meaning:

\[ \begin{align} \operatorname{Cov}\left(\pmb{S}\right)\:&=\:\operatorname{\mathbb{E}} {\begin{pmatrix} (S_{1}-\mu_{1})^{2}&(S_{1}-\mu_{1})(S_{2}-\mu_{2})&\cdots &(S_{1}-\mu_{1})(S_{n}-\mu_{n})\\(S_{2}-\mu_{2})(S_{1}-\mu_{1})&(S_{2}-\mu_{2})^{2}&\cdots &(S_{2}-\mu _{2})(S_{n}-\mu_{n})\\\vdots &\vdots &\ddots &\vdots \\(S_{n}-\mu_{n})(S_{1}-\mu_{1})&(S_{n}-\mu_{n})(S_{2}-\mu _{2})&\cdots &(X_{n}-\mu_{n})^{2} \end{pmatrix}} \\\\ &=\quad {\begin{pmatrix} \operatorname{Var} (S_{1})&\operatorname{Cov} (S_{1},S_{2})&\cdots &\operatorname{Cov} (S_{1},S_{n})\\ \operatorname{Cov} (S_{2},S_{1})&\operatorname{Var} (S_{2})&\cdots &\operatorname{Cov} (S_{2},S_{n})\\\vdots &\vdots &\ddots &\vdots \\ \operatorname{Cov} (S_{n},S_{1})&\operatorname{Cov} (S_{n},S_{2})&\cdots &\operatorname {Var} (S_{n}) \end{pmatrix}} \,\,. \end{align} \]

The above matrix is the (variance-) covariance matrix of S, which we also abbreviate with Σ_S.

\[ \pmb{\Sigma}_S \,:=\, \operatorname{Cov} \left( \pmb{S} \right) \,. \]

As Wikipedia tells you, some people call it a bit different. I refer to it later on just as the covariance matrix (of a random vector). Because the covariance of two independent distributions X and Y is zero

\[ X, \,Y \,\, independent \,\, \Rightarrow \,\, \operatorname{Cov}(X, Y) \,=\, 0 \,, \]

we directly find

\[ \pmb{\Sigma}_W \,=\, \operatorname{Cov} \left( \pmb{W} \right) \,. \]

We have found an interpretation of the matrix appearing in the probability density g_w(w) of \(\pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{\mu}_{\small W},\, \pmb{\Sigma}_{\small W} \right) \):

The matrix Σ_w^-1 appearing in the exponential function of g_w(w) for a normal random vector with independent Gaussians as components is the inverse of the variance-covariance matrix Σ_w = Cov(W).

We expect that correlations between the 1-dimensional component distributions will leave their fingerprints in the coefficients of these matrices.

An open question: The attentive reader may now pose the following question: The matrix Σ_w = Cov(W) may be well defined for a general random vector. But how can we be sure that the inverse matrix always exists? This is a very good question – and we will have to deal with it in forthcoming posts on MNDs.

Correlation and independence

Equipped with the above definitions we can now better distinguish the independence of individual distributions on each other from their un-correlation. Independence means that the probability density function of a random vector factorizes into a product of the probability density functions for the component distributions.

\[ g(\pmb{s}) \,=\, \, \prod\limits_{j=1}^n g_j (s_j) \,. \]

Correlation instead refers to the covariance in the sense that some coefficients off the diagonal of the covariance matrix are not zero:

\[ \operatorname{Cov} (S_{j},S_{k}) \, \ne \, 0, \quad \mbox{for some } \, j, k, \,\, \mbox{with} \, j \ne k \]

Expectation value and covariance of standardized normal random vectors with independent components

For our centered and standardized normal distribution Z we find:

\[ \begin{align} \pmb{\mu}_Z \,=\, \pmb{\mu}\left(\pmb{Z}\right) \, &= \, \operatorname{\mathbb{E}}\left(\pmb{Z}\right) = \pmb{0} \,, \\ \pmb{\Sigma}_Z \, =\, \operatorname{Cov}\left(\pmb{Z}\right) \, &= \, \operatorname{\mathbb{E}}\left[\pmb{Z}\, \pmb{Z}^T\right] \, = \, \pmb{\operatorname{I}} \,. \end{align} \]

Conclusion and outlook

Objects with many quantifiable features can be described by vectors. Multivariate random vectors extend the idea of 1-dimensional, i.e. univariate random variables into the ℝⁿ. A random vector represents a map of an object population as well as derived statistical samples and related statistical probabilities for the appearance of vectors with certain properties.

Samples are created from a population by the repeated application of a defined process of statistically picking individual elements. By accounting vectors with certain properties we get to probabilities. A random vector thus describes a probabilistic distribution of vectors and related data points. The component values of these vectors follow univariate probability distributions. The probabilities of the univariate distributions can be understood as marginal probabilities of the random vector.

A limit transition from finite sets of sample vectors to huge or even infinite sets fed from an underlying continuous population may lead to distributions of (position) vectors whose endpoints densely fill the ℝⁿ. A limit process allows for the definition of a probability density function for random vector distributions. The probability density functions describing the distribution of vector endpoints in the ℝⁿ can only in simple cases be decomposed into a simple combination of the 1-dimensional pdfs for the components. In general the marginal distributions may not be independent of each other or uncorrelated.

By using the expectation values of marginal distributions we can define mean vectors for such vector distributions in a straightforward way. The quantity corresponding to the variance of univariate distributions is for a random vector distribution becomes a matrix – the variance-covariance matrix.

When the random vector is derived from independent univariate normal distributions the covariance matrix becomes diagonal and the probability density of the vector distribution is given by a product of the univariate Gaussians for the marginal distributions. The probability density of a multivariate normal distribution Z composed of independent and standardized marginal Gaussians gets a particularly simple form and the covariance matrix reduces to a (nxn) identity matrix.

In the next post of this series

Multivariate Normal Distributions – II – Linear transformation of a random vector with independent standardized normal components

we will apply a linear transformation to our special distribution Z. This will be done by applying a matrix A to the z-vectors. We will try to find out how this impacts the probability density of the transformed distribution.

Multivariate Normal Distributions – I – Basics and a random vector of independent Gaussians