This post series is about mathematical aspects of so called “**Multivariate Normal Distributions**“. In the literature two abbreviations are common: **MND**s or **MVN**s. I will use both synonymously. To get an easy access I want to introduce MNDs as the result of a linear transformations applied to *random vectors* whose components can be described by *independent* 1-dimensional normal distributions. Afterward I will discuss a more general definition of MNDs and derive their most important properties. As I am a physicist and not a mathematician, I will try not to become too formalistic. Readers with a mathematical education may apologize a lack of rigidity.

My perspective is that of a person who saw MNDs appear in different contexts of Machine Learning [ML]. Already the data samples used for the training of an ML algorithm may show properties which indicate an underlying MND. But also the analysis of data samples in the **latent spaces** of modern generative algorithms may reveal properties of MNDs. As we work with finite samples we have to apply numerical methods for the analysis for our data distributions and compare the results with theoretical predictions.

One helpful fact in this respect is that the probability densities of MNDs have contour hyper-surfaces defined by quadratic forms. We, therefore, have to deal with multidimensional ellipsoids and their 2-dimensional projections, namely ellipses. We have to cover these geometrical figures both with analytical and numerical methods.

The detection of MNDs and their analysis in a ML context can on the *lowest* level be done by investigating the *variance-covariance matrix* of multidimensional data distributions. We will have a look at methods for a geometrical reconstruction of MND contours with the help of coefficients of a respective matrix derived from available sample data by numerical methods. The theoretically predicted contour ellipses based on these coefficients can be compared with contours produced by a direct numerical *interpolation* of the sample’s vector data. I will also discuss the relation to PCA transformations and how to use them and their inverse for contour and vector reconstruction from selected data in two dimensional projections. Further mathematical methods, which check the MND compatibility of sample data more thoroughly, will be mentioned and discussed briefly.

In this first post I start our journey through the land of MNDs by describing a random vector composed of independent Gaussians. On our way I will briefly touch some basic topics as populations, samples, marginal distributions and probability densities in an intuitive and therefore incomplete way.

The posts in these articles require knowledge in Linear Algebra and a bit of vector analysis.

## Basics: Samples of **objects in Machine Learning** and underlying populations

In Machine Learning we typically work with (finite) **samples** of discrete objects which have well defined properties or “features”. When these features can be characterized by numerical values, the sample objects can mathematically be represented by points in the ℝ^{n} or corresponding *position vectors*. Such a position vector would reach out from the origin of an Euclidean coordinate system to a specific point representing a selected data object of our sample. This means that we deal with discrete distributions of data points and respective vectors in multi-dimensional spaces.

### Samples and statistics

We assume that the properties of the objects in an ML sample follow some common underlying *patterns* (in the hope that an ML algorithm will detect them during training). These properties would be encoded in the components of the related vectors. In what way does *statistics* enter this context? The first point is that we assume that the concrete sample of vectors we use for the training of an ML-algorithm is representative for other samples of objects (and related vectors) showing alike properties. One such prominent sample would be the one we apply our trained algorithm to at user inference. In other words:

We assume that the elements of our samples are taken via a *statistical* process from a general underlying **population** providing many more if not all possible objects of the same kind.

The “*population*” is *abstract* in several aspects: 1) In most practical scenarios we have no direct access to it. 2) It refers to objects and not necessarily to the vectors we in the end work with. 3) A “population” may also be the theoretical result of some kind of *machinery* which at least in principle is able to *produce* all (even an infinite) set of objects of the same kind. Take a genetical code as an example.

Let us therefore assume that the elements of our population reside in *abstract* space **Ω** – which is accompanied 1) by a kind of defined event or process **Ψ**_{S} for the “picking” of objects and 2) by probabilities *P _{Ω}* for picking certain types of objects, i.e. of objects having specific properties in common. If you want to, you may assume that the population shows an object

*distribution*with respect to certain properties. The distribution then could be used to definie related probabilities

*. We assume that the process which picks individual elements of the population and fills our sample is based*

*P*_{Ω}*and guided by*

**on chance***with respect to the element selection from the population. I.e. we assume that our samples are generated without any bias compared to the properties of the underlying population and its object distribution.*

*P*_{Ω}What we in our case need is a mapping ** S** from

**Ω**to the ℝ

^{n}:

As soon as we have a map ** S** and a description of a random picking process

**Ψ**

_{S}, we can repeat

**Ψ**

_{S}multiple times to create one or more samples of vectors. The production of different samples would then (due to the underlying probabilities) follow statistical laws.

The distributions of vectors in our samples with respect to certain properties of these vectors and its components should statistically reflect the properties of the underlying population objects and related probabilities *P_{Ω}*. For proper statistical processes behind the sample creation we would assume that the probabilities derived from many and/or

*growing*samples would approach the probabilities describing the (assumed) distribution of the objects in the underlying (potentially infinite) population.

How do we get and measure probabilities of the sample based distribution of vectors in the ℝ^{n} ? One option is that we sum up the number of vectors pointing into a certain discrete sub-volumes ΔV of the ℝ^{n} and set the results in relation to the number of all available vectors (and thus the covered part V of the ℝ^{n}). This gives us the **probability** *P_{V}* of finding a vector with component values in ranges defined by the borders of ΔV. We assume that the probability that

*S*_{S}takes on a value

**in a measurable sub-space ΔV ∈ ℝ**

*s*^{n}is defined by a probability to pick a respective element in the population space:

Meaning: The probability of finding vectors pointing to a volume ΔV of the ℝ^{n} is defined by a probability *P_{Ω}* of picking elements from the population with fitting properties.

### Dense distributions of data points and vectors

The end points of vectors pointing to neighboring small volumes ΔV of the ℝ^{n} may fill the surrounding space relatively densely. The population may be based on a distribution of many objects with smoothly varying properties. Our map ** S** might then lead to a dense distribution of points (and related vectors) in certain volumes of the ℝ or the ℝ

^{n}. By dividing the probabilities for small finite volumes ΔV by the respective measured volume we would get discrete values of a quantity we name a “

**“. We learn by the way that the target spaces of**

*probability density***functions should be measurable.**

*S*In many cases we may find profound indications from the analysis of large or many different ML samples – or from theories about our object population – that the individual points or vectors are distributed according to a certain well defined ** continuous** “

**probability density function**” [

*limit process*for ever growing samples picked from such a distribution. Note that a

*continuous probability density*ℝ

^{n}in the requires three things:

must cover certain volumes of the ℝ*S*^{n}more and more densely by the endpoints of vectors put into growing statistical samples.- A volume measure and a vector norm.
- The number of points (end-points of vectors) per unit volume derived from the samples must vary relatively smoothly between neighboring volume elements and approach a continuous variation in the limit process to infinitely large samples and infinitesimal volume elements.

If we believe that an available *finite* statistical sample of vectors indeed represents an underlying continuous vector distribution, we may want to approximate the latter by (multidimensional) numerical interpolations of the discrete numbers of points or vectors per volume element retrieved from of our samples. And compare the resulting curves for the probability density with theoretical curves of pdfs based on the properties of the continuous distributions of specific vector populations.

In the following sections we want to define some specific vector distributions and their probability density functions in mathematical terms.

# Random vectors and probability density functions

We take a *multivariate random vector* as a multidimensional “vector” ** S** having 1-dim statistical

*random variable*s

*S*as components.

_{j}A random variable * S_{j}* in our case is actually given by a function

*Sj*:

**Ω**→ℝ.

*S*maps a set of objects taken from an underlying population to a statistical sample of

_{j}*real numbers*. If applied multiple times

*S*creates

_{j}*discrete*distributions of points in ℝ. In a sense

*therefore*

*S*_{j}*represents*statistical sample distributions in ℝ.

A random vector ** S** maps elements of

**Ω**to vectors. It is a

*vector valued*function. It indirectly represents resulting statistical distributions of vectors in (potentially infinite) samples of (position) vectors defined on the ℝ

^{n}and respective distributions of data points. For a finite sample of vectors we can in principle index all the comprised vectors and list them up:

*T* symbolizes the transpose operation.

**Correlated components of the random vector: ** Note that the components of a vector * s^{k}* of a specific set may

*be independent of each other. Meaning: For some sets and populations the components of the comprised vectors in the ℝ*

**not**^{n}do

*vary individually (aside of statistical fluctuations). If we have chosen a component value*

**not***s*then the value

^{k}_{j}*s*of another component may have to fulfill certain conditions depending on the values of the first or other components. In other words: The individual distributions represented by the various

^{k}_{i}*S*can e.g. be (pairwise)

_{j}- independent or dependent of each other,
- they can be uncorrelated or correlated.

This holds for finite sets with discrete vectors or potentially infinite sets of vectors whose endpoints cover a sub-volume of the ℝ^{n} densely. For a distinction of (un-) *correlation* from (in-) *dependence* see below.

**Continuous distributions and probability densities in one dimension **

A concrete sample vector ** s^{k}** ∈ ℝ

^{n}of a vector distribution can be described by specific values

*s*assumed by

_{j}^{c}*variable*s

*s*, which we use below to flexibly describe vector components. The

_{j}*s*determine the coordinates of respective data points. In the discrete case

_{j}*s*can only take on certain well defined values in ℝ (see above). For

_{j}*discrete*distributions a concrete value

*s*is assumed with a certain probability P(

_{j}^{k}*s*). The factual probability given by a sample depends of course on the number of those objects in a set that reproduce

_{j}^{c}

*s*_{j}*and*on the total number of samples. If such a definition is too fine grained we may use probabilities defined on intervals a ≤

*s*≤ b.

_{j}In case of an infinite population with a dense distribution of objects with smoothly varying properties, we can approach a continuous distribution of points in target regions of ℝ by creating larger and larger samples. In such a kind of limit process *S _{j}* may map properties of the objects in

**Ω**onto a dense distribution of points in ℝ having a continuous 1-dimensional probability density – in the sense that the number of points in neighboring small intervals varies continuously.

Now let us look at a dense population with a distribution of objects with smoothly varying properties – and at extremely huge or infinite sets derived from it. If *s _{j}* in a limit process can take all values in ℝ and cover selected finite intervals [a, b] of ℝ densely we may define a continuous

**probability density**

*p*

_{j}(

*s*

_{j}) for infinitely small intervals within the finite intervals such that the probability for

*to assume values in an interval [a, b] is given by*

*s*_{j}*p*_{j}(*s*_{j}) characterizes an 1-dimensional distribution of dense a population represented by *S _{j}*. In some interesting cases the map of such a distribution and derived potentially infinite samples may cover large parts of ℝ or all of ℝ densely.

## Probability density of a random vector distribution

By making a limit transition to infinitely small volumes in the ℝ^{n} , we can extend the idea of a probability density to a random *vector* ** S** representing some object population. The more elements we get in a statistical sample of vectors (based on picking from the population) the more will the difference in the numbers of vectors pointing into neighboring small volumes of equal size be described by differences of the population’s probability density given at the volumes’ centers.

In a limit process for an infinite population the endpoints of the sample’s position vectors would fill the ℝ^{n} or a sub-space V of it more and more densely. In the end we would get an infinitely dense distribution in the sense that *every* infinitesimal volume d*V* in the covered region V would have a vector pointing to it. For huge samples we could use fine-grained, but finite volume-elements ΔV and count the number of vectors Δn_{V} pointing to each of these volume elements to define discrete data points *g*_{ΔV} = Δn_{V}/N/ΔV for an approximation of a continuous density function via interpolation.

Note that a precise definition of a “dense” vector population would require a measure for the distance between vectors ** a** and

**. We can formally achieve this by defining a norm for the difference vector ||**

*b***–**

*b***||. But I think it is clear what we mean: Two position vectors are close neighbors if and when their endpoints are close neighbors in the ℝ**

*a*^{n}in the sense that the distance between these points becomes very small or negligible.

If each of the components *s _{j}* of the vectors

**can take all values in ℝ we may describe the probability that the population represented by**

*s***contains vectors**

*S***pointing to a volume element ΔV by a**

*s**continuous*multidimensional

**probability density function**[

*dV*

_{S}=

*ds*

_{1}*ds*…

_{2}*ds*. Let us call the pdf for a continuous random vector

_{n}*p*(

_{S}*s*) and define the probability

_{1},s_{2}, …*P*of finding or creating vectors pointing into a finite volume:

Note that *p_{S}*(): ℝ

^{n}→ℝ is a continuous function mapping vectors via their components to real values. For

*p*to play its role as a probability density it must fulfill a

_{S}*normalization condition*for the covered space

*V*in the ℝ

^{n}(which in some cases may be the infinite ℝ

^{n}itself):

In terms of finite discrete distributions: Summing up the number of vectors pointing to discrete volume elements over all (hit) volume elements in the ℝ^{n} must give us the total number of all available vectors.

Note that *p _{S}* maps the ℝ

^{n}to ℝ. A continuous

*p*thereby defines a

_{S}**hyper-surface**in the ℝ

^{n + 1}.

### Dependencies and correlations of the components of a random vector

An interesting question is whether and, if so, *how* *p _{S}*(

*) depends on the 1-dimensional probability density functions*

**s***p*

_{j}(

*s*)

_{j}**or**on the parameters [

*params*] of these density functions:

The brackets **[]** denote a very complex relation in the general case; a direct analytical decomposition of *p _{S}*(

*) into the*

**s***p*

_{j}(

*s*) may in general not be possible or only after some sophisticated transformations. For MNDs we will at least be able to show how the parameters describing the densities

_{j}*p*

_{j}(

*s*

_{j}) constitute the functional form of

*p*(

_{S}*). And we will find transformations which allow for a functional decomposition.*

**s**Let us reverse our line of thought and assume that we have a given probability density function *p _{S}*(

*) for a vector distribution (or derived such a distribution by numerical interpolations between elements of a discrete set of sample vectors). Then we may be interested in the 1-dimensional densities of constituting random variables. We introduce the*

**s****probability density of a “**” as

*marginal distribution*I.e., we integrate *p _{S}*(

*) completely over*

**s***coordinate directions of the ℝ*

**n-1**^{n}, but leave out the 1-dimensional ℝ-subspace for

*s*. Note that the integral covers all dependencies of the components of the vectors in the set we work with.

_{j}Note also that evaluating the integral does not mean that we can decompose *p _{S}*(

*) easily into some analytical relation between the*

**s***p*

_{j}(

*s*

_{j}). But for some special distributions we should at least be able to provide a relation of

*p*

_{j}(

*s*

_{j}) to

*p*(

_{S}*) mediated by the a*

**s***conditional probability density*for the other component values

*p*|

_{sj}() in the logical sense of a factorization of probabilities

## Random vector composed of *independent* univariate normal distributions

Let us leave all the subtleties involved in the statistical mappings mediated by ** S** onto potentially continuous data point distributions in the ℝ

^{n}. We achieve this by discussing probability density functions. Let us simplify and focus on a bunch of

*n*

*1-dimensional*

**independent***distributions with probability densities*

**normal***g*(

_{j}*w*). (Knowing that we derive the distributions via

_{j}*n*mapping functions

*W*). For a 1-dimensional normal distribution the probability density is given by a Gaussian

_{j}The *μ _{j} *is the mean value and

*σ*is the square root of the variance of the

_{j}*W*distribution. We use the following notation to express that the result of

_{j}*W*belongs to the set of 1-dimensional normal distributions:

_{j}Regarding the parameters: \( \mathcal{N}\left(\mu_j, \sigma_j^2 \right) \) denotes a 1-dimensional normal distribution with a *mean value* \(\mu_j\) and a *variance* \(\sigma_j^2 \).

We use our bunch of *n* distributions *W _{j}* to compose a random vector

**with**

*W**n*components

Due to the independence of the 1-dimensional components *W _{j}*, the resulting probability density function

*g*

_{w}(

**) is just the**

*w**product*of the pdfs for the individual components:

**Σ**_{w} symbolizes a diagonal matrix with all the *σ _{j}*-values on the diagonal and otherwise 0 (see the next section for the appearance of this matrix).

*μ*_{w}is a vector comprising all the

*μ*-elements. We get a compact expression for

_{j}*g*

_{w}(

**):**

*w*For reasons that will become clear in the next post, we formally rewrite this function with the help of a *vector notation* to become:

The bullets in the exponent mark matrix multiplications (in the sense of Linear Algebra). The inverse of **Σ**_{w} is diagonal, too, and has the reciprocates 1/*σ _{j}*

^{2}as coefficients.

It is easy to prove by proper integration of *g*_{w}(** w**) that the marginal distributions

*W*of

_{j}**are just described by the probability densities**

*W**g*(

_{j}*w*). It is also simple to show that the probability density

_{j}*g*

_{w}(

**) fulfills the required normalization condition**

*w*We introduce the notation of a **multivariate normal distribution of ****independent*** Gaussians* in the ℝ

^{n}as a distribution with a pdf

*g*

_{w}(

**) and thus having two parameters given by a mean vector**

*w*

*μ*_{w}and a diagonal matrix

**Σ**

_{w}as

## Standardized multivariate normal distributions composed of independent 1-dimensional Gaussians

Let us now move our Cartesian coordinate system of our ℝ^{n} such that we have

Let us further scale the individual coordinates by

Then we get a *centered* and *standardized* normal vector distribution**[SNVD]** **of independent Gaussians** :

\( \pmb{\mathcal{N}} \left( \pmb{\mu}, \, \pmb{\operatorname{I}} \right) \) denotes a random normal vector with a *mean vector* \(\pmb{\mu}\) and an (nxn) identity matrix \( \pmb{\operatorname{I}} \) as the so called *covariance matrix* (see the next section). Its probability density has a simpler form, namely

**I** denotes the identity matrix with dimension (*n* x *n*). The exponent includes a scalar product between the vectors. In the future we will not write an extra dot between the horizontal and the vertical vectors. Two matrices aside each other symbolize a matrix multiplication in the sense of Linear Algebra – if the number of columns and rows fit.

Why did the leading factor get so simple? Well, the density function re-scales with the coordinates because on the infinitesimal scale we must fulfill:

Because of *dz*_{j} = (1/σ_{j}) *dw*_{j} , the new volume element *dV _{z}* in

*Z*-space becomes

The attentive reader recognizes the *Jacobi determinant* of the coordinate transformation. It eliminates one of the factors in the denominator of the leading term of *g*_{w}(** w**) during the transition to

*g*

_{z}(

**).**

*z*For ** Z** we obviously have:

It is easy to show that the pdf of ** Z** remains normalized:

## Expectation value and covariance of random vectors

We want to understand the role of the matrix **Σ**_{w} a bit better. Actually, this matrix is closely related to a formal extension of the “covariance” of two 1-dimensional random distributions to random vector distributions. But let us start with the expectation value of the random vector first.

The expectation vector of a random vector is just the vector composed of the expectation values of its (marginal) component distributions:

The covariance of a random vector results in a natural way by a generalization of the covariance of two 1-dim distributions *X* and *Y*:

A multidimensional generalization has to take into account all possible combinations (*S _{k}*,

*S*). This indicates already that we need a matrix. A simple formal way to get expectation values of all pairwise combinations of components from a random vector

_{j}**is to use a matrix product between**

*S***and**

*S***product. This leads almost naturally to a matrix of expectation values:**

*S*^{T}Note the order of transposition in the definition of Cov! A (vertical) vector is combined with a transposed (horizontal) vector. The rules of a matrix-multiplication then give you a *matrix* as the result! The expectation value has to be determined for every element of the matrix. Thus, the interpretation of the notation given above is:

Pick all pairwise combinations *(S _{j}, S_{k})* of the component distributions. Calculate the covariance of the pair cov(

*S*

_{j},

*S*

_{k}) and put it at the (j,k)-place inside the matrix.

Meaning:

The above matrix is the **(variance-) covariance matrix** of ** S**, which we also abbreviate with

**Σ**

_{S}.

As Wikipedia tells you, some people call it a bit different. I refer to it later on just as the ** covariance matrix** (of a random vector). Because the covariance of two independent distributions

*X*and

*Y*is zero

we directly find

We have found an interpretation of the matrix appearing in the probability density *g _{w}*(

**) of \(\pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{\mu}_{\small W},\, \pmb{\Sigma}_{\small W} \right) \):**

*w*The matrix **Σ**_{w}^{-1} appearing in the exponential function of *g*_{w}(** w**) for a normal random vector with independent Gaussians as components is the inverse of the variance-covariance matrix

**Σ**

_{w}= Cov(

**).**

*W*We expect that correlations between the 1-dimensional component distributions will leave their fingerprints in the coefficients of these matrices.

**An open question:** The attentive reader may now pose the following question: The matrix **Σ**_{w} = Cov(** W**) may be well defined for a general random vector. But how can we be sure that the inverse matrix always exists? This is a very good question – and we will have to deal with it in forthcoming posts on MNDs.

## Correlation and independence

Equipped with the above definitions we can now better distinguish the independence of individual distributions on each other from their un-correlation. *Independence* means that the probability density function of a random vector factorizes into a product of the probability density functions for the component distributions.

Correlation instead refers to the covariance in the sense that some coefficients off the diagonal of the covariance matrix are not zero:

## Expectation value and covariance of standardized normal random vectors with *independent* components

For our centered and standardized normal distribution ** Z** we find:

## Conclusion and outlook

Objects with many quantifiable features can be described by vectors. *Multivariate random vectors* extend the idea of 1-dimensional, i.e. univariate *random* variables into the ℝ^{n}. A *random vector* represents a map of an object *population* as well as derived statistical samples and related statistical probabilities for the appearance of vectors with certain properties. Samples are created from the population by the repeated application of a defined process of statistically picking individual elements. By accounting vectors with certain properties we get to probabilities. A random vector thus describes a probabilistic distribution of vectors and related data points. The component values of these vectors follow univariate probability distributions. The probabilities of the univariate distributions can be understood as marginal probabilities of the random vector.

A limit transition from finite sets of sample vectors to huge or even infinite sets fed from an underlying continuous population may lead to distributions of (position) vectors whose endpoints densely fill the ℝ^{n}. A limit process allows for the definition of a *probability density function* for random vector distributions. The probability density functions describing the distribution of vector endpoints in the ℝ^{n} can only in simple cases be decomposed into a simple combination of the 1-dimensional pdfs for the components. In general the marginal distributions may *not* be independent of each other or uncorrelated.

By using the expectation values of marginal distributions we can define *mean vectors *for such vector distributions in a straightforward way. The quantity corresponding to the variance of univariate distributions is for a random vector distribution becomes a matrix – the *variance-covariance matrix*.

When the random vector is derived from **independent***univariate normal distributions* the covariance matrix becomes diagonal and the probability density of the vector distribution is given by a product of the univariate Gaussians for the marginal distributions. The probability density of a multivariate normal distribution ** Z** composed of

**and**

*independent***marginal Gaussians gets a particularly simple form and the covariance matrix reduces to a (nxn) identity matrix.**

*standardized*In the next post of this series

we will apply a linear transformation to our secial distribution ** Z**. This will be done by applying a matrix

**A**to the

**z**-vectors. We will try to find out how this impacts the probability density of the transformed distribution.