Multivariate Normal Distributions – IV – Spectral decomposition of the covariance matrix and rotation of the coordinate system

In the preceding posts of this series we have considered a comprehensible definition and basic properties of a non-degenerate “Multivariate Normal Distribution” of vectors in the ℝⁿ [N-MND]. In this post we will make a step in the direction of a numerical analysis of some given finite vector distribution with properties that indicate an underlying N-MND. We want to find an optimal Euclidean coordinate System [ECS] which allows for a simple representation and handling of the distribution’s probability density function [pdf].

Links to introductory posts:

Steps and results so far

In “post I” we represented a vector distribution by a “random vector“. We afterward described the probability density of a continuous vector distribution and considered random vectors based on independent Gaussians. In “post II” we defined a MND as the result of a linear transformation M applied onto a special distribution of vectors whose component values varied according to independent and standardized Gaussian functions. We derived the functional form of an MND’s continuous pdf in an Euclidean Coordinate System [ECS]. In the preceding “post III” we have shown that the contour-hypersurfaces of the probability density are surfaces of multidimensional ellipsoids. For a general MND the main axes of these ellipsoids are rotated against the ECS-axes. We have understood that such a rotation reflects a correlation of the components of the random vector. In general the off-diagonal elements of a MND’s covariance matrix are not zero.

Objective of this post: Choose an optimal ECS for a given MND-like vector distribution

In this post we will look at MND features from a point of view which is relevant for the practical numerical analysis of assumedly normal distributions given in a ML-context. Our key question is: Can we find a special coordinate system in which the main axes of the (hopefully) ellipsoidal contour surfaces coincide with the ECS-axes? I.e., an ECS built on (abstract) coordinates, in which the distribution of the component values de-correlate? Such an ECS would make our analysis significantly easier – in particular with respect to numerical methods.

The answer to the posed question is: Yes, we can. And we will see that finding a suitable ECS corresponds to solving an eigenvalue problem. We start with considering the algebraic representation of ellipsoids whose main axes are oriented in parallel to the axes of an ECS. Afterward we discuss a suitable decomposition (= factorization) of the symmetric covariance matrices Σ of a N-MND and its inverse Σ^-1. The combination will give us a method to determine the aspired ECS.

We abbreviate the expression “Multivariate Normal Distribution” by either MND or, synonymously MVN. Both abbreviations appear in the literature. We refer to the “variance-covariance matrix” of a random vector just as its “covariance matrix”.

Main axes of a normal W-distribution of independent Gaussians

We work with vector distributions and related point distributions in the ℝⁿ. Remember that we constructed a non-degenerate n-dimensional MND by applying an invertible linear transformation (plus a shift vector) onto a much simpler distribution with independent Gaussian distributions of the vector component values. We have symbolized such a basic distribution by a random vector W – and its centered standardized variant by Z. See posts I and II of this series.

$\pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{\mu}_{\small W},\, \pmb{\Sigma}_{\small W} \right), \, \quad \pmb{Z} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{0},\, \pmb{\operatorname{I}} \right) \,,$

$\mbox{with} \quad \pmb{\operatorname{\Sigma}}_{\small W} \, = \, diag \left(\, \sigma_1^2,\, \sigma_2^2, \cdots, \, \sigma_n^2 \, \right) ,\quad \mbox{and} \quad \pmb{\operatorname{\Sigma}}_{\small Z} \, = \, \pmb{\operatorname{I}} \,.$

“diag” indicates a diagonal matrix and the σ_i² represent the variances of the component distributions. Remember that the contour surfaces of the pdf of Z are surfaces of multidimensional spheres. The inverse of Σ_w, Σ_w^-1, is a diagonal matrix, too, with the reciprocal of the variance values 1 / σ_i² as elements along its diagonal.

An important property of a W-distribution is that a constant Mahalanobis distance (see post III) for its vectors w defines the surface of an ellipsoid whose main axes indeed are oriented in parallel to the ECS axes. How can we conclude this from our basic formulas? Well, the standard definition of an ellipsoidal surface in n dimensions with the main axes of the ellipsoid oriented in parallel to the axes of the chosen ECS is given by an expression of the form

$\sum_{i=1}^n \, \left( {x_i \over a_i} \right)^2 \,=\, C = const.$

with constant factors a_i. When we “move” the value of C into the a_i, the factors give us the lengths of the main half axes of the ellipsoids. Now compare this to the square of the Mahalanobis distance in an ECS centered with respect to the W-related MND, i.e. in an ECS where μ_W = 0:

$D_W (\pmb{w}) \,=\, \pmb{w}^T {\small \bullet} \, \pmb{\operatorname{\Sigma}}_{\small W}^{-1} \, {\small \bullet} \, \pmb{w} \, = \, \sum_{i=1}^n \, \left( {w_i \over \sigma_i} \right)^2 \,=\, C = const.$

This is exactly the algebraic form required. What helped us is the fact that the inverse of the covariance matrix of W is a diagonal matrix. A W-distribution can easily be transformed into a standardized distribution Z with the help of a scaling diagonal matrix. So, we have good reason to believe that a given general non-degenerate MND is linearly related to a W-distribution with contours given by axis-parallel ellipsoids. But we apparently need a transition to an ECS in which the respective Σ-matrix and its inverse become diagonal.

From a given covariance matrix of a N-MND to a normal random vector with de-correlated components

So, let us try to reverse the considerations of previous posts. Let us assume that someone has given us a non-degenerate MND-distribution Y of vectors (assumedly) having a probability density like the g(y) we derived in post II (with μ being the mean vector):

$g(\pmb{y}) \,=\, {1 \over (2\pi)^{n/2} \, \left(\operatorname{det} \pmb{\operatorname{\Sigma}}\right)^{1/2} } \, {\Large e}^{ – \, {\Large 1 \over \Large 2} \, \left[ \left( {\Large \pmb{y} \,-\, \pmb{\mu} } \right)^{\Large T} \, { \Large \pmb{\operatorname{\Sigma}}}^{\Large -1} \, \left( {\Large \pmb{y} \,-\, \pmb{\mu}} \right) \right] } \,.$

By some magic we have also got the distribution’s covariance matrix Σ (or a numerical approximation of it). As we work in the ℝⁿ, Σ is a symmetric, positive (n x n)-matrix. We know from our construction of N-MNDs that Σ should factorize like

$\pmb{\operatorname{\Sigma}} = \pmb{\operatorname{M}} \, {\small \bullet} \, \pmb{\operatorname{M}}^T \,, \quad \mbox{for some} \,\, (n \, \operatorname{x} \, n) \mbox{-matrix} \,\,\pmb{\operatorname{M}}.$

Can we find a well defined and invertible matrix M leading us back to underlying Z-like distributions based on independent Gaussians in all coordinate directions? More precisely: Is there a (numerical) method to derive the elements of such a matrix M from Σ? Obviously, we must find some well defined factorization of Σ …

A problem you should be aware of is that due to our construction (see post II) M is not unique without further restrictions. Actually, it is unique only up to a multidimensional rotation, i.e. an orthogonal matrix. The reason is that a chosen Z-distribution can be rotated by any degree without changing any of our basic conditions for a non-degenerate MND. Or in other words: We can choose any rotated ECS with respect to Z to start with. This means that a well defined method must refer to a specific ECS, which we must select by imposing some condition on the factorization of Σ. And this restriction should in the best case have something to do with de-correlation of the vectors’ component distributions. To achieve this let us refer to the geometry of the pdf’s contour hyper-surfaces.

From a geometrical point of view a special ECS would be the one in which the orthogonal axes of the multi-dimensional ellipsoids, which define the pdf-contours of a N-MND, would be aligned with the coordinate axes of the ECS.

In such an ECS our MND-distribution would appear like a W-distribution composed of independent Gaussians for the distributions of the vector component values.

Spectral decomposition of the covariance matrix of a non-degenerate MND

We simplify our problem by moving the origin of our ECS to the center of the distribution of our MND vectors y, such that the MND’s mean vector μ becomes μ = 0.

$\pmb{Y} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{0},\, \pmb{\operatorname{\Sigma}} \right) \,.$

Let us call this specific ECS in which we describe the vectors y (given by the random vector Y) “ECS_Y“. We now use some theorems of Linear Algebra regarding matrix decomposition. A factorization of a given matrix is often possible in multiple ways.

Cholesky-decomposition?
In the case of a symmetric, positive-definite and real-valued matrix Σ it is tempting to pick the so called “Cholesky decomposition” (see [3]). It tells us that such a matrix Σ can always be decomposed into a pair K • K^T of invertible triangular matrices with positive elements along the diagonal

$\pmb{\operatorname{\Sigma}} \, = \, \pmb{\operatorname{K}} \, {\small \bullet } \, \pmb{\operatorname{K}}^T \, \quad \mbox{with} \,\, \pmb{\operatorname{K}} \,\, \mbox{being an upper or lower triangle matrix} .$

This would give us the aspired form. However, we can not see any directly understandable relation to a specific ECS and a diagonalization of Σ. We need to find a better suited decomposition.

Spectral decomposition
Another decomposition, which is of more interest, is the so called “spectral decomposition“. You can read all about it in [3] (page 149). A short summary is: A symmetric matrix as Σ can always be factorized and written as

$\pmb{\operatorname{\Sigma}} \: = \: \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}} \pmb{\operatorname{V}}^T \,=\, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{1/2} {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{1/2} \, \pmb{\operatorname{V}}^T \,=\, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{1/2} \, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{1/2} \, \pmb{\operatorname{V}}^{-1}$

$\mbox{with} \quad \pmb{\operatorname{\Lambda}}, \, \pmb{\operatorname{\Lambda}}^{1/2} \, diagonal, \,\,\pmb{\operatorname{V}} \, orthogonal$

V is an orthogonal matrix consisting of n orthogonal or even orthonormal eigenvectors of Σ. Λ is a diagonal matrix with real values.

$\pmb{\operatorname{V}} \,=\, \left( \pmb{v}_1, \, \pmb{v}_2, …, \, \pmb{v}_n \right) , \quad \mbox{with} \,\, \pmb{\operatorname{V}} ^T = \pmb{\operatorname{V}} ^{-1} \,\,\, \mbox{and} \,\,\, \pmb{v}_i \, {\small \bullet } \, \pmb{v}_j \,=\, \delta_{i,j} * || \pmb{v}_i ||^2 \, ,$

$\pmb{\operatorname{\Lambda}} \,=\, diag \left( \lambda_1, \, \lambda_2, \, …, \, \lambda_n \right) \,=\, \pmb{\operatorname{\Lambda}}^{1/2} {\small \bullet } \, \pmb{\operatorname{\Lambda}}^{1/2}, \,\, \mbox{with}\,\, \lambda_i \gt 0, \, \forall \,i \in [1, n] \,.$

It follows that

$\pmb{\operatorname{\Sigma}}^{-1} \: = \: \left[ \, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}} \pmb{\operatorname{V}}^T \, \right]^{-1} \,=\, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1} \pmb{\operatorname{V}}^T \,=\, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1} \pmb{\operatorname{V}}^{-1} \,=\, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1/2}\, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \,.$

The first positive point with respect to our objective is that V‘s column vectors are orthogonal eigenvectors of Σ. Such vectors can be found for a general symmetric matrix by well established numerical methods if its determinant is positive. The other positive point is that Λ is diagonal and contains the respective positive eigenvalues λ_i. From LinAlg we know that all eigenvalues λ_i of a real symmetric and positive definite matrix are real and that all λ_i > 0 (see e.g. [4]). Λ^1/2 contains square roots of the eigenvalues on the diagonal. Λ^-1 contains the values 1/λ_i on its diagonal. Λ actually represents Σ in a rotated coordinate system (see below).

Note that we can normalize the eigenvectors by moving respective length factors into the eigenvalues. So, V can be chosen to be an orthonormal matrix (with ||v_i|| = 1). Then the eigenvectors of Σ the can be regarded as unit vectors of a special Euclidean coordinate system. In addition the sign of the eigenvectors can always be chosen such that the determinant of V becomes +1. This is good, too, because then we can interpret V and its inverse as a rotation matrices (see below).

It is easy to show that eigenvectors y_e of Σ also are eigenvectors of Σ^-1, but for the eigenvalues 1/λ_i.

${1\over \lambda_e} ||\pmb{y}_e||^2 \, \pmb{y}_e = {1\over \lambda_e} \pmb{y}_e \left[\, \pmb{y}_e^T \pmb{y}_e\, \right] = {1\over \lambda_e} \pmb{y}_e \left[ \, \pmb{y}_e^T \, \pmb{\operatorname{\Sigma}} \pmb{\operatorname{\Sigma}}^{-1} \pmb{y}_e \, \right] = {1\over \lambda_e} \pmb{y}_e \, \left[ \, \lambda_e \, \pmb{y}_e^T \,\pmb{\operatorname{\Sigma}}^{-1} \pmb{y}_e \, \right] = ||\pmb{y}_e||^2 \, \pmb{\operatorname{\Sigma}}^{-1} \pmb{y}_e \\ \Rightarrow \,\, \pmb{\operatorname{\Sigma}}^{-1} \, \pmb{y}_e \,=\, {1\over \lambda_e} \,\pmb{y}_e$

As Σ has full rank n, V has full rank, too. However, V is not symmetric. (M isn’t either!) A spectral decomposition is a special case of a so called eigen-decomposition.

Orthonormal matrices represent rotations

Note: Angles and scalar products between some vectors w₁, w₂ transformed by V are kept up due to properties of the orthogonal matrices.

$\mbox{ECS}_{\small Y} : \,\, \pmb{y}_1^T \, {\small \bullet } \, \pmb{y}_2 \,=\, \,=\, \left( \pmb{\operatorname{V}} \pmb{w}_1 \right)^T\, \pmb{\operatorname{V}} \pmb{w}_2 = \pmb{w}_1^T \, \pmb{\operatorname{V}}^{-1} \, {\small \bullet } \, \pmb{\operatorname{V}} \pmb{w}_2 \,=\, \pmb{w}_1^T \, {\small \bullet } \, \pmb{w}_2 \,.$

And for a matrix B_y = VBV^T we find

$\mbox{ECS}_{\small Y} : \,\, \pmb{y}^T \pmb{\operatorname{B}}_y \, \pmb{y} \,=\, \left( \pmb{\operatorname{V}} \pmb{w} \right)^T \, {\small \bullet } \, \pmb{\operatorname{V}} \pmb{\operatorname{B}} \pmb{\operatorname{V}}^T \, {\small \bullet } \, \pmb{\operatorname{V}} \pmb{w} \,=\, \pmb{w}^T \pmb{\operatorname{V}}^{-1} {\small \bullet } \, \pmb{\operatorname{V}} \pmb{\operatorname{B}} \pmb{\operatorname{V}}^{-1} \, {\small \bullet } \, \pmb{\operatorname{V}} \pmb{w} \,=\, \pmb{w}^T \pmb{\operatorname{B}} \pmb{w} \,.$

The geometrical meaning is that an orthogonal matrix represents a rotation of vectors in an ECS by an angle φ around some axis given by a vector r.

However, an orthonormal matrix O with determinant +1 can also be interpreted such that it gives us the components of a vector in a new coordinate system ECS_W rotated in opposite direction (-φ) against the original coordinate system ECS_Y. The elements of a matrix B transform during a transition from ECS_Y to ECS_W OBO^T. The other way round, O^-1 can be interpreted to give coordinates of a given vector y in an ECS_W rotated by +φ.

V, in particular, represents a rotation of an ECS_W, whose axes were aligned with the orthogonal eigenvectors of Σ, onto ECS_Y. The inverse matrix V^-1 thus determines the component values of vectors y in an ECS_W with axes parallel to these eigenvectors. Therefore, our matrix Σ = V Λ V^T is a representation of Λ in the rotated ECS_Y . Or, if you like it to see the other way round, Λ represents our Σ in ECS_W.

$\mbox{ECS}_W \sim \pmb{\operatorname{V}} {\small \bullet} \mbox{ ECS}_Y$

The Mahalanobis distance in terms of spectral decomposition matrices

Let us combine our insights. We decide to chose a special M = M_S as indicated by the spectral decomposition

$\pmb{\operatorname{M}} = \pmb{\operatorname{M}}_S \,=\, \pmb{\operatorname{\Lambda}}^{1/2} \, \pmb{\operatorname{V}}^{-1} \, \quad \Rightarrow \, \quad \pmb{\operatorname{M}}_S^{-1} \,=\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \, ,$

and find out, how far we get with this. First we have

$\pmb{\operatorname{\Sigma}} \,=\, \pmb{\operatorname{M}}_S \, {\small \bullet} \, \pmb{\operatorname{M}}_S^T \quad \Rightarrow \quad \pmb{\operatorname{\Sigma}}^{-1} \,=\, \left(\pmb{\operatorname{M}}_S^T\right)^{-1} {\small \bullet} \,\, \pmb{\operatorname{M}}_S^{-1} \,=\, \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1/2}\, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \,.$

Let us write down the Mahalanobis distance for a vector y of a non-degenerate Y-MND::

$\pmb{y}^T \, {\small \bullet} \, \pmb{\operatorname{\Sigma}}^{-1} \, {\small \bullet} \, \pmb{y} \,=\, \pmb{y}^T \, {\small \bullet} \, \left(\pmb{\operatorname{M}}_S^T\right)^{-1} {\small \bullet} \,\, \pmb{\operatorname{M}}_S^{-1} \, {\small \bullet} \, \pmb{y} \,=\, \pmb{y}^T \, {\small \bullet} \, \left[ \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1/2}\, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \right] \, {\small \bullet} \,\, \pmb{y} \, .$

Thus, with y = M • z, we can fulfill an essential condition of our construction of a non-degenerate MND:

$\pmb{y}^T \, {\small \bullet} \, \pmb{\operatorname{\Sigma}}^{-1} \, {\small \bullet} \, \pmb{y} \,=\, \pmb{y}^T \, {\small \bullet} \, \left[ \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1/2}\, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \right] \, {\small \bullet} \,\, \pmb{y} \, = \, \pmb{z}^T \, {\small \bullet} \, \pmb{z} \,, \\ \quad \mbox{with} \,\, \pmb{z} \, =\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \, {\small \bullet} \,\, \pmb{y} .$

What does this all mean geometrically?

Recreation of the given N-MND from a Z-distribution

Let us first describe the creation of Y in ECS_Y for a given Σ, Λ and V. The elementary operation M_S • z to construct a MND obviously consists of two steps or operations:

Step 1: Pick a (spherically symmetric) Z-distribution of vectors and stretch all vector components by the square root of respective positive eigenvalues of Σ. I.e. transform our z-vectors by

$\pmb{w} \,=\, \pmb{\operatorname{\Lambda}}^{1/2} \, {\small \bullet} \, \, \pmb{z}, \, \quad \mbox{with} \,\, \pmb{w} \,\,\mbox{given by} \,\, \pmb{W} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{0},\, \pmb{\operatorname{\Lambda}} \right)$

This obviously transforms the spheres of equal probability density of Z into ellipsoidal surfaces with the main axes of the ellipsoids being oriented along the ECS_Y-axes. These w-vectors can be interpreted as elements of a distribution given by a centered normal random vector W with independent Gaussian components and variances λ_i.

Step 2: Pick the vectors w of W and rotate them via an orthonormal V (defined by eigenvectors of Σ – arranged in the order of the respective eigenvalues in Λ). Chose the sign of the eigenvectors such that V becomes a rotation (det V = +1).

$\pmb{y} \,=\, \pmb{\operatorname{V}} \, {\small \bullet} \, \pmb{w}, \, \quad \mbox{with} \,\, \pmb{y} \,\,\mbox{given by} \,\, \pmb{Y} \,\sim\, \pmb{\mathcal{N}}_n \left(\pmb{0},\, \pmb{\operatorname{\Sigma}} \right) \,\, .$

Reverse order: From Σ to V-matrices and respective W– and Z-distributions

We now revert the whole process for the analysis of a given Y which seems to have properties of a N-MND in ECS_Y. We follow three steps:

Step A – determination of eigenvectors: In ECS_Y we first determine the variance-covariance matrix Σ and its inverse (e.g. by numerical methods). We then calculate the n orthonormal eigenvectors of Σ and Σ^-1. Afterward we build a matrix V by using the eigenvectors as columns of this matrix. We choose the sign of the eigenvectors such that V defines a rotation (det V = +1). The eigenvectors define unit vectors along the axes of a new Euclidean coordinate system ECS_W. We organize respective eigenvalues λ₁, λ₂, .. λ_n in a matrix Λ the the same order as we positioned the eigenvectors as columns in the matrix V.

Step B – Rotation of the coordinate system: We now rotate ECS_Y by V such that it coincides with a new ECS_W. The components of a vector y in ECS_W are equal to the components of the following vector w in ECS_Y (!):

$\pmb{w} \,=\, \pmb{\operatorname{V}}^{-1} \, {\small \bullet} \, \pmb{y}$

I.e, the inverse of V, i.e., V^-1, gives us the components of y in ECS_W.

If the distribution Y really were a N-MND, than V^-1 would transform the contour ellipsoids of equal probability density into axis-parallell ellipsoids in ECS_W. We can see this via a transformation of the square of the Mahalanobis distance by V^-1 :

$\begin{align} \pmb{y}^T \, {\small \bullet} \, \pmb{\operatorname{\Sigma}}^{-1} \, {\small \bullet} \, \pmb{y} \,&=\, \pmb{y}^T \, {\small \bullet} \, \left[ \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1/2}\, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \right] \, {\small \bullet} \,\, \pmb{y} \, \\ &=\, \pmb{y}^T \, {\small \bullet} \, \left( \pmb{\operatorname{V}} \pmb{\operatorname{V}}^{-1} \right) \, {\small \bullet} \, \left[ \pmb{\operatorname{V}} \pmb{\operatorname{\Lambda}}^{-1/2}\, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1/2} \, \pmb{\operatorname{V}}^{-1} \right] \, {\small \bullet} \left( \pmb{\operatorname{V}} \pmb{\operatorname{V}}^{-1} \right) \, {\small \bullet} \,\, \pmb{y} \, \\ &=\, \left( \pmb{y}^T \, \pmb{\operatorname{V}} \right) \, {\small \bullet} \, \pmb{\operatorname{I}} \, \, {\small \bullet} \, \pmb{\operatorname{\Lambda}}^{-1} \, {\small \bullet} \,\pmb{\operatorname{I}} \, \, {\small \bullet} \, \left( \pmb{\operatorname{V}}^{-1} \, \pmb{y} \right) \, \\ &=\, \left( \pmb{\operatorname{V}}^{-1} \, \pmb{y} \right)^T \, {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1} \, {\small \bullet} \,\, \left( \pmb{\operatorname{V}}^{-1} \, \pmb{y} \right) \\ &=\, \pmb{w}^T {\small \bullet} \,\, \pmb{\operatorname{\Lambda}}^{-1} \, {\small \bullet} \,\, \pmb{w} . \end{align}$

So, Λ^-1 indeed represents Σ^-1 in ECS_W. If Y really were a N-MND, the diagonal form would guarantee that the main axis of the transformed ellipsoidal hyper-surfaces were aligned with ECS_W‘ s coordinate axes. Furthermore the components of the variances of the “de-correlated” Gaussians for the components would just be given by the eigenvalues λ₁, λ₂, .. λ_n.

However, if Y did not have the properties of a MND then we would not get axis-parallel ellipsoids in ECS_W. The difference to the theoretical ellipsoids is something that we can investigate numerically.

Step C – Scale to get (or not get) a spherically symmetric distribution: As soon as we have our W-distribution we can rescale it by applying Λ^-1/2. In case of an original N-MND Y tis would now give us a spherically symmetric distribution Z.

A comment on the meaning of rotated coordinate system in ML-contexts

The coordinate system we choose to work with in a ML context is typically given by some predefined set of variables – either corresponding directly to the properties of objects we work with or to already abstract orthogonal coordinates of the latent space of an ML algorithm (like e.g. an Autoencoder). When you move to a rotated ECS you should be very clear about one thing:

The new coordinates will be abstract ones. They (most often) have no direct interpretation in terms of the original properties of the objects we apply an ML-algorithm to.

The correlation of the original (natural) properties does not disappear by some magic when we go over to some abstract coordinates via a rotation of the ECS. And: Even in a coordinate system with axes-parallel ellipsoidal contours of a N-MND the ratios of the lengths of the main-axes of the ellipsoids would have fixed values. These ratios do not disappear via a rotation.

Conclusion

In this post we have seen that we can use the variance-covariance matrix of a N-MND to determine a coordinate system in which the main axes of the ellipsoidal contour hyper-surfaces align with the coordinate axes. Did this remind you of a method often used in the context of classic ML-methods? Probably it did: You may have thought of PCA. However, before we get there, I want to present a more general definition of a MND in the next post. This will also bring closer to the topic of how to include and justify degenerate MNDs.