Probability density function of a Bivariate Normal Distribution – derived from assumptions on marginal distributions and functional factorization

For a better understanding of ML experiments regarding a generator of human faces based on a convolutional autoencoder we need an understanding of multivariate and bivariate normal distributions and their probability densities.

This post is about the probability density function [pdf] of a bivariate normal distribution of two correlated Gaussian random variables X and Y. Most derivations of the mathematical form of this two-dimensional function start from a general definition of a random vector composed of independent 1-dimensional Gaussian (= normal) distributions and applying a linear transformation onto such a vector. With this post, I want to motivate the general functional form of the probability density in a different way, namely by symmetry arguments and a factorization approach.

In a later post I will derive the general form by applying linear operation on the random vector (X, Y)^T of two independent Gaussian distributions.

Related posts:

Assumption – the marginal distributions are normalized 1-dimensional normal distributions

Let us name the function for the probability density of a bivariate normal distribution g₂(x, y) – and of a centered one g_2c(x, y). x and y are concrete values which the components X and Y of a random vector (X,Y)^T may assume. We want to derive the form of g_2c(x, y) and g₂(x, y). X and Y may be correlated.

Centered distributions: We work in a 2-dimensional Cartesian coordinate system [CCS]. Without loosing much of a generalization we center the density distributions such that their mean values of the density distributions for are zero:

\[ \mu_x \,=\, E(X) \,=\,\, \lt x \gt_{g_x} \,=\, 0 \, \quad \mbox{and} \quad \, \mu_y \,=\, E(Y) \,=\,\, \lt y \gt_{g_y} \,=\, 0 \tag{2}\]

This situation will give us g_2c(x, y).

The basic assumption we make is that the two marginal distributions are 1-dimensional centered Gaussian normal distributions :

\[ \begin{align} \mbox{marg. probability density for x}\, : \quad g_{xc}(x) \,&=\, \int g_{2c}(x, y) \,dy \, = \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \,, \\[10pt] \mbox{marg. probability density for y}\, : \quad g_{yc}(y) \,&=\, \int g_{2c}(x, y)\,dx \, = \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,. \end{align} \tag{1}\]

Our marginal density functions shall in addition be be normalized:

\[ \int_0^\infty g_{xc}(x) \,dx \, = \, 1 \,, \quad \int_0^\infty g_{yc}(y) \,dy \, = \, 1 \,. \]

These assumption can at this point be regarded as conditions imposed on g_2c(x, y). The motivation is 2-fold:

We want our bivariate normal distribution to be composed of correlated 1-dimensional (= univariate) normal Gaussian distributions. At some place we have to put respective conditions onto g_2c(x, y). We take it that the marginals represent the basic Gaussians X and Y. Note, however, that we do not assume that X and Y are independent. How the correlation can be expressed will be investigated in further posts. See e.g here and here.
For the case of independent X and Y we need to reproduce g_2c(x, y) = g_xc(x) * g_yc(y). See below.

To be on the safe side, we will later check that the imposed conditions will be reproduced by our final combined pdf g_2c(x, y).

Side remark: In other posts in this blog we will later show that it is a general property of general (regular) Multivariate Normal (= Gaussian) distributions [MVDs] that all of its marginals are Multivariate Normal distributions for a lower number of dimensions. In the extreme case of 1-dimensional marginal of a general MVD we arrive at a simple Gaussian probability density function.

Equ.s (1) indicate already some symmetry of g_2c(x, y) regarding the dependencies on x and y.

Probability density of conditional distributions

We can look at the whole thing from the perspective of conditional probabilities. Let us denote the conditional probability density for Y taking a value y under the condition that X has a value x as c_y(y|x). For the conditional probability for X becoming x under the condition Y = y value we analogously write c_x(x|y).

Due to the meaning of conditional probabilities, we have:

\[ \begin{align} g_{2c}(x, y) \,&=\, c_y(y|x) * g_{xc}(x) \,=\, c_y(y|x) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \,, \\[10pt] g_{2c}(x, y) \,&=\, c_x(x|y) * g_{yc}(x) \,=\, c_x(x|y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,. \end{align} \tag{3} \]

Factorization approach: Due to the indicated symmetry, we try a factorization of the conditional functions:

\[ \begin{align} c_y(y | x) \,&=\, f_y(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,, \\[10pt] c_x(x|y) \,&=\, f_x(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \,. \end{align} \tag{4} \]

From

\[ \begin{align} g_{2c}(x, y) \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * f_y(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,, \\[10pt] \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * f_x(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \end{align} \]

we conclude that

\[ f_x(x, y) \,=\, f_y(x, y) := f(x, y) \,, \tag{5} \]

\[ g_{2c}(x, y) \,=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * f(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,. \tag{6} \]

Due to symmetry reasons we could already now assume a symmetry of f(x,y) in the sense that f(x, y) = f(y, x). But, let us wait a bit until we get further indications from other relations.

Why does the factorization in (6) make sense? Well, in case of independent distributions X and Y we would have to fulfill

\[ f_x(x,y) \,=\, f_y(x,y) \,=\, f(x,y) \,=\; 1 \]

to get the special case of a factorized distribution

\[ g_{i2c}(x, y) \,=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,. \tag{7} \]

So, if we can make f(x,y) dependent on the correlation ρ or equivalently the covariance cov(X,Y) such that it gets 1 for ρ = 0, we would be able to reproduce independence in a simple manner.

Guessing the form of f(x, y) from further conditions

The marginal distributions must fulfill (1) and should in addition result from (6) by integration:

\[ \begin{align} g_{xc}(x) \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \\[10pt] \,&=\, \int_{-\infty}^{\infty} \, g_{2c}(x,y) \, dy \\[10pt] \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * \int_{-\infty}^{\infty} \, f_y(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \, dy \,. \end{align} \]

This means nothing else than that the density function c_y(y|x) of the conditional distribution Y|_X must be normalized, too. The same holds for X|_Y and c_x(x|y) :

\[ \begin{align} \int_{-\infty}^{\infty} \, c_y(y|x) \, dy \,\,&=\,\, \int_{-\infty}^{\infty} \, f(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \, dy \, = \, 1 \,, \\[10pt] \int_{-\infty}^{\infty} \, c_x(x|y) \, dx \,\,&=\,\, \int_{-\infty}^{\infty} \, f(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \, dx \, = \, 1 \,. \end{align} \tag{8} \]

How could we make this to become true?
Well, if the conditional distributions were shifted Gaussians themselves, we could get this to work. The reason is the following:

If we could bring e.g. c_y(y|x) into a fully quadratic form like

\[ \int_{-\infty}^{\infty} \, c_y(y|x) \, dy \, =\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{xy} } {1 \over \sigma_y} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{xy}^2}} \left( {y\over \sigma_y} \,-\, \rho_y {x \over \sigma_x} \right)^2 \right) \, dy \tag{9} \]

then we could use a variable transformation

\[ u_y \, = \, {y\over \sigma_y} \,-\, \rho_y {x \over \sigma_x} \,, \quad \, du_y = {1 \over \sigma_y } \, dy \]

to achieve

\[ \int_{-\infty}^{\infty} \, c_y(y|x) \, dy \, =\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{xy} } \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{xy}^2}} u^2 \right) \, du \, = \, 1 \,. \tag{10} \]

Which would, of course, be true. Analogously, we would like f(x, y) to provide

\[ \begin{align} \int_{-\infty}^{\infty} \, c_x(x|y) \, dx \, &=\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{yx} } {1 \over \sigma_x} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{yx}^2}} \left( {x \over \sigma_x} \,-\, \rho_x {y \over \sigma_y} \right)^2 \right) \, dx \\[12pt] \, &=\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{yx} } \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{yx}^2}} u^2 \right) \, du \, = \, 1 \,. \tag{11} \end{align} \]

Note that σ_xy and σ_yx must be constants – independent of the respective x and y values!

Our approach to fulfill normalization means that f(x, y) must provide fitting terms to complete the terms in the exponents to a fully quadratic term. What does this in turn mean for our yet unknown function f(x, y)?

f(x, y) must provide squares in x and y as well as some term containing x*y in the exponent. We also must get some symmetry in f(x, y) regarding x and y. Taking all into account we try the most simple approach, namely restraining us to quadratic terms only:

\[ f(x, y) = {1 \over \epsilon} * exp \left( – \, {1 \over 2 } \left[ \, \gamma_x {x^2 \over \sigma_x^2} \, + \, \xi {x \over \sigma_x} {y \over \sigma_y} \, +\, \gamma_y {y^2 \over \sigma_y^2} \right] \, \right) \tag{12} \]

\(\epsilon\), \(\gamma_x \), \(\xi\) and \(\gamma_y\) are constants. We could already now assume \(\gamma_x \,=\, \gamma_y \), but let us see.

Combining (9) and (12) for c_y(y|x) we get:

\[ \begin{align} & {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{xy} } {1 \over \sigma_y} \, \int_{-\infty}^{\infty} \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{xy}^2}} \left[ {y^2 \over \sigma_y^2} \,-\, 2 \rho_y {x \over \sigma_x} {y \over \sigma_y} \,+\, \rho_y^2 {x^2 \over \sigma_x^2} \right] \right) \, dy \\[12pt] \,=\, &{1 \over \sqrt{2 \pi}} {1 \over \epsilon} {1 \over \sigma_y} \, \int_{-\infty}^{\infty} \, \operatorname{exp} \left( – \, {1 \over 2 } \left[ \, (1 \,+\, \gamma_y ) {y^2 \over \sigma_y^2} \, + \, \xi {x \over \sigma_x} {y \over \sigma_y} \, +\, \gamma_x {x^2 \over \sigma_x^2} \right] \right) \, dy \end{align} \tag{13} \]

and analogous terms for c_x(x|y). Comparing all terms shows:

\[ {1\over \epsilon } \,=\, {1 \over \sigma_{xy} } \,=\, {1 \over \sigma_{yx} } \Rightarrow \sigma_{xy} = \sigma_{yx}, \quad (1 \,+\, \gamma_y ) \,=\, {1 \over \sigma_{xy}^2} \,=\, (1 + \gamma_x) \, \Rightarrow \, \gamma_x = \gamma_y \,=\, \gamma ,\\[10pt] \xi \,=\, -2 {\rho_y \over \sigma_{xy}^2} \,=\, -2 {\rho_x \over \sigma_{xy}^2}, \quad \gamma_x \,=\, {\rho_y^2 \over \ \sigma_{xy}^2}, \,\, \gamma_y \,=\, {\rho_x^2 \over \ \sigma_{xy}^2} \, \Rightarrow \, \rho_x \,=\, \rho_y \,=\, \rho \,. \]

ρ, obviously ia a key parameter. It is called the Pearson correlation coefficient. The meaning of this statement will become clearer in a later post.

The σ_xy must have a positive value to makes sense in defining a Gaussian. Meaning:

\[ \gamma \,=\, \rho^2 \,( 1\,+\, \gamma) \,\Rightarrow \, \gamma \,=\, {\rho^2 \over 1\,-\, \rho^2} \\[12pt] \,\Rightarrow \, \epsilon \,=\, \sigma_{xy} \,=\, \sqrt{ \left( 1 \,-\, \rho^2 \right)} ,\quad \xi = -2 {\rho \over 1\,-\,\rho^2} \,. \tag{14} \]

This defines our f(x, y) completely.

The final form of the probability density of a bivariate normal distribution

Inserting the results of (14) into (12) we get:

\[ \begin{align} g_{2c}(x, y) \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \\[10pt] &\, * \, {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \, \operatorname{exp} \left( – 1/2 \left[ {\rho^2 \over 1\,-\, \rho^2 } {x^2 \over \sigma_x^2} \,-\, 2 {\rho \over 1\,-\, \rho^2} {x \over \sigma_x} {y \over \sigma_y} \,+\, {\rho^2 \over 1\,-\, \rho^2 } {y^2 \over \sigma_y^2} \, \right] \right) \\[10pt] &\, * \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \,. \end{align} \tag{15} \]

Combining the exponents gives:

\[ g_{2c}(x, y) \,=\, {1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \operatorname{exp} \left( – {1\over2} {1 \over 1\,-\, \rho^2 } \left[ {x^2 \over \sigma_x^2} \,-\, 2 \rho \, {x \over \sigma_x} {y \over \sigma_y} \,+\, {y^2 \over \sigma_y^2} \, \right] \right) \,. \tag{16} \]

Going to a general orthogonal coordinate system in which the expectation values are not zero ( μ_x ≠ 0 and μ_x ≠ 0 ), we get

\[ \begin{align} g_2(x, y) \,=\, &{1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \, * \\ &\, * \operatorname{exp} \left( – {1\over2} {1 \over 1\,-\, \rho^2 } \left[ { \left(x – \mu_x\right)^2 \over \sigma_x^2} \,-\, 2\, \rho { \left(x – \mu_x \right) \left(y – \mu_y \right) \over \sigma_x \, \sigma_y } \,+\, { \left(y – \mu_y \right)^2 \over \sigma_y^2} \, \right] \right) \end{align} \,. \tag{17} \]

I leave it to the reader to prove this in detail.

Vector form and relation to the inverse of the variance-covariance matrix

For those of my readers who are used to vector distributions and respective matrix operations let us define a random vector V containing X and Y and further vectors :

\[ \pmb{V} = \begin{pmatrix} X \\ Y \end{pmatrix}, \quad \mbox{concrete values}: \, \pmb{v} = \begin{pmatrix} x \\ y \end{pmatrix}, \quad \pmb{\mu} = \begin{pmatrix} \mu_x \\ \mu_y \end{pmatrix}, \quad \pmb{v}_{\mu} = \begin{pmatrix} x – \mu_x \\ y – \mu_y \,. \end{pmatrix} \]

We define the following matrix:

\[ \pmb{\Sigma}^{-1} \,=\, {1 \over \sigma_x^2\, \sigma_y^2\, \left( 1\,-\, \rho^2\right) } \, \begin{pmatrix} \sigma_y^2 &-\rho\, \sigma_x\sigma_y \\ -\rho\, \sigma_x\sigma_y & \sigma_x^2 \end{pmatrix} \,. \tag{18} \]

This enables us to re-write (16) and (17) in vector form

\[ \mbox{centered CCS :} \quad g_2(x, y) \,=\, {1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \operatorname{exp} \left( – {1\over2} \, {\pmb{v}}^T \bullet \, \pmb{\Sigma}^{-1} \bullet {\pmb{v}} \, \right) \,, \tag{19} \]

\[ \mbox{generell CCS :} \quad g_2(x, y) \,=\, {1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \operatorname{exp} \left( – {1\over2} \, {\pmb{v}_{\mu}}^T \bullet \, \pmb{\Sigma}^{-1} \bullet {\pmb{v}_{\mu}} \, \right) \,. \tag{20} \]

T symbolizes the transposition operation. The reader maybe recognizes the variance-covariance matrix as the inverse of Σ^-1 and that ρ actually is the correlation coefficient coupling X and Y :

\[ \pmb{\Sigma} \,=\, \begin{pmatrix} \sigma_x^2 &\rho\, \sigma_x\sigma_y \\ \rho\, \sigma_x\sigma_y & \sigma_y^2 \end{pmatrix}, \quad \pmb{\Sigma} \bullet \pmb{\Sigma}^{-1} = \mathbf I_n \,. \tag{21} \]

Note that this matrix is symmetric. The notation often used to describe a two-dimensional normal distribution is

\[ \pmb{V}_B \,=\, \mathcal{N} \left[ \pmb{v}_{\mu},\, \pmb{\Sigma} \right] \,=\, \mathcal{N} \left[ \begin{pmatrix} \mu_x \\ \mu_y\end{pmatrix}, \, \begin{pmatrix} \sigma_x^2 &\rho\, \sigma_x\sigma_y \\ \rho\, \sigma_x\sigma_y & \sigma_y^2 \end{pmatrix} \right] \,. \tag{22} \]

Deriving the marginal distribution by integration

As I am a bit lazy, let us perform the integration for the special case σ_x = σ_y = 1. The essential steps for the integration become obvious also for this simplification. I leave it to the reader to perform the integration for the general case. The trick is a kind of substitution similar to the one which we ave performed already for the conditional probability densities. We rewrite the exponent as follows:

\[ \begin{align} & {1 \over 2\, \left( 1 \,-\, \rho^2 \right) } \, \left[\, x^2 \,-\, 2\, \rho \,x \, y \, +\, y^2 \, \right] \\[8pt] &= \, {1 \over 2\, \left( 1 \,-\, \rho^2 \right) } \, \left[\, (x^2 \,-\, \rho^2\, x^2) \,+\, (y^2 \, -\, 2\,\rho\, x\, y \, + \,\rho^2\, x^2) ) \, \right] \\[8pt] &= \, {1\over 2} \, x^2 \,+\, {1\over 2} \, \left[\, { y\, -\, \rho\, x \over \sqrt{ 1 \,-\, \rho^2} } \, \right]^2\,. \end{align} \]

Our integral for the marginal distribution becomes

\[ \begin{align} g_{xc}(x) \,& =\, \int_{-\infty}^\infty \, g_{2c}(x, y) \, dy \\[10pt] & =\, {1 \over \sqrt{2\,\pi}} \operatorname{exp} ( -\, x^2 / \,2) * \int_{-\infty}^\infty \, { \operatorname{exp} \left[\, – (y-\rho\, x)^ 2 / (2(1- \rho^2) ) \right] \over \sqrt{1 \,-\, \rho^2 \,} \, \sqrt{2\, \pi \, } } \, dy\,. \end{align} \]

A variable substitution

\[ u \,=\, {1 \over \sqrt{\, 1 \,-\, \rho^2\, }} \, \left( \, y \, -\, \rho\, x \,\right) \,, \quad {du \over dy} \,=\, {1 \over \sqrt{\, 1 \,-\, \rho^2\, } } \]

delivers

\[ \begin{align} g_{xc}(x) \,& =\, \int_{-\infty}^\infty \, g_{2c}(x, y) \, dy \\[10pt] & =\, {1 \over \sqrt{2\,\pi}} \operatorname{exp} ( -\, x^2 / \,2) * \int_{-\infty}^\infty \, {1 \over \sqrt{2\,\pi\, }} \, \operatorname{exp} ( – \, {1\over 2} \,u^2) \, du \\[10pt] & =\, {1 \over \sqrt{2\,\pi}} \operatorname{exp} ( -\, x^2 / \, 2) \,. \\[10pt] \end{align} \]

As expected.

Conclusion

By some simple reasoning we have guessed the functional form of a bivariate normal distribution. We have made the assumption that the marginal distributions of a bivariate normal distribution should be one-dimensional normal distributions whose probability density functions are described by normalized Gaussians.

By looking at conditional probabilities we found that a normalization of the respective probability densities could be achieved by symmetry arguments and a factorization approach. This lead us to the assumption that the conditional distributions could be normal Gaussian distributions themselves.

We shall have a look at properties of the bivariate normal distribution, its marginals and conditional sub-distributions in later posts.