Probability density function of a Bivariate Normal Distribution – derived from assumptions on marginal distributions and functional factorization

For a better understanding of ML experiments regarding a generator of human faces based on a convolutional autoencoder we need an understanding of multivariate and bivariate normal distributions and their probability densities.

This post is about the probability density function of a bivariate normal distribution depending on two correlated random variables X and Y. Most derivations of the mathematical form of this function two-dimensional function start from a general definition including the variance-covariance matrix of two dimensional distributions and further assumptions. With this post I want to motivate the general functional form of the probability density by symmetry arguments and a factorization approach.

Assumption – the marginal distributions are normalized 1-dimensional normal distributions

Let us name the function for the probability density of the bivariate normal distribution g(x, y). x and y are concrete values X and Y may assume. We want to derive the form of g(x, y). The basic assumption we make is that the two marginal distributions are 1-dimensional normal distributions.

\[ \begin{align} \mbox{marg. probability density for x}\, : \quad g_x(x) \,&=\, \int g(x, y) \,dy \, = \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \\ \mbox{marg. probability density for y}\, : \quad g_y(y) \,&=\, \int g(x, y)\,dx \, = \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \end{align} \tag{1}\]

We work in a 2-dimensional Euclidean coordinate system [ECS] in which the mean values of the density distributions for X and Y are zero:

\[ \mu_x \,=\, E(X) \,=\,\, \lt x \gt_{g_x} \,=\, 0 \, \quad \mbox{and} \quad \, \mu_y \,=\, E(Y) \,=\,\, \lt y \gt_{g_y} \,=\, 0 \tag{2}\]

Equ.s (1) indicate already some symmetry of g(x, y) regarding the dependencies on x and y.

Probability density of conditional distributions

We can look at the whole thing from the perspective of conditional probabilities. Let us denote the conditional probability for Y taking a value y under the condition that X has a value x as c_y(y|x).

For the conditional probability for X becoming x under the condition Y=y value we analogously write c_x(x|y). Then we have:

\[ \begin{align} g(x, y) \,&=\, c_y(y|x) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \\ g(x, y) \,&=\, c_x(x|y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \end{align} \tag{3} \]

Due to the symmetry indicated, we try a factorization of the conditional functions:

\[ \begin{align} c_y(y|x) \,&=\, f_y(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \\ c_x(x|y) \,&=\, f_x(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \end{align} \tag{4} \]

From

\[ \begin{align} g(x, y) \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * f_y(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \\ \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * f_x(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \end{align} \]

we conclude that

\[ f_x(x, y) \,=\, f_y(x, y) := f(x, y) \tag{5} \]

\[ g(x, y) \,=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * f(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \tag{6} \]

Due to symmetry reasons we could already now assume a symmetry of f in the sense that f(x, y) = f(y, x). But we wait a bit until we get further indications from other relations.

Why does the factorization in (6) make sense? Well, in case of independent distributions X and Y we must fulfill

\[ f_x(x,y) \,=\, f_y(x,y) \,=\, f(x,y) \,=\; 1 \]

to get the special case factorized distribution

\[ g_{ind}(x, y) \,=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \tag{7} \]

So, if we can make f(x,y) dependent on the correlation ρ or equivalently the covariance cov(X,Y) such that it gets 1 for ρ=0, we would be able to reproduce independence in a simple manner.

Guessing the form of f(x, y) from further conditions

The marginal distributions must fulfill (1) and should in addition result from (6) by integration:

\[ \begin{align} g_x(x) \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \\ \,&=\, \int_{-\infty}^{\infty} \, g(x,y) \, dy \\ \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) * \int_{-\infty}^{\infty} \, f_y(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \end{align} \]

This means nothing else than that the density function c_y(y|x) of the conditional distributions Y|_X must be normalized. The same holds for X|_Y and c_x(x|y) :

\[ \begin{align} \int_{-\infty}^{\infty} \, c_y(y|x) \, dy \,\,&=\,\, \int_{-\infty}^{\infty} \, f(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \, dy \, = \, 1 \\ \int_{-\infty}^{\infty} \, c_x(x|y) \, dx \,\,&=\,\, \int_{-\infty}^{\infty} \, f(x,y) * {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \, dx \, = \, 1 \end{align} \tag{8} \]

How could we get this to be true? Well, if the conditional distributions were (shifted) Gaussians themselves we could get this to work. The reason is that if we could bring e.g. c_y(y|x) into a fully quadratic form like

\[ \int_{-\infty}^{\infty} \, c_y(y|x) \, dy \, =\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{xy} } {1 \over \sigma_y} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{xy}^2}} \left( {y\over \sigma_y} \,-\, \rho_y {x \over \sigma_x} \right)^2 \right) \, dy \tag{9} \]

then we could use a variable transformation

\[ u_y \, = \, {y\over \sigma_y} \,-\, \rho_y {x \over \sigma_x} \,, \quad \, du_y = {1 \over \sigma_y } \, dy \]

to achieve

\[ \int_{-\infty}^{\infty} \, c_y(y|x) \, dy \, =\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{xy} } \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{xy}^2}} u^2 \right) \, du \, == \, 1 \tag{10} \]

Analogously, we would like f(x, y) to provide

\[ \int_{-\infty}^{\infty} \, c_x(x|y) \, dx \, =\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{yx} } {1 \over \sigma_x} \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{yx}^2}} \left( {x \over \sigma_x} \,-\, \rho_x {y \over \sigma_y} \right)^2 \right) \, dx \\ \, =\, \int_{-\infty}^{\infty} \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{yx} } \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{yx}^2}} u^2 \right) \, du \, == \, 1 \tag{11} \]

Note that σ_xy and σ_yx must be constants – independent of the respective x and y values ! Our approach to fulfill normalization means that f(x, y) must provide fitting terms to complete the terms in the exponentials to a fully quadratic term. f(x, y) must therefore provide squares in x and y as well as some term containing x*y in an exponential. We also must get some symmetry in f(x, y) regarding x and y. Taking all into account we try a very simple approach keeping us to quadratic terms :

\[ f(x, y) = {1 \over \epsilon} * exp \left( – \, {1 \over 2 } \left[ \, \gamma_x {x^2 \over \sigma_x^2} \, + \, \xi {x \over \sigma_x} {y \over \sigma_y} \, +\, \gamma_y {y^2 \over \sigma_y^2} \right] \, \right) \tag{12} \]

\(\epsilon\), \(\gamma_x \), \(\xi\) and \(\gamma_y\) are constants. We could already now assume \(\gamma_x \,=\, \gamma_y \), but let us see.

Combining (9) and (12) for c_y(y|x) we get:

\[ {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_{xy} } {1 \over \sigma_y} \, \int_{-\infty}^{\infty} \, \operatorname{exp} \left( – {1\over2} {1\over {\sigma_{xy}^2}} \left[ {y^2 \over \sigma_y^2} \,-\, 2 \rho_y {x \over \sigma_x} {y \over \sigma_y} \,+\, \rho_y^2 {x^2 \over \sigma_x^2} \right] \right) \, dy \\ \,=\, {1 \over \sqrt{2 \pi}} {1 \over \epsilon} {1 \over \sigma_y} \, \int_{-\infty}^{\infty} \, \operatorname{exp} \left( – \, {1 \over 2 } \left[ \, (1 \,+\, \gamma_y ) {y^2 \over \sigma_y^2} \, + \, \xi {x \over \sigma_x} {y \over \sigma_y} \, +\, \gamma_x {x^2 \over \sigma_x^2} \right] \right) \, dy \, \tag{13} \]

and analogous terms for c_x(x|y). Thus:

\[ {1\over \epsilon } \,=\, {1 \over \sigma_{xy} } \,=\, {1 \over \sigma_{yx} } \Rightarrow \sigma_{xy} = \sigma_{yx}, \quad (1 \,+\, \gamma_y ) \,=\, {1 \over \sigma_{xy}^2} \,=\, (1 + \gamma_x) \, \Rightarrow \, \gamma_x = \gamma_y \,=\, \gamma ,\\ \xi \,=\, -2 {\rho_y \over \sigma_{xy}^2} \,=\, -2 {\rho_x \over \sigma_{xy}^2}, \quad \gamma_x \,=\, {\rho_y^2 \over \ \sigma_{xy}^2}, \,\, \gamma_y \,=\, {\rho_x^2 \over \ \sigma_{xy}^2} \, \Rightarrow \, \rho_x \,=\, \rho_y \,=\, \rho \]

With ρ being a key parameter. The σ_xy must have a positive value to makes sense in defining a Gaussian. Meaning:

\[ \gamma \,=\, \rho^2 \,( 1\,+\, \gamma) \,\Rightarrow \, \gamma \,=\, {\rho^2 \over 1\,-\, \rho^2} \\ \,\Rightarrow \, \epsilon \,=\, \sigma_{xy} \,=\, \sqrt{ \left( 1 \,-\, \rho^2 \right)} ,\quad \xi = -2 {\rho \over 1\,-\,\rho^2} \tag{14} \]

This defines our f(x, y) completely.

The final form of the probability density of a bivariate normal distribution

Inserting the results of (14) into (12) we get:

\[ \begin{align} g(x, y) \,&=\, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_x } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_x}^2} x^2 \right) \\ &\, * \, {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \,\operatorname{exp} \left( – 1/2 \left[ {\rho^2 \over 1\,-\, \rho^2 } {x^2 \over \sigma_x^2} \,-\, 2 {\rho \over 1\,-\, \rho^2} {x \over \sigma_x} {y \over \sigma_y} \,+\, {\rho^2 \over 1\,-\, \rho^2 } {y^2 \over \sigma_y^2} \, \right] \right) \\ &\, * \, {1 \over \sqrt{2 \pi}} \, {1 \over \sigma_y } \operatorname{exp} \left( – {1\over2} {1\over {\sigma_y}^2} y^2 \right) \end{align} \tag{15} \]

Combining the exponentials gives:

\[ g(x, y) \,=\, {1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \operatorname{exp} \left( – {1\over2} {1 \over 1\,-\, \rho^2 } \left[ {x^2 \over \sigma_x^2} \,-\, 2 \rho \, {x \over \sigma_x} {y \over \sigma_y} \,+\, {y^2 \over \sigma_y^2} \, \right] \right) \tag{16} \]

Going to a general orthogonal coordinate system we have

\[ \begin{align} g(x, y) \,=\, &{1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \, * \\ &\, * \operatorname{exp} \left( – {1\over2} {1 \over 1\,-\, \rho^2 } \left[ { \left(x – \mu_x\right)^2 \over \sigma_x^2} \,-\, 2\, \rho { \left(x – \mu_x \right) \left(y – \mu_y \right) \over \sigma_x \, \sigma_y } \,+\, { \left(y – \mu_y \right)^2 \over \sigma_y^2} \, \right] \right) \end{align} \tag{17} \]

Vector form and relation to the inverse of the variance-covariance matrix

For those of my readers who are used to vector distributions and respective matrix representations we define a random vector V containing X and Y and further vectors :

\[ \pmb{V} = \begin{pmatrix} X \\ Y \end{pmatrix}, \quad \mbox{concrete values}: \, \pmb{v} = \begin{pmatrix} x \\ y \end{pmatrix}, \quad \pmb{\mu} = \begin{pmatrix} \mu_x \\ \mu_y \end{pmatrix}, \quad \pmb{v}_{\mu} = \begin{pmatrix} x – \mu_x \\ y – \mu_y \end{pmatrix} \]

We use the following matrix

\[ \pmb{\Sigma}^{-1} \,=\, {1 \over \sigma_x\, \sigma_y\, \left( 1\,-\, \rho^2\right) } \, \begin{pmatrix} \sigma_y^2 &-\rho\, \sigma_x\sigma_y \\ -\rho\, \sigma_x\sigma_y & \sigma_x^2 \end{pmatrix}, \tag{18} \]

and thus are able to re-write (16) and (17) in vector form

\[ \mbox{centered ECS :} \quad g(x, y) \,=\, {1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \operatorname{exp} \left( – {1\over2} \, {\pmb{v}}^T \bullet \, \pmb{\Sigma}^{-1} \bullet {\pmb{v}} \, \right) \tag{19} \]

\[ \mbox{generell ECS :} \quad g(x, y) \,=\, {1 \over 2 \pi \, \sigma_x \, \sigma_y } {1 \over { \sqrt{\, 1\,-\, \rho^2}} } \operatorname{exp} \left( – {1\over2} \, {\pmb{v}_{\mu}}^T \bullet \, \pmb{\Sigma}^{-1} \bullet {\pmb{v}_{\mu}} \, \right) \tag{20} \]

T symbolizes the transposition operation. The reader maybe recognizes the variance-covariance matrix as the inverse of Σ^-1 and that ρ actually is the correlation coefficient coupling X and Y :

\[ \pmb{\Sigma} \,=\, \begin{pmatrix} \sigma_x^2 &\rho\, \sigma_x\sigma_y \\ \rho\, \sigma_x\sigma_y & \sigma_y^2 \end{pmatrix}, \quad \pmb{\Sigma} \bullet \pmb{\Sigma}^{-1} = \mathbf I_n \tag{21} \]

Note that this matrix is symmetric. The notation often used to describe a two-dimensional normal distribution is

\[ \pmb{V}_B \,=\, \mathcal{N} \left[ \pmb{v}_{\mu},\, \pmb{\Sigma} \right] \,=\, \mathcal{N} \left[ \begin{pmatrix} \mu_x \\ \mu_y\end{pmatrix}, \, \begin{pmatrix} \sigma_x^2 &\rho\, \sigma_x\sigma_y \\ \rho\, \sigma_x\sigma_y & \sigma_y^2 \end{pmatrix} \right] \tag{22} \]

Conclusion

By some simple reasoning we have guessed the functional form of a bivariate normal distribution. We have made the assumption that the marginal distributions of a bivariate normal distribution should be one-dimensional normal distributions whose probability density functions are described by normalized Gaussians.

By looking at conditional probabilities we found that a normalization of the respective probability densities could be achieved by following symmetry arguments and a factorization. This lead us to the assumption that the conditional distributions could be normal Gaussian distributions themselves.

We shall have a look at properties of the bivariate normal distribution, its marginals and conditional sub-distributions in a later post.