next up previous contents
Next: 2.3.2 Mixture of probabilistic Up: 2.3 Mixture of local Previous: 2.3 Mixture of local


2.3.1 Gaussian mixture models

Gaussian mixture models assume that the patterns $ \bf x_{i}^{}$ origin from a probability density p($ \bf x$) (Bishop, 1995). This density is a linear combination of Gaussian functions p($ \bf x$| j),

p($\displaystyle \bf x$| j) = $\displaystyle {\frac{{1}}{{N_j}}}$exp$\displaystyle \left(\vphantom{-\frac{1}{2} ({\bf x}-{\bf c}_j)^T {\bf A}_j ({\bf x}-{\bf c}_j)}\right.$ - $\displaystyle {\frac{{1}}{{2}}}$($\displaystyle \bf x$ - $\displaystyle \bf c_{j}^{}$)T$\displaystyle \bf A_{j}^{}$($\displaystyle \bf x$ - $\displaystyle \bf c_{j}^{}$)$\displaystyle \left.\vphantom{-\frac{1}{2} ({\bf x}-{\bf c}_j)^T {\bf A}_j ({\bf x}-{\bf c}_j)}\right)$ . (2.16)

The normalization constant Nj is chosen such that the integral of p($ \bf x$| j) over IRd equals one (a necessary condition for a probability density). The negative exponent is a weighted squared distance (called Mahalanobis distance) between $ \bf x$ and the center $ \bf c_{j}^{}$; the corresponding weights are given by the symmetric matrix $ \bf A_{j}^{}$. The boundary that has a Mahalanobis distance to the center $ \bf c_{j}^{}$ equal to one is a hyper-ellipsoid.

The density p($ \bf x$) is a weighted sum of the local densities p($ \bf x$| j),

p($\displaystyle \bf x$) = $\displaystyle \sum_{{j=1}}^{m}$P(j)p($\displaystyle \bf x$| j) . (2.17)

To normalize p($ \bf x$), the weights P(j) must sum to one, $ \sum_{{j=1}}^{m}$P(j) = 1. Therefore, P(j) can be interpreted as the probability that patterns originate from the unit j. It is called prior probability.

The goal of the mixture model is to find the unknown parameters $ \bf c_{j}^{}$, $ \bf A_{j}^{}$, and the priors P(j) for each unit j such that the likelihood, L = $ \prod_{{i=1}}^{n}$p($ \bf x_{i}^{}$), to obtain the distribution {$ \bf x_{i}^{}$} given the density p($ \bf x$) is maximal (Bishop, 1995). To solve this optimization problem it is common to use a variant of the expectation maximization algorithm (Bishop, 1995). It consists of two steps with iterate until convergence is reached. In the expectation step, the soft-assignment P(j|$ \bf x_{i}^{}$) for all j and i is computed based on a given estimate of the parameters $ \bf c_{j}^{}$, $ \bf A_{j}^{}$, and P(j). P(j|$ \bf x_{i}^{}$) is called posterior probability. It is computed using Bayes' theorem (see appendix A.1),

P(j|$\displaystyle \bf x_{i}^{}$) = $\displaystyle {\frac{{p({\bf x}_i\vert j)P(j)}}{{p({\bf x}_i)}}}$ . (2.18)

In the special case of uniform Gaussians that all have the same width and weight P(j), (2.18) is the same as the Gibbs distribution (2.14).

In the maximization step, the Gaussian's parameters $ \bf c_{j}^{}$, $ \bf A_{j}^{}$, and P(j) that maximize the likelihood given all P(j|$ \bf x_{i}^{}$) can be directly computed (Bishop, 1995). The result is that the center $ \bf c_{j}^{}$ is the weighted mean of the set {$ \bf x_{i}^{}$},

$\displaystyle \bf c_{j}^{}$ = $\displaystyle {\frac{{\sum_{i=1}^n P(j\vert{\bf x}_i) {\bf x}_i}}{{\sum_{i=1}^n P(j\vert{\bf x}_i)}}}$ , (2.19)

and the matrix $ \bf A_{j}^{}$ is the inverse of the weighted covariance matrix $ \bf C_{j}^{}$,

$\displaystyle \bf C_{j}^{}$ = $\displaystyle {\frac{{\sum_{i=1}^n P(j\vert{\bf x}_i) ({\bf x}_i-{\bf c}_j)({\bf x}_i-{\bf c}_j)^T}}{{\sum_{i=1}^n P(j\vert{\bf x}_i)}}}$ . (2.20)

The inverse can be computed by extracting all eigenvectors of $ \bf C_{j}^{}$. Thus, the axes of the mentioned hyper-ellipsoid are the principal components of the local data distribution. The size of this hyper-ellipsoid is given by the eigenvalues $ \lambda_{j}^{l}$ from the PCA (the semi-axis length of unit j in the direction l equals $ \sqrt{{\lambda _j^l}}$). Finally, the result for the prior probabilities is

P(j) = $\displaystyle {\frac{{1}}{{n}}}$$\displaystyle \sum_{{i=1}}^{n}$P(j|$\displaystyle \bf x_{i}^{}$) . (2.21)

It can be shown that alternating these expectation and maximization steps increases the likelihood L in each iteration step (Bishop, 1995). However, local maxima are not avoided. Further, single isolated data points (outliers) can make the algorithm unstable (Archambeau et al., 2003). If just one pattern is assigned to a unit (that is, the other patterns have almost zero P(j|$ \bf x_{i}^{}$)) the variance of the local Gaussian collapses to zero. As an improvement to the local minima problem, annealing schemes (as discussed for the vector quantization, section 2.2) were suggested (Albrecht et al., 2000; Meinicke and Ritter, 2001; Meinicke, 2000). Here, a global variance linked to the width of each Gaussian is gradually reduced during annealing.

The Gaussian mixture model as it is presented here has the disadvantage that all eigenvectors of the local covariance matrix need to be extracted. However, this problem can be overcome if probabilistic PCA is used instead of standard PCA.


next up previous contents
Next: 2.3.2 Mixture of probabilistic Up: 2.3 Mixture of local Previous: 2.3 Mixture of local
Heiko Hoffmann
2005-03-22