Next: 2.2.1 Kmeans
Up: 2. Modeling of data
Previous: 2.1.2 Probabilistic PCA
2.2 Vector quantization
Vector quantization describes a pattern set using a reduced number of socalled `codebook' vectors. We assume again that n training patterns
IR^{d} are given. Let m < n be the number of codebook vectors
IR^{d}. In the final state, each training pattern is assigned to one codebook vector. The optimal position of codebook vectors is usually gained by finding the minimum of the sum E of squared distances^{2.2} between each codebook vector and its assigned patterns,
P(j) is the probability that belongs to .
For the final state,
P(j) = 1 if is assigned to , and
P(j) = 0 otherwise. A pattern is assigned to the codebook vector that has the smallest Euclidean distance to that pattern. Thus, the codebook vectors induce a Voronoi tessellation of space (figure 2.2). In each of the separated regions, the position of the codebook vector is the centerofmass of the local pattern distribution (otherwise (2.12) cannot be minimal).
Figure 2.2:
Codebook vectors (crosses) and the resulting Voronoi tessellation (dashed lines are boundaries, equidistant from two respective codebook vectors). Training patterns are drawn as gray dots.

The difficulty in finding the optimal
{} is that E has many local minima. No general solution exists. Instead, various iterative algorithms exist that find approximate solutions. The algorithms can be divided into those that use hardclustering and those that use softclustering. In hardclustering,
P(j) is binary (throughout the iteration), and each can be only assigned to one codebook vector. In softclustering,
P(j) can take any value in the interval [0;1].
The algorithms can be further divided into online and batch versions. Online versions update codebook vectors in each iteration step based on just one (randomly drawn) training pattern (as for the selforganizing map, section 1.5.5). The update rule is usually written as
(t + 1) = (t) + (t)P(j)  (t) . 
(2.13) 
The change of is in the direction of the negative gradient of (2.12).
(t) is a learning rate, which can depend on time. In contrast, batch versions use all training patterns for each step. Here, the algorithm alternates between computing all
P(j) based on a given distribution
{}, and optimizing all given
P(j). This algorithm is a variant of the expectation maximization algorithm (Dempster et al., 1977). The `maximization' step is a minimization of the error E (which maximizes the likelihood, see section 2.3.1).
The following sections describe the hardclustering algorithm `kmeans' and provide more details about softclustering algorithms. Two examples are described: `deterministic annealing' and `Neural Gas'.
Subsections
Next: 2.2.1 Kmeans
Up: 2. Modeling of data
Previous: 2.1.2 Probabilistic PCA
Heiko Hoffmann
20050322