8.2 Approximation of the data distribution

Two strategies were pursued: either the distribution of sensorimotor data was approximated by a mixture of ellipsoids (chapter 3), or the distribution's mapping into a single higher-dimensional space was approximated by a hyper-plane (chapter 5). The first was accomplished by a mixture of local principal component analyzers (PCA). Here, a single PCA operates on a region within the distribution. For each ellipsoid, the resulting principal components are the axes' directions, and the corresponding eigenvalues are the squared axes' lengths. To obtain the free parameters of the mixture model (the ellipsoids' centers, axes' directions and lengths) three new methods were presented:

**NGPCA**is an extension of the vector quantizer Neural Gas (Martinetz et al., 1993) to local PCA (Möller and Hoffmann, 2004). For each presented data point, not only the centers are updated, but also the principal components. Herein, an error measure (the normalized Mahalanobis distance plus reconstruction error) regulates the competition between the units of the mixture (section 3.2.1).**NGPCA-constV**is a variant of NGPCA. Here, the error measure is replaced by one that removes the dependency on the volume of an ellipsoid (section 3.2.2).**MPPCA-ext**is an extension of the mixture of probabilistic PCA (Tipping and Bishop, 1999), which is a mixture of Gaussian probability functions that model the density of the data distribution (section 2.3.1). The algorithm was modified in three parts:First, Neural Gas was used to initialize the centers. Second, units with almost zero weight were moved to more dense regions within the distribution. And third, an on-line method for PCA was used, which allows the addition of noise to each presented data point (section 3.3). The last two parts made the algorithm usable for the sensorimotor distributions from the robot experiments. The sparseness of these distributions would otherwise lead to eigenvalues with zero value (section 6.2.5).

These algorithms were demonstrated to work on synthetic data and on the classification of hand-written digits (section 3.4). On classification tasks, however, mixture models are not the most suitable. Here, feed-forward networks and support-vector machines (Cortes and Vapnik, 1995) can do better (LeCun et al., 1998).

On most tasks, the NGPCA variants gave about the same performance than MPPCA-ext, though some differences were visible. The NGPCA variants were more sensitive than MPPCA-ext on the training parameters (section 3.5 and 6.3). Further, the dependence on these parameters is not understood.

On the other hand, the NGPCA variants could cope with arbitrarily many dimensions; MPPCA-ext fails on high-dimensional data (observed for 676 and 784 dimensions, see section 4.4.2 and 3.4). This problem arises from the Gaussian function, which vanishes for the large distances that the high-dimensional space provides (section 3.4.2). Thus, probably, this problem can be also found in other Gaussian mixture models (Albrecht et al., 2000; Meinicke and Ritter, 2001; Tipping and Bishop, 1999). These studies did not report tests on such high-dimensional data.

In the robot experiments, NGPCA-constV was observed to be better than NGPCA (section 6.3 and 7.3.2). This performance difference is possibly linked to the assignment of the data points among the units. For NGPCA-constV, this assignment was more balanced. In particular, more training patterns were assigned to the unit with the fewest patterns (section 6.3 and 7.3.2).

The mixture model parameters, the number of units and the number of principal components were defined before training. The first was set by trial and error. Using more units improved the performance until the point where the number of training patterns per unit was not sufficient for a robust training (section 4.5.2 and 5.3.2). The number of principal components was estimated from the local dimensionality of the distribution. This dimensionality was accessible experimentally by computing a PCA in the neighborhood of each data point (section 4.5.2, 6.2.5, and 7.2.5).

In the second strategy, which uses a hyper-plane to approximate the data distribution, the data were mapped into a higher-dimensional space, a so-called feature space. The algorithm `kernel PCA' Schölkopf et al. (1998b) extracted the principal components of the data's mapping in this feature space (section 2.4). This algorithm does not require that the mapping is actually carried out; instead, all computation is done in the original space. The principal components in feature space spanned the hyper-plane that approximated the data.

2005-03-22