The results for the classification of the 28×28 digits are shown in table 3.1. The error rates for the two NGPCA variants were averaged over three separate training cycles (the difference between best and worst was around 0.2% for both variants). Both variants are better then a model using only a single PCA, and also better then Neural Gas with the same number of free parameters. MPPCAext could not be tested on this set because the large distances between digits lead to numerical zero probabilities (the maximum distance in a 784dimensional cube of side length one is 28, this is large compared to a of around 0.1).
In the following, the ellipsoids of the NGPCA model are visualized (Möller and Hoffmann, 2004). Figure 3.10 shows the centers of the ten ellipsoids for each digit. Each center represents the local average over a subgroup of digits. Different ways to write a digit become visible, for example, digit `7' with or without a crossbar.
The ellipsoid axis (eigenvectors) for one digit are visualized in figure 3.11. The eigenvectors represent variations around a center. This can be illustrated by adding multiples of an eigenvector to a center (figure 3.12). In the presented example, different sizes of the digit `2' are covered by the local PCA.


Figure 3.13 shows a sample of misclassified digits. Some of the misclassified digits resemble the center they were assigned to (for example, the digit `9'). These digits seem to be extremes that lie close to representatives of another class.

The training set with digits of size 8×8 was used for a comparison with MPPCAext, and also for a comparison with local PCA mixture models from the literature (Hinton et al., 1997; Tipping and Bishop, 1999). These models worked on a different data set (CEDAR, which is commercial), however the size of the images ( 8×8) and the number of training patterns (1000 per digit) were the same. Moreover, these models had the same complexity as our models, namely ten units with ten principal components each. Tipping and Bishop (1999) used the discussed MPPCA model, and Hinton et al. (1997) used a mixture model that minimized the reconstruction error (as mentioned in section 2.3). Other mixture models that were tested on handwritten digits have a different complexity, for example, Meinicke and Ritter (2001) used a variable number of principal components. These models were excluded because they are hard to compare. Table 3.2 shows the result of the comparison. The errors from our models were averaged over three separate training cycles (the difference between worst and best was around 0.2%). Tipping and Bishop (1999) presented the result of the best training cycle.
