2.4 Kernel PCA

Different from the mixture models, kernel PCA (Schölkopf et al., 1998b) just works with a single PCA. It is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the *n* data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space,

() . | (2.22) |

As it turns out later, the computation of this mapping can be omitted. In the feature space, principal components are extracted. That is, the following equation needs to be solved (here, we first assume that {()} has zero mean, see section 2.4.2):

with the covariance matrix = ()()

Combining (2.23) and (2.24) gives

which is equivalent to the set of

The direction from (2.26) to (2.25) is fulfilled because the left side of (2.25) is in the span of (),...,(), and (2.26) defines all

with = (,...,)

Thus, the vector for each principal component can be obtained by extracting the eigenvectors of . For further processing, the principal component needs to be normalized to have unit length. This can be also established by working solely with the kernel,

which results in a normalization rule for .

To apply kernel PCA, a data point's features (the projections on the principal components) need to be extracted, and the formalism needs to be adjusted to distributions that do not have zero mean in feature space. These two points are addressed in the following sections. Furthermore, a short list of common kernel functions is given.

2005-03-22