Content uploaded by Christian Bauckhage
Author content
All content in this area was uploaded by Christian Bauckhage on Aug 25, 2015
Content may be subject to copyright.
Lecture Notes on Data Science: Kernel k-Means
Clustering (Part 1)
Christian Bauckhage
B-IT, University of Bonn
In this note, we show that objective function for k-means clustering
can be cast in a form that allows for invoking the kernel trick.
Introduction
Previously1, we saw that the problem of k-means clustering of a data 1C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering, 2015b.
DOI: 10.13140/RG.2.1.2829.4886
set X={x1,x2, . . . , xn} ⊂ Rminto kclusters C1, . . . , Ckis, at its heart,
equivalent to the problem of finding kappropriate cluster centroids
µ1,µ2, . . . , µk.
In a later note2, we then saw that the problem of seraching for 2C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering Is Gaus-
sian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646
appropriate centroids can be cast as the problem of minimizing the
the following objective function
E(k) =
k
∑
i=1
n
∑
j=1
zij
xj−µi
2. (1)
over all possible choices of the µiwhere the zij are so called latent-
or indicator variables. They indicate for any data point xjwhether or
not it belongs to cluster Ci. In other words,
indicator variables
zij =
1, if xj∈Ci
0, otherwise. (2)
Regarding our topic in this note, it is interesting to observe
that these indicator variables provide us with alternative expressions
for each cluster centroid µi. Up until now, we always considered
µi=1
ni∑
xj∈Ci
xj(3)
where ni=|Ci|denotes the size of cluster Ci. However, using the Exercise: convince yourself that (3) and
(4) are indeed equivalent.
zij , we may just as well write
µi=1
ni
n
∑
j=1
zij xj. (4)
This result will play a key role in the following. In particular, we
will draw on it to show that the k-means objective function can be
kernelized.
Kernelizing the k-Means Objective Function
So far, our study of the k-means algorithm and its properties was
(more or less implicitly) confined to setting involving Euclidean data
vectors. Of course, these are very common in our daily practice, but,
sometimes, we need to cluster data which are not additive so that
lecture notes on data science:kernel k-means clustering (part 1)2
the notion of a mean is ill defined3. While it may seem, that k-means 3Consider, e.g., categorical data, textual
data, or relational data.
clustering does not apply to situations like these, it is indeed possible
to generalize the approach to basically any kind of data using kernel
k-means clustering. In this section, we will have a first brief look at
what this means.
To begin with, we recall the following elementary identity
xj−µi
2=xj−µiTxj−µi=xT
jxj−2µT
ixj+µT
iµi. (5)
which allows us to cast the k-means objective function in (1) as
E(k) =
k
∑
i=1
n
∑
j=1
zij xT
jxj−2µT
ixj+µT
iµi. (6)
Given what we worked out in (4), we next note that
µT
ixj=1
ni
n
∑
p=1
zip xT
pxj(7)
as well as
µT
iµi=1
n2
i
n
∑
p=1
n
∑
q=1
zip ziq xT
pxq(8)
so that the k-means objective function can also be expressed as
expanded k-means objective function
E(k) =
k
∑
i=1
n
∑
j=1
zij xT
jxj−21
ni
n
∑
p=1
zip xT
pxj+1
n2
i
n
∑
p=1
n
∑
q=1
zip ziq xT
pxq!.
(9)
At this point, we can basically conclude our discussion. Looking
at (9), we recognize that the k-means objective function in (1) can be
written entirely in terms of inner products between data vectors. This
allows for invoking the kernel trick where we replace inner products
kernel trick
xT
pxqby non-linear kernel functions k(xp,xq).
This trick has become a staple in areas such as data mining or
pattern recognition because it allows for applying linear techniques
in order to tackle nonlinear problems 4,5,6.4C. Bauckhage. Lecture Notes on
the Kernel Trick (I), 2015c. DOI:
10.13140/2.1.4524.8806
5J. Shaw-Taylor and N. Cristianini. Ker-
nel Methods for Pattern Analysis. Cam-
bridge University Press, 2004
6B. Schölkopf and A. Smola. Learning
with Kernels – Support Vector Machines,
Optimizatyion and Beyond. MIT Press,
2002
In the context of k-means clustering, the kernel trick may
help us to obtain reasonable clusters even for highly non-Gaussian
data. Furthermore, kernel functions may be defined for a wide range
of data types so that k-means clustering is no longer confined to
Euclidean vectors. Looking at (9), we realize that kernel k-means
clustering is basically tantamount to determining suitable values of
the indicator variable zij.
However, the latter is usually a rather daunting problem and ap-
plying the kernel trick typically increases computation times. It also
requires experience as to appropriate kernel functions and necessi-
tates especially careful initializations of the algorithm7. For the time 7Note that these aspects of kernel k-
means clustering are often passed over
in the literature.
being, we therefore postpone solution strategies for all of these prob-
lems to later notes.
lecture notes on data science:kernel k-means clustering (part 1)3
References
C. Bauckhage. Lecture Notes on Data Science: k-Means
Clustering Is Gaussian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646.
C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering,
2015b. DOI: 10.13140/RG.2.1.2829.4886.
C. Bauckhage. Lecture Notes on the Kernel Trick (I), 2015c. DOI:
10.13140/2.1.4524.8806.
B. Schölkopf and A. Smola. Learning with Kernels – Support Vector
Machines, Optimizatyion and Beyond. MIT Press, 2002.
J. Shaw-Taylor and N. Cristianini. Kernel Methods for Pattern Analy-
sis. Cambridge University Press, 2004.