Content uploaded by Christian Bauckhage
Author content
All content in this area was uploaded by Christian Bauckhage on Jan 20, 2019
Content may be subject to copyright.
Content uploaded by Christian Bauckhage
Author content
All content in this area was uploaded by Christian Bauckhage on Jan 20, 2019
Content may be subject to copyright.
Lecture Notes on Machine Learning
Kernel k-Means Clustering (Part 2)
Christian Bauckhage
B-IT, University of Bonn
Earlier, we saw that k-means clustering allows for invoking the kernel
trick. Here, we discuss the problem we have to solve in kernel k-means
clustering and how it differs from the conventional k-means problem.
Setting the Stage
Recall1that k-means clustering aims at partitioning a given set of n1C. Bauckhage and O. Cremers. Lecture
Notes on Machine Learning: k-Means
Clustering. B-IT, University of Bonn,
2019
data points xj∈Rminto kclusters Ciwhich are defined in terms of
prototypes µi. Hence, the basic problem is to find a set {µ∗
1, . . . , µ∗
k}
of optimal cluster prototypes. This can be formalized as the problem
of finding those µithat minimize
k-means objective
E=
k
∑
i=1
n
∑
j=1
zij
xj−µi
2(1)
where the zij are binary indicator variables2given by 2C. Bauckhage and D. Speicher. Lecture
Notes on Machine Learning: Rewriting
the k-Means Objective. B-IT, University
of Bonn, 2019
zij =
1, if xj∈ Ci
0, otherwise. (2)
In other and more abstract words, common algorithms for k-means
clustering try to solve the following minimization problem
k-means problem
µ∗
i=argmin
{µi}
k
∑
i=1
n
∑
j=1
zij
xj−µi
2. (3)
Previously3, we saw that we can kernelize the k-means objective in 3C. Bauckhage. Lecture Notes on Ma-
chine Learning: Kernel k-Means Clus-
tering (Part 1). B-IT, University of Bonn,
2019
(1). To recap, this is to say that we can rewrite it such that data points
only occur in form of inner products with other data points and that
we can replace these inner products by kernel functions. Following
this recipe, we found that the minimization objective in (1) becomes
kernel k-means objective
EK=
n
∑
j=1
K(xj,xj)−
k
∑
i=1
1
ni
n
∑
p=1
n
∑
q=1
zip ziq K(xp,xq)(4)
where
ni=
n
∑
j=1
zij (5)
and K:Rm×Rm→Ris a Mercer kernel.
But what exactly is the problem we have to solve with respect to
this kernelized k-means objective?
After all, in k-means clustering, we treat the cluster prototypes as
problem variables and assume that suitable prototypes will minimize
(1). But equation (4) does not involve cluster prototypes anymore!
What then is the kernel k-means problem? With respect to which
variables do we have to minimize (4)? And what will we find when
we minimize (4)?
© C. Bauckhage
licensed under Creative Commons License CC BY-NC
2 c.bauckhage
The Kernel k-Means Clustering Problem
Indeed, as a consequence of kernelizing (1), the cluster prototypes
µihave vanished from the minimization objective in (4). Rather, the
only problem variables4left are the indicator variables zij.4Recall that k-means clustering poses a
chicken and egg problem:
If we know the prototypes µi, we can
compute clusters Cior indicators zij .
If we know the clusters Cior indicators
zij , we can compute prototypes µi.
In (1), we treat the zij according to the
first view (as dependent variables).
In (4), we treat them according to the
second view (as independent variables).
The basic problem of kernel k-means clustering must therefore be
to find an optimal set {z∗
ij }of cluster membership indicators.
However, if we set out to minimize (4) over all admissible choices
of the zij , we should note that the first term in (4) does not depend
on the indicator variables. Accordingly, we only need to consider the
second term so that the kernel k-means problem becomes to solve
the kernel k-means problem
z∗
ij =argmin
{zij }
−
k
∑
i=1
1
ni
n
∑
p=1
n
∑
q=1
zip ziq K(xp,xq)
s.t. zij ∈ {0, 1}
k
∑
i=1
zij =1
(6)
kernel k-means clustering is a
constrained optimization problem
Note that (6) is a constrained optimization problem! This is because
we must ensure that the minimization variables zij (and hence the
results z∗
ij ) are proper cluster membership indicators.
This means that, first of all, the minimization variables must be
binary variables
zij ∈ {0, 1}(7)
because, in k-means clustering, data point xjeither belongs to cluster
Cior not. Second of all, because each data point can belong to only
one cluster, the minimization variables have to obey
k
∑
i=1
zij =1. (8)
In other words, minimizing the kernel k-means objective in (4) has to
happen such that (s.t.) the two constraints in (7) and (8) are met.
Pros and Cons of Kernel k-Means Clustering
If we compare (3) and (6), it seems that conventional k-means and
kernel k-means clustering pose rather different problems. To see how
fundamentally different they actually are, we next look into some
of the (dis)advantages of kernel k-means over conventional k-means
clustering.
An important practical issue in kernel k-means clustering is
due to the nature of the problem we have to solve.
integer programming problem
−Since its optimization variables zij are integers, the kernel k-means
problem is an integer programming problem and thus generally
kernel k-means 3
(a) sample of two-dimensional
data points
(b) kernel k-means clustering
result for k=2
(c) kernel k-means clustering
result for k=3
(d) same as in (c) with cluster
means computed using (10)
Figure 1: Example of a data clustering
problem where the data do not form
convex clusters. In situations like these,
kernel k-means may lead to reasonable
results.
However, the quality of results depends
on the choice of kernel function and, if
computable at all, cluster means might
be of little practical use.
difficult to solve. Contrary to optimization problems over real-
valued variables, we can, for instance, not resort to methods based
on calculus. Rather, to be guaranteed to find the optimal solu-
tion, we would have to test every admissible instantiation of the
discrete problem variables. On a digital computer, this quickly
becomes prohibitively expensive even for moderately many data
points and clusters5. In fact, integer programming is NP-complete. 5On a quantum computer, however, it
might be feasible
C. Bauckhage, E. Brito, K. Cvejoski,
C. Ojeda, R. Sifa, and S. Wrobel. Ising
Models for Binary Clustering via Adi-
abatic Quantum Computing. In Proc.
EMMCVPR,2017
This is reminiscent of conventional k-means clustering which
we recall is a hard problem, too6. Just as in conventional k-means
6M. Garey, D. Johnson, and H. Witsen-
hausen. The Complexity of the Gener-
alized Lloyd - Max Problem (Corresp.).
IEEE Trans. on Information Theory,28(2),
1982; and D. Aloise, A. Deshapande,
P. Hansen, and P. Popat. NP-Hardness
of Euclidean Sum-of-Squares Cluster-
ing. Machine Learning,75(2), 2009
clustering, algorithms for solving the kernel k-means clustering
problem are therefore typically heuristics for which there is no
guarantee that they will find the optimal solution.
Indeed, in practice, we either relax the kernel k-means problem,
i.e. approximate it as a continuous optimization problem, or tackle
it using greedy algorithms. Both ideas usually work well and we
will study them in detail later on.
Aconsiderable advantage of kernel k-means clustering is that
it reaps the benefits of invoking the kernel trick.
+Kernel k-means may produce reasonable results even for data sets
that contain non-convex clusters that are not linearly separable
(see Fig. 1).
Moreover, kernel k-means is not confined to numerical data but
also applies to data where the notion of a mean does not make
sense7. In other words, since kernel functions can be defined for a 7Common examples include relational,
categorical, or textual data.
wide range of data types, kernel k-means generalizes to basically
any kind of data (see Fig. 2). X={rod flanders, abe simpson,
lisa simpson, todd flanders,
ned flanders, marge simpson,
bart simpson, homer simpson,
maude flanders, maggie simpson}
C1={rod flanders, todd flanders,
ned flanders, maude flanders}
C2={abe simpson, lisa simpson,
marge simpson, bart simpson,
homer simpson, maggie simpson}
Figure 2: A data set Xof ten strings and
two clusters C1and C2found through
kernel k-means clustering.
Another practical issue in kernel k-means clustering often glossed
over in the literature is the following.
−Kernel k-means requires experience as to what kind of kernel
functions work well in what kind of settings.
When in doubt, most practitioners resort to Gaussian kernels since
these are usually believed to work well. However, care is needed!
4 c.bauckhage
Even minute differences with respect to the task to be solved may
require careful tuning of the parameters of the kernel function.
For example, to produce the results in Figs. 1(b) and 1(c), we used
a Gaussian kernel
K(xp,xq) = exp −
xp−xq
2
2σ2!(9)
but actually had to choose different scale parameters σin order to
obtain reasonable results for k=2 and k=3, respectively.
In other words, there is no one-size-fits-all solution when using
kernel k-means clustering. Even in very simple settings such as
the one in Fig. 1,the quality or usefulness of clustering results
will crucially depend on the parametrization of the method.
To paraphrase once more, if kernel k-means clustering does not
seem to work well in practice, we should not blame the method
but the way we use it.88Since careful parameter tuning is usu-
ally time consuming even when done
automatically, kernel k-means should
never be used without consideration.
Often the problem at hand is not worth
the hassle but can be solved using other
methods.
Another crucial characteristic of kernel k-means clustering
may or may not be a drawback depending on the application context.
±Kernel k-means does not yield cluster means! All we obtain from
solving the kernel k-means clustering problem are assignments of
data points to clusters.
In applications where we are only interested in data clustering,
this is good enough. In applications where we actually need clus-
ter prototypes in downstream computations, kernel k-means will
generally be of little use.
Practitioners are sometimes tempted to run kernel k-means so as
to determine cluster membership indicators z∗
ij and then to use
them to compute
µ∗
i=1
n∗
i
n
∑
j=1
z∗
ij xj. (10)
But this is generally a bad idea with spurious outcomes because
there are two things we must keep in mind.
On the one hand, using kernel functions K(xp,xq) = ϕ(xp)|ϕ(xq),
kernel k-means implicitly operates in abstract feature spaces. As
we just saw, say in Fig. 1(c), this can reveal non-convex structures
in the original data space. But what good is the mean of a non-
convex cluster as a prototype? For instance, using (10), we found
the means shown in Fig. 1(d). While they certainly reside in the
center of their clusters, we observe that the data in the circular
cluster are far from their mean and that the data in the two moon
shaped clusters are actually closer to this mean. Whether or not
behavior like this is a bug or a feature will depend on the appli-
cation context. Note, however, that, in this simple example, we
could recognize this potential problem through visual inspection.
kernel k-means 5
For much higher dimensional data, which we can not visualize as
easily, we may not be able to see how means computed according
to (10) behave and may thus not be able to recognize potentially
misleading results.
On the other hand, computing (10) may not just be futile but make
no sense at all. For example, consider again the data and results
in Fig. 2. What would be the mean of a cluster of strings? If we
were in need of identifying prototypical strings, kernel k-means
will be of little help. Rather, we should resort to (closely) related
methods such as, for instance, k-medoids clustering.99C. Bauckhage. NumPy / SciPy
Recipes for Data Science: k-Medoids
Clustering. researchgate.net, Feb. 2015
Summary and Outlook
In this note, we saw that the problem at the heart of kernel k-means
clustering is to solve
z∗
ij =argmin
{zij }
−
k
∑
i=1
1
ni
n
∑
p=1
n
∑
q=1
zip ziq K(xp,xq)
s.t. zij ∈ {0, 1}
k
∑
i=1
zij =1.
(11)
We discussed the mathematical nature of this problem as well as
some of the (dis)advantages of its solution (general pros and cons of
kernel k-means clustering).
The crucial open questions of what kind of algorithms to use in
order to actually solve this problem and how to practically compute
results as shown in Figs. 1and 2will be answered later on.
Acknowledgments
This material was prepared within project P3ML which is funded by
the Ministry of Education and Research of Germany (BMBF) under
grant number 01/S17064. The authors gratefully acknowledge this
support.
6 c.bauckhage
References
D. Aloise, A. Deshapande, P. Hansen, and P. Popat. NP-Hardness
of Euclidean Sum-of-Squares Clustering. Machine Learning,75(2),
2009.
C. Bauckhage. NumPy / SciPy Recipes for Data Science: k-Medoids
Clustering. researchgate.net, Feb. 2015.
C. Bauckhage. Lecture Notes on Machine Learning: Kernel k-Means
Clustering (Part 1). B-IT, University of Bonn, 2019.
C. Bauckhage and O. Cremers. Lecture Notes on Machine Learning:
k-Means Clustering. B-IT, University of Bonn, 2019.
C. Bauckhage and D. Speicher. Lecture Notes on Machine Learning:
Rewriting the k-Means Objective. B-IT, University of Bonn, 2019.
C. Bauckhage, E. Brito, K. Cvejoski, C. Ojeda, R. Sifa, and S. Wro-
bel. Ising Models for Binary Clustering via Adiabatic Quantum
Computing. In Proc. EMMCVPR,2017.
M. Garey, D. Johnson, and H. Witsenhausen. The Complexity of
the Generalized Lloyd - Max Problem (Corresp.). IEEE Trans. on
Information Theory,28(2), 1982.