Technical ReportPDF Available

Lecture Notes on Data Science: Kernel k-Means Clustering (Part 1)

Authors:

Abstract

In this note, we show that the objective function for k-means clustering can be cast in a form that allows for invoking the kernel trick.
Lecture Notes on Data Science: Kernel k-Means
Clustering (Part 1)
Christian Bauckhage
B-IT, University of Bonn
In this note, we show that objective function for k-means clustering
can be cast in a form that allows for invoking the kernel trick.
Introduction
Previously1, we saw that the problem of k-means clustering of a data 1C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering, 2015b.
DOI: 10.13140/RG.2.1.2829.4886
set X={x1,x2, . . . , xn} Rminto kclusters C1, . . . , Ckis, at its heart,
equivalent to the problem of finding kappropriate cluster centroids
µ1,µ2, . . . , µk.
In a later note2, we then saw that the problem of seraching for 2C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering Is Gaus-
sian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646
appropriate centroids can be cast as the problem of minimizing the
the following objective function
E(k) =
k
i=1
n
j=1
zij
xjµi
2. (1)
over all possible choices of the µiwhere the zij are so called latent-
or indicator variables. They indicate for any data point xjwhether or
not it belongs to cluster Ci. In other words,
indicator variables
zij =
1, if xjCi
0, otherwise. (2)
Regarding our topic in this note, it is interesting to observe
that these indicator variables provide us with alternative expressions
for each cluster centroid µi. Up until now, we always considered
µi=1
ni
xjCi
xj(3)
where ni=|Ci|denotes the size of cluster Ci. However, using the Exercise: convince yourself that (3) and
(4) are indeed equivalent.
zij , we may just as well write
µi=1
ni
n
j=1
zij xj. (4)
This result will play a key role in the following. In particular, we
will draw on it to show that the k-means objective function can be
kernelized.
Kernelizing the k-Means Objective Function
So far, our study of the k-means algorithm and its properties was
(more or less implicitly) confined to setting involving Euclidean data
vectors. Of course, these are very common in our daily practice, but,
sometimes, we need to cluster data which are not additive so that
lecture notes on data science:kernel k-means clustering (part 1)2
the notion of a mean is ill defined3. While it may seem, that k-means 3Consider, e.g., categorical data, textual
data, or relational data.
clustering does not apply to situations like these, it is indeed possible
to generalize the approach to basically any kind of data using kernel
k-means clustering. In this section, we will have a first brief look at
what this means.
To begin with, we recall the following elementary identity
xjµi
2=xjµiTxjµi=xT
jxj2µT
ixj+µT
iµi. (5)
which allows us to cast the k-means objective function in (1) as
E(k) =
k
i=1
n
j=1
zij xT
jxj2µT
ixj+µT
iµi. (6)
Given what we worked out in (4), we next note that
µT
ixj=1
ni
n
p=1
zip xT
pxj(7)
as well as
µT
iµi=1
n2
i
n
p=1
n
q=1
zip ziq xT
pxq(8)
so that the k-means objective function can also be expressed as
expanded k-means objective function
E(k) =
k
i=1
n
j=1
zij xT
jxj21
ni
n
p=1
zip xT
pxj+1
n2
i
n
p=1
n
q=1
zip ziq xT
pxq!.
(9)
At this point, we can basically conclude our discussion. Looking
at (9), we recognize that the k-means objective function in (1) can be
written entirely in terms of inner products between data vectors. This
allows for invoking the kernel trick where we replace inner products
kernel trick
xT
pxqby non-linear kernel functions k(xp,xq).
This trick has become a staple in areas such as data mining or
pattern recognition because it allows for applying linear techniques
in order to tackle nonlinear problems 4,5,6.4C. Bauckhage. Lecture Notes on
the Kernel Trick (I), 2015c. DOI:
10.13140/2.1.4524.8806
5J. Shaw-Taylor and N. Cristianini. Ker-
nel Methods for Pattern Analysis. Cam-
bridge University Press, 2004
6B. Schölkopf and A. Smola. Learning
with Kernels Support Vector Machines,
Optimizatyion and Beyond. MIT Press,
2002
In the context of k-means clustering, the kernel trick may
help us to obtain reasonable clusters even for highly non-Gaussian
data. Furthermore, kernel functions may be defined for a wide range
of data types so that k-means clustering is no longer confined to
Euclidean vectors. Looking at (9), we realize that kernel k-means
clustering is basically tantamount to determining suitable values of
the indicator variable zij.
However, the latter is usually a rather daunting problem and ap-
plying the kernel trick typically increases computation times. It also
requires experience as to appropriate kernel functions and necessi-
tates especially careful initializations of the algorithm7. For the time 7Note that these aspects of kernel k-
means clustering are often passed over
in the literature.
being, we therefore postpone solution strategies for all of these prob-
lems to later notes.
lecture notes on data science:kernel k-means clustering (part 1)3
References
C. Bauckhage. Lecture Notes on Data Science: k-Means
Clustering Is Gaussian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646.
C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering,
2015b. DOI: 10.13140/RG.2.1.2829.4886.
C. Bauckhage. Lecture Notes on the Kernel Trick (I), 2015c. DOI:
10.13140/2.1.4524.8806.
B. Schölkopf and A. Smola. Learning with Kernels Support Vector
Machines, Optimizatyion and Beyond. MIT Press, 2002.
J. Shaw-Taylor and N. Cristianini. Kernel Methods for Pattern Analy-
sis. Cambridge University Press, 2004.
ResearchGate has not been able to resolve any citations for this publication.
Technical Report
Full-text available
We show that k-means clustering is closely related to statistical mixture modeling. In particular, we show that the k-means algorithm implicitly fits a simple Gaussian mixture model to a set of data.
Technical Report
Full-text available
This is the first in a series of lecture notes on k-means clustering, its variants, and applications. We discuss the basic ideas behind k-means clustering and study the classical algorithm.
Technical Report
Full-text available
The kernel trick has become an important tool of the trade in machine learning, pattern recognition, and data mining. In this note, we look at how the mapping of data to higher dimensional spaces can turn non-linear problems into linear ones and how it may thus facilitate subsequent processing. We introduce the notion of a Mercer kernel and briefly summarize Mercer's theorem. Mastering these theoretical concepts will help us later on to apply linear methods to non-linear problems.
Book
Kernel methods provide a powerful and unified framework for pattern discovery, motivating algorithms that can act on general types of data (e.g. strings, vectors or text) and look for general types of relations (e.g. rankings, classifications, regressions, clusters). The application areas range from neural networks and pattern recognition to machine learning and data mining. This book, developed from lectures and tutorials, fulfils two major roles: firstly it provides practitioners with a large toolkit of algorithms, kernels and solutions ready to use for standard pattern discovery problems in fields such as bioinformatics, text analysis, image analysis. Secondly it provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.