Content uploaded by Christian Bauckhage
Author content
All content in this area was uploaded by Christian Bauckhage on Dec 29, 2017
Content may be subject to copyright.
Lecture Notes on Data Science: k-Means Clustering
Is Matrix Factorization
Christian Bauckhage
B-IT, University of Bonn
In this note, we show that k-means clustering can be understood as a
constrained matrix factorization problem. This insight will later allow
us to recognize that k-means clustering is but a specific latent factor
model and closely related to techniques such as non-negative matrix
factorization or archetypal analysis.
Introduction
Previously, we discussed1that hard k-means clustering of a data set 1C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering, 2015b.
DOI: 10.13140/RG.2.1.2829.4886; and
C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering Is Gaus-
sian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646
X={x1,x2, . . . , xn} ⊂ Rminto kclusters C1, . . . , Ckboils down to
the problem of finding appropriate cluster centroids µ1, . . . , µkand
that these will minimize the following objective function
E(k) =
k
∑
i=1
n
∑
j=1
zij
xj−µi
2(1)
where
zij =
1, if xj∈Ci
0, otherwise. (2)
Our purpose in this note is to show that there is yet another way
of how to formalize the k-means objective in (1).
To this end, we note that we may understand the binary indicator
variables zij in (2) as the elements of an indicator matrix Z∈Rk×n.
We also observe that we may think of the given data points xjas
the columns of a data matrix
data matrix
X=hx1x2. . . xni∈Rm×n(3)
and that we may furthermore introduce a centroid matrix
centroid matrix
M=hµ1µ2. . . µki∈Rm×k(4)
whose columns correspond to the cluster centroids that are to be
determined.
Given the matrices defined in (2), (3), and (4), we will show that
the k-means objective function in (1) can indeed be written as
alternative form of the k-means
objective function
k
∑
i=1
n
∑
j=1
zij
xj−µi
2=
X−MZ
2
F(5)
where k·kFdenotes the matrix Frobenius norm.
In other words, we will show that k-means clustering is a matrix
factorization problem! If there were two appropriate matrices Mand
Zthat would minimize the right hand side of (5), the data matrix X
could be approximated as X≈MZ.Exercise: convince yourself that M Z is
am×nmatrix.
lecture notes on data science:k-means clustering is matrix factorization 2
Proving Equation (5)
In this section, we will prove that our claim in (5) does indeed hold.
The basic idea is to expand both sides of the equation into several,
more elementary terms and to show that the expressions we obtain
for the left- and right hand side are indeed equivalent.
Yet, before we set out to do so, we will remind ourselves of general
properties of the Frobenius norm and point out some of the peculiar
features of the binary indicator matrix Z.
General Properties of the Squared Frobenius Norm of a Matrix
Let A∈Rm×nbe any real valued matrix of mrows and ncolumns.
To denote individual elements of such a matrix, we either write aij
or (A)ij and to refer to the j-th column vector of A, we write aj.
The squared Frobenius norm of Ais defined as
A
2
F=
m
∑
i=1
n
∑
j=1
a2
ij (6)
and we recall the following properties
A
2
F=
n
∑
j=1
aj
2=
n
∑
j=1
aT
jaj=
n
∑
j=1ATAjj =trATA. (7)
Since our derivation below will frequently allude to the identities in
(7), readers are encouraged to verify (7) for themselves. Exercise: convince yourself that all the
equalities in (7) do hold.
Peculiar Properties of the Indicator Matrix Z
If the clusters C1, . . . Ckhave distinct cluster centroids µ1, . . . , µk, each
of the ncolumns of Zwill contain a single element that is 1 and k−1
elements that are 0. Accordingly, each column jof Zwill sum to one
k
∑
i=1
zij =1 (8)
and the kdifferent row sums will indicate the number of elements
per cluster, that is, for each row iof Z, we have
n
∑
j=1
zij =|Ci|=ni. (9)
Moreover, since zij ∈ {0, 1}and each column of Zonly contains a
single 1, the rows of Zare pairwise perpendicular because
zij zi0j=
1, if i=i0
0, otherwise (10)
which is then to say that the matrix ZZTis a diagonal matrix where
ZZTii0=∑
jZij ZTji0=∑
j
zij zi0j=
ni, if i=i0
0, otherwise. (11)
lecture notes on data science:k-means clustering is matrix factorization 3
Having familiarized ourselves with these properties of the in-
dicator matrix, we are now positioned to establish the equalities in
(5) which we will do in a step by step manner.
Step 1: Expanding the expression on the left of (5)
We begin by expanding the traditional k-means objective on the left
of (5). For this expression, we have
∑
i,j
zij
xj−µi
2=∑
i,j
zij xT
jxj−2xT
jµi+µT
iµi
=∑
i,j
zij xT
jxj
| {z }
T1
−2∑
i,j
zij xT
jµi
| {z }
T2
+∑
i,j
zij µT
iµi
| {z }
T3
. (12)
This expansion leads to further insights, if we examine the three
terms T1,T2, and T3one by one.
First of all, we find
T1=∑
i,j
zij xT
jxj=∑
i,j
zij
xj
2(13)
=∑
j
xj
2(14)
=trXTX(15)
where we made use of (8) and (7).
Second of all, we observe
T2=∑
i,j
zij xT
jµi=∑
i,j
zij ∑
l
xlj µl i (16)
=∑
j,l
xlj ∑
i
µli zij (17)
=∑
j,l
xlj M Zlj (18)
=∑
j
∑
lXTjl MZl j (19)
=∑
jXTMZjj (20)
=trXTMZ. (21)
Third of all, we note that
T3=∑
i,j
zij µT
iµi=∑
i,j
zij
µi
2(22)
=∑
i
µi
2ni(23)
where we applied (9).
lecture notes on data science:k-means clustering is matrix factorization 4
Step 2: Expanding the expression on the right of (5)
Next, we look at the expression on the right hand side of (5). As a
squared Frobenius norm of a matrix difference, it can be written as
X−MZ
2=trhX−MZTX−M Zi
=trXTX
| {z }
T4
−2 trXTMZ
| {z }
T5
+trZTMTMZ
| {z }
T6
(24)
Given our results in (15) and (21), we immediately recognize that
T1=T4and T2=T5. Thus, to establish that (12) and (24) are indeed
equivalent, it remains to verify whether T3=T6?
Regarding term T6, we note that, due to the cyclic permutation
invariance of the trace operator, we have
trZTMTMZ=trMTM Z ZT. (25)
We also note that
trMTMZZT=∑
iMTMZZTii (26)
=∑
i
∑
lMTMil ZZTli (27)
=∑
iMTMii ZZTii (28)
=∑
i
µi
2ni(29)
where we used the fact that ZZTis diagonal. This result, however,
shows that T3=T6and, consequently, that (12) and (24) really are
equivalent.
Summary and Outlook
Using rather tedious yet straightforward algebra, we have shown
that the problem of hard k-means clustering can be understood as
the following constrained matrix factorization problem
argmin
M,Z
X−MZ
2
s.t. zij ∈ {0, 1}
∑
i
zij =1
(30)
where
X∈Rm×nis a matrix of data vectors xj∈Rm(31)
M∈Rm×kis a matrix of cluster centroids µi∈Rm(32)
Z∈Rk×nis a matrix of binary indicator variables such that
zij =
1, if xj∈Ci
0, otherwise. (33)
lecture notes on data science:k-means clustering is matrix factorization 5
At this point, readers who are not accustomed with the idea of
matrix factorization for data analysis might be wondering what we
could possibly gain from this insight.
Admittedly, the formulation of the k-means clustering problem in
(30) appears to be more complicated and less intuitive than those
fond in the textbooks. However, in later notes, we will see that the
expression in (30) allows for seamless insights into several important
properties of the k-means clustering problem that are otherwise more
difficult to uncover 2.2C. Bauckhage. k-Means Clustering via
the Frank-Wolfe Algorithm. In Proc.
KDML-LWDA,2016
lecture notes on data science:k-means clustering is matrix factorization 6
References
C. Bauckhage. Lecture Notes on Data Science: k-Means
Clustering Is Gaussian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646.
C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering,
2015b. DOI: 10.13140/RG.2.1.2829.4886.
C. Bauckhage. k-Means Clustering via the Frank-Wolfe Algorithm.
In Proc. KDML-LWDA,2016.