Technical ReportPDF Available

Lecture Notes on Data Science: k-Means Clustering Is Matrix Factorization

Authors:

Abstract

In this note, we show that k-means clustering can be understood as a constrained matrix factorization problem. This insight will later allow us to recognize that k-means clustering is but a specific latent factor model and closely related to techniques such as non-negative matrix factorization or archetypal analysis.
Lecture Notes on Data Science: k-Means Clustering
Is Matrix Factorization
Christian Bauckhage
B-IT, University of Bonn
In this note, we show that k-means clustering can be understood as a
constrained matrix factorization problem. This insight will later allow
us to recognize that k-means clustering is but a specific latent factor
model and closely related to techniques such as non-negative matrix
factorization or archetypal analysis.
Introduction
Previously, we discussed1that hard k-means clustering of a data set 1C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering, 2015b.
DOI: 10.13140/RG.2.1.2829.4886; and
C. Bauckhage. Lecture Notes on Data
Science: k-Means Clustering Is Gaus-
sian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646
X={x1,x2, . . . , xn} Rminto kclusters C1, . . . , Ckboils down to
the problem of finding appropriate cluster centroids µ1, . . . , µkand
that these will minimize the following objective function
E(k) =
k
i=1
n
j=1
zij
xjµi
2(1)
where
zij =
1, if xjCi
0, otherwise. (2)
Our purpose in this note is to show that there is yet another way
of how to formalize the k-means objective in (1).
To this end, we note that we may understand the binary indicator
variables zij in (2) as the elements of an indicator matrix ZRk×n.
We also observe that we may think of the given data points xjas
the columns of a data matrix
data matrix
X=hx1x2. . . xniRm×n(3)
and that we may furthermore introduce a centroid matrix
centroid matrix
M=hµ1µ2. . . µkiRm×k(4)
whose columns correspond to the cluster centroids that are to be
determined.
Given the matrices defined in (2), (3), and (4), we will show that
the k-means objective function in (1) can indeed be written as
alternative form of the k-means
objective function
k
i=1
n
j=1
zij
xjµi
2=
XMZ
2
F(5)
where k·kFdenotes the matrix Frobenius norm.
In other words, we will show that k-means clustering is a matrix
factorization problem! If there were two appropriate matrices Mand
Zthat would minimize the right hand side of (5), the data matrix X
could be approximated as XMZ.Exercise: convince yourself that M Z is
am×nmatrix.
lecture notes on data science:k-means clustering is matrix factorization 2
Proving Equation (5)
In this section, we will prove that our claim in (5) does indeed hold.
The basic idea is to expand both sides of the equation into several,
more elementary terms and to show that the expressions we obtain
for the left- and right hand side are indeed equivalent.
Yet, before we set out to do so, we will remind ourselves of general
properties of the Frobenius norm and point out some of the peculiar
features of the binary indicator matrix Z.
General Properties of the Squared Frobenius Norm of a Matrix
Let ARm×nbe any real valued matrix of mrows and ncolumns.
To denote individual elements of such a matrix, we either write aij
or (A)ij and to refer to the j-th column vector of A, we write aj.
The squared Frobenius norm of Ais defined as
A
2
F=
m
i=1
n
j=1
a2
ij (6)
and we recall the following properties
A
2
F=
n
j=1
aj
2=
n
j=1
aT
jaj=
n
j=1ATAjj =trATA. (7)
Since our derivation below will frequently allude to the identities in
(7), readers are encouraged to verify (7) for themselves. Exercise: convince yourself that all the
equalities in (7) do hold.
Peculiar Properties of the Indicator Matrix Z
If the clusters C1, . . . Ckhave distinct cluster centroids µ1, . . . , µk, each
of the ncolumns of Zwill contain a single element that is 1 and k1
elements that are 0. Accordingly, each column jof Zwill sum to one
k
i=1
zij =1 (8)
and the kdifferent row sums will indicate the number of elements
per cluster, that is, for each row iof Z, we have
n
j=1
zij =|Ci|=ni. (9)
Moreover, since zij {0, 1}and each column of Zonly contains a
single 1, the rows of Zare pairwise perpendicular because
zij zi0j=
1, if i=i0
0, otherwise (10)
which is then to say that the matrix ZZTis a diagonal matrix where
ZZTii0=
jZij ZTji0=
j
zij zi0j=
ni, if i=i0
0, otherwise. (11)
lecture notes on data science:k-means clustering is matrix factorization 3
Having familiarized ourselves with these properties of the in-
dicator matrix, we are now positioned to establish the equalities in
(5) which we will do in a step by step manner.
Step 1: Expanding the expression on the left of (5)
We begin by expanding the traditional k-means objective on the left
of (5). For this expression, we have
i,j
zij
xjµi
2=
i,j
zij xT
jxj2xT
jµi+µT
iµi
=
i,j
zij xT
jxj
| {z }
T1
2
i,j
zij xT
jµi
| {z }
T2
+
i,j
zij µT
iµi
| {z }
T3
. (12)
This expansion leads to further insights, if we examine the three
terms T1,T2, and T3one by one.
First of all, we find
T1=
i,j
zij xT
jxj=
i,j
zij
xj
2(13)
=
j
xj
2(14)
=trXTX(15)
where we made use of (8) and (7).
Second of all, we observe
T2=
i,j
zij xT
jµi=
i,j
zij
l
xlj µl i (16)
=
j,l
xlj
i
µli zij (17)
=
j,l
xlj M Zlj (18)
=
j
lXTjl MZl j (19)
=
jXTMZjj (20)
=trXTMZ. (21)
Third of all, we note that
T3=
i,j
zij µT
iµi=
i,j
zij
µi
2(22)
=
i
µi
2ni(23)
where we applied (9).
lecture notes on data science:k-means clustering is matrix factorization 4
Step 2: Expanding the expression on the right of (5)
Next, we look at the expression on the right hand side of (5). As a
squared Frobenius norm of a matrix difference, it can be written as
XMZ
2=trhXMZTXM Zi
=trXTX
| {z }
T4
2 trXTMZ
| {z }
T5
+trZTMTMZ
| {z }
T6
(24)
Given our results in (15) and (21), we immediately recognize that
T1=T4and T2=T5. Thus, to establish that (12) and (24) are indeed
equivalent, it remains to verify whether T3=T6?
Regarding term T6, we note that, due to the cyclic permutation
invariance of the trace operator, we have
trZTMTMZ=trMTM Z ZT. (25)
We also note that
trMTMZZT=
iMTMZZTii (26)
=
i
lMTMil ZZTli (27)
=
iMTMii ZZTii (28)
=
i
µi
2ni(29)
where we used the fact that ZZTis diagonal. This result, however,
shows that T3=T6and, consequently, that (12) and (24) really are
equivalent.
Summary and Outlook
Using rather tedious yet straightforward algebra, we have shown
that the problem of hard k-means clustering can be understood as
the following constrained matrix factorization problem
argmin
M,Z
XMZ
2
s.t. zij {0, 1}
i
zij =1
(30)
where
XRm×nis a matrix of data vectors xjRm(31)
MRm×kis a matrix of cluster centroids µiRm(32)
ZRk×nis a matrix of binary indicator variables such that
zij =
1, if xjCi
0, otherwise. (33)
lecture notes on data science:k-means clustering is matrix factorization 5
At this point, readers who are not accustomed with the idea of
matrix factorization for data analysis might be wondering what we
could possibly gain from this insight.
Admittedly, the formulation of the k-means clustering problem in
(30) appears to be more complicated and less intuitive than those
fond in the textbooks. However, in later notes, we will see that the
expression in (30) allows for seamless insights into several important
properties of the k-means clustering problem that are otherwise more
difficult to uncover 2.2C. Bauckhage. k-Means Clustering via
the Frank-Wolfe Algorithm. In Proc.
KDML-LWDA,2016
lecture notes on data science:k-means clustering is matrix factorization 6
References
C. Bauckhage. Lecture Notes on Data Science: k-Means
Clustering Is Gaussian Mixture Modeling, 2015a. DOI:
10.13140/RG.2.1.3033.2646.
C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering,
2015b. DOI: 10.13140/RG.2.1.2829.4886.
C. Bauckhage. k-Means Clustering via the Frank-Wolfe Algorithm.
In Proc. KDML-LWDA,2016.
... K MEANS [1], arguably most commonly used techniques for data analysis [2], produces a clustering such that the sum of squared error between samples and the mean of their cluster is minimized. For feature vectors gathered as the -dimensional columns of a matrix X ∈ R × , the -Means problem can be written as a matrix factorization problem [3]: ...
... This line of research work includes Convex-NMF, Symmetric-NMF, etc. A note on matrix factorization representation is in [3]. [4] proposed relaxing the constraints on G in the -Means optimization problem to an orthogonality constraint: min F,G 0,G G=I X − FG 2 . ...
Preprint
This paper presents an algorithm to solve the Soft k-Means problem globally. Unlike Fuzzy c-Means, Soft k-Means (SkM) has a matrix factorization-type objective and has been shown to have a close relation with the popular probability decomposition-type clustering methods, e.g., Left Stochastic Clustering (LSC). Though some work has been done for solving the Soft k-Means problem, they usually use an alternating minimization scheme or the projected gradient descent method, which cannot guarantee global optimality since the non-convexity of SkM. In this paper, we present a sufficient condition for a feasible solution of Soft k-Means problem to be globally optimal and show the output of the proposed algorithm satisfies it. Moreover, for the Soft k-Means problem, we provide interesting discussions on stability, solutions non-uniqueness, and connection with LSC. Then, a new model, named Minimal Volume Soft k-Means (MVSkM), is proposed to address the solutions non-uniqueness issue. Finally, experimental results support our theoretical results.
... Our model is an end-to-end clustering method that does not require k-means (Bauckhage 2015) clustering to obtain the final clustering results. Therefore, we can optimize the entire model simultaneously. ...
Article
Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.
... where W v ∈ R M ×d are trainable weight matrices to adaptively control the importance of features. And we employ the K-means algorithm (Bauckhage 2015) to ascertain the clustering centroid of the fused feature: ...
Article
Multi-view clustering (MVC) for remote sensing data is a critical and challenging task in Earth observation. Although recent advances in graph neural network (GNN)-based MVC have shown remarkable success, the most prevalent approaches have two major limitations: 1) heavily relying on a predefined yet fixed graph, which limits the performance of clustering because the large number of indistinguishable background samples contained in remote sensing data would introduce noise information and increase structure heterogeneity; 2) ignoring the effect of confusing samples on cluster structure compactness, which leads to fluffy cluster structure and decrease feature discriminability. To address these issues, we propose a Structure-Adaptive Multi-View Graph Clustering method named SAMVGC on remote sensing data which boosts the structure homogeneity and cluster compactness by adaptively learning the graph and cluster structures, respectively. Concretely, we use the geometric structure within the feature embedding space to refine adjacency matrices. The adjacency matrices are dynamically fused with the previous ones to improve the homogeneity and stability of structure information. Additionally, the samples are separated into two categories, including the central (intra-cluster center samples) and the confusing (inter-cluster boundary samples). On the basis, we deploy the contrastive learning paradigm on the central samples within views and the consistent learning paradigm on the confusing samples between views, improving the cluster compactness and consistency. Finally, we conduct extensive experiments on four benchmarks and achieve promising results, well demonstrating the effectiveness and superiority of the proposed method.
... Our model is an end-to-end clustering method that does not require k-means (Bauckhage 2015) clustering to obtain the final clustering results. Therefore, we can optimize the entire model simultaneously. ...
Preprint
Incomplete multi-view clustering (IMVC) has garnered increasing attention in recent years due to the common issue of missing data in multi-view datasets. The primary approach to address this challenge involves recovering the missing views before applying conventional multi-view clustering methods. Although imputation-based IMVC methods have achieved significant improvements, they still encounter notable limitations: 1) heavy reliance on paired data for training the data recovery module, which is impractical in real scenarios with high missing data rates; 2) the generated data often lacks diversity and discriminability, resulting in suboptimal clustering results. To address these shortcomings, we propose a novel IMVC method called Diffusion Contrastive Generation (DCG). Motivated by the consistency between the diffusion and clustering processes, DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. By performing contrastive learning on a limited set of paired multi-view samples, DCG can align the generated views with the real views, facilitating accurate recovery of views across arbitrary missing view scenarios. Additionally, DCG integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data, achieving robust and end-to-end clustering. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches.
... To achieve the clustering results for all samples, we use the k-means algorithm for the clustering module [30]- [32]. In particular, the factorization of the learned fused representation H is as follows: ...
Preprint
Multi-view clustering can partition data samples into their categories by learning a consensus representation in an unsupervised way and has received more and more attention in recent years. However, there is an untrusted fusion problem. The reasons for this problem are as follows: 1) The current methods ignore the presence of noise or redundant information in the view; 2) The similarity of contrastive learning comes from the same sample rather than the same cluster in deep multi-view clustering. It causes multi-view fusion in the wrong direction. This paper proposes a novel multi-view clustering network to address this problem, termed as Trusted Mamba Contrastive Network (TMCN). Specifically, we present a new Trusted Mamba Fusion Network (TMFN), which achieves a trusted fusion of multi-view data through a selective mechanism. Moreover, we align the fused representation and the view-specific representation using the Average-similarity Contrastive Learning (AsCL) module. AsCL increases the similarity of view presentation from the same cluster, not merely from the same sample. Extensive experiments show that the proposed method achieves state-of-the-art results in deep multi-view clustering tasks.
... For the clustering module, we take the k-means [2,31] to obtain the clustering results for all samples. Specifically, the learnt consensus representation H is factorized as follows: ...
Preprint
Multi-view clustering can partition data samples into their categories by learning a consensus representation in unsupervised way and has received more and more attention in recent years. However, most existing deep clustering methods learn consensus representation or view-specific representations from multiple views via view-wise aggregation way, where they ignore structure relationship of all samples. In this paper, we propose a novel multi-view clustering network to address these problems, called Global and Cross-view Feature Aggregation for Multi-View Clustering (GCFAggMVC). Specifically, the consensus data presentation from multiple views is obtained via cross-sample and cross-view feature aggregation, which fully explores the complementary ofsimilar samples. Moreover, we align the consensus representation and the view-specific representation by the structure-guided contrastive learning module, which makes the view-specific representations from different samples with high structure relationship similar. The proposed module is a flexible multi-view data representation module, which can be also embedded to the incomplete multi-view data clustering task via plugging our module into other frameworks. Extensive experiments show that the proposed method achieves excellent performance in both complete multi-view data clustering tasks and incomplete multi-view data clustering tasks.
... K-means [19] is one of the most widely used algorithms for unsupervised learning. Bauckhage [3] showed that the objective function of K-means can be rewritten as ONMF if the entries in H are binary, and the following condition holds: ...
Preprint
We propose a method for computing binary orthogonal non-negative matrix factorization (BONMF) for clustering and classification. The method is tested on several representative real-world data sets. The numerical results confirm that the method has improved accuracy compared to the related techniques. The proposed method is fast for training and classification and space efficient.
... The constraints (5b) ensure that each point x i is assigned to exactly one cluster, and (5c) ensures that there are exactly k clusters. We can show (5a) in matrix form by Ferebinous norm as follows (Bauckhage 2015): ...
Article
Full-text available
Orthogonal Nonnegative Matrix Factorization (ONMF) with orthogonality constraints on a matrix has been found to provide better clustering results over existing clustering problems. Because of the orthogonality constraint, this optimization problem is difficult to solve. Many of the existing constraint-preserving methods deal directly with the constraints using different techniques such as matrix decomposition or computing exponential matrices. Here, we propose an alternative formulation of the ONMF problem which converts the orthogonality constraints into non-convex constraints. To handle the non-convex constraints, a penalty function is applied. The penalized problem is a smooth nonlinear programming problem with quadratic (convex) constraints that can be solved by a proper optimization method. We first make use of an optimization method with two gradient projection steps and then apply a post-processing technique to construct a partition of the clustering problem. Comparative performance analysis of our proposed approach with other available clustering methods on randomly generated test problems and hard synthetic data-sets shows the outperformance of our approach, in terms of the obtained misclassification error rate and the Rand index.
... In [17], it was shown that (1) can be alternately represented in the form of matrix factorization. ...
Chapter
In this work we propose a dictionary learning based clustering approach. We regularize dictionary learning with a clustering loss; in particular, we have used sparse subspace clustering and K-means clustering. The basic idea is to use the coefficients from dictionary learning as inputs for clustering. Comparison with state-of-the-art deep learning based techniques shows that our proposed method improves upon them.
... On peut donc écrire la fonction de coût de l'Équation (4.18) [24], [64] ainsi : Notons que ce résultat à une signification physique. Comme les colonnes de S sont orthogonales, on obtient bien l'une des étapes de l'algorithme de Lloyd [103] correspondant au à la moyenne des éléments de chaque classe : ...
Thesis
Les graphes dynamiques permettent de comprendre l'évolution de systèmes complexes qui évoluent dans le temps. Ceux-ci peuvent être vues comme une succession de graphes complets partageant les mêmes nœuds, mais dont les poids associés à chaque lien évoluent dans le temps. Ces types de graphes ont récemment fait l'objet d'une attention considérable. Cependant, il n'existe pas de consensus sur les manières de les inférer et de les étudier. Dans cette thèse, on propose des méthodes d'analyse de graphes dynamiques spécifiques. Les méthodes proposées peuvent avoir des applications en neurosciences ou dans l'étude des réseaux sociaux comme Twitter et Facebook par exemple. L'enjeu applicatif de cette thèse est l'épilepsie, l'une des maladies neurologiques les plus répandues dans le monde affectant environ 1% de la population.La première partie concerne l'inférence de graphe dynamique à partir de signaux neurophysiologiques. Cette inférence est généralement réalisée à l'aide de mesures de connectivité fonctionnelle permettant d'évaluer la similarité entre deux signaux. La comparaison de ces mesures est donc d'un grand intérêt pour comprendre les caractéristiques des graphes obtenus. On compare alors des mesures de connectivité fonctionnelle impliquant la phase et l'amplitude instantanée des signaux. On s'intéresse en particulier à une mesure nommée Phase-Locking-Value (PLV) qui quantifie la synchronie des phases entre deux signaux. On propose ensuite, afin d'inférer des graphes dynamiques robustes et interprétables, deux nouvelles mesures de PLV conditionnées et régulariséesLa seconde partie présente des méthodes de décomposition de graphes dynamiques. L'objectif est de proposer une méthode semi-automatique afin de caractériser les informations les plus importantes du réseau pathologique de plusieurs crises d'un même patient. Dans un premier temps, on considère des crises qui ont des durées et des évolutions temporelles similaires. Une décomposition tensorielle spécifique est alors appliquée.Dans un second temps, on considère des crises qui ont des durées hétérogènes. Plusieurs stratégies sont proposées et comparées. Ce sont des méthodes qui en plus d'extraire les sous-graphes caractéristiques communs à toutes les crises, permettent d'observer leurs profils d'activation temporelle spécifiques à chaque crise. Finalement, on utilise la méthode retenue pour une application clinique. Les décompositions obtenues sont comparées à l'interprétation visuelle du clinicien. Dans l'ensemble, on constate que les sous-graphes extraits correspondent aux régions du cerveau impliquées dans la crise d'épilepsie. De plus l'évolution de l'activation de ces sous-graphes est cohérente avec l'interprétation visuelle.
Conference Paper
Full-text available
We show that k-means clustering is a matrix factorization problem. Seen from this point of view, k-means clustering can be computed using alternating least squares techniques and we show how the constrained optimization steps involved in this procedure can be solved efficiently using the Frank-Wolfe algorithm.
Technical Report
Full-text available
We show that k-means clustering is closely related to statistical mixture modeling. In particular, we show that the k-means algorithm implicitly fits a simple Gaussian mixture model to a set of data.
Technical Report
Full-text available
This is the first in a series of lecture notes on k-means clustering, its variants, and applications. We discuss the basic ideas behind k-means clustering and study the classical algorithm.