An empirical comparison of fast and efficient tools for mining textual data
ABSTRACT In order to effectively manage and retrieve the information comprised in vast amount of text documents,
powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better clustering quality at the expense of memory and CPU time.
-
Citations (0)
-
Cited In (0)
Page 1
An empirical comparison of fast and efficient
tools for mining textual data
Volkan Tunalı1, A. Yılmaz Çamurcu2, T. Tugay Bilgin3
1,2Marmara University, Faculty of Technical Education, Göztepe, stanbul, Turkey
3Maltepe University, Faculty of Engineering, Maltepe, stanbul, Turkey
volkan.tunali@gmail.com, camurcu@marmara.edu.tr, ttbilgin@maltepe.edu.tr
I.INTRODUCTION
With the development and proliferation of the Internet,
there is a significant increase in the amount of
information produced particularly in the form of text
documents, such as news articles, research papers, books,
digital libraries, e-mail messages, and Web pages [1].
85% of the business-related information is stored
electronically, in the form of text databases [2]. In order
to effectively handle this “information explosion” prob-
lem and utilise the huge amount of text databases,
efficient and scalable text mining and information
retrieval techniques are essential.
Document clustering is an important text mining
method, which provides users with effective means of
navigation, summarization, and organization of text
documents. It can be defined as an unsupervised and
automatic grouping of a given document collection into
clusters, in such a way that documents belonging to the
same cluster are as similar to each other as possible,
whereas documents from two different clusters are
dissimilar to the maximum extent [3]–[6].
The rest of this paper is organised as follows: Section 2
introduces the efficient k-means partitioning clustering
algorithm and its variants used for clustering high-
dimensional document datasets. Section 3 describes the
two state-of-the-art clustering tools that we evaluate and
compare for mining textual data. Section 4 describes
experiments we performed with these tools and gives
detailed evaluation results. Section 5 contains some
conclusions.
II.
EFFICIENT CLUSTERING ALGORITHMS FOR MINING
TEXTUAL DATA
Generally, clustering algorithms are divided into two
categories: hierarchical and partitioning. Hierarchical
clustering algorithms initially treat each document as a
cluster, iteratively merge the closest pair of clusters into
one cluster until the desired number of clusters is
obtained. Partitioning algorithms organise the documents
into the desired number of partitions, where each partition
represents a cluster. The clusters are formed to optimise
an objective partitioning criterion, so that the documents
within a cluster are “similar”, whereas the documents of
different clusters are “dissimilar” [1, 7, 8]. In this paper,
we are interested in partitioning clustering algorithms.
A.Document Representation Model
First designed for indexing and information retrieval
purposes [9], Vector Space Model (VSM) has been a
widely used model in data and text mining operations.
LetD = (d1,d2,...,dn) denote the collection of
documents, where n is the number of documents in the
collection. Let T = (t1,t2,...,tm) represent all the terms
occurred in the document collection D. Here m is the
number of unique terms in the document collection. Each
document to be clustered is represented using the VSM as
a vector in the term-space. In this model, each document
ABSTRACT – In order to effectively manage and retrieve the information comprised in vast amount of text documents,
powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data
mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three
benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We
empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better cluster-
ing quality at the expense of memory and CPU time.
Keywords: text mining, document clustering, spherical k-means, bisecting k-means
141
Page 2
di is located as a point in a m-dimensional vector space, di
= (wi1,wi2,...,wim), i = 1,...,n, where the dimension is the
same as the number of terms in the document collection.
Each component of this vector reflects the degree of
relationship between its associated term and the
respective document, which is called the weight of the
term [4, 10].
B. Determining Similarity
Similarity measure is the key aspect in clustering.
Good choice of similarity measure is directly related to
the performance of clustering. The similarity between two
documents is computed with one of several similarity
measures based on two corresponding feature vectors,
e.g. cosine measure, Jaccard measure and Euclidean
distance measure. The cosine measure has been one of the
most popular document similarity measures due to its
sensitivity to document vector and to its performance.
The cosine measure computes the cosine of the angle
between two feature vectors and is used frequently in text
mining and information retrieval applications where
vectors are usually very large but very sparse. The cosine
similarity between two documents di and dj is calculated
as
m
k
jk
m
k
ik
m
k
jkik
jiji
ww
ww
dddd sim
1
2
1
2
1
),cos(),(
i, j = 1,...,n (1)
The larger the cosine value, the larger the similarity
between two documents [10].
C. K-means Algorithm
The k-means is a commonly used partitioning
clustering algorithm, which takes the input parameter, k,
and partitions a set of n documents into k clusters so that
a global criterion function is either maximised or
minimised [1]. The global criterion function is defined as
k
j
j
n
i
ic,df
1
)(
(2)
cj represents the centroid of cluster Cj, for j = 1,...,k,
and f(di, cj) is the clustering criterion function for a
document di and centroid cj. When the cosine similarity
measure is used as the clustering criterion function,
document di is assigned to the cluster with the most
similar centroid cj, and the global criterion function is
maximised as a result [5]. The steps of the k-means
algorithm are as follows [1]:
1.Arbitrarily choose k documents from D as the
initial cluster centroids.
2.Assign each document to the cluster to which the
clustering criterion function is maximised with.
3. Recalculate k centroids based on the documents
assigned to them.
4.Repeat steps 2 and 3 until convergence (i.e. no
change in the value of the global criterion
function).
D. K-means Variants
There are many variants of the k-means algorithm in
the literature. Some variants address the weaknesses of
the k-means like initial cluster assignment and
convergence to local optima, while some offer various
other improvements and extensions. In this paper we are
interested particularly in two variants known to be
effective in clustering textual data: spherical and bisecting
k-means.
Spherical K-means
There are several weighting schemes for weights of
terms in the document vector. The well-known one is the
term frequency-inverse document frequency (TFIDF)
where the importance
increases proportionally to the number of times the term
appears in the document but is offset by the frequency of
the term in the whole corpus. TFIDF value of term wij is
usually calculated as
of a term
)/ log(
j ij ij
fdfw
(3)
where fij is the frequency of term j in document i, d is
the total number of documents in D, fj is the number of
documents that contain term j.
This scheme is vulnerable that long documents are
favored over short ones since they contain more terms.
For this reason, a normalization factor is used in order to
discount the contribution of long documents [10]. One
possible normalization factor is:
m
j
ij
w
1
2 (4)
This normalization implies that
each document vector lies on the surface of the unit
sphere in Rm. Thus, using normalization, only the
proportion of terms occurring in a document is retained.
Like document vectors, centroid vectors of the clusters
also lie on the surface of the high-dimensional sphere.
The variant of the k-means algorithm that uses document
vectors of unit length along with cosine similarity
measure is called spherical k-means algorithm [11]. An
important computational advantage of using vectors of
unit length is that calculation of cosine similarity measure
becomes simply the dot product of two vectors. That is,
the division is eliminated from (1) since the denominator
is always 1.
; that is,
Bisecting K-means
The bisecting k-means is another variant of the k-
means algorithm [12]. Starting with the whole dataset as a
single cluster, the algorithm essentially splits one cluster
into two subclusters at each iteration as follows:
1.Select a cluster Cj to split, based on a heuristic
function.
1
i D
142
Page 3
2.Obtain 2 subclusters of Cj using the classical k-
means algorithm.
Repeat step 2 l times, and select the split that
produces the clustering that satisfies the global
criterion function.
Repeat steps 1, 2, and 3 until k clusters are
obtained.
3.
4.
l is the number of iterations for each bisecting step,
which is usually specified in advance.
III. TOOLS FOR CLUSTERING TEXTUAL DATA
In this paper, we are interested in two state-of-the-art
clustering tools for mining text documents: Cluto and
Gmeans. Source code of Gmeans is distributed under the
GNU Public License (GPL) [13] whereas Cluto is
distributed as both stand-alone programs and a library
without source code [14].
A.Cluto
Written in ANSI C by George Karypis, CLUTO
(CLUstering TOolkit) is a software package for clustering
low- and high-dimensional datasets and for analyzing the
characteristics of the various clusters [12].
Cluto contains partitional, agglomerative, and graph-
partitioning based clustering algorithms. Bisecting k-
means is the default option from the partitional class of
algorithms, which is under consideration of this paper. In
addition, Cluto offers multiple distance (similarity)
functions like cosine, euclidean, correlation coefficient,
extended Jaccard, where cosine is the default option.
Cluto also has an option to select one of several clustering
criterion functions from four categories: internal,
external, hybrid, and graph-based. Performance of all
these criterion functions are evaluated and analysed with
several datasets in [15] extensively.
B.
Gmeans
Gmeans is a C++ program for clustering, developed by
Yuqiang Guan as part of his PhD thesis [11] and [16].
The program employs four different k-means type
clustering algorithms with four different distance
(similarity) measures: cosine, euclidean, diametric
distance, and Kullback-Leibler divergence, where cosine
is the default similarity measure applied for spherical k-
means, with each document vector to be (L2) normalised.
Moreover, a local search strategy to overcome the local
optima problem, called first variation, is also included.
The program generates one-way, hard-clustering of a
given dataset.
The k-means algorithm is very sensitive to the selection
of initial cluster centroid. That is, its performance and
clustering quality is directly affected by this selection.
Gmeans program offers different initialization methods
like randomly assigning each data item a cluster index,
generating the centroid for the whole data and then
perturbing around it, reading cluster index of each data
from a file, randomly picking data items as the initial
cluster centroids. Default initialization method of Gmeans
is a rather deterministic one which first picks the data
item ‘farthest’ to the centroid of the whole data set, and
then picks an item which is ‘farthest’ to all the previous
cluster centroids already picked until all the cluster
centroids are picked.
IV. EVALUATION OF CLUTO AND GMEANS
In order to evaluate the performance of Cluto and
Gmeans, three common benchmark datasets were used.
Two of them were chosen among the well-known text
mining benchmark datasets while one was chosen as a
synthetic data generated by a waveform database genera-
tor.
A.Classic3 Dataset
Classic3 is a very popular document dataset with 3891
documents used for benchmarking in many text mining
related research. It is comprised of Medline, Cisi and
Cranfield subcollections from the SMART collection
[17]. Medline consists of 1033 medical journal abstracts,
Cisi consists of 1460 abstracts from information retrieval
papers, and Cranfield consists of 1388 aerodynamical
systems abstracts. After the preprocessing with filterings
like “stop-words”, the dimension of the dataset is 4467,
which is 99% sparse.
B.20 News Groups (20NG) Dataset
20NG dataset consists of about 18,821 messages
compiled from 20 Usenet groups, which is also a
commonly referenced benchmark dataset in text mining
research. Number of dimensions of this dataset is 70,241
[18].
C. Waveform Dataset
Waveform dataset is composed of 5000 data points
generated from 3 different classes with 21 attributes (e.i.
dimensions). There are 1657 points from class 0, 1647
points from class 1, and 1696 points from class 2 [19].
D. Clustering Performance Metrics
There are two aspects of evaluating the performance of
a clustering tool. First, we evaluate the scalability of the
clustering tool in terms of memory and CPU time con-
sumption. Second, we evaluate the quality of clustering
using several validity indices. In this paper, we use exter-
nal validity indices that are based on comparing the clus-
tering results with the known true class labels of docu-
ments: Purity, Entropy, F-measure, and Normalised Mu-
tual Information (NMI).
Suppose there are c categories (or classes) and k clus-
ters. Let nl be the number of objects in cluster Cl, and nl
be the the number of objects in cluster Cl that belong the
class Kh, where h = 1,…,c.
Purity is the ratio of the dominant class size in the clus-
ter to the cluster size itself. A high purity implies that the
cluster is a pure subset of the dominant class. Purity of
cluster Cl is defined as
(h)
143
Page 4
)(max
h
1
n
)(
)(h
l
l
l
nCpurity
(5)
Purity of the entire collection of clusters is evaluated as
a weighted sum of the individual cluster purities and is
defined as
k
l
h
l
h
k
l
l
l
n
n
Cpurity
n
n
C( purity
1
)(
1
)( max
1
)()
(6)
Entropy is a more comprehensive measure than purity.
It considers the distribution of classes in a cluster. Note
that we use normalised entropy which takes values be-
tween 0 and 1. An entropy value of 0 means the cluster is
comprised entirely of one class, while an entropy value
close to 1 is considered bad because it implies that the
cluster contains a uniform mixture of all classes. Entropy
of cluster Cl is defined as
c
h
l
h
l
l
h
l
l
n
n
n
n
c
C(entropy
1
)()(
) log(
) log(
1
)
(7)
Entropy of the entire collection of clusters is evaluated
as a weighted sum of the individual cluster entropies and
is defined as
k
l
l
l
C(entropy
n
n
C(entropy
1
))
(8)
F-Measure is another frequently used external valida-
tion measure which is commonly referred to as clustering
accuracy. It requires the calculation of Precision and
Recall values. Precision is the portion of cluster Cl that is
an object of class Kh defined as
l
h
l
hl
n
n
KCprecision
)(
),(
(9)
Recall is the portion of objects from class Kh that are
present in cluster Cl defined as
h
h
l
hl
K
n
KCrecall
)(
),(
(10)
where is the number of objects in class h. Then F value
of the cluster Cl is just the harmonic mean of the preci-
sion and the recall as
),(2
),(
hl
recallKCprecision
The F-Measure of cluster Cl is the maximum F value
obtained at any class in the entire classes as
),(),(
),
K
(
C
hl
hlhl
hl
KCrecallKCprecision
KCF
(11)
),(max
h
)(
hll
KCFCF
(12)
F-Measure of the entire collection of clusters is evalu-
ated as a weighted sum of the individual cluster F-
Measure values and is defined as
k
l
l
l
C(F
n
n
C(F
1
))
(13)
Normalised Mutual Information (NMI) is an informa-
tion-based clustering validity measure. A high NMI value
implies that clustering and true class label match well.
NMI is defined as
k
l
c
h
l
h
h
l
ck
h
l
nn
nn
n
n
C(NMI
11
)(
)(
)(
)(log
2
)
(14)
E.
Experimental Results
Experiments were performed on Ubuntu Linux 9.1
platform installed on a PC with 2.8 GHz Pentium 4 CPU
and 512 MB main memory. Results of experiments are
shown in Tables 1–7 and Figures 1–7.
Maximum memory consumption of Cluto and Gmeans
for each benchmark dataset is shown in Table 1 and
Figure 1 comparatively in KiloBytes. These values are
not the values reported by the programs. Instead, we used
pidstat system tool to objectively measure the actual
maximum amount of memory consumed by each prog-
ram.
Table 1 – Memory consumption in KB.
DataSet
Classic3
20NG
Waveform
Cluto Gmeans
4,480
27,092
3,512
4,560
36,452
4,504
Memory Used (in KB)
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
Classic3 20NGWaveform
Cluto
Gmeans
Figure 1 – Memory consumption in KB.
Comparison of CPU time consumption of the programs
is given in Table 2 and Figure 2. We ran the programs 10
times for each dataset and measured the average CPU
time consumption.
Table 2 – CPU Time consumption in seconds.
DataSet
Classic3
20NG
Waveform
Cluto Gmeans
0.331
4.166
0.178
0.578
16.270
0.434
h
K
144
Page 5
CPU Time (in Sec)
0
2
4
6
8
10
12
14
16
18
Classic320NGWaveform
Cluto
Gmeans
Figure 2 – CPU Time consumption in seconds.
Comparison of Purity values obtained from clustering
results is shown in Table 3 and Figure 3.
Table 3 – Comparison of Purity values.
DataSet
Classic3
20NG
Waveform
Cluto Gmeans
0.989
0.498
0.546
0.986
0.614
0.644
Purity
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Classic3 20NGWaveform
Cluto
Gmeans
Figure 3 – Comparison of Purity values.
Entropy values of clustering of each dataset are given
comparatively in Table 4 and Figure 4.
Table 4 – Comparison of Entropy values.
DataSet
Classic3
20NG
Waveform
Cluto Gmeans
0.059
0.471
0.641
0.071
0.405
0.603
Entropy
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Classic3 20NGWaveform
Cluto
Gmeans
Figure 4 – Comparison of Entropy values.
Table 5 and Figure 5 display the comparison of F-
Measure values.
Table 5 – Comparison of F-Measure values.
DataSet
Classic3
20NG
Waveform
Cluto Gmeans
0.989
0.522
0.538
0.986
0.609
0.633
F-Measure
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Classic3 20NGWaveform
Cluto
Gmeans
Figure 5 – Comparison of F-Measure values.
Finally, comparison of NMI values is shown in Table 6
and Figure 6.
Table 6 – Comparison of NMI values.
DataSet
Classic3
20NG
Waveform
Cluto Gmeans
0.941
0.561
0.360
0.930
0.606
0.410
145
Page 6
Normalized Mutual Information
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Classic320NGWaveform
Cluto
Gmeans
Figure 6 – Comparison of NMI values.
Classic3 document dataset is composed of documents
from highly distinct classes, and both programs nearly
recovered the three classes according to low entropy and
high purity, F-Measure and NMI values. Although, purity
and F-Measure values show good clustering results, en-
tropy and NMI values indicate that clusters obtained
contains documents and points from diverse classes for
20NG and Waveform datasets, which are known to be
comprised of documents and points from highly overlap-
ping and fuzzy classes.
F.Comparison: Cluto vs. Gmeans
Our experiments suggest that Cluto with bisecting k-
means algorithm presents good performance on both text
data and synthetic waveform data in terms of clustering
quality. However, the quality of clustering comes with a
drawback in memory and CPU time consumption. Ex-
perimental results show that Gmeans with spherical k-
means algorithm is very effective and efficient in terms of
memory and CPU time consumption. That is, Gmeans is
more scalable than Cluto. Moreover, Gmeans also shows
superior performance on Classic3 document dataset com-
posed of documents from highly distinct classes. For
20NG and Waveform datasets, on the other hand, cluster-
ing performance of Cluto is clearly better than Gmeans.
V. CONCLUSIONS
In this paper we empirically evaluated and compared
two high-performance data mining tools for clustering
high-dimensional text data, Cluto and Gmeans. We
performed experiments on three benchmark datasets, and
analysed the results in terms of several performance
metrics like purity, entropy, F-measure, Normalised Mu-
tual Information as well as memory and CPU time
consumption. Our experiments showed that Gmeans with
spherical k-means offers high scalability by sacrificing
clustering quality while Cluto with bisecting k-means
achieves better clustering quality at the expense of mem-
ory and CPU time. Therefore, selection of a clustering
tool is a matter of decision between quality and speed.
In short, if there is textual data that needs to be clus-
tered precisely and powerful hardware with large amount
of main memory is available, then Cluto is the best op-
tion. On the other hand, if good clustering quality in rea-
sonable time with modest hardware configuration is ex-
pected, then Gmeans is the preferable tool.
REFERENCES
[1] Han, J.; Kamber, M. 2006. Data Mining: Concepts
and Techniques. Morgan Kaufmann Publishers.
[2] Hotho, A.; Nürnberger, A.; Paaß, G. 2005. A brief
survey of text mining.
[3] Feldman, R.; Sanger, J.. 2007. The Text Mining
Handbook: Advanced Approaches in Analyzing
Unstructured Data. Cambridge University Press.
[4] Aliguliyev, R.M. 2009. Clustering of document
collection – a weighting approach. Expert Systems
with Applications 36: 7904-7916.
[5] Li, Y. 2007. High performance text document
clustering. Phd Thesis. Wright State University.
[6] Bilgin, T.T.. 2007. Çok boyutlu uzayda görsel veri
madencili i için üç yeni çatı tasarımı ve uygulama-
ları. PhD Thesis. Marmara Üniversitesi Fen Bilimle-
ri Enstitüsü.
[7] Jain, A.K.; Dubes, R.C. 1988. Algorithms for
Clustering Data. Prentice Hall.
[8] Tan, P.; Steinbach, M.; Kumar, V. 2006. Introduction
to Data Mining. Addison-Wesley.
[9] Salton, G.; Wong, A.; Yang, C.S. 1975. A vector
space model for
Communications of the ACM, 18(11): 613-620.
[10]Witten, H.I.; Moffat, A.; Bell, T.C. 1999. Managing
Gigabytes: Compressing and Indexing Documents
and Images. Morgan Kaufmann Publishers.
[11]Guan, Y. 2006. Large-scale clustering: algorithms
and appilcations. PhD Thesis. The University of
Texas at Austin.
[12]Steinbach, M.; Karypis, G.; Kumar, V. (2000). A
comparison of document clustering techniques
(Technical Report #0-034). University
Minnesota.
[13]Gmeans software. Retrieved November 06, 2009
from Word Wide Web: http://userweb.cs.utexas.edu/
users/dml/Software/gmeans.html.
[14]Cluto software. Retrieved July 12, 2009 from World
Wide Web: http://glaros.dtc.umn.edu/gkhome/cluto/
cluto/download.
[15]Zhao, Y.; Karypis, G. (2001). Criterion functions for
document clustering: experiments and analysis
(Technical Report #01-40). University
Minnesota.
[16]Dhillon, I.; Fan, J.; Guan, Y. 2001. Efficient
clustering of very large document collections.
Invited book chapter in Data Mining for Scientific
and Engineering Applications, Kluwer: 357-381.
[17]Classic3 dataset. Retrieved November 29, 2009 from
World Wide Web: ftp://ftp.cs.cornell.edu/pub/smart.
[18]20 News Groups dataset. Retrieved January 20, 2010
from World Wide Web: http://web.ist.utl.pt/
~acardoso/datasets.
[19]Waveform dataset. Retrieved March 18, 2010 from
World Wide Web: http://archive.ics.uci.edu/ml/
automatic indexing,
of
of
146
Page 7
data-
sets/Waveform+Database+Generator+(Version+2).
BIOGRAPHIES
Volkan Tunalı – Received the BSc degree in Computer Engineering
from Marmara University, Faculty of Engineering, Istanbul in 2001, and
MSc degree in Computer Engineering from Marmara University, Insti-
tute for Graduate Studies in Pure and Applied Sciences, Istanbul in
2005.
He is currently a PhD student at Marmara University, Institute for
Graduate Studies in Pure and Applied Sciences. His research interests
include data mining and knowledge discovery, text mining, pattern
recognition, and speech recognition.
A. Yılmaz Çamurcu – Born in Nazilli, Turkey, in 1953. Received the
PhD degree in Computer Education from Marmara University, Istanbul
in 1996.
His current research interests are data mining, intelligent tutoring
systems and medical image processing.
Prof. Dr. Çamurcu is a professor of computer systems in the Faculty
of Technical Education at Marmara University. He is a member of
ACM.
T. Tugay Bilgin – Received the BSc, PhD degrees in Computer and
Control Education from Marmara University, Istanbul in 2001 and 2007
respectively.
His doctoral thesis was on the mining of high dimensional datasets.
He proposed three new frameworks for effective mining of high dimen-
sional datasets. He became Assistant Professor of the Software Engi-
neering Department at Maltepe University in 2008. His research inter-
ests are high dimensional data mining, web mining, service oriented
architecture and web services.
Dr. Bilgin is currently Assistant Head of Software Engineering De-
partment and Director of Advanced Computing Laboratory of Maltepe
University. He is a member of ACM.
147