Conference PaperPDF Available

Query-based Topic Detection Using Concepts and Named Entities

Authors:

Abstract and Figures

In this paper, we present a framework for topic detection in news articles. The framework receives as input the results retrieved from a query-based search and clusters them by topic. To this end, the recently introduced " DBSCAN-Martingale " method for automatically estimating the number of topics and the well-established Latent Dirichlet Allocation topic modelling approach for the assignment of news articles into topics of interest, are utilized. Furthermore, the proposed query-based topic detection framework works on high-level textual features (such as concepts and named entities) that are extracted from news articles. Our topic detection approach is tackled as a text clustering task, without knowing the number of clusters and compares favorably to several text clustering approaches, in a public dataset of retrieved results, with respect to four representative queries.
Content may be subject to copyright.
Query-based Topic Detection Using Concepts and Named
Entities
Ilias Gialampoukidis1, Dimitris Liparas1, Stefanos Vrochidis1, and Ioannis Kompatsiaris1
Abstract.
1
In this paper, we present a framework for topic
detection in news articles. The framework receives as input the
results retrieved from a query-based search and clusters them by
topic. To this end, the recently introduced “DBSCAN-Martingale”
method for automatically estimating the number of topics and the
well-established Latent Dirichlet Allocation topic modelling
approach for the assignment of news articles into topics of interest,
are utilized. Furthermore, the proposed query-based topic detection
framework works on high-level textual features (such as concepts
and named entities) that are extracted from news articles. Our topic
detection approach is tackled as a text clustering task, without
knowing the number of clusters and compares favorably to several
text clustering approaches, in a public dataset of retrieved results,
with respect to four representative queries.
1 INTRODUCTION
The need by both journalists and media monitoring companies to
master large amounts of news articles produced on a daily basis, in
order to identify and detect interesting topics and events, has
highlighted the importance of the topic detection task. In general,
topic detection aims at grouping together stories-documents that
discuss about the same topic-event. Formally, a topic is defined in
[1] as a specific thing that happens at a specific time and place
along with all necessary preconditions and unavoidable
consequences”. It is clarified [1] that the notion of “topic” is not
general like “accidents” but is limited to a specific collection of
related events of the type accident, such as “cable car crash”. We
shall refer to topics as news clusters, or simply clusters.
The two main challenges involved in the topic detection
problem are the following: one needs to (1) estimate the correct
number of topics/news clusters and (2) assign the most similar
news articles into clusters. In addition, the following assumptions
must be made: Firstly, real data is highly noisy and the number of
clusters is not known a priori. Secondly, there is a lower bound for
the minimum number of documents per news cluster.
In this context, we present and describe the hybrid clustering
framework for topic detection, which has been developed within
the FP7 MULTISENSOR project
2
. For a given query-based search,
the main idea is to efficiently cluster the retrieved results, without
the need for a pre-specified number of topics. To this end, the
framework, recently introduced in [2], combines automatic
estimation of the number of clusters and assignment of news
articles into topics of interest, on the results of a text query. The
estimation of the number of clusters is done by the novel
1
Information Technologies Institute, CERTH, Thessaloniki, Greece,
email: {heliasgj, dliparas, stefanos, ikom}@iti.gr
2
http://www.multisensorproject.eu/
“DBSCAN-Martingale” method [2], which can deal with the
aforementioned assumptions. All clusters are progressively
extracted (by a density-based algorithm) by applying Doob’s
martingale and then Latent Dirichlet Allocation is applied for the
assignment of news articles to topics. Contrary to [2], the
contribution of this paper is based on the fact that the overall
framework relies on high-level textual features (concepts and
named entities) that are extracted from the retrieved results of a
textual query, and can assist any search engine.
The rest of the paper is organized as follows: Section 2 provides
related work with respect to topic detection, news clustering and
density-based clustering. In Section 3, our framework for topic
detection is presented and described. Section 4 discusses the
experimental results from the application of our framework and
several other clustering methods to four collections of text
documents, related to four given queries, respectively. Finally,
some concluding remarks are provided in Section 5.
2 RELATED WORK
Topic detection is traditionally considered as a clustering problem
[3], due to the absence of training sets. The clustering task usually
involves feature selection [4], spectral clustering [5] and k-means
oriented [3] techniques, assuming mainly that the number of topics
to be discovered is known a priori and there is no noise, i.e. news
items that do not belong to any of the news clusters. Latent
Dirichlet Allocation (LDA) is a popular approach for topic
modelling for a given number of topics k [6]. LDA has been
generalized to nonparametric Bayesian approaches, such as the
hierarchical Dirichlet process [7] and DP-means [8], which predict
the number of topics k. The extraction of the correct number of
topics is equivalent to the estimation of the correct number of
clusters in a dataset. The majority vote among 30 clustering indices
has been proposed in [9] as an indicator for the number of clusters
in a dataset. In contrast, we propose an alternative majority vote
among 10 realizations of the “DBSCAN-Martingale”, which is a
modification of the DBSCAN algorithm [10] with parameters the
density level and a lower bound for the minimum number of
points per cluster. However, the DBSCAN-Martingale [2] regards
the density level as a random variable and the clusters are
progressively extracted. We consider the general case, where the
number of topics to be discovered is unknown and it is possible to
have news articles which are not assigned to any topic.
Graph-based methods for event detection and multimodal
clustering in social media streams have appeared in [11], where a
graph clustering algorithm is applied on the graph of items. The
decision, whether to link two items or not, is based on the output of
a classifier, which assigns or not, the candidate items in the same
This is the accepted manuscript.
In: Proceedings of the 1st International Workshop on Multimodal Media Data Analytics, In conjunction with the
22nd European Conference on Artificial Intelligence (ECAI) 2016
cluster. Contrary to this graph-based approach, we cluster news
items in an unsupervised way.
Density-based clustering does not require as input the number of
topics. OPTICS [12] is very useful for the visualization of the
cluster structure and for the optimal selection of the density level .
The OPTICS-ξ algorithm [12] requires an extra parameter ξ, which
has to be manually set in order to find “dents” in the OPTICS
reachability plot. The automatic extraction of clusters from the
OPTICS reachability plot, as an extension of the OPTICS-ξ
algorithm, has been presented in [13] and has been outperformed
by HDBSCAN [14] in several datasets of any nature. In the context
of news clustering, however, we shall examine whether some of
these density-based algorithms perform well on the topic detection
problem and by comparing them with our DBSCAN-Martingale, in
terms of the number of estimated topics. All the aforementioned
methods, which do not require the number of topics to be known a
priori, are combined with LDA in order to examine whether the use
of DBSCAN-Martingale (combined with LDA) provides the most
efficient assignment of news articles to topics.
3 TOPIC DETECTION USING CONCEPTS
AND NAMED ENTITIES
The MULTISENSOR framework for topic detection, which is
presented in Figure 1, is approached as a news clustering problem,
where the number of topics needs to be estimated. The overall
framework is based on textual features, namely concepts and
named entities. The number of topics k is estimated by DBSCAN-
Martingale and the assignment of news articles to topics is done
using Latent Dirichlet Allocation (LDA).
LDA has shown great performance in text clustering, given the
number of topics. However, in realistic applications, the number of
topics is unknown to the system. On the other hand, DBSCAN
does not require as input the number of clusters, but its
performance in text clustering is very weak, due to the fact that it
assigns too much noise to the news article collection and this
results in very limited performance [2]. Moreover, it is difficult to
find a unique density level that can output all clusters. Thus, we
keep only the number of clusters using density-based clustering
and the assignment of documents to topics is done by the well-
performing LDA.
Figure 1. The MULTISENSOR topic detection framework using
DBSCAN-Martingale and LDA
In our approach, the constructed DBSCAN-Martingale
combines several density levels and is applied on high-level
concepts and named entities. In the following, the construction of
DBSCAN-Martingale is briefly reported.
3.1 The DBSCAN-Martingale
Given a collection of news articles, density-based clustering
algorithms output clustering vector with values the cluster IDs
 for each news item      , where we denote by
the-th element of a vector. In case the-th document is not
assigned to any of the clusters, the-th cluster ID is zero.
Assuming that is the clustering vector provided by the
DBSCAN [10] algorithm for the density level, the problem is to
combine the results for several values of, into one unique
clustering result. To that end, a martingale construction has been
presented in [2], where the density level is a random variable,
uniformly sampled in a pre-defined interval.
Figure 2. One realization of the DBSCAN-Martingale with   iterations
and 3 topics detected [2]
The DBSCAN-Martingale progressively updates the estimation
of the number of clusters (topics), as shown in Figure 2, where 3
topics are detected in 2 iterations of the process. Due to the
randomness in the selection of the density levels, it is likely that
each realization of the DBSCAN-Martingale will output a random
variable
as an estimation of the number of clusters. Hence, we
allow 10 realizations
 
   
and the final estimation of the
number of clusters is the majority vote over them. An illustrative
example of 5 clusters in the 2-dimensional plane is demonstrated in
Figure 3.
Figure 3. Example in the 2-dimensional plane and the histogram of results
after 100 realizations of the DBSCAN-Martingale
In brief, the DBSCAN-Martingale is mathematically formulated
as follows. Firstly, a sample of size     is randomly
generated in , where is an upper bound for the
density levels. The sample of     is then sorted in
increasing order. For each density level we find the
corresponding clustering vectors for all stages 
   . In the first stage, all clusters detected by are
kept, corresponding to the lowest density level. In the second
stage (  ), some of the detected clusters by are new
and some of them have also been detected by . In order
to keep only the newly detected clusters, we keep only groups of
numbers of the same cluster ID with size greater than s.
Finally, the cluster IDs are relabelled and the maximum value of a
clustering vector provides the number of clusters.
Complexity: The DBSCAN-Martingale requires iterations of the
DBSCAN algorithm, which runs in   if a tree-based
spatial index can be used and in  without tree-based spatial
indexing [12]. Therefore, the DBSCAN-Martingale runs
in   for tree-based indexed datasets and in 
without tree-based indexing. Our code
3
is written in R
4
, using the
dbscan
5
package, which runs DBSCAN in  with kd-tree
data structures for fast nearest neighbor search.
3.2 Latent Dirichlet Allocation (LDA)
LDA assumes a Bag-of-Words (BoW) representation of the
collection of documents and each topic is a distribution over terms
in a fixed vocabulary. LDA assigns probabilities to words and
assumes that documents exhibit multiple topics, in order to assign a
probability distribution on the set of documents. Finally, LDA
assumes that the order of words does not matter and, therefore,
LDA is not applicable to word -grams for  , but can be
applied to named entities and concepts. This input allows topic
detection even in multilingual corpora, where -grams are not
available in a common language.
4 EXPERIMENTS
In this Section, we describe our dataset and evaluate our method.
4.1 Dataset description
A part of the present MULTISENSOR database (in which articles
crawled from international news websites are stored) was used for
the evaluation of our query-based topic detection framework. We
use the retrieved results for a given query in order to cluster them
into labelled clusters (topics) without knowing the number of
clusters. The concepts and named entities are extracted using the
DBpedia spotlight
6
online tool and the final concepts and named
entities replaced the raw text of each news article. The final
collection of text documents is available online
7
.
The queries that were used for the experiments are the following:
energy crisis
3
https://github.com/MKLab-ITI/topic-detection
4
https://www.r-project.org/
5
https://cran.r-project.org/web/packages/dbscan/index.html
6
https://dbpedia-spotlight.github.io/demo/
7
http://mklab2.iti.gr/project/query-based-topic-detection-dataset
energy policy
home appliances
solar energy
It should be noted that the aforementioned queries are
considered representative, with respect to the use cases addressed
by the MULTISENSOR project. The output of our topic detection
framework can be visualized in Figure 4 for the query “home
appliances”, where the retrieved results are clustered by 9 topics.
The font size of the clusters’ labels depends on the particular word
probability within each cluster.
4.2 Evaluation results
In order to evaluate the clustering of the retrieved news articles, we
use the average precision (AP), broadly used in the context of
information retrieval, clustering and classification. A document
of a cluster is considered relevant to (true positive), if at least
one concept associated with document appears also in the label
of cluster. It should be noted that the labels of the clusters
(topics) are provided by the concepts or named entities that have
the highest probability (provided by LDA) within each topic.
Precision is considered the fraction of relevant documents in a
cluster and average precision is the average for all clusters of a
query. Finally, we average the AP scores for all considered queries
to obtain the Mean Average Precision (MAP).
We compared the clustering performance of the proposed topic
detection framework, in which the DBSCAN-Martingale algorithm
(for estimating the number of topics) and LDA (for assigning news
articles to topics) are employed, against a variety of well-known
clustering approaches, which were also combined with LDA for a
fair comparison. DP-means is a Dirichlet process and we used its
implementation in R
8
. HDBSCAN is a hierarchical DBSCAN
approach, which uses the “excess-of-mass” (EOM) approach to
find the optimal cut. Nbclust is a majority vote of the first 16
indices, which are all described in detail in [9].
Figure 4. Demonstration of the MULTISENSOR topic detection
framework
8
https://github.com/johnmyleswhite/bayesian_nonparametrics
Table 1. Average Precision ( standard deviation) and Mean Average Precision over 10 runs of LDA using the estimated number of topics
Index + LDA
energy crisis
energy policy
solar energy
MAP
CH
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
Duda
0.44980.0671
0.55340.0457
0.44840.0067
0.4703
Pseudo t^2
0.44980.0671
0.55340.0457
0.44840.0067
0.4703
C-index
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
Ptbiserial
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
DB
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
Frey
0.35410.0181
0.39110.0033
0.44840.0067
0.3920
Hartigan
0.59380.0502
0.53360.0375
0.59610.0347
0.5794
Ratkowsky
0.53570.0151
0.53710.0357
0.53750.0446
0.5266
Ball
0.42070.0093
0.45010.0021
0.44640.0614
0.4536
McClain
0.57860.0425
0.53710.0357
0.59610.0347
0.5215
KL
0.57860.0425
0.53710.0357
0.59610.0347
0.5704
Silhouette
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
Dunn
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
SDindex
0.35410.0181
0.39110.0033
0.44840.0067
0.4469
SDbw
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
NbClust
0.57860.0425
0.53710.0357
0.59610.0347
0.5765
DP-means
0.35410.0181
0.39110.0033
0.44840.0067
0.3920
HDBSCAN-EOM
0.44980.0671
0.39110.0033
0.53750.0446
0.4933
DBSCAN-Martingale
0.76910.0328
0.55340.0457
0.60730.0303
0.6353
Table 2. Estimation of the number of topics in the MULTISENSOR queries
Index
energy crisis
energy policy
home appliances
solar energy
CH
12
8
15
15
Duda
4
4
3
2
Pseudo t^2
4
4
3
2
C-index
12
8
15
15
Ptbiserial
12
8
15
15
DB
12
8
15
15
Frey
2
2
2
2
Hartigan
11
7
15
15
Ratkowsky
7
8
5
5
Ball
3
3
3
3
McClain
12
8
2
15
KL
12
8
11
15
Silhouette
12
8
15
15
Dunn
12
8
15
15
SDindex
2
2
15
2
SDbw
12
8
15
15
NbClust
12
8
15
15
DP-means
2
2
2
2
HDBSCAN-EOM
4
2
10
5
DBSCAN-Martingale
6
4
9
10
The AP scores per query and the MAP scores per method over
10 runs of LDA are displayed in Table 1, for each estimation of the
number of topics combined with LDA. In addition, the numbers of
news clusters estimated by the considered clustering indices for
each query are presented in Table 2. Looking at Table 1, we
observe a relative increase of 9.65% in MAP, when our topic
detection framework is compared to the second highest MAP score
(by Hartigan+LDA) and a relative increase of 10.20%, when
compared to the most recent approach (NbClust+LDA).
In general, the proposed topic detection framework outperforms
all the considered clustering approaches both in terms of AP
(within each query) and in terms of MAP (overall performance for
all queries), with the exception of the “energy policy” query, where
the performance of our framework is matched by that of the Duda
and Pseudo t^2 clustering indices.
Finally, we evaluated the time performance of the DBSCAN-
Martingale method and we selected several baseline approaches in
order to compare their processing time with that of our approach.
In Figure 5, the number of news clusters is estimated for   
iterations for the DBSCAN-Martingale and for maximum number
of clusters set to 15 for the indices Duda, Pseudo t^2, Silhouette,
Dunn and SDindex. We observe that DBSCAN-Martingale is faster
than all other methods. Even when it is applied to 500 documents,
it is able to reach a decision about the number of clusters in
approximately 0.4 seconds.
Figure 5. Time performance of DBSCAN-Martingale and several baseline
approaches to estimate the number of news clusters
5 CONCLUSIONS
In this paper, we have presented a hybrid topic detection
framework, developed for the purposes of the MULTISENSOR
project. Given a query-based search, the framework clusters the
retrieved results by topic, without the need to know the number of
topics a priori. The framework employs the recently introduced
DBSCAN-Martingale method for efficiently estimating the number
of news clusters, coupled with Latent Dirichlet Allocation for
assigning the news articles to topics. Our topic detection
framework relies on high-level textual features that are extracted
from the news articles, namely textual concepts and named entities.
In addition, it is multimodal, since it fuses more than one sources
of information from the same multimedia object. The query-based
topic detection experiments have shown that our framework
outperforms several well-known clustering methods, both in terms
of Average Precision and Mean Average Precision. A direct
comparison, by means of time performance, has shown that our
approach is faster than several well-performing methods in the
estimation of the number of clusters, given as input the same
number of query-based retrieved news articles.
As future work, we plan to investigate the behavior of our
framework by introducing additional modalities/features, examine
the application of alternative (other than LDA) text clustering
approaches, as well as investigate the extraction of language-
agnostic concepts and named entities, something that could provide
multilingual capabilities to our topic detection framework.
ACKNOWLEDGEMENTS
This work was supported by the projects MULTISENSOR (FP7-
610411) and KRISTINA (H2020-645012), funded by the European
Commission.
REFERENCES
[1] J. Allan (Ed.), Topic detection and tracking: event-based information
organization, vol. 12, Springer Science & Business Media, (2012).
[2] I. Gialampoukidis, S. Vrochidis and I. Kompatsiaris, A hybrid
framework for news clustering based on the DBSCAN-Martingale
and LDA’, In: Perner, P. (Ed.) Machine Learning and Data Mining in
Pattern Recognition, LNAI 9729, pp. 170-184, (2016).
[3] C. C. Aggarwal and C. Zhai, A survey of text clustering algorithms,
In Mining Text Data, pp. 77-128, Springer US, (2012).
[4] M. Qian and C. Zhai, Unsupervised feature selection for multi-view
clustering on text-image web news data, In Proceedings of the 23rd
ACM International Conference on Conference on Information and
Knowledge Management, pp. 1963-1966, ACM, (2014).
[5] A. Kumar and H. Daumé, A co-training approach for multi-view
spectral clustering, In Proceedings of the 28th International
Conference on Machine Learning (ICML-11), pp. 393-400, (2011).
[6] D. M. Blei, A. Y. Ng and M. I. Jordan, ‘Latent dirichlet allocation,
the Journal of machine Learning research, vol. 3, pp. 993-1022,
(2003).
[7] Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei, Hierarchical
dirichlet processes, Journal of the american statistical association,
101(476), (2006).
[8] B. Kulis and M. I. Jordan, ‘Revisiting k-means: New algorithms via
Bayesian nonparametrics’, arXiv preprint arXiv:1111.0352, (2012).
[9] M. Charrad, N. Ghazzali, V. Boiteau and A. Niknafs, NbClust: an R
package for determining the relevant number of clusters in a data set’,
Journal of Statistical Software, 61(6), pp. 1-36, (2014).
[10] M. Ester, H. P. Kriegel, J. Sander and X. Xu, ‘A density-based
algorithm for discovering clusters in large spatial databases with
noise’, In Kdd, 96(34), pp. 226-231, (1996).
[11] G. Petkos, M. Schinas, S. Papadopoulos and Y. Kompatsiaris,
Graph-based multimodal clustering for social multimedia’,
Multimedia Tools and Applications, 1-23, (2016).
[12] M. Ankerst, M. M. Breunig, H. P. Kriegel and J. Sander, OPTICS:
ordering points to identify the clustering structure’, In ACM Sigmod
Record, 28(2), pp. 49-60, ACM, (1999).
[13] J. Sander, X. Qin, Z. Lu, N. Niu and A. Kovarsky, ‘Automatic
extraction of clusters from hierarchical clustering representations’, In
Advances in knowledge discovery and data mining, pp. 75-87,
Springer Berlin Heidelberg, (2003).
[14] R. J. Campello, D. Moulavi and J. Sander, Density-based clustering
based on hierarchical density estimates, In Advances in Knowledge
Discovery and Data Mining, pp. 160-172, Springer Berlin
Heidelberg, (2013).
Conference Paper
Large amounts of social media posts are produced on a daily basis and monitoring all of them is a challenging task. In this direction we demonstrate a topic detection and visualisation tool in Twitter data, which filters Twitter posts by topic or keyword, in two different languages; German and Turkish. The system is based on state-of-the-art news clustering methods and the tool has been created to handle streams of recent news information in a fast and user-friendly way. The user interface and user-system interaction examples are presented in detail.
Chapter
Full-text available
Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.
Article
Full-text available
Real world datasets often consist of data expressed through multiple modalities. Clustering such datasets is in most cases a challenging task as the involved modalities are often heterogeneous. In this paper we propose a graph-based multimodal clustering approach. The proposed approach utilizes an example relevant clustering in order to learn a model of the “same cluster” relationship between a pair of items. This model is subsequently used in order to organize the items of the collection to be clustered in a graph, where the nodes represent the items and a link between a pair of nodes exists if the model predicted that the corresponding pair of items belong to the same cluster. Eventually, a graph clustering algorithm is applied on the graph in order to produce the final clustering. The proposed approach is applied on two problems that are typically treated using clustering techniques; in particular, it is applied on the problem of detecting social events and to the problem of discovering different landmark views in collections of social multimedia.
Conference Paper
Full-text available
Unlabeled high-dimensional text-image web news data are produced every day, presenting new challenges to unsuper-vised feature selection on multi-view data. State-of-the-art multi-view unsupervised feature selection methods learn pseudo class labels by spectral analysis, which is sensitive to the choice of similarity metric for each view. For textimage data, the raw text itself contains more discriminative information than similarity graph which loses information during construction, and thus the text feature can be directly used for label learning, avoiding information loss as in spectral analysis. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint l2,1-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible. We systematically evaluate the proposed method in multi-view textimage web news datasets. Our extensive experiments on web news datasets crawled from two major US media channels: CNN and FOXNews demonstrate the efficacy of the new method over state-of-the-art multi-view and single-view unsupervised feature selection methods.
Article
Full-text available
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organiza-tion, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.
Article
Full-text available
Download at : http://www.jstatsoft.org/v61/i06/paper Clustering is the partitioning of a set of objects into groups (clusters) so that objects within a group are more similar to each others than objects in different groups. Most of the clustering algorithms depend on some assumptions in order to define the subgroups present in a data set. As a consequence, the resulting clustering scheme requires some sort of evaluation as regards its validity. The evaluation procedure has to tackle difficult problems such as the quality of clusters, the degree with which a clustering scheme fits a specific data set and the optimal number of clusters in a partitioning. In the literature, a wide variety of indices have been proposed to find the optimal number of clusters in a partitioning of a data set during the clustering process. However, for most of indices proposed in the literature, programs are unavailable to test these indices and compare them. The R package NbClust has been developed for that purpose. It provides 30 indices which determine the number of clusters in a data set and it offers also the best clustering scheme from different results to the user. In addition, it provides a function to perform kmeans and hierarchical clustering with different distance measures and aggregation methods. Any combination of validation indices and clustering methods can be requested in a single function call. This enables the user to simultaneously evaluate several clustering schemes while varying the number of clusters, to help determining the most appropriate number of clusters for the dataset of interest.
Article
Full-text available
We propose a spectral clustering algorithm for the multi-view setting where we have ac-cess to multiple views of the data, each of which can be independently used for cluster-ing. Our spectral clustering algorithm has a flavor of co-training, which is already a widely used idea in semi-supervised learning. We work on the assumption that the true under-lying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperpa-rameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and real-world datasets to show the efficacy of the proposed algorithm.
Article
Full-text available
We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the "Chinese restaurant franchise." We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.
Conference Paper
Full-text available
We propose a spectral clustering algorithm for the multi-view setting where we have access to multiple views of the data, each of which can be independently used for clustering. Our spectral clustering algorithm has a flavor of co-training, which is already a widely used idea in semi-supervised learning. We work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view. Hence, we constrain our approach to only search for the clusterings that agree across the views. Our algorithm does not have any hyperparameters to set, which is a major advantage in unsupervised learning. We empirically compare with a number of baseline methods on synthetic and real-world datasets to show the efficacy of the proposed algorithm. 1.
Conference Paper
We propose a theoretically and practically improved density-based, hierarchical clustering method, providing a clustering hierarchy from which a simplified tree of significant clusters can be constructed. For obtaining a “flat” partition consisting of only the most significant clusters (possibly corresponding to different density thresholds), we propose a novel cluster stability measure, formalize the problem of maximizing the overall stability of selected clusters, and formulate an algorithm that computes an optimal solution to this problem. We demonstrate that our approach outperforms the current, state-of-the-art, density-based clustering methods on a wide variety of real world data.