A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Data Science and Analytics
This content is subject to copyright. Terms and conditions apply.
International Journal of Data Science and Analytics (2024) 18:239–258
https://doi.org/10.1007/s41060-024-00540-x
REVIEW
A comprehensive and analytical review of text clustering techniques
Vivek Mehta1·Mohit Agarwal1·Rohit Kumar Kaliyar1
Received: 6 October 2023 / Accepted: 18 March 2024 / Published online: 8 April 2024
© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2024
Abstract
Document clustering involves grouping together documents so that similar documents are grouped together in the same
cluster and different documents in the different clusters. Clustering of documents is considered a fundamental problem in
the field of text mining. With a high rise in textual content over the Internet in recent years makes this problem more and
more challenging. For example, SpringerNature has alone published more than 67,000 articles in the last few years just on the
topic of COVID-19. This high volume leads to the challenge of very high dimensionality in analyzing the textual datasets. In
this review paper, several text clustering techniques are reviewed and analyzed theoretically as well as experimentally. The
reviewed techniques range from traditional non-semantic to some state-of-the-art semantic text clustering techniques. The
individual performances of these techniques are experimentally compared and analyzed on several datasets using different
performance measures such as purity, Silhouette coefficient, and adjusted rand index. Additionally, significant research gaps
are also presented to give the readers a direction for future research.
Keywords Text clustering ·Semantic text clustering ·Word Embeddings ·Curse of dimensionality
1 Introduction
The rise in digital form of textual data has seen exponential
growth in recent years. Organizations need a proper sys-
tem in place to find meaningful insights from the available
data. Whether it is social media analytics or it is cybercrime
investigation, they all require some textual data analytics.
For example, more than 67,000 new research articles only
on COVID-19 have been published by SpringerNature [16].
More than 7800 peer-reviewed articles appeared on the Web
of Science database just for the keyword search “coron-
avirus.” Additionally, the expansion rate get doubled every
20 days [15]. So, it can be imagined how fast the textual data
are getting accumulated in digital form. With so much avail-
able data, it is just impossible for a health official or a medical
researcher to go through this data to make better decisions to
halt the spread of disease.
In the analysis of large amounts of documents spread
across multiple sites, document clustering has become
increasingly significant. The difficult part is organizing
the documents in a way that allows for better searching
BVive k Me hta
vivekmehta27@gmail.com
1Bennett University, Greater Noida, Uttar Pradesh 201310,
India
without adding a lot of extra cost and complexity. The Clus-
ter Hypothesis is essential to the discussion of increased
efficiency. It claims that relevant documents are more compa-
rable to one another than non-related documents, and hence
appear in similar clusters [45]. Relevant documents will be
well differentiated from non-relevant documents if the clus-
ter hypothesis holds true for a given document collection.
Because it lacks some of the query terms, a related docu-
ment may be placed low in a best-match search. Especially
nowadays, an important application of clustering lies in text
summarization which in turn has applications in various
domains such as journalism for news summarization, book
summarization, and legal data summarization. For example,
in [44] text clustering has been successfully used to improve
the summarization of lengthy documents by enhancing the
document level scoring of each sentence.
The five major contributions of this review paper are as
follows.
1. Various representation methods for text have been
reviewed such as vector space model, lexical chains,
and word embeddings.
2. Based on each representation method, techniques for
clustering the text are reviewed such as partitioning-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.