ArticlePublisher preview available

A comprehensive and analytical review of text clustering techniques

Authors:
  • Bennett University
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Document clustering involves grouping together documents so that similar documents are grouped together in the same cluster and different documents in the different clusters. Clustering of documents is considered a fundamental problem in the field of text mining. With a high rise in textual content over the Internet in recent years makes this problem more and more challenging. For example, SpringerNature has alone published more than 67,000 articles in the last few years just on the topic of COVID-19. This high volume leads to the challenge of very high dimensionality in analyzing the textual datasets. In this review paper, several text clustering techniques are reviewed and analyzed theoretically as well as experimentally. The reviewed techniques range from traditional non-semantic to some state-of-the-art semantic text clustering techniques. The individual performances of these techniques are experimentally compared and analyzed on several datasets using different performance measures such as purity, Silhouette coefficient, and adjusted rand index. Additionally, significant research gaps are also presented to give the readers a direction for future research.
This content is subject to copyright. Terms and conditions apply.
International Journal of Data Science and Analytics (2024) 18:239–258
https://doi.org/10.1007/s41060-024-00540-x
REVIEW
A comprehensive and analytical review of text clustering techniques
Vivek Mehta1·Mohit Agarwal1·Rohit Kumar Kaliyar1
Received: 6 October 2023 / Accepted: 18 March 2024 / Published online: 8 April 2024
© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2024
Abstract
Document clustering involves grouping together documents so that similar documents are grouped together in the same
cluster and different documents in the different clusters. Clustering of documents is considered a fundamental problem in
the field of text mining. With a high rise in textual content over the Internet in recent years makes this problem more and
more challenging. For example, SpringerNature has alone published more than 67,000 articles in the last few years just on the
topic of COVID-19. This high volume leads to the challenge of very high dimensionality in analyzing the textual datasets. In
this review paper, several text clustering techniques are reviewed and analyzed theoretically as well as experimentally. The
reviewed techniques range from traditional non-semantic to some state-of-the-art semantic text clustering techniques. The
individual performances of these techniques are experimentally compared and analyzed on several datasets using different
performance measures such as purity, Silhouette coefficient, and adjusted rand index. Additionally, significant research gaps
are also presented to give the readers a direction for future research.
Keywords Text clustering ·Semantic text clustering ·Word Embeddings ·Curse of dimensionality
1 Introduction
The rise in digital form of textual data has seen exponential
growth in recent years. Organizations need a proper sys-
tem in place to find meaningful insights from the available
data. Whether it is social media analytics or it is cybercrime
investigation, they all require some textual data analytics.
For example, more than 67,000 new research articles only
on COVID-19 have been published by SpringerNature [16].
More than 7800 peer-reviewed articles appeared on the Web
of Science database just for the keyword search “coron-
avirus.” Additionally, the expansion rate get doubled every
20 days [15]. So, it can be imagined how fast the textual data
are getting accumulated in digital form. With so much avail-
able data, it is just impossible for a health official or a medical
researcher to go through this data to make better decisions to
halt the spread of disease.
In the analysis of large amounts of documents spread
across multiple sites, document clustering has become
increasingly significant. The difficult part is organizing
the documents in a way that allows for better searching
BVive k Me hta
vivekmehta27@gmail.com
1Bennett University, Greater Noida, Uttar Pradesh 201310,
India
without adding a lot of extra cost and complexity. The Clus-
ter Hypothesis is essential to the discussion of increased
efficiency. It claims that relevant documents are more compa-
rable to one another than non-related documents, and hence
appear in similar clusters [45]. Relevant documents will be
well differentiated from non-relevant documents if the clus-
ter hypothesis holds true for a given document collection.
Because it lacks some of the query terms, a related docu-
ment may be placed low in a best-match search. Especially
nowadays, an important application of clustering lies in text
summarization which in turn has applications in various
domains such as journalism for news summarization, book
summarization, and legal data summarization. For example,
in [44] text clustering has been successfully used to improve
the summarization of lengthy documents by enhancing the
document level scoring of each sentence.
The five major contributions of this review paper are as
follows.
1. Various representation methods for text have been
reviewed such as vector space model, lexical chains,
and word embeddings.
2. Based on each representation method, techniques for
clustering the text are reviewed such as partitioning-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Clustering document data is a fundamental task in various applications, typically involving feature extraction followed by the application of clustering techniques [24,46,27,34]. A noteworthy state-of-the-art framework, as exemplified by [11], incorporates a pre-trained text encoder (like Embeddings from Language Models (ELMO)) for feature extraction and employs K-means to generate clusters. ...
Thesis
Full-text available
This research introduces the Conceptual Document Clustering Explanation Model (CDCEM), a novel explanation model for explaining unsupervised textual clustering. CDCEM explains the discovered clusters and document assignments. Furthermore, it ensures faithfulness—meaning it accurately reflects the decision-making process—using the core elements of black-box textual clustering, such as document embedding and centroids from k-means. This faithfulness and comprehensiveness boost user trust and understanding and help debug clustering. Using Wikipedia, CDCEM first performs wikification, which extracts real-world concepts from the text. It then evaluates these concepts' significance for cluster assignment to produce concept-based explanations. CDCEM determines the importance of each concept within a cluster by measuring the cosine similarity between the concept's embedding (representing its contextual meaning) with the cluster centres (representing the cluster's theme), both of which it derives from a black-box model (using ELMO for embeddings and K-means for clustering). This concept's importance for each cluster facilitates generating concept-based explanations at two levels: cluster-level explanations, which describe the concepts that best represent the clusters, and document-level explanations, which clarify why the black-box model assigns a document to a particular cluster. We quantitatively evaluate the faithfulness of CDCEM using AG News, DBpedia, and Reuters-21578 datasets, comparing it with explainable classification methods (Decision Tree, Logistic Regression, and Naive Bayes) by treating clusters as classes and computing the agreement between the black-box model's predictions and explanations. Additionally, a user study was conducted to compare CDCEM with the best baseline in terms of comprehensiveness, accuracy, usefulness, user satisfaction, and usability of the explanation visualization tool on the AG News dataset. CDCEM showed higher faithfulness than the baseline model in quantitative evaluations, indicating accurate explanations of unsupervised clustering decisions. Qualitative evaluations revealed that users preferred CDCEM's cluster-level and document-level explanations for accuracy, clarity, logic, and comprehensibility.
Article
Full-text available
In recent years, Quantum Inspired Metaheuristic algorithms have emerged to be promising due to their efficiency, robustness and faster computational capability. In this paper, a novel Quantum Inspired Differential Evolution (QIDE) algorithm has been presented for automatic clustering of unlabeled datasets. In case of automatic clustering, the datasets have been clustered into optimal number of groups on the run without any apriori knowledge of the datasets. In this work, the proposed algorithm has been compared with other two quantum inspired algorithms, viz., Fast Quantum Inspired Evolutionary Clustering Algorithm (FQEA) and Quantum Evolutionary Algorithm for Data Clustering (QEAC), a Classical Differential Evolution (CDE) algorithm with different mutation probabilities and an Improved Differential Evolution (IDE) algorithm. The experiments have been conducted on six real life publicly available datasets to identify the optimal number of clusters. By introducing some concepts of quantum gates, the proposed algorithm not only achieves good convergence speed but also provides better results than other competitive algorithms. In addition, Sobol’s sensitivity analysis has been conducted for tuning the parameters of the proposed algorithm.
Article
Full-text available
Document clustering is a well established technique used to segregate voluminous text corpora into distinct categories. In this paper we present an improved algorithm for clustering large text corpus. The proposed algorithm tries to overcome the challenges of clustering large corpora, while maintaining high ”goodness” values for the proposed clusters. The algorithm proceeds by optimizing a fitness function using Differential Evolution to form the initial clusters. The clusters obtained after the initial phase are then “refined” by re-evaluating the points that fall at the fringes of the clusters and reassigning them to other clusters, if necessary. Two different approaches e.g. Nearest Cluster Based Re-evaluation (N-CBR) and Multiple Cluster Based Re-evaluation (M-CBR) have been proposed to select candidates during the reassignment phase and their performances have been evaluated. The result of such a post processing phase has been demonstrated on a number of standard benchmark text corpora and the algorithm is found to be quite accurate and efficient. The results obtained by the proposed method have also been compared to other evolutionary strategies e.g. Genetic Algorithm(GA), Particle Swarm Optimization(PSO), Harmony Search(HS), and have been found to be quite satisfactory.
Article
Full-text available
The appropriate understanding and fast processing of lengthy legal documents are computationally challenging problems. Designing efficient automatic summarization techniques can potentially be the key to deal with such issues. Extractive summarization is one of the most popular approaches for forming summaries out of such lengthy documents, via the process of summary-relevant sentence selection. An efficient application of this approach involves appropriate scoring of sentences, which helps in the identification of more informative and essential sentences from the document. In this work, a novel sentence scoring approach DCESumm is proposed which consists of supervised sentence-level summary relevance prediction, as well as unsupervised clustering-based document-level score enhancement. Experimental results on two legal document summarization datasets, BillSum and Forum of Information Retrieval Evaluation (FIRE), reveal that the proposed approach can achieve significant improvements over the current state-of-the-art approaches. More specifically it achieves ROUGE metric F1-score improvements of (1−6)% and (6−12)% for the BillSum and FIRE test sets respectively. Such impressive summarization results suggest the usefulness of the proposed approach in finding the gist of a lengthy legal document, thereby providing crucial assistance to legal practitioners.
Article
Full-text available
Domain-driven data mining of health care data poses unique challenges. The aim of this paper is to explore the advantages and the challenges of a ‘domain-led approach’ versus a data-driven approach to a k-means clustering experiment. For the purpose of this experiment, clinical experts in heart failure selected variables to be used during the k-means clustering, whilst during the ‘data-driven approach’ feature selection was performed by applying principal component analysis to the multidimensional dataset. Six out of seven features selected by physicians were amongst 26 features that contributed most to the significant principal components within the k-means algorithm. The data-driven approach showed advantage over the domain-led approach for feature selection by removing the risk of bias that can be introduced by domain experts. Whilst the ‘domain-led approach’ may potentially prohibit knowledge discovery that can be hidden behind variables not routinely taken into consideration as clinically important features, the domain knowledge played an important role at the interpretation stage of the clustering experiment providing insight into the context and preventing far fetched conclusions. The “data-driven approach” was accurate in identifying clusters with distinct features at the physiological level. To promote the domain-led data mining approach, as a result of this experiment we developed a practical checklist guiding how to enable the integration of the domain knowledge into the data mining project.
Article
Full-text available
Text document clustering is used to separate a collection of documents into several clusters by allowing the documents in a cluster to be substantially similar. The documents in one cluster are distinct from documents in other clusters. The high-dimensional sparse document term matrix reduces the clustering process efficiency. This study proposes a new way of clustering documents using domain ontology and WordNet ontology. The main objective of this work is to increase cluster output quality. This work aims to investigate and examine the method of selecting feature dimensions to minimize the features of the document name matrix. The sports documents are clustered using conventional K-Means with the dimension reduction features selection process and density-based clustering. A novel approach named ontology-based document clustering is proposed for grouping the text documents. Three critical steps were used in order to develop this technique. The initial step for an ontology-based clustering approach starts with data pre-processing, and the characteristics of the DR method are reduced with the Info-Gain collection. The documents are clustered using two clustering methods: K-Means and Density-Based clustering with DR Feature Selection Process. These methods validate the findings of ontology-based clustering, and this study compared them using the measurement metrics. The second step of this study examines the sports field ontology development and describes the principles and relationship of the terms using sports-related documents. The semantic web rational process is used to test the ontology for validation purposes. An algorithm for the synonym retrieval of the sports domain ontology terms has been proposed and implemented. The retrieved terms from the documents and sport ontology concepts are mapped to the retrieved synonym set words from the WorldNet ontology. The suggested technique is based on synonyms of mapped concepts. The proposed ontology approach employs the reduced feature set in order to clustering the text documents. The results are compared with two traditional approaches on two datasets. The proposed ontology-based clustering approach is found to be effective in clustering the documents with high precision, recall, and accuracy. In addition, this study also compared the different RDF serialization formats for sports ontology.
Article
Full-text available
Text data is a type of unstructured information, which is easily processed by a human, but it is hard for the computer to understand. Text mining techniques effectively discover meaningful information from text, which has received a great deal of attention in recent years. The aim of this study is to evaluate and analyze the comments and suggestions presented by Barez Iran Company. Barez is an unlabeled dataset. Extracting useful information from unlabeled large textual data by human to manually be very difficult and time consuming. Therefore, in this paper we analyze suggestions presented in Persian using BERTopic modeling for cluster analysis of the dataset. In BERTopic, each document belongs to a topic with a probability distribution. As a result, seven latent topics are found, covering a broad range of issues such as Installation, manufacture, correction, and device. Then we propose a novel deep text clustering based on hybrid of a stacked autoencoder and k-means clustering to organize text documents into meaningful groups for mining information from Barez data in an unsupervised method. Our data clustering has three main steps: 1) Text representation with a new pre-trained BERT model for language understanding called ParsBERT, 2) Text feature extraction based on based on a new architecture of stacked autoencoder to reduce the dimension of data to provide robust features for clustering, 3) Cluster the data by k-means clustering. We employ the Barez dataset to verify our work’s effectiveness; Silhouette Score is used to evaluate the resulting clusters with the best value of 0.60 with 3 clusters grouping. Experimental evaluations demonstrate that the proposed algorithm clearly outperforms other clustering methods.
Article
Full-text available
A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.
Article
Full-text available
Numerous number of tourism attractions along with a huge amount of information about them on web and social platforms have made the decision-making process for selecting and visiting them complicated. In this regard, the tourism recommendation systems have become interesting for tourists, but challenging for designers because they should be able to provide personalized services. This paper introduces a tourism recommendation system that extracts users' preferences in order to provide personalized recommendations. To this end, users' reviews on tourism social networks are used as a rich source of information to extract their preferences. Then, the comments are preprocessed, semantically clustered, and sentimentally analyzed to detect a tourist's preferences. Similarly, all users aggregated reviews about an attraction are utilized to extract the features of these points of interest. Finally, the proposed recommendation system, semantically compares the preferences of a user with the features of attractions to suggest the most matching points of interest to the user. In addition, the system utilizes the vital contextual information of time, location, and weather to filter unsuitable items and increase the quality of suggestions regarding the current situation. The proposed recommendation system is developed by Python and evaluated on a dataset gathered from TripAdvisor platform. The evaluation results show that the proposed system improves the f-measure criterion in comparison with the previous systems.
Chapter
WordNet is an on-line lexical reference system whose design isinspired by current psycholinguistic theories of human lexical memory;version 1.6 is the most up-to-date version of the system. WordNet, an electronic lexical database, is considered to be the most important resource available to researchers in computational linguistics, text analysis, and many related areas. Its design is inspired by current psycholinguistic and computational theories of human lexical memory. English nouns, verbs, adjectives, and adverbs are organized into synonym sets, each representing one underlying lexicalized concept. Different relations link the synonym sets.The purpose of this volume is twofold. First, it discusses the design of WordNet and the theoretical motivations behind it. Second, it provides a survey of representative applications, including word sense identification, information retrieval, selectional preferences of verbs, and lexical chains. Contributors Reem Al-Halimi, Robert C. Berwick, J. F. M. Burg, Martin Chodorow, Christiane Fellbaum, Joachim Grabowski, Sanda Harabagiu, Marti A. Hearst, Graeme Hirst, Douglas A. Jones, Rick Kazman, Karen T. Kohl, Shari Landes, Claudia Leacock, George A. Miller, Katherine J. Miller, Dan Moldovan, Naoyuki Nomura, Uta Priss, Philip Resnik, David St-Onge, Randee Tengi, Reind P. van de Riet, Ellen Voorhees Bradford Books imprint
Article
Document clustering in text mining is a problem that is heavily researched upon. It is observed that individual approaches based on statistical features and semantic features have been extensively used to solve this problem. However, techniques combining the advantages of both types of features have not been frequently researched upon. Specifically, when the growth in the size of textual data is immense, there is a need for such an approach that combines the advantages of both types of features to give more accurate results within an acceptable range of time. In this paper, a document clustering technique is proposed that combines the effectiveness of the statistical features (using TF-IDF) and semantic features (using lexical chains). It is designed to use a fewer number of features while maintaining a comparable and even better accuracy for the task of document clustering.