Access to this full-text is provided by MDPI.
Content available from Information
This content is subject to copyright.
Academic Editor: Shmuel Tomi Klein
Received: 26 November 2024
Revised: 20 January 2025
Accepted: 4 February 2025
Published: 10 February 2025
Citation: Allam, H.; Makubvure, L.;
Gyamfi, B.; Graham, K.N.; Akinwolere,
K. Text Classification: How Machine
Learning Is Revolutionizing Text
Categorization. Information 2025,16,
130. https://doi.org/10.3390/
info16020130
Copyright: © 2025 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license
(https://creativecommons.org/
licenses/by/4.0/).
Article
Text Classification: How Machine Learning Is Revolutionizing
Text Categorization
Hesham Allam * , Lisa Makubvure, Benjamin Gyamfi , Kwadwo Nyarko Graham and Kehinde Akinwolere
Center for Information & Communication Sciences (CICS), Ball State University, Muncie, IN 47306, USA;
benjamin.gyamfi@bsu.edu (B.G.); kwadwo.nyarkograham@bsu.edu (K.N.G.);
kehinde.akinwolere@bsu.edu (K.A.)
*Correspondence: hesham.allam@bsu.edu
Abstract: The automated classification of texts into predefined categories has become
increasingly prominent, driven by the exponential growth of digital documents and the
demand for efficient organization. This paper serves as an in-depth survey of text clas-
sification and machine learning, consolidating diverse aspects of the field into a single,
comprehensive resource—a rarity in the current body of literature. Few studies have
achieved such breadth, and this work aims to provide a unified perspective, offering a
significant contribution to researchers and the academic community. The survey examines
the evolution of machine learning in text categorization (TC), highlighting its transforma-
tive advantages over manual classification, such as enhanced accuracy, reduced labor, and
adaptability across domains. It delves into various TC tasks and contrasts machine learning
methodologies with knowledge engineering approaches, demonstrating the strengths and
flexibility of data-driven techniques. Key applications of TC are explored, alongside an anal-
ysis of critical machine learning methods, including document representation techniques
and dimensionality reduction strategies. Moreover, this study evaluates a range of text
categorization models, identifies persistent challenges like class imbalance and overfitting,
and investigates emerging trends shaping the future of the field. It discusses essential
components such as document representation, classifier construction, and performance
evaluation, offering a well-rounded understanding of the current state of TC. Importantly,
this paper also provides clear research directions, emphasizing areas requiring further in-
novation, such as hybrid methodologies, explainable AI (XAI), and scalable approaches for
low-resource languages. By bridging gaps in existing knowledge and suggesting actionable
paths forward, this work positions itself as a vital resource for academics and industry
practitioners, fostering deeper exploration and development in text classification.
Keywords: text categorization (TC); machine learning; automation; document representation;
dimension reduction; classifier evaluation; emerging trends
1. Introduction
The history of text categorization (TC) is a narrative of continuous evolution, driven
by the growing need to efficiently manage and organize ever-increasing volumes of text
data. Initially a manual process rooted in text and corpus linguistics, TC involved cate-
gorizing texts into predefined topics or genres [
1
–
3
]. However, the digital revolution and
exponential growth of textual data rendered manual methods impractical, necessitating the
development of automated systems. Early approaches relied on handmade features and
rule-based systems, which, while foundational, were limited by their rigidity and inability
Information 2025,16, 130 https://doi.org/10.3390/info16020130
Information 2025,16, 130 2 of 47
to adapt to new data. Subsequent advancements introduced statistical techniques, nature-
inspired algorithms, and graph-based methods to enhance the flexibility and accuracy of
text categorization [4].
The introduction of machine learning (ML) marked a turning point, with algorithms
like k-nearest neighbors (KNN) and support vector machines (SVMs) offering improved
scalability and accuracy by learning directly from data. These methods utilize feature
selection techniques to address challenges such as high-dimensional feature spaces and
scalability. This shift represented a significant improvement in classification performance
and adaptability [
1
,
5
,
6
]. More recently, deep learning has revolutionized TC, enabling
the development of models capable of capturing intricate semantic relationships in text.
Techniques like recurrent neural networks (RNNs), convolutional neural networks (CNNs),
and transformer-based models (e.g., BERT) have dramatically improved performance.
Additionally, semantic methods such as ontology-based classification and latent semantic
indexing have enhanced contextual understanding of text data [7].
Despite its progress, TC remains a field at the crossroads of ML and information
retrieval (IR), sharing features with related areas like text mining and knowledge extraction.
This overlap has led to fragmented literature, inconsistent terminology, and a lack of
standardized frameworks [
1
–
3
]. Challenges include ambiguous definitions of terms like
“automatic text classification”, which variously refers to assigning predefined categories,
creating new categories, or clustering texts [
8
,
9
]. Furthermore, the field lacks comprehensive
resources such as dedicated textbooks or journals, hindering the consolidation of knowledge
and impeding newcomers [10].
However, these gaps present opportunities for advancement. By developing system-
atic methodologies, standardizing terminologies, and centralizing resources, researchers
can unify the field and enhance its applicability. The absence of structured guidance also
underscores the potential for innovative contributions, such as creating frameworks that
bridge theory and practice or addressing evolving challenges like multilingual classification,
noisy data handling, and explainability in models.
This research paper presents an extensive survey of text classification and machine
learning, offering a unified framework that consolidates best practices from ML, natural
language processing (NLP), and information retrieval (IR). It introduces a comprehensive
taxonomy of text classification techniques, encompassing traditional algorithms, modern
ML approaches, and emerging trends in deep learning. Furthermore, the paper provides
a detailed evaluation of methods using standardized datasets and metrics, making it a
foundational resource for researchers and a practical guide for industry professionals.
Building on these advancements, recent years have ushered in transformative trends
that further define the state of the art in text categorization. Post-2023, the field has
seen significant progress in the development of fine-tuned transformer architectures and
advanced pretraining techniques, enabling models to excel in few-shot and zero-shot
classification scenarios. These approaches address the persistent challenge of limited
labeled data by leveraging vast unlabeled corpora and contextual knowledge encoded
during pretraining.
Furthermore, the rise of domain-specific language models has reshaped applications
of text categorization in specialized industries such as healthcare, legal systems, and e-
commerce. These tailored models enhance performance by integrating domain-relevant
semantics and terminology, enabling more precise and context-aware classification. An-
other notable trend is the optimization of lightweight and efficient transformer models
designed for deployment in resource-constrained environments, such as mobile devices
and IoT platforms. These advancements are critical as the demand for on-device text
Information 2025,16, 130 3 of 47
categorization continues to grow, particularly in applications like real-time content filtering
and personalized recommendations.
The field has also witnessed renewed emphasis on multilingual and cross-lingual
text classification techniques. Innovations in transfer learning and adaptive fine-tuning
have enabled models to process diverse languages within unified frameworks, making
significant strides toward global accessibility and inclusivity. In parallel, researchers
are addressing persistent challenges such as model interpretability, bias mitigation, and
handling noisy or imbalanced datasets. Ethical considerations have gained prominence,
with a focus on ensuring fairness, transparency, and accountability in deploying TC systems
for high-impact applications like automated moderation and misinformation detection.
These recent advancements and ongoing efforts underscore the dynamic nature of text
categorization, highlighting its expanding relevance across disciplines and industries. By
consolidating these trends and providing a robust evaluation of emerging techniques, this
paper aims to bridge the gap between foundational knowledge and cutting-edge research,
offering a resource that is both comprehensive and forward-looking.
The paper is divided into eleven sections to help readers navigate this comprehensive
survey. Section 1provides an overview of the historical and technological backdrop for
text categorization (TC), as well as its problems and objectives. Section 2examines the
scope and role of TC, distinguishing it from comparable tasks and summarizing previous
research. Section 3describes the study approach, whereas Sections 4and 5cover significant
applications and machine learning techniques in TC. Section 6of the study delves into
foundational and advanced document representation strategies, followed by Section 7’s
exploration of evaluation measures. Section 8addresses TC challenges, while Section 9
discusses recent breakthroughs in deep learning. Finally, Section 10 discusses future
directions, and Section 11 provides a summary of the study’s contributions.
2. Background
Section 2conducts a thorough review of text categorization (TC), separating it from
related topics and describing recent advances in the field. This section examines the funda-
mental principles of TC, highlights noteworthy studies from 2019 to 2024, and identifies key
issues and developing trends. By breaking down the subject into logical subsections, the
section hopes to provide a unified narrative that connects theoretical and practical aspects
of TC.
Overview of Text Categorization (TC) and Recent Research
Text categorization (TC), also known as text classification, is a core task in text min-
ing and natural language processing (NLP). Text categorization plays an integral part in
managing and organizing unstructured data, employing machine learning techniques to
assign predefined categories to text documents. This process enables efficient information
retrieval and analysis, with applications in sentiment analysis, spam detection, and topic
classification. By streamlining activities such as content filtering and subject identification,
TC enhances productivity and supports decision making.
Historically, TC was performed manually, which was suitable for small datasets but
lacked scalability, consistency, and speed, especially with dynamic data like social media.
The advent of automated TC, driven by machine learning (ML) and NLP, revolutionized
the field by increasing speed, accuracy, and scalability while reducing human bias. Auto-
mated TC is now widely adopted in industries that process extensive and rapidly growing
data streams [10].
The domain of text categorization has experienced remarkable advancements, pro-
pelled by continuous innovations in machine learning (ML) and natural language process-
Information 2025,16, 130 4 of 47
ing (NLP). Over the years, researchers have developed sophisticated techniques to enhance
classification accuracy, scalability, and adaptability across diverse datasets and applications.
These developments span traditional machine learning approaches, deep learning archi-
tectures, and hybrid models that blend the strengths of both paradigms. This evolution
has not only improved the precision of text categorization systems but also expanded their
relevance to areas such as sentiment analysis, spam detection, topic identification, and
domain-specific classification.
One of the most impactful trends in recent research is the integration of transfer learn-
ing techniques, which allow models to influence knowledge from pretrained language
illustrations. This approach has significantly boosted the effectiveness of text categoriza-
tion, particularly in handling low-resource languages and niche domains. Additionally,
studies have explored the use of hybrid methodologies, combining rule-based systems with
advanced machine-learning practices to address challenges in multilingual and domain-
specific contexts. These approaches underscore the growing importance of adaptability
and context-awareness in modern text categorization systems.
Recent studies in 2024 have placed a particular emphasis on leveraging pre-trained
language models, domain-specific adaptations, and innovative clustering techniques. These
advancements have demonstrated special effectiveness in managing complex, multidimen-
sional datasets, enabling more nuanced and accurate classifications. The following table
highlights key contributions from recent studies from 2019 to 2024, providing an overview
of cutting-edge methodologies and findings. Table 1summarizes key contributions in
text categorization.
Table 1. Key Contributions in Text Categorization Research. Recent Studies on Text Classification
from 2019 to 2024.
Publication
Type Title
Year Authors
Objectives Insights Practical Implications
Journal
Article
Research on
Intelligent
Natural
Language Texts
Classification
2022
[11]
- Summarize and
compare text
classification
methods.
- Explore
development
direction of text
classification
research.
The paper summarizes
previous studies on text
classification, highlighting
the rapid development of
machine learning
technologies and the
diversification of research
methods. It compares
classification methods based
on technical routes, text
vectorization, and
classification information
processing for further
research insights.
- Intelligent
classification
enhances efficient
use of natural
language texts.
- Provides
references for
further research
in text
classification
methods.
Journal
Article
The Research
Trends of Text
Classification
Studies
(2000–2020): A
Bibliometric
Analysis
2022
[12]
- Evaluate the state
of the art of TC
studies.
-
Identify publication
trends and
important
contributors in TC
research.
The study analyzes 3121 text
classification publications
from 2000 to 2020,
highlighting trends,
contributors, and disciplines.
It reveals increased interest in
advanced classification
algorithms, performance
evaluation methods, and
practical applications,
indicating a growing
interdisciplinary focus in text
classification research.
-
Recognizes recent
trends in text
classification
research.
- Highlights
importance of
advanced
algorithms and
applications.
Information 2025,16, 130 5 of 47
Table 1. Cont.
Publication
Type Title
Year Authors
Objectives Insights Practical Implications
Journal
Article
A survey on text
classification
and its
applications
2020
[13]
- Overview of
existing text
classification
technologies.
- Propose research
direction for text
mining challenges.
Previous studies on text
classification have proposed
various feature selection
methods and classification
algorithms, addressing
challenges such as scalability
due to the massive increase in
text data. These studies
highlight the importance of
effective information
organization and
management in diverse
research fields.
- Important
applications in
real-world text
classification.
- Addresses
challenges in text
mining and
scalability.
Journal
Article
A Survey on
Text
Classification:
From Traditional
to Deep
Learning
2022
[14]
- Review
state-of-the-art
approaches from
1961 to 2021.
- Create a taxonomy
for text
classification
methods.
The paper reviews
state-of-the-art approaches in
text classification from 1961
to 2021, highlighting
traditional models and deep
learning advancements. It
discusses technical
developments and
benchmark datasets and
provides a comprehensive
comparison of various
techniques and evaluation
metrics used in
previous studies.
- Summarizes key
implications for
text classification
research.
- Identifies future
research
directions and
challenges.
Book
Chapter
Case Studies of
Several Popular
Text
Classification
Methods
2023
[15]
- Evaluate automatic
language
processing
techniques for text
classification.
- Analyze and
compare the
performance of
various text
classification
algorithms.
The paper discusses various
text classification methods,
highlighting that deep
learning models, particularly
distributed word
representations like
Word2Vec and GloVe,
outperform traditional
methods such as
bag-of-words (BOW).
Contextual embeddings like
BERT also show significant
performance improvements.
- Improved text
classification
methods for
massive data
analysis.
- Enhanced
performance
using advanced
feature extraction
techniques.
Journal
Article
Text
Classification
Using Deep
Learning
Models: A
Comparative
Review
2023
[16]
- Analyze deep
learning models for
text classification
tasks.
- Address gaps,
limitations, and
future research
directions in text
classification.
The paper conducts a
literature review on various
deep learning models for text
classification, analyzing their
gaps and limitations. It
highlights previous studies’
comparative results and
discusses classification
applications, guiding future
research directions in
this field.
- Guidance for
future research in
text classification.
- Highlights
challenges and
potential
directions in the
field.
Journal
Article Survey on Text
Classification
2020
[17]
-
Classify documents
into predefined
classes effectively.
- Compare various
text representation
schemes and
classifiers.
Previous studies on text
classification have utilized
various techniques, including
supervised learning with
labeled training documents,
naive Bayes, and decision
tree algorithms. Challenges
include the difficulty of
creating labeled datasets and
the limited applicability of
individual classifiers across
different domains.
- Detailed
information on
text classification
concepts and
algorithms.
- Evaluation of
algorithms using
common
performance
metrics.
Information 2025,16, 130 6 of 47
Table 1. Cont.
Publication
Type Title
Year Authors
Objectives Insights Practical Implications
Journal
Article
The Text
Classification
Method Based
on BiLSTM and
Multi-Scale
CNN
2024
[18]
- Overview of deep
learning in text
classification.
- Analyze research
progress and
technical
approaches.
Previous studies on text
classification have
transitioned from traditional
machine learning methods to
deep learning models,
including attention
mechanisms and pretrained
language models,
highlighting significant
progress and challenges in
enhancing model
performance and dataset
quality across
various domains.
-
Overview of deep
learning text
classification
methods.
- Analysis of
labeled datasets
for research
support.
Journal
Article
Research on Text
Classification
Method Based
on NLP
2023
[19]
- Describe text
classification
concepts and
processes.
- - Explore deep
learning models for
text classification.
Previous studies on text
classification have explored
various methods, including
LSTM-based multitask
learning architectures,
capsule networks, and hybrid
models like RCNNs,
demonstrating advancements
in feature extraction and
improved performance in
tasks such as sentiment
analysis and spam
recognition.
- Text classification
methods are
important for
effectively
classifying
text-based data.
- - New ideas such
as word
embedding
models and
pretraining
models have
made great
progress in text
classification.
Book
Chapter
A Comparative
Study on
Various Text
Classification
Methods
2020
[20]
- Analyze methods
for efficient text
classification.
- - Examine
featurization
techniques and
their performance.
The paper does not provide a
review of previous studies on
text classification. Instead, it
focuses on analyzing various
text classification methods
and featurization techniques,
such as bag-of-words, Tf-Idf
vectorization, and
Word2Vec approaches.
-
Analyzes efficient
text classification
methods for
decision making.
-
Discusses various
featurization
techniques for
improved
performance.
Journal
Article
Evaluating text
classification: A
benchmark
study
2024
[21]
- Investigate
necessity of
complex models
versus simple
methods.
-
Assess performance
across various
classification tasks
and datasets.
The paper highlights a gap in
existing literature, noting that
previous research primarily
compares similar types of
methods without a
comprehensive benchmark.
This study aims to provide an
extensive evaluation across
various tasks, datasets, and
model architectures.
- Simple methods
can outperform
complex models
in certain tasks.
- Negative
correlation
between F1
performance and
complexity for
small datasets.
Proceedings
Article
Comparative
Performance of
Machine
Learning
Methods for Text
Classification
2020
[22]
- Compare
performance of
machine learning
and deep learning
algorithms.
- Explore scalability
with larger data
instances.
Previous studies on text
classification primarily tested
machine learning and deep
learning methods with
relatively small-sized data
instances. This paper builds
on that by comparing these
methods’ performance and
scalability using a larger
dataset of 6000 instances
across six classes.
- Deep learning
outperforms
traditional
methods in text
classification.
- Scalability of
methods for
larger data
instances
explored.
Information 2025,16, 130 7 of 47
Table 1. Cont.
Publication
Type Title
Year Authors
Objectives Insights Practical Implications
Journal
Article
A Survey on
Text
Classification
using Machine
Learning
Algorithms
2019
[23]
- Explore algorithms
for automated text
document
classification.
- Select best features
and classification
algorithms for
accuracy.
Previous studies on text
classification have explored
various methodologies,
including feature selection
techniques, like document
frequency thresholding and
information gain, and
classification algorithms such
as K-nearest neighbors and
support vector machines,
highlighting the importance
of efficient keyword
prioritization for
accurate categorization.
- Automated text
classification
improves
efficiency in
document
handling.
- Reduces reliance
on expert
classification for
large text
documents.
Dataset
Text
Classification
Data from 15
Drug Class
Review SLR
Studies
2023
[24]
- Automate citation
classification in
systematic reviews.
- Reduce workload
in systematic
review preparation.
The paper references a study
by Cohen et al. (2006) that
focused on reducing
workload in systematic
review preparation through
automated citation
classification, providing a
foundation for the datasets
used in the current text
classification research on
drug class reviews.
- Automates
citation
classification in
systematic
reviews.
- Reduces
workload for
researchers in
drug class
studies.
Proceedings
Article
An Exploration
of the
Effectiveness of
Machine
Learning
Algorithms for
Text
Classification
2023
[25]
- Explore
effectiveness of
machine learning
algorithms for text
classification.
- Compare
performance of
various algorithms
like SVM, KNN,
CNN, RNN.
The paper does not provide
specific details on previous
studies in text classification. It
focuses on evaluating and
comparing the performance of
various machine learning
algorithms, such as decision
trees, SVM, KNN, CNN, and
RNN for text
classification tasks.
- Machine learning
improves text
classification
accuracy and
efficiency.
- Algorithms can
handle complex
and large datasets
effectively.
Proceedings
Article
A Comparative
Text
Classification
Study with Deep
Learning-Based
Algorithms
2022
[26]
- Compare deep
learning algorithms
for text
classification.
- Optimize
hyperparameters
and evaluate word
embeddings’
effectiveness.
The paper compares its
results with previous studies
in the literature, highlighting
significant improvements in
classification performance
using deep learning
algorithms and word
embeddings. It specifically
utilizes an open-source
Turkish News benchmarking
dataset for this
comparative analysis.
- Improved text
classification
performance
using deep
learning
algorithms.
- Effective
hyperparameter
tuning enhances
classification
accuracy.
Proceedings
Article
Classification
Models of Text:
A Comparative
Study
2021
[27]
- Overview of
classification
process stages.
- Survey and
compare popular
classification
algorithms.
The paper does not provide
specific details on previous
studies in text classification.
Instead, it focuses on the
classification process,
including preprocessing,
feature engineering,
dimension decomposition,
model selection, and
evaluation, while surveying
and comparing popular
classification algorithms.
- Text classification
has implications
in education,
politics, and
finance.
- The paper
provides a
comparative
study of popular
classification
algorithms.
Information 2025,16, 130 8 of 47
Table 1. Cont.
Publication
Type Title
Year Authors
Objectives Insights Practical Implications
Journal
Article
Trends and
patterns of text
classification
techniques: a
systematic
mapping study
2020
[28]
- Provide an
overview of text
classification
research trends and
gaps.
- Analyze research
patterns, problems,
and
problem-solving
methods in text
classification.
The paper systematically
reviews ninety-six studies on
text classification from 2006
to 2017, identifying nine main
problems and analyzing
research patterns, data
sources, language choices,
and applied techniques,
highlighting significant
trends and gaps in the field.
- Highlights trends
and gaps in text
classification
research.
- Identifies nine
main problems in
text classification
area.
Journal
Article
Research On
Text
Classification
Based On Deep
Neural Network
2022
[4]
- Design text
representation and
classification
models using deep
networks.
- Improve text
feature
representation and
classification
accuracy.
The paper highlights that
traditional text classification
methods, such as the
bag-of-words model and
vector space model, face
challenges like loss of context,
high dimensionality, and
sparsity, prompting a shift
towards deep learning
techniques for
improved performance.
- Deep learning
models improve
text classification
performance
compared to
traditional
methods.
- The BRCNN and
ACNN models
proposed in the
paper show better
text feature
representation
and classification
accuracy.
3. Method
3.1. Search Strategy and Databases
To comprehensively cover the body of literature on text classification, we utilized
multiple academic databases, including PubMed, Web of Science, IEEE Xplore, Scopus,
Google Scholar, ACM Digital Library, ScienceDirect, JSTOR, ProQuest, SpringerLink, and
EBSCOhost. These databases were chosen for their extensive coverage of scientific and
scholarly publications in technology, computer science, and machine learning.
Our search strategy was systematic, combining relevant keywords and Boolean opera-
tors to ensure a comprehensive collection of articles. Keywords included “text classifica-
tion”, “machine learning in text analysis”, “document categorization”, “natural language
processing”, “text mining”, “feature selection for classification”, and “supervised learn-
ing for text”. The search string was refined iteratively, guided by recent reviews on text
classification and related fields (e.g., [
29
]). This rigorous approach ensured the retrieval of
relevant and high-quality research articles for our analysis.
3.2. Inclusion and Exclusion Criteria
To streamline the review process and ensure the quality and relevance of the selected
studies, we established explicit inclusion and exclusion criteria.
Inclusion Criteria:
1.
Published peer-reviewed articles focusing on text classification using machine learning
techniques.
2. Studies presenting experimental results on various classification algorithms.
3. Articles published in English.
4.
Papers discussing challenges and advancements in text classification, including pre-
processing, feature selection, and evaluation metrics.
5.
Conference proceedings, book chapters, and review articles relevant to text classification.
Exclusion Criteria:
Information 2025,16, 130 9 of 47
1.
Articles only tangentially related to text classification or focusing on unrelated ma-
chine learning domains.
2. Papers lacking experimental results or substantive analysis.
3. Secondary sources not published in English.
4.
Studies addressing text classification superficially without detailed methodological or
algorithmic discussion.
3.3. Data Extraction and Analysis
Once the final selection of articles was made based on the inclusion and exclusion
criteria, data extraction focused on critical aspects of the studies. Extracted data included
authors, publication year, study design, classification techniques used, datasets employed,
preprocessing methods, feature selection approaches, performance metrics, primary find-
ings, and conclusions.
The data analysis followed a narrative synthesis approach to accommodate the diver-
sity of studies [
30
]. Descriptive analysis highlighted bibliometric characteristics such as
the number of studies; publication trends over time; countries of origin; and frequently
used datasets.
1. Thematic analysis categorized findings into recurring themes, such as:
2. Preprocessing and feature engineering techniques in text classification.
3. Evaluation of machine learning algorithms for classification tasks.
4. Performance metrics and benchmarks in text classification.
5. Challenges in real-world applications, such as scalability and bias.
6.
Emerging trends and innovations, including deep learning and transformer-based
models.
This systematic approach ensured that the study rigorously examined the state of
research on text classification, offering insights into current advancements, limitations, and
future directions.
4. Approaches to Text Categorization
TC employs a variety of machine learning practices, broadly categorized into super-
vised, unsupervised, and deep learning methods:
Supervised Learning
•
In supervised learning, models are trained using labeled datasets to classify new
documents into predefined categories. Popular algorithms for this purpose include
logistic regression, naive Bayes, random forest, support vector machines (SVMs), and
AdaBoost. For example, naive Bayes has demonstrated impressive accuracy, reaching
up to 96.86% in certain applications [23].
•
Deep learning techniques, such as convolutional neural networks (CNNs) and long
short-term memory (LSTM) networks, achieve high accuracy while requiring minimal
feature engineering. For instance, LSTMs have demonstrated accuracy rates of up to
92% in specific tasks [31].
Unsupervised Learning
•
Unsupervised learning techniques, including hierarchical clustering, k-means clus-
tering, and probabilistic clustering, are used to group documents based on content
similarity in cases where labeled data are not available [32].
•
These techniques uncover inherent data structures and are instrumental in analyzing
unlabeled datasets [4].
Advancements in TC leverage feature extraction and dimensionality reduction tech-
niques like PCA and LDA, enhancing model performance. Deep learning models further
Information 2025,16, 130 10 of 47
refine TC by capturing linguistic subtleties such as tone and context, making them in-
valuable for tasks like sentiment analysis. These developments position TC as a vital tool
across industries for deriving insights from text data and managing digital information
environments [31].
Despite its benefits, TC faces challenges such as handling ambiguous or overlapping
categories and requiring large labeled datasets for supervised learning. Algorithm selection
also influences outcomes, with models like naive Bayes and SVM performing differently
across datasets and applications [33].
4.1. The Rise of Machine Learning in TC
Machine learning (ML) has significantly advanced TC by transitioning from rule-based
systems to adaptive algorithms that learn from tagged input. Early ML models like naive
Bayes and SVM laid the groundwork for TC.
•Naive Bayes: Effective for large vocabularies due to its probabilistic approach.
•
SVM: Achieves precision by mapping text into high-dimensional spaces, helping
identify closely related themes.
The emergence of deep learning further transformed TC with models like CNNs and
RNNs, capable of capturing local word dependencies and long-term correlations. Trans-
former models, such as BERT, have set new benchmarks by understanding bidirectional,
long-distance interactions in text. These innovations enable tasks like sarcasm and emotion
detection with minimal fine-tuning. Additionally, metrics like burstiness and perplexity
enhance TC by identifying significant phrases and quantifying prediction uncertainty.
Applications of ML-driven TC span customer service, healthcare, finance, and more,
enabling rapid and precise classification. This supports innovations in content moderation,
personalized recommendations, and trend analysis [1].
4.2. Benefits of Automated TC over Manual Classification
Automated TC offers numerous advantages over manual processes:
•
Scalability and Efficiency: Handles large datasets rapidly and consistently, unlike
manual methods that are time-intensive and impractical for extensive collections.
•
Objectivity: Applies standard criteria uniformly, eliminating human bias and ensuring
reliable outcomes, crucial for domains like legal document classification.
•
Real-Time Processing: Facilitates immediate classification, essential in industries like
finance and journalism where timely decisions are critical [34].
Advanced ML approaches like burstiness and perplexity improve TC by addressing
dynamic settings. Burstiness measures fluctuation in word occurrence, allowing for im-
proved detection of significant terms, whereas perplexity evaluates uncertainty in text
predictions, enhancing adaptation to changing datasets. These metrics improve model
performance in complicated, dynamic situations [35].
4.3. Types of TC Tasks
Text categorization assigns predefined categories to free-text documents, organizing
them conceptually for efficient retrieval and management [
5
]. Applications include email
filtering, topic labeling, and content organization for digital libraries.
TC tasks vary depending on the nature of the classification problem. Common types
include:
•
Binary Classification: This involves two classes, such as spam and non-spam emails,
where each document belongs to one of the two categories [5].
Information 2025,16, 130 11 of 47
•
Multiclass Classification: More than two classes are involved, and each document is
assigned to only one class, such as classifying news articles into topics like politics,
sports, or entertainment [5].
•
Single-Label Classification: Often approached using binary classification methods,
where documents are classified into distinct categories without overlap [36].
•
Multilabel Classification: In this case, each document may belong to multiple cate-
gories simultaneously. For example, an academic paper may be categorized under
multiple disciplines like biology and technology [5].
•
Hierarchical Classification: Documents are classified into categories supervised in a
hierarchical format. This type is beneficial for large datasets with numerous categories [
5
].
4.4. Document-Pivoted vs. Category-Pivoted TC
Document-pivoted and category-pivoted text categorization represent two method-
ologies for organizing the classification process.
(a) Document-Pivoted Categorization (DPC): This approach focuses on classifying a doc-
ument by searching across all possible categories. It is generally simpler to implement
and more efficient for practical applications [37].
(b)
Category-Pivoted Categorization (CPC): In contrast, CPC classifies documents by
first identifying the relevant category. This method is more complex, as it requires
re-evaluating document classifications when new categories are added.
4.5. Hard Categorization vs. Ranking
Hard categorization and ranking are two distinct approaches to classifying documents.
(a)
Hard Categorization: This method assigns each document to a single category, re-
sulting in binary decisions about the classification of whether the text belongs to the
category or not.
(b)
Ranking: In contrast, ranking categorization involves generating a list of categories
ranked by their relevance to the document. This approach provides a more nuanced
view of a document’s classification, allowing for further decision-making processes
based on category probability. Figures 1and 2represent hard categorization and
ranking categorization.
Figure 1. Comparison of hard categorization (binary decision) and ranking categorization (multiple
categories ranked) Hard categorization [38].
Information 2025,16, 130 12 of 47
Information 2025, 16, x FOR PEER REVIEW 11 of 46
Figure 1. Comparison of hard categorization (binary decision) and ranking categorization (multiple
categories ranked) Hard categorization [39].
Figure 2. Ranking categorization [39]. (a) represents hard categorization while (b) represents rank-
ing categorization.
4.6. Machine Learning vs. Knowledge Engineering in TC
Both machine learning and knowledge engineering play significant roles in the de-
velopment of text categorization systems.
4.6.1. Machine Learning
Machine learning algorithms automatically learn from data without prior program-
ming to enable systems to adapt and make predictions or decisions according to patterns
of interest in the data [40]. These algorithms can be supervised, unsupervised, and rein-
forcement learning methods that improve categorization accuracy by finding and analyz-
ing complex patterns, trends, and relationships in large datasets. This is important and
beneficial when working with large, complex data to achieve a more granular and efficient
categorization process.
Machine learning methods have been extensively employed in the field of text cate-
gorization for predictive modeling. Essentially, historical or prelabeled data are used to
Figure 2. Ranking categorization [
38
]. (a) represents hard categorization while (b) represents
ranking categorization.
4.6. Machine Learning vs. Knowledge Engineering in TC
Both machine learning and knowledge engineering play significant roles in the devel-
opment of text categorization systems.
4.6.1. Machine Learning
Machine learning algorithms automatically learn from data without prior program-
ming to enable systems to adapt and make predictions or decisions according to patterns of
interest in the data [
39
]. These algorithms can be supervised, unsupervised, and reinforce-
ment learning methods that improve categorization accuracy by finding and analyzing
complex patterns, trends, and relationships in large datasets. This is important and ben-
eficial when working with large, complex data to achieve a more granular and efficient
categorization process.
Machine learning methods have been extensively employed in the field of text catego-
rization for predictive modeling. Essentially, historical or prelabeled data are used to train
algorithms which are later applied to new, unseen data to categorize them effectively [
40
].
These techniques give rise to the development of machine learning that allows a TC system
to extend beyond manual or rule-based approaches to text categorization, scalable and
Information 2025,16, 130 13 of 47
adaptable to large volumes of text [
39
]. In addition, iterative machine learning makes
systems more adaptable by allowing models to learn from feedback loops and updated
datasets. Each iteration improves the model’s performance, allowing it to adapt to new
patterns, eliminate errors, and deal with dynamic and changing data more efficiently. This
process enables continual improvement and relevance across a wide range of applications.
4.6.2. Knowledge Engineering
Knowledge engineering focuses on emulating human expert decision-making pro-
cesses in specific domains [
40
]. Knowledge engineering involves creating systems that
replicate the decision-making processes of human experts in specific fields. It focuses on
capturing and representing expert knowledge, such as rules and reasoning, to develop
systems that can analyze problems and provide accurate solutions. These systems are
widely used in specialized domains to enhance problem-solving and decision-making
capabilities. It involves creating expert systems that can utilize rules and data to facilitate
complex problem solving. In the context of TC, knowledge engineering systems integrate
human expertise with machine learning outputs to enhance decision-making capabilities
and ensure accuracy in categorizations [41].
5. Applications of Text Categorization
This section delves into the numerous uses of text categorization (TC) across disci-
plines, emphasizing its importance in improving information management, operational
efficiency, and customer engagement. It highlights the practical impact of TC techniques
by focusing on specific use cases such as document indexing, content customization, and
hierarchical web content classification. The detailed subsections explain how TC helps
organizations explore and extract value from vast amounts of text data. Text categorization
is integral to many sectors, bringing with it a host of benefits that include better information
management, enhanced customer engagement, and operational efficiency [42].
5.1. Document Indexing for Information Retrieval Systems
Document indexing involves treating documents with keywords or key phrases to
facilitate retrieval in Boolean IR systems. In order to avoid inconsistencies in the tags
assigned to the documents, controlled dictionaries or thematic thesauri like the MeSH
thesaurus for medicine are used [
43
]. Though manual indexing has been considerably
replaced by automated indexing, it helps to manage large databases efficiently in research
and library systems.
Role of Controlled Vocabulary and Thesauri
Controlled vocabulary helps standardize the terminology in certain fields and thus
supports the consistent categorization of documents. The thesauri give a hierarchical and
relational context to the terms, thereby making the search and retrieval processes in systems
using TC more effective, particularly in large-scale document databases [44].
5.2. Automated Document Organization and Archiving
For large document bases, TC automates the filing system for corporate records, patent
filings, and other institutional archives. The tools can classify patents or group news stories
by theme, which can cut down on manual classification workload [45].
Use in Corporate and News Media
Corporations use TC to filter incoming information, such as routing relevant doc-
uments to specific departments. News agencies use TC to precategorize articles before
publication, for example, placing content under “Politics” or “Lifestyle” [
46
]. In high-
Information 2025,16, 130 14 of 47
volume environments, this is particularly important for facilitating streamlined operations
and maintaining supervised archives.
5.3. Text Filtering and Content Personalization
Content personalization in TC assists in tailoring content to user preferences through
the classification of information that is stored according to user profiles. Applications such
as personalized news feeds, customized email filtering, and targeted advertisements are
performed with systems trained on filtering or promoting content based on precise thematic
categories [
47
]. Content personalization in text categorization (TC) delivers user-specific
content by classifying information based on individual profiles. Systems trained in thematic
categories enable applications like personalized news feeds, email filtering, and targeted
ads. This ensures users receive relevant and tailored experiences.
Newsfeeds, Email Filtering, and Spam Detection
A good example of test categorization is in spam detection, which classifies email
content into spam or non-spam categories based on the keywords and patterns of the
sender. Filters analyze text content to block unsolicited emails that provide relief to users
from irrelevant content [48].
5.4. Word Sense Disambiguation (WSD)
WSD disambiguates polysemous words to recognize the true sense of a term in context.
It has been identified as one of the fundamental tasks in natural communication to handle
requests and engines and machine translation. Categorization of word senses using WSD
provides for more accurate keyword searching and indexing [49].
5.5. Hierarchical Categorization of Web Content
Content categorization taxonomy organizes online information into a hierarchical
structure, similar to those used in digital libraries or internet directories. Text categorization
(TC) techniques classify websites into nested levels, such as “Technology” > “Artificial
Intelligence” > “Machine Learning”, enabling users to navigate vast online repositories
with ease and efficiency [50].
6. Machine Learning Techniques in Text Categorization (TC)
Machine learning (ML) has proven essential to the creation of and progress in auto-
mated text categorization (TC) systems. Text classification (TC) is the process of categorizing
text documents based on their content. ML techniques improve TC by automatically learn-
ing from big datasets and refining categorization models, resulting in increased efficiency,
accuracy, and flexibility for varied data sources [51].
This section provides a comprehensive overview of machine learning techniques
for text classification, focusing on several key areas. It focuses on supervised learning
approaches because they are widely used and proven effective in text categorization
problems. While unsupervised approaches have advantages, they are outside the focus of
this paper because they are often used for exploratory or clustering tasks rather than preset
categorization. The section also delves into classifier construction, offering insights into
various types of algorithms and their design for effective text categorization. Additionally,
it highlights the importance of feature selection and engineering, discussing techniques to
identify and refine features that enhance model performance. Lastly, it examines advanced
approaches to text categorization, showcasing cutting-edge machine learning methods that
improve accuracy and adaptability in classification tasks.
Information 2025,16, 130 15 of 47
6.1. Supervised Learning Techniques
Supervised learning is the most common strategy in TC, in which labeled data
are used to “teach” models how to effectively categorize texts. This method divides
datasets into three subsets: training, test, and validation sets. The training set is uti-
lized to build the model; the validation set fine-tunes hyperparameters and evaluates
model performance during development; and the test set is reserved for final evaluation to
ensure generalizability [23].
Using different datasets reduces the risk of overfitting, which occurs when a model
performs well on training data but poorly on different data [
52
]. Furthermore, effective
partitioning ensures balanced and representative data, which is critical for applications
such as sentiment analysis and document categorization, where certain terms and contexts
must be learned consistently [43].
6.2. Classifier Construction and Types of Algorithms
The choice of algorithm greatly influences the building of classifiers for TC, as it
dictates the model’s learning strategy, interpretability, and processing needs [
53
]. Rule-
based systems, decision trees, naive Bayes, and neural networks are among the most used
TC algorithms, each with its strengths and shortcomings when dealing with text input.
(a)
Rule-based Systems: These classifiers use handmade rules, which are highly inter-
pretable but less flexible for complicated or huge datasets [
43
]. Rule-based systems,
on the other hand, continue to be useful in situations when plain, transparent decision
making is required.
(b)
Decision Trees: Decision trees divide data based on certain criteria, making them
intuitive and interpretable but susceptible to overfitting. Decision trees are effective
for small to medium-sized text corpora, but they may struggle with scalability and
feature depth [51].
(c) Naive Bayes: Naive Bayes is frequently used in TC due to its simplicity, efficiency, and
resilience, especially in document categorization and spam filtering [
54
]. However,
while the assumption of feature independence simplifies calculation, it can reduce effi-
ciency when features are highly linked [
53
]. Figure 3shows how Naïve Bayes works.
Nodes:
•The topmost node “C” represents the class label of the text document.
•
The nodes labeled F
1
, F
2
,
. . .
, F
n
represent features (words, phrases, or attributes)
extracted from the text.
Arrows:
•
The black arrows from C
→
{F
1
, F
2
,
. . .
, F
n
} indicate that the classification decision
directly influences the features. This suggests a Naïve Bayes model assumption,
where features are conditionally independent given the class.
•
The blue arrows between features represent feature dependencies or correlations
(e.g., word co-occurrence relationships). This indicates that some features depend
on each other, making the model more complex than Naïve Bayes.
Information 2025,16, 130 16 of 47
Information 2025, 16, x FOR PEER REVIEW 15 of 46
Figure 3. Naive-Bayes-Based Classification [56].
(d) Neural Networks: “Neural networks, particularly deep learning models, have trans-
formed TC by allowing them to learn sophisticated, hierarchical text representations.
Although neural networks often need big datasets and significant computer re-
sources, they provide unrivaled accuracy in capturing semantic meaning and contex-
tual nuances” [57].
Each of these techniques is used depending on the use case, dataset features, and
resource availability, emphasizing the importance of personalized ML approaches in TC.
6.3. Feature Selection and Engineering
Feature selection and engineering are critical in TC because they decide the data qual-
ities the model learns from, which influences its overall performance [58]. Text classifica-
tion features are often words, sentences, or semantic representations, therefore their selec-
tion is critical for increasing model accuracy and interpretability [53]. By focusing solely
on important features, effective feature selection eliminates unnecessary data and compu-
tational expenses. Term frequency–inverse document frequency (TF-IDF) and word em-
beddings are popular techniques for capturing the textual structure, context, and im-
portance of words in texts. Furthermore, feature engineering approaches like stemming,
lemmatization, and n-gram analysis improve feature representation, hence enhancing
classifier performance [44]. Furthermore, high-quality feature selection frequently results
in improved generalization across domains and datasets, which is crucial for applications
that require models to work in various languages or specialized disciplines [52].
6.4. Advanced Machine Learning Approaches to Text Categorization
The evolution of ML has seen a blend of traditional and advanced models enhancing
TC accuracy and efficiency.
6.4.1. Traditional ML Techniques
• Naive Bayes and Logistic Regression: Offer simplicity and effectiveness in text clas-
sification, with naive Bayes achieving up to 96.86% accuracy in specific datasets [31].
• Support Vector Machines (SVMs): Efficiently handle high-dimensional data and
demonstrate strong performance with word embeddings.
• Random Forest (RF): Achieves a mean accuracy of 99.98% when combined with
Word2Vec embeddings [59].
Figure 3. Naive-Bayes-Based Classification [55].
(d)
Neural Networks: “Neural networks, particularly deep learning models, have trans-
formed TC by allowing them to learn sophisticated, hierarchical text representations. Al-
though neural networks often need big datasets and significant computer resources, they
provide unrivaled accuracy in capturing semantic meaning and contextual nuances” [
56
].
Each of these techniques is used depending on the use case, dataset features, and
resource availability, emphasizing the importance of personalized ML approaches in TC.
6.3. Feature Selection and Engineering
Feature selection and engineering are critical in TC because they decide the data quali-
ties the model learns from, which influences its overall performance [
57
]. Text classification
features are often words, sentences, or semantic representations, therefore their selection is
critical for increasing model accuracy and interpretability [
52
]. By focusing solely on im-
portant features, effective feature selection eliminates unnecessary data and computational
expenses. Term frequency–inverse document frequency (TF-IDF) and word embeddings
are popular techniques for capturing the textual structure, context, and importance of
words in texts. Furthermore, feature engineering approaches like stemming, lemmatization,
and n-gram analysis improve feature representation, hence enhancing classifier perfor-
mance [
43
]. Furthermore, high-quality feature selection frequently results in improved
generalization across domains and datasets, which is crucial for applications that require
models to work in various languages or specialized disciplines [51].
6.4. Advanced Machine Learning Approaches to Text Categorization
The evolution of ML has seen a blend of traditional and advanced models enhancing
TC accuracy and efficiency.
6.4.1. Traditional ML Techniques
•
Naive Bayes and Logistic Regression: Offer simplicity and effectiveness in text classi-
fication, with naive Bayes achieving up to 96.86% accuracy in specific datasets [23].
•
Support Vector Machines (SVMs): Efficiently handle high-dimensional data and
demonstrate strong performance with word embeddings.
•
Random Forest (RF): Achieves a mean accuracy of 99.98% when combined with
Word2Vec embeddings [58].
•
K-Nearest Neighbors (KNN) and Decision Trees: Useful for smaller datasets but less
effective compared to SVM and RF [59].
Information 2025,16, 130 17 of 47
6.4.2. Deep Learning Approaches
•
Convolutional Neural Networks (CNNs): Capture spatial patterns in text, ideal for
classification tasks.
•
Recurrent Neural Networks (RNNs): RNNs, such as long short-term memory (LSTM)
and gated recurrent unit (GRU) architectures, are especially useful for simulating
sequential dependencies in text. They excel at jobs that need contextual comprehension,
such as sentiment analysis and time-series forecasts.
•
Transformer-based Models: Transformer-based models, like BERT, have transformed
text classification by exploiting self-attention mechanisms to detect global dependen-
cies in text. Their ability to construct contextual embeddings has established new
standards for several natural language processing tasks, achieving over 97% accuracy
in some applications [58].
Figure 4illustrates a convolutional neural network (CNN) architecture designed for
classifying handwritten digits, such as those found in the MNIST dataset. For grayscale
images, the CNN design starts with a multidimensional array of 28
×
28
×
1 pixels. The
first convolutional layer generates feature maps of 24
×
24
×
n1, where n1 is the number of
filters used. Subsequent layers lower spatial dimensions while increasing depth, depending
on the number of filters. This approach collects hierarchical features that are useful for
text categorization.
Information 2025, 16, x FOR PEER REVIEW 16 of 46
• K-Nearest Neighbors (KNN) and Decision Trees: Useful for smaller datasets but
less effective compared to SVM and RF [60].
6.4.2. Deep Learning Approaches
• Convolutional Neural Networks (CNNs): Capture spatial patterns in text, ideal for
classification tasks.
• Recurrent Neural Networks (RNNs): RNNs, such as long short-term memory
(LSTM) and gated recurrent unit (GRU) architectures, are especially useful for simu-
lating sequential dependencies in text. They excel at jobs that need contextual com-
prehension, such as sentiment analysis and time-series forecasts.
• Transformer-based Models: Transformer-based models, like BERT, have trans-
formed text classification by exploiting self-attention mechanisms to detect global
dependencies in text. Their ability to construct contextual embeddings has estab-
lished new standards for several natural language processing tasks, achieving over
97% accuracy in some applications [59].
Figure 4 illustrates a convolutional neural network (CNN) architecture designed for
classifying handwritten digits, such as those found in the MNIST dataset. For grayscale
images, the CNN design starts with a multidimensional array of 28 × 28 × 1 pixels. The
first convolutional layer generates feature maps of 24 × 24 × n1, where n1 is the number of
filters used. Subsequent layers lower spatial dimensions while increasing depth, depend-
ing on the number of filters. This approach collects hierarchical features that are useful for
text categorization.
Figure 4. CNN Sequence to Classify Digits [61].
6.4.3. Hybrid and Ensemble Methods
• Model Combinations: Traditional classifiers paired with similarity measures like co-
sine similarity enhance performance [24].
• Ensemble Learning: Combines diverse models to boost robustness and accuracy in
TC tasks [62].
7. Document Representation Techniques
Document representation techniques translate text documents into structured for-
mats that machine learning models can understand to retain as much semantic
Figure 4. CNN Sequence to Classify Digits [60].
6.4.3. Hybrid and Ensemble Methods
•
Model Combinations: Traditional classifiers paired with similarity measures like
cosine similarity enhance performance [24].
•
Ensemble Learning: Combines diverse models to boost robustness and accuracy in
TC tasks [61].
7. Document Representation Techniques
Document representation techniques translate text documents into structured formats
that machine learning models can understand to retain as much semantic information as
Information 2025,16, 130 18 of 47
feasible. Effective representation strategies support correct text classification, enhancing
models’ ability to recognize and analyze key patterns in textual data [62].
This section explores document processing techniques, providing a detailed examina-
tion of foundational and advanced methods. Topics include the vector space model (VSM)
and its applications, the evolution from bag-of-words to more sophisticated approaches,
and techniques in lexical semantics and text tokenization for understanding textual content.
It also highlights word stemming and stop word removal as essential preprocessing steps,
along with a discussion on weighting schemes such as term frequency–inverse document
frequency (TF-IDF) and other innovative weighting strategies to enhance text analysis
and classification.
7.1. Vector Space Model (VSM)
The vector space model (VSM) serves as a fundamental tool for document representa-
tion in text categorization. It represents documents as vectors within a multidimensional
space, where each dimension corresponds to a distinct term from the corpus [
63
]. VSM
enables the calculation of document similarity using metrics such as cosine similarity, which
is useful in applications like clustering, search, and categorization. Figure 5illustrates how
the Vector Space Model operates.
Information 2025, 16, x FOR PEER REVIEW 17 of 46
information as feasible. Effective representation strategies support correct text classifica-
tion, enhancing models’ ability to recognize and analyze key patterns in textual data [63].
This section explores document processing techniques, providing a detailed exami-
nation of foundational and advanced methods. Topics include the vector space model
(VSM) and its applications, the evolution from bag-of-words to more sophisticated ap-
proaches, and techniques in lexical semantics and text tokenization for understanding tex-
tual content. It also highlights word stemming and stop word removal as essential pre-
processing steps, along with a discussion on weighting schemes such as term frequency–
inverse document frequency (TF-IDF) and other innovative weighting strategies to en-
hance text analysis and classification.
7.1. Vector Space Model (VSM)
The vector space model (VSM) serves as a fundamental tool for document represen-
tation in text categorization. It represents documents as vectors within a multidimensional
space, where each dimension corresponds to a distinct term from the corpus [64]. VSM
enables the calculation of document similarity using metrics such as cosine similarity,
which is useful in applications like clustering, search, and categorization. Figure 5 illus-
trates how the Vector Space Model operates.
Figure 5. How the Vector Space Model Works [65].
7.2. Bag-of-Words and Beyond
The bag-of-words (BoW) approach in VSM relates to a simple but effective strategy
that depicts each text as a gathering of individual phrases, disregarding word order but
capturing the frequency of each term. BoW is computationally efficient and successful for
many classification applications, but it has drawbacks, such as neglecting word order and
semantic nuances [44]. To address these limitations, BoW extensions such as n-grams and
distributed representations have arisen, which better capture word context and relation-
ships by taking term sequences into account or employing embeddings [66]. These meth-
ods increase the semantic depth of document representation, making them appropriate
for more complicated text analysis tasks. Figure 6 shows how the Bag-of-Words model
works.
Figure 5. How the Vector Space Model Works [64].
7.2. Bag-of-Words and Beyond
The bag-of-words (BoW) approach in VSM relates to a simple but effective strategy
that depicts each text as a gathering of individual phrases, disregarding word order but
capturing the frequency of each term. BoW is computationally efficient and successful for
many classification applications, but it has drawbacks, such as neglecting word order and
semantic nuances [43]. To address these limitations, BoW extensions such as n-grams and
distributed representations have arisen, which better capture word context and relation-
ships by taking term sequences into account or employing embeddings [
65
]. These methods
increase the semantic depth of document representation, making them appropriate for
more complicated text analysis tasks. Figure 6shows how the Bag-of-Words model works.
Information 2025,16, 130 19 of 47
Information 2025, 16, x FOR PEER REVIEW 18 of 46
Figure 6. How Bag-of-Words Model Works [67].
7.3. Lexical Semantics and Text Tokenization
Lexical semantics, when paired with tokenization, divides text into meaningful units
while preserving the document’s fundamental information. Tokenization breaks down
text into smaller components, typically words or phrases, allowing algorithms to handle
text as discrete tokens rather than continuous strings [63].
7.4. Word Stemming and Stop Word Removal
Many TC applications rely heavily on stemming and stop word removal to improve
document representation. Stemming reduces words to their base forms, grouping varia-
tions of the same term to prevent repetition in representations. For example, “running”,
“ran”, and “runner” are all derived from “run”. This simplification enables models to con-
centrate on key meanings, increasing efficiency and relevance in text analysis [68].
Stop word deletion entails removing common terms like “the”, “is”, and “and” which
often add little to document classification. By omitting these keywords, models reduce
computational complexity while increasing accuracy by focusing on more informative
words. These preprocessing techniques are especially beneficial in fields where separating
important phrases from popular ones is critical to accurate categorization [44].
Preprocessing techniques like stemming, lemmatization, and stop word removal are
essential for accurate text categorization because they remove noise and simplify textual
data. These strategies ensure that models concentrate on relevant patterns by standardiz-
ing word forms (e.g., “run”, “running”, and “ran” into “run”) and removing non-informa-
tive terms (e.g., “the”, “is”). This simplified form improves both computing efficiency and
feature relevance for classification.
7.5. Weighting Schemes and Alternatives
Weighting schemes lend importance to terms in a document, which helps models
discover the most relevant aspects for categorization. While TF-IDF is the most extensively
used approach, other schemes, such as entropy weighting and BM25, have advantages in
some settings, such as controlling term importance across many datasets. Accurate
weighting distinguishes phrases that carry significant information from those that do not,
which improves classification results [69].
Figure 6. How Bag-of-Words Model Works [66].
7.3. Lexical Semantics and Text Tokenization
Lexical semantics, when paired with tokenization, divides text into meaningful units
while preserving the document’s fundamental information. Tokenization breaks down text
into smaller components, typically words or phrases, allowing algorithms to handle text as
discrete tokens rather than continuous strings [62].
7.4. Word Stemming and Stop Word Removal
Many TC applications rely heavily on stemming and stop word removal to improve
document representation. Stemming reduces words to their base forms, grouping variations
of the same term to prevent repetition in representations. For example, “running”, “ran”,
and “runner” are all derived from “run”. This simplification enables models to concentrate
on key meanings, increasing efficiency and relevance in text analysis [67].
Stop word deletion entails removing common terms like “the”, “is”, and “and” which
often add little to document classification. By omitting these keywords, models reduce
computational complexity while increasing accuracy by focusing on more informative
words. These preprocessing techniques are especially beneficial in fields where separating
important phrases from popular ones is critical to accurate categorization [43].
Preprocessing techniques like stemming, lemmatization, and stop word removal are
essential for accurate text categorization because they remove noise and simplify textual
data. These strategies ensure that models concentrate on relevant patterns by standardizing
word forms (e.g., “run”, “running”, and “ran” into “run”) and removing non-informative
terms (e.g., “the”, “is”). This simplified form improves both computing efficiency and
feature relevance for classification.
7.5. Weighting Schemes and Alternatives
Weighting schemes lend importance to terms in a document, which helps models
discover the most relevant aspects for categorization. While TF-IDF is the most extensively
used approach, other schemes, such as entropy weighting and BM25, have advantages
in some settings, such as controlling term importance across many datasets. Accurate
weighting distinguishes phrases that carry significant information from those that do not,
which improves classification results [68].
Information 2025,16, 130 20 of 47
Term Frequency–Inverse Document Frequency (TF-IDF)
One of the most widely used weighting methods in text categorization (TC) is term
frequency–inverse document frequency (TF-IDF). This metric evaluates a word’s signifi-
cance within a text by considering its frequency within the document and its distribution
across the entire corpus. The term frequency (TF) component highlights terms that occur
frequently in a single document, while the inverse document frequency (IDF) compo-
nent downscales the weight of terms that are common across multiple documents. This
approach ensures a more balanced representation of term importance [
69
]. TF-IDF has
proven effective in various applications, such as document retrieval and categorization, by
prioritizing unique and contextually significant terms [62].
In addition to TF-IDF, various weighting techniques such as entropy weighting and
BM25 have been investigated to better capture word significance across different con-
texts [
70
]. Entropy weighting, for example, assesses each term’s informational contribution
across categories, minimizing the impact of highly predictable phrases [
71
]. The BM25
technique, an extension of TF-IDF, provides an improved strategy for document retrieval
tasks by integrating parameters that account for document length and frequency saturation,
hence improving performance in big text corpora [72].
These weighting techniques address a wide range of text processing needs while also
improving document representation flexibility, making them useful for TC applications
that must deal with heterogeneous datasets and complex language patterns.
8. Dimension Reduction in Text Categorization
Dimensionality reduction (DR) is a vital process in text categorization, aimed at
addressing the challenge of high-dimensional feature spaces that often characterize text
datasets. These datasets can consist of thousands or even millions of unique words,
making the feature space complex and computationally intensive. DR techniques help by
eliminating noisy or irrelevant terms, thereby enhancing training efficiency and model
interpretability without compromising critical information [
73
]. Additionally, DR mitigates
overfitting, a common issue where models are excessively tailored to the training data,
hindering their ability to generalize to new, unseen data [1].
Methods such as principal component analysis (PCA) and latent semantic analysis
(LSA) are commonly employed to lower dimensionality while retaining the structural and
relational integrity of the text data. This streamlined representation facilitates faster pro-
cessing and more accurate predictions, making dimensionality reduction an indispensable
component of the machine learning pipeline.
8.1. Importance of Dimensionality Reduction
Reducing dimensionality offers several key advantages:
1.
Improved Efficiency: Streamlines computational demands, particularly during train-
ing and testing phases.
2.
Enhanced Interpretability: Simplifies understanding by focusing on the most significant
features.
3.
Reduced Overfitting: Ensures the model learns generalizable patterns rather than
noise specific to the training dataset.
These benefits collectively enable the creation of robust and reliable machine learning
models, empowering practitioners to derive meaningful insights from complex datasets.
Practices such as PCA and t-distributed stochastic neighbor embedding (t-SNE) have
proven effective in maintaining essential information while reducing dimensions, thereby
improving model performance and the extraction of insights.
Information 2025,16, 130 21 of 47
Dimensionality reduction techniques focus on identifying discriminative features,
which are then weighted and fed into classifiers to construct models. During the testing
phase, test documents are preprocessed and represented using the same methods applied
during training. This ensures consistency in data handling, leading to more reliable
predictions and deeper insights into underlying patterns within the dataset.
8.2. Dimensionality Reduction in Support Vector Machines (SVMs)
Support vector machines (SVMs) benefit significantly from dimensionality reduction,
particularly when handling high-dimensional data. The optimization process in SVMs relies
on the dual formulation of soft margin SVMs, which transforms the primal optimization
problem into a dual problem. This approach leverages kernel functions to efficiently handle
non-linear classifications.
Kernel Functions in SVM
1. Linear Kernel
The linear kernel is represented as:
K(x,x_i)=(x,x_i)
This calculates the dot product of two feature vectors, x and x_i. This straightfor-
ward kernel works well for linearly separable data, where a linear decision boundary can
effectively separate the classes.
2. Polynomial Kernel
The polynomial kernel is expressed as:
K(x,xi) = [(x·xi) + β]d
where d is the degree of the polynomial and
β
is a constant. This kernel enables the SVM
to capture more complex relationships between data points by considering polynomial
interactions of features. The degree d determines the level of complexity in the model, with
higher degrees capturing more intricate patterns.
3. Gaussian RBF Kernel
The Gaussian RBF kernel is given by:
K(x,xi) = exp(−γ∥x−xi∥2)
where
γ
is a parameter that controls the kernel’s flexibility and sensitivity to differences
between data points. It maps the data into an infinite-dimensional space, allowing the SVM
to create highly non-linear decision boundaries. The parameter
γ
determines how closely
the model fits the data, with larger values resulting in tighter fits around individual data
points and smaller values producing smoother decision boundaries [74].
While kernel functions mitigate the impact of high feature space dimensions on
computational complexity, the dimensionality of the input space still influences kernel
evaluations, especially for large datasets. The optimal hyperplane is derived using the
following equation: F(x,α*, b)
This hyperplane is calculated using support vectors, kernel functions, and bias terms,
enabling precise classification. Dimensionality reduction enhances SVM efficiency by
reducing the computational overhead required for training and testing [74].
Information 2025,16, 130 22 of 47
8.3. Text Representation in Dimensionality Reduction
In text categorization, documents are typically represented as a term–document matrix
A = (aij)A = (a_)A = (aij), where:
•Rows: Represent terms.
•Columns: Represent documents.
•Entries (aija_aij): Indicate the frequency or presence of term iii in document jjj.
This matrix serves as the basis for clustering and classification tasks, with dimen-
sionality reduction techniques applied to enhance efficiency and accuracy. By leveraging
this representation, models can focus on essential features, enabling better performance in
high-dimensional spaces [24].
8.4. Common Methods for Term Selection
8.4.1. Document Frequency
This approach selects terms that appear frequently across documents, as frequent
terms may have more importance for classification. However, common terms across all
documents (like stop words) are typically excluded. The calculation involves determining
the amount of text within a collection that contains a specific feature, which may include
words, phrases, n-grams, or custom-derived attributes. The counting approach employs
a binary method: each time a feature is present in a document, its document frequency
(DF) is incremented by one. However, this conventional DF metric focuses solely on
the presence or absence of a feature in a text without accounting for the significance or
relevance of that feature within the document itself [
75
]. While the document frequency
(DF) metric effectively quantifies the presence of features across a collection of documents,
its binary nature overlooks the contextual importance of those features within individual
documents. DF’s simple presence/absence counting overlooks feature frequency and
relevance variations within a document, leading to an incomplete representation of feature
importance in tasks like text classification.
To address these limitations, the term frequency–inverse document frequency (TF-IDF)
metric is frequently employed, as it evaluates both the occurrence of a feature within a
document and its distribution across the dataset. This results in a more accurate assessment
of a feature’s significance. TF-IDF is particularly useful for minimizing irrelevant terms in
tasks like text summarization and classification [
76
]. As a commonly used feature weighting
method in the vector space model, TF-IDF is widely applied in text mining and information
retrieval. It effectively emphasizes the importance of a term within a document collection,
treating all documents equally in its computation [77].
8.4.2. Chi-Square Test
This method evaluates the independence of a term from the document class, selecting
terms that show a significant relationship with the target labels. Chi-square tests are
used to assess many classes of comparison such as tests of independence and tests of
homogeneity [
78
,
79
]. Tao and Chang also use the chi-square test to cluster web query
schema [
80
]. The chi-square test, initially introduced by Pearson, has become a widely used
statistical tool for assessing relationships between categorical variables, such as testing
independence or homogeneity. Its application in clustering tasks, like grouping web query
schemas, demonstrates the versatility of the chi-square test beyond traditional statistical
analysis. By comparing observed and expected frequencies, the test helps uncover patterns
or associations that might not be immediately apparent in raw data. In the context of web
queries, chi-square tests can be used to cluster web query schemas based on their content.
For example, if a search engine wants to categorize search queries into topics like sports,
technology, or health, the chi-square test can assess the relationship between query words
Information 2025,16, 130 23 of 47
and the topics, helping to improve search accuracy. Experiments show that the proposed
method improves the performance of text categorization techniques using chi-square (
χ2
)
for feature selection with an F-measure of 92.20% [81].
8.4.3. Mutual Information
This approach evaluates how much information a term contributes to predicting the
class label, prioritizing terms that offer the greatest value for classification. Terms are
ranked based on their predictive significance, which can be assessed using techniques like
document frequency, information gain, mutual information, or the
χ2
test [
81
]. The core idea
is that the most effective terms are those that exhibit the greatest variation in distribution
between positive and negative examples across different categories. These techniques
evaluate a term’s ability to distinguish between categories effectively [
82
]. Document
frequency evaluates how frequently a term appears, while information gain measures
its significance in predicting a category. Mutual information quantifies the relationship
between a term and a category, assessing each term’s capacity to differentiate between
categories effectively, while the chi-square test determines their independence. These
methods aid in identifying the most important terms, enhancing support vector machine
(SVM) training by concentrating on critical features and patterns.
8.4.4. Term Clustering
This technique groups terms that are semantically similar, reducing redundancy in
the feature set. By clustering similar terms together, the model can focus on clusters rather
than individual terms, improving efficiency. Term clustering phrases derived from syntac-
tic meta-features and indexed based on document or document group co-occurrence are
typically of higher quality compared to indexing methods that rely solely on individual
syntactic phrases, single indexing words, or word clusters [
83
]. Term clustering differs
from term selection in that it focuses on grouping terms that are synonymous or nearly syn-
onymous, whereas term selection primarily aims to eliminate non-informative terms [
43
].
The relationships identified within clusters are often incidental rather than the intended
systematic connections originally sought [
83
]. Optimization techniques have a wide range
of applications, including clustering and categorizing text documents, engineering, image
processing, speech recognition, pattern recognition, weather forecasting, route optimiza-
tion, wireless sensor networks, and job scheduling, among others [
84
]. Grouping terms by
analyzing syntactic relationships and co-occurrence patterns improves document indexing
by capturing contextual meaning while minimizing redundancy. This approach reflects
how words work together, rather than treating them as isolated terms. For example, in legal
document retrieval, clustering terms like “contract terms” or “legal agreement” enhances
search relevance and accuracy.
8.4.5. Principal Component Analysis (PCA)
Principal component analysis (PCA) is a linear dimensionality decrease method that
projects data onto the most significant axes, known as principal components. It is a statis-
tical technique designed to reduce dimensionality while minimizing the loss of variance
from the original dataset. PCA identifies the directions of maximum variance within the
term–document matrix, allowing for a reduction in the number of features while preserving
the majority of the data’s variance. This approach is especially valuable for managing
sparse or high-dimensional datasets. It achieves this by transforming the initial correlated
quantitative variables into new, uncorrelated variables known as principal components [
85
].
PCA reduces dimensionality by calculating the covariance matrix to identify eigenvectors
(principal components) that capture the highest change in the data. These components
transform correlated features into uncorrelated ones, simplifying analysis and eliminating
Information 2025,16, 130 24 of 47
redundancy. PCA is widely used for visualization, improving machine learning perfor-
mance, and handling high-dimensional datasets.
8.5. Comparison of Dimensionality Reduction Methods
Term Extraction Techniques
Document frequency (DF) measures how often a term appears in a document collection.
Terms that appear in a lot or fewer documents may not provide useful distinguishing
information. This technique is widely used for filtering out common terms, for example,
stop words or rare terms. Widely used weighting schemes like term frequency–inverse
document frequency (TF-IDF) are used to convert a document into a structured format [
86
].
The chi-square test measures the dependence between two categorical variables, such
as the occurrence of a term and its corresponding document category. It identifies terms
that are strongly associated with specific categories, aiding in feature selection. A higher
χ2
value signifies a stronger relationship between the term and the category. This test is
computationally efficient and is commonly used to examine the independence of categorical
variables or assess how well a sample aligns with the distribution of a known population
(goodness of fit) [87].
Mutual information (MI) quantifies the dependency between two variables (terms).
Text analysis measures how much information one term provides about another, capturing
both frequency and context. High mutual information values suggest that the term is
informative and relevant to the target classification task. Estimating mutual information
(MI) accurately is a complex task, and using it as an objective in representation learning
often leads to highly entangled representations because of its invariance under arbitrary
invertible transformations. However, despite these difficulties, MI-based methods have
repeatedly proven to be highly effective in practical scenarios [88].
9. Evaluation of Text Categorization Models
Text categorization is a key task in natural language processing (NLP). The aim of
text categorization methods is to associate one (or more) of a given set of categories to
a particular document [
89
]. Evaluating the performance of text categorization models
is crucial for understanding their effectiveness and ensuring they perform well in real-
world applications. Evaluating the performance of a text categorization model involves
the use of various metrics. This section discusses the evaluation of text categorization
models, focusing on performance metrics, the F-measure, and challenges associated with
model evaluation.
9.1. Metrics for Performance Evaluation
These metrics are used to assess the model’s effectiveness in accurately classifying
text into the appropriate categories. Various evaluation measures are commonly em-
ployed, such as recall, precision, accuracy, error rate, F-measure or break-even point,
micro-average and macro-average for binary classification, and 11-point average precision
for ranking categories [90].
9.1.1. Accuracy
Accuracy is the ratio of correctly foreseen instances to the total instances in the dataset.
While simple and widely used, it can be misleading in imbalanced datasets.
Accuracy =True Positive +True Negatives
Total Samples
In text categorization (such as classifying documents into multiple categories or topics),
we evaluate model performance using metrics like accuracy (how often the model predicts
Information 2025,16, 130 25 of 47
the correct category) or error rate (how often the model is wrong). However Yang [
90
] points
out key issues when applying these metrics to certain datasets. As a result, a simplistic
algorithm that rejects all documents for every category would achieve a global average
error rate of 1.3% and a global average accuracy of 98.7%, whether measured on a micro-
or macro-scale, as these values would be identical [
91
]. This does not imply that a trivial
rejector classifier is effective; rather, it highlights that accuracy or error alone may not be
reliable metrics for evaluating the performance or utility of a classifier in text categorization,
especially when the number of categories is large, and each document is associated with
only a small number of categories on average [
90
]. A trivial classifier refers to a model
that generates basic, non-informative predictions. Selecting an appropriate performance
evaluation metric becomes especially critical when dealing with advanced machine learning
methods, such as neural networks, to ensure meaningful and accurate assessments of their
predictive capabilities [
92
]. In this context, the trivial approach refers to a classifier that
rejects all documents for every category. Alternatively, a predictor–rejector formulation
involves learning both a predictor and a rejector, each derived from distinct families of
functions, while explicitly considering the cost of abstaining from making a prediction [
93
].
In simpler terms, this model consistently predicts that no categories are assigned to any
document, earning it the label of a “rejector classifier”. Despite failing to perform any
meaningful classification, the rejector classifier could achieve a global accuracy of 98.7%,
primarily because many documents in the dataset have very few assigned categories,
making them irrelevant to most. While accuracy can be a reliable metric when positive
and negative examples are balanced, it becomes misleading in imbalanced scenarios. For
instance, if negative examples significantly outnumber positive ones, a system that assigns
no documents to any category can still achieve an accuracy value close to 1, even though it
provides no useful information for classification, as it fails to differentiate between relevant
and irrelevant categories [93].
9.1.2. Precision
Precision (also called positive predictive value) measures the accuracy of positive
predictions. It is the ratio of true positives to the total predicted positives [94].
Precision =True Positive
True Positive +False Positives ×100
Precision plays a critical role in scenarios where the cost of false positives is significant,
such as spam detection, where misclassifying a legitimate email as spam can lead to unde-
sirable outcomes. In the field of information retrieval, precision refers to the percentage of
retrieved documents that are relevant, while recall represents the percentage of relevant
documents successfully retrieved from the total set of relevant documents [
95
]. Studies
have reported impressive results, with recall and precision averaging around 90% on a
small subset (3%) of a specific corpus [
43
]. It is noted that micro-averaged scores (recall, pre-
cision, and F1) are predominantly influenced by the classifier’s performance on frequently
occurring categories, whereas macro-averaged scores are more impacted by performance
on less common categories [
90
]. Precision becomes especially important in high-cost error
cases, such as spam detection, where the misclassification of non-spam emails as spam can
have significant repercussions.
9.1.3. Recall (Sensitivity)
Recall (also called sensitivity or true positive rate) shows how well the model identifies
all relevant instances. It is the ratio of true positives to the total actual positives [94].
Recall =True Positive
True Positive +False Negatives ×100
Information 2025,16, 130 26 of 47
Recall is crucial in situations where the cost of false negatives is high, such as in med-
ical diagnostics, where failing to detect a positive case could have serious consequences.
Recall is defined as the ratio of correctly identified positive cases to the total number of
actual positives. This measure evaluates the system’s ability to identify true positives, with
average performance sometimes assessed across different recall thresholds for all test docu-
ments [
90
]. It is particularly significant in cases where missing a positive diagnosis could
result in severe outcomes, emphasizing the importance of capturing all
relevant instances
.
F-Measure
The F1-measure serves as the harmonic mean of precision and recall [
94
], providing a
balanced evaluation of both metrics. It is especially valuable in scenarios with an uneven
class distribution, where balancing false positives and false negatives is critical, such as in
text classification tasks. The F1-measure is calculated as follows:
F-Measure =2Precision ×Recall
Precision +Recall
The F-measure is commonly employed when achieving a balance between precision
and recall is important, such as in text classification tasks where it is necessary to minimize
both false positives and false negatives. However, designing an appropriate significance
test can be challenging, as the method’s performance is often summarized into a single
metric, like the break-even point or the optimized F1-score [
90
]. Additionally, optimizing
predictions to maximize the F1-measure is not always feasible by merely ranking labels
based on their relevance and selecting the highest-ranked ones [96].
Table 2summarizes the mean number of assigned (Assign. Mean) keywords and
correct (Corr. Mean) keywords per document, as well as the precision (P), recall (R), and
F-measure (F) achieved when extracting 312 keywords per document [97].
Table 2. Keyword Statistics.
Assign. Mean Corr. Mean P R F
8.6 3.6 41.5 46.9 44.0
9.1.4. Break-Even Point (BEP)
The break-even point represents the point where precision and recall are equal, provid-
ing insight into the trade-off between these metrics. Typically, BEP values are interpolated
because exact matches of precision and recall are rare. Recall and precision are critical mea-
sures in text classification. Recall assesses the capacity to recognize all relevant documents,
whereas precision assesses the accuracy of recovered documents. Together, they provide a
fair evaluation of the model’s performance. Additionally, the point where precision equals
recall is not always meaningful or desirable from the user’s perspective [
93
]. This also
means that the BEP score of a system is always equal to or less than the optimal value of
F1 of that system [
90
]. The BEP score is a more lenient metric than F1, meaning it cannot
exceed the optimal F1 score, which balances precision and recall.
9.2. Validation Techniques
Effective validation techniques are critical to evaluate how well a model performs on
unseen data. The rapid growth of digital text data has necessitated the development of new
methods for text processing and classification [84].
Information 2025,16, 130 27 of 47
9.2.1. K-Fold Cross-Validation
K-fold cross-validation divides a dataset into “k” subsets. Each subset is used once as a
validation set, while the remaining k-1 subsets are used for training. This ensures that each
data point is utilized for both training and validation [
98
]. As “k” increases, the evaluation
becomes more stable by averaging the results over more models. However, increasing
“k” also requires training more models, making it important to choose an appropriate “k”
value [
99
]. This method is especially useful in fields like healthcare, where it helps assess
classification model performance with limited datasets [100].
9.2.2. Train–Test Split
A simpler validation approach is the train–test split, which divides the dataset into two
parts: a training set for developing the model and a test set for evaluating its performance.
A common split ratio is 80% training and 20% testing, although this may vary. The train–test
split is often used in meta-learning, where models are adapted to specific tasks using one
subset of data and evaluated on another [
101
]. While the train–test split trains a single
model, cross-validation improves generalization by training multiple models on different
data subsets [102].
9.3. Challenges in Model Evaluation
a.
There are only a few lexical databases for a small number of languages, hence
knowledge-based systems can be developed only for those languages. Knowledge-
based systems are mostly specific in nature for certain languages and subjects, so
they cannot easily be used for other languages. These systems can be costly to
maintain since languages keep changing. They are also not available for some
subjects [
84
]. Knowledge-based systems rely on lexical databases, which are limited
to a few languages and domains, making them costly and hard to adapt. Researchers
are urged to develop these resources for underrepresented languages to expand
system usability.
b.
Building and implementing a deep learning-based system can be highly resource-
intensive, as training such systems requires expensive hardware and significant
computational power, which must be accounted for [84].
c.
The meaning relationships of the words in a text document cause problems in
text categorization, hence making it hard to create a system. Unsupervised
text data are a tough job for obtaining meaning relationships to make text
categorization systems [84].
9.4. Challenges in Machine-Learning-Based Text Classification
This section examines the challenges in machine-learning-based text classification,
addressing critical issues such as overfitting and underfitting, which impact model general-
ization; class imbalance, which skews classification results; feature space complexity, which
complicates model training and interpretation; and linguistic challenges like ambiguity
and polysemy, which hinder accurate text understanding and categorization.
9.5. Overfitting and Underfitting in TC
Overfitting and underfitting pose major challenges to the quality of classification
models. Overfitting occurs when a model learns excessively, including noise, leading
to excellent performance on training data but poor generalization to unseen data. Both
overfitting and underfitting can cause training errors that significantly impact the reliability
of deep-learning-based communication systems [
103
]. Regularization, dropout layers, and
data augmentation are techniques that help to prevent overfitting by balancing model
Information 2025,16, 130 28 of 47
complexity and lowering sensitivity to certain parameters. The process of this problem is
called generalization, and generalization mainly solves the problem of overfitting [
104
].
Underfitting happens when a model is overly simplistic in capturing major patterns in the
data, resulting in poor performance on both the training and test sets. Underfitting in text
classification (TC) can occur as a result of utilizing basic algorithms or insufficient feature
extraction approaches, which limits the model’s capacity to recognize linguistic complexity
and thematic nuances. To solve underfitting, consider enhancing the model’s complexity by
using advanced topologies like transformers or pretrained models, as well as including a
diverse variety of data points. Furthermore, approaches such as regularization and dropout
can help to prevent overfitting, whereas the inclusion of additional layers or pretrained
models helps to reduce underfitting. These changes increase TC models’ generalization
power, allowing them to reliably categorize a wider range of text types [105].
9.6. Class Imbalance in TC
The class imbalance problem in text categorization (TC) occurs when certain categories
dominate a dataset, while others are underrepresented. This imbalance might cause
machine learning models to favor majority classes, resulting in biased predictions. This
is especially troublesome in applications such as spam detection or sentiment analysis,
where minority classes are important. Addressing class imbalance is critical to ensuring TC
models’ robustness and fairness. The first challenge is multiclass imbalance: the rapidly
intensifying (RI) and extraordinarily intensifying (EI) classes have significantly fewer
training samples in comparison with the neutral and weakening classes [106].
The terms rapidly intensifying (RI) and extraordinarily intensifying (EI) refer to system
intensity changes, with RI characterizing a rapid and large increase in strength over a short
period of time and EI referring to an unusual, rare escalation in intensity beyond regular
patterns. Neutral denotes systems with little or no intensity change, whereas weakening
denotes systems losing strength owing to unfavorable conditions. These classifications aid
in understanding system behavior, allowing for more accurate analysis and prediction.
Class imbalance is frequently the result of natural data distribution. Sports and politics,
for example, may have significantly more data than specialty fields such as environmental
news, particularly in user-generated content or real-time applications. An imbalance
in class distribution skews models toward the majority class, reducing their ability to
generalize effectively across different scenarios. To address this, data-level methods like
the synthetic minority oversampling technique (SMOTE) are used to balance the dataset by
oversampling minority classes and undersampling majority classes. However, classifiers
trained and evaluated on increasingly imbalanced datasets often exhibit artificially inflated
classification accuracy, which can be misleading [
106
]. Algorithmic approaches modify
the learning process by allocating higher weights to minority classes, with techniques
such as boosting and bagging being useful. Advanced models such as BERT and GPT,
through fine-tuning and cost-sensitive learning, aid in minority class recognition in highly
skewed datasets [107].
9.7. Complexity in Feature Space
Gaining a deeper understanding of the distribution of patterns within the feature
space can provide valuable insights into the difficulty and complexity of various classi-
fication tasks [
108
]. The feature space in text categorization (TC) refers to the structured
dimensions or variables used to process text input for machine learning. Text is called
unstructured since it consists of words, phrases, and syntax that do not follow a predefined
format or numerical representation. Unlike structured data, such as tables or spreadsheets,
text requires processing techniques such as tokenization and embedding to numerically
Information 2025,16, 130 29 of 47
represent its semantic and grammatical qualities, resulting in a multidimensional feature
space. This intricacy can make it difficult for models to train successfully, resulting in sig-
nificant computing costs and the danger of overfitting. To address these difficulties, feature
selection and dimensionality reduction approaches can manage feature space complexity
while retaining critical information [39].
Classifiers trained on datasets with increasing levels of class imbalance and evaluated
under the same conditions often exhibit an artificially inflated classification accuracy, which
can be misleading [
108
]. The high complexity of text data raises computational demands
and makes it difficult to differentiate relevant aspects. Feature space, like the physical
universe, is very sparsely populated [
108
]. Sparse data points in a large feature space can
make generalization difficult and increase training time. Simple representations, such as
bag-of-words, may fail to express linguistic nuances, especially when dealing with poly-
semy and synonyms. A higher-dimensional feature space is required to cope with this more
complex situation [
109
]. Feature engineering is critical to make text data more manageable
and understandable. Methods like term frequency–inverse document frequency (TF-IDF)
and n-grams help models identify important terms and phrase structures. Word embed-
dings, including Word2Vec, GloVe, and fastText, provide compact, dense representations
that enhance generalization across related concepts. More advanced embeddings, such as
BERT and GPT, go further by generating contextualized representations that capture the
meanings of words based on their surrounding context [110].
Dimensionality reduction techniques such as principal component analysis (PCA),
singular value decomposition (SVD), and autoencoders condense the feature space by pre-
serving only the most significant features. This not only enhances model interpretability but
also reduces training time, making the models more efficient. Modern embedding models
such as BERT and GPT improve TC by incorporating contextual nuances, increasing model
accuracy for complex languages. While these developments improve TC, they also raise
interpretability concerns. Deep learning models and embeddings are frequently viewed as
“black boxes”, which is especially troublesome in industries requiring explanation, such as
healthcare or finance. Attention mechanisms and explainable AI (XAI) tools help to empha-
size significant elements while balancing feature complexity and interpretability, allowing
practitioners to make educated decisions in complicated language processing tasks [111].
9.8. Ambiguity and Polysemy in Language
Ambiguity and polysemy provide substantial issues in natural language processing
(NLP), particularly in tasks such as text categorization. Ambiguity occurs when a term
or phrase has many meanings, such as “bank” referring to a financial organization or a
riverbank. Polysemy is a type of ambiguity in which words have multiple related meanings,
such as “run” for physical exercise or executing a program. These phenomena hamper
model performance because they require context for accurate interpretation, which standard
models struggle with. Ambiguity creates confusion in TC, where context is critical for
accurate classification [
112
]. For example, a headline like “local bank raises funds” requires
contextual expertise to discern between financial and non-financial issues. Simple models
frequently misclassify such scenarios, and even neural models such as transformers can
fail when contextual cues are not apparent or need cultural knowledge, emphasizing the
importance of advanced context handling strategies [113].
Polysemy is especially difficult since static word embeddings cannot record multiple
meanings across contexts. Words like “light” can relate to either brightness or weight,
depending on the context. Contextual embeddings, such as those used in BERT and GPT,
address this by dynamically modifying meanings based on surrounding words, although
complex phrases and nuanced interpretations continue to pose issues. Multilingual NLP
Information 2025,16, 130 30 of 47
complicates TC by varying ambiguity and polysemy across languages. Some languages
use morphology to resolve ambiguity, while others rely significantly on context, which
complicates operations like machine translation. To deal with these challenges, multilingual
models such as mBERT are trained on a variety of datasets, although linguistic diversity
still presents limits [114].
There are several ways to deal with ambiguity and polysemy. Domain-specific models
improve context and reduce misclassification, while auxiliary tasks such as part-of-speech
tagging help clarify meaning. Ensemble models, which incorporate predictions from
many models, improve overall performance. Although effective, these techniques are
computationally expensive, demonstrating that ambiguity and polysemy remain key issues
in NLP [115].
Advancements and Emerging Trends in Text Categorization (TC)
Recent advances in TC reflect a paradigm shift away from traditional machine learning
methods and toward deep learning and hybrid methodologies. These advancements enable
better feature extraction, contextual comprehension, and flexibility across languages and
domains, broadening the scope of TC’s practical applications. This section explores deep
learning approaches for text categorization, focusing on the application of convolutional
neural networks (CNNs) and recurrent neural networks (RNNs) for various text classifica-
tion tasks. It also highlights the transformative impact of transfer learning and pretrained
language models, such as BERT and GPT, in advancing text categorization with contextual
understanding and reduced training requirements.
9.9. Deep Learning for Text Categorization
With its capacity to identify complex patterns in high-dimensional data, deep learning
has transformed text classification by allowing models to learn directly from raw text
with minimal feature engineering. Deep learning algorithms, particularly convolutional
neural networks (CNNs) and recurrent neural networks (RNNs), have exhibited significant
promise in text classification tasks, excelling at capturing local and sequential relationships.
9.10. CNNs and RNNs for TC Tasks
CNNs and RNNs are two of the most popular TC architectures due to their dis-
tinct ability to process and comprehend textual data. CNNs, which have typically been
employed in image processing, have been adapted for text classification by applying
convolutional filters on word embeddings or n-gram representations. This technique recog-
nizes local word patterns and is especially beneficial for short text categorization, such as
sentence-level sentiment analysis [
116
]. CNNs’ hierarchical feature extraction technique
finds relevant phrases and concepts, making them ideal for context-dependent document
classification tasks [117].
“RNNs, notably Long Short-Term Memory (LSTM) networks, have also proven useful
for TC because of their sequential character, allowing them to effectively model dependen-
cies across phrases and paragraphs” [
118
]. Recurrent neural networks (RNNs) are a type
of neural network architecture which is mainly used to detect patterns in a sequence of
data [
119
]. The sequential learning capabilities of these models are especially useful in TC
tasks that need large documents with complicated language structures.
9.11. Transfer Learning and Pretrained Language Models
Transfer learning, particularly through pretrained language models, represents a
significant advancement in text classification (TC). By leveraging knowledge from vast and
diverse text corpora, it reduces the reliance on extensive labeled datasets, thereby enhancing
the accessibility of text classification for low-resource languages and niche domains.
Information 2025,16, 130 31 of 47
9.11.1. Use of BERT, GPT, and Similar Models
Pretrained language models like Bidirectional Encoder Representations from Trans-
formers (BERT), Generative Pre-trained Transformer (GPT), and related architectures have
revolutionized text classification (TC). These models are pretrained on extensive corpora
and can be fine-tuned with minimal additional training for specific tasks, setting new
benchmarks in performance and efficiency. “BERT, for example, uses a bidirectional atten-
tion mechanism to record the context of words from both left and right contexts, resulting
in more nuanced understanding in TC applications” [
120
]. BERT’s deep bidirectional
methodology makes it particularly successful for context-dependent tasks like sentiment
analysis and topic classification.
GPT, on the other hand, employs a unidirectional transformer architecture, excelling
at producing coherent, contextually relevant text and doing well on tasks requiring text
production or completion [
121
]. For TC, GPT and its descendants, such as GPT-3, have
demonstrated exceptional performance in few-shot and zero-shot classification scenar-
ios, decreasing reliance on labeled data and facilitating fast knowledge transfer between
languages and domains [122].
The introduction of these models significantly improved TC capabilities, allowing
classifiers to function with minimum task-specific input while maintaining high levels of
accuracy. Their efficacy across a variety of TC applications demonstrates transfer learning’s
promise for dealing with complicated and developing text collections.
9.11.2. Hybrid Approaches Combining Knowledge Engineering and ML
Hybrid approaches that integrate knowledge engineering with machine learning are
gaining traction, effectively bridging the gap between rule-based systems and data-driven
methods. A SWOT analysis of the ten most frequently cited algorithms from a curated
collection of peer-reviewed studies and research publications reveals the strengths and
weaknesses of traditional algorithms while uncovering the opportunities and challenges
that hybrid methods aim to address [
123
]. These methods incorporate human-defined rules
and domain expertise into machine learning models, enhancing the interpretability and
robustness of text classification (TC) systems.
In recent years, other hybrid physics–ML models have been developed, extending
beyond residual modeling. A simple method to integrate physics-based and ML models
involves using the output of a physics-based model as input for an ML algorithm [
124
].
Within hybrid TC systems, knowledge engineering is often applied to create initial feature
sets or rules that feed into machine learning algorithms. For instance, domain-specific
ontologies or taxonomies can guide feature selection, enabling the model to capture critical
semantic details relevant to the categorization task. This approach is particularly effective
in specialized fields such as healthcare or legal document categorization, where domain
expertise is crucial for achieving accurate classification [35].
10. Future Directions and Research Opportunities
This section describes both broad future directions and specific research oppor-
tunities in text categorization (TC), focusing on developing trends and challenges in
practical applications.
10.1. Multilanguage and Cross-Cultural Text Classification
This subsection discusses the challenges and advancements in creating inclusive TC
systems that address linguistic and cultural diversity.
Information 2025,16, 130 32 of 47
10.1.1. Importance of Cross-Language Communication
In today’s interconnected global landscape, seamless cross-language communication is
essential. As language diversity persists as a barrier, domains like multilingual translation
and text summarization are reaching a critical juncture, requiring innovative automated so-
lutions [
125
]. Text classification models, which often rely on large-scale labeled datasets, are
typically tailored for specific languages and cultural contexts. This limitation underscores
the growing demand for systems capable of addressing linguistic and cultural diversity in
an increasingly interconnected world [126].
10.1.2. Advancements in Multilingual NLP
New multilingual datasets featuring conversations in Chinese, English, Korean, and
Japanese provide a robust foundation for developing powerful conversational AI sys-
tems [
126
]. Pretrained models like BERT have expanded their capabilities to include
multilingual versions such as mBERT and XLM-R. These models enable simultaneous
processing of diverse linguistic inputs, enhancing cross-language text classification [127].
10.1.3. Cross-Lingual Transfer Learning
Cross-lingual transfer learning, facilitated by both social and machine translation,
plays a pivotal role in multilingual text classification. Many multilingual datasets are gen-
erated through professional translations, while machine translation is frequently employed
to translate training or test sets. Despite these advancements, challenges remain, such
as the lack of standardized multilingual datasets annotated under consistent guidelines,
particularly for intent detection and slot filling tasks [128,129].
10.1.4. Cultural Sensitivity in Text Classification
Text classification systems must navigate cultural nuances, including idiomatic expres-
sions, societal norms, and sentiment variations across regions. For example, positive or
neutral sentiment expressions can differ significantly between cultures, affecting sentiment
analysis accuracy. Translators must ensure cultural appropriateness, preserving the natural
tone and relevance for the target audience.
10.1.5. Future Research Directions
Universal Multilingual Models
Developing generalized models capable of learning across multiple languages with
minimal reliance on labeled data is a critical research direction. Universal multilingual
models such as XLM-R and mBERT have laid the groundwork, but further advancements
are needed to enhance their adaptability to low-resource languages and diverse linguistic
contexts. By leveraging transfer learning, cross-lingual embeddings, and domain adapta-
tion, these models can facilitate effective communication and analysis across linguistic and
cultural barriers.
Low-Resource Languages
Addressing data scarcity in low-resource languages remains a significant challenge.
Techniques such as unsupervised learning, self-supervised approaches, and domain-specific
transfer learning can mitigate these limitations. For instance, multilingual pretrained
models can be fine-tuned for specific low-resource languages, enabling their inclusion
in broader applications and ensuring global inclusivity. Integrating machine transla-
tion and text classification duties could also enhance the usability of these models in
multilingual environments.
Information 2025,16, 130 33 of 47
Enhanced Language Identification
Future text categorization systems must incorporate advanced language identifica-
tion techniques to process user-generated content that often includes multiple languages.
Methods such as combining deep learning with linguistic rules can improve accuracy in
detecting and processing code-switching and mixed-language texts. This capability is
essential for applications in social media monitoring, global marketing, and multilingual
customer support, where accurate language identification is critical [130].
Cultural Awareness in Models
Embedding cultural sensitivity into text categorization models is vital for improving
their classification accuracy and relevance in diverse contexts. Cultural nuances, idiomatic
expressions, and societal norms influence language usage and sentiment expression, which
models must understand to perform effectively. Incorporating cultural awareness into
training data and leveraging cross-cultural embeddings can enhance the adaptability and
inclusivity of these systems.
Integration with Real-Time and Multimodal Systems
The integration of text categorization with real-time processing and multimodal sys-
tems is another promising research avenue. Real-time categorization systems must handle
dynamic data streams with minimal latency while maintaining accuracy. Combining text
with visual and audio inputs, such as in social media content analysis, could provide
richer contextual understanding and enhance classification outcomes. Edge computing and
incremental learning techniques can support this shift toward dynamic, real-time systems.
Ethical AI, Transparency, and Bias Mitigation
Ensuring ethical AI methods and transparency is especially important in sensitive
applications such as recruitment, healthcare, legal analytics, and public safety. Address-
ing biases in training data and algorithms necessitates frameworks for bias identification,
mitigation, and explainable AI (XAI) strategies that promote trust and responsibility. Incor-
porating ethical considerations into model design ensures fair, transparent, and impartial
results across varied demographic and cultural contexts, boosting user trust in real-time
text classification systems.
Hybrid and Explainable Models
Combining machine learning with rule-based systems offers a promising avenue
for creating interpretable and robust text categorization models. Hybrid models can
balance precision and transparency, making them more suitable for high-stakes applications.
Explainable AI approaches will play a crucial role in enabling users to recognize and trust
the decision-making processes of these copies.
By addressing these research directions, the next generation of text categorization
systems can achieve greater inclusivity, adaptability, and ethical integrity. These advance-
ments will not only refine technical performance but also ensure that text categoriza-
tion technologies remain relevant and impactful in an increasingly interconnected and
data-driven world.
10.2. Addressing Emerging Challenges in Multilingual Text Classification
To enhance the inclusivity and adaptability of text classification (TC) systems, address-
ing emerging challenges in multilingual and cross-cultural contexts has become a pressing
need. While advancements in pretrained multilingual models and cross-lingual transfer
learning have set a strong foundation, several critical areas demand focused research.
Information 2025,16, 130 34 of 47
10.2.1. Low-Resource Language Challenges
Despite progress, low-resource languages continue to pose significant challenges. The
scarcity of labeled datasets and the diversity in linguistic structures impede the develop-
ment of robust TC systems for these languages. Techniques such as unsupervised learning,
few-shot learning, and synthetic data generation can help bridge this gap. For example,
using generative AI models to create synthetic training data for low-resource languages
could expand their application scope in multilingual environments.
10.2.2. Adaptive Multilingual Systems
Dynamic multilingual systems that can adapt to real-time user needs and cultural
contexts represent a promising direction. Innovations in adaptive embeddings, context-
aware processing, and reinforcement learning for linguistic and cultural adaptation are
critical to enabling seamless multilingual applications. These systems must also address
code switching, mixed-language text processing, and evolving regional dialects to ensure
relevance and accuracy.
By tackling these challenges, text classification systems can better support global com-
munication needs, fostering inclusivity and equity in an increasingly
interconnected world
.
10.3. Real-Time Text Categorization Applications
This subsection explores the demands and opportunities of deploying TC systems in
real-time environments where immediate decision making is critical.
10.3.1. The Need for Real-Time Classification
Real-time text categorization enables immediate processing and classification of newly
generated content, bypassing the need for batch operations. This capability is critical for
applications such as social media monitoring, content filtering, and customer support,
where real-time decision making is essential [13].
10.3.2. Scalability and Speed in Real-Time Systems
The high volume and rapid generation of content on social media and news platforms
demand systems that are both fast and scalable. For instance, integrating report texts with
tweets containing relevant links has been shown to improve classification outcomes in
real-time environments [131].
10.3.3. Incremental Learning for Dynamic Content
Real-time systems thrive in dynamic environments by employing incremental learning
techniques. These approaches allow models to continuously adapt to new data, enhancing
their robustness in ever-changing contexts. Lifelong learning frameworks provide methods
for task-incremental, domain-incremental, and class-incremental learning, bridging the gap
between natural and artificial intelligence [132].
10.3.4. Latency-Aware Optimization
Reducing latency while preserving accuracy is crucial for real-time systems, partic-
ularly those deployed on resource-constrained devices. Techniques such as knowledge
distillation, model pruning, edge computing, and optimized inference algorithms minimize
computational demands while maintaining high performance. These strategies enable
efficient and energy-conscious processing, making them essential for latency-sensitive
applications like real-time sentiment analysis and content moderation [48]
Information 2025,16, 130 35 of 47
10.3.5. Future Research Opportunities
High-Throughput Systems
Developing models capable of processing large-scale, real-time data streams with min-
imal latency remains a top priority. Future systems must leverage advanced technologies
such as edge computing, model distillation, and parallel processing to handle massive
volumes of data without sacrificing accuracy. High-throughput systems can play a critical
role in applications like live news categorization, stock market analysis, and emergency
response, where rapid decision making is essential.
Dynamic Adaptation
Real-time content is highly dynamic, with patterns and trends shifting quickly. To
maintain relevance and accuracy, it is essential to enhance models to adapt to these changes.
Incremental learning techniques, which enable models to update and evolve without
requiring complete retraining, offer a particularly effective solution. These methods can be
combined with continual learning frameworks to create systems that seamlessly adjust to
new topics, terms, and contexts over time.
Applications in Diverse Domains
Real-time text categorization offers significant potential across various fields:
•
Social Media Analytics: Identifying trends, sentiment, and emerging topics in
real time.
•Spam Detection: Filtering spam messages or malicious content as it is generated.
•
Fraud Prevention: Monitoring financial transactions or communications for suspicious
patterns.
•
Customer Support Chatbots:
Providing instant, context-aware responses to user queries
.
10.3.6. Real-Time Multimodal Integration
Combining text with other data modalities, such as images, videos, and audio, presents
an exciting research direction. For instance, analyzing text alongside accompanying vi-
suals in social media posts could provide richer insights into user intent and sentiment.
Multimodal approaches will be critical for applications like live event monitoring and
personalized content delivery, where a holistic understanding of data is necessary.
10.3.7. Scalability for Global Applications
With the growing global nature of data, scalable systems capable of processing multi-
lingual and culturally diverse content in real time are needed. Advances in cross-lingual
embeddings, transfer learning, and domain adaptation will enable models to handle di-
verse data streams efficiently. This scalability is particularly important for global platforms
that deal with multilingual user bases, such as international social media networks and
e-commerce platforms.
10.3.8. Context-Aware Personalization
Future systems should aim to provide personalized categorizations by incorporating
user preferences, location, and historical interactions. Context-aware models can improve
the relevance and utility of real-time classifications in applications like targeted marketing,
personalized news feeds, and adaptive recommendation systems.
Information 2025,16, 130 36 of 47
10.4. Integration with Other NLP Tasks
10.4.1. Expanding the Scope of NLP Integration
Integrating various NLP tasks such as named entity recognition (NER), parsing,
sentiment analysis, and information extraction into text classification can significantly
improve system performance. These tasks enable models to derive deeper insights from
textual data, supporting more complex applications.
10.4.2. Named Entity Recognition (NER)
NER identifies entities like names, locations, and organizations within the text, en-
hancing classification accuracy for domain-specific tasks such as medical or legal document
analysis. This task is critical for structured data extraction in applications like information
retrieval and question answering [133,134].
10.4.3. Parsing Techniques
Parsing systems analyze sentence structure and relationships between words, aiding
models in understanding both syntactic and semantic nuances. These insights enable
more accurate distinctions between text types, such as formal articles versus informal blog
posts by analyzing specific linguistic and structural features unique to each type [
134
]. For
example, formal articles often have a higher lexical richness, precise terminology, and well-
structured arguments, which are frequently supported by citations and an objective tone. In
contrast, informal blog postings use conversational language, personal tales, and emotive
emotions to engage readers on a deeper level. By detecting and measuring these traits, the
insights improve text classification, resulting in sophisticated comprehension and context-
specific applications for academic research, content curation, and targeted marketing.
10.4.4. Information Extraction (IE)
IE techniques automatically identify structured data within unstructured text. This
functionality is particularly useful in applications like legal document analysis and auto-
mated data entry, where structured outputs are crucial [135].
10.4.5. Multitask Learning Frameworks
Multitask learning involves training models to handle several NLP tasks simultane-
ously, leading to richer feature representations and improved overall performance. For
example, integrating text summarization and sentiment analysis within a single model can
yield more nuanced outcomes [59,136].
10.5. Advancing Multimodal Text Classification
10.5.1. Combining Modalities for Comprehensive Analysis
Multimodal classification combines textual data with other data types, such as images,
videos, or audio, providing a holistic understanding of user-generated content. For example,
social media platforms can analyze both text and accompanying images to classify posts
more effectively.
10.5.2. Practical Applications
•
E-Commerce: Platforms can integrate sentiment analysis and NER to classify product
reviews, extract brand mentions, and monitor customer feedback in real time.
•
Social Media: By combining text-based sentiment analysis with image-based emotion
detection, platforms can enhance their content moderation and analytics capabilities.
Information 2025,16, 130 37 of 47
10.5.3. Future Research Directions in Multimodal Text Classification
The addition of multiple modalities in text classification opens up new opportunities
for advancing the field. Beyond the current applications in e-commerce and social media,
innovative research can explore the following directions:
10.5.4. Dynamic Multimodal Fusion Techniques
Future research should focus on developing advanced techniques for dynamically
fusing multimodal data. This includes creating adaptive models that can weigh the impor-
tance of text, images, videos, and audio based on the context of the task. For instance, a
news categorization system might prioritize textual content for breaking news and image
content for photojournalism.
10.5.5. Temporal Multimodal Analysis
Investigating the temporal aspects of multimodal data, such as analyzing how user
sentiment evolves over time across different modalities, could be a valuable direction. This
is particularly relevant for applications like campaign monitoring, where text, images, and
videos are generated sequentially and provide evolving narratives.
10.5.6. Real-Time Multimodal Interaction
Building systems capable of real-time multimodal interaction presents an exciting
challenge. For instance, integrating live video feeds with chat-based textual input can
enhance virtual events, online education, and telemedicine. These systems would need to
process and classify data across modalities simultaneously, ensuring high responsiveness
and accuracy.
10.5.7. Cross-Modal Transfer Learning
Future work could explore cross-modal transfer learning, where knowledge from
one modality (e.g., textual embeddings) is transferred to another (e.g., image features) to
improve performance. This approach can be particularly effective in domains where one
modality has abundant labeled data while another is scarce.
10.5.8. Domain-Specific Multimodal Solutions
Developing domain-specific multimodal frameworks tailored to fields like healthcare,
finance, or legal analysis can drive significant progress. For instance:
•
Healthcare: Analyzing patient notes alongside medical images for enhanced diagnos-
tic accuracy.
•
Finance: Integrating financial reports (text) with market trend graphs (visuals) to
improve investment decision making.
•
Legal Analysis: Combining contract text with associated diagrams or annotations to
classify clauses efficiently.
10.5.9. Augmented Reality (AR) and Virtual Reality (VR) Integration
As AR and VR applications grow, research could focus on integrating multimodal text
classification into these environments. For example, AR systems could analyze spoken
words, gestures, and textual annotations in real time to assist users in educational or
professional contexts.
10.5.10. Emotion and Context Detection
Future systems could explore more nuanced emotion and context detection by com-
bining textual sentiment analysis with facial expressions, voice tones, and visual cues. This
Information 2025,16, 130 38 of 47
could significantly enhance applications in customer service, mental health analysis, and
human–computer interaction.
10.5.11. Energy-Efficient Multimodal Models
Multimodal classification systems are computationally intensive. Research into energy-
efficient architectures, such as low-power neural networks and efficient hardware acceler-
ators, can make these systems more accessible for real-world deployment, especially on
mobile and edge devices.
10.5.12. Interactive Multimodal Systems
Interactive systems that allow users to provide real-time feedback on classifica-
tions can improve model accuracy and adaptability. For instance, a system analyzing
tweets and images could adjust its categorization based on user input, ensuring more
accurate classifications.
10.5.13. Multimodal Anomaly Detection
Expanding research to include anomaly detection in multimodal data streams can en-
hance applications like fraud detection, cybersecurity, and disaster response. For example,
detecting inconsistencies between textual content and visual evidence can flag potentially
fraudulent activities.
By pursuing these directions, multimodal text classification can evolve into a more ver-
satile, context-aware, and impactful tool, enabling transformative applications across indus-
tries and societal domains. Table 3summarized the future research in text categorization.
Table 3. A summary of future research in text categorization.
Research Focus Future Research Direction Potential Applications
Universal Multilingual Models
Develop generalized models for
multilingual text classification with
minimal labeled data.
Cross-cultural communication,
multilingual customer support, and
global content moderation.
Low-Resource Languages
Use transfer learning, domain
adaptation, and unsupervised
methods to address data scarcity.
Language preservation, text analysis
in underserved regions, and niche
domain categorization.
Enhanced Language Identification
Improve techniques for detecting and
processing multiple languages
in text.
Multilingual user-generated content
analysis and global social
media monitoring.
Cultural Awareness in Models
Embed cultural sensitivity to
improve classification relevance
across diverse contexts.
Sentiment analysis, cross-border
marketing, and international public
opinion tracking.
High-Throughput Systems
Develop systems capable of
processing large-scale, real-time data
streams with minimal latency.
Live news categorization, stock
market monitoring, and emergency
response systems.
Dynamic Adaptation
Enhance models to adjust to shifting
patterns and evolving content in
real time.
Social media analytics, adaptive
spam filtering, and customer
sentiment tracking.
Multimodal Integration
Combine text with other modalities
(images, videos, audio) for holistic
content analysis.
Social media content moderation,
e-commerce review analysis, and
multimedia news classification.
Temporal Multimodal Analysis
Analyze how user sentiment or
trends evolve over time using
multiple data types.
Campaign monitoring, real-time
sentiment tracking, and user
behavior analysis.
Information 2025,16, 130 39 of 47
Table 3. Cont.
Research Focus Future Research Direction Potential Applications
Real-Time Systems Optimize latency and computational
efficiency for real-time applications.
Chatbots, fraud detection, and
personalized content delivery.
Cross-Modal Transfer Learning
Enable knowledge transfer between
text and other data modalities for
enhanced classification.
Healthcare diagnostics, financial
trend analysis, and multimedia
content categorization.
Domain-Specific Frameworks
Design tailored models for specific
industries like healthcare, finance,
and legal analysis.
Medical text categorization, contract
clause extraction, and investment
report analysis.
AR/VR Integration
Integrate text categorization into
augmented and virtual
reality systems.
Interactive learning environments,
immersive customer support, and
AR-based real-time text translation.
Emotion and Context Detection
Combine multimodal inputs for
nuanced emotion and
context understanding.
Mental health monitoring,
sentiment-based recommendations,
and adaptive marketing strategies.
Interactive Multimodal Systems
Develop systems allowing real-time
user feedback to refine
classification accuracy.
Live content moderation, chatbot
systems, and collaborative filtering
in e-commerce.
Ethical Considerations and Bias
Mitigation
Focus on identifying and mitigating
biases in training data
and algorithms.
Recruitment systems, content
moderation for sensitive topics, and
legal document categorization.
Explainable AI and Hybrid Models
Combine rule-based systems with
ML for interpretability
and transparency.
Regulatory compliance, healthcare
decision support, and consumer
trust building.
Energy-Efficient Architectures
Research architectures that optimize
resource usage for
text categorization.
Mobile applications, edge computing,
and sustainable AI deployment in
resource-constrained settings.
Anomaly Detection
Develop methods to detect
inconsistencies across multimodal
data streams.
Fraud detection, cybersecurity
monitoring, and disaster
response systems.
Real-Time Multilingual Systems Extend real-time systems to handle
multiple languages dynamically.
Global event monitoring, real-time
multilingual chatbots, and
international e-commerce platforms.
This table provides a synthesized overview of the key future research directions in text categorization, reflecting
advancements in multilingual, multimodal, real-time, and ethical AI practices, along with their applications across
various domains.
11. Conclusions
The field of text categorization (TC) has experienced significant evolution, becoming a
foundational component in natural language processing (NLP) and machine learning (ML).
Transitioning from manual classification to scalable, ML-driven methods has revolutionized
the ability to process, organize, and analyze large-scale textual data across various domains.
Advances such as supervised learning, feature engineering, and dimensionality reduction
have greatly improved the accuracy and efficiency of TC systems, making them critical for
applications like sentiment analysis, spam detection, and domain-specific categorization.
Despite these achievements, challenges like overfitting, class imbalance, language complex-
ity, and high computational demands remain, underscoring the need for innovations in
model interpretability and robustness.
Deep learning techniques, including convolutional neural networks (CNNs), recurrent
neural networks (RNNs), and pretrained models like BERT and GPT, have expanded the
Information 2025,16, 130 40 of 47
capabilities of TC by enabling advanced language understanding and contextual analy-
sis. However, their dependence on large datasets and high computational power limits
their practicality, especially for low-resource languages and real-time applications. Ad-
dressing these constraints requires a focus on developing efficient learning techniques,
hybrid approaches, and explainable AI (XAI) solutions. Combining machine learning with
knowledge engineering can result in interpretable and reliable models, while integrating
TC with other NLP tasks, such as text summarization, named entity recognition (NER), and
sentiment analysis, has the potential to create more intelligent and context-aware systems.
The future of TC lies in its ability to adapt to the demands of an increasingly inter-
connected and data-driven world. Multilingual and cross-cultural applications, real-time
systems, and multimodal integration are poised to shape the next wave of advancements
in the field. These developments will not only enhance the scalability and precision of TC
systems but also democratize access to AI technologies, fostering inclusivity and global
applicability. Furthermore, addressing ethical concerns, such as bias mitigation and trans-
parency, will be critical to building trust and ensuring equitable outcomes in high-stakes
applications like recruitment, healthcare, and legal analytics.
By refining technical performance and enhancing real-world relevance, text categoriza-
tion systems are positioned to play a pivotal role in information retrieval, data mining, and
decision making. The integration of advanced algorithms, ethical frameworks, and inter-
disciplinary approaches will drive innovation, enabling TC systems to overcome existing
challenges while unlocking unprecedented opportunities across industries. As researchers
and practitioners collaborate to push the boundaries of what is possible, the future of TC
promises transformative impacts on how we process, understand, and derive value from
textual data.
Establishing unified methodologies and benchmarks, such as cross-domain datasets
and standardized evaluation metrics, could significantly improve comparability and repro-
ducibility. Incorporating quantitative data in future studies, such as specific performance
metrics, would further enhance the practical relevance of TC systems. For example, deep
learning models like BERT have demonstrated over 97% accuracy in contextual classifi-
cation tasks, while naive Bayes algorithms continue to offer reliable performance with
accuracies exceeding 95% in less complex domains. The potential of explainable AI (XAI)
in making TC systems more interpretable is critical for applications in sensitive fields
such as healthcare and legal analytics. Techniques such as attention mechanisms, which
visually highlight key decision-influencing features, can improve transparency and foster
trust in these models. Additionally, domain-specific pretrained models have shown to im-
prove classification accuracy by 10–15%, particularly in specialized industries like medical
diagnostics and legal text processing.
Future advancements in TC are likely to be shaped by multilingual and cross-cultural
applications, real-time processing systems, and multimodal integration (e.g., combining
text with visual or audio data). These advancements will not only enhance scalability and
precision but also democratize access to AI technologies, fostering inclusivity across global
applications. Ethical considerations, such as bias mitigation and fairness, must also remain
a priority. For instance, algorithmic adjustments to balance class distributions have been
shown to reduce bias by up to 20%, ensuring fairer outcomes in high-stakes domains like
recruitment and healthcare.
By addressing these gaps and challenges, the integration of advanced algorithms,
ethical frameworks, and interdisciplinary approaches will drive innovation in the field.
This evolution positions TC systems as transformative tools, capable of unlocking unprece-
dented opportunities in how we process, understand, and derive value from textual data.
Information 2025,16, 130 41 of 47
12. Case Studies
The rapid growth of digital data has transformed how organizations and researchers
manage and extract value from unstructured text. Text categorization, powered by machine
learning and artificial intelligence, has emerged as a cornerstone in enabling this transfor-
mation. It allows the automatic organization, analysis, and retrieval of textual information
with unprecedented speed and accuracy. From detecting spam emails to classifying pa-
tient records in healthcare, the applications of text categorization are as diverse as they
are impactful.
This section presents five real-world case studies that exemplify the practical imple-
mentation of text categorization across various domains. These examples are drawn from
industries such as technology, academia, media, customer service, and healthcare, high-
lighting the versatility and adaptability of machine learning techniques. Each case study
demonstrates how state-of-the-art algorithms, ranging from naive Bayes and k-nearest
neighbors to cutting-edge transformer models like BERT, have revolutionized workflows,
improved decision making, and delivered tangible benefits. By showcasing these appli-
cations, the case studies aim to bridge the gap between theoretical advancements and
practical deployment. They serve as a testament to the potential of text categorization to
address complex challenges, drive innovation, and unlock new possibilities in data-driven
environments. Whether you are an academic researcher, an industry practitioner, or a
technology enthusiast, these cases provide valuable insights into how text categorization is
shaping the modern world.
12.1. Case Study 1: Spam Detection in Email Systems
A study conducted by Google applied machine learning techniques to classify emails
as spam or non-spam. By using naive Bayes and support vector machines (SVMs) (Google
LLC, Mountain View, CA, USA), the project achieved an accuracy of 95% on a dataset
of 10,000 emails. The feature selection process involved using term frequency–inverse
document frequency (TF-IDF) to improve the model’s ability to identify spam-related
keywords. This application demonstrated the potential of text categorization in reducing
manual efforts and improving efficiency in email filtering systems.
12.2. Case Study 2: Sentiment Analysis for Product Reviews
Researchers at Stanford University analyzed customer sentiment in product reviews
using deep learning. They implemented a recurrent neural network (RNN) with long
short-term memory (LSTM) units, achieving a sentiment classification accuracy of 88% on a
dataset containing 50,000 reviews. This work highlighted the effectiveness of deep learning
in understanding customer feedback and tailoring marketing strategies.
12.3. Case Study 3: Customer Support Chat Categorization
Zendesk, a leading customer service software company, implemented text catego-
rization to automate the tagging and routing of customer support tickets. By using a
combination of random forests and BERT-based transformer models, they achieved a classi-
fication accuracy of 93% across a dataset of 15,000 tickets. This system improved response
times by 30% and enhanced customer satisfaction, demonstrating the value of automated
categorization in customer service operations.
12.4. Case Study 4: News Article Categorization
Reuters News Agency developed a text categorization model to classify articles into
topics such as “Politics”, “Sports”, and “Entertainment”. The approach utilized a com-
bination of CNNs and word embeddings, achieving a classification accuracy of 90% on
Information 2025,16, 130 42 of 47
a dataset of 100,000 articles. This real-world application showcased how automated text
categorization can streamline news organization and enhance reader engagement.
12.5. Case Study 5: Healthcare Document Analysis
A healthcare organization, Mayo Clinic, leveraged machine learning to categorize
patient records based on diagnoses. By applying k-nearest neighbors (kNN) and random
forests, the system achieved an F1-score of 87% on a dataset of 20,000 patient records. This
initiative facilitated faster retrieval of patient information, improving decision making in
clinical settings.
Author Contributions: Conceptualization, H.A. and L.M.; methodology, H.A.; software, H.A.;
validation, H.A., L.M. and B.G.; formal analysis, H.A.; investigation, H.A.; resources, H.A.; data
curation, H.A.; writing—original draft preparation, H.A.; writing—review and editing, H.A. and
L.M.; visualization, H.A.; supervision, K.N.G.; project administration, K.N.G.; funding acquisition,
K.A. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
Joachims, T.; Sebastiani, F. Guest editors’ introduction to the special issue on automated text categorization. J. Intell. Inf. Syst.
2002,18, 103. [CrossRef]
2. Knight, K. Mining online text. Commun. ACM 1999,42, 58–61. [CrossRef]
3. Pazienza, M.T. Information Extraction; Springer: Berlin/Heidelberg, Germany, 1999.
4. Sebastiani, F. Text categorization: Advances and challenges. Comput. Linguist. 2024,50, 205–245, P.3.
5. Yang, Y.; Joachims, T. Text categorization. Scholarpedia 2008,3, 4242. [CrossRef]
6. Lewis, D.D.; Hayes, P.J. Special issue on text categorization. Inf. Retr. J. 1994,2, 307–340.
7. Manning, C.; Schütze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999.
8.
Paaß, G. Document classification, information retrieval, text and web mining. In Handbook of Technical Communication; De Gruyter
Mouton: Berlin/Heidelberg, Germany, 2012; Volume 8, p. 141.
9.
Larabi-Marie-Sainte, S.; Bin Alamir, M.; Alameer, A. Arabic Text Clustering Using Self-Organizing Maps and Grey Wolf
Optimization. Appl. Sci. 2023,13, 10168. [CrossRef]
10. Dhar, V. The evolution of text classification: Challenges and opportunities. AI Soc. 2021,36, 123–135.
11.
Chen, Y.; Zhang, X.-M. Research on Intelligent Natural Language Texts Classification. Int. J. Adv. Comput. Sci. Appl. 2022,13.
[CrossRef]
12.
Haoran, Z.; Lei, L. The Research Trends of Text Classification Studies (2000–2020): A Bibliometric Analysis. SAGE Open
2022,12, 21582440221089963. [CrossRef]
13.
Zhou, X.; Gururajan, R.; Li, Y.; Venkataraman, R.; Tao, X.; Bargshady, G.; Barua, P.D.; Kondalsamy-Chennakesavan, S. A survey
on text classification and its applications. Web Intell. 2020,18, 205–216. [CrossRef]
14.
Qian, L.; Hao, P.; Jianxin, L.; Cong-min, X.; Renyu, Y.; Lichao, S.; Philip, S.Y.; Lifang, H. A Survey on Text Classification: From
Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2022,13, 1–41. [CrossRef]
15.
Karim, A.; Hami, Y.; Loqman, C.; Boumhidi, J. Case Studies of Several Popular Text Classification Methods. In International
Conference on Digital Technologies and Applications; Springer Nature: Cham, Switzerland, 2023; pp. 552–560.
16.
Zulqarnain, M.; Sheikh, R.; Hussain, S.; Sajid, M.; Abbas, S.N.; Majid, M.; Ullah, U. Text Classification Using Deep Learning
Models: A Comparative Review. Cloud Comput. Data Sci. 2024,5, 80–96.
17. Leena, B.; Satish, K.V. Survey on Text Classification. Int. J. Innov. Sci. Res. Technol. 2020,5, 543–549. [CrossRef]
18.
He, B.; Yang, Y.; Wang, L.; Zhou, J. The Text Classification Method Based on BiLSTM and Multi-Scale CNN. Comput. Life
2024,12, 43–49. [CrossRef]
Information 2025,16, 130 43 of 47
19. Mengnan, W. Research on Text Classification Method Based on NLP. Adv. Comput. Signals Syst. 2022,7, 93–100. [CrossRef]
20.
Samarth, K.; Bishnu, T.; Priyanka, D.; Asit, K.D. A Comparative Study on Various Text Classification Methods. In Computational
Intelligence in Pattern Recognition; Springer: Singapore, 2019.
21.
Reusens, M.; Stevens, A.; Tonglet, J.; De Smedt, J.; Verbeke, W.; Vanden Broucke, S.; Baesens, B. Evaluating text classification: A
benchmark study. Expert Syst. Appl. 2024,254, 124302. [CrossRef]
22.
Bello, A.M.; Rahat, I.; Anne, J.; Dianabasi, N. Comparative Performance of Machine Learning Methods for Text Classifica-
tion. In Proceedings of the 2020 International Conference on Computing and Information Technology, Tabuk, Saudi Arabia,
9–10 September 2020. [CrossRef]
23.
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey.
Information 2019,10, 150. [CrossRef]
24.
Ankita, A.; Aravindan, M.K.; Manish, S.; Sathya, S.; Devika, A.V.; Jagmeet, S. An Exploration of the Effectiveness of Machine
Learning Algorithms for Text Classification. In Proceedings of the 2023 IEEE International Conference on Paradigm Shift in
Information Technologies with Innovative Applications in Global Scenario, Indore, India, 28–29 December 2023.
25.
Köksal, Ö.; Akgül, Ö. A Comparative Text Classification Study with Deep Learning-Based Algorithms. In Proceedings of the 2022
9th International Conference on Electrical and Electronics Engineering, Alanya, Turkey, 29–31 March 2022.
26.
Tiffany, Z. Classification Models of Text: A Comparative Study. In Proceedings of the 2021 IEEE 11th Annual Computing and
Communication Workshop and Conference, Vegas, NV, USA, 27–30 January 2021.
27.
Maw, M.; Vimala, B.; Omer, R.; Sri Devi, R. Trends and patterns of text classification techniques: A systematic mapping study.
Malays. J. Comput. Sci. 2020,33, 102–117. [CrossRef]
28.
Dea, W.K. Research on Text Classification Based on Deep Neural Network. Int. J. Commun. Netw. Inf. Secur. 2022,14, 100–113.
[CrossRef]
29.
O’Donovan, M.A.; McCallion, P.; McCarron, M.; Lynch, L.; Mannan, H.; Byrne, E. A narrative synthesis scoping review of life
course domains within health service utilisation frameworks. HRB Open Res. 2019,2, 6. [CrossRef]
30.
Dawar, I.; Kumar, N.; Pathan, S.; Layek, S. Text Categorization using Supervised Machine Learning Techniques. In Proceedings
of the 2023 Sixth International Conference of Women in Data Science at Prince Sultan University, Riyadh, Saudi Arabia,
14–15 March 2023. [CrossRef]
31.
Quazi, S.; Musa, S.M. Performing Text Classification and Categorization through Unsupervised Learning. In Proceedings of the
2023 1st International Conference on Advanced Engineering and Technologies, Kediri, Indonesia, 14 October 2023. [CrossRef]
32.
Karathanasi, L.C.; Bazinas, C.; Iordanou, G.; Kaburlasos, V.G. A Study on Text Classification for Applications in Special Education.
In Proceedings of the 2021 International Conference on Software, Telecommunications and Computer Networks, Split, Croatia,
23–25 September 2021. [CrossRef]
33.
Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev.
2019,52, 273–292
.
[CrossRef]
34.
Ittoo, A.; van den Bosch, A. Text analytics in industry: Challenges, desiderata and trends. Comput. Ind. 2016,78, 96–107. [CrossRef]
35.
Shen, D. Text Categorization. 2009. Available online: https://dl.acm.org/doi/abs/10.1145/1645953.1646192 (accessed on
10 November 2024).
36.
Sajid, N.A.; Rahman, A.; Ahmad, M.; Musleh, D.; Basheer Ahmed, M.I.; Alassaf, R.; Chabani, S.; Ahmed, M.S.; Salam, A.A.;
AlKhulaifi, D. Single vs. multi-label: The issues, challenges and insights of contemporary classification schemes. Appl. Sci.
2023,13, 6804. [CrossRef]
37.
Chen, R.; Zhang, W.; Wang, X. Machine learning in tropical cyclone forecast modeling: A review. Atmosphere 2020,11, 676.
[CrossRef]
38.
Wang, Z.; Zhao, J.; Huang, H.; Wang, X. A review on the application of machine learning methods in tropical cyclone forecasting.
Front. Earth Sci. 2022,10, 902596. [CrossRef]
39.
Gasparetto, A.; Marcuzzo, M.; Zangari, A.; Albarelli, A. A survey on text classification algorithms: From text to predictions.
Information 2022,13, 83. [CrossRef]
40.
Shortliffe, E.H.; Buchanan, B.G.; Feigenbaum, E.A. Knowledge engineering for medical decision making: A review of computer-
based clinical decision aids. Proc. IEEE 1979,67, 1207–1224. [CrossRef]
41.
Ali, M.; Ali, R.; Khan, W.A.; Han, S.C.; Bang, J.; Hur, T.; Kim, D.; Lee, S.; Kang, B.H. A data-driven knowledge acquisition system:
An end-to-end knowledge engineering process for generating production rules. IEEE Access 2018,6, 15587–15607. [CrossRef]
42. Gupta, D. Applied Analytics Through Case Studies Using Sas and R: Implementing Predictive Models and Machine Learning Techniques;
Apress: New York, NY, USA, 2018.
43. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 2002,34, 1–47. [CrossRef]
44.
Fuhr, N.; Knorz, G. Retrieval test evaluation of a rule based automatic indexing (AIR/PHYS). In Proceedings of the 7th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, Cambridge, UK, 2–6 July 1984;
pp. 391–408.
Information 2025,16, 130 44 of 47
45. Borko, H.; Bernick, M. Automatic document classification. J. ACM (JACM) 1963,10, 151–162. [CrossRef]
46.
Larkey, L.S. A patent search and classification system. In Proceedings of the fourth ACM Conference on Digital Libraries, Berkeley,
CA, USA, 11–14 August 1999; pp. 179–187.
47.
Hayes, P.J.; Weinstein, S.P. CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories. In Proceedings
of the IAAI, Washington, DC, USA, 1–3 May 1990; pp. 49–64.
48.
Androutsopoulos, I.; Koutsias, J.; Chandrinos, K.V.; Spyropoulos, C.D. An experimental comparison of naive Bayesian and
keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 160–167.
49.
Drucker, H.; Wu, D.; Vapnik, V.N. Support vector machines for spam categorization. IEEE Trans. Neural Netw. 1999,10, 1048–1054.
[CrossRef] [PubMed]
50. Gale, W.A.; Church, K. A program for aligning sentences in bilingual corpora. Comput. Linguist. 1993,19, 75–102.
51.
Chakrabarti, S.; Dom, B.; Raghavan, P.; Rajagopalan, S.; Gibson, D.; Kleinberg, J. Automatic resource compilation by analyzing
hyperlink structure and associated text. Comput. Netw. ISDN Syst. 1998,30, 65–74. [CrossRef]
52.
Mohammad, S.M. Sentiment analysis: Detecting valence, emotions, and other affectual states from text. In Emotion Measurement;
Elsevier: Amsterdam, The Netherlands, 2016; pp. 201–237.
53.
Yang, Y.; Liu, X. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 42–49.
54.
Forman, G. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003,3, 1289–1305.
55.
Aggarwal, C.C.; Zhai, C. An introduction to text mining. In Mining Text Data; Springer: Berlin/Heidelberg, Germany, 2012;
pp. 1–10
.
56.
McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI-98
workshop on Learning for Text Categorization, Madison, WI, USA, 26–27 July 1998; Volume 752, pp. 41–48.
57.
Luo, X. Efficient English text classification using selected machine learning techniques. Alex. Eng. J. 2021,60, 3401–3409.
[CrossRef]
58.
Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput.
Intell. Mag. 2018,13, 55–75. [CrossRef]
59. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003,3, 1157–1182.
60.
Mondal, S.; Barman, A.K.; Basumatary, S.; Barman, M.; Rai, C.; Nag, A. Cancer Text Article Categorization and Prediction Model
Based on Machine Learning Approach. In Proceedings of the 2023 IEEE 3rd Mysore Sub Section International Conference, Hassan,
India, 1–2 December 2023.
61.
Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way; Towards Data Science: San Francisco, CA, USA, 2018.
62.
Ali, S.I.M.; Nihad, M.; Sharaf, H.M.; Farouk, H. Machine learning for text document classification-efficient classification approach.
IAES Int. J. Artif. Intell. 2024,13, 703–710. [CrossRef]
63.
Valluri, D.; Manne, S.; Tripuraneni, N. Custom Dataset Text Classification: An Ensemble Approach with Machine Learning
and Deep Learning Models. In Proceedings of the 2023 3rd International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), Bengaluru, India, 21–23 December 2023.
64. Manning, C.D. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008.
65. Salton, G.; Wong, A.; Yang, C.-S. A vector space model for automatic indexing. Commun. ACM 1975,18, 613–620. [CrossRef]
66. Van Otten, N. Vector Space Model Made Simple with Examples & Tutorial in Python; Spot Intelligence: London, UK, 2023.
67. Mikolov, T. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781.
68. DataScienctyst. How to Create a Bag of Words in Pandas Python. Available online: https://datascientyst.com/create-a-bag-of-
words-pandas-python/ (accessed on 24 November 2024).
69. Lovins, J.B. Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 1968,11, 22–31.
70.
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on
Machine Learning, Los Angeles, CA, USA, 23–24 June 2003; pp. 29–48.
71. Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988,24, 513–523. [CrossRef]
72.
Robertson, S.; Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 2009,3, 333–389.
[CrossRef]
73.
Wang, T.; Cai, Y.; Leung, H.-f.; Cai, Z.; Min, H. Entropy-based term weighting schemes for text categorization in VSM. In
Proceedings of the 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, Italy,
9–11 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 325–332.
74.
Jones, K.S.; Walker, S.; Robertson, S.E. A probabilistic model of information retrieval: Development and comparative experiments:
Part 2. Inf. Process. Manag. 2000,36, 809–840. [CrossRef]
75.
Said, D.A. Dimensionality Reduction Techniques for Enhancing Automatic Text Categorization; Faculty of Engineering, Cairo University
Master of Science: Cairo, Egypt, 2007.
Information 2025,16, 130 45 of 47
76.
Murty, M.; Raghava, R. Kernel-based SVM. In Support Vector Machines and Perceptrons: Learning, Optimization, Classification, and
Application to Social Networks; Spinger: Berlin/Heidelberg, Germany, 2016; pp. 57–67.
77.
Li, B.; Yan, Q.; Xu, Z.; Wang, G. Weighted document frequency for feature selection in text classification. In Proceedings of the
2015 International Conference on Asian Language Processing (IALP), Suzhou, China, 24–25 October 2015; pp. 132–135.
78.
Christian, H.; Agus, M.P.; Suhartono, D. Single document automatic text summarization using term frequency-inverse document
frequency (TF-IDF). ComTech Comput. Math. Eng. Appl. 2016,7, 285–294. [CrossRef]
79.
Peng, T.; Liu, L.; Zuo, W. PU text classification enhanced by term frequency–inverse document frequency-improved weighting.
Concurr. Comput. Pract. Exp. 2014,26, 728–741. [CrossRef]
80.
Magnello, M.E. Karl Pearson, paper on the chi square goodness of fit test (1900). In Landmark Writings in Western Mathematics
1640–1940; Elsevier: Amsterdam, The Netherlands, 2005; pp. 724–731.
81. Greenwood, P.E.; Nikulin, M.S. A Guide to Chi-Squared Testing; John Wiley & Sons: Hoboken, NJ, USA, 1996; Volume 280.
82. Chen, Y.-T.; Chen, M.C. Using chi-square statistics to measure similarities for text categorization. Expert Syst. Appl. 2011,38, 3085–3090.
[CrossRef]
83.
Meesad, P.; Boonrawd, P.; Nuipian, V. A chi-square-test for word importance differentiation in text classification. In Proceedings
of the International Conference on Information and Electronics Engineering, Bangkok, Thailand, 28–29 May 2011; pp. 110–114.
84.
Wang, G.; Lochovsky, F.H. Feature selection with conditional mutual information maximin in text categorization. In Proceed-
ings of the Thirteenth Acm International Conference on Information and Knowledge Management, Washington, DC, USA,
8–13 November 2004; pp. 342–349.
85.
Lewis, D.D. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark,
21–24 June 1992; pp. 37–50.
86.
Dhar, A.; Mukherjee, H.; Dash, N.S.; Roy, K. Text categorization: Past and present. Artif. Intell. Rev. 2021,54, 3007–3054. [CrossRef]
87.
Lhazmir, S.; El Moudden, I.; Kobbane, A. Feature extraction based on principal component analysis for text categorization. In
Proceedings of the 2017 International Conference on Performance Evaluation and Modeling in Wired and Wireless Networks
(PEMWN), Paris, France, 28–30 November 2017; pp. 1–6.
88.
Bafna, P.; Pramod, D.; Vaidya, A. Document clustering: TF-IDF approach. In Proceedings of the 2016 International Conference on
Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 3–5 March 2016; IEEE: Piscataway, NJ, USA, 2016;
pp. 61–66.
89.
Franke, T.M.; Ho, T.; Christie, C.A. The chi-square test: Often used and more often misinterpreted. Am. J. Eval. 2012,33, 448–458.
[CrossRef]
90.
Tschannen, M.; Djolonga, J.; Rubenstein, P.K.; Gelly, S.; Lucic, M. On mutual information maximization for representation learning.
arXiv 2019, arXiv:1907.13625.
91.
Cardoso-Cachopo, A.; Oliveira, A.L. An empirical comparison of text categorization methods. In Proceedings of the International
Symposium on String Processing and Information Retrieval, Manaus, Brazil, 8–10 October 2003; Spinger: Berlin/Heidelberg,
Germany, 2003; pp. 183–196.
92. Yang, Y. An evaluation of statistical approaches to text categorization. Inf. Retr. 1999,1, 69–90. [CrossRef]
93.
Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C.A.; Nielsen, H. Assessing the accuracy of prediction algorithms for classification:
An overview. Bioinformatics 2000,16, 412–424. [CrossRef] [PubMed]
94. Ruiz, M.E.; Srinivasan, P. Hierarchical text categorization using neural networks. Inf. Retr. 2002,5, 87–118. [CrossRef]
95.
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. Using knn model for automatic text categorization. Soft Comput. 2006,10, 423–430.
[CrossRef]
96.
Lewis, D.D. Evaluating text categorization i. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove,
California; Morgan Kaufmann Publishers: Burlington, MA, USA, 1991.
97.
Wang, B.; Li, C.; Pavlu, V.; Aslam, J. A pipeline for optimizing f1-measure in multi-label text classification. In Proceed-
ings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA,
17–20 December 2018.
98.
Hulth, A.; Megyesi, B. A study on automatically extracted keywords in text categorization. In Proceedings of the 21st International
Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney,
Australia, 17–21 July 2006; pp. 537–544.
99.
Wong, T.-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit.
2015,48, 2839–2846. [CrossRef]
100.
Moss, H.B.; Leslie, D.S.; Rayson, P. Using JK fold cross validation to reduce variance when tuning NLP models. arXiv 2018,
arXiv:1806.07139.
101.
Marcot, B.G.; Hanea, A.M. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Comput.
Stat. 2021,36, 2009–2031. [CrossRef]
Information 2025,16, 130 46 of 47
102.
Bai, Y.; Chen, M.; Zhou, P.; Zhao, T.; Lee, J.; Kakade, S.; Wang, H.; Xiong, C. How important is the train-validation split in
meta-learning? In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 543–553.
103. Vabalas, A.; Gowen, E.; Poliakoff, E.; Casson, A.J. Machine learning algorithm validation with a limited sample size. PLoS ONE
2019,14, e0224365. [CrossRef] [PubMed]
104.
Zhang, H.; Zhang, L.; Jiang, Y. Overfitting and underfitting analysis for deep learning based end-to-end communication systems.
In Proceedings of the 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP), Xi’an,
China, 23–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6.
105.
Bu, C.; Zhang, Z. Research on overfitting problem and correction in machine learning. J. Phys. Conf. Ser. 2020,1693, 012100.
[CrossRef]
106.
Dogra, V.; Verma, S.; Kavita; Chatterjee, P.; Shafi, J.; Choi, J.; Ijaz, M.F. A Complete Process of Text Classification System Using
State-of-the-Art NLP Models. Comput. Intell. Neurosci. 2022,2022, 1883698. [CrossRef] [PubMed]
107.
Hachiya, H.; Yoshida, H.; Shimada, U.; Ueda, N. Multi-class AUC maximization for imbalanced ordinal multi-stage tropical
cyclone intensity change forecast. Mach. Learn. Appl. 2024,17, 100569. [CrossRef]
108.
Liu, Y.; Loh, H.T.; Sun, A. Imbalanced text classification: A term weighting approach. Expert Syst. Appl. 2009,36, 690–701.
[CrossRef]
109.
Nagy, G.; Zhang, X. Simple statistics for complex feature spaces. In Data Complexity in Pattern Recognition; Springer:
Berlin/Heidelberg, Germany, 2006; pp. 173–195.
110.
Le, P.Q.; Iliyasu, A.M.; Garcia, J.; Dong, F.; Hirota, K. Representing visual complexity of images using a 3d feature space based on
structure, noise, and diversity. J. Adv. Comput. Intell. Intell. Inform. 2012,16, 631–640. [CrossRef]
111.
Mars, M. From word embeddings to pre-trained language models: A state-of-the-art walkthrough. Appl. Sci. 2022,12, 8805.
[CrossRef]
112.
Sinjanka, Y.; Musa, U.I.; Malate, F.M. Text Analytics and Natural Language Processing for Business Insights: A Comprehensive
Review. Int. J. Res. Appl. Sci. Eng. Technol. 2023,11. [CrossRef]
113.
Bashiri, H.; Naderi, H. Comprehensive review and comparative analysis of transformer models in sentiment analysis. Knowl. Inf.
Syst. 2024,66, 7305–7361. [CrossRef]
114.
Yadav, A.; Patel, A.; Shah, M. A comprehensive review on resolving ambiguities in natural language processing. AI Open 2021,2, 85–92.
[CrossRef]
115.
Seneviratne, I.S. Text Simplification Using Natural Language Processing and Machine Learning for Better Language Understand-
ability. Ph.D. Thesis, The Australian National University, Canberra, Australia, 2024.
116.
Garg, R.; Kiwelekar, A.W.; Netak, L.D.; Bhate, S.S. Potential use-cases of natural language processing for a logistics organization.
In Modern Approaches in Machine Learning and Cognitive Science: A Walkthrough: Latest Trends in AI; Springer: Berlin/Heidelberg,
Germany, 2021; Volume 2, pp. 157–191.
117.
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751.
118.
Johnson, R.; Zhang, T. Effective use of word order for text categorization with convolutional neural networks. arXiv 2014,
arXiv:1412.1058.
119.
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings
of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489.
120. Schmidt, R.M. Recurrent neural networks (rnns): A gentle introduction and overview. arXiv 2019, arXiv:1912.05911.
121. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805.
122.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI
Blog 2019,1, 9.
123. Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165.
124.
Azevedo, B.F.; Rocha, A.M.A.; Pereira, A.I. Hybrid approaches to optimization and machine learning methods: A systematic
literature review. Mach. Learn. 2024,113, 4055–4097. [CrossRef]
125.
Willard, J.; Jia, X.; Xu, S.; Steinbach, M.; Kumar, V. Integrating scientific knowledge with machine learning for engineering and
environmental systems. ACM Comput. Surv. 2022,55, 1–37. [CrossRef]
126. Banu, S.; Ummayhani, S. Text summarisation and translation across multiple languages. J. Sci. Res. Technol. 2023,1, 242–247.
127.
Orosoo, M.; Goswami, I.; Alphonse, F.R.; Fatma, G.; Rengarajan, M.; Bala, B.K. Enhancing Natural Language Processing in
Multilingual Chatbots for Cross-Cultural Communication. In Proceedings of the 2024 5th International Conference on Intelligent
Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 11–12 March 2024; IEEE: Piscataway, NJ,
USA, 2024; pp. 127–133.
128.
Liang, L.; Wang, S. Spanish Emotion Recognition Method Based on Cross-Cultural Perspective. Front. Psychol. 2022,13, 849083.
[CrossRef] [PubMed]
Information 2025,16, 130 47 of 47
129. Artetxe, M.; Labaka, G.; Agirre, E. Translation artifacts in cross-lingual transfer learning. arXiv 2020, arXiv:2004.04721.
130.
Schuster, S.; Gupta, S.; Shah, R.; Lewis, M. Cross-lingual transfer learning for multilingual task oriented dialog. arXiv 2018,
arXiv:1810.13327.
131.
Yu, M.; Huang, Q.; Qin, H.; Scheele, C.; Yang, C. Deep learning for real-time social media text classification for situation
awareness–using Hurricanes Sandy, Harvey, and Irma as case studies. In Social Sensing and Big Data Computing for Disaster
Management; Routledge: London, UK, 2020; pp. 33–50.
132. Demirsoz, O.; Ozcan, R. Classification of news-related tweets. J. Inf. Sci. 2017,43, 509–524. [CrossRef]
133.
Van de Ven, G.M.; Tuytelaars, T.; Tolias, A.S. Three types of incremental learning. Nat. Mach. Intell. 2022,4, 1185–1197. [CrossRef]
[PubMed]
134.
Yan, H.; Gui, T.; Dai, J.; Guo, Q.; Zhang, Z.; Qiu, X. A unified generative framework for various NER subtasks. In Proceedings of
the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing, Online, 1–6 August 2021.
135.
Mohit, B. Named entity recognition. In Natural Language Processing of Semitic Languages; Springer: Berlin/Heidelberg, Germany, 2014;
pp. 221–245.
136.
Bui, D.D.A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification to leverage information extraction from publication reports. J.
Biomed. Inform. 2016,61, 141–148. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
Available via license: CC BY 4.0
Content may be subject to copyright.