ArticlePDF Available

Abstract

Automated text classification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. In general, text classification plays an important role in information extraction and summarization, text retrieval, and question-answering. This paper illustrates the text classification process using machine learning techniques. The references cited cover the major theoretical issues and guide the researcher to interesting research directions.
Text Classification Using Machine Learning Techniques
M. IKONOMAKIS
Department of Mathematics
University of Patras, GREECE
ikonomakis@mailbox.gr
S. KOTSIANTIS
Department of Mathematics
University of Patras, GREECE
sotos@math.upatras.gr
V. TAMPAKAS
Technological Educational
Institute of Patras, GREECE
tampakas@teipat.gr
Abstract: Automated text classification has been considered as a vital method to manage and process a vast
amount of documents in digital forms that are widespread and continuously increasing. In general, text
classification plays an important role in information extraction and summarization, text retrieval, and question-
answering. This paper illustrates the text classification process using machine learning techniques. The
references cited cover the major theoretical issues and guide the researcher to interesting research directions.
Key-Words: text mining, learning algorithms, feature selection, text representation
1 Introduction
Automatic text classification has always been an
important application and research topic since the
inception of digital documents. Today, text
classification is a necessity due to the very large
amount of text documents that we have to deal with
daily.
In general, text classification includes topic based
text classification and text genre-based
classification. Topic-based text categorization
classifies documents according to their topics [33].
Texts can also be written in many genres, for
instance: scientific articles, news reports, movie
reviews, and advertisements. Genre is defined on
the way a text was created, the way it was edited,
the register of language it uses, and the kind of
audience to whom it is addressed. Previous work on
genre classification recognized that this task differs
from topic-based categorization [13].
Typically, most data for genre classification are
collected from the web, through newsgroups,
bulletin boards, and broadcast or printed news.
They are multi-source, and consequently have
different formats, different preferred vocabularies
and often significantly different writing styles even
for documents within one genre. Namely, the data
are heterogenous.
Intuitively Text Classification is the task of
classifying a document under a predefined
category. More formally, if
i
d is a document of the
entire set of documents
D
and
{
}
12
, ,...,
n
cc c is the
set of all the categories, then text classification
assigns one category
j
c to a document
i
d .
As in every supervised machine learning task, an
initial dataset is needed. A document may be
assigned to more than one category (Ranking
Classification), but in this paper only researches on
Hard Categorization (assigning a single category to
each document) are taken into consideration.
Moreover, approaches, that take into consideration
other information besides the pure text, such as
hierarchical structure of the texts or date of
publication, are not presented. This is because the
main issue of this paper is to present techniques
that exploit the most of the text of each document
and perform best under this condition.
Sebastiani gave an excellent review of text
classification domain [25]. Thus, in this work apart
from the brief description of the text classification
we refer to some more recent works than those in
Sebastiani’s article as well as few articles that were
not referred by Sebastiani. In Figure 1 is given the
graphical representation of the Text Classification
process.
.
Fig. 1. Text Classification Process
The task of constructing a classifier for
documents does not differ a lot from other tasks of
Machine Learning. The main issue is the
representation of a document [16]. In Section 2 the
document representation is presented. One
particularity of the text categorization problem is
Read
Document
Tokenize
Text
Stemming
Delete
Stopwords
Vector Representation of
Text
Feature Selection and/or
Feature Transformation
Learning
algorithm
*The Project is co-funded by the European Social Fund and National Resources.
WSEAS TRANSACTIONS on COMPUTERS, Issue 8, Volume 4, August 2005, pp. 966-974
that the number of features (unique words or
phrases) can easily reach orders of tens of
thousands. This raises big hurdles in applying many
sophisticated learning algorithms to the text
categorization
Thus dimension reduction methods are called for.
Two possibilities exist, either selecting a subset of
the original features [3], or transforming the
features into new ones, that is, computing new
features as some functions of the old ones [10]. We
examine both in turn in Section 3 and Section 4.
After the previous steps a Machine Learning
algorithm can be applied. Some algorithms have
been proven to perform better in Text Classification
tasks and are more often used; such as Support
Vector Machines. A brief description of recent
modification of learning algorithms in order to be
applied in Text Classification is given in Section 5.
There are a number of methods to evaluate the
performance of a machine learning algorithms in
Text Classification. Most of these methods are
described in Section 6. Some open problems are
mentioned in the last section.
2 Vector space document
representations
A document is a sequence of words [16]. So each
document is usually represented by an array of
words. The set of all the words of a training set is
called vocabulary, or feature set. So a document
can be presented by a binary vector, assigning the
value 1 if the document contains the feature-word
or 0 if the word does not appear in the document.
This can be translated as positioning a document in
a
V
R
space, were V denotes the size of the
vocabulary
V .
Not all of the words presented in a document can
be used in order to train the classifier [19]. There
are useless words such as auxiliary verbs,
conjunctions and articles. These words are called
stopwords. There exist many lists of such words
which are removed as a preprocess task. This is
done because these words appear in most of the
documents.
Stemming is another common preprocessing step.
In order to reduce the size of the initial feature set
is to remove misspelled or words with the same
stem. A stemmer (an algorithm which performs
stemming), removes words with the same stem and
keeps the stem or the most common of them as
feature. For example, the words “train”, “training”,
“trainer” and “trains” can be replaced with “train”.
Although stemming is considered by the Text
Classification community to amplify the classifiers
performance, there are some doubts on the actual
importance of aggressive stemming, such as
performed by the Porter Stemmer [25].
An ancillary feature engineering choice is the
representation of the feature value [16]. Often a
Boolean indicator of whether the word occurred in
the document is sufficient. Other possibilities
include the count of the number of times the word
occurred in the document, the frequency of its
occurrence normalized by the length of the
document, the count normalized by the inverse
document frequency of the word. In situations
where the document length varies widely, it may be
important to normalize the counts. Further, in short
documents words are unlikely to repeat, making
Boolean word indicators nearly as informative as
counts. This yields a great savings in training
resources and in the search space of the induction
algorithm. It may otherwise try to discretize each
feature optimally, searching over the number of
bins and each bin’s threshold.
Most of the text categorization algorithms in the
literature represent documents as collections of
words. An alternative which has not been
sufficiently explored is the use of word meanings,
also known as senses. Kehagias et al. using several
algorithms, they compared the categorization
accuracy of classifiers based on words to that of
classifiers based on senses [12]. The document
collection on which this comparison took place is a
subset of the annotated Brown Corpus semantic
concordance. A series of experiments indicated that
the use of senses does not result in any significant
categorization improvement.
3 Feature Selection
The aim of feature-selection methods is the
reduction of the dimensionality of the dataset by
removing features that are considered irrelevant for
the classification [6]. This transformation
procedure has been shown to present a number of
advantages, including smaller dataset size, smaller
computational requirements for the text
categorization algorithms (especially those that do
not scale well with the feature set size) and
considerable shrinking of the search space. The
goal is the reduction of the curse of dimensionality
to yield improved classification accuracy. Another
benefit of feature selection is its tendency to reduce
overfitting, i.e. the phenomenon by which a
classifier is tuned also to the contingent
characteristics of the training data rather than the
constitutive characteristics of the categories, and
therefore, to increase generalization.
Methods for feature subset selection for text
document classification task use an evaluation
function that is applied to a single word [27].
Scoring of individual words (Best Individual
Features) can be performed using some of the
measures, for instance, document frequency, term
frequency, mutual information, information gain,
odds ratio, χ
2
statistic and term strength [3], [30],
[6], [28], [27]. What is common to all of these
feature-scoring methods is that they conclude by
ranking the features by their independently
determined scores, and then select the top scoring
features. The most common metrics are presented
in Table 1. The symbolisms that are presented in
Table 1 are described in Table 2.
On the contrary with Best Individual Features
(BIF) methods, sequential forward selection (SFS)
methods firstly select the best single word
evaluated by given criterion [20]; then, add one
word at a time until the number of selected words
reaches desired k words. SFS methods do not result
in the optimal words subset but they take note of
dependencies between words as opposed to the BIF
methods. Therefore SFS often give better results
than BIF. However, SFS are not usually used in
text classification because of their computation cost
due to large vocabulary size.
Forman has present benchmark comparison of 12
metrics on well known training sets [6]. According
to Forman, BNS performed best by wide margin
using 500 to 1000 features, while Information Gain
outperforms the other metrics when the features
vary between 20 and 50. Accuracy 2 performed
equally well as Information Gain. Concerning the
performance of chi-square, it was consistently
worse the Information Gain. Since there is no
metric that performs constantly better than all
others, researchers often combine two metrics in
order to benefit from both metrics [6].
Novovicova et al. used SFS that took into
account, not only the mutual information between a
class and a word but also between a class and two
words [22]. The results were slightly better.
Although machine learning based text
classification is a good method as far as
performance is concerned, it is inefficient for it to
handle the very large training corpus. Thus, apart
from feature selection, many times instance
selection is needed.
Metrics mathematical forms
Information
Gain
() () () () ( ) ( ) () () ()
11 1
log | log | | log |
mm m
ii i i i i
ii i
I
Gt Pc Pc Pt Pc t Pc t Pt Pc t Pc t
== =
=− + +
∑∑
Gain Ratio
()
()
(
)
() ( )
{}{}
() ()
{}
,,
,
,
,log
,
log
ii kk
ii
cccttt
ki
ccc
Ptc
Ptc
PtPc
GR t c
Pc Pc
∈∈
=
∑∑
Conditional
Mutual
Information
(
)
(
)
(
)
12
| | , ,...,
n
CMI C S H C H C S S S=−
Document
Frequency
(
)
(
)
kk
DF t P t=
Term
Frequency
()
,
max
ij
ij
kj
k
freq
tf f d
f
req
=
Inversed
Document
Frequency
()
log
#
i
i
D
idf
=
Chi-square
()
()
(
)
(
) ()
(
)
()
()
()
()()
()
()
()
()
(
(
2
2
#, #, #, #,
,
#, #, #, #, #, #, #,
ij ij ij ij
ij
ij ij ij ij ij ij i
Dcfcf cfcf
fc
cf cf cf cf cf cf c
f
χ
×−
=
+×+×
Term
(
)
(
)
|st Pt yt x
=
∈∈
Metrics mathematical forms
Strength
Weighted
Ratio
(
)
(
)
(
)
WOddsRatio w P w OddsRatio w
OddsRatio
()
(
)
(
)
(
)
()
()
()
|1 |
,log
1| |
ij i j
ij
ij i j
Pfc Pf c
OddsRatio f c
Pfc Pf c
−¬
=
−¬
Logarithmic
Probability
Ratio
()
(
)
()
|
log
|
Pwc
LogProbRatio w
Pw c
=
¬
Pointwise
Mutual
Information
()
(
)
() ()
,
,log
Pxy
Ixy
PxPy
=
Category
Relevance
Factor (CRF)
()
(
)
(
)
()()
#/#
,log
#, /#
ij
ij
ij j
fc
CRF f c
f
cc
=
Odds
Numerator
(
)
(
)
(
)
(
)
,|1|OddsNum w c P w c P w c=−¬
Probability
Ratio
()
(
)
()
|
Pr |
|
Pwc
Rwc
Pw c
=
¬
Bi-Normal
Separation
(
)
(
)
(
)
(
)
11
||FPwc FPwc
−−
−¬
Pow
()
(
)
()
(
)
1| 1|
kk
Pw c Pwc−¬
Topic
Relevance
using
Relative
Word
Position
()
(
)
()
,
,log log
,
n
DF
n
n
db
DF w c
Mwc
DF w db c
=+
Topic
Relevance
using
Document
Frequency
()
(
)
()
,
,log log
,
n
DF
dbDF w c
Mwc
DF w db c
=+
Topic
Relevance
using
Modified
Document
Frequency
()
()
()
~
~
,
,log log
,
n
DF
db
DF w c
Mwc
c
DF w db
=+
Topic
Relevance
using Term
Frequency
()
(
)
()
,
,log log
,
n
TF
dbTF w c
Mwc
TF w db c
=+
Weight of
evidence for
Text
() () ()
(
)()
(
)
() ( )
()
|1
log
1|
ii
i
i
ii
Pc w Pc
Weight w P c P w
Pc Pc w
×
Table 1. Feature Selection metrics
c a class of the training set
C the set of classes of the training set
d a document of the training set
D or db the set of documents of the training set
t or w a term or word
()
Pc or
()
i
Pc
the probability of the class c or
i
c respectively How often the class appears in the
training set
()
Pc¬ or
()
Pc
the probability of the class not occurring
()
|Pct
the probability of the class c given that the term t appears Respectively,
()
|Pct
denotes the probability of class c not occurring, given that the term t appears
()
,Pct
the probability of the class c and term t occurring simultaneously
()
HC
the entropy of the set C
()
i
DF t the document frequency of term
k
t
()
n
DF t
the frequency of term t in documents containing t in every of their n splits
()
~
DF t
the document frequency, taking into consideration only documents in which t appears
more than once
()
# c or
()
# t
the number of documents which belong to class or respectively contain the term t
()
#,ct
the number of documents containing term t and belong to class c
Table 2. Symbolisms
Guan and Zhou proposed a training-corpus
pruning based approach to speedup the process [8].
By using this approach, the size of training corpus
can be reduced significantly while classification
performance can be kept at a level close to that of
without training documents pruning according to
their experiments.
Fragoudis et al. [7] integrated Feature and
Instance Selection for Text Classification with even
better results.
Their method works in two steps. In
the first step, their method sequentially selects
features that have high precision in predicting the
target class. All documents that do not contain at
least one such feature are dropped from the training
set. In the second step, their method searches
within this subset of the initial dataset for a set of
features that tend to predict the complement of the
target class and these features are also selected. The
sum of the features selected during these two steps
is the new feature set and the documents selected
from the first step comprise the training set
4 Feature Transformation
Feature Transformation varies significantly from
Feature Selection approaches, but like them its
purpose is to reduce the feature set size [10]. This
approach does not weight terms in order to discard
the lower weighted but compacts the vocabulary
based on feature concurrencies.
Principal Component Analysis is a well known
method for feature transformation [38]. Its aim is to
learn a discriminative transformation matrix in
order to reduce the initial feature space into a lower
dimensional feature space in order to reduce the
complexity of the classification task without any
trade-off in accuracy. The transform is derived
from the eigenvectors corresponding. The
covariance matrix of data in PCA corresponds to
the document term matrix multiplied by its
transpose. Entries in the covariance matrix
represent co-occurring terms in the documents.
Eigenvectors of this matrix corresponding to the
dominant eigenvalues are now directions related to
dominant combinations can be called “topics” or
“semantic concepts”. A transform matrix
constructed from these eigenvectors projects a
document onto these “latent semantic concepts”,
and the new low dimensional representation
consists of the magnitudes of these projections. The
eigenanalysis can be computed efficiently by a
sparse variant of singular value decomposition of
the document-term matrix [11].
In the information retrieval community this
method has been named Latent Semantic Indexing
(LSI) [23]. This approach is not intuitive
discernible for a human but has a good
performance.
Qiang et al [37] performed experiments using k-
NN LSI, a new combination of the standard k-NN
method on top of LSI, and applying a new matrix
decomposition algorithm, Semi-Discrete Matrix
Decomposition, to decompose the vector matrix.
The Experimental results showed that text
categorization effectiveness in this space was better
and it was also computationally less costly, because
it needed a lower dimensional space.
The authors of [4] present a comparison of the
performance of a number of text categorization
methods in two different data sets. In particular,
they evaluate the Vector and LSI methods, a
classifier based on Support Vector Machines
(SVM) and the k-Nearest Neighbor variations of
the Vector and LSI models. Their results show that
overall, SVMs and k-NN LSI perform better than
the other methods, in a statistically significant way.
5 Machine learning algorithms
After feature selection and transformation the
documents can be easily represented in a form that
can be used by a ML algorithm. Many text
classifiers have been proposed in the literature
using machine learning techniques, probabilistic
models, etc. They often differ in the approach
adopted: decision trees, naıve-Bayes, rule
induction, neural networks, nearest neighbors, and
lately, support vector machines. Although many
approaches have been proposed, automated text
classification is still a major area of research
primarily because the effectiveness of current
automated text classifiers is not faultless and still
needs improvement.
Naive Bayes is often used in text classification
applications and experiments because of its
simplicity and effectiveness [14]. However, its
performance is often degraded because it does not
model text well. Schneider addressed the problems
and show that they can be solved by some simple
corrections [24]. Klopotek and Woch presented
results of empirical evaluation of a Bayesian
multinet classifier based on a new method of
learning very large tree-like Bayesian networks
[15]. The study suggests that tree-like Bayesian
networks are able to handle a text classification
task in one hundred thousand variables with
sufficient speed and accuracy.
Support vector machines (SVM), when applied to
text classification provide excellent precision, but
poor recall. One means of customizing SVMs to
improve recall, is to adjust the threshold associated
with an SVM. Shanahan and Roma described an
automatic process for adjusting the thresholds of
generic SVM [26] with better results.
Johnson et al. described a fast decision tree
construction algorithm that takes advantage of the
sparsity of text data, and a rule simplification
method that converts the decision tree into a
logically equivalent rule set [9].
Lim proposed a method which improves
performance of kNN based text classification by
using well estimated parameters [18]. Some
variants of the kNN method with different decision
functions, k values, and feature sets were proposed
and evaluated to find out adequate parameters.
Corner classification (CC) network is a kind of
feed forward neural network for instantly document
classification. A training algorithm, named as
TextCC is presented in [34].
The level of difficulty of text classification tasks
naturally varies. As the number of distinct classes
increases, so does the difficulty, and therefore the
size of the training set needed. In any multi-class
text classification task, inevitably some classes will
be more difficult than others to classify. Reasons
for this may be: (1) very few positive training
examples for the class, and/or (2) lack of good
predictive features for that class.
When training a binary classifier per category in
text categorization, we use all the documents in the
training corpus that belong to that category as
relevant training data and all the documents in the
training corpus that belong to all the other
categories as non-relevant training data. It is often
the case that there is an overwhelming number of
non relevant training documents especially when
there is a large collection of categories with each
assigned to a small number of documents, which is
typically an “imbalanced data problem". This
problem presents a particular challenge to
classification algorithms, which can achieve high
accuracy by simply classifying every example as
negative. To overcome this problem, cost sensitive
learning is needed [5].
A scalability analysis of a number of classifiers
in text categorization is given in [32]. Vinciarelli
presents categorization experiments performed over
noisy texts [31]. By noisy it is meant any text
obtained through an extraction process (affected by
errors) from media other than digital texts (e.g.
transcriptions of speech recordings extracted with a
recognition system). The performance of the
categorization system over the clean and noisy
(Word Error Rate between ~10 and ~50 percent)
versions of the same documents is compared. The
noisy texts are obtained through Handwriting
Recognition and simulation of Optical Character
Recognition. The results show that the performance
loss is acceptable.
Other authors [36] also proposed to parallelize
and distribute the process of text classification.
With such a procedure, the performance of
classifiers can be improved in both accuracy and
time complexity.
Recently in the area of Machine Learning the
concept of combining classifiers is proposed as a
new direction for the improvement of the
performance of individual classifiers. Numerous
methods have been suggested for the creation of
ensemble of classifiers. Mechanisms that are used
to build ensemble of classifiers include: i) Using
different subset of training data with a single
learning method, ii) Using different training
parameters with a single training method (e.g. using
different initial weights for each neural network in
an ensemble), iii) Using different learning methods.
In the context of combining multiple classifiers
for text categorization, a number of researchers
have shown that combining different classifiers can
improve classification accuracy [1], [29].
Comparison between the best individual classifier
and the combined method, it is observed that the
performance of the combined method is superior
[2]. Nardiello et al. [21] also proposed algorithms
in the family of "boosting"-based learners for
automated text classification with good results.
6 Evaluation
There are various methods to determine
effectiveness; however, precision, recall, and
accuracy are most often used. To determine these,
one must first begin by understanding if the
classification of a document was a true positive
(TP), false positive (FP), true negative (TN), or
false negative (FN) (see Table 3).
TP Determined as a document being classified
correctly as relating to a category.
FP Determined as a document that is said to be
related to the category incorrectly.
FN Determined as a document that is not marked
as related to a category but should be.
TN Documents that should not be marked as being
in a particular category and are not.
Table 3. Classification of a document
Precision (π
i
) is determined as the conditional
probability that a random document d is classified
under c
i
, or what would be deemed the correct
category. It represents the classifiers ability to place
a document as being under the correct category as
opposed to all documents place in that category,
both correct and incorrect:
i
ii
TP
i
TP FP
π
+
=
Recall (ρ
i
) is defined as the probability that, if a
random document d
x
should be classified under
category (c
i
), this decision is taken.
i
ii
TP
i
TP FN
ρ
+
=
Accuracy is commonly used as a measure for
categorization techniques. Accuracy values,
however, are much less reluctant to variations in
the number of correct decisions than precision and
recall:
iiii
ii
FNFPTNTP
TNTP
i
A
+++
+
=
Many times there are very few instances of the
interesting category in text categorization. This
overrepresentation of the negative class in
information retrieval problems can cause problems
in evaluating classifiers' performances using
accuracy. Since accuracy is not a good metric for
skewed datasets, the classification performance of
algorithms in this case is measured by precision
and recall [5].
Furthermore, precision and recall are often
combined in order to get a better picture of the
performance of the classifier. This is done by
combining them in the following formula:
(
)
2
2
1
F
β
β
πρ
β
πρ
+
=
+
,
where π and ρ denote presicion and recall
respectively. β is a positive parameter, which
represents the goal of the evaluation task. If
presicion is considered to be more important that
recall, then the value of β converges to zero. On the
other hand, if recall is more important than
presicion then β converges to infinity. Usually β is
set to 1, because in this way equal importance is
given to each presicion and recall.
Reuters Corpus Volume I (RCV1) is an archive
of over 800,000 manually categorized newswire
stories recently made available by Reuters, Ltd. for
research purposes [17]. Using this collection, we
can compare the learning algorithms.
Although research in the pass years had shown
that training corpus could impact classification
performance, little work was done to explore the
underlying causes. The authors of [35] try to
propose an approach to build semi-automatically
high-quality training corpuses for better
classification performance by first exploring the
properties of training corpuses, and then giving an
algorithm for constructing training corpuses semi-
automatically.
7 Conclusion
The text classification problem is an Artificial
Intelligence research topic, especially given the
vast number of documents available in the form of
web pages and other electronic texts like emails,
discussion forum postings and other electronic
documents.
It has observed that even for a specified
classification method, classification performances
of the classifiers based on different training text
corpuses are different; and in some cases such
differences are quite substantial. This observation
implies that a) classifier performance is relevant to
its training corpus in some degree, and b) good or
high quality training corpuses may derive
classifiers of good performance. Unfortunately, up
to now little research work in the literature has been
seen on how to exploit training text corpuses to
improve classifier’s performance.
Some important conclusions have not been
reached yet, including:
Which feature selection methods are both
computationally scalable and high-performing
across classifiers and collections? Given the
high variability of text collections, do such
methods even exist?
Would combining uncorrelated, but well-
performing methods yield a performance
increase?
Change the thinking from word frequency
based vector space to concepts based vector
space. Study the methodology of feature
selection under concepts, to see if these will
help in text categorization.
Make the dimensionality reduction more
efficient over large corpus.
Moreover, there are other two open problems in
text mining: polysemy, synonymy. Polysemy refers
to the fact that a word can have multiple meanings.
Distinguishing between different meanings of a
word (called word sense disambiguation) is not
easy, often requiring the context in which the word
appears. Synonymy means that different words can
have the same or similar meaning.
References:
[1] Bao Y. and Ishii N., “Combining Multiple kNN
Classifiers for Text Categorization by
Reducts”, LNCS 2534, 2002, pp. 340-347
[2] Bi Y., Bell D., Wang H., Guo G., Greer K.,
”Combining Multiple Classifiers Using
Dempster's Rule of Combination for Text
Categorization”, MDAI, 2004, 127-138.
[3] Brank J., Grobelnik M., Milic-Frayling N.,
Mladenic D., “Interaction of Feature Selection
Methods and Linear Classification Models”,
Proc. of the 19th International Conference on
Machine Learning, Australia, 2002.
[4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, An
Empirical Comparison of Text Categorization
Methods, Lecture Notes in Computer Science,
Volume 2857, Jan 2003, Pages 183 - 196
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O.,
Kegelmeyer, W. P., “SMOTE: Synthetic
Minority Over-sampling Technique,” Journal
of AI Research, 16 2002, pp. 321-357.
[6] Forman, G., An Experimental Study of Feature
Selection Metrics for Text Categorization.
Journal of Machine Learning Research, 3 2003,
pp. 1289-1305
[7] Fragoudis D., Meretakis D., Likothanassis S.,
“Integrating Feature and Instance Selection for
Text Classification”, SIGKDD ’02, July 23-26,
2002, Edmonton, Alberta, Canada.
[8] Guan J., Zhou S., “Pruning Training Corpus to
Speedup Text Classification”, DEXA 2002, pp.
831-840
[9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz,
“A decision-tree-based symbolic rule induction
system for text categorization”, IBM Systems
Journal, September 2002.
[10] Han X., Zu G., Ohyama W., Wakabayashi
T., Kimura F., Accuracy Improvement of
Automatic Text Classification Based on
Feature Transformation and Multi-classifier
Combination, LNCS, Volume 3309, Jan 2004,
pp. 463-468
[11] Ke H., Shaoping M., “Text categorization
based on Concept indexing and principal
component analysis”, Proc. TENCON 2002
Conference on Computers, Communications,
Control and Power Engineering, 2002, pp. 51-
56.
[12] Kehagias A., Petridis V., Kaburlasos V.,
Fragkou P., “A Comparison of Word- and
Sense-Based Text Categorization Using
Several Classification Algorithms”, JIIS,
Volume 21, Issue 3, 2003, pp. 227-247.
[13] B. Kessler, G. Nunberg, and H. Schutze.
Automatic detection of text genre. In
Proceedings of the Thirty-Fifth ACL and
EACL, pages 32–38, 1997.
[14] Kim S. B., Rim H. C., Yook D. S. and Lim
H. S., “Effective Methods for Improving Naive
Bayes Text Classifiers”, LNAI 2417, 2002, pp.
414-423
[15] Klopotek M. and Woch M., “Very Large
Bayesian Networks in Text Classification”,
ICCS 2003, LNCS 2657, 2003, pp. 397-406
[16] Leopold, Edda & Kindermann, Jörg, “Text
Categorization with Support Vector Machines.
How to Represent Texts in Input Space?”,
Machine Learning 46, 2002, pp. 423 - 444.
[17] Lewis D., Yang Y., Rose T., Li F., “RCV1:
A New Benchmark Collection for Text
Categorization Research”, Journal of Machine
Learning Research 5, 2004, pp. 361-397.
[18] Heui Lim, Improving kNN Based Text
Classification with Well Estimated Parameters,
LNCS, Vol. 3316, Oct 2004, Pages 516 - 523.
[19] Madsen R. E., Sigurdsson S., Hansen L. K.
and Lansen J., “Pruning the Vocabulary for
Better Context Recognition”, 7
th
International
Conference on Pattern Recognition, 2004
[20] Montanes E., Quevedo J. R. and Diaz I.,
“A Wrapper Approach with Support Vector
Machines for Text Categorization”, LNCS
2686, 2003, pp. 230-237
[21] Nardiello P., Sebastiani F., Sperduti A.,
“Discretizing Continuous Attributes in
AdaBoost for Text Categorization”, LNCS,
Volume 2633, Jan 2003, pp. 320-334
[22] Novovicova J., Malik A., and Pudil P.,
“Feature Selection Using Improved Mutual
Information for Text Classification”,
SSPR&SPR 2004, LNCS 3138, pp. 1010–
1017, 2004
[23] Qiang W., XiaoLong W., Yi G., “A Study
of Semi-discrete Matrix Decomposition for LSI
in Automated Text Categorization”, LNCS,
Volume 3248, Jan 2005, pp. 606-615.
[24] Schneider, K., Techniques for Improving
the Performance of Naive Bayes for Text
Classification, LNCS, Vol. 3406, 2005, 682-
693.
[25] Sebastiani F., “Machine Learning in
Automated Text Categorization”, ACM
Computing Surveys, vol. 34 (1),2002, pp. 1-47.
[26] Shanahan J. and Roma N., Improving SVM
Text Classification Performance through
Threshold Adjustment, LNAI 2837, 2003, 361-
372
[27] Soucy P. and Mineau G., “Feature
Selection Strategies for Text Categorization”,
AI 2003, LNAI 2671, 2003, pp. 505-509
[28] Sousa P., Pimentao J. P., Santos B. R. and
Moura-Pires F., “Feature Selection Algorithms
to Improve Documents Classification
Performance”, LNAI 2663, 2003, pp. 288-296
[29] Sung-Bae Cho, Jee-Haeng Lee, Learning
Neural Network Ensemble for Practical Text
Classification, Lecture Notes in Computer
Science, Volume 2690, Aug 2003, Pages 1032
– 1036.
[30] Torkkola K., “Discriminative Features for
Text Document Classification”, Proc.
International Conference on Pattern
Recognition, Canada, 2002.
[31] Vinciarelli A., “Noisy Text Categorization,
Pattern Recognition”, 17th International
Conference on (ICPR'04) , 2004, pp. 554-557
[32] Y. Yang, J. Zhang and B. Kisiel., “A
scalability analysis of classifiers in text
categorization”, ACM SIGIR'03, 2003, pp 96-
103
[33] Y. Yang. An evaluation of statistical
approaches to text categorization. Journal of
Information Retrieval, 1(1/2):67–88, 1999.
[34] Zhenya Zhang, Shuguang Zhang, Enhong
Chen, Xufa Wang, Hongmei Cheng, TextCC:
New Feed Forward Neural Network for
Classifying Documents Instantly, Lecture
Notes in Computer Science, Volume 3497, Jan
2005, Pages 232 – 237.
[35] Shuigeng Zhou, Jihong Guan, Evaluation
and Construction of Training Corpuses for Text
Classification: A Preliminary Study, Lecture
Notes in Computer Science, Volume 2553, Jan
2002, Page 97-108.
[36] Verayuth Lertnattee, Thanaruk
Theeramunkong, Parallel Text Categorization
for Multi-dimensional Data, Lecture Notes in
Computer Science, Volume 3320, Jan 2004,
Pages 38 - 41
[37] Wang Qiang, Wang XiaoLong, Guan Yi, A
Study of Semi-discrete Matrix Decomposition
for LSI in Automated Text Categorization,
Lecture Notes in Computer Science, Volume
3248, Jan 2005, Pages 606 – 615.
[38] Zu G., Ohyama W., Wakabayashi T.,
Kimura F., "Accuracy improvement of
automatic text classification based on feature
transformation": Proc: the 2003 ACM
Symposium on Document Engineering,
November 20-22, 2003, pp.118-120
... In a machine, learning-based text classification system past experiences/observations are used for making the system learn anything (Ikonomakis et al. 2005). The first step for training an ML-based system is feature extraction, where the textual data is converted into its equivalent number form and represented in the form of a vector. ...
... Here every neuron has some memory that they kept before going to the next step. One of the most popular applications of RNN is in text-tospeech conversion (Ikonomakis et al. 2005). ...
Article
Full-text available
In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.
... If a text document is given in some electronic form, its words can be trivially identified as discrete, unique elements, and then the whole field of text analytics [11,1] is available to approach many document processing problems, including document classification (DC). Most DC methods assume a document representation model known as vector model or bag of words (BOW) [6,11,1]. In this model, the order of words in the text is ignored, and a document is represented as a feature vector (also called "word embedding") indexed by V . ...
... Using the statistical expectations of document and word frequencies of Eqs. (6)(7)(8)(9), IG and Tf·Idf can be strightforwardly estimated for a collection of text images. According to the notation used previously, a document D in Sec. 3 becomes a set of text images or image document, X. ...
Preprint
Full-text available
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to low proper organization of the archives and effective exploration by scholars and the general public. The class or ``typology'' of a document is perhaps the most important tag to be included in the metadata. The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The approach considered is based on ``probabilistic indexing'', a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex notarial manuscripts from the Spanish Archivo Host\'orico Provincial de C\'adiz, with promising results.
... TF-IDF scores each word token by multiplying their TF and IDF values as explained below-TF: Term Frequency is the number of times a term t occurs in document d (having N total terms) and is calculated using Eq. (2) [20]. ...
Article
Full-text available
Email is one of the most popular modes of communication we have today. Billions of emails are sent every day in our world but not every one of them is relevant or of importance. The irrelevant and unwanted emails are termed email spam. These spam emails are sent with many different targets that range from advertisement to data theft. Filtering these spam emails is very essential in order to keep the email space fluent in its functioning. Machine Learning algorithms are being extensively used in the classification of spam emails. This paper showcases the performance evaluation of some selected supervised Machine Learning algorithms namely Naive Bayes Classifier, Support Vector Machine, Random Forest, & XG-Boost for spam email classification on a combination of three different datasets. For feature extraction, both Bag of Words & TF-IDF models were used separately and performance with both of these approaches was also compared. The results showed that SVM performed better than all the other algorithms when trained with TF-IDF feature vectors. The performance metrics used were accuracy, precision, recall, and f1-score, along with the ROC curve.
... Many text categorisation methods have been proposed in the previous studies using machine learning and deep learning techniques. These classifiers vary in the adopted approach: decision trees, naive-Bayes, support vector machines, nearest neighbours, and neural networks [22]. ...
Article
Full-text available
The goal of this research is to look into different techniques to solve the problem of authorship verification for Arabic short writings. Despite the widespread usage of Twitter among Arabs, short text research has so far focused on authorship verification in languages other than Arabic, such as English, Spanish, and Greek. To the best of the researcher's knowledge, no study has looked into the task of verifying Arabic-language Twitter texts. The impact of Stylometric and TF-IDF features of very brief texts (Arabic Twitter postings) on user verification was explored in this study. In addition, an analytical analysis was done to see how meta-data from Twitter tweets, such as time and source, can help to verify users perform better. This research is significant on the subject of cyber security in Arabic countries.
... Decision Tree: Decision tree is a supervised learning algorithm, it is the most powerful and popular method for classification [56][57][58]. Decision tree algorithm is also known as a structure tree, where each node represents a test on an attribute, each branch is an outcome of the test, and each leaf node is a class label. ...
Preprint
Full-text available
Society needs to develop a system to detect hate and offense to build a healthy and safe environment. However, current research in this field still faces four major shortcomings, including deficient pre-processing techniques, indifference to data imbalance issues, modest performance models, and lacking practical applications. This paper focused on developing an intelligent system capable of addressing these shortcomings. Firstly, we proposed an efficient pre-processing technique to clean comments collected from Vietnamese social media. Secondly, a novel hate speech detection (HSD) model, which is the combination of a pre-trained PhoBERT model and a Text-CNN model, was proposed for solving tasks in Vietnamese. Thirdly, EDA techniques are applied to deal with imbalanced data to improve the performance of classification models. Besides, various experiments were conducted as baselines to compare and investigate the proposed model's performance against state-of-the-art methods. The experiment results show that the proposed PhoBERT-CNN model outperforms SOTA methods and achieves an F1-score of 67,46% and 98,45% on two benchmark datasets, ViHSD and HSD-VLSP, respectively. Finally, we also built a streaming HSD application to demonstrate the practicality of our proposed system.
Chapter
Social media has dramatically influenced and changed the rate and the nature of crime in our society. The perpetrators cut across different age groups, social standing, and beliefs. The ability to be anonymous on social media and the lack of adequate resources to fight cybercrime are catalysts for the rise in criminal activities, especially in South Africa. We proposed a system to analyse and detect crime in social media posts or messages. The new system can detect attacks, drug-related crimes, hate speech, and offensive messages. Natural language processing algorithms were used for text tokenisation, stemming, and lemmatisation. Machine learning models such as support vector machines and random forest classifiers were used to classify texts. Using the support vector machine to detect crime in texts, we achieved 86% accuracy and using the random forest for crime analysis, 72% accuracy was achieved.
Thesis
Full-text available
This thesis focuses on several tasks in argumentation mining of internet claims. Argumentation mining studies argumentation extraction, structuring, and analysis from text. Argumentation analysis is done in numerous applications, such as the analysis of scientific papers, policy decision making, or language learning through critical thinking development. With the increase of internet use, internet discussions are becoming a valuable source of argumentation. Claims constitute the building blocks of argumentation. This research proposes methods for mining topic-specific argumentative claim analysis in internet discussions. A framework and support for claim extraction and structuring and analysis is proposed. First, three preliminary research studies based on unstructured approaches to argumentation mining are conducted. Then, based on the results of the preliminary research studies, claim structuring is proposed. Claims are structured using a two-level ontology: an upper ontology and a domain ontology. The upper ontology is used to describe claim patterns. The domain ontology models domain-specific concepts. Structuring claims allows for higher quality claim analysis by relaxing the problem of language variance and allows for deriving implicit claims. Supervised machine learning methods are proposed to detect and structure claims from internet discussions. A method for claim analysis is proposed to analyze implicit claims of internet discussion participants.
Chapter
Many water utilities are currently struggling to manage their aging infrastructure. Water mains are a key component of water systems, as they convey drinking water to billions of end-users worldwide. However, as they are usually buried underground, their visual inspection and condition assessment can be cumbersome. Furthermore, water main failure may lead to significant challenges for utilities and end-users, such as service interruption, capacity reduction, as well as high replacement and rehabilitation costs. Accordingly, various researchers have sought to develop statistical methods to predict water main condition. Previous studies have developed models for single systems, applying a range of statistical and machine learning methods, from linear regression to artificial neural networks. The objective of the present study is to compare the applicability and accuracy of a few machine-learning algorithms, such as Random Forest, Logistic Regression and Decision Tree to predict whether a pipe is going to fail or not (Classification). Data from two Canadian municipalities has been collected (Saskatoon, Saskatchewan and Waterloo, Ontario). A number of features are taken into consideration, such as diameter, age, material, and the number of previous failures. The results show a moderate to high accuracy of classification models although in some cases the performance of models is relatively low. Thus, deeper data mining approaches with higher concentrations on the most influential attributes would increase the reliability of the models.
Article
Full-text available
In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models. We report the results obtained using the Mean Reciprocal Rank as a measure of overall performance, a commonly used evaluation measure for question answering tasks. We argue that this evaluation measure is also very well suited for text categorization tasks. Our results show that overall, SVMs and k-NN LSA perform better than the other methods, in a statistically significant way.
Article
Full-text available
In this paper we explore effects of various feature selection algorithms on document classification performance. We propose to use two, possibly distinct linear classifiers: one used exclusively for feature selection in order to obtain the feature space for training the second classifier, using possibly a different training set.
Article
Full-text available
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.
Article
Full-text available
The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.
Conference Paper
Full-text available
This paper proposes the use of Latent Semantic Indexing (LSI) techniques, decomposed with semi-discrete matrix decomposition (SDD) method, for text categorization. The SDD algorithm is a recent solution to LSI, which can achieve similar performance at a much lower storage cost. In this paper, LSI is used for text categorization by constructing new features of category as combinations or transformations of the original features. In the experiments on data set of Chinese Library Classification we compare accuracy to a classifier based on k-Nearest Neighbor (k-NN) and the result shows that k-NN based on LSI is sometimes significantly better. Much future work remains, but the results indicate that LSI is a promising technique for text categorization.
Article
Full-text available
We present a decision-tree-based symbolic rule induction system for categorizing text documents automatically. Our method for rule induction involves the novel combination of (1) a fast decision tree induction algorithm especially suited to text data and (2) a new method for converting a decision tree to a rule set that is simplified, but still logically equivalent to, the original tree. We report experimental results on the use of this system on some practical problems.
Conference Paper
Full-text available
Instance selection and feature selection are two orthogonal methods for reducing the amount and complexity of data. Feature selection aims at the reduction of redundant features in a dataset whereas instance selection aims at the reduction of the number of instances. So far, these two methods have mostly been considered in isolation. In this paper, we present a new algorithm, which we call FIS (Feature and Instance Selection) that targets both problems simultaneously in the context of text classificationOur experiments on the Reuters and 20-Newsgroups datasets show that FIS considerably reduces both the number of features and the number of instances. The accuracy of a range of classifiers including Naïve Bayes, TAN and LB considerably improves when using the FIS preprocessed datasets, matching and exceeding that of Support Vector Machines, which is currently considered to be one of the best text classification methods. In all cases the results are much better compared to Mutual Information based feature selection. The training and classification speed of all classifiers is also greatly improved.
Conference Paper
Full-text available
In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models. We report the results obtained using the Mean Reciprocal Rank as a measure of overall performance, a commonly used evaluation measure for question answering tasks. We argue that this evaluation measure is also very well suited for text categorization tasks. Our results show that overall, SVMs and k-NN LSA perform better than the other methods, in a statistically significant way.
Conference Paper
Text classification is becoming more and more important with the rapid growth of on-line information available. It was observed that the quality of training corpus impacts the performance of the trained classifier. This paper proposes an approach to build high-quality training corpuses for better classification performance by first exploring the properties of training corpuses, and then giving an algorithm for constructing training corpuses semi-automatically. Preliminary experimental results validate our approach: classifiers based on the training corpuses constructed by our approach can achieve good performance while the training corpus’ size is significantly reduced. Our approach can be used for building efficient and lightweight classification systems.
Conference Paper
With the rapid growth of online text information, efficient text classification has become one of the key techniques for organizing and processing text repositories. In this paper, an efficient text classification approach was proposed based on pruning training-corpus. By using the proposed approach, noisy and superfluous documents in training corpuses can be cut off drastically, which leads to substantial classification efficiency improvement. Effective algorithm for training corpus pruning is proposed. Experiments over the commonly used Reuters benchmark are carried out, which validates the effectiveness and efficiency of the proposed approach.