ArticlePDF Available

A systematic analysis of performance measures for classification tasks

Authors:
  • University of Ottawa and IBDA@Dalhousie University

Abstract

This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier’s evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.
A systematic analysis of performance measures for classification tasks
Marina Sokolova
a,*
, Guy Lapalme
b
a
Electronic Health Information Lab, Children’s Hospital of Eastern Ontario, Ottawa, Canada
b
Département d’informatique et de recherche opérationnelle Université de Montréal, Montréal, Canada
article info
Article history:
Received 14 February 2008
Received in revised form 21 November 2008
Accepted 6 March 2009
Available online 8 May 2009
Keywords:
Performance evaluation
Machine Learning
Text classification
abstract
This paper presents a systematic analysis of twenty four performance measures used in the
complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class,
multi-labelled, and hierarchical. For each classification task, the study relates a set of
changes in a confusion matrix to specific characteristics of data. Then the analysis concen-
trates on the type of changes to a confusion matrix that do not change a measure, therefore,
preserve a classifier’s evaluation (measure invariance). The result is the measure invariance
taxonomy with respect to all relevant label distribution changes in a classification problem.
This formal analysis is supported by examples of applications where invariance properties
of measures lead to a more reliable evaluation of classifiers. Text classification supplements
the discussion with several case studies.
Ó2009 Elsevier Ltd. All rights reserved.
1. Motivation
Machine Learning (ML) divides classification onto binary, multi-class, multi-labelled, and hierarchical tasks. In this work we
present a systematic analysis of twenty four performance measures used in these classification subfields. We focus on how well
classes are identified without reference to computation cost or time. We consider a set of changes in a confusion matrix that
correspond to specific characteristics of data. We then analyze the type of changes that do not change a measure’s value and
therefore preserve a classifier’s evaluation. This is what we call measure invariance. As a result, we build the measure invariance
taxonomy with respect to all relevant label distribution changes in a classification problem. We supplement the formal analysis
by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers; examples
are taken from text classification. Note, that we focus on recent ML developments; more details on ML measures can be found,
for example, in Sokolova, Japkowicz, and Szpakowicz (2006) which looks into relations between the measures and assessment
of medical trials. To the best of our knowledge, our current study is the first reviews of ML measures which comprehensively
evaluates the invariant properties of measures. Preliminary results on binary classification appear in (Sokolova & Lapalme,
2007). This study expands the results two-fold, with discussion of new invariant properties, in some cases, adding monotonicity
properties, and consideration of multi-class, multi-labelled, and hierarchical measures.
Empirical evaluation remains the most used approach for the algorithm assessment, although ML algorithms can be eval-
uated through empirical assessment or theory or both, e.g., derived generalized bounds and empirical results (Marchand &
Shawe-Taylor, 2002). Evaluation techniques based on multiple experiments are considered in Dietterich (1998), one of the
most cited work on empirical evaluation of ML algorithms. An extensive critique of ML evaluation practice can be found in
Salzberg (1999). The author analyzes the currently used methods and their statistical validity. The paper distinguishes two
goals of evaluation: a comparison of algorithms, and the feasibility of algorithms on a specific domain. Demsar (2006) sur-
veys how classifiers are compared over multiple data sets. Empirical comparison is most often done by applying algorithms
on various data sets and then evaluating the performance of the classifiers that the algorithms have produced; accuracy
0306-4573/$ - see front matter Ó2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ipm.2009.03.002
*Corresponding author.
E-mail addresses: msokolova@ehealthinformation.ca (M. Sokolova), lapalme@iro.umontreal.ca (G. Lapalme).
Information Processing and Management 45 (2009) 427–437
Contents lists available at ScienceDirect
Information Processing and Management
journal homepage: www.elsevier.com/locate/infoproman
being the most often used measure. In all these assessment approaches, the algorithm and the output classifiers take the
central stage.
We take an alternative route looking how characteristics affect the objectivity of measures. Our formal discussion of ML
performance measures complements popular statistical and empirical comparisons such as the ones presented in Goutte and
Gaussier (2005). We show that, in some learning settings, the correct identification of positive examples may be important
whereas in others, the correct identification of negative examples or disagreement between data and classifier labels may be
more significant. Thus, standard performance measures should be re-evaluated with respect to those scenarios. Previously,
ML studies of performance measures have primarily focused on binary classification. For a complete review, we add multi-
class, multi-topic and hierarchical classification measures. The current study can be useful for measure design. So far, the ML
community did not consider measures’ invariance when new ones were introduced (Bengio, Mariéthoz, & Keller, 2005;
Huang & Ling, 2007) or suggested for adoption from other disciplines (Sokolova et al., 2006).
2. Overview of classification tasks
Supervised ML allows access to the data labels during the algorithm’s training and testing stages. Consider categorical
labels when data entries x
1
;...;x
n
have to be assigned into predefined classes C
1
;...;C
l
. Then classification falls into one
of the following tasks:
Binary: the input is to be classified into one, and only one, of two non-overlapping classes ðC
1
;C
2
Þ; Binary classification is
the most popular classification task. Assigned categories can be objective, independent of manual evaluation (e.g, repub-
lican or democrat in the votes data of the UCI repository (Asuncion & Newman, 2007)) or subjective, dependent on man-
ual evaluation (e.g., positive or negative reviews in Amazon.com (Blitzer et al., 2007)). Classes can be well-defined (e.g.,
the votes labels), ambiguous (e.g., the review opinion labels), or both (e.g., medical vs. other texts in the Newsgroups
collection
1
).
Multi-class: the input is to be classified into one, and only one, of lnon-overlapping classes. Multiclass problems include
the identification of the iris type in a three-class data set popular in pattern recognition (Duda & Hart, 1973), in the learn-
ing the original 135 categories in the benchmark Reuters collection,
2
or in tagging utterances as objective, subjective, or
neutral (Wilson, Wiebe, & Hwa, 2006). As for the binary case, multi-class categorization can be objective or subjective,
well-defined or ambiguous.
Multi-labelled: the input is to be classified into several of lnon-overlapping C
j
. Examples include classification of func-
tions of yeast genes (Mewes, Albermann, Heumann, Lieb, & Pfeiffer, 1997), identifying scenes from image data (Li, Zhang,
& Zhu, 2006) or text-database alignment and word alignment in machine translation (Snyder & Barzilay, 2007). In text
mining of medical information, multi-label classification methods are often evaluated on OHSUMED, a collection of
medical references (Hersh, Buckley, Leone, & Hickam, 1997). When the learning task is document topic classification,
multi-labelling is often referred as multi-topic classification such as for clinical texts that are assigned multiple disease
codes from ICD-9-CM (Sasaki, Rea, & Ananiadou, 2007). Binary, multi-class, and multi-labelled problems form flat
classification (Yang, 1999), in which categories are isolated and their relations are not considered important. The next,
hierarchical, problem addresses relations among categories and includes their structure into learning targets.
Hierarchical: the input is to be classified into one, and only one, C
j
which are be divided into subclasses or grouped into
superclasses. The hierarchy is defined and cannot be changed during classification. Text classification and bioinformatics
supply many examples, e.g., protein function prediction (Eisner, Poulin, Szafron, Lu, & Greiner, 2005). Hierarchical
classification can be transformed into flat classification. For example, the Reuters collection classification can be multi-
class (Bobicev & Sokolova, 2008), multi-labelled (Tikk & Biró, 2003), and hierarchical (Sun, Lim, & Ng, 2003).
A frequent appearance of language and text problems among the listed above examples can be explained by a special role
Natural Language Processing (NLP) holds in ML applications. The richness of language characteristics and the fast-increasing
volume of readily available digital texts make texts not only a nearly inexhaustible research area, but also one of the most
important data formats for ML applications (Shawe-Taylor & Christianini, 2004). Text Classification has achieved a promi-
nent place among ML applications to NLP problems. It is dedicated to finding texts according to a given criteria (Sebastiani,
2002) and it includes the classification of documents (research papers, technical reports, magazine articles, etc.). For topic
classification (e.g., identification of documents about a given city or documents about bands and artists, etc.) documents
are simply classified as being relevant to the topic or not; hence, classes are built as positive vs ‘‘everything else”. Retrieval
of relevant documents being the more important task, the focus in this case is on true positive classification. First compre-
hensive books on Machine Learning were published in late 1990’s (Langley, 1996; Mitchell, 1997). As a discipline, ML bor-
rowed measures from assortment of disciplines traditionally relied on empirical evidence, e.g., medical trials (Isselbacher &
Braunwald, 1994), behavioural research (Cohen, 1988), information retrieval (IR) (van Rijsbergen, 1979; Salton & McGill,
1983). In some ways, text classification borrows from Information Extraction (IE) which preluded the use of Machine
1
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.
2
http://www.daviddlewis.com/resources/testcollections/reuters21578.
428 M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437
Learning in automated text processing and understanding, e.g., the automated analysis and generation of synonymous texts
(Boyer & Lapalme, 1985). IE and IR metrics in the evaluation of ML algorithms are an example of such borrowing. The
evaluation metrics commonly used in Text Classification (Precision,Recall,Fscore) have their origin in IE. The formulas for
these measures neglect the correct classification of negative examples, they instead reflect the importance of retrieval of
positive examples in text/document classification:
Precision: the number of correctly classified positive examples divided by the number of examples labeled by the system
as positive
Recall: the number of correctly classified positive examples divided by the number of positive examples in the data
Fscore: a combination of the above.
In recent years, the NLP and ML communities have turned their attention to the study of opinions, subjective statements,
and sentiments. The corresponding empirical problems are represented by the classification of political debates, web post-
ings or phone calls in which the main task is non-topic classification, e.g. vote classification, gender classification, sentiment
classification, etc. Data for these studies are gathered from chart-boards, blogs, product and movie reviews, email, records of
phone conversations and political debates, electronic negotiation transcripts, etc. Chart-boards, blogs and movie reviews are
often used in sentiment analysis to find whether texts reflect a positive or negative opinion of the author on certain products
or events. In this case, texts are classified according to opinion/sentiment labels (Pang, Lee, & Vaithyanathan, 2002). Email
discussions, records of phone conversation and electronic negotiation transcripts are used in studies of individual behavior.
The aim of such studies is to find what factors influence the behavior of a person in a specific situation. Classification of texts
depends on the problem statement. Transcripts of the US Congress debates are used in the social network analysis, a new
area of Artificial Intelligence research. Here a common task is to define important influence factors and predict the future
behavior of members of a social group. In this case, records are classified according to the actions of speakers (Thomas, Pang,
& Lee, 2006).
These sources represent records of human communication that convey meanings sent by a speaker and received by a
hearer. These meanings can be complex and subtly expressed and constituted from both what is said and what is implied.
So far, there is no common consensus on the choice of measures used to evaluate the performance of classifiers in opinion,
subjectivity and sentiment analysis. Additional performance measures other than the above are Accuracy used in Pang et al.
(2002) and Thomas et al. (2006), or the correspondence between Precision and Recall in Gamon, Aue, Corston-Oliver, and
Ringger (2005). When going from document classification to the classification of human communication, it is important
to know how different performance measures relate to each other in order to help resolve disagreements among perfor-
mance evaluations. This phenomenon happens quite often in experimental studies.
3. Performance measures for classification
The correctness of a classification can be evaluated by computing the number of correctly recognized class examples
(true positives), the number of correctly recognized examples that do not belong to the class (true negatives), and exam-
ples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples
(false negatives). These four counts constitute a confusion matrix shown in Table 1 for the case of the binary
classification.
Table 2 presents the most often used measures for binary classification based on the values of the confusion matrix. AUC
(Area Under the Curve),
3
captures a single point on the Reception Operating Characteristic curve; it can also be viewed as a linear
transformation of Youden Index (Youden, 1950). We omit measures such as BreakEvenPoint, the point at which Precision =Recall
(Goutte & Gaussier, 2005), and the combined AUC
Accuracy (Huang & Ling, 2007) because their properties can be derived from the basic
measures. However, we present Fscore’s properties because of its extensive use in text classification.
Table 3 presents the measures for multi-class classification. For an individual class C
i
, the assessment is defined by
tp
i
;fn
i
;tn
i
;fp
i
:Accuracy
i
;Precision
i
;Recall
i
are calculated from the counts for C
i
. Quality of the overall classification is usually
Table 1
Confusion matrix for binary classification and the corresponding array representation used in this paper.
3
AUC, sometimes referred to as Balanced Accuracy.
M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437 429
assessed in two ways: a measure is the average of the same measures calculated for C
1
;...;C
l
(macro-averaging shown with
an Mindex), or the sum of counts to obtain cumulative tp;fn;tn;fp and then calculating a performance measure (micro-aver-
aging shown with
l
indices). Macro-averaging treats all classes equally while micro-averaging favors bigger classes. As there
is yet no well-developed multi-class Reception Operating Characteristic analysis (Lachiche & Flach, 2003), we do not include
AUC in the list of multi-classification measures.
The quality of multi-topic classification (Table 4) is assessed through either partial or complete class label matching
(Kazawa, Izumitani, Taira, & Maeda, 2005); the latter is often referred to as exact matching. We consider all classes and their
labels as being equivalent. These measures thus count correct or incorrect label identification independently of their order or
rank. We do not include such measures as One-error which counts how many times the top-ranked label was not a member
of the predicted label set (Li et al., 2006). Some authors refer to it Exact Match Ratio as Accuracy (Zhu, Ji, Xu, & Gong, 2005). In
Section 4, we show that these two measures are not interchangeable with respect to confusion matrix transformations; thus,
they may not be equally applicable to similar settings.
For hierarchical classification measures (Table 5), we consider measures that incorporate the problem’s hierarchy. These
measures either evaluate descendant or ancestor performance (Kiritchenko, Matwin, Nock, & Famili, 2006). We omit
Table 2
Measures for binary classification using the notation of Table 1.
Measure Formula Evaluation focus
Accuracy
tpþtn
tpþfnþfpþtn
Overall effectiveness of a classifier
Precision
tp
tpþfp
Class agreement of the data labels with the positive labels given by the classifier
Recall (Sensitivity)
tp
tpþfn
Effectiveness of a classifier to identify positive labels
Fscore
ðb
2
þ1Þtp
ðb
2
þ1Þtpþb
2
fnþfp
Relations between data’s positive labels and those given by a classifier
Specificity
tn
fpþtn
How effectively a classifier identifies negative labels
AUC
1
2
tp
tpþfn
þ
tn
tnþfp
 Classifier’s ability to avoid false classification
Table 3
Measures for multi-class classification based on a generalization of the measures of Table 1 for many classes Ci:tpiare true positive for Ci, and fpi– false
positive, fni– false negative, and tni– true negative counts respectively. land Mindices represent micro- and macro-averaging.
Measure Formula Evaluation focus
Average Accuracy P
l
i¼1
tpiþtni
tpiþfniþfpiþtni
l
The average per-class effectiveness of a classifier
Error Rate P
l
i¼1
fpiþfni
tpiþfniþfpiþtni
l
The average per-class classification error
Precision
l
P
l
i¼1
tp
i
P
l
i¼1
ðtp
i
þfp
i
Þ
Agreement of the data class labels with those of a classifiers if calculated from sums of per-text decisions
Recall
l
P
l
i¼1
tp
i
P
l
i¼1
ðtp
i
þfn
i
Þ
Effectiveness of a classifier to identify class labels if calculated from sums of per-text decisions
Fscore
l
ðb
2
þ1ÞPrecision
l
Recall
l
b
2
Precision
l
þRecall
l
Relations between data’s positive labels and those given by a classifier based on sums of per-text decisions
Precision
M
P
l
i¼1
tpi
tpiþfpi
l
An average per-class agreement of the data class labels with those of a classifiers
Recall
M
P
l
i¼1
tpi
tpiþfni
l
An average per-class effectiveness of a classifier to identify class labels
Fscore
Mðb
2
þ1ÞPrecision
M
Recall
M
b
2
Precision
M
þRecall
M
Relations between data’s positive labels and those given by a classifier based on a per-class average
Table 4
Measures for multi-topic classification; Iis the indicator function; Li¼Li½1;...;Li½ldenotes a set of class labels for xi;Li½j¼1ifCjis present among the labels
and 0, otherwise; Lc
iare labels given by a classifier, Ld
iare the data labels.
Measure Formula Evaluation focus
Exact Match Ratio P
n
i¼1
IðL
d
i
¼L
d
i
Þ
n
The average per-text exact classification
Labelling Fscore P
n
i¼1
2
P
l
j¼1Lc
i½jLd
i½j
P
l
j¼1ðLc
i½jþLd
i½jÞ
n
The average per-text classification with partial matches
Retrieval Fscore P
l
j¼1
2
P
n
i¼1Lc
i½jLd
i½j
P
n
i¼1ðLc
i½jþLd
i½jÞ
l
The average per-class classification with partial matches
Hamming Loss P
n
i¼1
P
l
j¼1
IðL
c
i
½jL
d
i
½jÞ
nl
The average per-example per-class total error
430 M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437
distance- and semantics-based measures suggested for hierarchical classification (Blockeel, Bruynooghe, Dzeroski, Ramon, &
Struyf, 2002; Sun et al., 2003). These measures extend flat, non-hierarchical, measures by estimating differences and simi-
larities between classes. However, in these measures, acceptable differences and similarities are often specified by users
(Costa, Lorena, Carvalho, & Freitas, 2007). Thus, the obtained results may be subjective and user-specific. A similar restriction
applies to depth-dependent measures, which relate classes by imposing vertical distances (Blockeel et al., 2002).
Data Mining has successfully exploited the invariant properties of interestingness measures for comparison of association
and classification rules (Tan, Kumar, & Srivastava, 2004). Some invariant properties of binary classification measures have
been discussed within broader studies of the classification of communication records (Sokolova & Lapalme, 2007). In the cur-
rent study, we consider new invariant properties and expand discussed measures by including multi-class, multi-topic and
hierarchical classification measures. Although the latter three types of classification are quite popular, their measures have
not been studied to the same extent as for binary classification measures.
4. Invariance properties of measures
We focus on the ability of a measure to preserve its value under a change in the confusion matrix. A measure is invariant if
its value does not change when a confusion matrix changes, i.e. invariance indicates that the measure does not detect the
change in the confusion matrix. This inability can be beneficial or adverse, depending on the goals.
Let’s consider a case when invariance to the change of tn is beneficial. Text classification extensively uses Precision and
Recall (Sensitivity) which do not detect changes in tn when all other matrix entries remain the same. In document classifica-
tion, a large number of unrelated documents constitute a negative class without having a single unifying characteristic. The
criterion for the performance of a classifier is its performance on relevant documents, a well-defined unimodal positive class,
independently of performance on the irrelevant documents. Precision and Recall do not depend on tn, but only on the correct
labelling of positive examples ðtpÞand the incorrect labelling of examples (fp and fn). These measures provide the best per-
spective on a classifier’s performance for document classification.
On contrast, the same invariance for the tn change can be an adversary. Consider the classification of human communi-
cation where negative classes are also important. In those problems, classes often have distinct features (male or female) for
which both positive and negative classes are well-defined. The retrieval of a positive class, the discrimination between clas-
ses or the balance between retrieval from both classes are problem-dependent tasks. Thus, an appropriate evaluation mea-
sure should take into account the classification of negative examples and reflect the changes in tn when the other matrix
elements stay the same.
We now examine eight invariance properties ðI
16k68
Þwith respect to changes of elements in a confusion matrix. All the
eight changes are results of elementary operations on matrices: addition, scalar multiplication, interchange of rows or col-
umns. This set covers all relevant label distribution changes in a classification problem: interchange of positive and negative
labels provided by data, interchange of those labels output by a classifier, change of a single segment (e.g., true positives), a
uniform increase in the number of examples. Henceforth, I
k
denotes the non-invariance of a transformation. We in detail
discuss binary classification because other evaluation measures are derived from the binary confusion matrix and its perfor-
mance measures. In several parts of the discussion, we refer to data quality. By this we understand how well examples rep-
resent the underlying notion (especially, ease of understanding and interpretability), how accurate is the data, including its
labels, and the amount of noise (based on Wang & Strong (1996)). Thereinafter, fð½tp;fp;tn;fnÞ denotes a measure’s value.
Our claim is that the following invariance properties affect the applicability and trustworthiness of a measure.
4.1. ðI
1
ÞExchange of positives and negatives
A measure f([tp;fp;tn;fn]) is invariant under exchange of positives and negatives if f([tp;fp;tn;fn]) = f([tn;fp;tp;fn]).
tp fn
fp tn

!tn fp
fn tp

Table 5
Measures for hierarchical classification: C#means subclasses of class C;Cc
#denotes subclasses assigned by a classifier; Cd
#– data class labels; similar notations
apply to superclasses, which are denoted by C".
Measure Formula Evaluation focus
Precision
#
jC
c
#
\C
d
#
j
jC
c
#
j
Positive agreement on subclass labels w.r.t. the subclass labels given by a classifier
Recall
#
jC
c
#
\C
d
#
j
jC
d
#
j
Positive agreement on subclass labels w.r.t. the subclass labels given by data
Fscore
#ðb
2
þ1ÞPrecision
#
Recall
#
b
2
Precision
#
þRecall
#
Relations between data’s positive subclass labels and those given by a classifier
Precision
"
jC
c
"
\C
d
"
j
jC
c
"
j
Positive agreement on superclass labels w.r.t. the superclass labels given by a classifier
Recall
"
jC
c
"
\C
d
"
j
jC
d
"
j
Positive agreement on superclass labels w.r.t. the superclass labels given by data
Fscore
"ðb
2
þ1ÞPrecision
"
Precision
"
b
2
Precision
"
þPrecision
"
Relations between data’s positive superclass labels and those given by a classifier
M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437 431
This shows measure invariance with respect to the distribution of classification results because it does not distin-
guish tp from tn and fn from fp and may not recognize the asymmetry of classification results. Thus it may not be trust-
worthy when classifiers are compared on data sets with different and/or unbalanced class distributions. For example,
invariant measures may be more appropriate for assessing the classification of consumer reviews than for document
classification.
4.2. ðI
2
ÞChange of true negative counts
A measure f([tp;fp;tn;fn]) is invariant under the change of tn if all other matrix counts remain the same f([tp;fp;tn;
fn]) = f([tp;fp;tn
0
;fn]).
tp fn
fp tn

!tp fn
fp tn
0

This measure does not recognize the specifying ability of classifiers. Such evaluation may be more applicable to domains
with a multi-modal negative class taken as ‘‘everything not positive”. In the case of text classification, these invariant mea-
sures are suitable for the evaluation of document classification. If the measure is non-invariant, then it acknowledges the
ability of classifiers to correctly identify negative examples. In this case, it may be reliable for comparison in domains with
a well-defined, unimodal, negative class. Non-invariant measures are preferable for evaluating communications in which
there are criteria for both positive and negative results.
4.3. ðI
3
ÞChange of true positive counts
A measure f([tp;fp;tn;fn]) is invariant under the change of tp if all other matrix counts remain the same f([tp;fp;tn;
fn]) = f([tp
0
;fp;tn;fn]).
tp fn
fp tn

!tp
0
fn
fp tn

This measure does not recognize a classifier’s sensitivity. Such evaluation can be complementary to other measures, but
can hardly stay on its own. It may be reliable for comparison in domains with a well-defined, unimodal, negative class. As
opposed to I
2
, these invariant measures are not suitable for the evaluation of document classification. Non-invariant mea-
sures can be used as stand alone for evaluating classification with a strong positive class.
4.4. ðI
4
ÞChange of false negative counts
A measure f([tp;fp;tn;fn]) is invariant under the change of fn if all other matrix counts remain the same f([tp;fp;tn;
fn]) = f([tp;fp;tn;fn
0
])
tp fn
fp tn

!tp fn
0
fp tn
"#
Invariance indicates measure stability under disagreement between the data and the negative labels assigned by a clas-
sifier. This is especially important for problems involving manual labelling. If a negative class has unreliable labels (Nigam &
Hurst (2004) argue that humans can agree on only 74% of labels for negative opinion), an invariant measure may give mis-
leading results. For non-invariant measures, its value’s monotonicity is important. If the classifier evaluation improves when
fn increases, the measure may favor a classifier prone to false negatives. The use of invariant and non-invariant measures
should be decided based on problem and data characteristics.
4.5. ðI
5
ÞChange of false positive counts
A measure f([tp;fp;tn;fn]) is invariant under the change of fp if all other matrix counts remain the same f([tp;fp;tn;
fn]) = f([tp;fp
0
;tn;fn]).
tp fn
fp tn

!tp fn
fp
0
tn

An invariant measure may provide reliable results when some of positive data labels are counter-intuitive. This can hap-
pen when the positive examples have outliers that cannot be explained by the mainstream data. We call such outliers
counterexamples.
A non-invariant measure may not be suitable for data with many counterexamples. If the classifier evaluation improves
when fp increases, the measure may favor a classifier prone to false positives. This is especially important for problems
involving subjective labelling. Some data entries may not have consistent labels because of the difficulty of imposing
432 M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437
rigorous labelling rules. This can occur in the classification of records of long-term communications in which the data may
contain a substantial number of counterexamples.
4.6. ðI
6
ÞUniform change of positives and negatives
A measure f([tp;fp;tn;fn]) is invariant under a uniform change of positives and negatives if fð½tp;fp;tn;fnÞ
¼fð½k
1
tp;k
1
fp;k
1
tn;k
1
fnÞ;k
1
1.
tp fn
fp tn

!k
1
tp k
1
fn
k
1
fp k
1
tn

An invariant measure is stable with respect to the uniform increase of data size, i.e., scalar multiplication of the confusion
matrix. If we expect that for different data sizes the same proportion of examples will exhibit positive and negative charac-
teristics, then the invariant measure may be a better choice for the evaluation of classifiers.
When a measure is non-invariant, then its applicability may depend on data sizes. The non-invariant measures may be
more reliable if we do not know how representative the data sample is in terms of the proportion of positive/negative
examples.
4.7. ðI
7
ÞChange of positive and negative columns
A measure f([tp;fp;tn;fn]) is invariant under columns’ change if fð½tp;fp;tn;fnÞ ¼ fð½k
1
tp;k
1
fp;k
2
tn;k
2
fnÞ;k
1
k
2
.
tp fn
fp tn

!k
1
tp k
2
fn
k
1
fp k
2
tn

Suppose that different data sizes have the same proportion of positive and negative examples. This change in the con-
fusion matrix is caused by changes in the proportion of positive and negative labels issued by an algorithm, i.e., the col-
umns are multiplied by different scalars. This may happen when the quality of additional data substantially differs from
the initial data sample (e.g., the information inflow can add more noise). However, an invariant measure does not show
the performance change. Thus, it requires support of other measures to assess a classifier’s performance on different
classes.
A non-invariant measure reflects on the performance of a classifier on different classes. It is more appropriate if we can
expect a change in the algorithm’s performance across classes.
4.8. ðI
8
ÞChange of positive and negative rows
A measure f([tp;fp;tn;fn]) is invariant under rows’ change if fð½tp;fp;tn;fnÞ ¼ fð½k
1
tp;k
2
fp;k
2
tn;k
1
fnÞ;k
1
k
2
.
tp fn
fp tn

!k
1
tp k
1
fn
k
2
fp k
2
tn

We again expect that different data sizes have the same proportion of positive and negative examples. Then the change in
the confusion matrix corresponds to changes of an algorithm’s performance within a positive (negative) class, i.e., the rows
are multiplied by different scalars. For example, this may happen when a positive (negative) class is better represented in the
new data. If we expect that different data sizes exhibit same quality of positive (negative) characteristics, then the invariant
measure may be a better choice for the evaluation of classifiers.
When a measure is non-invariant, its applicability may depend on the quality of data classes. The non-invariant measures
may be more reliable if we do not know how representative the data sample is in terms of the quality of positive and neg-
ative classes, which might be the case in web-posted consumer reviews.
For multi-class classification, we consider transformations of the confusion matrix for each class C
j
. As expected, the mea-
sures retain their invariance properties regardless of micro-ormacro-averaging.
For multi-topic classification, Exact Match Ratio and Accuracy have different invariant properties. Thus, referring to Exact
Match Ratio as Accuracy may be misleading.
Measures used in hierarchical classification have a somewhat limited reliability because they evaluate the performance of
a classifier either on subclasses or on superclasses, but not on both. Thus, invariance properties should be assessed with re-
spect to the classification of subclasses – for Precision
#
and Recall
#
, and superclasses – for Precision
"
and Precision
"
.
Table 6 displays the invariance properties of the measures described in Tables 2–5. By assessing the invariant properties
of commonly used measures, we show that Precision;Precision
l
;Precision
M
;Precision
#
;Precision
"
exhibit same invariance
characteristics. Thus, we group them as Precision
G
for general. Similarly, we group Recall;Recall
l
;Recall
M
;Recall
#
, and
Recall
"
as Recall
G
;Fscore;Fscore
l
;Fscore
M
;Fscore
#
, and Fscore
"
as Fscore
G
, and, finally, Accuracy,Average Accuracy, and Error
Rate, essentially 1-Average Accuracy,asAccuracy
G
.
As a result, we further consider only those performance measures that vary in their invariance properties. Table 7 lists the
measures and their properties. Our next step is to associate the invariant properties with particular settings.
M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437 433
5. Analysis of invariant properties
To identify similarities among the measures, we compare them according to their invariance and non-invariance proper-
ties shown in Table 7. First, we present measure outliers whose properties remarkably differ them from others. Two mea-
sures hold unique invariant properties: Precision
G
is the only measure invariant under vertical scaling ðI
7
Þand Exact
Match Ratio is the only measure non-invariant under uniform scaling ðI
6
Þ. Another exception is Retrieval Fscore which is sen-
sitive to all the changes in the confusion matrix except for uniform scaling.
Next we generalize on the properties:
The invariance I
1
has been much discussed in the Machine Learning community, albeit from a negative point of view (Jap-
kowicz, 2006). But we want to emphasize that this invariance makes Accuracy
G
and Hamming Loss robust
measures for an algorithm’s overall performance and insensitive to performance on a specific class. The
corresponding non-invariance I
1
means that the measures are sensitive to asymmetry of classification.
This is a well-known characteristic for Precision,Recall,Fscore and Specificity, but not for AUC, which has
been introduced only recently in text classification.
The invariance I
2
is a well-known property of Precision,Recall, and Fscore and less known for Labelling Fscore and Hamming
Loss. Invariance under the change of tn has made them a tool of choice for the evaluation of document
classification. The non-invariance I
2
signifies that the use of non-invariant measures is more appropriate
Table 6
Invariance properties of performance measures ðIkÞfor different types of classification tasks. + denotes invariance and non-invariance of the measure.
I
1
I
2
I
3
I
4
I
5
I
6
I
7
I
8
Binary classification Table 2
Accuracy ++
Precision ++++
Recall (Sensitivity)++++
Fscore ++
Specificity ++++
AUC ++
Multi-class classification Table 3
Average Accuracy ++
Error Rate ++
Precision
l
++++
Recall
l
++++
Fscore
l
++
Precision
M
++++
Recall
M
++++
Fscore
M
++
Multi-topic classification Table 4
Exact Match Ratio ++ 
Labelling Fscore ++
Retrieval Fscore +
Hamming Loss ++++
Hierarchical classification Table 5
Precision
#
++++
Recall
#
++++
Fscore
#
++
Precision
"
++++
Precision
"
++++
Fscore
"
++
Table 7
Performance measures that exhibit different invariance properties. + denotes invariance and non-invariance of the measure.
I
1
I
2
I
3
I
4
I
5
I
6
I
7
I
8
Accuracy
G
++
Precision
G
++++
Recall
G
ðSensiti
v
ityÞ++++
Fscore
G
++
Specificity ++++
AUC ++
Exact Match Ratio ++
Labelling Fscore ++
Retrieval Fscore +
Hamming Loss ++++
434 M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437
on data with a unimodal negative class than with a multi-modal one. This implication is more important
for AUC than for Specificity. The latter is usually used in combination with other measures, whereas the
former might be applied separately.
The invariance I
3
so far eludes thorough studies. Measures are expected to be non-invariant under the change of tp. The
non-invariant measures are used for evaluating classification with a strong positive class, such as for the
evaluation of document classification. Only Specificity and Hamming Loss do not measure the tp change.
Specificity was purposefully designed to avoid tp. The non-invariance of Specificity and Hamming Loss
suggests they may be used in a combination with other measures. These two measures may be reliable
for comparison in domains with a well-defined, unimodal, negative class.
The invariance I
4
under change in fn indicates that Precision,Specificity, and Exact Match Ratio may be more reliable when
manual labelling follows rigorous rules for a negative class. In the absence of such rules, disagreement
between the data labels and the negative labels assigned by a classifier can depend on subjective factors
and fluctuate. Under such conditions, an invariant measure may give misleading results. All the I
4
mea-
sures discussed above are monotone decreasing when fn increase hence, will not favor a classifier prone
to false negatives.
The invariance I
5
under fp change indicates that Recall and Exact Match Ratio may provide reasonably conservative esti-
mate when a positive class has counterexamples, i.e., outliers not explained by the mainstream positive
examples. The other eight measures are non-invariant. However, they are monotone decreasing when fp
increase, hence, they will not favor a classifier prone to false positives.
The invariance I
6
under uniform scaling holds for all the measures except Exact Match Ratio. The nine invariant measures
adapt to different sizes of data. The non-invariance of Exact Match Ratio indicates that its results may not
be comparable when obtained on data of widely different sizes.
The invariance I
7
under the scalar column change holds only for Precision. This supports a common practice of combining
Precision with other measures when assessing classifier performance. The combination assures that the
evaluation is less dependent on the data quality. All the other measures are non-invariant under the sca-
lar column change. Thus, they are more reliable if an algorithm’s performance is expected to change
across classes with new data.
The invariance I
8
under the scalar row change indicates that Recall,Specificity, and AUC may be a better choice for the eval-
uation of classifiers if different data sizes exhibit same quality of positive (negative) characteristics. Exam-
ples are simulated or generated data under the same distribution. The other measures are non-invariant.
They may be more reliable if the representative power of positive and negative classes is uncertain.
Invariance with respect to the matrix transformations is especially important because it connects evaluation measures to
particular learning settings. We now summarize the applicability of these measures to two subfields of text classification:
document classification and classification of human communications. One might be tempted to apply Fscore measures on
any text classification evaluation. However, various classification problems exhibit different characteristics which may
require different evaluation measures. Based on our analysis, we propose the following.
Since document classification data is often highly imbalanced, relevant documents constitute a small well-defined posi-
tive class, but the rest is a heterogeneous negative class built from non-relevant documents as ‘‘everything non-positive”.
Presence of a negative class that complements the positive class favors the use of the Fscore measures. In many such prob-
lems, examples of the positive class remain the same and the class keeps its modality, whereas examples of the negative
class change. Since the Fscore measures’ invariance under the change of correctly classified negative examples ðI
2
Þprevents
drastic changes, they will be less sensitive to changes in the negative class.
Classification of human communications is most often represented by sentiment classification applied to collections of
free form texts containing product evaluations. The number and ratio of positive and negative examples depends on the pop-
ularity of a particular product. If reviewers have strong likes and dislikes, then both classes have well-defined characteristics.
In this case, Area Under the Curve (AUC) may provide a more reliable classifier evaluation than Precision and Recall. Since AUC
is non-invariant under the change of correctly classified negative examples ðI
2
Þ, it will detect possible changes in the nega-
tive class better than Fscore measures.
For other types of classification of communications in social activities, other measure combinations might also be suit-
able. Political debates and electronic negotiations are examples of such communications. Their data can exhibit a unimodal
negative class and a large number of counterexamples. In political debates, counterexamples are records that praise the
discussed matter, but vote against it at the end, either because of a hidden motive or randomness of behavior (Sokolova
& Lapalme, 2007). In such cases, which are difficult even for human classification, Accuracy, with its invariance under the
exchange of positives and negatives classification ðI
1
Þ, and Precision, with its invariance under the change of false negative
examples ðI
5
Þ, may be used for a reliable evaluation of classifiers.
6. Conclusion and future work
In this study, we have analyzed twenty four performance measures used in the complete spectrum of Machine Learning
classification tasks: binary, multi-class, multi-labelled, and hierarchical. Effects of changes in the confusion matrix on several
M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437 435
well-known measures have been studied. In all the cases, we have shown that the evaluation of classification results can
depend on the invariance properties of the measures. A few cases required that we additionally considered monotonicity
of the measure. These properties have allowed us to make fine distinctions in the relations between the measures. One
way to insure a reliable evaluation is to employ a measure corresponding to the expected quality of the data, e.g., represen-
tativeness of class distribution, reliability of class labels, uni- and multi-modality of classes. To match measures with the data
characteristics, we constructed the measure invariance taxonomy with respect to all relevant label distribution changes in a
classification problem.
We supplemented the formal discussion by analyzing the applicability of performance measures on different subfields of
text classification. We have shown that the classification of human communications differs from document classification,
and thus that these two types of text classification may require different performance measures.
Our study has dealt with measures used in text classification but it could be extended to other language applications of
Machine Learning. The next step would be to study measures used in Machine Translation. This will considerably expand the
measure list. Applicability of the measures to traditional Natural Language Processing tasks, e.g., word sense disambiguation,
parsing, is another topic of considerable interest. It would also be useful to analyze in more details a measure’s monotonicity,
especially its behavior with respect to extreme classification results, such as when the labels provided by the data and a clas-
sifier are independent. Person authentication problems, in which the appropriate measures are a false acceptance rate and a
false rejection rate (Bengio et al., 2005), is another example of possible applications.
Acknowledgments
This work has been funded by the Natural Sciences and Engineering Research Council of Canada and the Ontario Centres
of Excellence. We thank Elliott Macklovitch for fruitful suggestions on an early draft. We thank anonymous reviewers for
helpful comments.
References
Asuncion, A., & Newman, D. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
<http://www.ics.uci.edu/mlearn/MLRepository.html>.
Bengio, S., Mariéthoz, J., & Keller, M. (2005). The expected performance curve. In Proceedings of the ICML’05 workshop on ROC analysis in machine learning (pp.
43–50).
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings
of the 45th annual meeting of the association of computational linguistics (pp. 440–447). Association for Computational Linguistics.
Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In KDD-2002: Workshop on multi-relational data
mining (pp. 21–35).
Bobicev, V., & Sokolova, M. (2008). An effective and robust method for short text classification. In Proceedings of the association for the advancement of
artificial intelligence (AAAI-2008) (pp. 1444–1445). AAAI Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
Costa, E., Lorena, A., Carvalho, A., & Freitas, A. (2007). A review of performance evaluation measures for hierarchical classifiers. In Proceedings of the AAAI
2007 workshop ‘‘Evaluation methods for machine learning” (pp. 1–6).
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Dietterich, T. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923.
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. John Wiley & Sons.
Eisner, R., Poulin, B., Szafron, D., Lu, P., & Greiner, R. (2005). Improving protein function prediction using the hierarchical structure of the gene ontology. In
Proceedings of IEEE symposium on computational intelligence in bioinformatics and computational biology (pp. 1–10).
Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E. (2005). Pulse: Mining customer opinions from free text. In Proceedings of the 6th international symposium
on intelligent data analysis (IDA 2005) (pp. 121–132).
Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In Proceedings of 27th European
conference on IR research (ECIR 2005) (pp. 345–359).
Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1997). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In
Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR-97) (pp. 192–201).
Huang, J., & Ling, C. (2007). Constructing new and better evaluation measures for machine learning. In Proceedings of the 20th international joint conference on
artificial intelligence (IJCAI’2007) (pp. 859–864).
Isselbacher, K., & Braunwald, E. (1994). Harrison’s principles of internal medicine. McGraw-Hill.
Japkowicz, N. (2006). Why question machine learning evaluation methods? In Proceedings of the AAAI’06 workshop on evaluation methods for machine
learning (pp. 6–11).
Kazawa, H., Izumitani, T., Taira, H., & Maeda, E. (2005). Maximal margin labeling for multi-topic text categorization. In Advances in neural information
processing systems (NIPS’04), (Vol. 17, pp. 649–656).
Kiritchenko, S., Matwin, S., Nock, R., & Famili, A. F. (2006). Learning and evaluation in the presence of class hierarchies: Application to text categorization. In
Proceedings of the 19th Canadian conference on AI (AI’2006) (pp. 395–406).
Lachiche, N., & Flach, P. A. (2003). Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In Proceedings of
ICML’2003 (pp. 416–423).
Langley, P. (1996). Elements of machine learning. San Francisco, Calif: Morgan Kaufmann.
Li, T., Zhang, C., & Zhu, S. (2006). Empirical studies on multi-label classification. In Proceedings of the 18th IEEE international conference on tools with artificial
intelligence (pp. 86–92).
Marchand, M., & Shawe-Taylor, J. (2002). The set covering machine. Journal of Machine Learning Research, 3, 723–746.
Mewes, H.-W., Albermann, K., Heumann, K., Lieb, S., & Pfeiffer, F. (1997). MIPS: A database for protein sequences, homology data and yeast genome
information. Nucleic Acids Research, 25(1), 28–30.
Mitchell, T. (1997). Machine learning. McGraw-Hill.
Nigam, K., & Hurst, M. (2004). Towards a robust metric of opinion. In Proceedings of the AAAI spring symposium on exploring attitude and affect in text
(pp. 98–105). AAAI Press.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of empirical methods of
natural language processing (EMNLP’02) (pp. 79–86).
436 M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437
Rijsbergen, C. (1979). Information retrieval (2nd ed.). London: Butterworths.
Salton, G., & McGill, M. (1993). Introduction to modern information retrieval. New York: McGraw-Hill.
Salzberg, S. L. (1999). On comparing classifiers: A critique of current research and methods. Data Mining and Knowledge Discovery, 1, 1–12.
Sasaki, Y., Rea, B., & Ananiadou, S. (2007). Multi-topic aspects in clinical text classification. In Proceedings of the 2007 IEEE international conference on
bioinformatics and biomedicine (pp. 62–70). IEEE Computer Society.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press.
Snyder, B., & Barzilay, R. (2007). Database-text alignment via structured multilabel classification. In Proceedings of the international joint conference on
artificial intelligence (IJCAI-2007) (pp. 1713–1718).
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In
Proceedings of the ACS Australian joint conference on artificial intelligence (pp. 1015–1021).
Sokolova, M., & Lapalme, G. (2007). Performance measures in classification of human communication. In Proceedings of the 20th Canadian conference on
artificial intelligence ( AI’2007) (pp. 159–170).
Sun, A., Lim, E.-P., & Ng, W.-K. (2003). Performance measurement framework for hierarchical text classification. Journal of the Americal Society for Information
Science and Technology, 54(11), 1014–1028.
Tan, P., Kumar, V., & Srivastava, J. (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4), 293–313.
Tikk, D., & Biró, G. (2003). Experiments with multi-label text classifier on the Reuters collection. In Proceedings of the international conference on
computational cybernetics (ICCC 03) (pp. 33–38).
Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In: Proceedings of the
2006 conference on empirical methods in natural language processing (pp. 327–335).
Wang, R., & Strong, D. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–34.
Wilson, T., Wiebe, J., & Hwa, R. (2006). Recognizing strong and weak opinion clauses. Computational Intelligence, 22(2), 73–99.
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1, 69–90.
Youden, W. (1950). Index for rating diagnostic tests. Cancer, 3, 32–35.
Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classification using maximum entropy method. In Proceedings of the 28th annual international ACM
SIGIR conference on Research and development in information retrieval (pp. 274–281).
M. Sokolova, G. Lapalme / Information Processing and Management 45 (2009) 427–437 437
... Only measuring the training and testing accuracy is not enough to validate the proposed methodology. Therefore, other performance measures, i.e., sensitivity, specificity, accuracy, F-Score, balanced accuracy (BA), informedness (BM), and markedness (MK) [48,49], were used to evaluate the efficiency of the proposed method. ...
Article
Full-text available
Class imbalance is one of the significant challenges in classification problems. The uneven distribution of data samples in different classes may occur due to human error, improper/unguided collection of data samples, etc. The uneven distribution of class samples among classes may affect the classification accuracy of the developed model. The main motivation behind this study is the design and development of methodologies for handling class imbalance problems. In this study, a new variant of the synthetic minority oversampling technique (SMOTE) has been proposed with the hybridization of particle swarm optimization (PSO) and Egyptian vulture (EV). The proposed method has been termed SMOTE-PSOEV in this study. The proposed method generates an optimized set of synthetic samples from traditional SMOTE and augments the five datasets for verification and validation. The SMOTE-PSOEV is then compared with existing SMOTE variants, i.e., Tomek Link, Borderline SMOTE1, Borderline SMOTE2, Distance SMOTE, and ADASYN. After data augmentation to the minority classes, the performance of SMOTE-PSOEV has been evaluated using support vector machine (SVM), Naïve Bayes (NB), and k-nearest-neighbor (k-NN) classifiers. The results illustrate that the proposed models achieved higher accuracy than existing SMOTE variants.
... This situation makes it difficult to evaluate the model correctly. In order to avoid this problem, we chose the verification methods suggested in the literature to evaluate our model [61]. ...
Article
The analysis and grading of software vulnerabilities is an important process that is done manually by experts today. For this reason, there are time delays, human errors, and excessive costs involved with the process. The final result of these software vulnerability reports created by experts is the calculation of a severity score and a severity rating. The severity rating is the first and foremost value of the software’s vulnerability. The vulnerabilities that can be exploited are only 20% of the total vulnerabilities. The vast majority of exploitations take place within the first two weeks. It is therefore imperative to determine the severity rating without time delays. Our proposed model uses statistical methods and deep learning-based word embedding methods from natural language processing techniques, and machine learning algorithms that perform multi-class classification. Bag of Words, Term Frequency Inverse Document Frequency and Ngram methods, which are statistical methods, were used for feature extraction. Word2Vec, Doc2Vec and Fasttext algorithms are included in the study for deep learning based Word embedding. In the classification stage, Naive Bayes, Decision Tree, K-Nearest Neighbors, Multi-Layer Perceptron, and Random Forest algorithms that can make multi-class classification were preferred. With this aspect, our model proposes a hybrid method. The database used is open to the public and is the most reliable data set in the field. The results obtained in our study are quite promising. By helping experts in this field, procedures will speed up. In addition, our study is one of the first studies containing the latest version of the data size and scoring systems it covers.
... are calculated [43]. tp i is the true positive value for class i, tn i is the true negative value for class i, f p i is the false positive value for class i, f n i is the false negative value for class i, and l is the number of classes. ...
Article
Full-text available
We present a convolutional neural network (CNN) that identifies drone models in real-life videos. The neural network is trained on synthetic images and tested on a real-life dataset of drone videos. To create the training and validation datasets, we show a method of generating synthetic drone images. Domain randomization is used to vary the simulation parameters such as model textures, background images, and orientation. Three common drone models are classified: DJI Phantom, DJI Mavic, and DJI Inspire. To test the performance of the neural network model, Anti-UAV, a real-life dataset of flying drones is used. The proposed method reduces the time-cost associated with manually labelling drones, and we prove that it is transferable to real-life videos. The CNN achieves an overall accuracy of 92.4%, a precision of 88.8%, a recall of 88.6%, and an f1 score of 88.7% when tested on the real-life dataset.
... In order to better evaluate the proposed approach, three significant metrics were taken, namely Precision, Recall, and F1-score. As a standard, these indexes can be used to measure the performance of the model [35]. The results are presented in Table 1. ...
Article
Maintenance activities to replace, repair, and revamp equipment in the industrial plant sector are gradually needed for sustainability during the plant’s life cycle. In order to carry out these revamping activities, the plant owners exchange many purchase orders (POs) with equipment suppliers, including technical and specification documents and commercial procurement content. As POs are written in various formats with large volumes and complexities, it is often time-consuming for the owner’s engineer to review them and it may lead to errors and omissions. This study proposed the purchase order recognition and analysis system (PORAS), which automatically detects and compares risk clauses between plant owners’ and suppliers’ POs by utilizing artificial intelligence (AI). The PORAS is a comprehensive framework consisting of two independent modules and four model components that accurately reflect on the added value of the PORAS. The table recognition and comparison (TRC) module is utilized for risk clauses in POs written in tables with its two components, the table comparison (TRC-C) and table recognition (TRC-R) models. The critical terms in general conditions (CTGC) module analyzes the patterns of risk clauses in general texts, then extracts them with a rule-based algorithm and compares them through entity matching. In the TRC-C model using machine learning (Ditto model), a few errors occurred due to insufficient training data, resulting in an accuracy of 87.8%, whereas in the TRC-R model, a rule-based algorithm, errors occurred in only some exceptional cases; thus, its F1 score was evaluated to be 96.9%. The CTGC module’s F2 score for automatic extraction performance was evaluated as 79.1% due to some data’s bias. Overall, the validation study shows that while a human review of the risk clauses in a PO manually took hours, it took only an average of 10 min with the PORAS. Therefore, this time saving can significantly reduce the owner engineer’s PO workload. In essence, this study contributes to achieving sustainable engineering processes through the intelligence and automation of document and risk management in the plant industry.
Article
Areas of cutaneous field cancerization (CFC) occur in sun-damaged skin and are prone to skin cancer development. Actinic keratosis (AK) is the pathognomonic lesion of CFC. Therefore, the reliable and non-invasive AK burden assessment is essential to assist clinicians in delivering patient-tailored therapeutic interventions and support the objective evaluation of emerging therapeutic modalities. Herein, we introduce a system for automated AK detection in CFC areas. For the differentiation of AK from healthy skin areas and co-localized benign growths (Seborrheic Keratosis/Lentigo Solaris; SK/LS), cross-polarized digital photographs of afflicted skin surfaces were taken, and a convolutional neural network (AKCNN) whose convolution part was optimally transferred from a pre-trained VGG16 was implemented. For the detection of multifocal AK in wide skin regions, superpixels were employed to generate region patches for the subsequent evaluation. AKCNN was implemented and evaluated in 19739, 43067, and 12,205 image patches of AK, SK/LS, and healthy skin, respectively, originating from 46 patients. AKCNN performance was assessed in two ways: (a) patch classification using the macro averaged F1 score and (b) AK burden evaluation in broad skin areas using an adapted region-based F1 (aF1) score. Using raw clinical images, AKCNN exhibited a macro F1 of 0.78 at patch level and a region-based aF1 of 0.81, with good tolerance against image scaling. The proposed system efficiently uses cross-polarized clinical photography to assess the AK burden within the CFC.
Article
Computer vision systems perform based on their design and parameter setting. In computer vision systems that use grayscale conversion, the conversion of RGB images to a grayscale format influences performance of the systems in terms of both results quality and computational costs. Appropriate setting of the weights for the weighted means grayscale conversion, co-estimated with other parameters used in the computer vision system, helps to approach the desired performance of a system or its subsystem at the cost of a negligible or no increase in its time-complexity. However, parameter space of the system and subsystem as extended by the grayscale conversion weights can contain substandard settings. These settings show strong sensitivity of the system and subsystem to small changes in the distribution of data in a color space of the processed images. We developed a methodology for Tuning of the Grayscale computer Vision systems (TGV) that exploits the advantages while compensating for the disadvantages of the weighted means grayscale conversion. We show that the TGV tuning improves computer vision system performance by up to 16% in the tested case studies. The methodology provides a universally applicable solution that merges the utility of a fine-tuned computer vision system with the robustness of its performance against variable input data.
Chapter
Oil spills are considered to be one of the biggest obstacles to marine and coastal environments. Effective surveillance, ship detection, and accurate oil spill detection are crucial for the relevant agencies to respond adequately and minimize environmental emissions and avoid further disruption. Satellites deployed for capturing the data have led to the ingestion of huge amounts of remote sensing data to systems but to analyze that data via human effort is a tedious and extensive task. Hence, modern literature suggests the use of machine learning in paradigms such as image segmentation, image recognition, object detection as a substitute for traditional techniques. This research applies the contemporary deep Learning methods over the dataset available from the European Space Agency (ESA). The paper proposes the use of the volumetric convolution net (V-Net) architecture in addition to image augmentation methods like horizontal flipping, vertical flipping, and image rotation. The proposed computational pipeline resulted in a net Mean IOU of 88.29 and an accuracy of 90.65%.KeywordsMarine pollutionRemote sensingOil spill detectionSARV-Net
Article
Machine learning classification algorithms play a major role in diagnosing faults in industrial equipment. In this paper, we investigate the use of ensembles composed of deep neural networks for improving the electrical submersible pump fault diagnosis results. The proposed method relies on composing an ensemble of multiple convolutional neural networks trained with a metric function for extracting relevant features directly from the raw data. The final classification is given by a standard voting scheme after a Random Forest is trained for each feature set generated by each deep metric neural network and then the methods are compared with two previous methods already used for the electrical submersible pump fault diagnosis. The experiments were carried out using five different metric functions: Proxy-Anchor loss, CosFace loss, Triplet loss, Lifted Structured loss and Contrastive loss. Results show statistical evidence that the new approach using ensemble methods achieves better performance than the previous solutions. Moreover, results indicate that composing an ensemble of multiple distinct metric losses achieves a high macro F-measure with low variance when compared to an ensemble where all neural networks are trained with the same metric loss.
Article
Full-text available
Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow of gradually creating a classifier by weight adjusting method. We experimented on the well-known Reuters-21578 document corpus with different taxonomies. Results show that our approach outperforms existing ones by up to 10%.
Article
Full-text available
Criteria for evaluating the performance of a classifier are an important part in its design. They allow to es-timate the behavior of the generated classifier on un-seen data and can be also used to compare its perfor-mance against the performance of classifiers generated by other classification algorithms. There are currently several performance measures for binary and flat classi-fication problems. For hierarchical classification prob-lems, where there are multiple classes which are hi-erarchically related, the evaluation step is more com-plex. This paper reviews the main evaluation metrics proposed in the literature to evaluate hierarchical clas-sification models.
Chapter
Semantic Information Theory (SIT) is concerned with studies in Logic and Philosophy on the use of the term information, “in the sense in which it is used of whatever it is that meaningful sentences and other comparable combinations of symbols convey to one who understands them” (Hintikka, 1970). Notwithstanding the large scope of this description, SIT has primarily to do with the question of how to weigh sentences according to their informative content. The main difference with conventional information theory is that information is not conveyed by an ordered sequence of binary symbols, but by means of a formal language in which logical statements are defined and explained by a semantics. The investigation of SIT concerns two research directions: the axiomatisation of the logical principles for assigning probabilities or similar weighting functions to logical sentences and the relationship between information content of a sentence and its probability.
Conference Paper
We investigate whether one can determine from the transcripts of U.S. Congressional floor debates whether the speeches represent support of or opposition to proposed legislation. To address this problem, we exploit the fact that these speeches occur as part of a discussion; this allows us to use sources of information regarding relationships between discourse segments, such as whether a given utterance indicates agreement with the opinion expressed by another. We find that the incorporation of such information yields substantial improvements over classifying speeches in isolation.