Content uploaded by Yangming Zhou
Author content
All content in this area was uploaded by Yangming Zhou on Apr 24, 2021
Content may be subject to copyright.
20 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Keywords Classiers,Classiers’Performances,Correlation,MachineLearningCommunity,Performance
Metrics
ABSTRACT
Theevaluationofclassiers’performancesplaysacriticalroleinconstructionandselectionofclassica-
tionmodel.Althoughmanyperformancemetricshavebeenproposedinmachinelearningcommunity, no
generalguidelinesareavailableamongpractitionersregardingwhichmetrictobeselectedforevaluating
aclassier’s performance.Inthispaper,weattempttoprovidepractitionerswitha strategyonselecting
performancemetricsforclassierevaluation.Firstly,theauthorsinvestigatesevenwidelyusedperformance
metrics,namelyclassicationaccuracy,F-measure,kappastatistic,rootmeansquareerror,meanabsolute
error,theareaunderthereceiveroperatingcurve,andtheareaundertheprecision-recallcurve.Secondly,the
authorsresorttousingPearsonlinearcorrelationandSpearmanrankcorrelationtoanalysesthepotential
relationshipamongthesesevenmetrics.Experimentalresultsshowthatthesecommonlyusedmetricscanbe
dividedintothreegroups,andallmetricswithinagivengrouparehighlycorrelatedbutlesscorrelatedwith
metricsfromdifferentgroups.
A Strategy on Selecting
Performance Metrics for
Classier Evaluation
YangguangLiu,NingboInstituteofTechnology,ZhejiangUniversity,Ningbo,China
YangmingZhou,NingboInstituteofTechnology,ZhejiangUniversity,Ningbo,China
ShitingWen,NingboInstituteofTechnology,ZhejiangUniversity,Ningbo,China
ChaogangTang,ChinaUniversityofMiningandTechnology,Xuzhou,China
1. INTRODUCTION
The correct selection of performance metrics is one of the most key issues in evaluating classi-
fiers’ performances. A number of performance metrics have been proposed in different applica-
tion scenarios. For example, accuracy is typically used to measure the percentage of correctly
classified test instances. It is so far the primary metric for assessing classifier performance (Ben
et al.2007) and (Huang et al.2005); precision and recall metrics are widely applied in information
retrieval (Baeza-Yates 1999); medical decision making community prefers the area under the
receiver operating characteristic (ROC) curves (i.e., AUC) (Lasko et al. 2005). It is a very com-
mon situation, where a classifier performs well on one performance metric but badly on others.
DOI: 10.4018/IJMCMC.2014100102
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 21
For example, boosted trees and SVM classifiers achieve good performances on classification
accuracy, while they yield poor performances on root mean square error (Caruana et al. 2004).
In general, a widely accepted consensus is to choose performance metrics depending on the
practical requirements of specific applications. For example, neural networks typically optimize
squared error, and thus the metric of root mean square error can better reflect the actual perfor-
mance of a classifier than other metrics. However, in some cases, specific criteria are unknown
in advance, and practitioners tend to select several measures from widely adopted ones, such
as classification accuracy, kappa statistic, F-measure and AUC, for evaluating a new classifier
(Sokolova et al.2006), and (Sokolova et al.2009). Additionally, most metrics are derived by
calculating the confusion matrix of the classifier. It could be reasonable to think that some of
such performance metrics are closely related, which may cause redundancy on measuring the
performance of classifiers. On the other hand, it is difficult for practitioners to reach concrete
conclusion when two metrics provide conflicting results.
This study focuses on providing a strategy on selecting appropriate performance metrics
for classifiers by using Pearson linear correlation and Spearman rank correlation to analyses
the potential relationship among seven widely used performance metrics, namely accuracy, F-
measure, kappa statistic, root mean squared error (i.e., RMSE), mean absolute error (i.e., MAE),
AUC, and area under the precision recall (PR) curve (i.e., AUPRC). We first briefly describe
these performance metrics. By definition, we sketch out their characteristic features by confu-
sion matrix and preliminarily classify them into three groups, namely threshold metrics, rank
metrics, and probability metrics. Then, we use correlation analysis to measure the correlations
of these metrics. The experimental results show that metrics from the same group are closely
correlated but less correlated with metrics from different groups. Additionally, we compare the
correlation changes caused by the size and class distribution of the datasets, which are the main
factors affecting measured values.
The main contributions made in this work are summarized as follows. First, we divide these
seven performance metrics into three groups by analyzing their definitions. Experimental results
confirm that metrics inside the same group have high correlation, and metrics from different
groups have low correlation. Second, we also provide practitioners with the following strategies
on selecting performance metrics for evaluating a classifier’s performance based on experiment
results. For balanced training data sets, one should select multiple metrics to evaluate the classi-
fier, and at least one metric is selected in each group. For imbalanced training data sets, a clas-
sifier is not necessary to achieve the optimal performance on all groups of metrics; instead, as
long as the classifier meets the performance requirement of an application measured by certain
group(s) of metrics, we recommend adopting it regardless of its less satisfactory performance
on other groups of metrics.
Compared with existing work, our work concentrates on investigating the relationships
among some special popular performance metrics. By the definitions and the experiments, a clear
taxonomy of these metrics has been given. It is necessary to note that we are not attempting to
give specific performance metrics for the practical applications. Instead, we resort to correlation
analysis to discover the potential relationships among several most commonly used performance
metrics, and provide practitioners with a more profound understanding of the performance metrics
and the strategy on selecting performance metrics for classifier evaluation.
The outline of this paper is as follows. Section II briefly reviews some related work. Seven
common performance metrics used in practice are described in Section III. Section IV intro-
duces the correlation analysis methods and explains the detail of the experimental setup. The
experimental results and some discussions are provided in Section V. Finally, Section VI gives
our conclusions and suggestions for future research.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
22 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
2. RELATED WORK
The evaluation of classifiers’ performances is a critical step in construction and selection of
classification models. Because of its great importance, a number of articles have been published
in data analytic domains. Additionally, there are several books on this topic (Hand 1997), (Pepe
2003), (Gönen 2007), and (Krzanowski et al. 2009). In what follows, we focus on the related
work of popular metrics.
Accuracy is typically used to measure the predictive ability of a classification model on
the testing samples. It has been the major metric for a long time in the areas, such as machine
learning and data mining (Ben et al.2007), and (Demšar 2006) . Despite of its easiness to imple-
ment, it also has some disadvantages as follows. (1) It does not take the class distribution into
consideration, and is often biased towards the majority class (Provost 1998). (2) It does not
compensate for success due to mere chances (Ben et al.2007).
ROC curve is an alternative metric used in pattern recognition and machine learning (Lasko
et al.2005), (Bradley 1997), and (Fawcett 2006). One of the attractive properties is its ability to
handle different class distributions. In addition, AUC has an important statistical property known
as Mann-Whitney-Wilcoxon U statistic, which provides a naturally intuitive interpretation to
ROC curve (Bradley 2014). And the work in (Ferri et al.2011) offers an alternative, coherent
interpretation of AUC as linearly related to expected loss. Nonetheless, AUC may give misleading
results when ROC curves cross with each other (Hand 2009). Furthermore, AUC is insensitive
to the class distribution. (Kaymak et al. 2012) proposed a simple alternative scalar metric to
AUC, known as the area under the kappa curve (i.e., AUK), which compensates for the class
indifference of the AUC. AUK is particularly suitable for evaluating classifier performance on
skewed class distribution datasets.
Many performance metrics have been developed for evaluating the performance of classifica-
tion algorithms. It is not a surprise that a classifier performs well under one metric but badly under
another metric. For example, it has been proved by extensive experiments that naive Bayes and
C4.5 classifiers have similar performances in terms of average predictive accuracy, and there is
no significant difference between them. However, a naive Bayes classifier is significantly better
than C4.5 decision trees in terms of AUC (Huang et al.2003), and (Huang et al.2005).
Additionally, researchers have investigated the relationship between AUC and accuracy (Huang
et al. 2005), and (Cortes et al. 2004). The studies show that the average AUC is monotonically
increasing as a function of the classification accuracy, but that the standard deviation for uneven
distributions and higher error rates are noticeable. Thus, algorithms designed to minimize the
error rate may not lead to the best possible AUC values (Huang et al.2005). Similar work can be
found in another study on the relationship between PR curve and ROC curve (Davis et al.2006),
which has shown that there exists a tight relationship between ROC space and PR space, such
that a curve dominates in ROC space if and only if it dominates in PR space.
There are some previous studies that compare and analyses the relationship between the
metrics. (Caruana et al.2004) analyses the behavior of the multiple metrics against multiple su-
pervised learning algorithms and the relationship between metrics by using multi-dimensional
scaling and correlation (Seliya et al.2009) apply factors analysis to investigate the relationships
among classifiers’ performance space, which is characterized by 22 metrics. (Hernández et al.
2012)explore many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. In the work which proposed (Ferri et al.2009), explore the
relationships among 18 different performance metrics in multiple different scenarios, identify-
ing clusters and relationships between metrics. However, the clustering among the performance
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 23
metrics is not obvious. In addition, some commonly-used performance metrics are not evaluated,
such as the area under the PR curve.
3. CLASSIFIER PERFORMANCE METRICS
In this section, we describe seven common performance metrics in data analytic domains. For a
binary classification, a confusion matrix can be constructed to depict the numbers of instances
predicted by each of the four possible outcomes as given in Table 1. Based on the definitions
of evaluation metrics, we preliminarily divide these seven metrics into three groups, namely
threshold metrics, rank metrics, and probability metrics. Notations using in this paper can be
found in Table 2.
Threshold metrics, which are sensitive to thresholds, include accuracy, F-measure and kappa
statistic. These metrics do not consider how close the prediction value to true value is, but only
focus on whether the predicted value is above or below a threshold value. In the following, we
present the threshold metrics in details.
• Accuracy (ACC): This metrics is the most popular performance metric for evaluating clas-
sifiers’ accuracy. It is defined as the percentage of correct classifications
Table1.Theconfusionmatrixforbinaryclassification
Predicted Class
+ -
Actual Class +
-
True Positive(TP) False Negative(FN)
False Positive (FP) True Negative(TN)
Table2.Notations
Notation Meaning
ACC Accuracy: A Performance Metric
RMSE Root Mean Square Error
ROC Receiver Operating Characteristic
MAE Mean Absolute Error
AUC Area Under the ROC Curve
FSC F-Score
KAP Kappa Statistic
r Pearson Correlation Coefficient
AUPRC Area Under the PR Curve
ρ
Spearman Rank Correlation
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
24 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
TP TN
Acc
M
+
= (1)
where
M
denotes the total number of positive samples (
P
) and negative samples (
N
), and TP
and TN denote the numbers of true positives and true negatives, respectively.
• Kappa Statistic (KAP): This metric is proposed to measure the degree of agreement between
two classifiers. It is defined as follows
0
1
C
C
P P
KAP P
−
=− (2)
where 0
p is the prediction accuracy of a classifier as defined in equation (1), and
2
ˆ ˆ
( ) /
c
P PP NN M= + is the “agreement” probability due to chance. Additionally,
ˆ
P
and
ˆ
N
represent the total numbers of samples labeled as positives and negatives, respectively.
• F-Measure (also F-Score/FSC): This metric has been widely applied in the field of infor-
mation retrieval (Baeza-Yates et al.1999). It represents a harmonic mean between precision
and recall
1
2Precision Recall
FPrecesion Recall
×
=+ (3)
where ˆ
/Precision TP P
= and
/Recall TP P
=.The area under ROC curves and PR curves are
the rank metrics, which measure how well a model ranks the positive instances above the nega-
tive instances. The ROC and PR curves have been widely used in information retrieval. These
metrics can be viewed as a summary of the performances of a model across all possible thresh-
olds.
• Area Under the ROC Curve (AUC): ROC curve is a very useful two-dimensional depic-
tion of the trade-off between the true positive rate (i.e.,
/t TP P
=) and false positive rate
(i.e., /f FP N=) (Fawcett 2006). In order to compare the performances of different clas-
sifiers, one often calculates the area under the ROC curves. According to our notation, the
AUC is defined as follows:
1
0
AUC t df= ×
∫ (4)
• Area Under the PR Curve (AUPRC): It is usually served as an alternative metric to AUC,
especially in the information retrieval area (Ngo 2011), and (Raghavan et al. 1989). The
formula for calculating the area under the PR curve is given as follows:
1
0
AUPRC p dt= ×
∫ (5)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 25
where /t TP P= and
ˆ
/p TP P=. The probability metrics are measured by the deviation between
the predicted value and the true value. Such metrics we study here are mean absolute error and
root mean squared error, which neither directly compare the results with a threshold value, nor
directly compare the instances’ order with one another. These metrics are widely used in regres-
sion problems, especially for assessing the reliability of the classifiers.
• Root Mean Square Error (RMSE). This is a principal and frequently used metric, which
measures the difference between the value predicted by a classifier and its true value. It is
defined as follows:
2
1
1( ( ) ( ))
M
c c
i
RMSE Pred i True i
M=
= −
∑ (6)
where ( )
c
Pred i denotes the prediction probability of instance
i
, which belongs to class
c
, and
( )
c
True i represents the actual probability.
• Mean Absolute Error (MAE): It is usually an alternative of root measure square error.
Without their signs involved, it only averages the magnitude of the individual errors. The
formula for calculating the mean absolute error is
1
1| ( ) ( ) |
M
c c
i
MAE Pred i True i
M=
= −
∑ (7)
Although a number of evaluation metrics have been proposed so far, we focus on several
common performance metrics. In what follows, we will briefly introduce two correlation analysis
methods, and use them to investigate the relationships among the above seven performance metrics.
4. METHODOLOGY
4.1. Correlation Analysis Methods
Correlation is a metric used for measuring the strength of a linear relationship between two
variables. A strong correlation implies that there exists a close relationship between variables;
while a weak correlation means that the variables are hardly related. In the experiments, we resort
to correlation analysis to investigate the relationships among the seven performance metrics.
In what follows, we will briefly describe the correlation analysis techniques, including Pearson
linear correlation and Spearman rank correlation.
The most widely used correlation is the Pearson correlation, which measures the strength
and direction of the linear relationship between the variables (Ahlgren et al. 2003). It ranges
from -1 to +1. The Pearson correlation coefficient (
r
) between two random variables
x
and
y
is defined as follows:
[( )( )]
x y
x y
E x y
r
µ µ
σ σ
− −
= (8)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
26 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
where x
µ
and y
µ
are expected values, and x
σ
and y
σ
are standard deviations.
When the variables are not normally distributed or the relationship between the variables
is not linear, it may be more appropriate to use the Spearman rank correlation coefficient (Go-
vindarajulu 1992). Spearman correlation coefficient is a non-parametric measure of statistical
dependence between two variables. It ranges from -1 to +1. A clear description of the difference
between Pearson linear correlation and Spearman rank correlation can be found in Figure 1. The
formula for calculating the Spearman rank correlation (
ρ
) is as follows:
2
1
2
6
1( 1)
n
i
id
n n
ρ
=
= − −
∑ (9)
where i
d is the difference in the ranks of
i
instance and
n
is the dimension of the variables.
4.2. Experimental Settings
In order to investigate the relationships among the seven commonly used performance metrics,
we conduct four sets of experiments. Firstly, we carry out a correlation analysis of performance
metrics inside the group on a single dataset. Correspondingly, we make a correlation analysis
of metrics from different groups. Then, we give an overall correlation analysis of all datasets.
Finally, we analyses the correlation changes of performance metrics caused by the size and class
Figure1.TheMSDofthenodeinGCLThedifferencebetweenPearsonlinearcorrelation(
r
)
andSpearmanrankcorrelation(
ρ
).
r
measureshowclosepointsonascattergrapharetoa
straightline;
ρ
measuresthetendencyfor
y
toincrease(ortodecrease)as
x
increases,but
notnecessarilyinalinearway
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 27
distribution of dataset. In the following, we will briefly introduce algorithms, datasets, as well
as the specific experimental procedures.
The experiments run on Weka 3.7.61. We use eight well-known classification models: Artifi-
cial Neural Network, C4.5 (J48), k-Nearest Neighbors (kNN), Logistic Regression, Naive Bayes,
Random Forest, Bagging with 25 J48 trees, AdaBoost with 25 J48 trees. In (Ngo 2011), there are
more details about these classifiers and their WEKA implementations. All the results are derived
by stratified 10 fold cross-validations, and the default parameters are used in the experiments.
A total of 18 binary classification datasets are tested in the experiments. Details can be
found in Table 3. Table 3 includes detailed information of the datasets, such as size, number of
features, class distribution, etc.
In the experiments, each model is evaluated using fold cross-validation, and is applied to each
of the 18 binary classification problems. We obtain 1800 results for each algorithm, and 14,400
results in total for eight algorithms. To study the relationship of metrics inside the same group
and relationship of metrics from different groups, we conduct the experiment on House-voting
dataset based on a kNN classifier. To verify the generalization of the experimental results, we
make an overall analysis for all datasets based on eight selected algorithms. In addition, we also
analyse the correlation changes caused by the size and class distribution of the dataset. Because
performance metric values are influenced in different ways based on the specific problem, such
as class distribution and dataset size, we first calculate the correlation of performance metrics
of each dataset on every algorithm. In each case, we make correlation analysis by acquiring
Pearson linear correlation coefficients and Spearman rank correlation coefficients among the
seven metrics. In the experiments, we work with 1-RMSE and 1-MAE.
5. RESULTS
In this section, we present some interesting findings from the analysis of Pearson linear (
r
) and
Spearman rank (
ρ
) correlation among metrics.
Firstly, we perform the experiment on House-voting dataset to study the relationships among
performance metrics based on kNN classification algorithm. Table 4 shows the correlation matrix
of the performance metrics. From this table, we have the following preliminary observations,
i.e., all correlation coefficients are positive, and Pearson linear correlations are extremely close
to the corresponding Spearman rank correlation.
Figure 2(a) shows the correlation of metrics from the same group, from which we have the
following observations. (1) For Spearman rank correlation, the correlation of any two metrics
from the same group is very big, i.e., 0.82 for probability metrics, 0.92 for rank metrics, and
0.92 to 1.0 for threshold metrics. (2) For Pearson linear correlation, the correlation is 0.91 for
probability metrics, 0.95 for rank metrics, and 0.99 to 1.0 for threshold metrics.
Figure 2(b) shows that the correlation of metrics from different groups is not so high compared
with the results shown in Figure 2 (a). That is, for Spearman rank correlation, any inter-group
correlation ranges from 0.22 to 0.65, i.e., 0.22 to 0.37 between threshold and rank metrics, 0.35
to 0.55 between rank and probability metrics, and 0.42 to 0.65 between threshold and probability
metrics; whereas for Pearson linear correlation, any inter-group correlation ranges from 0.17 to
0.65, i.e., 0.17 to 0.19 between threshold and rank metrics, 0.31 to 0.49 between rank and prob-
ability metrics, and 0.57 to 0.65 between threshold and probability metrics.
In conclusion, the metrics from the same group are closely associated with each other, while
metrics from different groups are not so closely related. That is to say that one should select
multiple metrics to evaluate the classifier, and at least one metric is selected in each group, if
the specific criteria are unknown in advance.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
28 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
Secondly, we calculate the correlation among performance metrics on each dataset to inves-
tigate the adaptability of our grouping of metrics, i.e., the correlation of the eight models for one
dataset, and obtain the average of 18 correlation matrix, as shown in Table 5. Compared with
Table 4, there are some changes in the correlation values, i.e., the intra-group correlation (in bold)
Table3.Datasets*andtheirproperties
Data set #instance #features %min-%max
Colic 368 22 36.96-63.04
Credit-rating 690 15 44.50-55.50
Heart-disease 303 13 45.54-54.46
Heart-statlog 270 23 44.45-55.55
Hepatitis 155 19 20.65-79.35
House-voting 435 16 38.62-61.38
Ionosphere 351 34 35.90-64.10
Kr-vs-kp 3196 36 47.78-52.22
Monks1 556 6 50.00-50.00
Monks2 601 6 24.28-75.72
Monks3 554 6 48.01-51.90
Mushroom 8124 22 48.20-51.80
Optdigits 5620 64 49.79-50.21
Sick 3772 29 6.12-93.87
Sonar 208 60 46.63-53.37
Spambase 4601 57 39.40-60.60
Spectf 80 44 50.00-50.00
Tic-tac-toe 958 8 34.66-65.34
*Datasets from http://www.cs.waikato.ac.nz/ml/weka/datasets.html.
Table4.Pearson(bottom-left)andSpearman(top-right)correlationcoefficientsforhouse-voting
datasetbasedonkNNalgorithm(intra-groupcorrelationinbold)
Threshold Metrics Probability Metrics Rank Metrics
ACC KAP FSC 1-MAE 1-RMSE AUC AUPRC
ACC 1.0 0.93 0.65 0.58 0.32 0.22
KAP 1.0 0.92 0.62 0.57 0.37 0.27
FSC 1.0 0.99 0.63 0.42 0.27 0.22
1-MAE 0.65 0.65 0.63 0.82 0.50 0.35
1-RMSE 0.59 0.61 0.57 0.91 0.55 0.47
AUC 0.18 0.18 0.17 0.49 0.39 0.92
AUPRC 0.19 0.19 0.19 0.31 0.31 0.95
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 29
become smaller. A clearer description can be found in Figure 3, where Pearson linear correlation
is shown on the x-axis and Spearman rank correlation coefficient on the y-axis. A further obser-
vation is that there is a strong correlation (greater than 0.77) between the performance metrics
Figure2.Intra-groupcorrelations(a) andinter-groupcorrelations(b)of metricsforHouse-
votingdataset
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
30 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
in the same group and a low correlation (less than 0.67) of metrics from different groups. The
above experimental results verify the reasonableness of the classification of performance metrics.
It shows that a classifier is not necessary to achieve the optimal performance on all groups
of metrics; instead, as long as the classifier meets the performance requirement of an applica-
tion measured by certain group(s) of metrics, we recommend adopting it regardless of its less
satisfactory performance on other groups of metrics.
Finally, we investigate the correlation of metrics with respect to the size and class distribu-
tion of data sets, and the results are shown in Figures 4 and 5. Both figures show that correla-
Table5.Pearosn(bottom-left)andSpearman(top-right)correlationson18datasets(intra-group
correlationinbold)
Threshold Metrics Probability Metrics Rank Metrics
ACC KAP FSC 1-MAE 1-RMSE AUC AUPRC
ACC 0.94 0.91 0.63 0.59 0.28 0.26
KAP 0.96 0.90 0.60 0.57 0.29 0.26
FSC 0.94 0.94 0.60 0.55 0.29 0.26
1-MAE 0.67 0.63 0.63 0.77 0.46 0.41
1-RMSE 0.63 0.61 0.58 0.79 0.43 0.39
AUC 0.30 0.29 0.29 0.47 0.45 0.85
AUPRC 0.26 0.26 0.26 0.42 0.40 0.89
Figure3.Thedistributionofcorrelationsbetweensevencommonperformancemetrics.There
haverelativelyhighcorrelationbetweenmetricsinsidethesamegroup(top-right)andlowcor-
relationbetweenthemetricsformdifferentgroups(bottom-left)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 31
tion is from 0.75 to 1.0 for metrics in the same group and 0.05 to 0.65 for metrics in different
groups. The results comply with those shown in Figure 2. In Figure 4, a data set is regarded as
a large data set, if it contains over 3000 instances, otherwise it is regard as a small data set. The
correlation for metrics from the same group w.r.t. large data sets is slightly higher than those
w.r.t. small data sets, and results are opposite for metrics from different groups. In general,
classifier algorithm treats the data as independent, identically distributed (i.i.d.). As the data set
size increase the impact of variance can be expected to decrease. According to the definitions
of Pearson linear correlation and Spearman rank correlation, they are both related to algorithm
prediction variance or the data set size. In Figure 5, the results show that the correlation w.r.t.
balanced data sets is almost the same as that of all data sets, while correlation w.r.t. imbalanced
data sets slightly fluctuates around that of balanced data sets. Because typically there are two
parts to solving a prediction problem: model selection and model assessment. In model selec-
tion we estimate the performance of various competing models with the hope of choosing the
best one. Having chosen the final model, we assess the model by estimating the prediction error
on new data. There is no obvious choice on how to split the data. It depends on the signal to
noise ratio which we, of course, do not know. It is reasonable explanation that the parameters
of algorithms in our experiments are stationary after model selection for the balanced data sets,
but it is sensitive to the imbalanced data sets.
6. CONCLUSION
In this paper, we have intensively investigated the relationships among seven widely adopted
performance metrics for binary classification. The major contributions we have made in this work
are two-folds. (1) Based on Pearson and Spearman correlation analysis, we have verified the
reasonableness of classifying seven commonly used metrics into three groups, namely threshold
metrics, rank metrics, and probability metrics. Any two metrics have high correlation if they are
from the same group and otherwise low correlation. This finding provides practitioners with
a better understanding about the relationships among these common metrics. (2) Based on the
experimental analysis, this work also suggests the strategy on choosing adequate measures to
evaluate a classifier’s performance from a user perspective. In addition, we investigate the influ-
ence of datasets with different sizes and class distribution on the correlation of different metrics.
For the next stage of study, an interesting and challenging direction is to investigate an all-
in-one metric that can perform the functions of all three groups of metrics.
ACKNOWLEDGMENT
This work was supported by Zhejiang Provincial Natural Science Foundation of China, Grant
No. LY15F020035, LY16F030012 and LY15F030016, and partially supported by Ningbo Natu-
ral Science Foundation of China, Grant No. 2014A610066, 2011A610177, 2012A610018, and
partially Supported by Scientific Research Fund of Zhejiang Provincial Education Department,
Grant No. Y201534788, and Jiangsu Province Natural Science Foundation of China Under
Grant No.BK20150201.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
32 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
Figure4.Spearmanrankcorrelation(a)andPearsonlinearcorrelation(b)amongthemetrics
fordatasetswithdifferentsizes
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 33
Figure5.Spearmanrankcorrelation(a)andPearsonlinearcorrelation(b)amongthemetrics
fordatasetswithdifferentclassdistribution
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
34 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
REFERENCES
Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure, with
special reference to Pearson’s correlation coefficient. Journal of the AmericanSocietyforInformation
ScienceandTechnology, 54(6), 550–560. doi:10.1002/asi.10242
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Moderninformationretrieval (Vol. 463). New York: ACM press.
Ben-David, A. (2007). A lot of randomness is hiding in accuracy. EngineeringApplications ofArtificial
Intelligence, 20(7), 875–885. doi:10.1016/j.engappai.2007.01.001
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning
algorithms. PatternRecognition, 30(7), 1145–1159. doi:10.1016/S0031-3203(96)00142-2
Bradley, A. P. (2014). Half-AUC for the evaluation of sensitive or specific classifiers. PatternRecognition
Letters, 38, 93–98. doi:10.1016/j.patrec.2013.11.015
Caruana, R., & Niculescu-Mizil, A. (2004, August). Data mining in metric space: an empirical analysis of
supervised learning performance criteria. ProceedingsofthetenthACMSIGKDDinternationalconference
onKnowledgediscoveryanddatamining (pp. 69-78). ACM. doi:10.1145/1014052.1014063
Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. AdvancesinNeuralInfor-
mationProcessingSystems, 16(16), 313–320.
Davis, J., & Goadrich, M. (2006, June). The relationship between Precision-Recall and ROC curves. Proceedings
ofthe23rdinternationalconferenceonMachinelearning (pp. 233-240). ACM. doi:10.1145/1143844.1143874
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. JournalofMachineLearn-
ingResearch, 7, 1–30.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
doi:10.1016/j.patrec.2005.10.010
Ferri, C., Hernández-Orallo, J., & Flach, P. A. (2011). A coherent interpretation of AUC as a measure of
aggregated classification performance. Proceedings of the 28th International Conference on Machine
Learning(ICML-11) (pp. 657-664).
Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance
measures for classification. PatternRecognitionLetters, 30(1), 27–38. doi:10.1016/j.patrec.2008.08.010
Gönen, M. (2007). Analyzingreceiveroperatingcharacteristiccurveswithsas. SAS Institute.
Govindarajulu, Z., Kendall, M., & Gibbons, J. D. (1992). Rank correlation methods. Technometrics, 34(1),
108–108. doi:10.2307/1269571
Hand, D. J. (1997). Constructionandassessmentofclassificationrules (Vol. 15). Chichester: Wiley.
Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC
curve. MachineLearning, 77(1), 103–123. doi:10.1007/s10994-009-5119-5
Hernández-Orallo, J., Flach, P., & Ferri, C. (2012). A unified view of performance metrics: Translating
threshold choice into expected classification loss. JournalofMachineLearningResearch, 13(1), 2813–2869.
Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEETrans-
actionsonKnowledge and Data Engineering, 17(3), 299–310.
Huang, J., Lu, J., & Ling, C. X. (2003, November). Comparing naive Bayes, decision trees, and SVM with
AUC and accuracy. ProceedingsoftheThirdIEEEInternationalConferenceonDataMiningICDM‘03
(pp. 553-556). IEEE. doi:10.1109/ICDM.2003.1250975
Kaymak, U., Ben-David, A., & Potharst, R. (2012). The AUK: A simple alternative to the AUC. Engineering
ApplicationsofArtificialIntelligence, 25(5), 1082–1089. doi:10.1016/j.engappai.2012.02.012
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 35
Krzanowski, W. J., & Hand, D. J. (2009). ROC curves for continuous data. CRC Press.
doi:10.1201/9781439800225
Lasko, T. A., Bhagwat, J. G., Zou, K. H., & Ohno-Machado, L. (2005). The use of receiver operating
characteristic curves in biomedical informatics. Journal of Biomedical Informatics, 38(5), 404–415.
doi:10.1016/j.jbi.2005.02.008 PMID:16198999
Ngo, T. (2011). Data mining: Practical machine learning tools and technique. SoftwareEngineeringNotes,
36(5), 51–52. doi:10.1145/2020976.2021004
Pepe, M. S. (2003). Thestatisticalevaluationofmedicaltestsforclassificationand prediction. Oxford
University Press.
Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing
induction algorithms. In ICML (Vol. 98, pp. 445-453).
Raghavan, V., Bollmann, P., & Jung, G. S. (1989). A critical investigation of recall and precision as
measures of retrieval system performance. ACM Transactionson Information Systems, 7(3), 205–229.
doi:10.1145/65943.65945
Seliya, N., Khoshgoftaar, T. M., & Van Hulse, J. (2009, November). A study on the relationships of classifier
performance metrics. Proceedingsofthe21stInternationalConferenceonToolswithArtificialIntelligence
ICTAI‘09 (pp. 59-66). IEEE. doi:10.1109/ICTAI.2009.25
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: a family of
discriminant measures for performance evaluation. Proceedings of Advances in Artificial Intelligence AI
2006 (pp. 1015-1021). Springer Berlin Heidelberg.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification
tasks. InformationProcessing&Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002
ENDNOTES
1 http://www.cs.waikato.ac.nz/ml/weka/index.html