ArticlePDF Available

Abstract and Figures

The evaluation of classifiers' performances plays a critical role in construction and selection of classification model. Although many performance metrics have been proposed in machine learning community, no general guidelines are available among practitioners regarding which metric to be selected for evaluating a classifier's performance. In this paper, we attempt to provide practitioners with a strategy on selecting performance metrics for classifier evaluation. Firstly, the authors investigate seven widely used performance metrics, namely classification accuracy, F-measure, kappa statistic, root mean square error, mean absolute error, the area under the receiver operating curve, and the area under the precision-recall curve. Secondly, the authors resort to using Pearson linear correlation and Spearman rank correlation to analyses the potential relationship among these seven metrics. Experimental results show that these commonly used metrics can be divided into three groups, and all metrics within a given group are highly correlated but less correlated with metrics from different groups.
Content may be subject to copyright.
20 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Keywords Classiers,Classiers’Performances,Correlation,MachineLearningCommunity,Performance
Metrics
ABSTRACT
Theevaluationofclassiers’performancesplaysacriticalroleinconstructionandselectionofclassica-
tionmodel.Althoughmanyperformancemetricshavebeenproposedinmachinelearningcommunity, no
generalguidelinesareavailableamongpractitionersregardingwhichmetrictobeselectedforevaluating
aclassier’s performance.Inthispaper,weattempttoprovidepractitionerswitha strategyonselecting
performancemetricsforclassierevaluation.Firstly,theauthorsinvestigatesevenwidelyusedperformance
metrics,namelyclassicationaccuracy,F-measure,kappastatistic,rootmeansquareerror,meanabsolute
error,theareaunderthereceiveroperatingcurve,andtheareaundertheprecision-recallcurve.Secondly,the
authorsresorttousingPearsonlinearcorrelationandSpearmanrankcorrelationtoanalysesthepotential
relationshipamongthesesevenmetrics.Experimentalresultsshowthatthesecommonlyusedmetricscanbe
dividedintothreegroups,andallmetricswithinagivengrouparehighlycorrelatedbutlesscorrelatedwith
metricsfromdifferentgroups.
A Strategy on Selecting
Performance Metrics for
Classier Evaluation
YangguangLiu,NingboInstituteofTechnology,ZhejiangUniversity,Ningbo,China
YangmingZhou,NingboInstituteofTechnology,ZhejiangUniversity,Ningbo,China
ShitingWen,NingboInstituteofTechnology,ZhejiangUniversity,Ningbo,China
ChaogangTang,ChinaUniversityofMiningandTechnology,Xuzhou,China
1. INTRODUCTION
The correct selection of performance metrics is one of the most key issues in evaluating classi-
fiers’ performances. A number of performance metrics have been proposed in different applica-
tion scenarios. For example, accuracy is typically used to measure the percentage of correctly
classified test instances. It is so far the primary metric for assessing classifier performance (Ben
et al.2007) and (Huang et al.2005); precision and recall metrics are widely applied in information
retrieval (Baeza-Yates 1999); medical decision making community prefers the area under the
receiver operating characteristic (ROC) curves (i.e., AUC) (Lasko et al. 2005). It is a very com-
mon situation, where a classifier performs well on one performance metric but badly on others.
DOI: 10.4018/IJMCMC.2014100102
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 21
For example, boosted trees and SVM classifiers achieve good performances on classification
accuracy, while they yield poor performances on root mean square error (Caruana et al. 2004).
In general, a widely accepted consensus is to choose performance metrics depending on the
practical requirements of specific applications. For example, neural networks typically optimize
squared error, and thus the metric of root mean square error can better reflect the actual perfor-
mance of a classifier than other metrics. However, in some cases, specific criteria are unknown
in advance, and practitioners tend to select several measures from widely adopted ones, such
as classification accuracy, kappa statistic, F-measure and AUC, for evaluating a new classifier
(Sokolova et al.2006), and (Sokolova et al.2009). Additionally, most metrics are derived by
calculating the confusion matrix of the classifier. It could be reasonable to think that some of
such performance metrics are closely related, which may cause redundancy on measuring the
performance of classifiers. On the other hand, it is difficult for practitioners to reach concrete
conclusion when two metrics provide conflicting results.
This study focuses on providing a strategy on selecting appropriate performance metrics
for classifiers by using Pearson linear correlation and Spearman rank correlation to analyses
the potential relationship among seven widely used performance metrics, namely accuracy, F-
measure, kappa statistic, root mean squared error (i.e., RMSE), mean absolute error (i.e., MAE),
AUC, and area under the precision recall (PR) curve (i.e., AUPRC). We first briefly describe
these performance metrics. By definition, we sketch out their characteristic features by confu-
sion matrix and preliminarily classify them into three groups, namely threshold metrics, rank
metrics, and probability metrics. Then, we use correlation analysis to measure the correlations
of these metrics. The experimental results show that metrics from the same group are closely
correlated but less correlated with metrics from different groups. Additionally, we compare the
correlation changes caused by the size and class distribution of the datasets, which are the main
factors affecting measured values.
The main contributions made in this work are summarized as follows. First, we divide these
seven performance metrics into three groups by analyzing their definitions. Experimental results
confirm that metrics inside the same group have high correlation, and metrics from different
groups have low correlation. Second, we also provide practitioners with the following strategies
on selecting performance metrics for evaluating a classifier’s performance based on experiment
results. For balanced training data sets, one should select multiple metrics to evaluate the classi-
fier, and at least one metric is selected in each group. For imbalanced training data sets, a clas-
sifier is not necessary to achieve the optimal performance on all groups of metrics; instead, as
long as the classifier meets the performance requirement of an application measured by certain
group(s) of metrics, we recommend adopting it regardless of its less satisfactory performance
on other groups of metrics.
Compared with existing work, our work concentrates on investigating the relationships
among some special popular performance metrics. By the definitions and the experiments, a clear
taxonomy of these metrics has been given. It is necessary to note that we are not attempting to
give specific performance metrics for the practical applications. Instead, we resort to correlation
analysis to discover the potential relationships among several most commonly used performance
metrics, and provide practitioners with a more profound understanding of the performance metrics
and the strategy on selecting performance metrics for classifier evaluation.
The outline of this paper is as follows. Section II briefly reviews some related work. Seven
common performance metrics used in practice are described in Section III. Section IV intro-
duces the correlation analysis methods and explains the detail of the experimental setup. The
experimental results and some discussions are provided in Section V. Finally, Section VI gives
our conclusions and suggestions for future research.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
22 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
2. RELATED WORK
The evaluation of classifiers’ performances is a critical step in construction and selection of
classification models. Because of its great importance, a number of articles have been published
in data analytic domains. Additionally, there are several books on this topic (Hand 1997), (Pepe
2003), (Gönen 2007), and (Krzanowski et al. 2009). In what follows, we focus on the related
work of popular metrics.
Accuracy is typically used to measure the predictive ability of a classification model on
the testing samples. It has been the major metric for a long time in the areas, such as machine
learning and data mining (Ben et al.2007), and (Demšar 2006) . Despite of its easiness to imple-
ment, it also has some disadvantages as follows. (1) It does not take the class distribution into
consideration, and is often biased towards the majority class (Provost 1998). (2) It does not
compensate for success due to mere chances (Ben et al.2007).
ROC curve is an alternative metric used in pattern recognition and machine learning (Lasko
et al.2005), (Bradley 1997), and (Fawcett 2006). One of the attractive properties is its ability to
handle different class distributions. In addition, AUC has an important statistical property known
as Mann-Whitney-Wilcoxon U statistic, which provides a naturally intuitive interpretation to
ROC curve (Bradley 2014). And the work in (Ferri et al.2011) offers an alternative, coherent
interpretation of AUC as linearly related to expected loss. Nonetheless, AUC may give misleading
results when ROC curves cross with each other (Hand 2009). Furthermore, AUC is insensitive
to the class distribution. (Kaymak et al. 2012) proposed a simple alternative scalar metric to
AUC, known as the area under the kappa curve (i.e., AUK), which compensates for the class
indifference of the AUC. AUK is particularly suitable for evaluating classifier performance on
skewed class distribution datasets.
Many performance metrics have been developed for evaluating the performance of classifica-
tion algorithms. It is not a surprise that a classifier performs well under one metric but badly under
another metric. For example, it has been proved by extensive experiments that naive Bayes and
C4.5 classifiers have similar performances in terms of average predictive accuracy, and there is
no significant difference between them. However, a naive Bayes classifier is significantly better
than C4.5 decision trees in terms of AUC (Huang et al.2003), and (Huang et al.2005).
Additionally, researchers have investigated the relationship between AUC and accuracy (Huang
et al. 2005), and (Cortes et al. 2004). The studies show that the average AUC is monotonically
increasing as a function of the classification accuracy, but that the standard deviation for uneven
distributions and higher error rates are noticeable. Thus, algorithms designed to minimize the
error rate may not lead to the best possible AUC values (Huang et al.2005). Similar work can be
found in another study on the relationship between PR curve and ROC curve (Davis et al.2006),
which has shown that there exists a tight relationship between ROC space and PR space, such
that a curve dominates in ROC space if and only if it dominates in PR space.
There are some previous studies that compare and analyses the relationship between the
metrics. (Caruana et al.2004) analyses the behavior of the multiple metrics against multiple su-
pervised learning algorithms and the relationship between metrics by using multi-dimensional
scaling and correlation (Seliya et al.2009) apply factors analysis to investigate the relationships
among classifiers’ performance space, which is characterized by 22 metrics. (Hernández et al.
2012)explore many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. In the work which proposed (Ferri et al.2009), explore the
relationships among 18 different performance metrics in multiple different scenarios, identify-
ing clusters and relationships between metrics. However, the clustering among the performance
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 23
metrics is not obvious. In addition, some commonly-used performance metrics are not evaluated,
such as the area under the PR curve.
3. CLASSIFIER PERFORMANCE METRICS
In this section, we describe seven common performance metrics in data analytic domains. For a
binary classification, a confusion matrix can be constructed to depict the numbers of instances
predicted by each of the four possible outcomes as given in Table 1. Based on the definitions
of evaluation metrics, we preliminarily divide these seven metrics into three groups, namely
threshold metrics, rank metrics, and probability metrics. Notations using in this paper can be
found in Table 2.
Threshold metrics, which are sensitive to thresholds, include accuracy, F-measure and kappa
statistic. These metrics do not consider how close the prediction value to true value is, but only
focus on whether the predicted value is above or below a threshold value. In the following, we
present the threshold metrics in details.
Accuracy (ACC): This metrics is the most popular performance metric for evaluating clas-
sifiers’ accuracy. It is defined as the percentage of correct classifications
Table1.Theconfusionmatrixforbinaryclassification
Predicted Class
+ -
Actual Class +
-
True Positive(TP) False Negative(FN)
False Positive (FP) True Negative(TN)
Table2.Notations
Notation Meaning
ACC Accuracy: A Performance Metric
RMSE Root Mean Square Error
ROC Receiver Operating Characteristic
MAE Mean Absolute Error
AUC Area Under the ROC Curve
FSC F-Score
KAP Kappa Statistic
r Pearson Correlation Coefficient
AUPRC Area Under the PR Curve
ρ
Spearman Rank Correlation
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
24 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
TP TN
Acc
M
+
= (1)
where
M
denotes the total number of positive samples (
P
) and negative samples (
N
), and TP
and TN denote the numbers of true positives and true negatives, respectively.
Kappa Statistic (KAP): This metric is proposed to measure the degree of agreement between
two classifiers. It is defined as follows
0
1
C
C
P P
KAP P
= (2)
where 0
p is the prediction accuracy of a classifier as defined in equation (1), and
2
ˆ ˆ
( ) /
c
P PP NN M= + is the “agreement” probability due to chance. Additionally,
ˆ
P
and
ˆ
N
represent the total numbers of samples labeled as positives and negatives, respectively.
F-Measure (also F-Score/FSC): This metric has been widely applied in the field of infor-
mation retrieval (Baeza-Yates et al.1999). It represents a harmonic mean between precision
and recall
1
2Precision Recall
FPrecesion Recall
×
=+ (3)
where ˆ
/Precision TP P
= and
/Recall TP P
=.The area under ROC curves and PR curves are
the rank metrics, which measure how well a model ranks the positive instances above the nega-
tive instances. The ROC and PR curves have been widely used in information retrieval. These
metrics can be viewed as a summary of the performances of a model across all possible thresh-
olds.
Area Under the ROC Curve (AUC): ROC curve is a very useful two-dimensional depic-
tion of the trade-off between the true positive rate (i.e.,
/t TP P
=) and false positive rate
(i.e., /f FP N=) (Fawcett 2006). In order to compare the performances of different clas-
sifiers, one often calculates the area under the ROC curves. According to our notation, the
AUC is defined as follows:
1
0
AUC t df= ×
(4)
Area Under the PR Curve (AUPRC): It is usually served as an alternative metric to AUC,
especially in the information retrieval area (Ngo 2011), and (Raghavan et al. 1989). The
formula for calculating the area under the PR curve is given as follows:
1
0
AUPRC p dt= ×
(5)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 25
where /t TP P= and
/p TP P=. The probability metrics are measured by the deviation between
the predicted value and the true value. Such metrics we study here are mean absolute error and
root mean squared error, which neither directly compare the results with a threshold value, nor
directly compare the instances’ order with one another. These metrics are widely used in regres-
sion problems, especially for assessing the reliability of the classifiers.
Root Mean Square Error (RMSE). This is a principal and frequently used metric, which
measures the difference between the value predicted by a classifier and its true value. It is
defined as follows:
2
1
1( ( ) ( ))
M
c c
i
RMSE Pred i True i
M=
=
(6)
where ( )
c
Pred i denotes the prediction probability of instance
i
, which belongs to class
c
, and
( )
c
True i represents the actual probability.
Mean Absolute Error (MAE): It is usually an alternative of root measure square error.
Without their signs involved, it only averages the magnitude of the individual errors. The
formula for calculating the mean absolute error is
1
1| ( ) ( ) |
M
c c
i
MAE Pred i True i
M=
=
(7)
Although a number of evaluation metrics have been proposed so far, we focus on several
common performance metrics. In what follows, we will briefly introduce two correlation analysis
methods, and use them to investigate the relationships among the above seven performance metrics.
4. METHODOLOGY
4.1. Correlation Analysis Methods
Correlation is a metric used for measuring the strength of a linear relationship between two
variables. A strong correlation implies that there exists a close relationship between variables;
while a weak correlation means that the variables are hardly related. In the experiments, we resort
to correlation analysis to investigate the relationships among the seven performance metrics.
In what follows, we will briefly describe the correlation analysis techniques, including Pearson
linear correlation and Spearman rank correlation.
The most widely used correlation is the Pearson correlation, which measures the strength
and direction of the linear relationship between the variables (Ahlgren et al. 2003). It ranges
from -1 to +1. The Pearson correlation coefficient (
r
) between two random variables
x
and
y
is defined as follows:
[( )( )]
x y
x y
E x y
r
µ µ
σ σ
= (8)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
26 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
where x
µ
and y
µ
are expected values, and x
σ
and y
σ
are standard deviations.
When the variables are not normally distributed or the relationship between the variables
is not linear, it may be more appropriate to use the Spearman rank correlation coefficient (Go-
vindarajulu 1992). Spearman correlation coefficient is a non-parametric measure of statistical
dependence between two variables. It ranges from -1 to +1. A clear description of the difference
between Pearson linear correlation and Spearman rank correlation can be found in Figure 1. The
formula for calculating the Spearman rank correlation (
ρ
) is as follows:
2
1
2
6
1( 1)
n
i
id
n n
ρ
=
=
(9)
where i
d is the difference in the ranks of
i
instance and
n
is the dimension of the variables.
4.2. Experimental Settings
In order to investigate the relationships among the seven commonly used performance metrics,
we conduct four sets of experiments. Firstly, we carry out a correlation analysis of performance
metrics inside the group on a single dataset. Correspondingly, we make a correlation analysis
of metrics from different groups. Then, we give an overall correlation analysis of all datasets.
Finally, we analyses the correlation changes of performance metrics caused by the size and class
Figure1.TheMSDofthenodeinGCLThedifferencebetweenPearsonlinearcorrelation(
r
)
andSpearmanrankcorrelation(
ρ
).
r
measureshowclosepointsonascattergrapharetoa
straightline;
ρ
measuresthetendencyfor
y
toincrease(ortodecrease)as
x
increases,but
notnecessarilyinalinearway
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 27
distribution of dataset. In the following, we will briefly introduce algorithms, datasets, as well
as the specific experimental procedures.
The experiments run on Weka 3.7.61. We use eight well-known classification models: Artifi-
cial Neural Network, C4.5 (J48), k-Nearest Neighbors (kNN), Logistic Regression, Naive Bayes,
Random Forest, Bagging with 25 J48 trees, AdaBoost with 25 J48 trees. In (Ngo 2011), there are
more details about these classifiers and their WEKA implementations. All the results are derived
by stratified 10 fold cross-validations, and the default parameters are used in the experiments.
A total of 18 binary classification datasets are tested in the experiments. Details can be
found in Table 3. Table 3 includes detailed information of the datasets, such as size, number of
features, class distribution, etc.
In the experiments, each model is evaluated using fold cross-validation, and is applied to each
of the 18 binary classification problems. We obtain 1800 results for each algorithm, and 14,400
results in total for eight algorithms. To study the relationship of metrics inside the same group
and relationship of metrics from different groups, we conduct the experiment on House-voting
dataset based on a kNN classifier. To verify the generalization of the experimental results, we
make an overall analysis for all datasets based on eight selected algorithms. In addition, we also
analyse the correlation changes caused by the size and class distribution of the dataset. Because
performance metric values are influenced in different ways based on the specific problem, such
as class distribution and dataset size, we first calculate the correlation of performance metrics
of each dataset on every algorithm. In each case, we make correlation analysis by acquiring
Pearson linear correlation coefficients and Spearman rank correlation coefficients among the
seven metrics. In the experiments, we work with 1-RMSE and 1-MAE.
5. RESULTS
In this section, we present some interesting findings from the analysis of Pearson linear (
r
) and
Spearman rank (
ρ
) correlation among metrics.
Firstly, we perform the experiment on House-voting dataset to study the relationships among
performance metrics based on kNN classification algorithm. Table 4 shows the correlation matrix
of the performance metrics. From this table, we have the following preliminary observations,
i.e., all correlation coefficients are positive, and Pearson linear correlations are extremely close
to the corresponding Spearman rank correlation.
Figure 2(a) shows the correlation of metrics from the same group, from which we have the
following observations. (1) For Spearman rank correlation, the correlation of any two metrics
from the same group is very big, i.e., 0.82 for probability metrics, 0.92 for rank metrics, and
0.92 to 1.0 for threshold metrics. (2) For Pearson linear correlation, the correlation is 0.91 for
probability metrics, 0.95 for rank metrics, and 0.99 to 1.0 for threshold metrics.
Figure 2(b) shows that the correlation of metrics from different groups is not so high compared
with the results shown in Figure 2 (a). That is, for Spearman rank correlation, any inter-group
correlation ranges from 0.22 to 0.65, i.e., 0.22 to 0.37 between threshold and rank metrics, 0.35
to 0.55 between rank and probability metrics, and 0.42 to 0.65 between threshold and probability
metrics; whereas for Pearson linear correlation, any inter-group correlation ranges from 0.17 to
0.65, i.e., 0.17 to 0.19 between threshold and rank metrics, 0.31 to 0.49 between rank and prob-
ability metrics, and 0.57 to 0.65 between threshold and probability metrics.
In conclusion, the metrics from the same group are closely associated with each other, while
metrics from different groups are not so closely related. That is to say that one should select
multiple metrics to evaluate the classifier, and at least one metric is selected in each group, if
the specific criteria are unknown in advance.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
28 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
Secondly, we calculate the correlation among performance metrics on each dataset to inves-
tigate the adaptability of our grouping of metrics, i.e., the correlation of the eight models for one
dataset, and obtain the average of 18 correlation matrix, as shown in Table 5. Compared with
Table 4, there are some changes in the correlation values, i.e., the intra-group correlation (in bold)
Table3.Datasets*andtheirproperties
Data set #instance #features %min-%max
Colic 368 22 36.96-63.04
Credit-rating 690 15 44.50-55.50
Heart-disease 303 13 45.54-54.46
Heart-statlog 270 23 44.45-55.55
Hepatitis 155 19 20.65-79.35
House-voting 435 16 38.62-61.38
Ionosphere 351 34 35.90-64.10
Kr-vs-kp 3196 36 47.78-52.22
Monks1 556 6 50.00-50.00
Monks2 601 6 24.28-75.72
Monks3 554 6 48.01-51.90
Mushroom 8124 22 48.20-51.80
Optdigits 5620 64 49.79-50.21
Sick 3772 29 6.12-93.87
Sonar 208 60 46.63-53.37
Spambase 4601 57 39.40-60.60
Spectf 80 44 50.00-50.00
Tic-tac-toe 958 8 34.66-65.34
*Datasets from http://www.cs.waikato.ac.nz/ml/weka/datasets.html.
Table4.Pearson(bottom-left)andSpearman(top-right)correlationcoefficientsforhouse-voting
datasetbasedonkNNalgorithm(intra-groupcorrelationinbold)
Threshold Metrics Probability Metrics Rank Metrics
ACC KAP FSC 1-MAE 1-RMSE AUC AUPRC
ACC 1.0 0.93 0.65 0.58 0.32 0.22
KAP 1.0 0.92 0.62 0.57 0.37 0.27
FSC 1.0 0.99 0.63 0.42 0.27 0.22
1-MAE 0.65 0.65 0.63 0.82 0.50 0.35
1-RMSE 0.59 0.61 0.57 0.91 0.55 0.47
AUC 0.18 0.18 0.17 0.49 0.39 0.92
AUPRC 0.19 0.19 0.19 0.31 0.31 0.95
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 29
become smaller. A clearer description can be found in Figure 3, where Pearson linear correlation
is shown on the x-axis and Spearman rank correlation coefficient on the y-axis. A further obser-
vation is that there is a strong correlation (greater than 0.77) between the performance metrics
Figure2.Intra-groupcorrelations(a) andinter-groupcorrelations(b)of metricsforHouse-
votingdataset
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
30 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
in the same group and a low correlation (less than 0.67) of metrics from different groups. The
above experimental results verify the reasonableness of the classification of performance metrics.
It shows that a classifier is not necessary to achieve the optimal performance on all groups
of metrics; instead, as long as the classifier meets the performance requirement of an applica-
tion measured by certain group(s) of metrics, we recommend adopting it regardless of its less
satisfactory performance on other groups of metrics.
Finally, we investigate the correlation of metrics with respect to the size and class distribu-
tion of data sets, and the results are shown in Figures 4 and 5. Both figures show that correla-
Table5.Pearosn(bottom-left)andSpearman(top-right)correlationson18datasets(intra-group
correlationinbold)
Threshold Metrics Probability Metrics Rank Metrics
ACC KAP FSC 1-MAE 1-RMSE AUC AUPRC
ACC 0.94 0.91 0.63 0.59 0.28 0.26
KAP 0.96 0.90 0.60 0.57 0.29 0.26
FSC 0.94 0.94 0.60 0.55 0.29 0.26
1-MAE 0.67 0.63 0.63 0.77 0.46 0.41
1-RMSE 0.63 0.61 0.58 0.79 0.43 0.39
AUC 0.30 0.29 0.29 0.47 0.45 0.85
AUPRC 0.26 0.26 0.26 0.42 0.40 0.89
Figure3.Thedistributionofcorrelationsbetweensevencommonperformancemetrics.There
haverelativelyhighcorrelationbetweenmetricsinsidethesamegroup(top-right)andlowcor-
relationbetweenthemetricsformdifferentgroups(bottom-left)
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 31
tion is from 0.75 to 1.0 for metrics in the same group and 0.05 to 0.65 for metrics in different
groups. The results comply with those shown in Figure 2. In Figure 4, a data set is regarded as
a large data set, if it contains over 3000 instances, otherwise it is regard as a small data set. The
correlation for metrics from the same group w.r.t. large data sets is slightly higher than those
w.r.t. small data sets, and results are opposite for metrics from different groups. In general,
classifier algorithm treats the data as independent, identically distributed (i.i.d.). As the data set
size increase the impact of variance can be expected to decrease. According to the definitions
of Pearson linear correlation and Spearman rank correlation, they are both related to algorithm
prediction variance or the data set size. In Figure 5, the results show that the correlation w.r.t.
balanced data sets is almost the same as that of all data sets, while correlation w.r.t. imbalanced
data sets slightly fluctuates around that of balanced data sets. Because typically there are two
parts to solving a prediction problem: model selection and model assessment. In model selec-
tion we estimate the performance of various competing models with the hope of choosing the
best one. Having chosen the final model, we assess the model by estimating the prediction error
on new data. There is no obvious choice on how to split the data. It depends on the signal to
noise ratio which we, of course, do not know. It is reasonable explanation that the parameters
of algorithms in our experiments are stationary after model selection for the balanced data sets,
but it is sensitive to the imbalanced data sets.
6. CONCLUSION
In this paper, we have intensively investigated the relationships among seven widely adopted
performance metrics for binary classification. The major contributions we have made in this work
are two-folds. (1) Based on Pearson and Spearman correlation analysis, we have verified the
reasonableness of classifying seven commonly used metrics into three groups, namely threshold
metrics, rank metrics, and probability metrics. Any two metrics have high correlation if they are
from the same group and otherwise low correlation. This finding provides practitioners with
a better understanding about the relationships among these common metrics. (2) Based on the
experimental analysis, this work also suggests the strategy on choosing adequate measures to
evaluate a classifier’s performance from a user perspective. In addition, we investigate the influ-
ence of datasets with different sizes and class distribution on the correlation of different metrics.
For the next stage of study, an interesting and challenging direction is to investigate an all-
in-one metric that can perform the functions of all three groups of metrics.
ACKNOWLEDGMENT
This work was supported by Zhejiang Provincial Natural Science Foundation of China, Grant
No. LY15F020035, LY16F030012 and LY15F030016, and partially supported by Ningbo Natu-
ral Science Foundation of China, Grant No. 2014A610066, 2011A610177, 2012A610018, and
partially Supported by Scientific Research Fund of Zhejiang Provincial Education Department,
Grant No. Y201534788, and Jiangsu Province Natural Science Foundation of China Under
Grant No.BK20150201.
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
32 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
Figure4.Spearmanrankcorrelation(a)andPearsonlinearcorrelation(b)amongthemetrics
fordatasetswithdifferentsizes
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 33
Figure5.Spearmanrankcorrelation(a)andPearsonlinearcorrelation(b)amongthemetrics
fordatasetswithdifferentclassdistribution
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
34 International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014
REFERENCES
Ahlgren, P., Jarneving, B., & Rousseau, R. (2003). Requirements for a cocitation similarity measure, with
special reference to Pearson’s correlation coefficient. Journal of the AmericanSocietyforInformation
ScienceandTechnology, 54(6), 550–560. doi:10.1002/asi.10242
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Moderninformationretrieval (Vol. 463). New York: ACM press.
Ben-David, A. (2007). A lot of randomness is hiding in accuracy. EngineeringApplications ofArtificial
Intelligence, 20(7), 875–885. doi:10.1016/j.engappai.2007.01.001
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning
algorithms. PatternRecognition, 30(7), 1145–1159. doi:10.1016/S0031-3203(96)00142-2
Bradley, A. P. (2014). Half-AUC for the evaluation of sensitive or specific classifiers. PatternRecognition
Letters, 38, 93–98. doi:10.1016/j.patrec.2013.11.015
Caruana, R., & Niculescu-Mizil, A. (2004, August). Data mining in metric space: an empirical analysis of
supervised learning performance criteria. ProceedingsofthetenthACMSIGKDDinternationalconference
onKnowledgediscoveryanddatamining (pp. 69-78). ACM. doi:10.1145/1014052.1014063
Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. AdvancesinNeuralInfor-
mationProcessingSystems, 16(16), 313–320.
Davis, J., & Goadrich, M. (2006, June). The relationship between Precision-Recall and ROC curves. Proceedings
ofthe23rdinternationalconferenceonMachinelearning (pp. 233-240). ACM. doi:10.1145/1143844.1143874
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. JournalofMachineLearn-
ingResearch, 7, 1–30.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
doi:10.1016/j.patrec.2005.10.010
Ferri, C., Hernández-Orallo, J., & Flach, P. A. (2011). A coherent interpretation of AUC as a measure of
aggregated classification performance. Proceedings of the 28th International Conference on Machine
Learning(ICML-11) (pp. 657-664).
Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance
measures for classification. PatternRecognitionLetters, 30(1), 27–38. doi:10.1016/j.patrec.2008.08.010
Gönen, M. (2007). Analyzingreceiveroperatingcharacteristiccurveswithsas. SAS Institute.
Govindarajulu, Z., Kendall, M., & Gibbons, J. D. (1992). Rank correlation methods. Technometrics, 34(1),
108–108. doi:10.2307/1269571
Hand, D. J. (1997). Constructionandassessmentofclassificationrules (Vol. 15). Chichester: Wiley.
Hand, D. J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC
curve. MachineLearning, 77(1), 103–123. doi:10.1007/s10994-009-5119-5
Hernández-Orallo, J., Flach, P., & Ferri, C. (2012). A unified view of performance metrics: Translating
threshold choice into expected classification loss. JournalofMachineLearningResearch, 13(1), 2813–2869.
Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEETrans-
actionsonKnowledge and Data Engineering, 17(3), 299–310.
Huang, J., Lu, J., & Ling, C. X. (2003, November). Comparing naive Bayes, decision trees, and SVM with
AUC and accuracy. ProceedingsoftheThirdIEEEInternationalConferenceonDataMiningICDM‘03
(pp. 553-556). IEEE. doi:10.1109/ICDM.2003.1250975
Kaymak, U., Ben-David, A., & Potharst, R. (2012). The AUK: A simple alternative to the AUC. Engineering
ApplicationsofArtificialIntelligence, 25(5), 1082–1089. doi:10.1016/j.engappai.2012.02.012
Copyright © 2014, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35, October-December 2014 35
Krzanowski, W. J., & Hand, D. J. (2009). ROC curves for continuous data. CRC Press.
doi:10.1201/9781439800225
Lasko, T. A., Bhagwat, J. G., Zou, K. H., & Ohno-Machado, L. (2005). The use of receiver operating
characteristic curves in biomedical informatics. Journal of Biomedical Informatics, 38(5), 404–415.
doi:10.1016/j.jbi.2005.02.008 PMID:16198999
Ngo, T. (2011). Data mining: Practical machine learning tools and technique. SoftwareEngineeringNotes,
36(5), 51–52. doi:10.1145/2020976.2021004
Pepe, M. S. (2003). Thestatisticalevaluationofmedicaltestsforclassificationand prediction. Oxford
University Press.
Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing
induction algorithms. In ICML (Vol. 98, pp. 445-453).
Raghavan, V., Bollmann, P., & Jung, G. S. (1989). A critical investigation of recall and precision as
measures of retrieval system performance. ACM Transactionson Information Systems, 7(3), 205–229.
doi:10.1145/65943.65945
Seliya, N., Khoshgoftaar, T. M., & Van Hulse, J. (2009, November). A study on the relationships of classifier
performance metrics. Proceedingsofthe21stInternationalConferenceonToolswithArtificialIntelligence
ICTAI‘09 (pp. 59-66). IEEE. doi:10.1109/ICTAI.2009.25
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: a family of
discriminant measures for performance evaluation. Proceedings of Advances in Artificial Intelligence AI
2006 (pp. 1015-1021). Springer Berlin Heidelberg.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification
tasks. InformationProcessing&Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002
ENDNOTES
1 http://www.cs.waikato.ac.nz/ml/weka/index.html
... This process requires the use of various performance metrics to measure the success of the classifier. However, there is no universal guideline for selecting the most appropriate metric (Liu et al., 2014). Four key metrics are commonly employed to assess classification performance: accuracy, precision, recall, and F1 score. ...
Article
Full-text available
The aim of this study is to evaluate the prediction performance of the Multilayer Perceptron (MLP) model, one of the deep learning methods, on the stock indices of G7 countries, namely NYSE 100 (USA), FTSE 100 (UK), NIKKEI 225 (Japan), CAC 40 (France), FTSE MIB (Italy), DAX 30 (Ger-many), and TSX (Canada). In addition to daily data from January 1, 2014, to December 31, 2023, ten technical indicators, widely used in the literature and selected for their ability to enhance prediction performance and provide statistically significant contributions to the model, were included as input variables. The analysis results showed that the directional movements of the NYSE 100, FTSE 100, NIKKEI 225, CAC 40, FTSE MIB, DAX 30, and TSX indices were predicted with accuracies of 74.75%, 92.28%, 77.58%, 79.10%, 82.58%, 88.83%, and 77.49%, respectively. The obtained average accuracy of 81.80% demonstrates that the Multilayer Perceptron (MLP) model is an effective method for stock index prediction. ÖZ Bu çalışmanın amacı, derin öğrenme yöntemlerinden biri olan çok katmanlı algılayıcı (MLP) modeli-nin, G7 ülkelerine ait borsa endeksleri olan NYSE 100 (ABD), FTSE 100 (İngiltere), NIKKEI 225 (Japonya), CAC 40 (Fransa), FTSE MIB (İtalya), DAX 30 (Almanya) ve TSX (Kanada) üzerindeki tahmin performansını değerlendirmektir. 01 Ocak 2014 ile 31 Aralık 2023 tarihleri arasında günlük verilere ilave olarak literatürde yaygın olarak kullanılan teknik göstergeler arasından modelin tahmin performansını artıran ve modele istatistiksel olarak anlamlı katkı sağlayan on adet teknik gösterge seçilerek modelde giriş değişkeni olarak kullanılmıştır. Gerçekleştirilen analizler sonucunda, NYSE 100, FTSE 100, NIKKEI 225, CAC 40, FTSE MIB, DAX 30 ve TSX endekslerinin hareket yönlerinin sırasıyla %74,75, %92,28, %77,58, %79,10, %82,58, %88,83 ve %77,49 oranında doğrulukla tahmin edildiği görülmüştür. Elde edilen %81,80 ortalama doğruluk oranı, çok katmanlı algılayıcı (MLP) modelinin borsa endeksi tahmininde etkili bir yöntem olduğunu göstermektedir.
... Nevertheless, there is a lack of consensus among machine learning experts regarding which metric to use when evaluating a classifier's performance (Y. Liu et al., 2014). Hyperparameter tuning is vital in optimizing the performance of DL models. ...
... The performance metrics used to assess all the models in this research are: accuracy, precision, recall, and F1 scores, and the area under the receiver operating characteristics (ROC) curve 46 . These metrics are calculated using true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values. ...
Article
Full-text available
Hepatitis C is a liver infection triggered by the hepatitis C virus (HCV). The infection results in swelling and irritation of the liver, which is called inflammation. Prolonged untreated exposure to the virus can lead to chronic hepatitis C. This can result in serious health complications such as liver damage, hepatocellular carcinoma (HCC), and potentially death. Therefore, rapid diagnosis and prompt treatment of HCV is crucial. This study utilizes machine learning (ML) to precisely identify hepatitis C in patients by analyzing parameters obtained from a standard biochemistry test. A hybrid dataset was acquired by merging two commonly used datasets from individual sources. A portion of the dataset was used as a hold-out set to simulate real-world data. A multi-dimensional pre-clustering approach was used in this study in the form of k-means for binning and k-modes for categorical clustering. The pre-clustering approach was used to extract a new feature. This extracted feature column was added to the original dataset and was used to train a stacked meta-model. The model was compared against baseline models. The predictions were further elaborated using explainable artificial intelligence. The models used were XGBoost, K-nearest neighbor, support vector classifier, and random forest (RF). The baseline score obtained was 94.25% using RF, while the meta-model gave a score of 94.82%.
... ROC-AUC is computed using the True Positive Rate vs the False Positive rate, and it is useful as it is considered an objective metric since it remains unaffected by changes in the distribution of train and test datasets, and by subjective indicators. However, since it can be affected by class imbalance, PR-AUC is recommended as a second metric as anomaly detection is a high-class imbalance metric [30,31]. ...
... Moreover, focusing only on this aspect of performance ignores some very important practical considerations. In classification research the correct selection of performance metrics is one of the most important issues in evaluating a classifier's performance (Liu et al., 2014). Indeed, a vast array of alternative metrics are available and optimizing the wrong metric directly translates into lost revenue (Dmitriev & Wu, 2016). ...
Article
Full-text available
A novel financial performance metric (FPM) is introduced seeking to minimise the misclassification cost arising from false positives and false negatives in credit risk assessment. Using the German Credit Dataset (GCD), important financial variables are simulated according to four different asset classes to enable a more accurate and reliable, multidimensional model selection. The misclassification cost arising from FPM is compared with commonly used statistical metrics and the credit scoring example dependent cost matrix (CSEDCM) metric. The results show that CSEDCM underestimates false prediction costs by as much as 99% compared to the FPM. A range of high- performance machine learning methods was compared using FPM and statistical metrics. The Multi-Layer Perceptron outperformed other methods on statistical metrics and overall on financial costs, while a mix of algorithms worked best on either side of the decision threshold. The results confirmed that the proposed FPM would provide a significant financial benefit to organisations.
... An 8-layer artificial neural network (ANN) is used, comprising one input layer, six hidden layers, and one output layer. The activation function ReLU and the 'adam' optimizer are utilized, with Mean Absolute Error (MAE) chosen as the performance metric, aligning with the optimal probabilistic metrics recommended by Liu et al. (2014). The model development process encompasses two key stages: training, involving the adjustment of neural network parameters through backpropagation, and testing, used for evaluating the model's predictive capabilities on unseen data, thus validating its real-world applicability. ...
Article
Full-text available
The initial filling phase of reservoirs is a critical period that demands close supervision to ensure safety and functionality. During this phase, the dam is slowly filled with water, submerging floodplains until it reaches its intended storage capacity. This process assesses the response of the dam to water filling and its overall safety, with continuous monitoring and evaluation against design standards. The duration and rate of filling depend on several factors, i.e., precipitation, dam height, and hydropower plant sensitivity; thus, precipitation was the prominent driving force. However, as continuous precipitation data, multi-satellite global precipitation maps under the Global Precipitation Measurement near-real-time (GSMaP NRT) satellite products offer an alternative but tend to underestimate or overestimate rainfall values, posing challenges for accurate predictions. Bias correction methods of GSMaP NRT product in the spanning period of 2005–2022 demonstrated in agreement with ground observation data through the application of the artificial neural network (ANN) method to reduce the error bias to produce reliable results. This study highlights the importance of the impoundment period for reservoir sedimentation and overall dam safety. It emphasises the need for accurate precipitation data in reservoir management and recommends rigorous bias correction when using satellite data to substitute ground measurements.
Article
Full-text available
Öz: Bu çalışmada, derin öğrenme yönteminin yükselen piyasa ekonomileri olarak bilinen E-7 ülkelerinin borsa endeksleri üzerindeki tahmin performansının incelenmesi amaçlanmıştır. Bu bağlamda, IPC (Meksika), SSE (Çin), BIST 100 (Türkiye), RTS (Rusya), BOVESPA (Brezilya), IDX (Endonezya) ve NIFTY 50 (Hindistan) borsa endekslerinin günlük hareket yönleri H2O derin öğrenme modeli kullanılarak tahmin edilmiştir. Modelin girdileri olarak, 01.01.2015 ve 31.12.2024 tarihleri arasında günlük kapanış fiyatlarına dayalı olarak hesaplanan teknik göstergeler kullanılmıştır. Tahmin sürecinde veriler %80 eğitim ve %20 test seti olarak bölünmüştür. Hesaplanan doğruluk oranları IPC endeksi için %88,47, SSE için %78,13, BIST 100 için %77,29, RTS için %76,05, BOVESPA için %75,81, IDX için %75,05 ve NIFTY 50 için %74,34 olarak bulunmuştur. Elde edilen bulgular, derin öğrenme yöntemlerinin borsa endeksi hareketlerini belirli bir doğruluk düzeyiyle tahmin edebildiğini göstermektedir. Abstract: This study aims to examine the prediction performance of the deep learning method on the stock indices of E-7 countries, known as emerging market economies. In this context, the daily movement directions of the stock indices of IPC (Mexico), SSE (China), BIST 100 (Turkey), RTS (Russia), BOVESPA (Brazil), IDX (Indonesia), and NIFTY 50 (India) were predicted using the H2O deep learning model. Technical indicators calculated based on the daily closing prices between 01.01.2015 and 31.12.2024 were used as inputs for the model. The data was split into 80% training and 20% test sets during the prediction process. The calculated accuracy rates were 88.47% for the IPC index, 78.13% for SSE, 77.29% for BIST 100, 76.05% for RTS, 75.81% for BOVESPA, 75.05% for IDX, and 74.34% for NIFTY 50. The findings demonstrate that deep learning methods can predict stock index movements with a certain level of accuracy.
Article
Full-text available
Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far received much less attention in the analysis of performance metrics. This dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform range of operating conditions we give clear interpretations of the 0-1 loss, the absolute error, the Brier score, the AUC and the refinement loss respectively. Our analysis provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation which can be summarised as follows: given a model, apply the threshold choice methods that correspond with the available information about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibration in choosing the threshold choice method.
Conference Paper
Full-text available
Different evaluation measures assess different characteristics of machine learning algorithms. The empirical evaluation of algorithms and classifiers is a matter of on-going debate among researchers. Most measures in use today focus on a classifier’s ability to identify classes correctly. We note other useful properties, such as failure avoidance or class discrimination, and we suggest measures to evaluate such properties. These measures – Youden’s index, likelihood, Discriminant power – are used in medical diagnosis. We show that they are interrelated, and we apply them to a case study from the field of electronic negotiations. We also list other learning problems which may benefit from the application of these measures.
Article
Full-text available
This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier’s evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.
Article
Full-text available
Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been used increasingly in machine learning and data mining research. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. The purpose of this article is to serve as an introduction to ROC graphs and as a guide for using them in research.
Article
This paper describes a simple, non-parametric variant of area under the receiver operating characteristic (ROC) curve (AUC), which we call half-AUC (HAUC). By measuring AUC in two halves: first when the true positive rate (TPR) is greater than the true negative rate (TNR) and then when TPR is less than TNR, we obtain a measure of a classifier’s overall sensitivity (HAUCSeHAUCSe) and specificity (HAUCSpHAUCSp) respectively. We show that these HAUC measures can be interpreted as the probability of correct ranking under the constraint that one class must have a higher detection rate than the other. We then go on to describe application domains where this constraint is appropriate and hence where HAUC may be superior to AUC. We show examples where HAUC discriminates ROC curves both when one curve dominates another and when the curves cross, but have an equivalent AUC.
Article
Rationale and objectives: Receiver operating characteristic (ROC) curves are ubiquitous in the analysis of imaging metrics as markers of both diagnosis and prognosis. While empirical estimation of ROC curves remains the most popular method, there are several reasons to consider smooth estimates based on a parametric model. Materials and methods: A mixture model is considered for modeling the distribution of the marker in the diseased population motivated by the biological observation that there is more heterogeneity in the diseased population than there is in the normal one. It is shown that this model results in an analytically tractable ROC curve which is itself a mixture of ROC curves. Results: The use of creatine kinase-BB isoenzyme in diagnosis of severe head trauma is used as an example. ROC curves are fit using the direct binormal method, ROCKIT software, and the Box-Cox transformation as well as the proposed mixture model. The mixture model generates an ROC curve that is much closer to the empirical one than the other methods considered. Conclusions: Mixtures of ROC curves can be helpful in fitting smooth ROC curves in datasets where the diseased population has higher variability than can be explained by a single distribution.