Conference PaperPDF Available

The Relationship Between Precision-Recall and ROC Curves

Authors:

Abstract and Figures

Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.
Content may be subject to copyright.
A preview of the PDF is not available
... Despite the AUROC of 0.987 in our training sets has indicated the outstanding performance of this model, we have also used other evaluation approaches for further verification. Compared with ROC, PRC is more sensitive to imbalance and can better reflect the classification performance when there exists large proportion of difference between positive and negative samples [33]. As expected, the AUPRC of 0.981 in training sets demonstrates that our model has strong robustness. ...
Article
Full-text available
Background At present, the diagnostic ability of hepatocellular carcinoma (HCC) based on serum alpha-fetoprotein level is limited. Finding markers that can effectively distinguish cancer and non-cancerous tissues is important for improving the diagnostic efficiency of HCC. Results In this study, we developed a predictive model for HCC diagnosis using personalized biological pathways combined with a machine learning algorithm based on regularized regression and carry out relevant examinations. In two training sets, the overall cross-study-validated area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve and the Brier score of the diagnostic model were 0.987 [95%confidence interval (CI): 0.979–0.996], 0.981 and 0.091, respectively. Besides, the model showed good transferability in external validation set. In TCGA-LIHC cohort, the AUROC, AURPC and Brier score were 0.992 (95%CI: 0.985–0.998), 0.967 and 0.112, respectively. The diagnostic model has accomplished very impressive performance in distinguishing HCC from non-cancerous liver tissues. Moreover, we further analyzed the extracted biological pathways to explore molecular features and prognostic factors. The risk score generated from a 12-gene signature extracted from the characteristic pathways was correlated with some immune related pathways and served as an independent prognostic factor for HCC. Conclusion We used personalized biological pathways analysis and machine learning algorithm to construct a highly accurate HCC diagnostic model. The excellent interpretable performance and good transferability of this model enables it with great potential for personalized medicine, which can assist clinicians in diagnosis for HCC patients.
... Several measures based on this curve exist, but they all calculate the area under the curve (AUC) in their own slightly different way. Calculating the area under the curve is an effective method for assessing the quality of the predictions across the different recall levels [13] . Most of these method use interpolation to reduce the impact of "wiggles" [14] . ...
Thesis
Many types of potential road object detection is a challenging problems, to medicate this often deep learning methods require a tremendous amount of hand-labeled ground truth dataset which is a very labor-intensive and costly process. Yet to compensate for the missed road objects by trained model the autonomous driving vehicle usages expensive sensors including RADAR, LiDAR, etc. It would be beneficial for an autonomous driving vehicle in terms of cost, which can detect any road objects with the limited hand-labeled dataset. This problem is known as the generic road obstacle detection (GROD) problem and this research has an opportunity to achieve this. Recent research has shown promising results in terms of speed and accuracy with the use of You Only Look Once (YOLO) for road object detection. We propose a GROD network by using YOLO for generic road object detection problems. GROD network has achieved mAP of 48.50% with Berkeley dataset BDD100K and it is tested that it can be used in a real-time environment. The improvements came as a result of modifying YOLOv3 by utilizing various techniques WordTree, Soft-NMS, and supervised contrastive loss function. GROD network learns the common features for parent class from the same object types in child classes and predicts the object as a parent class if it hasn’t been seen beforehand while training. For instance vehicle class which is parent node has a car, bus, and truck classes as child nodes, while training vehicle class regardless of having ground truth learns the meaning and some special features i.e, wheels, structure, and shape from child nodes. When GROD network sees other types of vehicle which it has never seen before, it still can predict as a vehicle class. This research thesis shows that with YOLO, hierarchical classification and supervised contrastive learning could be one of the many ways for solving GROD problems.
... An in-distribution sample is expected to have a high out-of-distribution score, while an outof-distribution sample is the opposite. To evaluate the out-ofdistribution detection performance, we adopt the Area Under the Receiver Operating Characteristic curve (AUROC) [44] to measure the rank of the out-of-distribution scores. AUROC indicates the probability that an in-distribution sample owns a higher score than an out-of-distribution sample. ...
Preprint
Deep neural networks only learn to map in-distribution inputs to their corresponding ground truth labels in the training phase without differentiating out-of-distribution samples from in-distribution ones. This results from the assumption that all samples are independent and identically distributed without distributional distinction. Therefore, a pretrained network learned from the in-distribution samples treats out-of-distribution samples as in-distribution and makes high-confidence predictions on them in the test phase. To address this issue, we draw out-of-distribution samples from the vicinity distribution of training in-distribution samples for learning to reject the prediction on out-of-distribution inputs. A \textit{Cross-class Vicinity Distribution} is introduced by assuming that an out-of-distribution sample generated by mixing multiple in-distribution samples does not share the same classes of its constituents. We thus improve the discriminability of a pretrained network by finetuning it with out-of-distribution samples drawn from the cross-class vicinity distribution, where each out-of-distribution input corresponds to a complementary label. Experiments on various in-/out-of-distribution datasets show that the proposed method significantly outperforms existing methods in improving the capacity of discriminating between in- and out-of-distribution samples.
... The score gap between ID and OOD samples is expected to be large, which indicates we can clearly separate the two kinds of samples. Accordingly, to evaluate the detection performance of OOD samples, we adopt the following three metrics: the area under the receiver operating characteristic curve (AUROC) [48], the true negative rate at 95% (FPR95) [20] and the detection accuracy (Detection) [20]. A higher AUROC score and a lower FPR95 and Detection score indicate better detection performance. ...
Preprint
To classify in-distribution samples, deep neural networks learn label-discriminative representations, which, however, are not necessarily distribution-discriminative according to the information bottleneck. Therefore, trained networks could assign unexpected high-confidence predictions to out-of-distribution samples drawn from distributions differing from that of in-distribution samples. Specifically, networks extract the strongly label-related information from in-distribution samples to learn the label-discriminative representations but discard the weakly label-related information. Accordingly, networks treat out-of-distribution samples with minimum label-sensitive information as in-distribution samples. According to the different informativeness properties of in- and out-of-distribution samples, a Dual Representation Learning (DRL) method learns distribution-discriminative representations that are weakly related to the labeling of in-distribution samples and combines label- and distribution-discriminative representations to detect out-of-distribution samples. For a label-discriminative representation, DRL constructs the complementary distribution-discriminative representation by an implicit constraint, i.e., integrating diverse intermediate representations where an intermediate representation less similar to the label-discriminative representation owns a higher weight. Experiments show that DRL outperforms the state-of-the-art methods for out-of-distribution detection.
... As pointed out in [16], AUC of different models are not comparable because its value depends on the cutoffs that themselves are related to the class imbalance of data. The cost space [17] and precisionrecall space [18] are the proposed alternatives to mitigate the evaluation bias imposed by the class-imbalance of data. ...
Preprint
Full-text available
The class-imbalance issue is intrinsic to many real-world machine learning tasks, particularly to the rare-event classification problems. Although the impact and treatment of imbalanced data is widely known, the magnitude of a metric's sensitivity to class imbalance has attracted little attention. As a result, often the sensitive metrics are dismissed while their sensitivity may only be marginal. In this paper, we introduce an intuitive evaluation framework that quantifies metrics' sensitivity to the class imbalance. Moreover, we reveal an interesting fact that there is a logarithmic behavior in metrics' sensitivity meaning that the higher imbalance ratios are associated with the lower sensitivity of metrics. Our framework builds an intuitive understanding of the class-imbalance impact on metrics. We believe this can help avoid many common mistakes, specially the less-emphasized and incorrect assumption that all metrics' quantities are comparable under different class-imbalance ratios.
Article
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a widely used algorithm for exploratory clustering applications. Despite the DBSCAN algorithm being considered an unsupervised pattern recognition method, it has two parameters that must be tuned prior to the clustering process in order to reduce uncertainties, the minimum number of points in a clustering segmentation MinPts, and the radii around selected points from a specific dataset Eps. This article presents the performance of a clustering hybrid algorithm for automatically grouping datasets into a two-dimensional space using the well-known algorithm DBSCAN. Here, the function nearest neighbor and a genetic algorithm were used for the automation of parameters MinPts and Eps. Furthermore, the Factor Analysis (FA) method was defined for pre-processing through a dimensionality reduction of high-dimensional datasets with dimensions greater than two. Finally, the performance of the clustering algorithm called FA+GA-DBSCAN was evaluated using artificial datasets. In addition, the precision and Entropy of the clustering hybrid algorithm were measured, which showed there was less probability of error in clustering the most condensed datasets.
Article
Familial Hypercholesterolemia (FH) is an inherited disorder of cholesterol metabolism. Current criteria for FH diagnosis, like Simon Broome (SB) criteria, lead to high false positive rates. The aim of this work was to explore alternative classification procedures for FH diagnosis, based on different biological and biochemical indicators. For this purpose, logistic regression (LR), naive Bayes classifier (NB), random forest (RF) and extreme gradient boosting (XGB) algorithms were combined with Synthetic Minority Oversampling Technique (SMOTE), or threshold adjustment by maximizing Youden index (YI), and compared. Data was tested through a 10 × 10 repeated k -fold cross validation design. The LR model presented an overall better performance, as assessed by the areas under the receiver operating characteristics (AUROC) and precision-recall (AUPRC) curves, and several operating characteristics (OC), regardless of the strategy to cope with class imbalance. When adopting either data processing technique, significantly higher accuracy ( Acc ), G -mean and F 1 score values were found for all classification algorithms, compared to SB criteria ( p < 0.01), revealing a more balanced predictive ability for both classes, and higher effectiveness in classifying FH patients. Adjustment of the cut-off values through pre or post-processing methods revealed a considerable gain in sensitivity ( Sens ) values ( p < 0.01). Although the performance of pre and post-processing strategies was similar, SMOTE does not cause model’s parameters to loose interpretability. These results suggest a LR model combined with SMOTE can be an optimal approach to be used as a widespread screening tool.
Article
Full-text available
Gathering information of aquatic life is often done manually using video feeds. This is a time consuming process and it would be beneficial to capture this information in a less laborious manner. Video based object detection has the ability to achieve this. Recent research has shown promising results with the use of YOLO for object detection of fish. Detecting fish underwater is difficult because of low visibility and therefore fish species can be hard to discriminate. To alleviate this, this study proposes the fish detector YOLO Fish, which uses hierarchical techniques in both the classification step and in the dataset. Hierarchical techniques handles situations where features are difficult to discern better and makes it possible to extract more information from the data. With an mAP of 91.8%, YOLO Fish is a state-of-the-art object detector on Nordic fish species. Additionally, the algorithm has an inference time of 26.4 ms, fast enough to run on real-time video on the high-end GPU Tesla V100.
Article
Fuel power plants are one of the main sources of pollutant emissions, so special attention should be paid to improving the efficiency of the fuel combustion process. The mathematical modeling of processes in the combustion chamber makes it possible to reliably predict and find the best dynamic characteristics of the operation of a power plant, in order to quantify the emission of harmful substances, as well as the environmental and technical and economic efficiency of various regime control actions and measures, and the use of new types of composite fuels. The main purpose of this article is to illustrate how machine learning methods can play an important role in modeling and predicting the performance and control of the combustion process. The paper proposes a mathematical model of an unsteady turbulent combustion process, presents a model of a combustion chamber with a combined burner, and performs a numerical study using the STAR-CCM+ multidisciplinary platform. The influence of various input indicators on the efficiency of burner devices, which is evaluated by several parameters at the output, is investigated. In this case, three possible states of the burners are assumed: optimal, satisfactory and unsatisfactory.
Conference Paper
Full-text available
When the goal is to achieve the best correct classification rate, cross entropy and mean squared error are typical cost functions used to optimize classifier performance. However, for many real-world classification problems, the ROC curve is a more meaningful perfor- mance measure. We demonstrate that min- imizing cross entropy or mean squared error does not necessarily maximize the area un- der the ROC curve (AUC). We then consider alternative objective functions for training a classifier to maximize the AUC directly. We propose an objective function that is an ap- proximation to the Wilcoxon-Mann-Whitney statistic, which is equivalent to the AUC. The proposed objective function is dierentiable, so gradient-based methods can be used to train the classifier. We apply the new objec- tive function to real-world customer behav- ior prediction problems for a wireless service provider and a cable service provider, and achieve reliable improvements in the ROC curve.
Conference Paper
Full-text available
ROC analysis is increasingly being recognised as an important tool for evaluation and comparison of classifiers when the operating characteristics (i.e. class distribution and cost parameters) are not known at training time. Usually, each classi- fier is characterised by its estimated true and false positive rates and is represented by a single point in the ROC diagram. In this paper, we show how a single decision tree can represent a set of classifiers by choosing different labellings of its leaves, or equivalently, an ordering on the leaves. In this setting, rather than estimating the accuracy of a single tree, it makes more sense to use the area under the ROC curve (AUC) as a quality metric. We also propose a novel splitting criterion which chooses the split with the highest local AUC. To the best of our knowledge, this is the first probabilistic splitting criterion that is not based on weighted average impurity. We present experiments suggesting that the AUC splitting criterion leads to trees with equal or better AUC value, without sacrificing accuracy if a single labelling is chosen.
Article
In this paper we investigate the use of the area under the receiver operating characteristic (ROC) curve (AUC) as a performance measure for machine learning algorithms. As a case study we evaluate six machine learning algorithms (C4.5, Multiscale Classifier, Perceptron, Multi-layer Perceptron, k-Nearest Neighbours, and a Quadratic Discriminant Function) on six "real world" medical diagnostics data sets. We compare and discuss the use of AUC to the more conventional overall accuracy and find that AUC exhibits a number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance (ANOVA) tests; a standard error that decreased as both AUC and the number of test samples increased; decision threshold independent; and it is invariant to a priori class probabilities. The paper concludes with the recommendation that AUC be used in preference to overall accuracy for "single number" evaluation of machine learning algorithms. © 1997 Pattern Recognition Society. Published by Elsevier Science Ltd.
Conference Paper
Many sequential prediction tasks involve locating instances of pat- terns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally repre- sent arbitrarily overlapping elements. We show how to ecien tly train and perform inference with these models. Experimental re- sults from a genomics domain show that our models are more accu- rate at locating instances of overlapping patterns than are baseline models based on HMMs.
Conference Paper
The area under an ROC curve(AUC) is a criterion used in many appli- cations to measure the quality of a classification algorithm . However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically in- creasing as a function of the classification accuracy, but th at the standard deviation for uneven distributions and higher error rates i s noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specific ally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC.
Conference Paper
Many machine learning applications require a combination of probability and rst-order logic. Markov logic networks (MLNs) accomplish this by attaching weights to rst-order clauses, and viewing these as templates for features of Markov networks. Model parameters (i.e., clause weights) can be learned by maximizing the likelihood of a relational database, but this can be quite costly and lead to suboptimal results for any given prediction task. In this paper we pro- pose a discriminative approach to training MLNs, one which optimizes the conditional likelihood of the query predicates given the evidence ones, rather than the joint likelihood of all predicates. We extend Collins's (2002) voted perceptron algorithm for HMMs to MLNs by replacing the Viterbi algo- rithm with a weighted satisability solver. Experiments on entity resolution and link prediction tasks show the advan- tages of this approach compared to generative MLN training, as well as compared to purely probabilistic and purely logical approaches.
Conference Paper
We study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. This problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the "collaborative-filtering" problem of ranking movies for a user based on the movie rankings provided by other users. In this work, we begin by presenting a formal framework for this general problem. We then describe and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning. We give theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training. We also describe an efficient implementation of the algorithm for a particular restricted but common case. We next discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. The second experiment is a collaborative-filtering task for making movie recommendations.
Conference Paper
Markov logic networks (MLNs) combine logic and probability by attaching weights to rst-order clauses, and viewing these as templates for features of Markov networks. In this paper we develop an algorithm for learning the structure of MLNs from relational databases, combining ideas from inductive logic pro- gramming (ILP) and feature induction in Markov net- works. The algorithm performs a beam or shortest- rst search of the space of clauses, guided by a weighted pseudo-likelihood measure. This requires computing the optimal weights for each candidate structure, but we show how this can be done ef- ciently. The algorithm can be used to learn an MLN from scratch, or to rene an existing knowledge base. We have applied it in two real-world domains, and found that it outperforms using off-the-shelf ILP sys- tems to learn the MLN structure, as well as pure ILP, purely probabilistic and purely knowledge-based ap- proaches.
Conference Paper
This paper presents a Support Vector Method for optimizing multivariate nonlinear performance measures like the F1-score. Taking a multivariate prediction approach, we give an algorithm with which such multivariate SVMs can be trained in polynomial time for large classes of potentially non-linear performance measures, in particular ROCArea and all measures that can be computed from the contingency table. The conventional classification SVM arises as a special case of our method.