Conference PaperPDF Available

The Relationship Between Precision-Recall and ROC Curves

Authors:

Abstract and Figures

Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.
Content may be subject to copyright.
A preview of the PDF is not available
... Precision measures the proportion of correct positive predictions (true purchases) out of all positive predictions, which is crucial for targeted marketing applications [11]. Recall measures the proportion of actual positives (true purchases) that were correctly identified, which is important for capturing as many potential sales as possible. ...
Article
Full-text available
This paper presents a novel approach to predicting buying intent and product demand in e-commerce settings, leveraging a Deep Q-Network (DQN) inspired architecture. In the rapidly evolving landscape of online retail, accurate prediction of user behavior is crucial for optimizing inventory management, personalizing user experiences, and maximizing sales. Our method adapts concepts from reinforcement learning to a supervised learning context, combining the sequential modeling capabilities of Long Short-Term Memory (LSTM) networks with the strategic decision-making aspects of DQNs. We evaluate our model on a large-scale e-commerce dataset comprising over 885,000 user sessions, each characterized by 1,114 features. Our approach demonstrates robust performance in handling the inherent class imbalance typical in e-commerce data, where purchase events are significantly less frequent than non-purchase events. Through comprehensive experimentation with various classification thresholds, we show that our model achieves a balance between precision and recall, with an overall accuracy of 88% and an AUC-ROC score of 0.88. Comparative analysis reveals that our DQN-inspired model offers advantages over traditional machine learning and standard deep learning approaches, particularly in its ability to capture complex temporal patterns in user behavior. The model’s performance and scalability make it well-suited for real-world e-commerce applications dealing with high-dimensional, sequential data. This research contributes to the field of e-commerce analytics by introducing a novel predictive modeling technique that combines the strengths of deep learning and reinforcement learning paradigms. Our findings have significant implications for improving demand forecasting, personalizing user experiences, and optimizing marketing strategies in online retail environments.
... In this study, we adopt the standard severity ratio, set as the inverse of the relative class frequency . The PR curve has been cited as an alternative to the ROC curve for tasks with a large skew in the class distribution, which focuses on the trade-off between precision (i.e., the proportion of correctly classified positive instances among all predicted positives) and recall (i.e., the proportion of correctly classified positive instances among all actual positives) (Davis & Goadrich, 2006). Typically, the larger the AUC, KS, H-measure, and PRAUC, the better the performance of a prediction model. ...
Article
Full-text available
This study explores the integration of a representative large language model, ChatGPT, into lending decision-making with a focus on credit default prediction. Specifically, we use ChatGPT to analyse and interpret loan assessments written by loan officers and generate refined versions of these texts. Our comparative analysis reveals significant differences between generative artificial intelligence (AI)-refined and human-written texts in terms of text length, semantic similarity, and linguistic representations. Using deep learning techniques, we show that incorporating unstructured text data, particularly ChatGPT-refined texts, alongside conventional structured data significantly enhances credit default predictions. Furthermore, we demonstrate how the contents of both human-written and ChatGPT-refined assessments contribute to the models’ prediction and show that the effect of essential words is highly context-dependent. Moreover, we find that ChatGPT’s analysis of borrower delinquency contributes the most to improving predictive accuracy. We also evaluate the business impact of the models based on human-written and ChatGPT-refined texts, and find that, in most cases, the latter yields higher profitability than the former. This study provides valuable insights into the transformative potential of generative AI in financial services.
... In our context, this metric can be interpreted as the probability that a classifier assigns a higher score to a randomly sampled 10K filed in the year before failure (a positive) compared to a randomly sampled 10K that was not filed before failure (a negative) (Fernández et al., 2018). Despite this intuitive interpretation, the ROC-AUC can be overly optimistic with heavily imbalanced data (Davis and Goadrich, 2006), which is why we report additional metrics as well. ...
Article
Full-text available
Business failure prediction models are crucial in high-stakes domains like banking, insurance, and investing. In this paper, we propose an interpretable model that combines numerical and sentence-level textual features through a well-known attention mechanism. Our model demonstrates competitive performance across various metrics, and the attention weights help identify sentences intuitively linked to business failure, offering a form of interpretability. Furthermore, our findings highlight the strength of traditional financial ratios for business failure prediction while textual data—particularly when represented as keywords—is mainly useful to correctly classify corporate disclosures where the possibility of failure is explicitly mentioned.
Article
Full-text available
Background Understanding the complex interplay between life course exposures, such as adverse childhood experiences and environmental factors, and disease risk is essential for developing effective public health interventions. Traditional epidemiological methods, such as regression models and risk scoring, are limited in their ability to capture the non-linear and temporally dynamic nature of these relationships. Deep learning (DL) and explainable artificial intelligence (XAI) are increasingly applied within healthcare settings to identify influential risk factors and enable personalised interventions. However, significant gaps remain in understanding their utility and limitations, especially for sparse longitudinal life course data and how the influential patterns identified using explainability are linked to underlying causal mechanisms. Methods We conducted a controlled simulation study to assess the performance of various state-of-the-art DL architectures including CNNs and (attention-based) RNNs against XGBoost and logistic regression. Input data was simulated to reflect a generic and generalisable scenario with different rules used to generate multiple realistic outcomes based upon epidemiological concepts. Multiple metrics were used to assess model performance in the presence of class imbalance and SHAP values were calculated. Results We find that DL methods can accurately detect dynamic relationships that baseline linear models and tree-based methods cannot. However, there is no one model that consistently outperforms the others across all scenarios. We further identify the superior performance of DL models in handling sparse feature availability over time compared to traditional machine learning approaches. Additionally, we examine the interpretability provided by SHAP values, demonstrating that these explanations often misalign with causal relationships, despite excellent predictive and calibrative performance. Conclusions These insights provide a foundation for future research applying DL and XAI to life course data, highlighting the challenges associated with sparse healthcare data, and the critical need for advancing interpretability frameworks in personalised public health.
Article
Machine learning-augmented applications have the potential to be powerful tools for decision-making in healthcare. However, healthcare is a complex domain that presents many challenges. These challenges, such as medical errors, clinician-patient relationships, and treatment preferences, must be addressed to ensure fairness in ML-augmented healthcare applications. To better understand the influence these challenges have on fairness, 16 experienced engineers and designers with domain knowledge in healthcare technology were interviewed about how they would prioritise fairness in three healthcare scenarios (well-being improvement, chronic illness management, acute illness treatment). Using a template analysis, this work identifies the key considerations in the creation of fair ML for healthcare. These considerations clustered into categories related to technology, healthcare context, and user perspectives. To explore these categories, we propose the stakeholder fairness conceptual model. This framework aids designers and developers in understanding the complex considerations that stem from the building, management, and evaluation of ML-augmented healthcare applications, and how they affect the expectations of fairness. This work then discusses how this model may be applied when the health technology is directly provisioned to users, without a healthcare provider managing its use or adoption. This paper contributes to the understanding of fairness requirements in healthcare, including the effect of healthcare errors, clinician-application collaboration, and how the evaluation of healthcare technology becomes part of the fairness design process.
Article
This study presents an augmented hybrid approach for improving the diagnosis of malignant skin lesions by combining convolutional neural network (CNN) predictions with selective human interventions based on prediction confidence. The algorithm retains high-confidence CNN predictions while replacing low-confidence outputs with expert human assessments to enhance diagnostic accuracy. A CNN model utilizing the EfficientNetB3 backbone is trained on datasets from the ISIC-2019 and ISIC-2020 SIIM-ISIC melanoma classification challenges and evaluated on a 150-image test set. The model’s predictions are compared against assessments from 69 experienced medical professionals. Performance is assessed using receiver operating characteristic (ROC) curves and area under curve (AUC) metrics, alongside an analysis of human resource costs. The baseline CNN achieves an AUC of 0.822, slightly below the performance of human experts. However, the augmented hybrid approach improves the true positive rate to 0.782 and reduces the false positive rate to 0.182, delivering better diagnostic performance with minimal human involvement. This approach offers a scalable, resource-efficient solution to address variability in medical image analysis, effectively harnessing the complementary strengths of expert humans and CNNs.
Conference Paper
Full-text available
When the goal is to achieve the best correct classification rate, cross entropy and mean squared error are typical cost functions used to optimize classifier performance. However, for many real-world classification problems, the ROC curve is a more meaningful perfor- mance measure. We demonstrate that min- imizing cross entropy or mean squared error does not necessarily maximize the area un- der the ROC curve (AUC). We then consider alternative objective functions for training a classifier to maximize the AUC directly. We propose an objective function that is an ap- proximation to the Wilcoxon-Mann-Whitney statistic, which is equivalent to the AUC. The proposed objective function is dierentiable, so gradient-based methods can be used to train the classifier. We apply the new objec- tive function to real-world customer behav- ior prediction problems for a wireless service provider and a cable service provider, and achieve reliable improvements in the ROC curve.
Conference Paper
Full-text available
ROC analysis is increasingly being recognised as an important tool for evaluation and comparison of classifiers when the operating characteristics (i.e. class distribution and cost parameters) are not known at training time. Usually, each classi- fier is characterised by its estimated true and false positive rates and is represented by a single point in the ROC diagram. In this paper, we show how a single decision tree can represent a set of classifiers by choosing different labellings of its leaves, or equivalently, an ordering on the leaves. In this setting, rather than estimating the accuracy of a single tree, it makes more sense to use the area under the ROC curve (AUC) as a quality metric. We also propose a novel splitting criterion which chooses the split with the highest local AUC. To the best of our knowledge, this is the first probabilistic splitting criterion that is not based on weighted average impurity. We present experiments suggesting that the AUC splitting criterion leads to trees with equal or better AUC value, without sacrificing accuracy if a single labelling is chosen.
Article
In this paper we investigate the use of the area under the receiver operating characteristic (ROC) curve (AUC) as a performance measure for machine learning algorithms. As a case study we evaluate six machine learning algorithms (C4.5, Multiscale Classifier, Perceptron, Multi-layer Perceptron, k-Nearest Neighbours, and a Quadratic Discriminant Function) on six "real world" medical diagnostics data sets. We compare and discuss the use of AUC to the more conventional overall accuracy and find that AUC exhibits a number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance (ANOVA) tests; a standard error that decreased as both AUC and the number of test samples increased; decision threshold independent; and it is invariant to a priori class probabilities. The paper concludes with the recommendation that AUC be used in preference to overall accuracy for "single number" evaluation of machine learning algorithms. © 1997 Pattern Recognition Society. Published by Elsevier Science Ltd.
Conference Paper
Many sequential prediction tasks involve locating instances of pat- terns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally repre- sent arbitrarily overlapping elements. We show how to ecien tly train and perform inference with these models. Experimental re- sults from a genomics domain show that our models are more accu- rate at locating instances of overlapping patterns than are baseline models based on HMMs.
Conference Paper
The area under an ROC curve(AUC) is a criterion used in many appli- cations to measure the quality of a classification algorithm . However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically in- creasing as a function of the classification accuracy, but th at the standard deviation for uneven distributions and higher error rates i s noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specific ally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC.
Conference Paper
Many machine learning applications require a combination of probability and rst-order logic. Markov logic networks (MLNs) accomplish this by attaching weights to rst-order clauses, and viewing these as templates for features of Markov networks. Model parameters (i.e., clause weights) can be learned by maximizing the likelihood of a relational database, but this can be quite costly and lead to suboptimal results for any given prediction task. In this paper we pro- pose a discriminative approach to training MLNs, one which optimizes the conditional likelihood of the query predicates given the evidence ones, rather than the joint likelihood of all predicates. We extend Collins's (2002) voted perceptron algorithm for HMMs to MLNs by replacing the Viterbi algo- rithm with a weighted satisability solver. Experiments on entity resolution and link prediction tasks show the advan- tages of this approach compared to generative MLN training, as well as compared to purely probabilistic and purely logical approaches.
Conference Paper
We study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. This problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the "collaborative-filtering" problem of ranking movies for a user based on the movie rankings provided by other users. In this work, we begin by presenting a formal framework for this general problem. We then describe and analyze an efficient algorithm called RankBoost for combining preferences based on the boosting approach to machine learning. We give theoretical results describing the algorithm's behavior both on the training data, and on new test data not seen during training. We also describe an efficient implementation of the algorithm for a particular restricted but common case. We next discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. The second experiment is a collaborative-filtering task for making movie recommendations.
Conference Paper
Markov logic networks (MLNs) combine logic and probability by attaching weights to rst-order clauses, and viewing these as templates for features of Markov networks. In this paper we develop an algorithm for learning the structure of MLNs from relational databases, combining ideas from inductive logic pro- gramming (ILP) and feature induction in Markov net- works. The algorithm performs a beam or shortest- rst search of the space of clauses, guided by a weighted pseudo-likelihood measure. This requires computing the optimal weights for each candidate structure, but we show how this can be done ef- ciently. The algorithm can be used to learn an MLN from scratch, or to rene an existing knowledge base. We have applied it in two real-world domains, and found that it outperforms using off-the-shelf ILP sys- tems to learn the MLN structure, as well as pure ILP, purely probabilistic and purely knowledge-based ap- proaches.
Conference Paper
This paper presents a Support Vector Method for optimizing multivariate nonlinear performance measures like the F1-score. Taking a multivariate prediction approach, we give an algorithm with which such multivariate SVMs can be trained in polynomial time for large classes of potentially non-linear performance measures, in particular ROCArea and all measures that can be computed from the contingency table. The conventional classification SVM arises as a special case of our method.