ArticlePDF Available

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods

Authors:

Abstract

The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.
−6 −5 −4 −3 −2 −1 0 1 2 3
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
−5 −4 −3 −2 −1 0 1 2 3 4
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
... Many post-processing calibration methods have been developed for binary classification models [50,69,70]. Applying these methods to multiclass classifiers requires some adaptation. ...
... In this paper, we divide post-hoc calibration methods into two categories: scaling and binary. Scaling methods are derived from Platt scaling [50] and optimize some parameters to scale the logits. Temperature Scaling [17] is a popular simple post-processing calibration method. ...
Preprint
For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score. This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step. However, many confidence calibration methods fail for problems with many classes. To address this issue, we transform the problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier. This approach allows for more efficient use of standard calibration methods. We evaluate our approach on numerous neural networks used for image or text classification and show that it significantly enhances existing calibration methods.
... In settings where we have an embedding model [21] or an existing end-to-end deep learning model, we can dramatically reduce the computation by training the concept detectors via fine-tuning. In all cases, we can calibrate each model by applying a post-hoc calibration technique, e.g., Platt scaling [23]. ...
Preprint
Full-text available
We propose a new approach to promote safety in classification tasks with established concepts. Our approach -- called a conceptual safeguard -- acts as a verification layer for models that predict a target outcome by first predicting the presence of intermediate concepts. Given this architecture, a safeguard ensures that a model meets a minimal level of accuracy by abstaining from uncertain predictions. In contrast to a standard selective classifier, a safeguard provides an avenue to improve coverage by allowing a human to confirm the presence of uncertain concepts on instances on which it abstains. We develop methods to build safeguards that maximize coverage without compromising safety, namely techniques to propagate the uncertainty in concept predictions and to flag salient concepts for human review. We benchmark our approach on a collection of real-world and synthetic datasets, showing that it can improve performance and coverage in deep learning tasks.
... Because of the highly flexible nature of the transformation learned by isotonic regression, it can be very effective if enough calibration data is available, but lead to overfitting if the sample is too small [53]. As an alternative, Platt scaling [56] parameterizes the transformation using a logistic function ↦ → (1 + exp( + )) −1 , which has only two tunable parameters and , and thus is more adequate in scarce data scenarios. ...
Preprint
Full-text available
Extreme multilabel classification (XMLC) problems occur in settings such as related product recommendation, large-scale document tagging, or ad prediction, and are characterized by a label space that can span millions of possible labels. There are two implicit tasks that the classifier performs: \emph{Evaluating} each potential label for its expected worth, and then \emph{selecting} the best candidates. For the latter task, only the relative order of scores matters, and this is what is captured by the standard evaluation procedure in the XMLC literature. However, in many practical applications, it is important to have a good estimate of the actual probability of a label being relevant, e.g., to decide whether to pay the fee to be allowed to display the corresponding ad. To judge whether an extreme classifier is indeed suited to this task, one can look, for example, to whether it returns \emph{calibrated} probabilities, which has hitherto not been done in this field. Therefore, this paper aims to establish the current status quo of calibration in XMLC by providing a systematic evaluation, comprising nine models from four different model families across seven benchmark datasets. As naive application of Expected Calibration Error (ECE) leads to meaningless results in long-tailed XMC datasets, we instead introduce the notion of \emph{calibration@k} (e.g., ECE@k), which focusses on the top-kk probability mass, offering a more appropriate measure for evaluating probability calibration in XMLC scenarios. While we find that different models can exhibit widely varying reliability plots, we also show that post-training calibration via a computationally efficient isotonic regression method enhances model calibration without sacrificing prediction accuracy. Thus, the practitioner can choose the model family based on accuracy considerations, and leave calibration to isotonic regression.
... The chosen classification method is support vector machines (SVM). More specifically, a linear support vector classification [44] is used and a probabilistic output is achieved with Platt scaling [45]. Text data is transformed with TF-IDF (term frequency-inverse document frequency) vectorisation method as outlined in [44,46,47]). ...
Article
Full-text available
This study delves into the challenge of efficiently digitalising wind turbine maintenance data, traditionally hindered by non‐standardised formats necessitating manual, expert intervention. Highlighting the discrepancies in past reliability studies based on different key performance indicators (KPIs), the paper underscores the importance of consistent standards, like RDS‐PP, for maintenance data categorisation. Leveraging on established digitalisation workflows, we investigate the efficacy of text classifiers in automating the categorisation process against conventional manual labelling. Results indicate that while classifiers exhibit high performance for specific datasets, their general applicability across diverse wind farms is limited at the present stage. Furthermore, differences in failure rate KPIs derived from manual versus classifier‐processed data reveal uncertainties in both methods. The study suggests that enhanced clarity in maintenance reporting and refined designation systems can lead to more accurate KPIs.
... This method calculates the likelihood of each sequence and normalizes it by the length of the sequence to provide a fair comparison between different lengths of sequences. Platt scaling (Platt, 1999), a variant of the sequence likelihood baseline, applies Platt scaling to the raw likelihoods. GraphSpectral uses the graph theory to estimate the confidence. ...
Preprint
Full-text available
One important approach to improving the reliability of large language models (LLMs) is to provide accurate confidence estimations regarding the correctness of their answers. However, developing a well-calibrated confidence estimation model is challenging, as mistakes made by LLMs can be difficult to detect. We propose a novel method combining the LLM's self-consistency with labeled data and training an auxiliary model to estimate the correctness of its responses to questions. This auxiliary model predicts the correctness of responses based solely on their consistent information. To set up the learning problem, we use a weighted graph to represent the consistency among the LLM's multiple responses to a question. Correctness labels are assigned to these responses based on their similarity to the correct answer. We then train a graph neural network to estimate the probability of correct responses. Experiments demonstrate that the proposed approach substantially outperforms several of the most recent methods in confidence calibration across multiple widely adopted benchmark datasets. Furthermore, the proposed approach significantly improves the generalization capability of confidence calibration on out-of-domain (OOD) data.
... The image processing for identifying the gas-liquid interface is the same employed in Fiorini et al. (2022) and provides samples of the gas-liquid interface h(η r , t) on a set of points η r , with r ∈ [1, n k ]. These data were then regressed using Support Vector Regression (SVR) (Platt 1999;Smola and Scholkopf 2004) to obtain a continuous and smooth representation of the gas-liquid interface, which facilitates the extraction of the dynamic contact angle formed at the solid surface. Additionally, SVR enables the precise definition of the spatial scale within which the contact angle is reported in this study. ...
Article
Full-text available
This work investigates the capillary rise dynamics of highly wetting liquids in a divergent U-tube in the microgravity conditions provided by 78th European Space Agency (ESA) parabolic flight. This configuration produces a capillary-driven channel flow. We use image recording in backlight illumination to characterize the interface dynamics and dynamic contact angle of HFE7200 and Di-Propylene Glycol (DPG). For the case of HF7200, we complement the interface measurements with Particle Tracking Velocimetry (PTV) to characterize the velocity fields underneath the moving meniscus. In the DPG experiments, varying liquid column heights are observed, with a notable decrease in meniscus curvature when the contact line transitions from a pre-wetted to a dry substrate. In contrast, for HFE7200, the interface consistently advances over a pre-wetted surface. Despite this, a reduction in meniscus curvature is detected, attributed to inertial effects within the underlying accelerating flow. PTV measurements reveal that the region where the velocity profile adapts to the meniscus velocity decreases as interface acceleration increases, suggesting a direct relationship between acceleration and the velocity adaptation length scale.
... Inprocessing techniques achieve fairness during training by adding new terms to the loss function [19] or including more constraints in the optimization. Post-processing techniques sacrifice the utility of output confidence scores and align them with the fairness objective [28]. ...
Article
Preprint
The iCAP is a tool for blood-based diagnostics that addresses the low signal-to-noise ratio of blood biomarkers by using cells as biosensors. The assay exposes small volumes of patient serum to standardized cells in culture and classifies disease by AI analysis of gene-expression readouts from the cells. It simplifies the complexity of blood into a concise readout in a scalable cell-based assay. We developed the LC-iCAP as a rule-out test for nodule management in CT-based lung cancer screening. The assay achieved an AUC of 0.63 (95% CI 0.50-0.75) in retrospective-blind-temporal validation. When integrated with CT data after validation, it demonstrated potential to reduce unnecessary follow-up procedures by significantly outperforming the Mayo Clinic model with 90% sensitivity, 67% specificity and 95% NPV using an estimated 25% prevalence. Analytical validation established LC-iCAP reproducibility and identified unwanted variation from long-term serum storage suggesting a prospective study design could enhance performance.
ResearchGate has not been able to resolve any references for this publication.