ArticlePDF Available

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods

Authors:

Abstract

The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.
−6 −5 −4 −3 −2 −1 0 1 2 3
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
−5 −4 −3 −2 −1 0 1 2 3 4
5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
... In the past, several works address the problem of unreliable uncertainty information by applying an additional calibration step during inference [35,36,13,34]. For classification, the authors in [35] initially sought for a probabilistic output of support vector machines by utilizing the sigmoid function to transform an unrestricted output score to a score in the [0, 1] interval that can be interpreted as an estimated probability of correctness [35]. ...
... In the past, several works address the problem of unreliable uncertainty information by applying an additional calibration step during inference [35,36,13,34]. For classification, the authors in [35] initially sought for a probabilistic output of support vector machines by utilizing the sigmoid function to transform an unrestricted output score to a score in the [0, 1] interval that can be interpreted as an estimated probability of correctness [35]. In common literature, this method is known as logistic calibration or Platt scaling and can be used as a recalibration function to obtain calibrated probabilities [35,13]. ...
... In the past, several works address the problem of unreliable uncertainty information by applying an additional calibration step during inference [35,36,13,34]. For classification, the authors in [35] initially sought for a probabilistic output of support vector machines by utilizing the sigmoid function to transform an unrestricted output score to a score in the [0, 1] interval that can be interpreted as an estimated probability of correctness [35]. In common literature, this method is known as logistic calibration or Platt scaling and can be used as a recalibration function to obtain calibrated probabilities [35,13]. ...
Preprint
Full-text available
Image-based environment perception is an important component especially for driver assistance systems or autonomous driving. In this scope, modern neuronal networks are used to identify multiple objects as well as the according position and size information within a single frame. The performance of such an object detection model is important for the overall performance of the whole system. However, a detection model might also predict these objects under a certain degree of uncertainty. [...] In this work, we examine the semantic uncertainty (which object type?) as well as the spatial uncertainty (where is the object and how large is it?). We evaluate if the predicted uncertainties of an object detection model match with the observed error that is achieved on real-world data. In the first part of this work, we introduce the definition for confidence calibration of the semantic uncertainty in the context of object detection, instance segmentation, and semantic segmentation. We integrate additional position information in our examinations to evaluate the effect of the object's position on the semantic calibration properties. Besides measuring calibration, it is also possible to perform a post-hoc recalibration of semantic uncertainty that might have turned out to be miscalibrated. [...] The second part of this work deals with the spatial uncertainty obtained by a probabilistic detection model. [...] We review and extend common calibration methods so that it is possible to obtain parametric uncertainty distributions for the position information in a more flexible way. In the last part, we demonstrate a possible use-case for our derived calibration methods in the context of object tracking. [...] We integrate our previously proposed calibration techniques and demonstrate the usefulness of semantic and spatial uncertainty calibration in a subsequent process. [...]
... In calibration literature, various post-processing methods, such as Platt scaling (Platt et al., 1999) and temperature scaling (Guo et al., 2017), also improves calibration by reducing p i,yi below 1, while other methods such as label smoothing (Müller et al., 2019;Liu et al., 2022) and focal loss (Mukhoti et al., 2020) achieve similar reduction on the predicted probability. While all these methods require additional hyperparameters, squentropy does not. ...
Preprint
Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.
... for an example and selects the example if ρ i > 1 − β i where β i is a Pareto random variable. Second, we consider heuristic classification with importance resampling by first calibrating the classifier's probabilities with Platt scaling (Platt, 1999) against a validation set, then using the calibrated probabilities ρ i to compute the importance weight log ρi 1−ρi . Similarly to DSIR, we use the Gumbel top-k trick to select a subset using these importance weights. ...
Preprint
Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.
... In-processing techniques achieve fairness during training by adding new terms to the loss function [16] or including more constraints in the optimization. Post-processing techniques sacrifice the utility of output confidence scores and align them with the fairness objective [25]. ...
Preprint
Full-text available
Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.
... The constant was chosen via a black-box optimization algorithm that attempts to minimize the miscalibration area for each UQ method, Table S4. Brent's method was used for this recalibration method because-unlike Platt scaling [62] for example-it is a nonparametric method that makes no assumptions about the shape of the calibration curve(s) to be recalibrated [63]. The scikit-learn implementation of Brent's method was used [64]. ...
Preprint
Full-text available
It is critical that machine learning (ML) model predictions be trustworthy for high-throughput catalyst discovery approaches. Uncertainty quantification (UQ) methods allow estimation of the trustworthiness of an ML model, but these methods have not been well explored in the field of heterogeneous catalysis. Herein, we investigate different UQ methods applied to a crystal graph convolutional neural network (CGCNN) to predict adsorption energies of molecules on alloys from the Open Catalyst 2020 (OC20) dataset, the largest existing heterogeneous catalyst dataset. We apply three UQ methods to the adsorption energy predictions, namely k-fold ensembling, Monte Carlo dropout, and evidential regression. The effectiveness of each UQ method is assessed based on accuracy, sharpness, dispersion, calibration, and tightness. Evidential regression is demonstrated to be a powerful approach for rapidly obtaining tunable, competitively trustworthy UQ estimates for heterogeneous catalysis applications when using neural networks. Recalibration of model uncertainties is shown to be essential in practical screening applications of catalysts using uncertainties.
Preprint
Full-text available
A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient---ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.
ResearchGate has not been able to resolve any references for this publication.