PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Deep networks are currently the state-of-the-art for sensory perception in autonomous driving and robotics. However, deep models often generate overconfident predictions precluding proper probabilistic interpretation which we argue is due to the nature of the SoftMax layer. To reduce the overconfidence without compromising the classification performance, we introduce a CNN probabilistic approach based on distributions calculated in the network's Logit layer. The approach enables Bayesian inference by means of ML and MAP layers. Experiments with calibrated and the proposed prediction layers are carried out on object classification using data from the KITTI database. Results are reported for camera ($RGB$) and LiDAR (range-view) modalities, where the new approach shows promising performance compared to SoftMax.
Probabilistic Object Classification
using CNN ML-MAP layers
G. Melotti1, C. Premebida1, J. J. Bird2, D. R. Faria2, and N. Gon¸calves1
1ISR-UC, Universiy of Coimbra, Portugal.
{gledson,cpremebida,nuno}@isr.uc.pt
https://www.isr.uc.pt
2ARVIS Lab, Aston University, UK.
{birdj1,d.faria}@aston.ac.uk
http://arvis-lab.io
Abstract. Deep networks are currently the state-of-the-art for sensory
perception in autonomous driving and robotics. However, deep models
often generate overconfident predictions precluding proper probabilistic
interpretation which we argue is due to the nature of the SoftMax layer.
To reduce the overconfidence without compromising the classification
performance, we introduce a CNN probabilistic approach based on dis-
tributions calculated in the network’s Logit layer. The approach enables
Bayesian inference by means of ML and MAP layers. Experiments with
calibrated and the proposed prediction layers are carried out on object
classification using data from the KITTI database. Results are reported
for camera (RGB) and LiDAR (range-view) modalities, where the new
approach shows promising performance compared to SoftMax.
Keywords: Probabilistic inference, Perception systems, CNN proba-
bilistic layer, object classification.
1 Introduction
In state-of-the-art research, the majority of CNN-based classifiers (convolutional
neural networks) train to provide normalized prediction-scores of observations
given the set of classes, that is, in the interval [0,1] [1]. Normalized outputs
aim to guarantee “probabilistic” interpretation. However, how reliable are these
predictions in terms of probabilistic interpretation? Also, given an example of a
non-trained class, how confident is the model? These are the key questions to be
addressed in this work.
Currently, possible answers to these open issues are related to calibration
techniques and penalizing overconfident output distributions [2, 3]. Regulariza-
tion is often used to reduce overconfidence, and consequently overfitting, such
as the confidence penalty [3] which is added directly to the cost function. Exam-
ples of transformation of network weights include L1 and L2 regularization [4],
Dropout [5], Multi-sample Dropout [6] and Batch Normalization [7]. Alterna-
tively, highly confident predictions can often be mitigated by calibration tech-
arXiv:2005.14565v1 [cs.LG] 29 May 2020
2 G.Melotti, C.Premebida, et al.
0 0.2 0.4 0.6 0.8 1
Confidence
0
0.2
0.4
0.6
0.8
1
Accuracy
Gap
Observed accuracy
(a) U C-RGB
0 0.2 0.4 0.6 0.8 1
Confidence
0
0.2
0.4
0.6
0.8
1
Accuracy
Gap
Observed accuracy
(b) T S-RGB (c) T S -RGB SM scores
0 0.2 0.4 0.6 0.8 1
Confidence
0
0.2
0.4
0.6
0.8
1
Accuracy
Gap
Observed accuracy
(d) U C-RV
0 0.2 0.4 0.6 0.8 1
Confidence
0
0.2
0.4
0.6
0.8
1
Accuracy
Gap
Observed accuracy
(e) T S-RV (f ) T S -RV SM scores
Fig. 1: RGB modality reliability diagrams (1st row), on the testing set, for un-
calibrated (UC) in (a), and for temperature scaling (TS) in (b), with T= 1.50.
Subfigure (c) shows the distribution of the calibrated prediction scores using
SoftMax (SM). The 2nd shows the LiDAR (range-view: RV) modality reliabil-
ity diagrams in (d) and (e), with T= 1.71, while in (f) is the prediction-score
distribution. Note that (c) and (f) are still overconfident post-calibration.
niques such as Isotonic Regression [8] which combines binary probability es-
timates of multiple classes, thus jointly optimizing the bin boundary and bin
predictions; Platt Scaling [9] which uses classifier predictions as features for a
logistic regression model; Beta Calibration [10] which is the use of a paramet-
ric formulation that considers the Beta probability density function (pdf); and
temperature scaling (TS) [11] which multiplies all values of the Logit vector by a
scalar parameter 1
T>0, for all classes. The value of Tis obtained by minimizing
the negative log likelihood on the validation set.
Typically, post-calibration predictions are analysed via reliability diagram
representations [2,12], which illustrate the relationship the of the model’s predic-
tion scores in regards to the true correctness likelihood [13]. Reliability diagrams
show the expected accuracy of the examples as a function of confidence i.e., the
maximum SoftMax value. The diagram illustrates the identity function should it
be perfectly calibrated, while any deviation from a perfect diagonal represents a
calibration error [2,12], as shown in Fig. 1a and 1b with the uncalibrated (UC)
and temperature scaling (T S ) predictions on the testing set. Otherwise, Fig. 1c
shows the distribution of scores (histogram), which is, even after T S calibration,
still overconfident. Consequently, calibration does not guarantee a good balance
of the prediction scores and may jeopardize adequate probability interpretation.
Probabilistic object classification 3
(a) Logit-layer scores, RGB. (b) SoftMax-layer scores, RGB.
(c) Logit-layer scores, RV . (d) SoftMax-layer scores, RV .
Fig. 2: Probability density functions (pdf), here modeled by histograms, calcu-
lated for the Logit-layer scores for RGB (a) and RV (c) modalities. The graphs in
(a,b,c,d) are organized from left-right according to the examples on the Train-
ing set (where the positives are in orange). The distribution of the SoftMax
prediction-scores in (b) and (d) are an evidence of high confidence.
Complex networks such as multilayer perceptron (MLPs) and CNNs are gen-
erally overconfident in the prediction phase, particularly when using the baseline
SoftMax as the prediction function, generating ill-distributed outputs i.e., val-
ues very close to zero or one [2]. Taking into account the importance of having
models grounded on proper probability assumptions to enable adequate inter-
pretation of the outputs, and then making reliable decisions, this paper aims to
contribute to the advances of multi sensor (RGB and LiDAR) perception for
autonomous vehicle systems [14–16] by using pdfs (calculated on the training
data) to model the Logit-layer scores. Then, the SoftMax is replaced by a Max-
imum Likelihood (ML), or by a Maximum A Posteriori (MAP ), as prediction
layers, which provide a smoother distribution of predictive values. Note that it
is not necessary to re-train the CNN i.e., this proposed technique is practical.
2 Effects of Replacing the SoftMax Layer by a
Probability Layer
The key contribution of this work is to replace the SoftMax-layer (which is
a “hard” normalization function) by a probabilistic layer (a ML or a M AP
layer) during the testing phase. The new layers make inference based on pdfs
calculated on the Logit prediction scores using the training set. It is known that
the SoftMax scores are overconfident (very close to zero or one), on the other
hand the distribution of the scores at the Logit-layer is far-more appropriate to
represent a pdf (as shown in Fig. 2). Therefore, replacement by ML or M AP
layers would be more adequate to perform probabilistic inference in regards to
permitting decision-making under uncertainty which is particularly relevant in
autonomous driving and robotic perception systems.
4 G.Melotti, C.Premebida, et al.
(a) SoftMax-layer scores: RGB. (b) SoftMax-layer scores: RV .
(c) ML-layer scores: RGB. (d) ML-layer scores: RV .
(e) MAP-layer scores: RGB. (f) MAP-layer scores: RV .
Fig. 3: Prediction scores (i.e., the network outputs), on the Testing set, using
SoftMax (baseline solution), ML and MAP layers, for RGB and LiDAR (RV )
modalities.
Let XL
ibe the output score vector3of the CNN in the Logit-layer for the ex-
ample i,Ciis the target class, and P(XL
i|Ci) is the class-conditional probability
to be modelled in order to make probabilistic predictions. In this paper, a non-
parametric pdf estimation, using histograms with 25 (for the RGB case) and 35
bins (for the RV model), was applied over the predicted scores of the Logit-layer,
on the training set, to estimate P(XL|C). Assuming the priors are uniform and
identically distributed for the set of classes C, thus a ML is straightforwardly
calculated normalizing P(XL
i|Ci), by the P(Xi) during the prediction phase.
Additionally, to avoid ‘zero’ probabilities and to incorporate some uncertainty
level on the final prediction, we apply additive smoothing (with a factor equal
to 0.01) before the calculation of the posteriors. Alternatively, a MAP layer can
be used by considering, for instance, the a-priori as modelled by a Gaussian dis-
tribution, thus the ith posterior becomes P(Ci|XL
i) = P(XL
i|Ci)P(Ci)/P (Xi),
where P(Ci) N (µ, σ2) with mean µand variance σ2calculated, per class,
from the training set. To simplify, the rationale of using Normal-distributed pri-
ors is that, contrary to histograms or more tailored distribution, the Normal pdf
fits the data more smoothly.
3 Evaluation and Discussion
3The dimensionality of Xis proportional to the number of classes.
Probabilistic object classification 5
Table 1: Classification performance (%) in terms of average F-score and FPR
for the baseline (SM ) models compared to the proposed approach of ML and
MAP layers. The performance measures on the ‘unseen’ dataset are the average
and the variance of the prediction scores.
Modalities: SMRGB M LRGB M APRGB SMRV M LRV M APRV
F-score 95.89 94.85 95.04 89.48 88.09 87.84
FPR 1.60 1.19 1.14 3.05 2.22 2.33
Ave.S coresunseen 0.983 0.708 0.397 0.970 0.692 0.394
V ar.Scoresunseen 0.005 0.025 0.004 0.010 0.017 0.003
(a) SoftMax (SM), ML, and MAP scores
on the RGB unseen set.
(b) SoftMax (SM), ML, and MAP scores
on the RV unseen set.
Fig. 4: Prediction scores, on the unseen data (comprising non-trained classes:
‘person sit.’, ‘tram’, ‘trees/trunks’, ‘truck’, ‘vans’), for the networks using
SoftMax-layer (left-most side), and the proposed ML (center) and MAP (right-
side) layers.
In this work a CNN is modeled by Inception V3. The classes of interest are
pedestrians, cars, and cyclists; the classification dataset is based on the KITTI
2Dobject [15], and the number of training examples are {2827,18103,1025}for
‘ped’, ‘car’, and ‘cyc’. The testing set is comprised of {1346,8620,488}examples
respectively.
The output scores of the CNN indicate a degree of certainty of the given
prediction. The “certainty level” can be defined as the confidence of the model
and, in a classification problem, represents the maximum value within the Soft-
Max layer i.e., equal to one for the target class. However, the output scores
may not always represent a reliable indication of certainty with regards to the
target class, especially when unseen or non-trained examples/objects occur in
the prediction stage; this is particularly relevant for a real-world application in-
volving autonomous robots and vehicles since unpredictable objects are highly
likely to be encountered. With this in mind, in addition to the trained classes
(‘ped’, ‘car’, ‘cyc’), a set of untrained objects are introduced: ‘person sit.’,‘tram’,
‘truck’, ‘vans’, ‘trees/trunks’ comprised of {222,511,1094,2914,45}examples
respectively. All classes with the exception of ‘trees/trunks’ are from the afore-
mentioned KITTI dataset directly, while the former is additionally introduced
by this study. The rationale behind this is to evaluate the prediction confidence
of the networks on objects that do not belong to any of the trained classes,
6 G.Melotti, C.Premebida, et al.
and thus consequently the consistency of the models can be assessed. Ideally, if
the classifiers are perfectly consistent in terms of probability interpretation, the
prediction scores would be identical (equal to 1/3) for all the examples on the
unseen set on a per-class basis.
Results on the testing set are shown in Table 1 in terms of F-score metric
and the average of the FPR prediction scores (classification errors). The average
(Ave.Scoresunseen ) and the sample-variance (V ar.Scoresunseen) of the predicted
scores are also shown for the unseen testing set.
To summarise, the proposed probabilistic approach shows promising results
since ML and M AP reduce classifier overconfidence, as can be observed in
Figures 3c, 3d, 3e and 3f. In reference to Table 1, it can be observed that the FPR
values are considerably lower than the result presented by a SoftMax (baseline)
function. Finally, to assess classifier robustness or the uncertainty of the model
when predicting examples of classes untrained by the network, we consider a
testing comprised of ‘new’ objects. Overall, the results are exciting since the
distribution of the predictions are not extremities as can be observed in Fig. 4.
Quantitatively, the average scores of the network using ML and M AP layers
are significantly lower than the SoftMax approach, and thus are less confident
on new/unseen negative objects .
References
1. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao.
Is robustness the cost of accuracy? A comprehensive study on the robustness of 18
deep image classification models. In ECCV, 2018.
2. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of
modern neural networks. In Proceedings of the 34th International Conference on
Machine Learning, volume 70, pages 1321–1330, 2017.
3. Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E.
Hinton. Regularizing neural networks by penalizing confident output distributions.
CoRR, arXiv: 1701.06548, 2017.
4. Andrew Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invari-
ance. ICML 04. Association for Computing Machinery, 2004.
5. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(56):1929–1958, 2014.
6. Hiroshi Inoue. Multi-sample dropout for accelerated training and better general-
ization. volume abs/1905.09788, 2019.
7. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift. In Francis Bach and David Blei,
editors, 32nd ICML, volume 37, pages 448–456, 2015.
8. Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accu-
rate multiclass probability estimates. ACM SIGKDD International Conference on
KDD, 2002.
9. John Platt. Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods. Adv. Large Margin Classifiers, 10, 2000.
Probabilistic object classification 7
10. Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded
and easily implemented improvement on logistic calibration for binary classifiers.
In 20th AISTATS., pages 623–631, 2017.
11. Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a
neural network. In NIPS, 2015.
12. Meelis Kull, Miquel Perello Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song,
and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-
class probabilities with dirichlet calibration. In Advances in Neural Information
Processing Systems 32, pages 12316–12326. 2019.
13. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with
supervised learning. In ICML, pages 625–632, 2005.
14. Manuel Martin, Alina Roitberg, Monica Haurilet, Matthias Horne, Simon Reiss,
Michael Voit, and Rainer Stiefelhagen. Drive&act: A multi-modal dataset for fine-
grained driver behavior recognition in autonomous vehicles. In ICCV, 2019.
15. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI
dataset. International Journal of Robotics Research (IJRR), 32(11), 2013.
16. G. Melotti, C. Premebida, and N. Gonalves. Multimodal deep-learning for object
recognition combining camera and LIDAR data. In IEEE ICARSC, 2020.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10–100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.
Conference Paper
Full-text available
We examine the relationship between the predic- tions made by different learning algorithms and true posterior probabilities. We show that maxi- mum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Mod- els such as Naive Bayes, which make unrealis- tic independence assumptions, push probabilities toward 0 and 1. Other models such as neural nets and bagged trees do not have these biases and predict well calibrated probabilities. We ex- periment with two ways of correcting the biased probabilities predicted by some learning meth- ods: Platt Scaling and Isotonic Regression. We qualitatively examine what kinds of distortions these calibration methods are suitable for and quantitatively examine how much data they need to be effective. The empirical results show that after calibration boosted trees, random forests, and SVMs predict the best probabilities.
Article
Full-text available
The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.
Conference Paper
For optimal decision making under variable class distributions and misclassification costs a classifier needs to produce well-calibrated estimates of the posterior probability. Isotonic calibration is a powerful non-parametric method that is however prone to overfitting on smaller datasets; hence a parametric method based on the logistic curve is commonly used. While logistic calibration is designed for normally distributed per-class scores, we demonstrate experimentally that many classifiers including Naive Bayes and Adaboost suffer from a particular distortion where these score distributions are heavily skewed. In such cases logistic calibration can easily yield probability estimates that are worse than the original scores. Moreover, the logistic curve family does not include the identity function, and hence logistic calibration can easily uncalibrate a perfectly calibrated classifier. In this paper we solve all these problems with a richer class of calibration maps based on the beta distribution. We derive the method from first principles and show that fitting it is as easy as fitting a logistic curve. Extensive experiments show that beta calibration is superior to logistic calibration for Naive Bayes and Adaboost.
Conference Paper
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
We consider supervised learning in the presence of very many irrelevant features, and study two di#erent regularization methods for preventing overfitting. Focusing on logistic regression, we show that using L 1 regularization of the parameters, the sample complexity (i.e., the number of training examples required to learn "well,") grows only logarithmically in the number of irrelevant features.