Content uploaded by Jordan J. Bird

Author content

All content in this area was uploaded by Jordan J. Bird on Jun 04, 2020

Content may be subject to copyright.

Probabilistic Object Classiﬁcation

using CNN ML-MAP layers

G. Melotti1, C. Premebida1, J. J. Bird2, D. R. Faria2, and N. Gon¸calves1

1ISR-UC, Universiy of Coimbra, Portugal.

{gledson,cpremebida,nuno}@isr.uc.pt

https://www.isr.uc.pt

2ARVIS Lab, Aston University, UK.

{birdj1,d.faria}@aston.ac.uk

http://arvis-lab.io

Abstract. Deep networks are currently the state-of-the-art for sensory

perception in autonomous driving and robotics. However, deep models

often generate overconﬁdent predictions precluding proper probabilistic

interpretation which we argue is due to the nature of the SoftMax layer.

To reduce the overconﬁdence without compromising the classiﬁcation

performance, we introduce a CNN probabilistic approach based on dis-

tributions calculated in the network’s Logit layer. The approach enables

Bayesian inference by means of ML and MAP layers. Experiments with

calibrated and the proposed prediction layers are carried out on object

classiﬁcation using data from the KITTI database. Results are reported

for camera (RGB) and LiDAR (range-view) modalities, where the new

approach shows promising performance compared to SoftMax.

Keywords: Probabilistic inference, Perception systems, CNN proba-

bilistic layer, object classiﬁcation.

1 Introduction

In state-of-the-art research, the majority of CNN-based classiﬁers (convolutional

neural networks) train to provide normalized prediction-scores of observations

given the set of classes, that is, in the interval [0,1] [1]. Normalized outputs

aim to guarantee “probabilistic” interpretation. However, how reliable are these

predictions in terms of probabilistic interpretation? Also, given an example of a

non-trained class, how conﬁdent is the model? These are the key questions to be

addressed in this work.

Currently, possible answers to these open issues are related to calibration

techniques and penalizing overconﬁdent output distributions [2, 3]. Regulariza-

tion is often used to reduce overconﬁdence, and consequently overﬁtting, such

as the conﬁdence penalty [3] which is added directly to the cost function. Exam-

ples of transformation of network weights include L1 and L2 regularization [4],

Dropout [5], Multi-sample Dropout [6] and Batch Normalization [7]. Alterna-

tively, highly conﬁdent predictions can often be mitigated by calibration tech-

arXiv:2005.14565v1 [cs.LG] 29 May 2020

2 G.Melotti, C.Premebida, et al.

0 0.2 0.4 0.6 0.8 1

Confidence

0

0.2

0.4

0.6

0.8

1

Accuracy

Gap

Observed accuracy

(a) U C-RGB

0 0.2 0.4 0.6 0.8 1

Confidence

0

0.2

0.4

0.6

0.8

1

Accuracy

Gap

Observed accuracy

(b) T S-RGB (c) T S -RGB SM scores

0 0.2 0.4 0.6 0.8 1

Confidence

0

0.2

0.4

0.6

0.8

1

Accuracy

Gap

Observed accuracy

(d) U C-RV

0 0.2 0.4 0.6 0.8 1

Confidence

0

0.2

0.4

0.6

0.8

1

Accuracy

Gap

Observed accuracy

(e) T S-RV (f ) T S -RV SM scores

Fig. 1: RGB modality reliability diagrams (1st row), on the testing set, for un-

calibrated (UC) in (a), and for temperature scaling (TS) in (b), with T= 1.50.

Subﬁgure (c) shows the distribution of the calibrated prediction scores using

SoftMax (SM). The 2nd shows the LiDAR (range-view: RV) modality reliabil-

ity diagrams in (d) and (e), with T= 1.71, while in (f) is the prediction-score

distribution. Note that (c) and (f) are still overconﬁdent post-calibration.

niques such as Isotonic Regression [8] which combines binary probability es-

timates of multiple classes, thus jointly optimizing the bin boundary and bin

predictions; Platt Scaling [9] which uses classiﬁer predictions as features for a

logistic regression model; Beta Calibration [10] which is the use of a paramet-

ric formulation that considers the Beta probability density function (pdf); and

temperature scaling (TS) [11] which multiplies all values of the Logit vector by a

scalar parameter 1

T>0, for all classes. The value of Tis obtained by minimizing

the negative log likelihood on the validation set.

Typically, post-calibration predictions are analysed via reliability diagram

representations [2,12], which illustrate the relationship the of the model’s predic-

tion scores in regards to the true correctness likelihood [13]. Reliability diagrams

show the expected accuracy of the examples as a function of conﬁdence i.e., the

maximum SoftMax value. The diagram illustrates the identity function should it

be perfectly calibrated, while any deviation from a perfect diagonal represents a

calibration error [2,12], as shown in Fig. 1a and 1b with the uncalibrated (UC)

and temperature scaling (T S ) predictions on the testing set. Otherwise, Fig. 1c

shows the distribution of scores (histogram), which is, even after T S calibration,

still overconﬁdent. Consequently, calibration does not guarantee a good balance

of the prediction scores and may jeopardize adequate probability interpretation.

Probabilistic object classiﬁcation 3

(a) Logit-layer scores, RGB. (b) SoftMax-layer scores, RGB.

(c) Logit-layer scores, RV . (d) SoftMax-layer scores, RV .

Fig. 2: Probability density functions (pdf), here modeled by histograms, calcu-

lated for the Logit-layer scores for RGB (a) and RV (c) modalities. The graphs in

(a,b,c,d) are organized from left-right according to the examples on the Train-

ing set (where the positives are in orange). The distribution of the SoftMax

prediction-scores in (b) and (d) are an evidence of high conﬁdence.

Complex networks such as multilayer perceptron (MLPs) and CNNs are gen-

erally overconﬁdent in the prediction phase, particularly when using the baseline

SoftMax as the prediction function, generating ill-distributed outputs i.e., val-

ues very close to zero or one [2]. Taking into account the importance of having

models grounded on proper probability assumptions to enable adequate inter-

pretation of the outputs, and then making reliable decisions, this paper aims to

contribute to the advances of multi sensor (RGB and LiDAR) perception for

autonomous vehicle systems [14–16] by using pdfs (calculated on the training

data) to model the Logit-layer scores. Then, the SoftMax is replaced by a Max-

imum Likelihood (ML), or by a Maximum A Posteriori (MAP ), as prediction

layers, which provide a smoother distribution of predictive values. Note that it

is not necessary to re-train the CNN i.e., this proposed technique is practical.

2 Eﬀects of Replacing the SoftMax Layer by a

Probability Layer

The key contribution of this work is to replace the SoftMax-layer (which is

a “hard” normalization function) by a probabilistic layer (a ML or a M AP

layer) during the testing phase. The new layers make inference based on pdfs

calculated on the Logit prediction scores using the training set. It is known that

the SoftMax scores are overconﬁdent (very close to zero or one), on the other

hand the distribution of the scores at the Logit-layer is far-more appropriate to

represent a pdf (as shown in Fig. 2). Therefore, replacement by ML or M AP

layers would be more adequate to perform probabilistic inference in regards to

permitting decision-making under uncertainty which is particularly relevant in

autonomous driving and robotic perception systems.

4 G.Melotti, C.Premebida, et al.

(a) SoftMax-layer scores: RGB. (b) SoftMax-layer scores: RV .

(c) ML-layer scores: RGB. (d) ML-layer scores: RV .

(e) MAP-layer scores: RGB. (f) MAP-layer scores: RV .

Fig. 3: Prediction scores (i.e., the network outputs), on the Testing set, using

SoftMax (baseline solution), ML and MAP layers, for RGB and LiDAR (RV )

modalities.

Let XL

ibe the output score vector3of the CNN in the Logit-layer for the ex-

ample i,Ciis the target class, and P(XL

i|Ci) is the class-conditional probability

to be modelled in order to make probabilistic predictions. In this paper, a non-

parametric pdf estimation, using histograms with 25 (for the RGB case) and 35

bins (for the RV model), was applied over the predicted scores of the Logit-layer,

on the training set, to estimate P(XL|C). Assuming the priors are uniform and

identically distributed for the set of classes C, thus a ML is straightforwardly

calculated normalizing P(XL

i|Ci), by the P(Xi) during the prediction phase.

Additionally, to avoid ‘zero’ probabilities and to incorporate some uncertainty

level on the ﬁnal prediction, we apply additive smoothing (with a factor equal

to 0.01) before the calculation of the posteriors. Alternatively, a MAP layer can

be used by considering, for instance, the a-priori as modelled by a Gaussian dis-

tribution, thus the ith posterior becomes P(Ci|XL

i) = P(XL

i|Ci)P(Ci)/P (Xi),

where P(Ci)∼ N (µ, σ2) with mean µand variance σ2calculated, per class,

from the training set. To simplify, the rationale of using Normal-distributed pri-

ors is that, contrary to histograms or more tailored distribution, the Normal pdf

ﬁts the data more smoothly.

3 Evaluation and Discussion

3The dimensionality of Xis proportional to the number of classes.

Probabilistic object classiﬁcation 5

Table 1: Classiﬁcation performance (%) in terms of average F-score and FPR

for the baseline (SM ) models compared to the proposed approach of ML and

MAP layers. The performance measures on the ‘unseen’ dataset are the average

and the variance of the prediction scores.

Modalities: SMRGB M LRGB M APRGB SMRV M LRV M APRV

F-score 95.89 94.85 95.04 89.48 88.09 87.84

FPR 1.60 1.19 1.14 3.05 2.22 2.33

Ave.S coresunseen 0.983 0.708 0.397 0.970 0.692 0.394

V ar.Scoresunseen 0.005 0.025 0.004 0.010 0.017 0.003

(a) SoftMax (SM), ML, and MAP scores

on the RGB unseen set.

(b) SoftMax (SM), ML, and MAP scores

on the RV unseen set.

Fig. 4: Prediction scores, on the unseen data (comprising non-trained classes:

‘person sit.’, ‘tram’, ‘trees/trunks’, ‘truck’, ‘vans’), for the networks using

SoftMax-layer (left-most side), and the proposed ML (center) and MAP (right-

side) layers.

In this work a CNN is modeled by Inception V3. The classes of interest are

pedestrians, cars, and cyclists; the classiﬁcation dataset is based on the KITTI

2Dobject [15], and the number of training examples are {2827,18103,1025}for

‘ped’, ‘car’, and ‘cyc’. The testing set is comprised of {1346,8620,488}examples

respectively.

The output scores of the CNN indicate a degree of certainty of the given

prediction. The “certainty level” can be deﬁned as the conﬁdence of the model

and, in a classiﬁcation problem, represents the maximum value within the Soft-

Max layer i.e., equal to one for the target class. However, the output scores

may not always represent a reliable indication of certainty with regards to the

target class, especially when unseen or non-trained examples/objects occur in

the prediction stage; this is particularly relevant for a real-world application in-

volving autonomous robots and vehicles since unpredictable objects are highly

likely to be encountered. With this in mind, in addition to the trained classes

(‘ped’, ‘car’, ‘cyc’), a set of untrained objects are introduced: ‘person sit.’,‘tram’,

‘truck’, ‘vans’, ‘trees/trunks’ comprised of {222,511,1094,2914,45}examples

respectively. All classes with the exception of ‘trees/trunks’ are from the afore-

mentioned KITTI dataset directly, while the former is additionally introduced

by this study. The rationale behind this is to evaluate the prediction conﬁdence

of the networks on objects that do not belong to any of the trained classes,

6 G.Melotti, C.Premebida, et al.

and thus consequently the consistency of the models can be assessed. Ideally, if

the classiﬁers are perfectly consistent in terms of probability interpretation, the

prediction scores would be identical (equal to 1/3) for all the examples on the

unseen set on a per-class basis.

Results on the testing set are shown in Table 1 in terms of F-score metric

and the average of the FPR prediction scores (classiﬁcation errors). The average

(Ave.Scoresunseen ) and the sample-variance (V ar.Scoresunseen) of the predicted

scores are also shown for the unseen testing set.

To summarise, the proposed probabilistic approach shows promising results

since ML and M AP reduce classiﬁer overconﬁdence, as can be observed in

Figures 3c, 3d, 3e and 3f. In reference to Table 1, it can be observed that the FPR

values are considerably lower than the result presented by a SoftMax (baseline)

function. Finally, to assess classiﬁer robustness or the uncertainty of the model

when predicting examples of classes untrained by the network, we consider a

testing comprised of ‘new’ objects. Overall, the results are exciting since the

distribution of the predictions are not extremities as can be observed in Fig. 4.

Quantitatively, the average scores of the network using ML and M AP layers

are signiﬁcantly lower than the SoftMax approach, and thus are less conﬁdent

on new/unseen negative objects .

References

1. Dong Su, Huan Zhang, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao.

Is robustness the cost of accuracy? A comprehensive study on the robustness of 18

deep image classiﬁcation models. In ECCV, 2018.

2. Chuan Guo, Geoﬀ Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of

modern neural networks. In Proceedings of the 34th International Conference on

Machine Learning, volume 70, pages 1321–1330, 2017.

3. Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoﬀrey E.

Hinton. Regularizing neural networks by penalizing conﬁdent output distributions.

CoRR, arXiv: 1701.06548, 2017.

4. Andrew Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invari-

ance. ICML 04. Association for Computing Machinery, 2004.

5. Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan

Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of Machine Learning Research, 15(56):1929–1958, 2014.

6. Hiroshi Inoue. Multi-sample dropout for accelerated training and better general-

ization. volume abs/1905.09788, 2019.

7. Sergey Ioﬀe and Christian Szegedy. Batch normalization: Accelerating deep net-

work training by reducing internal covariate shift. In Francis Bach and David Blei,

editors, 32nd ICML, volume 37, pages 448–456, 2015.

8. Bianca Zadrozny and Charles Elkan. Transforming classiﬁer scores into accu-

rate multiclass probability estimates. ACM SIGKDD International Conference on

KDD, 2002.

9. John Platt. Probabilistic outputs for support vector machines and comparisons to

regularized likelihood methods. Adv. Large Margin Classiﬁers, 10, 2000.

Probabilistic object classiﬁcation 7

10. Meelis Kull, Telmo Silva Filho, and Peter Flach. Beta calibration: a well-founded

and easily implemented improvement on logistic calibration for binary classiﬁers.

In 20th AISTATS., pages 623–631, 2017.

11. Geoﬀrey Hinton, Oriol Vinyals, and Jeﬀrey Dean. Distilling the knowledge in a

neural network. In NIPS, 2015.

12. Meelis Kull, Miquel Perello Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song,

and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-

class probabilities with dirichlet calibration. In Advances in Neural Information

Processing Systems 32, pages 12316–12326. 2019.

13. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with

supervised learning. In ICML, pages 625–632, 2005.

14. Manuel Martin, Alina Roitberg, Monica Haurilet, Matthias Horne, Simon Reiss,

Michael Voit, and Rainer Stiefelhagen. Drive&act: A multi-modal dataset for ﬁne-

grained driver behavior recognition in autonomous vehicles. In ICCV, 2019.

15. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI

dataset. International Journal of Robotics Research (IJRR), 32(11), 2013.

16. G. Melotti, C. Premebida, and N. Gonalves. Multimodal deep-learning for object

recognition combining camera and LIDAR data. In IEEE ICARSC, 2020.