Available via license: CC BY 4.0

Content may be subject to copyright.

PIVEN: A Deep Neural Network for Prediction

Intervals with Speciﬁc Value Prediction

Eli Simhayev

Ben-Gurion University of the Negev

Beer-Sheva, Israel

elisim@post.bgu.ac.il

Gilad Katz

Ben-Gurion University of the Negev

Beer-Sheva, Israel

giladkz@post.bgu.ac.il

Lior Rokach

Ben-Gurion University of the Negev

Beer-Sheva, Israel

liorrk@post.bgu.ac.il

Abstract

Improving the robustness of neural nets in regression tasks is key to their application

in multiple domains. Deep learning-based approaches aim to achieve this goal either

by improving the manner in which they produce their prediction of speciﬁc values

(i.e., point prediction), or by producing prediction intervals (PIs) that quantify

uncertainty. We present PIVEN, a deep neural network for producing both a PI and

a prediction of speciﬁc values. Benchmark experiments show that our approach

produces tighter uncertainty bounds than the current state-of-the-art approach for

producing PIs, while managing to maintain comparable performance to the state-

of-the-art approach for speciﬁc value-prediction. Additional evaluation on large

image datasets further support our conclusions.

1 Introduction

Deep neural networks (DNNs) have been achieving state-of-the-art results in a large variety of

complex problems. These include automated decision making and recommendation systems in the

medical domain [25], autonomous control of drones [12] and self driving cars [3]. In many of these

domains, it is crucial not only that the prediction made by the DNN is accurate, but rather that its

uncertainty is quantiﬁed. Quantifying uncertainty has many beneﬁts, including risk reduction and the

ability to plan in a more reliable fashion [15].

For regression problems, uncertainty is quantiﬁed by the creation of prediction intervals (PIs), which

offer upper and lower bounds on the value of a data point for a given probability (e.g., 95% or 99%).

Existing non-bayesian methods for PI generation can be roughly divided into two groups: a) carrying

out multiple runs of the regression problem (e.g., dropout [

6

], ensemble-based methods [

18

]) and

deriving the PI from the prediction variance in a post-hoc manner, and; b) the use of dedicated

architectures for the generation of the PI, which produce the upper and lower bounds of the PI.

While effective, each approach has limitations. On the one hand, the ensemble-based approaches

produce a speciﬁc value for the regression problem (i.e., a point prediction), but they are not optimized

for PI construction. This lack of a PI makes the use of such approaches difﬁcult in domains such

as ﬁnancial risk mitigation or maintenance scheduling. For example, providing a PI for the number

of days a machine can function without malfunctioning (e.g., 30-45 days with 99% certainty) is

more valuable than a prediction for the speciﬁc time of failure. On the other hand, PI-dedicated

architectures [

22

,

31

] provide accurate upper and lower bounds for the prediction, but do not provide

Preprint. Under review.

arXiv:2006.05139v1 [cs.LG] 9 Jun 2020

a method for speciﬁcally selecting a value within the interval. As a result, these approaches choose

the middle of the interval as their value prediction, which is a sub-optimal strategy as it makes

assumptions regarding the value distribution within the interval. The shortcomings of this approach

to value prediction are supported by [22], as well as by our own experiments in Section 5.

In this study we propose PIVEN (

p

rediction

i

ntervals with speciﬁc

v

alue pr

e

dictio

n

), a novel

approach for uncertainty modeling using DNNs. Our approach combines the beneﬁts of the two

types of approaches described above by producing both a PI and a value prediction. We follow

the experimental procedure of recent works, and compare our approach to current best-performing

methods: Quality-Driven PI method (QD) [

22

] (a dedicated PI generation architecture), and Deep

Ensembles (DE) [

18

]. The results of our evaluation show that PIVEN outperforms QD by producing

narrower PIs, while simultaneously achieving comparable results to DE in terms of value prediction.

2 Related Work

2.1 Uncertainty Modeling in Data

In the ﬁeld of uncertainty modeling, one considers two types of uncertainty: a) Aleatoric uncertainty,

which captures noise inherent in the observations, and; b) epistemic uncertainty, which accounts for

uncertainty in the model parameters – thus capturing our ignorance about the correctness of the model

generated from our collected data. Overall uncertainty

σ2

y

can therefore be modeled as

σ2

y=σ2

f+σ2

ξ

,

where

σ2

f

denotes epistemic uncertainty and

σ2

ξ

denotes aleatoric uncertainty. Aleatoric uncertainty

can further be categorized into homoscedastic uncertainty, where

σ2

ξ

is constant for different inputs,

and heteroscedastic uncertainty where

σ2

ξ

is dependent on the inputs to the model, with some inputs

potentially being more noisy than others. In this work we quantify uncertainty using PIs, which

by deﬁnition quantify

σ2

y

, whereas conﬁdence intervals (CIs) quantify only

σ2

f

. Therefore, PIs are

necessarily wider than CIs.

2.2 Modeling Uncertainty in Regression Problems

Enabling deep learning algorithms to cope with uncertainty has been an active area of research in

recent years [

22

,

24

,

6

,

18

,

5

,

13

,

7

,

29

]. Studies in the uncertainty modeling and regression can be

roughly divided into two groups: sampling-based and PI-based.

Sampling-based approaches initially utilized Bayesian neural networks [

19

], in which a prior distri-

bution was deﬁned on the weights and biases of a neural net (NN), and a posterior distribution is

then inferred from the training data. The main shortcomings of these approaches were their heavy

computational costs and the fact that they were difﬁcult to implement. Subsequently, non-Bayesian

methods [

6

,

18

,

24

] were proposed. In [

6

], Monte Carlo sampling was used to estimate the predictive

uncertainty of NNs through the use of dropout over multiple runs. A later study [

18

] employed

a combination of ensemble learning and adversarial training to quantify data uncertainty. In an

expansion of a previously-proposed approach [

21

], each NN was optimized to learn the mean and

variance of the data, assuming a Gaussian distribution. In a recent study [

24

], the authors proposed a

post-hoc procedure using Gaussian processes to measure the uncertainty of the predictions of NN

regressors.

PI-based approaches, whose aim is to explicitly produce a PI for each analyzed sample, belong to

a ﬁeld of research that has been gaining popularity in recent years. In [

14

], the authors propose a

post-processing approach that considers the regression problem as one of classiﬁcation, and uses the

output of the ﬁnal softmax layer to produce PIs. Another recent study [

31

] proposed the use of a loss

function designed to learn all conditional quantiles of a given target variable. Khosravi et al. [

15

]

proposed a method called LUBE, which consists of a loss function optimized for the creation of PIs

but has the caveat of not being able to use stochastic gradient descent (SGD) for its optimization.

Finally, a recent study [

22

] inspired by LUBE proposed a loss function that is both optimized for the

generation of PIs and can be optimized using SGD.

Each of the two groups presented above tends to under-perform when applied to tasks for which

its loss function was not optimized: sampling-based approaches, which are optimized to produce

value predictions, tend to produce PIs of lesser accuracy than those of the PI-based methods, which

are optimized to produce tight PI intervals, and vice versa. Recent studies [

17

,

26

] attempted to

2

produce both value predictions and PIs by using conformal prediction with quantile regression. While

effective, these methods use a complex splitting strategy, where one part of the data is used to produce

value predictions and PIs, while the the other part is to further adjust the PIs. Contrary to these

approaches, PIVEN produces PIs with value predictions in an end-to-end manner by relying on novel

loss function.

3 Problem Formulation

In this work we consider a neural network regressor that processes an input

x∈ X

with an associated

label

y∈R

, where

X

can be any feature space (e.g., tabular data, age prediction from images).

Let

(xi, yi)∈ X × R

be a data point along with its target value. Let

Ui

and

Li

be the upper and

lower bounds of PIs corresponding to the ith sample. Our goal is to construct

(Li, Ui, yi)

such that

Pr(Li≤yi≤Ui)≥1−α

. We refer to

1−α

as the conﬁdence level of the PI. In standard regression

problems, the goal is to estimate a function

f

such that

y(x) = f(x) + ξ(x)

, where

ξ(x)

is referred

to as noise and is usually assumed to have zero mean.

Next we deﬁne two quantitative measures for the evaluation of PIs, as deﬁned in [

15

]. First we deﬁne

coverage as the ratio of dataset samples that fall within their respective PIs. We measure coverage

using the prediction interval coverage probability (PICP) metric:

P I CP :=1

n

n

X

i=1

ki(1)

where

n

denotes the number of samples and

ki= 1

if

yi∈(Li, Ui)

, otherwise

ki= 0

. We now

deﬁne a metric to measure the quality of the generated PIs. Naturally, we are interested in producing

as tight a bound as possible while maintaining adequate coverage. We deﬁne the mean prediction

interval width (MPIW) as,

M P IW :=1

n

n

X

i=1

Ui−Li(2)

When combined, these metrics enable us to comprehensively evaluate the quality of generated PIs.

4 Method

In this section we ﬁrst deﬁne PIVEN, a deep neural architecture for the generation of both PIs and

value predictions for regression problems. We then present a suitable loss function that enables us to

train our architecture to generate the PIs for a desired conﬁdence level 1−α.

4.1 System Architecture

The proposed architecture is presented in Figure 1. It consists of three components:

•Backbone block

. The main body block, consisting of a varying number of DNN layers or

sub-blocks. The goal of this component is to transform the input into a latent representation

that is then provided as input to the other components. It is important to note that PIVEN

supports any architecture type (e.g., dense, convolutions) that can be applied to a regression

problem. Moreover, pre-trained architectures can also be used seamlessly. For example, we

use pre-trained VGG-16 and DenseNet architectures in our experiments.

•Upper & lower-bound heads

.

L(x)

and

U(x)

produce the lower and upper bounds of

the PI respectively, such that

Pr(L(x)≤y(x)≤U(x)) ≥1−α

where

y(x)

is the value

prediction and 1−αis the predeﬁned conﬁdence level.

•Auxiliary head.

The auxiliary prediction head,

v(x)

, enables us to produce a value predic-

tion.

v(x)

does not produce the value prediction directly, but rather produces a parameter

indicating the relative weight that should be given to each of the two bounds. We derive the

value prediction using,

y=v·U+ (1 −v)·L(3)

where

v∈(0,1)

. By expressing the output of the auxiliary as a function of the other two

heads, we bound them together and improve their performance. See Section 4.3 for details.

3

This architecture has several advantages compared to previous studies, particularly in terms of

robustness and the ability to represent PIs that are not uniformly distributed. We elaborate on this

subject further in Section 4.3.

Figure 1: The PIVEN schematic architecture

4.2 Network Optimization

Our goal is to generate narrow PIs, measured by MPIW, while maintaining the desired level of

coverage, measured by

P I CP = 1 −α

. However, PIs that fail to capture their respective data point

should not be encouraged to shrink further. We follow the derivation presented in [

22

] and deﬁne

captured MP I W (M P IWcapt ) as the M P IW of only those points for which Li≤yi≤Ui,

M P IWcapt :=1

c

n

X

i=1

(Ui−Li)·ki(4)

where c=Pn

i=1 ki. Hence, we seek to minimize M P I Wcapt subject to P IC P ≥1−α:

θ∗= arg min

θ

(M P IWcapt,θ )s.t P I CPθ≥1−α

where

θ

is the parameters of the neural net. To enforce the coverage constraint, we utilize a variant of

the well-known Interior Point Method (IPM) [23], resulting in an unconstrained loss:

LP I =M P IWcapt,θ +√n·λΨ(1 −α−P I CPθ)

Ψ(x):= max(0, x)2

where

λ

is a hyperparameter controlling the relative importance of width vs. coverage,

Ψ

is a

quadratic penalty function, and

n

is the batch size. We include dependency on batch size in the

loss since a larger sample size increases conﬁdence in the value of PICP, thus increasing the loss.

In practice, optimizing the loss with discrete version of

k

(see eq. 4) fails to converge, because the

gradient is always positive for all possible values. We therefore deﬁne a continuous version of

k

,

denoted as

ksoft =σ(s·(y−L)) σ(s·(U−y))

, where

σ

is the sigmoid function, and

s > 0

is a softening factor. The ﬁnal version of

LP I

uses the continuous and discrete versions of

k

in its

calculations of the

P I CP

and

M P IWcapt

metrics, respectively. By doing so, it discourages the PIs

from shrinking further when failing to capture their respective data points.

Neural networks optimized by the abovementioned objective are able to generate well-calibrated PIs,

but they disregard the original value prediction task. This omission has two signiﬁcant drawbacks:

•Overﬁtting

. The

M P IWcapt

term in

LP I

, as deﬁned in [

22

], focuses only on the fraction

c

of the training set where the data points are successfully captured by the PI. As a result,

the network is likely to overﬁt to a subset of the data. Our reasoning is supported by our

experiments in Section 5 and Appendix C.

•Lack of value prediction

. In its current form,

LP I

is not able to perform value prediction,

i.e., returning a speciﬁc prediction for the regression problem. To overcome this limitation,

one can return the middle of the PI, as done in [

22

,

31

]. This approach sometimes yields

sub-optimal results, as it is based on assumptions regarding the distribution of the data.

These assumptions do not always hold, as we show in our experiments in Section 5.4.

4

We propose a novel loss function that combines the generation of both PIs and value predictions. To

optimize the output of v(x)(the auxiliary head), we minimize the standard regression loss,

Lv=1

n

n

X

i=1

`(y(xi), yi)(5)

where

`

is a regression objective against the ground-truth, and

y(xi) = vi·Ui+ (1 −vi)·Li

. Our

ﬁnal loss function is a convex combination of

LP I

, and the auxiliary loss

Lv

. Thus, the overall

training objective is:

LP IV EN =βLP I + (1 −β)Lv(6)

where

β

is a hyperparameter that balances the two goals of our approach: producing narrow PIs and

accurate value predictions. To quantify epistemic uncertainty, we employ an ensemble of different

networks with parameter resampling, as proposed in [

18

]. Given an ensemble of

m

NNs trained with

LP IV EN

, let

˜

U

,

˜

L

represent the ensemble’s upper and lower estimate of the PI, and

˜v

represents the

ensemble’s auxiliary prediction. We calculate model uncertainty and use the ensemble to generate

the PIs and ˜vas follows:

¯

Ui=1

m

m

X

j=1

Uij (7)

σ2

model =σ2

Ui=1

m−1

m

X

j=1

(Uij −¯

Ui)2(8)

˜

Ui=¯

Ui+zα/2·σUi(9)

˜vi=1

m

m

X

j=1

vij (10)

where

Uij

and

vij

represents the upper bound of the PI and the auxiliary prediction for data point

i

,

for NN

j

. A similar procedure is followed for

˜

Li

, subtracting

zα/2·σLi

, where

zα/2

is the Zscore

for a conﬁdence level 1−α.

4.3 Discussion of contributions

PIVEN is different from previous studies in two important aspects. First, our approach is the ﬁrst

to propose an integrated architecture capable of producing both PIs and exact value predictions.

Moreover, since the auxiliary head produces predictions for all training set samples, it prevents

PIVEN from overﬁtting to only the data points which were contained in their respective PIs (a

possible problem for studies such as [22, 31]), thus increasing the robustness of our approach.

The second differentiating aspect of PIVEN with respect to previous work is its method for producing

the value prediction. While previous studies either provided the middle of the PI [

22

,

31

] or the

mean-variance [

18

] as their value predictions, PIVEN’s auxiliary head can produce any value within

the PI as its prediction. By expressing the value prediction as a function of the upper and lower

bounds, we ensure that the three heads are synchronized. Finally, this representation enables us to

produce value predictions that are not in the middle of the interval, thus creating representations

that are more characteristic of many real-world cases, where the PI is not necessarily uniformly

distributed. Our experiments, presented in Sections 5.5, 5.4 support our conclusions.

5 Evaluation

5.1 Datasets

UCI Datasets.

To compare PIVEN to recent state-of-the-art studies [

6

,

9

,

18

,

22

], we conduct our

experiments on a set of benchmark datasets used by them for evaluation. This benchmark includes

ten datasets from the UCI repository [2].

5

IMDB age estimation dataset1.

The IMDB-WIKI dataset [

27

] is currently the largest age-labeled

facial dataset available. Our dataset consists of 460,723 images from 20,284 celebrities, and the

regression goal is to predict the age of the person in the image. It is important to note that this dataset

is known to contain noise (i.e., aleatoric uncertainty), thus making it highly relevant to this study. We

apply the same preprocessing as in [32, 33], and refer the reader to the Appendix A for full details.

RSNA pediatric bone age dataset2.

This dataset is a popular medical imaging dataset consisting of

X-ray images of children’s hands [

8

]. The regression task is predicting one’s age from one’s bone

image. The dataset contains 12,611 training images and 200 test set images.

While the ﬁrst group of datasets enables us to compare PIVEN’s performance to recent state-of-the-art

studies in the ﬁeld, the two latter datasets enable us to demonstrate that our approach is both scalable

and effective on multiple types of input.

5.2 Baselines

We compare our performance to two top-performing NN-based baselines from recent years:

•Quality driven PI method (QD) [22].

This approach produces prediction intervals that

minimize a smooth combination of the PICP/MPIW metrics without considering the value

prediction task in its objective function. Its reported results make this approach state-of-the-

art in terms of PI width and coverage.

•Deep Ensembles (DE) [18].

This work combines individual conditional Gaussian distribu-

tion with adversarial training, and uses the models’ variance to compute prediction intervals.

Because DE outputs distribution instead of PIs, we ﬁrst convert it to PIs, and then compute

PICP and MPIW (replicating the process described in [

22

]). Its reported results make this

method one of the top performers with respect to the RMSE metric (i.e., value prediction).

By comparing PIVEN to these two baselines, we are able to evaluate its ability to simultaneously

satisfy the two main requirements for regression problems in domains with high certainty.

5.3 Experimental Setup

Throughout our experiments, we evaluate our two baselines [

22

,

18

] using their reported deep

architectures and hyperparmeters. For full experimental details, please see Appendix A. We ran our

experiments using a GPU server with two NVIDIA Tesla P100. Our code is implemented using

TensorFlow and Keras [1, 4], and is made available online3.

UCI datasets.

We implemented the experimental setup proposed by [

9

], which was also used by

our baselines. Results are averaged on 20 random 90%/10% splits of the data, except for the “Year

Prediction MSD" and “Protein”, which were split once and ﬁve times respectively. Our network

architecture is identical to previous work [

6

,

9

,

18

,

22

]: one hidden layer with ReLU activation

function [

20

], and the Adam optimizer [

16

]. Input and target variables are normalized to zero mean

and unit variance.

IMDB age estimation dataset.

We use the DenseNet architecture [

10

] as the backbone block, upon

which we add two fully connected layers. We apply the data preprocessing used in [

33

,

34

] (see

appendix for details). We report the results for 5-fold cross validation, as the dataset has no predeﬁned

test set.

RSNA bone age dataset.

We use the VGG-16 architecture [

28

] as the backbone block, with weights

pre-trained on ImageNet. We then add two convolutional layers followed by a dense layer. This

dataset has a predeﬁned test set of 200 images.

5.4 Evaluation Results: UCI Datasets

We use two evaluation metrics: MPIW and RMSE, with the desired coverage, measured by the PICP

metric, set to 95% (as done in [

22

]). In terms of PI-quality, shown in Table 1, PIVEN outperforms

1https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/

2https://www.kaggle.com/kmader/rsna-bone-age

3https://github.com/elisim/piven

6

QD in nine out of ten datasets (although it should be noted that no method reached the required

PICP in two of these datasets – “Boston" and “Concrete"), while achieving equal performance in the

remaining dataset. DE trails behind PIVEN and QD in most datasets, which is to be expected since

this approach does not attempt to optimize MPIW.

Table 2 presents the RMSE metric values for all methods. It is clear that PIVEN and DE are the top

performers, with the former achieving the best results in ﬁve datasets, and the latter in four. The QD

baseline trails behind the other methods in all datasets but one (“Naval", where all methods achieve

equal performance). QD’s performance is not surprising given that the focus of the said approach is

the generation of PIs rather than value predictions.

The results of our experiments clearly show that PIVEN is capable of providing accurate value

predictions for regression problems (i.e., achieving competitive results with the top-performing DE

baseline) while achieving state-of-the-art results in uncertainty modeling by the use of PIs.

Ablation Analysis.

In Section 4.3 we describe our rationale in expressing the value prediction as

a function of the upper and lower bounds of the interval. To prove the merits of our approach we

evaluate two variants of PIVEN. In the ﬁrst variant, denoted as POO (point-only optimization), we

decouple the value prediction from the PI. The loss function of this variant is

LQD +`(v, ytrue )

where

`

is set to be MSE loss. In the second variant, denoted MOI (middle of interval), the value

prediction produced by the model is always the middle of the PI (in other words, vis set to 0.5).

The results of our ablation study are presented in Table 3, which contains the results of the MPIW

and RMSE metrics (the PICP values are identical for all variants and are therefore omitted—values

are presented in the Appendix B). It is clear that the full PIVEN signiﬁcantly outperforms the two

other variants. This leads us to conclude that both novel aspects of our approach—the simultaneous

optimization of PI-width and RMSE, and the ability to select any value on the PI as the value

prediction—contribute to PIVEN’s performance. Finally, it is important to note that even though their

performance is inferior to PIVEN, both the POO and MOI variants outperform the QD baseline in

terms of MPIW, while being equal or better for RMSE.

Table 1: Results on regression benchmark UCI datasets comparing PICP and MPIW. Best performance

deﬁned as in [

22

]: every approach with PICP

≥

0.95 was deﬁned as best for PICP. For MPIW, best

performance was awarded to lowest value. If PICP

≥

0.95 for neither, the largest PICP was best, and

MPIW was only assessed if the one with larger PICP also had smallest MPIW.

PICP MPIW

Datasets DE QD PIVEN DE QD PIVEN

Boston 0.87 ±0.01 0.93 ±0.01 0.93 ±0.01 0.87 ±0.03 1.15 ±0.02 1.09 ±0.01

Concrete 0.92 ±0.01 0.93 ±0.01 0.93 ±0.01 1.01 ±0.02 1.08 ±0.01 1.02 ±0.01

Energy 0.99 ±0.00 0.97 ±0.01 0.97 ±0.00 0.49 ±0.01 0.45 ±0.01 0.42 ±0.01

Kin8nm 0.97 ±0.00 0.96 ±0.00 0.96 ±0.00 1.14 ±0.01 1.18 ±0.00 1.10 ±0.00

Naval 0.98 ±0.00 0.97 ±0.00 0.98 ±0.00 0.31 ±0.01 0.27 ±0.00 0.24 ±0.00

Power plant 0.96 ±0. 00 0.96 ±0.00 0.96 ±0.00 0.91 ±0.00 0.86 ±0.00 0.86 ±0.00

Protein 0.96 ±0.00 0.95 ±0.00 0.95 ±0.00 2.68 ±0.01 2.27 ±0.01 2.26 ±0.01

Wine 0.90 ±0.01 0.91 ±0.01 0.91 ±0.01 2.50 ±0.02 2.24 ±0.02 2.22 ±0.01

Yacht 0.98 ±0.01 0.95 ±0.01 0.95 ±0.01 0.33 ±0.02 0.18 ±0.00 0.17 ±0.00

Year Prediction MSD 0.95 ±NA 0.95 ±NA 0.95 ±NA 2.91 ±NA 2.45 ±NA 2.42 ±NA

5.5 Large-Scale Datasets

In our discussion in Section 4.3, we argue that PIVEN’s auxiliary head forces it to train on the entire

training set rather than overﬁt itself to the data points it manages to capture within their respective PIs.

We hypothesize that this advantage will become more pronounced in large and complex data, and

therefore perform an evaluation on two image datasets: bone age and age estimation. Since training

the DE approach on datasets of this size is computationally prohibitive, we instead use a dense layer

on top of the used architecture (DenseNet/VGG, see Section 5.3) that outputs value prediction using

the MSE metric. In doing so, we follow the approach used in [

24

] for similar evaluation. We refer to

this architecture as NN, because of the dense layer we add.

Our results are presented in Table 4. We use mean absolute error (MAE), which was the datasets’

chosen metric. For the IMDB age prediction dataset, results show that PIVEN outperforms both

7

Table 2: Evaluation results for the UCI benchmark datasets, using the RMSE metric

RMSE

Datasets DE QD PIVEN

Boston 2.87 ±0.19 3.39 ±0.26 3.13 ±0.21

Concrete 5.21 ±0.09 5.88 ±0.10 5.43 ±0.13

Energy 1.68 ±0.06 2.28 ±0.04 1.65 ±0.03

Kin8nm 0.08 ±0.00 0.08 ±0.00 0.07 ±0.00

Naval 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

Power plant 3.99 ±0.04 4.14 ±0.04 4.08 ±0.04

Protein 4.36 ±0.02 4.99 ±0.02 4.35 ±0.02

Wine 0.62 ±0.01 0.67 ±0.01 0.63 ±0.01

Yacht 1.38 ±0.07 1.10 ±0.06 0.98 ±0.07

Year Prediction MSD 8.95 ±NA 9.30 ±NA 8.93 ±NA

Table 3: Ablation analysis, comparing PICP and MPIW. Results were analyzed as in Table 1

MPIW RMSE

Datasets POO MOI PIVEN POO MOI PIVEN

Boston 1.09 ±0.02 1.15 ±0.02 1.09 ±0.01 3.21 ±0.24 3.39 ±0.27 3.13 ±0.21

Concrete 1.02 ±0.01 1.07 ±0.01 1.02 ±0.01 5.55 ±0.11 5.73 ±0.10 5.43 ±0.13

Energy 0.42 ±0.01 0.45 ±0.01 0.42 ±0.01 2.16 ±0.04 2.27 ±0.04 1.65 ±0.03

Kin8nm 1.13 ±0.00 1.17 ±0.00 1.10 ±0.00 0.08 ±0.00 0.08 ±0.00 0.07 ±0.00

Naval 0.24 ±0.00 0.30 ±0.02 0.24 ±0.00 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

Power plant 0.86 ±0.00 0.86 ±0.00 0.86 ±0.00 4.13 ±0.04 4.15 ±0.04 4.08 ±0.04

Protein 2.25 ±0.01 2.27 ±0.01 2.26 ±0.01 4.78 ±0.02 4.99 ±0.01 4.35 ±0.02

Wine 2.24 ±0.01 2.23 ±0.01 2.22 ±0.01 0.64 ±0.01 0.67 ±0.01 0.63 ±0.01

Yacht 0.18 ±0.00 0.19 ±0.01 0.17 ±0.00 0.99 ±0.07 1.15 ±0.08 0.98 ±0.07

Year Prediction MSD 2.42 ±NA 2.43 ±NA 2.42 ±NA 9.10 ±NA 9.25 ±NA 8.93 ±NA

baselines across all metrics. It is particularly noteworthy that our approach achieves both higher

coverage and tighter PIs compared to QD. We attribute the signiﬁcant improvement in MPIW – 17%

– to the fact that this dataset has relatively high degrees of noise [

33

]. In the bone age dataset, PIVEN

outperforms both baselines in terms of MAE. Our approach fares slightly worse compared to QD on

the MPIW metric, but that is likely due to the higher coverage (i.e., PICP) it is able to achieve.

Our results support our hypothesis that for large and high-dimensional data (and in particular those

with high degrees of noise), PIVEN is likely to outperform previous work due to its ability to combine

value predictions with PI generation. PIVEN produces tighter PIs and place the value prediction more

accurately within the PI. A detailed analysis of the training process – in terms of training/validation

loss, MAE, PICP and MPIW – is presented in Appendix C and further supports our conclusions.

Table 4: Results on the RSNA bone age and IMDB age estimation datasets

Dataset Method PICP MPIW MAE

Bone age

NN NA NA 18.68

PIVEN 0.93 2.09 18.13

QD 0.9 1.99 20.24

IMDB age

NN NA NA 7.08 ±0.03

PIVEN 0.95 ±0.01 2.87 ±0.04 7.03 ±0.04

QD 0.92 ±0.01 3.47 ±0.03 10.23 ±0.12

6 Conclusions

We present PIVEN, a novel deep architecture for addressing uncertainty. Our approach is the ﬁrst to

combine the generation of prediction intervals together with speciﬁc value predictions. By optimizing

for these two goals simultaneously we are able to produce tighter intervals while at the same time

achieving greater precision in our value predictions. Our evaluation on a set of widely accepted

benchmark datasets as well as large image datasets support the merits of our approach. For future

8

work, we will consider applying PIVEN in the ﬁeld of deep reinforcement learning (DRL). While DRL

algorithms usually employ neural nets for their utility estimations, the ability to provide both a speciﬁc

value and a PI has, in our view, the potential to produce more effective exploration/exploitation

strategies. Such an improvement is particularly important in domains where exploration is expensive

or time consuming.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,

Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine

learning. In 12th

{

USENIX

}

Symposium on Operating Systems Design and Implementation (

{

OSDI

}

16),

pages 265–283, 2016.

[2] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.

[3]

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal,

Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving

cars. arXiv preprint arXiv:1604.07316, 2016.

[4] François Chollet et al. Keras. https://keras.io, 2015.

[5]

Charles Corbière, Nicolas Thome, Avner Bar-Hen, Matthieu Cord, and Patrick Pérez. Addressing failure

prediction by learning model conﬁdence. In Advances in Neural Information Processing Systems, pages

2898–2909, 2019.

[6]

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty

in deep learning. In international conference on machine learning, pages 1050–1059, 2016.

[7]

Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural

classiﬁers. arXiv preprint arXiv:1805.08206, 2018.

[8]

Safwan S Halabi, Luciano M Prevedello, Jayashree Kalpathy-Cramer, Artem B Mamonov, Alexander

Bilbily, Mark Cicero, Ian Pan, Lucas Araújo Pereira, Rafael Teixeira Sousa, Nitamar Abdala, et al. The

rsna pediatric bone age machine learning challenge. Radiology, 290(2):498–503, 2019.

[9]

José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of

bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869, 2015.

[10]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected

convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,

pages 4700–4708, 2017.

[11]

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing

internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[12]

Elia Kaufmann, Antonio Loquercio, Rene Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide

Scaramuzza. Deep drone racing: Learning agile ﬂight in dynamic environments. arXiv preprint

arXiv:1806.08548, 2018.

[13]

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?

In Advances in neural information processing systems, pages 5574–5584, 2017.

[14]

Gil Keren, Nicholas Cummins, and Björn Schuller. Calibrated prediction intervals for neural network

regressors. IEEE Access, 6:54033–54041, 2018.

[15]

Abbas Khosravi, Saeid Nahavandi, Doug Creighton, and Amir F Atiya. Lower upper bound estimation

method for construction of neural network-based prediction intervals. IEEE transactions on neural

networks, 22(3):337–346, 2010.

[16]

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980, 2014.

[17]

Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb. c, distribution-free prediction intervals for deep

neural networks. arXiv preprint arXiv:1905.10634, 2019.

[18]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive

uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages

6402–6413, 2017.

9

[19]

David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation,

4(3):448–472, 1992.

[20]

Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In

Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.

[21]

David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution.

In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, pages

55–60. IEEE, 1994.

[22]

Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. High-quality prediction intervals for

deep learning: A distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167, 2018.

[23]

Florian A Potra and Stephen J Wright. Interior-point methods. Journal of Computational and Applied

Mathematics, 124(1-2):281–302, 2000.

[24]

Xin Qiu, Elliot Meyerson, and Risto Miikkulainen. Quantifying point-prediction uncertainty in neural

networks via residual estimation with an i/o kernel. arXiv preprint arXiv:1906.00588, 2019.

[25]

Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for medical image processing:

Overview, challenges and the future. In Classiﬁcation in BioApps, pages 323–350. Springer, 2018.

[26]

Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In Advances

in Neural Information Processing Systems, pages 3538–3548, 2019.

[27]

Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a single

image without facial landmarks. International Journal of Computer Vision, 126(2-4):144–157, 2018.

[28]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-

tion. arXiv preprint arXiv:1409.1556, 2014.

[29]

Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley,

Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive

uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980,

2019.

[30]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:

a simple way to prevent neural networks from overﬁtting. The journal of machine learning research,

15(1):1929–1958, 2014.

[31]

Natasa Tagasovska and David Lopez-Paz. Single-model uncertainties for deep learning. In Advances in

Neural Information Processing Systems, pages 6414–6425, 2019.

[32]

Zichang Tan, Jun Wan, Zhen Lei, Ruicong Zhi, Guodong Guo, and Stan Z Li. Efﬁcient group-n encoding

and decoding for facial age estimation. IEEE transactions on pattern analysis and machine intelligence,

40(11):2610–2623, 2017.

[33]

Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu, and Yung-Yu Chuang. Ssr-net: A compact

soft stagewise regression network for age estimation. In IJCAI, volume 5, page 7, 2018.

[34]

Yunxuan Zhang, Li Liu, Cheng Li, et al. Quantifying facial age by posterior of age comparisons. arXiv

preprint arXiv:1708.09687, 2017.

10

A Experimental Setup

In this section we provide full details of our dataset preprocessing and experiments presented in the main study.

Our code is available online 4

A.1 Dataset Preprocessing

In addition to the ten benchmark datasets used by all recent studies in the ﬁeld, we evaluated PIVEN on two

large image datasets. Due to the size of the datasets and the nature of the domain, preprocessing was required.

We provide the full details of the process below.

UCI datasets.

For the UCI datasets, we used the experimental setup proposed by [

9

], which was also used

in all the two baselines described in this study. All datasets were averaged on 20 random splits of the data,

except for the “Year Prediction MSD" and “protein" datasets. Since “Year Prediction MSD" has predeﬁned

ﬁxed splits by the provider, only one run was conducted. For "protein", 5 splits were used, as was done in

previous work. We used identical network architectures to those described in [

18

,

6

,

22

,

9

]: one dense layer

with ReLU [

20

], containing 50 neurons for each network. In the “Year Prediction MSD" and “protein" datasets

where NNs had 100 neurons. Regarding train/test split and hyperparameters, we employ the same setup as [

22

]:

train/test folds were randomly split 90%/10%, input and target variables were normalized to zero mean and

unit variance. The softening factor was constant for all datasets,

s= 160.0

. For the majority of the datasets we

used

λ= 15.0

, except for “naval", “protein", “wine" and “yacht" where

λ

was set to 4.0, 40.0, 30.0 and 3.0

respectively. The value of the parameter

β

was set to 0.5. The Adam optimizer [

16

] was used with exponential

decay, where learning rate and decay rate were tuned. Batch size of 100 was used for all the datasets, except

for “Year Prediction MSD" where batch size was set to 1000. Five neural nets were used in each ensemble,

using parameter re-sampling. The objective used to optimized

v

was Mean Square Error (MSE) for all datasets.

We also tune

λ

, initializing variance, and number of training epochs using early stopping. To ensure that our

comparison with the state-of-the-art baselines is accurate, we ﬁrst set the parameters of our neural nets so that

they produce the results reported in [

22

]. We then use the same parameter conﬁgurations in our experiments of

PIVEN.

IMDB age estimation dataset

For the IMDB dataset, we used the DenseNet architecture [

10

] as a feature

extractor. On top of this architecture we added two dense layers with dropout. The sizes of the two dense layers

were 128 and 32 neurons respectively, with a dropout factor of 0.2, and ReLU activation [

20

]. In the last layer,

the biases of the PIs were initially set to

[5.0,−5.0]

for the upper and lower bounds respectively. We used the

data preprocessing similar to that of previous work [

33

,

34

]: all face images were aligned using facial landmarks

such as eyes and the nose. After alignment, the face region of each image was cropped and resized to a 64

×

64

resolution. In addition, common data augmentation methods, including zooming, shifting, shearing, and ﬂipping

were randomly activated. The Adam optimization method [

16

] was used for optimizing the network parameters

over 90 epochs, with a batch size of 128. The learning rate was set to 0.002 initially and reduced by a factor

0.1 every 30 epochs. Regarding loss hyperparameters, we used the standard conﬁguration proposed in [

22

]:

conﬁdence interval set to 0.95, soften factor set to 160.0 and

λ= 15.0

. For PIVEN we used the same setting,

with β= 0.1. Since there was no predeﬁned test set for this dataset, we employed a 5-fold cross validation: In

each split, we used 20% as the test set. Additionally, 20% of the train set was designated as the validation set.

Best model obtained by minimizing the validation loss. In QD and PIVEN, we normalized ages to zero mean

and unit variance.

RSNA pediatric bone age dataset

For the RSNA dataset, we used the well-known VGG-16 architecture

[

28

] as a base model, with weights pre-trained on ImageNet. On top of this architecture, we added batch

normalization [

11

], attention mechanism with two CNN layers of 64 and 16 neurons each, two average pooling

layers, dropout [

30

] with a 0.25 probability, and a fully connected layer with 1024 neurons. The activation

function for the CNN layers was ReLU [

20

], and we used ELU for the fully connected layer. For the PIs last

layer, we used biases of

[2.0,−2.0]

, for the upper and lower bound initializion, respectively. We used standard

data augmentation consisting of horizontal ﬂips, vertical and horizontal shifts, and rotations. In addition, we

normalized targets to zero mean and unit variance. To reduce computational costs, we downscaled input images

to 384

×

384 pixels. The network was optimized using Adam optimizer [

16

], with an initial learning rate of 0.01

which was reduced when the validation loss has stopped improving over 10 epochs. We trained the network

for 50 epochs using batch size of 100. For our loss hyperparameters, we used the standard conﬁguration like

proposed in [

22

]: conﬁdence interval set to 0.95, soften factor set to 160.0 and

λ= 15.0

. For PIVEN we used

the same setting, with β= 0.5.

B Ablation analysis full results

We now present the full results of our ablation studies, including PICP, for the ablation variants:

4https://github.com/elisim/piven

11

Table 5: Ablation analysis, comparing PICP and MPIW. Best was assessed as in Table 1

PICP MPIW

Datasets POO MOI PIVEN POO MOI PIVEN

Boston 0.93 ±0.01 0.93 ±0.01 0.93 ±0.01 1.09 ±0.02 1.15 ±0.02 1.09 ±0.01

Concrete 0.93 ±0.01 0.93 ±0.01 0.93 ±0.01 1.02 ±0.01 1.07 ±0.01 1.02 ±0.01

Energy 0.97 ±0.01 0.97 ±0.00 0.97 ±0.00 0.42 ±0.01 0.45 ±0.01 0.42 ±0.01

Kin8nm 0.96 ±0.00 0.96 ±0.00 0.96 ±0.00 1.13 ±0.00 1.17 ±0.00 1.10 ±0.00

Naval 0.98 ±0.00 0.98 ±0.00 0.98 ±0.00 0.24 ±0.00 0.30 ±0.02 0.24 ±0.00

Power plant 0.96 ±0.00 0.96 ±0.00 0.96 ±0.00 0.86 ±0.00 0.86 ±0.00 0.86 ±0.00

Protein 0.95 ±0.00 0.95 ±0.00 0.95 ±0.00 2.25 ±0.01 2.27 ±0.01 2.26 ±0.01

Wine 0.91 ±0.01 0.91 ±0.01 0.91 ±0.01 2.24 ±0.01 2.23 ±0.01 2.22 ±0.01

Yacht 0.95 ±0.01 0.95 ±0.01 0.95 ±0.01 0.18 ±0.00 0.19 ±0.01 0.17 ±0.00

Year Prediction MSD 0.95 ±NA 0.95 ±NA 0.95 ±NA 2.42 ±NA 2.43 ±NA 2.42 ±NA

Table 6: Ablation analysis comparing value prediction in terms of RMSE

RMSE

Datasets POO MOI PIVEN

Boston 3.21 ±0.24 3.39 ±0.27 3.13 ±0.21

Concrete 5.55 ±0.11 5.73 ±0.10 5.43 ±0.13

Energy 2.16 ±0.04 2.27 ±0.04 1.65 ±0.03

Kin8nm 0.08 ±0.00 0.08 ±0.00 0.07 ±0.00

Naval 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00

Power plant 4.13 ±0.04 4.15 ±0.04 4.08 ±0.04

Protein 4.78 ±0.02 4.99 ±0.01 4.35 ±0.02

Wine 0.64 ±0.01 0.67 ±0.01 0.63 ±0.01

Yacht 0.99 ±0.07 1.15 ±0.08 0.98 ±0.07

Year Prediction MSD 9.10 ±NA 9.25 ±NA 8.93 ±NA

C IMDB age estimation training process and robustness to outliers

C.1 Training process

In the following ﬁgures we present comparisons of the training progression for PIVEN, QD and NN on the MAE,

PICP and MPIW evaluation metrics. We used 80% of images as the training set while the remaining 20% were

used as the validation set (we did not deﬁne a test set as we were only interested in analyzing the progression

of the training). For the MAE metric, presented in Figure 2, we observe that the values for QD not improves.

This is to be expected since QD does not consider this goal in its training process (i.e., loss function). This

result further strengthens our argument that choosing the middle of the interval is often sub-optimal strategy for

value prediction. For the remaining two approaches – NN and PIVEN– we note that NN suffers from overﬁtting,

given that the validation error is greater than training error after convergence. This phenomena does not happen

in PIVEN which indicates robustness, a result which further supports our conclusions regarding the method’s

robustness.

For the MPIW metric (Figures 3), PIVEN presents better performance both for the validation and train sets

compared to QD. Moreover, we observe a smaller gap between the errror produced by PIVEN for the two sets –

validation and training – which indicates that PIVEN enjoys greater robustness and an ability to not overﬁt to

a subset of the data. Our analysis also shows that for the PICP metric (Figure 4), PIVEN converges to higher

coverage.

12

(a) MAE NN (b) MAE QD

(c) MAE PIVEN (d) MAE validation errors

Figure 2: Comparison of MAE metric in the training process. We observe that the values for QD (b)

do not improve, which is expected since QD does not consider value prediction in its loss function.

Moreover, we note that NN suffers from overﬁtting, given that the validation error is greater than the

training error after convergence. This phenomena do not affect PIVEN, thus providing an indication

of its robustness.

C.2 Robustness to outliers

Since PIVEN is capable of learning from the entire dataset while QD learns only from data points which were

captured by the PI, it is reasonable to expect that the former will outperform the latter when coping with outliers.

In the IMDB age estimation dataset, we can consider images with very high or very low age as outliers. Our

analysis shows that for this subset of cases, there is a large gap in performance between PIVEN and QD. In

Figure 5 we provide several images of very young/old individuals and the results returned by the two methods.

We can observe that PIVEN copes with these outliers signiﬁcantly better.

13

(a) MPIW QD (b) MPIW PIVEN

(c) MPIW validation

Figure 3: Comparison of the MPIW metric between QD and PIVEN in the training process. As

can be seen, PIVEN signiﬁcantly improves over QD, and has a smaller gap between training and

validation errors.

14

(a) PICP QD (b) PICP PIVEN

(c) PICP validation

Figure 4: Comparison of PICP metric between QD and PIVEN in the training process. PIVEN

achieves higher coverage when two methods converges.

15

Figure 5: The predictions produced for outliers (i.e., very young/old individuals) by both PIVEN and

QD for the IMDB age estimation dataset. The results for QD are on the left, results for PIVEN on the

right.

16