PIVEN: A Deep Neural Network for Prediction
Intervals with Speciﬁc Value Prediction
Ben-Gurion University of the Negev
Ben-Gurion University of the Negev
Ben-Gurion University of the Negev
Improving the robustness of neural nets in regression tasks is key to their application
in multiple domains. Deep learning-based approaches aim to achieve this goal either
by improving the manner in which they produce their prediction of speciﬁc values
(i.e., point prediction), or by producing prediction intervals (PIs) that quantify
uncertainty. We present PIVEN, a deep neural network for producing both a PI and
a prediction of speciﬁc values. Benchmark experiments show that our approach
produces tighter uncertainty bounds than the current state-of-the-art approach for
producing PIs, while managing to maintain comparable performance to the state-
of-the-art approach for speciﬁc value-prediction. Additional evaluation on large
image datasets further support our conclusions.
Deep neural networks (DNNs) have been achieving state-of-the-art results in a large variety of
complex problems. These include automated decision making and recommendation systems in the
medical domain , autonomous control of drones  and self driving cars . In many of these
domains, it is crucial not only that the prediction made by the DNN is accurate, but rather that its
uncertainty is quantiﬁed. Quantifying uncertainty has many beneﬁts, including risk reduction and the
ability to plan in a more reliable fashion .
For regression problems, uncertainty is quantiﬁed by the creation of prediction intervals (PIs), which
offer upper and lower bounds on the value of a data point for a given probability (e.g., 95% or 99%).
Existing non-bayesian methods for PI generation can be roughly divided into two groups: a) carrying
out multiple runs of the regression problem (e.g., dropout [
], ensemble-based methods [
deriving the PI from the prediction variance in a post-hoc manner, and; b) the use of dedicated
architectures for the generation of the PI, which produce the upper and lower bounds of the PI.
While effective, each approach has limitations. On the one hand, the ensemble-based approaches
produce a speciﬁc value for the regression problem (i.e., a point prediction), but they are not optimized
for PI construction. This lack of a PI makes the use of such approaches difﬁcult in domains such
as ﬁnancial risk mitigation or maintenance scheduling. For example, providing a PI for the number
of days a machine can function without malfunctioning (e.g., 30-45 days with 99% certainty) is
more valuable than a prediction for the speciﬁc time of failure. On the other hand, PI-dedicated
] provide accurate upper and lower bounds for the prediction, but do not provide
Preprint. Under review.
arXiv:2006.05139v1 [cs.LG] 9 Jun 2020
a method for speciﬁcally selecting a value within the interval. As a result, these approaches choose
the middle of the interval as their value prediction, which is a sub-optimal strategy as it makes
assumptions regarding the value distribution within the interval. The shortcomings of this approach
to value prediction are supported by , as well as by our own experiments in Section 5.
In this study we propose PIVEN (
ntervals with speciﬁc
), a novel
approach for uncertainty modeling using DNNs. Our approach combines the beneﬁts of the two
types of approaches described above by producing both a PI and a value prediction. We follow
the experimental procedure of recent works, and compare our approach to current best-performing
methods: Quality-Driven PI method (QD) [
] (a dedicated PI generation architecture), and Deep
Ensembles (DE) [
]. The results of our evaluation show that PIVEN outperforms QD by producing
narrower PIs, while simultaneously achieving comparable results to DE in terms of value prediction.
2 Related Work
2.1 Uncertainty Modeling in Data
In the ﬁeld of uncertainty modeling, one considers two types of uncertainty: a) Aleatoric uncertainty,
which captures noise inherent in the observations, and; b) epistemic uncertainty, which accounts for
uncertainty in the model parameters – thus capturing our ignorance about the correctness of the model
generated from our collected data. Overall uncertainty
can therefore be modeled as
denotes epistemic uncertainty and
denotes aleatoric uncertainty. Aleatoric uncertainty
can further be categorized into homoscedastic uncertainty, where
is constant for different inputs,
and heteroscedastic uncertainty where
is dependent on the inputs to the model, with some inputs
potentially being more noisy than others. In this work we quantify uncertainty using PIs, which
by deﬁnition quantify
, whereas conﬁdence intervals (CIs) quantify only
. Therefore, PIs are
necessarily wider than CIs.
2.2 Modeling Uncertainty in Regression Problems
Enabling deep learning algorithms to cope with uncertainty has been an active area of research in
recent years [
]. Studies in the uncertainty modeling and regression can be
roughly divided into two groups: sampling-based and PI-based.
Sampling-based approaches initially utilized Bayesian neural networks [
], in which a prior distri-
bution was deﬁned on the weights and biases of a neural net (NN), and a posterior distribution is
then inferred from the training data. The main shortcomings of these approaches were their heavy
computational costs and the fact that they were difﬁcult to implement. Subsequently, non-Bayesian
] were proposed. In [
], Monte Carlo sampling was used to estimate the predictive
uncertainty of NNs through the use of dropout over multiple runs. A later study [
a combination of ensemble learning and adversarial training to quantify data uncertainty. In an
expansion of a previously-proposed approach [
], each NN was optimized to learn the mean and
variance of the data, assuming a Gaussian distribution. In a recent study [
], the authors proposed a
post-hoc procedure using Gaussian processes to measure the uncertainty of the predictions of NN
PI-based approaches, whose aim is to explicitly produce a PI for each analyzed sample, belong to
a ﬁeld of research that has been gaining popularity in recent years. In [
], the authors propose a
post-processing approach that considers the regression problem as one of classiﬁcation, and uses the
output of the ﬁnal softmax layer to produce PIs. Another recent study [
] proposed the use of a loss
function designed to learn all conditional quantiles of a given target variable. Khosravi et al. [
proposed a method called LUBE, which consists of a loss function optimized for the creation of PIs
but has the caveat of not being able to use stochastic gradient descent (SGD) for its optimization.
Finally, a recent study [
] inspired by LUBE proposed a loss function that is both optimized for the
generation of PIs and can be optimized using SGD.
Each of the two groups presented above tends to under-perform when applied to tasks for which
its loss function was not optimized: sampling-based approaches, which are optimized to produce
value predictions, tend to produce PIs of lesser accuracy than those of the PI-based methods, which
are optimized to produce tight PI intervals, and vice versa. Recent studies [
] attempted to
produce both value predictions and PIs by using conformal prediction with quantile regression. While
effective, these methods use a complex splitting strategy, where one part of the data is used to produce
value predictions and PIs, while the the other part is to further adjust the PIs. Contrary to these
approaches, PIVEN produces PIs with value predictions in an end-to-end manner by relying on novel
3 Problem Formulation
In this work we consider a neural network regressor that processes an input
with an associated
can be any feature space (e.g., tabular data, age prediction from images).
(xi, yi)∈ X × R
be a data point along with its target value. Let
be the upper and
lower bounds of PIs corresponding to the ith sample. Our goal is to construct
(Li, Ui, yi)
. We refer to
as the conﬁdence level of the PI. In standard regression
problems, the goal is to estimate a function
y(x) = f(x) + ξ(x)
to as noise and is usually assumed to have zero mean.
Next we deﬁne two quantitative measures for the evaluation of PIs, as deﬁned in [
]. First we deﬁne
coverage as the ratio of dataset samples that fall within their respective PIs. We measure coverage
using the prediction interval coverage probability (PICP) metric:
P I CP :=1
denotes the number of samples and
. We now
deﬁne a metric to measure the quality of the generated PIs. Naturally, we are interested in producing
as tight a bound as possible while maintaining adequate coverage. We deﬁne the mean prediction
interval width (MPIW) as,
M P IW :=1
When combined, these metrics enable us to comprehensively evaluate the quality of generated PIs.
In this section we ﬁrst deﬁne PIVEN, a deep neural architecture for the generation of both PIs and
value predictions for regression problems. We then present a suitable loss function that enables us to
train our architecture to generate the PIs for a desired conﬁdence level 1−α.
4.1 System Architecture
The proposed architecture is presented in Figure 1. It consists of three components:
. The main body block, consisting of a varying number of DNN layers or
sub-blocks. The goal of this component is to transform the input into a latent representation
that is then provided as input to the other components. It is important to note that PIVEN
supports any architecture type (e.g., dense, convolutions) that can be applied to a regression
problem. Moreover, pre-trained architectures can also be used seamlessly. For example, we
use pre-trained VGG-16 and DenseNet architectures in our experiments.
•Upper & lower-bound heads
produce the lower and upper bounds of
the PI respectively, such that
is the value
prediction and 1−αis the predeﬁned conﬁdence level.
The auxiliary prediction head,
, enables us to produce a value predic-
does not produce the value prediction directly, but rather produces a parameter
indicating the relative weight that should be given to each of the two bounds. We derive the
value prediction using,
y=v·U+ (1 −v)·L(3)
. By expressing the output of the auxiliary as a function of the other two
heads, we bound them together and improve their performance. See Section 4.3 for details.
This architecture has several advantages compared to previous studies, particularly in terms of
robustness and the ability to represent PIs that are not uniformly distributed. We elaborate on this
subject further in Section 4.3.
Figure 1: The PIVEN schematic architecture
4.2 Network Optimization
Our goal is to generate narrow PIs, measured by MPIW, while maintaining the desired level of
coverage, measured by
P I CP = 1 −α
. However, PIs that fail to capture their respective data point
should not be encouraged to shrink further. We follow the derivation presented in [
] and deﬁne
captured MP I W (M P IWcapt ) as the M P IW of only those points for which Li≤yi≤Ui,
M P IWcapt :=1
i=1 ki. Hence, we seek to minimize M P I Wcapt subject to P IC P ≥1−α:
θ∗= arg min
(M P IWcapt,θ )s.t P I CPθ≥1−α
is the parameters of the neural net. To enforce the coverage constraint, we utilize a variant of
the well-known Interior Point Method (IPM) , resulting in an unconstrained loss:
LP I =M P IWcapt,θ +√n·λΨ(1 −α−P I CPθ)
Ψ(x):= max(0, x)2
is a hyperparameter controlling the relative importance of width vs. coverage,
quadratic penalty function, and
is the batch size. We include dependency on batch size in the
loss since a larger sample size increases conﬁdence in the value of PICP, thus increasing the loss.
In practice, optimizing the loss with discrete version of
(see eq. 4) fails to converge, because the
gradient is always positive for all possible values. We therefore deﬁne a continuous version of
ksoft =σ(s·(y−L)) σ(s·(U−y))
is the sigmoid function, and
s > 0
is a softening factor. The ﬁnal version of
uses the continuous and discrete versions of
calculations of the
P I CP
M P IWcapt
metrics, respectively. By doing so, it discourages the PIs
from shrinking further when failing to capture their respective data points.
Neural networks optimized by the abovementioned objective are able to generate well-calibrated PIs,
but they disregard the original value prediction task. This omission has two signiﬁcant drawbacks:
M P IWcapt
, as deﬁned in [
], focuses only on the fraction
of the training set where the data points are successfully captured by the PI. As a result,
the network is likely to overﬁt to a subset of the data. Our reasoning is supported by our
experiments in Section 5 and Appendix C.
•Lack of value prediction
. In its current form,
is not able to perform value prediction,
i.e., returning a speciﬁc prediction for the regression problem. To overcome this limitation,
one can return the middle of the PI, as done in [
]. This approach sometimes yields
sub-optimal results, as it is based on assumptions regarding the distribution of the data.
These assumptions do not always hold, as we show in our experiments in Section 5.4.
We propose a novel loss function that combines the generation of both PIs and value predictions. To
optimize the output of v(x)(the auxiliary head), we minimize the standard regression loss,
is a regression objective against the ground-truth, and
y(xi) = vi·Ui+ (1 −vi)·Li
ﬁnal loss function is a convex combination of
, and the auxiliary loss
. Thus, the overall
training objective is:
LP IV EN =βLP I + (1 −β)Lv(6)
is a hyperparameter that balances the two goals of our approach: producing narrow PIs and
accurate value predictions. To quantify epistemic uncertainty, we employ an ensemble of different
networks with parameter resampling, as proposed in [
]. Given an ensemble of
NNs trained with
LP IV EN
represent the ensemble’s upper and lower estimate of the PI, and
ensemble’s auxiliary prediction. We calculate model uncertainty and use the ensemble to generate
the PIs and ˜vas follows:
represents the upper bound of the PI and the auxiliary prediction for data point
. A similar procedure is followed for
is the Zscore
for a conﬁdence level 1−α.
4.3 Discussion of contributions
PIVEN is different from previous studies in two important aspects. First, our approach is the ﬁrst
to propose an integrated architecture capable of producing both PIs and exact value predictions.
Moreover, since the auxiliary head produces predictions for all training set samples, it prevents
PIVEN from overﬁtting to only the data points which were contained in their respective PIs (a
possible problem for studies such as [22, 31]), thus increasing the robustness of our approach.
The second differentiating aspect of PIVEN with respect to previous work is its method for producing
the value prediction. While previous studies either provided the middle of the PI [
] or the
] as their value predictions, PIVEN’s auxiliary head can produce any value within
the PI as its prediction. By expressing the value prediction as a function of the upper and lower
bounds, we ensure that the three heads are synchronized. Finally, this representation enables us to
produce value predictions that are not in the middle of the interval, thus creating representations
that are more characteristic of many real-world cases, where the PI is not necessarily uniformly
distributed. Our experiments, presented in Sections 5.5, 5.4 support our conclusions.
To compare PIVEN to recent state-of-the-art studies [
], we conduct our
experiments on a set of benchmark datasets used by them for evaluation. This benchmark includes
ten datasets from the UCI repository .
IMDB age estimation dataset1.
The IMDB-WIKI dataset [
] is currently the largest age-labeled
facial dataset available. Our dataset consists of 460,723 images from 20,284 celebrities, and the
regression goal is to predict the age of the person in the image. It is important to note that this dataset
is known to contain noise (i.e., aleatoric uncertainty), thus making it highly relevant to this study. We
apply the same preprocessing as in [32, 33], and refer the reader to the Appendix A for full details.
RSNA pediatric bone age dataset2.
This dataset is a popular medical imaging dataset consisting of
X-ray images of children’s hands [
]. The regression task is predicting one’s age from one’s bone
image. The dataset contains 12,611 training images and 200 test set images.
While the ﬁrst group of datasets enables us to compare PIVEN’s performance to recent state-of-the-art
studies in the ﬁeld, the two latter datasets enable us to demonstrate that our approach is both scalable
and effective on multiple types of input.
We compare our performance to two top-performing NN-based baselines from recent years:
•Quality driven PI method (QD) .
This approach produces prediction intervals that
minimize a smooth combination of the PICP/MPIW metrics without considering the value
prediction task in its objective function. Its reported results make this approach state-of-the-
art in terms of PI width and coverage.
•Deep Ensembles (DE) .
This work combines individual conditional Gaussian distribu-
tion with adversarial training, and uses the models’ variance to compute prediction intervals.
Because DE outputs distribution instead of PIs, we ﬁrst convert it to PIs, and then compute
PICP and MPIW (replicating the process described in [
]). Its reported results make this
method one of the top performers with respect to the RMSE metric (i.e., value prediction).
By comparing PIVEN to these two baselines, we are able to evaluate its ability to simultaneously
satisfy the two main requirements for regression problems in domains with high certainty.
5.3 Experimental Setup
Throughout our experiments, we evaluate our two baselines [
] using their reported deep
architectures and hyperparmeters. For full experimental details, please see Appendix A. We ran our
experiments using a GPU server with two NVIDIA Tesla P100. Our code is implemented using
TensorFlow and Keras [1, 4], and is made available online3.
We implemented the experimental setup proposed by [
], which was also used by
our baselines. Results are averaged on 20 random 90%/10% splits of the data, except for the “Year
Prediction MSD" and “Protein”, which were split once and ﬁve times respectively. Our network
architecture is identical to previous work [
]: one hidden layer with ReLU activation
], and the Adam optimizer [
]. Input and target variables are normalized to zero mean
and unit variance.
IMDB age estimation dataset.
We use the DenseNet architecture [
] as the backbone block, upon
which we add two fully connected layers. We apply the data preprocessing used in [
appendix for details). We report the results for 5-fold cross validation, as the dataset has no predeﬁned
RSNA bone age dataset.
We use the VGG-16 architecture [
] as the backbone block, with weights
pre-trained on ImageNet. We then add two convolutional layers followed by a dense layer. This
dataset has a predeﬁned test set of 200 images.
5.4 Evaluation Results: UCI Datasets
We use two evaluation metrics: MPIW and RMSE, with the desired coverage, measured by the PICP
metric, set to 95% (as done in [
]). In terms of PI-quality, shown in Table 1, PIVEN outperforms
QD in nine out of ten datasets (although it should be noted that no method reached the required
PICP in two of these datasets – “Boston" and “Concrete"), while achieving equal performance in the
remaining dataset. DE trails behind PIVEN and QD in most datasets, which is to be expected since
this approach does not attempt to optimize MPIW.
Table 2 presents the RMSE metric values for all methods. It is clear that PIVEN and DE are the top
performers, with the former achieving the best results in ﬁve datasets, and the latter in four. The QD
baseline trails behind the other methods in all datasets but one (“Naval", where all methods achieve
equal performance). QD’s performance is not surprising given that the focus of the said approach is
the generation of PIs rather than value predictions.
The results of our experiments clearly show that PIVEN is capable of providing accurate value
predictions for regression problems (i.e., achieving competitive results with the top-performing DE
baseline) while achieving state-of-the-art results in uncertainty modeling by the use of PIs.
In Section 4.3 we describe our rationale in expressing the value prediction as
a function of the upper and lower bounds of the interval. To prove the merits of our approach we
evaluate two variants of PIVEN. In the ﬁrst variant, denoted as POO (point-only optimization), we
decouple the value prediction from the PI. The loss function of this variant is
LQD +`(v, ytrue )
is set to be MSE loss. In the second variant, denoted MOI (middle of interval), the value
prediction produced by the model is always the middle of the PI (in other words, vis set to 0.5).
The results of our ablation study are presented in Table 3, which contains the results of the MPIW
and RMSE metrics (the PICP values are identical for all variants and are therefore omitted—values
are presented in the Appendix B). It is clear that the full PIVEN signiﬁcantly outperforms the two
other variants. This leads us to conclude that both novel aspects of our approach—the simultaneous
optimization of PI-width and RMSE, and the ability to select any value on the PI as the value
prediction—contribute to PIVEN’s performance. Finally, it is important to note that even though their
performance is inferior to PIVEN, both the POO and MOI variants outperform the QD baseline in
terms of MPIW, while being equal or better for RMSE.
Table 1: Results on regression benchmark UCI datasets comparing PICP and MPIW. Best performance
deﬁned as in [
]: every approach with PICP
0.95 was deﬁned as best for PICP. For MPIW, best
performance was awarded to lowest value. If PICP
0.95 for neither, the largest PICP was best, and
MPIW was only assessed if the one with larger PICP also had smallest MPIW.
Datasets DE QD PIVEN DE QD PIVEN
Boston 0.87 ±0.01 0.93 ±0.01 0.93 ±0.01 0.87 ±0.03 1.15 ±0.02 1.09 ±0.01
Concrete 0.92 ±0.01 0.93 ±0.01 0.93 ±0.01 1.01 ±0.02 1.08 ±0.01 1.02 ±0.01
Energy 0.99 ±0.00 0.97 ±0.01 0.97 ±0.00 0.49 ±0.01 0.45 ±0.01 0.42 ±0.01
Kin8nm 0.97 ±0.00 0.96 ±0.00 0.96 ±0.00 1.14 ±0.01 1.18 ±0.00 1.10 ±0.00
Naval 0.98 ±0.00 0.97 ±0.00 0.98 ±0.00 0.31 ±0.01 0.27 ±0.00 0.24 ±0.00
Power plant 0.96 ±0. 00 0.96 ±0.00 0.96 ±0.00 0.91 ±0.00 0.86 ±0.00 0.86 ±0.00
Protein 0.96 ±0.00 0.95 ±0.00 0.95 ±0.00 2.68 ±0.01 2.27 ±0.01 2.26 ±0.01
Wine 0.90 ±0.01 0.91 ±0.01 0.91 ±0.01 2.50 ±0.02 2.24 ±0.02 2.22 ±0.01
Yacht 0.98 ±0.01 0.95 ±0.01 0.95 ±0.01 0.33 ±0.02 0.18 ±0.00 0.17 ±0.00
Year Prediction MSD 0.95 ±NA 0.95 ±NA 0.95 ±NA 2.91 ±NA 2.45 ±NA 2.42 ±NA
5.5 Large-Scale Datasets
In our discussion in Section 4.3, we argue that PIVEN’s auxiliary head forces it to train on the entire
training set rather than overﬁt itself to the data points it manages to capture within their respective PIs.
We hypothesize that this advantage will become more pronounced in large and complex data, and
therefore perform an evaluation on two image datasets: bone age and age estimation. Since training
the DE approach on datasets of this size is computationally prohibitive, we instead use a dense layer
on top of the used architecture (DenseNet/VGG, see Section 5.3) that outputs value prediction using
the MSE metric. In doing so, we follow the approach used in [
] for similar evaluation. We refer to
this architecture as NN, because of the dense layer we add.
Our results are presented in Table 4. We use mean absolute error (MAE), which was the datasets’
chosen metric. For the IMDB age prediction dataset, results show that PIVEN outperforms both
Table 2: Evaluation results for the UCI benchmark datasets, using the RMSE metric
Datasets DE QD PIVEN
Boston 2.87 ±0.19 3.39 ±0.26 3.13 ±0.21
Concrete 5.21 ±0.09 5.88 ±0.10 5.43 ±0.13
Energy 1.68 ±0.06 2.28 ±0.04 1.65 ±0.03
Kin8nm 0.08 ±0.00 0.08 ±0.00 0.07 ±0.00
Naval 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
Power plant 3.99 ±0.04 4.14 ±0.04 4.08 ±0.04
Protein 4.36 ±0.02 4.99 ±0.02 4.35 ±0.02
Wine 0.62 ±0.01 0.67 ±0.01 0.63 ±0.01
Yacht 1.38 ±0.07 1.10 ±0.06 0.98 ±0.07
Year Prediction MSD 8.95 ±NA 9.30 ±NA 8.93 ±NA
Table 3: Ablation analysis, comparing PICP and MPIW. Results were analyzed as in Table 1
Datasets POO MOI PIVEN POO MOI PIVEN
Boston 1.09 ±0.02 1.15 ±0.02 1.09 ±0.01 3.21 ±0.24 3.39 ±0.27 3.13 ±0.21
Concrete 1.02 ±0.01 1.07 ±0.01 1.02 ±0.01 5.55 ±0.11 5.73 ±0.10 5.43 ±0.13
Energy 0.42 ±0.01 0.45 ±0.01 0.42 ±0.01 2.16 ±0.04 2.27 ±0.04 1.65 ±0.03
Kin8nm 1.13 ±0.00 1.17 ±0.00 1.10 ±0.00 0.08 ±0.00 0.08 ±0.00 0.07 ±0.00
Naval 0.24 ±0.00 0.30 ±0.02 0.24 ±0.00 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
Power plant 0.86 ±0.00 0.86 ±0.00 0.86 ±0.00 4.13 ±0.04 4.15 ±0.04 4.08 ±0.04
Protein 2.25 ±0.01 2.27 ±0.01 2.26 ±0.01 4.78 ±0.02 4.99 ±0.01 4.35 ±0.02
Wine 2.24 ±0.01 2.23 ±0.01 2.22 ±0.01 0.64 ±0.01 0.67 ±0.01 0.63 ±0.01
Yacht 0.18 ±0.00 0.19 ±0.01 0.17 ±0.00 0.99 ±0.07 1.15 ±0.08 0.98 ±0.07
Year Prediction MSD 2.42 ±NA 2.43 ±NA 2.42 ±NA 9.10 ±NA 9.25 ±NA 8.93 ±NA
baselines across all metrics. It is particularly noteworthy that our approach achieves both higher
coverage and tighter PIs compared to QD. We attribute the signiﬁcant improvement in MPIW – 17%
– to the fact that this dataset has relatively high degrees of noise [
]. In the bone age dataset, PIVEN
outperforms both baselines in terms of MAE. Our approach fares slightly worse compared to QD on
the MPIW metric, but that is likely due to the higher coverage (i.e., PICP) it is able to achieve.
Our results support our hypothesis that for large and high-dimensional data (and in particular those
with high degrees of noise), PIVEN is likely to outperform previous work due to its ability to combine
value predictions with PI generation. PIVEN produces tighter PIs and place the value prediction more
accurately within the PI. A detailed analysis of the training process – in terms of training/validation
loss, MAE, PICP and MPIW – is presented in Appendix C and further supports our conclusions.
Table 4: Results on the RSNA bone age and IMDB age estimation datasets
Dataset Method PICP MPIW MAE
NN NA NA 18.68
PIVEN 0.93 2.09 18.13
QD 0.9 1.99 20.24
NN NA NA 7.08 ±0.03
PIVEN 0.95 ±0.01 2.87 ±0.04 7.03 ±0.04
QD 0.92 ±0.01 3.47 ±0.03 10.23 ±0.12
We present PIVEN, a novel deep architecture for addressing uncertainty. Our approach is the ﬁrst to
combine the generation of prediction intervals together with speciﬁc value predictions. By optimizing
for these two goals simultaneously we are able to produce tighter intervals while at the same time
achieving greater precision in our value predictions. Our evaluation on a set of widely accepted
benchmark datasets as well as large image datasets support the merits of our approach. For future
work, we will consider applying PIVEN in the ﬁeld of deep reinforcement learning (DRL). While DRL
algorithms usually employ neural nets for their utility estimations, the ability to provide both a speciﬁc
value and a PI has, in our view, the potential to produce more effective exploration/exploitation
strategies. Such an improvement is particularly important in domains where exploration is expensive
or time consuming.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine
learning. In 12th
Symposium on Operating Systems Design and Implementation (
pages 265–283, 2016.
 Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal,
Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving
cars. arXiv preprint arXiv:1604.07316, 2016.
 François Chollet et al. Keras. https://keras.io, 2015.
Charles Corbière, Nicolas Thome, Avner Bar-Hen, Matthieu Cord, and Patrick Pérez. Addressing failure
prediction by learning model conﬁdence. In Advances in Neural Information Processing Systems, pages
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty
in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural
classiﬁers. arXiv preprint arXiv:1805.08206, 2018.
Safwan S Halabi, Luciano M Prevedello, Jayashree Kalpathy-Cramer, Artem B Mamonov, Alexander
Bilbily, Mark Cicero, Ian Pan, Lucas Araújo Pereira, Rafael Teixeira Sousa, Nitamar Abdala, et al. The
rsna pediatric bone age machine learning challenge. Radiology, 290(2):498–503, 2019.
José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of
bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869, 2015.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4700–4708, 2017.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Elia Kaufmann, Antonio Loquercio, Rene Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide
Scaramuzza. Deep drone racing: Learning agile ﬂight in dynamic environments. arXiv preprint
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision?
In Advances in neural information processing systems, pages 5574–5584, 2017.
Gil Keren, Nicholas Cummins, and Björn Schuller. Calibrated prediction intervals for neural network
regressors. IEEE Access, 6:54033–54041, 2018.
Abbas Khosravi, Saeid Nahavandi, Doug Creighton, and Amir F Atiya. Lower upper bound estimation
method for construction of neural network-based prediction intervals. IEEE transactions on neural
networks, 22(3):337–346, 2010.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
Danijel Kivaranovic, Kory D Johnson, and Hannes Leeb. c, distribution-free prediction intervals for deep
neural networks. arXiv preprint arXiv:1905.10634, 2019.
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive
uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages
David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation,
Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In
Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution.
In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, pages
55–60. IEEE, 1994.
Tim Pearce, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. High-quality prediction intervals for
deep learning: A distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167, 2018.
Florian A Potra and Stephen J Wright. Interior-point methods. Journal of Computational and Applied
Mathematics, 124(1-2):281–302, 2000.
Xin Qiu, Elliot Meyerson, and Risto Miikkulainen. Quantifying point-prediction uncertainty in neural
networks via residual estimation with an i/o kernel. arXiv preprint arXiv:1906.00588, 2019.
Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib. Deep learning for medical image processing:
Overview, challenges and the future. In Classiﬁcation in BioApps, pages 323–350. Springer, 2018.
Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. In Advances
in Neural Information Processing Systems, pages 3538–3548, 2019.
Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a single
image without facial landmarks. International Journal of Computer Vision, 126(2-4):144–157, 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.
Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley,
Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive
uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980,
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:
a simple way to prevent neural networks from overﬁtting. The journal of machine learning research,
Natasa Tagasovska and David Lopez-Paz. Single-model uncertainties for deep learning. In Advances in
Neural Information Processing Systems, pages 6414–6425, 2019.
Zichang Tan, Jun Wan, Zhen Lei, Ruicong Zhi, Guodong Guo, and Stan Z Li. Efﬁcient group-n encoding
and decoding for facial age estimation. IEEE transactions on pattern analysis and machine intelligence,
Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu, and Yung-Yu Chuang. Ssr-net: A compact
soft stagewise regression network for age estimation. In IJCAI, volume 5, page 7, 2018.
Yunxuan Zhang, Li Liu, Cheng Li, et al. Quantifying facial age by posterior of age comparisons. arXiv
preprint arXiv:1708.09687, 2017.
A Experimental Setup
In this section we provide full details of our dataset preprocessing and experiments presented in the main study.
Our code is available online 4
A.1 Dataset Preprocessing
In addition to the ten benchmark datasets used by all recent studies in the ﬁeld, we evaluated PIVEN on two
large image datasets. Due to the size of the datasets and the nature of the domain, preprocessing was required.
We provide the full details of the process below.
For the UCI datasets, we used the experimental setup proposed by [
], which was also used
in all the two baselines described in this study. All datasets were averaged on 20 random splits of the data,
except for the “Year Prediction MSD" and “protein" datasets. Since “Year Prediction MSD" has predeﬁned
ﬁxed splits by the provider, only one run was conducted. For "protein", 5 splits were used, as was done in
previous work. We used identical network architectures to those described in [
]: one dense layer
with ReLU [
], containing 50 neurons for each network. In the “Year Prediction MSD" and “protein" datasets
where NNs had 100 neurons. Regarding train/test split and hyperparameters, we employ the same setup as [
train/test folds were randomly split 90%/10%, input and target variables were normalized to zero mean and
unit variance. The softening factor was constant for all datasets,
. For the majority of the datasets we
, except for “naval", “protein", “wine" and “yacht" where
was set to 4.0, 40.0, 30.0 and 3.0
respectively. The value of the parameter
was set to 0.5. The Adam optimizer [
] was used with exponential
decay, where learning rate and decay rate were tuned. Batch size of 100 was used for all the datasets, except
for “Year Prediction MSD" where batch size was set to 1000. Five neural nets were used in each ensemble,
using parameter re-sampling. The objective used to optimized
was Mean Square Error (MSE) for all datasets.
We also tune
, initializing variance, and number of training epochs using early stopping. To ensure that our
comparison with the state-of-the-art baselines is accurate, we ﬁrst set the parameters of our neural nets so that
they produce the results reported in [
]. We then use the same parameter conﬁgurations in our experiments of
IMDB age estimation dataset
For the IMDB dataset, we used the DenseNet architecture [
] as a feature
extractor. On top of this architecture we added two dense layers with dropout. The sizes of the two dense layers
were 128 and 32 neurons respectively, with a dropout factor of 0.2, and ReLU activation [
]. In the last layer,
the biases of the PIs were initially set to
for the upper and lower bounds respectively. We used the
data preprocessing similar to that of previous work [
]: all face images were aligned using facial landmarks
such as eyes and the nose. After alignment, the face region of each image was cropped and resized to a 64
resolution. In addition, common data augmentation methods, including zooming, shifting, shearing, and ﬂipping
were randomly activated. The Adam optimization method [
] was used for optimizing the network parameters
over 90 epochs, with a batch size of 128. The learning rate was set to 0.002 initially and reduced by a factor
0.1 every 30 epochs. Regarding loss hyperparameters, we used the standard conﬁguration proposed in [
conﬁdence interval set to 0.95, soften factor set to 160.0 and
. For PIVEN we used the same setting,
with β= 0.1. Since there was no predeﬁned test set for this dataset, we employed a 5-fold cross validation: In
each split, we used 20% as the test set. Additionally, 20% of the train set was designated as the validation set.
Best model obtained by minimizing the validation loss. In QD and PIVEN, we normalized ages to zero mean
and unit variance.
RSNA pediatric bone age dataset
For the RSNA dataset, we used the well-known VGG-16 architecture
] as a base model, with weights pre-trained on ImageNet. On top of this architecture, we added batch
], attention mechanism with two CNN layers of 64 and 16 neurons each, two average pooling
layers, dropout [
] with a 0.25 probability, and a fully connected layer with 1024 neurons. The activation
function for the CNN layers was ReLU [
], and we used ELU for the fully connected layer. For the PIs last
layer, we used biases of
, for the upper and lower bound initializion, respectively. We used standard
data augmentation consisting of horizontal ﬂips, vertical and horizontal shifts, and rotations. In addition, we
normalized targets to zero mean and unit variance. To reduce computational costs, we downscaled input images
384 pixels. The network was optimized using Adam optimizer [
], with an initial learning rate of 0.01
which was reduced when the validation loss has stopped improving over 10 epochs. We trained the network
for 50 epochs using batch size of 100. For our loss hyperparameters, we used the standard conﬁguration like
proposed in [
]: conﬁdence interval set to 0.95, soften factor set to 160.0 and
. For PIVEN we used
the same setting, with β= 0.5.
B Ablation analysis full results
We now present the full results of our ablation studies, including PICP, for the ablation variants:
Table 5: Ablation analysis, comparing PICP and MPIW. Best was assessed as in Table 1
Datasets POO MOI PIVEN POO MOI PIVEN
Boston 0.93 ±0.01 0.93 ±0.01 0.93 ±0.01 1.09 ±0.02 1.15 ±0.02 1.09 ±0.01
Concrete 0.93 ±0.01 0.93 ±0.01 0.93 ±0.01 1.02 ±0.01 1.07 ±0.01 1.02 ±0.01
Energy 0.97 ±0.01 0.97 ±0.00 0.97 ±0.00 0.42 ±0.01 0.45 ±0.01 0.42 ±0.01
Kin8nm 0.96 ±0.00 0.96 ±0.00 0.96 ±0.00 1.13 ±0.00 1.17 ±0.00 1.10 ±0.00
Naval 0.98 ±0.00 0.98 ±0.00 0.98 ±0.00 0.24 ±0.00 0.30 ±0.02 0.24 ±0.00
Power plant 0.96 ±0.00 0.96 ±0.00 0.96 ±0.00 0.86 ±0.00 0.86 ±0.00 0.86 ±0.00
Protein 0.95 ±0.00 0.95 ±0.00 0.95 ±0.00 2.25 ±0.01 2.27 ±0.01 2.26 ±0.01
Wine 0.91 ±0.01 0.91 ±0.01 0.91 ±0.01 2.24 ±0.01 2.23 ±0.01 2.22 ±0.01
Yacht 0.95 ±0.01 0.95 ±0.01 0.95 ±0.01 0.18 ±0.00 0.19 ±0.01 0.17 ±0.00
Year Prediction MSD 0.95 ±NA 0.95 ±NA 0.95 ±NA 2.42 ±NA 2.43 ±NA 2.42 ±NA
Table 6: Ablation analysis comparing value prediction in terms of RMSE
Datasets POO MOI PIVEN
Boston 3.21 ±0.24 3.39 ±0.27 3.13 ±0.21
Concrete 5.55 ±0.11 5.73 ±0.10 5.43 ±0.13
Energy 2.16 ±0.04 2.27 ±0.04 1.65 ±0.03
Kin8nm 0.08 ±0.00 0.08 ±0.00 0.07 ±0.00
Naval 0.00 ±0.00 0.00 ±0.00 0.00 ±0.00
Power plant 4.13 ±0.04 4.15 ±0.04 4.08 ±0.04
Protein 4.78 ±0.02 4.99 ±0.01 4.35 ±0.02
Wine 0.64 ±0.01 0.67 ±0.01 0.63 ±0.01
Yacht 0.99 ±0.07 1.15 ±0.08 0.98 ±0.07
Year Prediction MSD 9.10 ±NA 9.25 ±NA 8.93 ±NA
C IMDB age estimation training process and robustness to outliers
C.1 Training process
In the following ﬁgures we present comparisons of the training progression for PIVEN, QD and NN on the MAE,
PICP and MPIW evaluation metrics. We used 80% of images as the training set while the remaining 20% were
used as the validation set (we did not deﬁne a test set as we were only interested in analyzing the progression
of the training). For the MAE metric, presented in Figure 2, we observe that the values for QD not improves.
This is to be expected since QD does not consider this goal in its training process (i.e., loss function). This
result further strengthens our argument that choosing the middle of the interval is often sub-optimal strategy for
value prediction. For the remaining two approaches – NN and PIVEN– we note that NN suffers from overﬁtting,
given that the validation error is greater than training error after convergence. This phenomena does not happen
in PIVEN which indicates robustness, a result which further supports our conclusions regarding the method’s
For the MPIW metric (Figures 3), PIVEN presents better performance both for the validation and train sets
compared to QD. Moreover, we observe a smaller gap between the errror produced by PIVEN for the two sets –
validation and training – which indicates that PIVEN enjoys greater robustness and an ability to not overﬁt to
a subset of the data. Our analysis also shows that for the PICP metric (Figure 4), PIVEN converges to higher
(a) MAE NN (b) MAE QD
(c) MAE PIVEN (d) MAE validation errors
Figure 2: Comparison of MAE metric in the training process. We observe that the values for QD (b)
do not improve, which is expected since QD does not consider value prediction in its loss function.
Moreover, we note that NN suffers from overﬁtting, given that the validation error is greater than the
training error after convergence. This phenomena do not affect PIVEN, thus providing an indication
of its robustness.
C.2 Robustness to outliers
Since PIVEN is capable of learning from the entire dataset while QD learns only from data points which were
captured by the PI, it is reasonable to expect that the former will outperform the latter when coping with outliers.
In the IMDB age estimation dataset, we can consider images with very high or very low age as outliers. Our
analysis shows that for this subset of cases, there is a large gap in performance between PIVEN and QD. In
Figure 5 we provide several images of very young/old individuals and the results returned by the two methods.
We can observe that PIVEN copes with these outliers signiﬁcantly better.
(a) MPIW QD (b) MPIW PIVEN
(c) MPIW validation
Figure 3: Comparison of the MPIW metric between QD and PIVEN in the training process. As
can be seen, PIVEN signiﬁcantly improves over QD, and has a smaller gap between training and
(a) PICP QD (b) PICP PIVEN
(c) PICP validation
Figure 4: Comparison of PICP metric between QD and PIVEN in the training process. PIVEN
achieves higher coverage when two methods converges.
Figure 5: The predictions produced for outliers (i.e., very young/old individuals) by both PIVEN and
QD for the IMDB age estimation dataset. The results for QD are on the left, results for PIVEN on the