PreprintPDF Available

Conformalized predictive simulations for univariate time series

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Predictive simulation of time series data is useful for many applications such as risk management and stress-testing in finance or insurance, climate modeling, and electricity load forecasting. This paper proposes a new approach to uncertainty quantification for univariate time series forecasting. This approach adapts split conformal prediction to sequential data: after training the model on a proper training set, and obtaining an inference of the residuals on a calibration set, out-of-sample predictive simulations are obtained through the use of various parametric and semi-parametric simulation methods. Empirical results on uncertainty quantification scores are presented for more than 250 time series data sets, both real world and synthetic, reproducing a wide range of time series stylized facts. The methodology is benchmarked, with much success, against some of the most recent studies on conformal prediction for time series forecasting.
Content may be subject to copyright.
Conformalized predictive simulations for univariate time series
T. Moudikia
aTechtonique, LLC
Abstract
Predictive simulation of time series data is useful for many applications such as risk manage-
ment and stress-testing in finance or insurance, climate modeling, and electricity load forecast-
ing. This paper proposes a new approach to uncertainty quantification for univariate time series
forecasting. This approach adapts split conformal prediction to sequential data: after training
the model on a proper training set, and obtaining an inference of the residuals on a calibration
set, out-of-sample predictive simulations are obtained through the use of various parametric and
semi-parametric simulation methods. Empirical results on uncertainty quantification scores are
presented for more than 3000 time series (mostly real world and a few synthetic ones), repro-
ducing a wide range of time series stylized facts. The methodology is benchmarked, with much
success, against some of the most recent studies on conformal prediction for time series forecast-
ing.
Keywords: Keyword 1, Keyword 2, Keyword 3
1. Introduction
Despite having been available for decades, Conformal prediction (CP, Vovk et al. (2005)) is
becoming more and more popular, and a gold standard technique for uncertainty quantification
in Statistical/Machine Learning (ML). The interested reader can find an accessible introduction to
CP applied to regression and classification tasks in Angelopoulos and Bates (2023).
This paper proposes a new approach to uncertainty quantification for univariate time series
forecasting. Uncertainty quantification is useful in business settings, and can be achieved in many
different ways. The outcome is generally a prediction interval when the distribution of residuals
is parametric, or a set of future scenarios (not necessarily drawn from a parametric distribution),
from which prediction intervals could also be derived by using empirical quantiles. This latter
approach, the one described in this paper, will be denoted as predictive simulation hereafter.
The usefulness of uncertainty quantification lies in the fact that it allows to assess the impact
of alternative, hypothetical scenarios on business metrics of interest. For example, in the context
of electricity load forecasting, uncertainty quantification can help in assessing the impact of a
drop in temperature on electricity demand and taking appropriate measures to avoid blackouts.
In the context of financial forecasting, uncertainty quantification can help to assess the impact
of an increase in a stock’s market value on a portfolio. Another application, in insurance, is the
calculation of reserves, which are used to cover future claims: the insurer needs to envisage a
range of possible future scenarios for its balance sheet and calculate, accordingly, the amount of
reserves to be set aside.
More precisely, the approach presented in this paper is inspired from Split Conformal Pre-
diction (SCP hereafter, Vovk et al. (2005)) and adapted to sequential data. In the SCP context for
supervised learning on tabular data, after training the model on a so-called proper training set, and
obtaining an inference of the residuals on a calibration set, the distribution of these so-called cali-
brated residuals is used to quantify the uncertainty on test set data. More formally, SCP’s recipe for
supervised learning on tabular data is relatively straightforward. We let
(X,y);XRn×p,yRn(1)
be respectively a set of explanatory variables and a response variable with nobservations and
pfeatures. Under the assumption that the observations are exchangeable, SCP begins by splitting
the training data into two disjoint subsets: a proper training set
{(xi,yi):i I1}(2)
and a calibration set
{(xi,yi):i I2}(3)
With I1 I2={1, . . . , n}and I1 I2=. Now, let Abe an ML model - for example a Ridge
Regression Model Golub et al. (1979)) or an artificial neural network (Goodfellow et al. (2016)). A
is trained on I1, and absolute calibrated residuals are computed on I2, as follows:
2
Ri=|yiˆ
µA(xi)|,i I2(4)
ˆ
µAgives the value predicted by Aon xi,i I2. For a given level of risk α, the risk of being
wrong by saying that the prediction belongs to a certain prediction interval, a quantile of the
empirical distribution of the absolute residuals (Eq. 4) is then computed as:
Q1α(R,I2):= (1α)(1+1/ |I2|)-th quantile of {Ri:i I2}(5)
To finish, a prediction interval at a new point xn+1is given by
CA,1α(xn+1):=ˆ
µA(xn+1)±Q1α(R,I2)(6)
This type of prediction interval in (Eq. 6), is described in (Vovk et al.,2005)) as being able to
satisfy coverage guarantees, given that some assumptions do hold - such as the exchangeability
of the observations. The simplest case of data exchangeability is: {x1, . . . , xn}are independent
and identically distributed. Under these assumptions and a given, expected level of confidence
1α, we’d have:
P(yn+1CA,1α(xn+1)) 1α(7)
Where yn+1is the true value of the response variable yfor unseen observations xn+1.
Time series data do possess some peculiarities, because they are not exchangeable but rather
often, auto-correlated. Loosely speaking, not being exchangeable means that the sequential order
in the data matters crucially, and randomly splitting as done in the usual SCP setting for tabular
data doesn’t work anymore in this context. In addition to the non-exchangeability of the original
data, time series residuals can also, in turn, be non-exchangeable. This hasn’t prevented time
series forecasting from being impacted by the CP trend, and many different approaches have been
proposed to circumvent the non-exchangeability, and take into account their sequential structure
of input data. Without being exhaustive and focusing on recent publications:
(Xu and Xie,2021) describe a CP algorithm for ML forecasting that doesn’t require the ex-
changeability condition but stationary and strongly mixing error process, plus estimation
3
quality (these assumptions are discussed in their paper). It relies on the well-known statis-
tical bootstrapping procedure.
In (Gibbs and Candes,2021), the authors introduce an adaptive time series forecasting
method that improves the quality of uncertainty quantification in an online learning fash-
ion. If a point in the test set is outside the prediction interval constructed at the current step,
the length of the interval is adaptively increased for the next step. Conversely, if a point
is inside the prediction interval constructed at the current step, the length is decreased. Both
the increase or decrease of the prediction interval’s length are made by using a fixed learn-
ing rate γ. This approach is later extended by (Zaffran et al.,2022). The authors present
a method for choosing the unknown, adaptive learning rate γautomatically, based on an
aggregation of expert models.
To the best of my knowledge there are no published studies on predictive simulations for time
series forecasting in the CP setting. Predictive simulations are useful for many practical applica-
tions such as risk management and stress-testing in finance or insurance, climate modeling, and
electricity load forecasting. Hence, the contribution of this paper is twofold:
1. I propose a new and different approach to CP for time series, based on SCP, but with a twist.
Instead of splitting the data randomly or by stratifying the variable of interest, I split the
data in a way that preserves its temporal order.
2. The conformity score, a score computed on calibrated residuals to obtain quantiles of the
predictive distribution, is chosen to be the standardized residuals (instead of absolute resid-
uals in classical SCP). I propose a way to obtain predictive simulations - and prediction
intervals - from that approach, which is a point rarely discussed in the literature, and espe-
cially for time series forecasting.
Although this straightforward (just as SCP) approach may sound simplistic, it can be applied
to a wide range of time series data, both real-world and synthetic, with much success. Empir-
ical results for predictive coverage rates and uncertainty quantification scores are presented in
section 3 for more than 3000 time series (notably from M3 competition Makridakis and Hibon
(2000), and M5 competition Makridakis et al. (2022)), alongside benchmarks against widely-used
4
uncalibrated, and recently-published high-profile methods. The
R
and
Python
code and results
are available at
https://gitlab.com/conf3180013/conformalkde/
.
2. Context and methodology
2.1. Context
Let (yt)t0be the time series of interest, for example, last year’s cars sales in Oklahoma City.
(yt)t0is observed at discrete times t=1 to t=Twith TN. Based on (yt)t∈{1,...,T}, we are
interested in estimating ˆ
yT+hfor h>0. We are particularly interested in assessing the potential
inaccuracies in the model used to obtain a forecast ˆ
yT+h, with the objective of ensuring any devia-
tion remains within acceptable limits. Specifically, we seek to determine how many cars were sold
in Oklahoma City last year and to estimate, with an appropriate margin, the projected number of
cars that will be sold this year. This approach aims to avoid excessive inventory levels.
A predictive coverage rate is the proportion of times the true value of the series falls between
the lower and upper bounds of the prediction interval. That is, for h>0 and prediction bounds
hˆ
y(l)
T+hˆ
y(u)
T+hiobtained by the forecasting model:
Coverage rate =cvr :=1
H
H
h=1
11 nyT+hhˆ
y(l)
T+hˆ
y(u)
T+hio (8)
where 11 is an indicator function, and His the number of prediction dates ahead of T.
Some papers discuss various results for the convergence of coverage rates to a predefined,
desired level of confidence 1 αwith a small α(0, 1). Guaranteeing the convergence of the
coverage rate to a predefined level of confidence, for example 95%, means that in the future, the
model will be wrong at most 5% of the time. The asymptotic properties of coverage rates from
these papers hold for an infinite forecasting horizon (H), under relatively strict assump-
tions. These results are quite remarkable and elegant from a pure mathematical viewpoint. In
practice, however, we never forecast on an infinite horizon, and we’ll always, only have finite
data at our disposal.
As Yogi Berra (or someone else, the source is vague) said, it’s hard to make predictions, espe-
cially about the future. Thus, no such theoretical coverage guarantee is proposed in this paper for
the SCP method. I rather demonstrate the strong and robust empirical convergence properties of
5
coverage rates to a predefined coverage rate, on a broad range of data sets; industrial, financial
and synthetic.
2.2. Proposed methodology
The general assumption used in this algorithm is that the input time series has an underlying
’Data Generating Process’ (DGP), i.e is not excessively chaotic (no distributions shifts, aggressive
regimes’ switch or jumps for example). Its applicability to real-world data remains quite broad
however, as it will be demonstrated in section 3 on 30 examples.
1. Obtain sequential time series data
yt,t {1, . . . , T},TN(9)
And choose any forecasting algorithm A, not necessarily a probabilistic one.
2. In order to quantify the uncertainty around As point forecasts, the available training data
is split into two non-overlapping subsets: a proper training set and a calibration set. For the
purpose of maintaining the temporal structure of (yt)t, both sets are constructed in a sequential
manner, and the proper training set comes - chronologically - before the calibration set. If we let
pcal (0, 1)be the proportion of data used in the proper training set, we have:
Proper training set =ny1, . . . yT×pc al ]o(10)
Calibration set =nyT×pca l +1, . . . yTo(11)
The proper training set (Eq. 10) is used to adjust A, and obtain calibrated residuals on the
calibration set (Eq. 11) as follows:
ε(cal )
A,t=y(cal)
tˆ
y(cal)
A,t,t{⌈T×pcal +1, . . . , T}
Where ˆ
y(cal)
A,tis the point forecast obtained on Calibration set by A.
3. Centering and re-scaling the calibrated residuals (the standard deviation is kept and denoted
as ˆ
σ(cal)
A) gives us ε(cal )
A,t. Our forecasting uncertainty will be estimated by employing the distri-
bution of
6
ˆ
ε(cal)
A:=nˆ
ε(cal)
A,T×pcal +1, . . . , ˆ
ε(cal)
A,To(12)
Notice that contrary to SCP, these aren’t absolute calibrated residuals, but standardized. Three
methods are employed (and compared here) to estimate and simulate the distribution of the cali-
brated residuals - other methods can be envisaged, without loss of generality:
Estimate the empirical distribution of ˆ
ε(cal)
Awith a Gaussian (semi-parametric, mixture of
Gaussians) Kernel Density Estimator (KDE) - under the assumption that they are indepen-
dent and identically distributed -, and simulate from the Gaussian mixture or
Adjust a surrogate model (for example,
R
implementation of (Theiler et al.,1992) is used
in this paper, but it could also be any of (Haaga and Datseris,2022) models) to ˆ
ε(cal )
Aand
simulate from the surrogate model or
Apply a block bootstrap procedure (Künsch,1989) to ˆ
ε(cal)
A. Block bootstrapping tries to
maintain some sort of temporal dependence within blocks of input time series .
4. Again, contrary to what is done at this step in SCP, and because of the sequential nature
of (yt)t,Ais now trained on Calibration set (Eq. 11), and used to obtain point forecasts
{ˆ
yA,T+1, . . . , ˆ
yA,T+H}.
5. To finish, predictive simulations are obtained by adding the rescaled (by ˆ
σ(cal)
A) simulated
residuals ε(cal )
A,ttfrom step 4 to {ˆ
yA,T+1, . . . , ˆ
yA,T+H}. Lower and upper bounds of pre-
diction intervals are empirical quantiles of these predictive simulations, at a level 1 α,α
(0, 1).
3. Numerical examples
3.1. Introduction
The results presented in this section can be reproduced entirely in
R
and
Python
by using the
code found in: https://gitlab.com/conf3180013/conformalkde/. There are 4 core examples:
Uncertainty quantification on 250 time series described in
https://gitlab.com/conf3180013/
conformalkde/-/blob/main/250timeseries.txt
7
Comparison with Adaptive conformal methods for uncertainty quantification
Uncertainty quantification on M3 competition data: all the 3003 time series
Uncertainty quantification on M5 competition data: the 30000 time series are aggregated by
item, so that there are 3049 aggregated time series to forecast
Except for the M5 competition where Gradient Boosted Decision Trees (winning solutions)
were employed, two univariate time series forecasting models (with different versions) are used,
with the non-conformalized methods using Gaussian prediction intervals:
Theta model ((Assimakopoulos and Nikolopoulos,2000) and (Hyndman and Billah,2003)),
a model that won the famous M3 competition (Clements and Hendry,2001). It’s a sim-
ple exponential smoothing model, with a drift, implemented in (Hyndman and Khandakar,
2008). Fast to train when compared to an automated ARIMA for example, this model is
a good candidate for benchmarking on multiple data sets - without focusing too much on
computational resources but on the main question of calibration and uncertainty quantifica-
tion.
dynrmf
model implemented in
R
,
Python
and
Julia
package ahead Moudiki (2024).
dynrmf
is
a model-agnostic dynamic regression model inspired by NNAR (see (Hyndman and Athana-
sopoulos,2018)), with an automatic choice of the number of autoregressive, and seasonal
lags. Instead of an artificial neural network, as implemented in NNAR,
dynrmf
can use any
regression model available. Here, I’ll use a Ridge regression model, in which the regular-
ization parameter is automatically chosen with generalized cross-validation (Golub et al.,
1979). Because it’s relatively fast and robust,
dynrmf
with automated ridge regression is also
a good candidate for benchmarking on multiple datasets, without losing focus on the main
objective of the paper: model calibration.
The different versions of Theta and
dynrmf
studied here are the following:
Theta: the original Theta model implemented in Hyndman and Khandakar (2008), with no
modification, denoted as theta_0 in the results section 3.1.
Theta+KDE: the original Theta model, with a KDE applied to the calibrated residuals. De-
noted as theta_kde in the results section 3.1.
8
Theta+surrogates: the original Theta model, and a surrogate model (Theiler et al. (1992))
adjusted to the calibrated residuals. Denoted as theta_surr in the results section 3.1.
Theta+bootstrap: the original Theta model, with a block bootstrap applied to the calibrated
residuals. Denoted as theta_boot in the results section 3.1.
dynrmf: the original dynrmf model with automated ridge regression and no other modifi-
cation. Denoted as dynrmf_0 in the results section 3.1.
dynrmf
+KDE: the original
dynrmf
model with automated ridge regression and a KDE ap-
plied to the calibrated residuals. Denoted as
dynrmf
_kde in the results section 3.1.
dynrmf
+surrogates: the original
dynrmf
model with automated ridge regression and a sur-
rogate model adjusted to the calibrated residuals. Denoted as
dynrmf
_surr in the results
section 3.1.
dynrmf
+bootstrap: the original
dynrmf
model with automated ridge regression and a block
bootstrap applied to the calibrated residuals. Denoted as
dynrmf
_boot in the results section
3.1.
The aim in section 3.2.1 will be to benchmark these 8 variants of Theta and
dynrmf
variants on
250 various datasets (209 real-world and 41 synthetic, see https://gitlab.com/conf3180013/conformalkde/),
and see how conformalizing/calibration with the method described in this paper influences
uncertainty quantification metrics such as coverage rates and the Winkler score. Using the same
notations as those introduced in previous sections and no matter the level of uncertainty chosen,
a Winkler score is defined as:
wsh:=ˆ
y(u)
T+hˆ
y(l)
T+h×11 nyT+hhˆ
y(l)
T+h,ˆ
y(u)
T+hio
+ˆ
y(l)
T+hyT+h×11 nyT+h<ˆ
y(l)
T+ho
+yT+hˆ
y(u)
T+h×11 nyT+h>ˆ
y(u)
T+ho
(13)
Winkler score =ws :=1
H
H
h=1
wsh
9
Where His the number of prediction dates ahead of Tand h>0, hNthe forecasting
horizon. Intuitively, when the point forecasts fall within the prediction interval, the Winkler score
is the average prediction interval’s length. When the point forecasts fall outside the prediction
interval, the Winkler score is penalized from the average re-scaled difference between the point
forecast and the closest bound of the prediction interval.
The 41 synthetic data sets used in this section are simulated according to the following formula
- by using different random seeds for the residuals:
yt=0.01t+sin 2πt
180 +cos 2πt
180 +εt(14)
For t {1, . . . , 100}where (εt)tis an autoregressive process of order 1(AR(1)), with:
εt=0.99εt1+ξt(15)
and (ξt)tis a sequence of i.i.d. N(0, 0.01)random variables. Figure 1 depicts four (out of 41 )
of these synthetic data sets.
Let’s start with two introductive and illustrated real-world examples applying respectively
conformalized
dynrmf
and
thetaf
to:
USAccDeaths ((Brockwell and Davis,1991)), a seasonal time series data set giving the monthly
totals of accidental deaths in the USA from January 1973 to December 1978. 50 predictive
simulations are produced, and KDE is used (see 2.2).
GOOG: Closing stock prices of GOOG from the NASDAQ exchange, for 1000 consecutive
trading days between 25 February 2013 and 13 February 2017. Adjusted for splits. Source:
https://finance.yahoo.com/quote/GOOG/history
For both data sets, the training set contains 80% ( 50% proper training + 50% calibration) of
the data, and the test set, which is untouched during the whole procedure, contains the remaining
20%.
As it can be observed on figures 2 and 3, the prediction intervals and corresponding simula-
tions closely follow the seasonal pattern of USAccDeaths and the trend of GOOG. The prediction
intervals are wider for the 99% coverage level than for the 95% coverage level, as expected.
10
Synthetic # 1
Time
synthetic[[i]]
0 20 40 60 80 100
0.8 1.8
Synthetic # 2
Time
synthetic[[i]]
0 20 40 60 80 100
1.2 2.0
Synthetic # 3
Time
synthetic[[i]]
0 20 40 60 80 100
0.6 1.4
Synthetic # 4
Time
synthetic[[i]]
0 20 40 60 80 100
0.0
Figure 1: 4 synthetic time series.
11
1973 1975 1977 1979
6000 8000 11000
50 predictive simulations for
conformalized DynRM 1,1[12]
Forecasts from
conformalized DynRM 1,1[12]
1973 1975 1977 1979
6000 8000 10000
1973 1975 1977 1979
6000 8000 11000
50 predictive simulations for
conformalized DynRM 1,1[12]
Forecasts from
conformalized DynRM 1,1[12]
1973 1975 1977 1979
6000 8000 10000
Figure 2: Example based on
dynrmf
and USAccDeaths. Left: simulations. Right: true observation=blue, ensemble
forecast=red.
12
0 200 600 1000
400 900
50 predictive simulations for
conformalized Theta
Forecasts from
conformalized Theta
0 200 600 1000
400 900
0 200 600 1000
400 900
50 predictive simulations for
conformalized Theta
Forecasts from
conformalized Theta
0 200 600 1000
400 900
Figure 3: Example based on
thetaf
and GOOG stock. Left: simulations. Right: true observation=blue, ensemble
forecast=red.
13
3.2. Results
3.2.1. Distributions of coverage rates on 250 time series (209 real-world and 41 synthetic)
For the 250 time series used in this section (described in
https://gitlab.com/conf3180013/
conformalkde/-/blob/main/250timeseries.txt
), the training set contains 90% ( 50% proper
training +50% calibration) of the data, and the test set contains the remaining 10%. Each pre-
diction interval for conformalized methods is obtained for 1000 simulations and a expected the-
oretical coverage rate equal to 95%. For non-conformalized methods, the prediction interval is
based on a Gaussian hypothesis.
Table 1contains confidence intervals for the distribution of 250 coverage rates for a desired
coverage rate of 80%, and Table 2, the same confidence intervals for a desired coverage rate of
95%. Based on these results, calibrated
dynrmf
and thetaf obtain better out-of-sample coverage
rates than their uncalibrated counterparts, whose uncertainty quantifiation relies on a Gaussian
hypothesis on their residuals.
Shorter 95% confidence intervals for coverage rates which also contain the target coverage
rates of 80% and 95% should be privileged, as the model is more certain of its predictions. With
that being said, keep in mind that this is the default
dynrmf
and a plethora of other models could
be used in lieu of Ridge regression: the results could potentially be even better for
dynrmf
in
general.
Table 1: Comparison of Lower and Upper Bounds with Interval Lengths (level = 80%)
lower upper length
thetaf_0 74.14 80.19 6.05
dynrmf_0 69.20 74.93 5.73
thetaf_kde 71.71 78.19 6.48
dynrmf_kde 81.37 85.93 4.56
thetaf_boot 66.45 73.20 6.75
dynrmf_boot 76.40 81.41 5.01
thetaf_surr 67.16 73.82 6.66
dynrmf_surr 76.31 81.36 5.05
14
Table 2: Comparison of Lower and Upper Bounds with Interval Lengths (level = 95%)
lower upper length
thetaf_0 86.34 91.24 4.90
dynrmf_0 85.46 89.85 4.39
thetaf_kde 85.64 90.73 5.09
dynrmf_kde 92.23 95.32 3.09
thetaf_boot 79.29 85.60 6.31
dynrmf_boot 87.64 92.00 4.36
thetaf_surr 82.10 87.51 5.41
dynrmf_surr 89.73 93.32 3.59
3.2.2. Comparison with Adaptive conformal methods
In this section, the Adaptive Conformal Inference (ACI, (Gibbs and Candes,2021)) and Aggre-
gated Conformal Inference (AgCI, (Zaffran et al.,2022)) methods are compared to SCP-calibrated
thet af and
dynrmf
on synthetic data sets from (Zaffran et al.,2022). These data sets are generated
from the following model:
Yt=10 sin (πXt,1Xt,2)+20 (Xt,3 0.5)2
+10Xt,4 +5Xt,5 +0Xt,6 +εt
where "the Xtare multivariate uniformly distributed on [0, 1], and Xt,6 represents an uninfor-
mative variable. The noise εtis generated from an ARMA(1, 1)process of parameters φand θ, i.e.
εt+1=φεt+ξt+1+θξt, with ξta white noise", with φand θin {0.1, 0.8, 0.9, 0.95, 0.99}. For more
details, see (Zaffran et al.,2022). Implementations of ACI and AgACI are based on (Susmann
et al.,2024).
Coverage rates and Winkler scores are computed on a grid of values of φand θspecified
before, for 100 simulations in each pair of parameters configuration.
Table 3 contains the 95% confidence intervals for the coverage rates of ACI and AgACI for a
desired coverage rate of 80%, and table 4 , the confidence intervals for ACI and AgACI coverage
rates for a desired coverage rate of 95%. Table 5 contains the confidence intervals for the Winkler
scores.
15
Table 3: Confidence intervals for coverage rates vs ACI and AgCI (level = 80)
fcast_method conformal_method mean lower upper length
dynrmf splitconformal 81.40200 81.11286 81.69114 0.5782802
thetaf splitconformal 79.85000 79.51653 80.18347 0.6669351
dynrmf FACI 77.90467 77.75933 78.05001 0.2906808
thetaf FACI 77.63267 77.48174 77.78359 0.3018476
dynrmf AgACI 77.61200 77.45267 77.77133 0.3186499
thetaf AgACI 77.41400 77.25402 77.57398 0.3199632
Table 4: Confidence intervals for coverage rates vs ACI and AgCI (level = 95)
fcast_method conformal_method mean lower upper length
dynrmf splitconformal 95.57067 95.40907 95.73227 0.3232015
thetaf splitconformal 94.37867 94.18406 94.57328 0.3892207
dynrmf FACI 90.97933 90.87836 91.08031 0.2019511
thetaf FACI 90.73800 90.63206 90.84394 0.2118827
dynrmf AgACI 89.97867 89.86041 90.09692 0.2365102
thetaf AgACI 89.73867 89.61613 89.86120 0.2450722
of ACI and AgCI at a level of 80%, and table 6 , the confidence intervals for the Winkler scores
of ACI and AgCI at a level of 95%. They are compared to the results obtain by SCP-calibrated
thetaf
and
dynrmf
on the same data sets.
Here, the SCP-calibrated
dynrmf
and
thetaf
methods are compared to ACI and AgCI on syn-
thetic data sets. The SCP-calibrated methods obtain better coverage rates and Winkler scores than
both ACI and AgCI . The rule of thumb is that confidence intervals for coverage rates should be as
short as possible, while still containing the target coverage rate. In the 95% case, SCP-calibrated
dynrmf
is judged as being better not much because of the length of the confidence intervals, but
because of the mean coverage rate. For Winkler scores, considering their definition, the smaller
the better. In this area, conformalized
dynrmf
is winning for both a target coverage of 80% and
95%.
16
Table 5: Confidence intervals for Winkler scores vs ACI and AgCI (level = 80)
fcast_method conformal_method mean lower upper length
dynrmf splitconformal 20.28301 20.17123 20.39479 0.2235643
dynrmf AgACI 21.06962 20.97201 21.16724 0.1952261
thetaf splitconformal 21.15895 21.01029 21.30762 0.2973326
dynrmf FACI 21.24329 21.14597 21.34060 0.1946331
thetaf AgACI 21.84901 21.72749 21.97053 0.2430389
thetaf FACI 22.05348 21.92482 22.18214 0.2573109
3.3. Results on M3 competition data
The M3 competition (Makridakis and Hibon (2000)) is a well-known forecasting competition
containing "YEARLY", "QUARTERLY", "MONTHLY" and "OTHER" frequency data, from "DE-
MOGRAPHIC", "FINANCE", "INDUSTRY", "MACRO", "MICRO", "OTHER" sectors.
Theta forecasting method (Assimakopoulos and Nikolopoulos (2000) and Hyndman and Bil-
lah (2003)) was winning solution in this competition. Here, SCP-calibrated Theta model is com-
pared to other conformalized ( with methods from Susmann et al. (2024)) Theta models. SCP-
calibrated Theta model uses 250 simulations for computing their prediction intervals at a 95%
level.
Figure 4: Coverage rate per conformal method (see Susmann et al. (2024) for SAOCP, SF-OGD)
17
Figure 5: Log(Winkler score) per conformal method (see Susmann et al. (2024) for SAOCP, SF-OGD)
Figure 6: RMSE per conformal method (see Susmann et al. (2024) for SAOCP, SF-OGD)
Numerical details on these 3 graphics for M3 can be found in Appendix B. We notice that
SCP-calibrated Theta may seem overconfident, but since the Winkler score is the lowest, there’s
not much to worry about (narrower prediction intervals obtained).
18
3.4. Results on M5 competition data
3049 time series are obtained by aggregating the 30000 time series from the M5 competition.
Gradient Boosted Regression Trees models were compared without hyperparameter tuning, with
1 time series lag, and using
Python
package nnetsauce.scikit-learn’s
GradientBoostingRegressor
is denoted as gb, LightGBM is denoted as lgb, and XGBoost is denoted as xgb, utilizing two
methods for interval generation: sequential split conformal (described in this paper) and quantile
methods.
Figure 7: Log-error rate per conformalization method.
Figure 8: Winkler score per conformalization method.
Figure 9: Timings per conformalization method.
19
The Table in Appendix A presents a comprehensive comparison of prediction intervals across
the three different models. The first section titled "Coverage" illustrates the coverage rates of the
prediction intervals, which indicate the proportion of actual values that fall within the generated
intervals. These coverage rates, expressed as a percentage (to be multiplied by 100%), reveal that
while the median coverage is generally high across all models, the conformal method consistently
shows strong performance.
The second section, "Winkler," evaluates the performance of the models using the Winkler
score, a metric that penalizes intervals not including the true value while accounting for interval
width. Lower Winkler scores are indicative of better predictive performance, and the results show
that the conformal method tends to yield lower scores, particularly for LightGBM and XGBoost,
suggesting that these models effectively balance precision and interval width.
Lastly, the "Time" section reports the elapsed time in seconds for model training and predic-
tions, highlighting the computational efficiency of each model. Here, the XGBoost model with the
quantile method exhibits the fastest execution time, while LightGBM with the conformal method
takes the longest, suggesting a trade-off between accuracy and computational demand. Over-
all, this table offers valuable insights into the trade-offs between coverage, accuracy, and com-
putational efficiency across different machine learning models and interval estimation methods.
Regarding these timings, it’s worth mentioning that the conformal method is simulation-based
and thus requires slightly higher computational resources compared to the quantile method. A
parametric adjustment of the calibrated residuals could result in a comparably fast method.
4. Conclusion
In this paper, we’ve introduced a new method for conformalizing time series forecasting mod-
els. This method expands Split Conformal Prediction (SCP), and adapts it to the sequential na-
ture of time series. We’ve shown that this method can be successfully applied to calibrating a
wide range (litterally hundreds) of time series data sets, industrial, financial and synthetic. SCP-
calibrated
dynrmf
and
thetaf
have been shown to generally outperform their uncalibrated coun-
terparts on out-of-sample coverage rates, and to be more than competitive when compared to
other recent methods for time series conformalization.
20
References
Angelopoulos, A.N., Bates, S., 2023. Conformal prediction: A gentle introduction. Foundations and Trends® in Ma-
chine Learning 16. URL:
http://dx.doi.org/10.1561/2200000101
.
Assimakopoulos, V., Nikolopoulos, K., 2000. The Theta model: a decomposition approach to forecasting. International
journal of forecasting 16, 521–530.
Brockwell, P.J., Davis, R.A., 1991. Time series: theory and methods .
Clements, M., Hendry, D., 2001. Explaining the results of the M3 forecasting competition. International Journal of
Forecasting 17.
Gibbs, I., Candes, E., 2021. Adaptive conformal inference under distribution shift. Advances in Neural Information
Processing Systems 34, 1660–1672.
Golub, G.H., Heath, M., Wahba, G., 1979. Generalized cross-validation as a method for choosing a good ridge param-
eter. Technometrics 21, 215–223.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep learning. MIT press.
Haaga, K.A., Datseris, G., 2022. Timeseriessurrogates.jl: a julia package for generating surrogate data. Journal of Open
Source Software 7, 4414. URL:
https://doi.org/10.21105/joss.04414
, doi:
10.21105/joss.04414
.
Hyndman, R., Khandakar, Y., 2008. Automatic time series forecasting: The forecast package for R. Journal of Statistical
Software 27, 1–22.
Hyndman, R.J., Athanasopoulos, G., 2018. Forecasting: principles and practice. OTexts.
Hyndman, R.J., Billah, B., 2003. Unmasking the theta method. International Journal of Forecasting 19, 287–290.
Künsch, H.R., 1989. The jackknife and the bootstrap for general stationary observations. The Annals of Statistics ,
1217–1241.
Makridakis, S., Hibon, M., 2000. The m3-competition: results, conclusions and implications. International journal of
forecasting 16, 451–476.
Makridakis, S., Spiliotis, E., Assimakopoulos, V., 2022. The m5 competition: Background, organization, and implemen-
tation. International Journal of Forecasting 38, 1325–1336.
Moudiki, T., 2024. ahead: Univariate and multivariate time series forecasting with uncertainty quantification (including
simulation approaches) .
Susmann, H., Chambaz, A., Josse, J., 2024. Adaptiveconformal: An ‘R‘ Package for Adaptive Conformal Inference.
Computo URL:
https://computo.sfds.asso.fr/template-computo-quarto
, doi:
10.57750/edan-5f53
.
Theiler, J., Eubank, S., Longtin, A., Galdrikian, B., Farmer, J.D., 1992. Testing for nonlinearity in time series: the method
of surrogate data. Physica D: Nonlinear Phenomena 58, 77–94.
Vovk, V., Gammerman, A., Shafer, G., 2005. Algorithmic learning in a random world. volume 29. Springer.
Xu, C., Xie, Y., 2021. Conformal prediction interval for dynamic time-series, in: International Conference on Machine
Learning, PMLR. pp. 11559–11569.
Zaffran, M., Féron, O., Goude, Y., Josse, J., Dieuleveut, A., 2022. Adaptive conformal predictions for time series, in:
International Conference on Machine Learning, PMLR. pp. 25834–25866.
21
Appendix A. Coverage, Winkler score and timing on M5 data set
Coverage
model pi_method min median max range
gb conformal 0.00 0.89 1.00 1.00
gb quantile 0.03 0.81 1.00 0.97
lgb conformal 0.00 0.89 1.00 1.00
lgb quantile 0.05 0.75 1.00 0.95
xgb conformal 0.00 0.90 1.00 1.00
xgb quantile 0.29 0.79 1.00 0.71
Winkler
model pi_method min median max range
gb conformal 2.08 35.53 1298.16 1296.08
gb quantile 2.08 34.44 7270.32 7268.24
lgb conformal 5.77 36.00 1272.90 1267.13
lgb quantile 5.00 42.34 1304.69 1299.69
xgb conformal 5.54 35.43 1781.88 1776.34
xgb quantile 4.78 38.08 1553.96 1549.18
Time
model pi_method min median max range
gb conformal 2.22 12.75 92.19 89.97
gb quantile 0.69 1.09 11.61 10.92
lgb conformal 2.41 16.14 169.19 166.78
lgb quantile 0.62 3.63 78.22 77.60
xgb conformal 2.15 12.75 88.54 86.39
xgb quantile 0.13 0.23 2.78 2.65
Appendix B. Coverage rates, Winkler score and RMSE for M3 data set
22
Table B.3: Average Coverage by Method and Period
conformal_method period average_coverage
splitconformal MONTHLY 100.0000000
splitconformal OTHER 100.0000000
splitconformal QUARTERLY 100.0000000
splitconformal YEARLY 100.0000000
AgACI MONTHLY 72.4303909
AgACI OTHER 48.3811720
AgACI QUARTERLY 52.1629623
AgACI YEARLY 31.8191214
SAOCP MONTHLY 0.8799117
SAOCP OTHER 1.2609469
SAOCP QUARTERLY 0.2752630
SAOCP YEARLY 0.1085271
SF-OGD MONTHLY 0.0000000
SF-OGD OTHER 0.0547345
SF-OGD QUARTERLY 0.0000000
SF-OGD YEARLY 0.0000000
23
Table B.4: Average Winkler Score by Method and Period
conformal_method period average_winkler
splitconformal MONTHLY 2748.174
splitconformal OTHER 1174.329
splitconformal QUARTERLY 1663.088
splitconformal YEARLY 1735.325
AgACI MONTHLY 5368.826
AgACI OTHER 2237.696
AgACI QUARTERLY 5847.555
AgACI YEARLY 14568.552
SAOCP MONTHLY 25396.600
SAOCP OTHER 12282.196
SAOCP QUARTERLY 21956.867
SAOCP YEARLY 41459.081
SF-OGD MONTHLY 25569.669
SF-OGD OTHER 12401.179
SF-OGD QUARTERLY 22035.548
SF-OGD YEARLY 41504.670
24
Table B.5: Average RMSE by Method and Period
conformal_method period average_RMSE
splitconformal MONTHLY 1193.5822
splitconformal OTHER 667.5194
splitconformal QUARTERLY 1128.5092
splitconformal YEARLY 1662.1212
AgACI MONTHLY 779.9480
AgACI OTHER 359.7805
AgACI QUARTERLY 649.2999
AgACI YEARLY 1182.6598
SAOCP MONTHLY 779.9480
SAOCP OTHER 359.7805
SAOCP QUARTERLY 649.2999
SAOCP YEARLY 1182.6598
SF-OGD MONTHLY 779.9480
SF-OGD OTHER 359.7805
SF-OGD QUARTERLY 649.2999
SF-OGD YEARLY 1182.6598
25
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The M5 competition follows the previous four M competitions, whose purpose is to learn from empirical evidence how to improve forecasting performance and advance the theory and practice of forecasting. M5 focused on a retail sales forecasting application with the objective to produce the most accurate point forecasts for 42,840 time series that represent the hierarchical unit sales of the largest retail company in the world, Walmart, as well as to provide the most accurate estimates of the uncertainty of these forecasts. Hence, the competition consisted of two parallel challenges, namely the Accuracy and Uncertainty forecasting competitions. M5 extended the results of the previous M competitions by: (a) significantly expanding the number of participating methods, especially those in the category of machine learning; (b) evaluating the performance of the uncertainty distribution along with point forecast accuracy; (c) including exogenous/explanatory variables in addition to the time series data; (d) using grouped, correlated time series; and (e) focusing on series that display intermittency. This paper describes the background, organization, and implementations of the competition, and it presents the data used and their characteristics. Consequently, it serves as introductory material to the results of the two forecasting challenges to facilitate their understanding.
Article
Full-text available
Consider the ridge estimate (λ) for β in the model unknown, (λ) = (XX + nλI) Xy. We study the method of generalized cross-validation (GCV) for choosing a good value for λ from the data. The estimate is the minimizer of V(λ) given bywhere A(λ) = X(XX + nλI) X . This estimate is a rotation-invariant version of Allen's PRESS, or ordinary cross-validation. This estimate behaves like a risk improvement estimator, but does not require an estimate of σ, so can be used when n − p is small, or even if p ≥ 2 n in certain cases. The GCV method can also be used in subset selection and singular value truncation methods for regression, and even to choose from among mixtures of these methods.
Chapter
Full-text available
Conformal prediction is a valuable new method of machine learning. Conformal predictors are among the most accurate methods of machine learning, and unlike other state-of-the-art methods, they provide information about their own accuracy and reliability. This new monograph integrates mathematical theory and revealing experimental work. It demonstrates mathematically the validity of the reliability claimed by conformal predictors when they are applied to independent and identically distributed data, and it confirms experimentally that the accuracy is sufficient for many practical problems. Later chapters generalize these results to models called repetitive structures, which originate in the algorithmic theory of randomness and statistical physics. The approach is flexible enough to incorporate most existing methods of machine learning, including newer methods such as boosting and support vector machines and older methods such as nearest neighbors and the bootstrap. Topics and Features: * Describes how conformal predictors yield accurate and reliable predictions, complemented with quantitative measures of their accuracy and reliability * Handles both classification and regression problems * Explains how to apply the new algorithms to real-world data sets * Demonstrates the infeasibility of some standard prediction tasks * Explains connections with Kolmogorov's algorithmic randomness, recent work in machine learning, and older work in statistics * Develops new methods of probability forecasting and shows how to use them for prediction in causal networks Researchers in computer science, statistics, and artificial intelligence will find the book an authoritative and rigorous treatment of some of the most promising new developments in machine learning. Practitioners and students in all areas of research that use quantitative prediction or machine learning will learn about important new methods.
Article
Full-text available
This paper presents a new univariate forecasting method. The method is based on the concept of modifying the local curvature of the time-series through a coefficient ‘Theta’ (the Greek letter θ), that is applied directly to the second differences of the data. The resulting series that are created maintain the mean and the slope of the original data but not their curvatures. These new time series are named Theta-lines. Their primary qualitative characteristic is the improvement of the approximation of the long-term behavior of the data or the augmentation of the short-term features, depending on the value of the Theta coefficient. The proposed method decomposes the original time series into two or more different Theta-lines. These are extrapolated separately and the subsequent forecasts are combined. The simple combination of two Theta-lines, the Theta=0 (straight line) and Theta=2 (double local curves) was adopted in order to produce forecasts for the 3003 series of the M3 competition. The method performed well, particularly for monthly series and for microeconomic data.
Article
This paper describes the M3-Competition, the latest of the M-Competitions. It explains the reasons for conducting the competition and summarizes its results and conclusions. In addition, the paper compares such results/conclusions with those of the previous two M-Competitions as well as with those of other major empirical studies. Finally, the implications of these results and conclusions are considered, their consequences for both the theory and practice of forecasting are explored and directions for future research are contemplated.