ArticlePDF Available

Forecasting with time series imaging

Authors:

Figures

Content may be subject to copyright.
Forecasting with time series imaging
Xixi Lia,1, Yanfei Kanga,1, Feng Lib,
aSchool of Economics and Management, Beihang University, Beijing 100191, China.
bSchool of Statistics and Mathematics, Central University of Finance and Economics, Beijing 102206, China.
Abstract
Feature-based time series representations have attracted substantial attention in a wide
range of time series analysis methods. Recently, the use of time series features for forecast
model averaging has been an emerging research focus in the forecasting community. Nonethe-
less, most of the existing approaches depend on the manual choice of an appropriate set of
features. Exploiting machine learning methods to extract features from time series automat-
ically becomes crucial in state-of-the-art time series analysis. In this paper, we introduce an
automated approach to extract time series features based on time series imaging. We first
transform time series into recurrence plots, from which local features can be extracted using
computer vision algorithms. The extracted features are used for forecast model averaging. Our
experiments show that forecasting based on automatically extracted features, with less human
intervention and a more comprehensive view of the raw time series data, yields highly compa-
rable performances with the best methods in the largest forecasting competition dataset (M4)
and outperforms the top methods in the Tourism forecasting competition dataset.
Keywords: Forecasting, Time series imaging, Time series feature extraction, Recurrence
plots, Forecast combination.
1. Introduction
Time series features are a collection of statistical representations of time series character-
istics. Feature-based time series representation has attracted remarkable attention in a vast
majority of data mining tasks for time series. Most of the time series problems, including
time series clustering (e.g., Wang et al.,2006;Bandara et al.,2020), classification (e.g., Fulcher
and Jones,2014;Nanopoulos et al.,2001) and anomaly detection (e.g., Hyndman et al.,2015;
Talagala, Hyndman, Smith-Miles, Kandanaarachchi and Mu˜noz,2019;Corizzo et al.,2020),
Corresponding author
Email addresses: lixixi199407@buaa.edu.cn (Xixi Li ), yanfeikang@buaa.edu.cn (Yanfei Kang ),
feng.li@cufe.edu.cn (Feng Li )
URL: https://orcid.org/0000-0001-5846-3460 (Xixi Li ), https://orcid.org/0000-0001-8769-6650
(Yanfei Kang ), https://orcid.org/0000-0002-4248-9778 (Feng Li )
1The authors contributed equally.
Preprint submitted to arXiv
arXiv:1904.08064v3 [stat.ML] 5 Jun 2020
are eventually attributed to the quantification of similarity among time series data using time
series feature representations. Fulcher (2018) presents thousands of interpretable features that
can be used to represent a time series, such as global features, subsequence features and other
hybrid features, for classifying time series (Fulcher and Jones,2014) and labeling the emotional
content of speech (Fulcher et al.,2013). Christ et al. (2018) compute 794 time series features
based on hypothesis tests and illustrate their applications in time series anomaly detection and
classification. Another line of approaches for time series feature extraction is by auto-encoder
models (e.g., Vincent et al.,2008). Corizzo et al. (2020) further exploit time series features
extracted from auto-encoder models for gravitational waves detection. Other recent studies use
auto-encoder models for feature representation in time series forecasting (e.g., Laptev et al.,
2017;Abdollahi et al.,2020).
Instead of the traditional time series forecasting procedure – fitting a model to the historical
data and simulating future data based on the fitted model, selecting the most appropriate
forecasting model or averaging a number of candidate models based on time series features has
been a popular alternative approach. In the last few decades, many attempts have been made on
the feature-based model selection and averaging procedures for univariate time series forecasting.
For example, Collopy and Armstrong (1992) provided 99 rules using 18 features to combine four
extrapolation methods by examining a rule base to forecast annual economic and demographic
time series; Arinze (1994) described the use of artificial intelligence techniques to improve
the forecasting accuracy, built an induction tree to model time series features and developed
the most accurate forecasting method; Shah (1997) constructed several individual selection
rules for forecasting using discriminant analysis based on 26 time series features; Meade (2000)
used 25 summary statistics of time series as explanatory variables in predicting the relative
performances of nine forecasting methods based on a set of simulated time series with known
properties; Petropoulos et al. (2014) proposed “horses for courses” and measured the effects of
seven time series features on the forecasting performances of 14 popular forecasting methods
on the monthly data in the M3 dataset (Makridakis and Hibon,2000); more recently, Kang
et al. (2017) visualized the performances of different forecasting methods in a two-dimensional
principal component feature space and provided a preliminary understanding of their relative
performances. Talagala et al. (2018) presented a general framework for forecast model selection
using meta-learning in which they utilize a random forest algorithm to select the best forecasting
method based on time series features. Montero-Manso et al. (2020) trained a meta-model to
obtain the weights of various forecasting methods and made a weighted forecasting combination.
The input of the meta-model is a set of features calculated on the training data, while the output
2
is a group of weights assigned to each candidate forecasting method. Their method ranked 2nd
in the M4 competition (Makridakis et al.,2020).
Having revisited the literature on feature-based time series forecasting, we find that (i) al-
though researchers often highlight the usefulness of time series features in selecting the best
forecasting method, most of the existing approaches depend on the manual choice of an ap-
propriate set of features, which makes the forecast process that relies on the data and the
expertise of the forecasters inflexible (Fulcher,2018), and more importantly (ii) the current
literature on feature-based forecasting focuses on global features of time series, leaving local
characteristics under-emphasized. In some instances, the local dynamics of time series contain
important information such as heart failure in medical signals and irregular weather changes.
Therefore, exploiting automated feature extraction from time series data becomes vital. In-
spired by the recent work of Hatami et al. (2017) and Wang and Oates (2015) in time series
classification tasks, this paper aims to explore time series forecasting based on model averaging
with the idea of time series imaging, from which time series global and local features can be
automatically extracted using computer vision algorithms. The proposed approach also enables
automated feature extraction. This novel approach for time series forecasting is more flexible
than forecasting based on manually curated time series features.
The rest of the paper is organized as follows. Section 2presents our feature extraction
method for time series imaging. In Section 3, we describe how to assign weights to a group of
candidate forecasting methods using imaging-based time series features and perform forecast
combination accordingly. Section 4applies our imaging-based time series forecast combination
method to two large collections of real datasets, namely the M4 competition dataset and the
Tourism competition dataset. Section 5provides our discussions and insights, as well as several
possible future research directions. Section 6concludes the paper.
2. Time series imaging and feature extraction
In this paper, we extract time series features based on time series imaging in two steps. In
the first step, we encode the time series into images using recurrence plots. In the second step,
time series features are extracted from images using image processing techniques. We consider
two different image feature extraction approaches: spatial bag-of-features (SBoF) model and
convolutional neural networks (CNNs). We describe the details in the following sections.
2.1. Time series imaging
We use recurrence plots (RPs) to encode time series data into images, which provides a way
to visualize the periodic nature of a trajectory through a phase space (Eckmann et al.,1987)
3
and can contain all relevant dynamical information in the time series (Thiel et al.,2004). A
recurrence plot of time series x, showing when the time series revisits a previous state, can be
formulated as
R(i, j) = Θ(− k xixjk),
where R(i, j) is the element of the recurrence matrix R,iindexes time on the x-axis of the
recurrence plot, and jindexes time on the y-axis. is a predefined threshold, and Θ(·) is the
Heaviside step function. In short, one draws a black dot when xiand xjare closer than .
Instead of binary output, an un-thresholded RP is not binary but difficult to quantify. We use
the following modified RP, which balances the binary output and un-thresholded RP.
R(i, j) =
kxixjk> ,
kxixjkotherwise.
It gives more values than a binary RP and results in colored plots. Fig. 1shows three typical
examples of recurrence plots. They reveal different patterns of recurrence plots for time series
with randomness, periodicity, chaos, and trend. We can see that the recurrence plots shown
in the right column well depict the pre-defined patterns in the time series shown in the left
column.
2.2. Feature extraction with the SBoF model
We propose an image-based time series feature extraction framework using the SBoF (spatial
bag-of-features) model. As shown in Fig. 2, the framework consists of three steps: (i) detect
key points with the scale-invariant feature transform (SIFT) algorithm (Lowe,1999) and find
basic descriptors with k-means; (ii) generate the representation based on the locality constrained
linear coding (LLC) method (Wang et al.,2010); and (iii) extract spatial information via spatial
pyramid matching (SPM) and pooling. We interpret the details in each step, respectively.
The original bag-of-features (BoF) model, which extracts features from one-dimensional sig-
nal segments, has achieved great success in time series classification (Baydogan et al.,2013;
Wang et al.,2013). Hatami et al. (2017) transformed a time series into two-dimensional recur-
rence images with a recurrence plot (Eckmann et al.,1987) and then applied the BoF model.
Extracting time series features is then equivalent to identifying key points in images, which
are called key descriptors. A promising algorithm is the SIFT algorithm (Lowe,1999), which
is used to detect and describe local features in images by identifying the maxima/minima of
the difference of Gaussians (DoG) that occur at the multiscale spaces of an image as its key
descriptors. It consists of the following four steps.
4
Figure 1. Typical examples of recurrence plots (right column) for time series data with different patterns (left
column): uncorrelated stochastic data, i.e., white noise (top), a time series with periodicity and chaos (middle),
and a time series with periodicity and trend (bottom).
5
Figure 2. Image-based time series feature extraction with spatial bag-of-features model. It consists of four steps:
(i) encode a time series as an image with recurrence plots; (ii) detect key points with SIFT and obtain the basic
descriptors with k-means for the codebook; (iii) generate the representation based on LLC; and (iv) extract
spatial information via SPM and max pooling.
1. Detect extreme values in the scale spaces. We search over all the scale spaces and
use the Gaussian differential method to identify the potential interest points and select
those invariant to scale and orientation.
2. Find the key points. The position scale is determined by fitting a model at each
candidate position, and the key points are selected according to their stability.
3. Assign feature directions. This step assigns the key points one or more directions
based on the local gradient direction of the image. All subsequent operations are about
how to transform the direction, scale, and position of the key points to allow for invariance
in the features.
4. Describe key points. Within the neighborhood around each feature point, the local
gradient of the image is measured at selected scales, which is transformed into a repre-
sentation that allows larger local shape deformations and illumination transformations.
The SIFT method uses a 128-dimensional vector to characterize the key descriptors in
an image. First, an 8-direction histogram is established in each 4 ×4 subregion, and 16
subregions around the key points are used. We then calculate the magnitude and direction
of each pixel’s gradient magnitude and add it to the corresponding subregion. In the end,
128-dimensional image data based on histograms are generated.
6
Each descriptor can be projected onto its local coordinate system, and the projected coordi-
nates are integrated by max pooling to generate the final representation with the LLC method,
which utilizes the locality constraints to project each descriptor onto its local coordinate system
(Wang et al.,2010). The projected coordinates are integrated by max pooling to generate the
final representation:
min
c
N
X
i
kxiBcik2+λkdicik2, s.t. 1Tci= 1,i, (1)
where di= exp(dist(xi, B)) and xiR128×1is the vector of one descriptor. The basic
descriptors BR128×Mare obtained by k-means clustering. The representation parameters
ciare used as time series representations through Equation (1). The locality adaptor digives
different freedom for each basis vector proportional to its similarity to the input descriptor. We
use σto adjust the weight decay speed for the locality adaptor, and λis the adjustment factor.
However, in reality, the number of descriptors obtained by the SIFT algorithm is usually huge.
To address this problem, Wang et al. (2010) proposed an incremental codebook optimization
method for LLC.
The bag-of-features model calculates the distribution characteristics of feature points in the
whole image and then generates a global histogram. As a result, the image’s spatial distribution
information is lost, and the image may not be accurately described. To obtain the spatial
information of images, we apply a spatial pyramid matching (SPM) method, which statistically
distributes image feature points at different resolutions and has achieved high accuracy on a
large dataset of 15 natural scene categories (Lazebnik et al.,2006). The image is divided into
progressively finer grid sequences at each level of the pyramid, and features are derived from
each grid and combined into one large feature vector. Fig. 3depicts the diagram of the SPM
and max pooling process. In this task, we divide the image by 1 ×1, 2 ×2 and 4 ×4, and
thus obtain 21 subregions. To obtain the representation for each subregion, we first obtain the
descriptors. Suppose that we obtain 12 descriptors denoted by DiR12×200 for the third region
(the dimension of the local linear representation of the descriptors is equal to 200). We then
can obtain the maximum value of every dimension of Di. After max pooling, we calculate the
feature representation denoted by fiR200×1for the third region. The feature representations
of the other twenty regions can be obtained in the same way. Finally, the 21 features are linked
together for the final representation of the time series. In this way, the final size of the feature
vectors is 21 ×200 = 4200. More details about the experimental setup in the SoBF model can
be found in Appendix A.
7
Figure 3. Spatial pyramid matching and max pooling. The image is divided into progressively finer grid sequences
at each level of the pyramid, and features are derived from each grid and combined into one large feature vector.
We divide the image by 1 ×1, 2 ×2 and 4 ×4, and thus obtain 21 subregions. We first obtain the descriptors
for each region. Suppose that we obtain 12 descriptors denoted by DiR12×200 for the third region (200 is the
dimension of the local linear representation of the descriptor). Then, we can obtain the maximum value of every
dimension of Di. After max pooling, we obtain the feature representation denoted by fiR200×1for the third
region. The feature representations of the other 20 regions can be obtained in the same way. Finally, the 21
features are linked together for the final representation for the time series.
8
2.3. Feature extraction with fine-tuned deep neural networks
An alternative to SBoF for image feature extraction is to use a deep CNN, which has achieved
great breakthroughs in image processing (Krizhevsky et al.,2012). For example, Berkeley re-
searchers (Donahue et al.,2014) proposed feature extraction methods called DeCAF (a deep
convolutional activation feature for generic visual recognition) and directly used deep convolu-
tional neural networks for feature extraction. Their experimental results show that the extracted
features yield higher accuracy than traditional image features. In addition, some researchers
(e.g., Razavian et al.,2014) use the features acquired by convolutional neural networks as the
input of an image classifier, which significantly improves the image classification accuracy.
Nonetheless, the performance of neural networks heavily depends on the setting of the
network structure and the hyper-parameters. A deeper layer is often essential for achieving
higher performance in a task. As a result, extensive computational power is needed. An
appealing feature of our time series imaging approach is that a large number of well pre-trained
neural network models for imaging classification exist. We could easily transfer the model to
our task via transfer learning (Pan and Qiang,2010), which has been widely used recently in a
variety of fields such as image classification (Han et al.,2018) and natural language processing
(Ahmad et al.,2020). To simplify our task, we use the fine-tuning approach (Ge and Yu,2017)
from the field of transfer learning. In short, it uses pre-trained networks and makes adjustments
to our tasks. We fix the parameters of the previous layers based on the pre-trained model with
ImageNet data and fine-tune the last few layers for our task. In general, the closer the layer
is to the first layer, the more general features can be extracted; the closer the layer is to the
back layer, the more specific features for classification tasks can be extracted. In this way, the
computational efficiency of network training can be significantly accelerated.
Fig. 4shows the framework of transfer learning with fine-tuning. In this task, the deep
network is trained on the large ImageNet dataset (Deng et al.,2009), and the pre-trained
network is publicly available. Specifically, we fix the weights of all the previous layers of the pre-
trained network except for the last fully connected layers and then use our time series images
as inputs. Finally, the high-dimensional features of the time series images can be obtained
from the pre-trained network. We consider the following representative architectures in our
experiments: ResNet-v1-101 (He et al.,2016), ResNet-v1-50 (He et al.,2016), Inception-v1
(Szegedy et al.,2015), and VGG-19 (Simonyan and Zisserman,2014). The dimensions of the
time series features obtained from the pre-trained ResNet-v1-101,ResNet-v1-50 ,Inception-v1
and VGG-19 architectures are 2048, 2048, 1024 and 1000, respectively. More details about the
experimental setup in the CNN-based feature extraction can be found in Appendix A.
9
Figure 4. Framework of transfer learning with fine-tuning. Classic CNN models are trained on a large dataset
(ImageNet). For the CNN model, the closer the layer is to the first layer, the more general features can be
extracted; the closer the layer is to the back layer, the more specific features for classification tasks can be
extracted. To extract time series features, we fix the parameters of all the previous layers except for the last
fully connected layer, and fine-tune the last layer for our task. With the trained model, we obtain the final
representation of the time series.
10
3. Time series forecasting with image features
We aim to find the optimal combination of a pool of candidate forecasting methods. The
essence is to link the knowledge of forecasting errors from different forecasting methods to time
series features. Therefore, in this section, we focus on the mapping from the time series image
features to forecasting method performances. We use nine most popular time series forecasting
methods as candidates for forecast combination, which are also used in many recent studies
(Montero-Manso et al.,2020;Talagala, Li and Kang,2019;Kang et al.,2020). They are the
automated ARIMA algorithm (ARIMA), automated exponential smoothing algorithm (ETS),
feed-forward neural network using autoregressive inputs (NNET-AR), exponential smoothing
state space model with a Box-Cox transformation (TBATS), seasonal and trend decomposition
using LOESS with AR modeling of the seasonally adjusted series (STLM-AR), random walk with
drift (RW-DRIFT), theta method (THETA), na¨ıve (NAIVE), and seasonal na¨ıve (SNAIVE).
They are described in Table 1and implemented in the Rpackage forecast (Hyndman et al.,
2019).
To validate the effectiveness of our image features of the time series, we follow the work of
Montero-Manso et al. (2020), who proposed a model-averaging method based on 42 manually
curated time series features and won the second place in the M4 competition (Makridakis et al.,
2020), to obtain the weights for forecast combination based on our image features. To make our
proposed method comparable with those in M4, we use the overall weighted average (OWA) to
measure the forecasting accuracies, as used in the M4 competition. OWA is an overall indicator
of two accuracy measures, the mean absolute scaled error (MASE) and the symmetric mean
absolute percentage error (sMAPE). The individual measures are calculated as follows.
sMAPE = 1
h
h
X
t=1
2|Ytb
Yt|
|Yt|+|b
Yt|,
MASE = 1
hPh
t=1 |Ytb
Yt|
1
nmPn
t=m+1 |YtYtm|,
OWA = 1
2(sMAPE/sMAPENaive2 + MASE/MASENaive2),
(2)
where Ytis the real value of the time series at point t,ˆ
Ytis the point forecast, his the forecasting
horizon, and mis the frequency of the data (e.g., 4 for quarterly series). Na¨ıve2 is equivalent
to the Na¨ıve (NAIVE) method but applied to the time series adjusted for seasonal factors.
Our framework for model averaging is shown in Fig. 6. It consists of two parts. In the
training process, based on the extracted image features and the OWA values of the nine fore-
casting methods, we train a feature-based gradient tree boosting model (XGBoost, Chen and
Guestrin,2016), to produce nine weights for forecast model averaging by minimizing the OWA
11
Table 1. The methods used for forecast combination. All these methods are implemented using the forecast
package in the Rsoftware.
Forecasting
method
Description Rimplementation
ARIMA The autoregressive integrated moving average model
automatically estimated in the forecast package for
R(Hyndman and Khandakar,2008).
auto.arima()
ETS The exponential smoothing state space model (Hyn-
dman et al.,2002).
ets()
NNET-AR A feed-forward neural network using autoregressive
inputs.
nnetar()
TBATS The exponential smoothing state space model with
a Box-Cox transformation, ARMA errors, trend and
seasonal components (De Livera et al.,2011).
tbats()
STLM-AR The STL decomposition (Cleveland et al.,1990) with
AR modeling of the seasonally adjusted series.
stlm(..., modelf = ar)
RW-DRIFT The random walk model with drift. rwf(..., drift = TRUE)
THETA The decomposition forecasting model by modifying
the local curvature of the time-series through a co-
efficient ‘Theta’ that is applied directly to the sec-
ond differences of the data (Assimakopoulos and
Nikolopoulos,2000).
thetaf()
NAIVE The na¨ıve method, which takes the last observation
as the forecasts of all the forecast horizons.
naive()
SNAIVE The seasonal na¨ıve method, which forecasts using the
most recent values of the same season.
snaive()
12
Figure 5. The temporal holdout strategy used to generate the training dataset. Each original time series is
divided into a training period and a testing period. The length of the testing period is equal to the forecasting
horizon (h) given by the M4 competition. We calculate time series image features from the training periods of
the training dataset, generate forecasts, and compute the corresponding OWA values over the test periods for
each candidate forecasting method. We train an XGBoost model on the training dataset and obtain weights for
each candidate forecasting method, which are then used to generate forecasts by forecast combination for the
future data.
error obtained by forecast combination. Let fnbe the image features extracted from a time
series, and Nis the total number of the time series. Onm is the contribution to the OWA error
of m-th method for the series n-th time series. p(fn)mis the output of the XGBoost algorithm
corresponding to m-th forecasting method, based on the features extracted from the n-th time
series. The gradient tree boosting approach minimizes the weighted average loss function as
arg min
w
N
X
n=1
M
X
m=1
w(fn)mOnm,
where w(fn)mare the softmax-transformed weights for the output p(fn)mof the XGBoost model
defined as
w(fn)m=exp{p(fn)m}
Pmexp{p(fn)m}.
The hyper-parameter settings for XGBoost are available in Appendix B.
In the testing process, we use the trained model and the image features extracted from the
testing data to obtain the weights of different forecasting models. Finally, based on the weights
and forecasts of different models, we can obtain the final forecasts for the testing data.
4. Experiments
4.1. Forecasting with M4 competition data
The first dataset we use to evaluate our proposed method is a collection of general-purpose
data from the M4 competition that consists of 100,000 time series diversely from the economic,
13
Figure 6. Framework of forecast model averaging based on automatic feature extraction. In the training process,
nine weights are obtained for the forecast model combination using XGBoost. Based on the weights, we obtain
the forecasts for the testing data in the testing process.
finance, demographics, and industry domains. In the training process, we divide the original
time series in M4 into training and testing periods following the strategy in Fig. 5. The lengths
of the testing periods are equal to the forecasting horizon (h), i.e., 6 for yearly, 8 for quarterly,
18 for monthly, 13 for weekly 14 for daily, and 48 for hourly data, which are given by the M4
competition. For each time series in M4, we calculate time series features from the training
period, generate forecasts, and compute the corresponding OWA values over the test period for
each candidate forecasting method. We then train an XGBoost model to produce the weights
for each forecasting method described in Table 1. In the testing process, we use the trained
model to forecast the original M4 time series, and evaluate the forecasts based on the future
M4 data, which are public after the M4 competition.
We now apply our imaging-based time series forecasting method to the M4 data. To illustrate
that the extracted image features are diverse and can be used to characterize the original time
series, we project the features of the time series with different periods into two-dimensional
feature space using t-distributed stochastic neighbor embedding (t-SNE, Maaten (2014)). From
Fig. 7, we notice that yearly, quarterly, monthly, daily and hourly data can be well distinguished
in the feature spaces, although the features are automatically extracted from time series images.
Following the framework in Fig. 6, we obtain the forecasts of M4 based on time series
imaging. Our model averaging results are compared with the results of the top ten ranked
14
Figure 7. Two-dimensional feature spaces of the M4 time series with different periods. The blue points highlight
areas where the time series instance (orange points) with the corresponding seasonal pattern lie.
Table 2. Description of the top ten forecasting methods in M4 competition (Makridakis et al.,2020).
Ranking Description
1 A hybrid model mixing Exponential Smoothing (ES) with a black-box Recurrent Neural
Network (RNN) forecasting engine (Smyl,2020).
2 Weighted forecast combination of nine standard forecasting methods in Table 1(Montero-
Manso et al.,2020).
3 Weighted average of multiple statistical methods using hold-out tests (Pawlikowski and
Chorowska,2020).
4 Combination of multiple statistical methods as described in Armstrong (2001).
5 Weighted average of the standard ARIMA, ETS and THETA methods described in Table 1
(Fiorucci and Louzada,2020).
6 Median of ETS, CES (Complex exponential smoothing, Svetunkov and Kourentzes,2018),
ARIMA, and THETA methods (Petropoulos and Svetunkov,2020).
7 Combination of two THIEF (Temporal Hierarchical Forecasting, Athanasopoulos et al.,
2017) forecasts (with the base model of ARIMA and THETA, respectively) (Shaub,2020).
8 THETA method with data deseasonalization and Box-Cox Transformation.
9 A calibrated average of Rho and Delta (Card) forecasting methods (Doornik et al.,2020).
10 Forecast combination of seven benchmarks.
15
methods (Table 2) from the M4 competition, which are available in the concluding paper of
M4 (Makridakis et al.,2020). Detailed descriptions and the code for replicating the top ten
methods are available in the M4 GitHub repository (https://github.com/Mcompetitions/M
4-methods). Note that the replication results may slightly differ due to the updates of related
Rpackages. However, since the concluding paper of M4 competition (Makridakis et al.,2020)
is publicly available at the same time of this work, the possible code changes in the Rpackages
used by the competitors are negligible. Tables 3,4and 5depict the MASE, sMAPE, and OWA
values for our time series imaging method with model-averaging, and the top ten methods from
the M4 competition. The optimal parameters of XGBoost on the M4 competition dataset can
be found in Table 7of Appendix B.
Overall, our model averaging method with automated time series image features can achieve
highly comparable performances with the top methods from the M4 competition. From Table 5,
our method ranks the sixth overall. But our approach has the advantages: (1) limited human
interaction is required during feature extraction, (2) both global and local features are uti-
lized, (3) the fine-tuned results from existing CNN models in the computer vision tasks can be
seamlessly transferred to our model, and (4) it sheds the potential improvements of forecasting
performance with the advances of neural networks for the computer vision tasks.
4.2. Forecasting with the Tourism competition data
To validate our method’s generality and robustness in even specific forecasting domains,
we now apply the proposed method to the Tourism competition dataset that consists of 366
monthly series, 427 quarterly series, and 518 yearly series (Athanasopoulos et al.,2011). In
the training process, we use the M4 competition data as training data to train the XGBoost
model and produce the optimal weights for each candidate forecasting method, which are used
to forecast the Tourism data. Since the Tourism dataset has smaller size compared to the M4
competition data, we use M4 monthly data as the training data for the Tourism monthly data
to obtain the optimal weights from XGBoost. The same strategy is applied to the quarterly
and yearly datasets.
We apply the same accuracy metrics as in the Tourism competition (Athanasopoulos et al.,
2011) to make the results comparable with the literature, which are the mean absolute per-
centage error (MAPE) and the mean absolute scaled error (MASE). MASE is calculated as
Equation (2), and MAPE is calculated as follows.
MAPE = 1
h
h
X
t=1
|Ytb
Yt|
|Yt|,
16
Table 3. Benchmarking the MASE performance of our proposed forecasting method based on time series imaging
against the top 10 methods in the M4 competition.
Yearly Quarterly Monthly Weekly Daily Hourly Total
Ranking M4 competition
1 2.980 1.118 0.884 2.356 3.446 0.893 1.536
2 3.060 1.111 0.893 2.108 3.344 0.819 1.551
3 3.130 1.125 0.905 2.158 2.642 0.873 1.547
4 3.126 1.135 0.895 2.350 3.258 0.976 1.571
5 3.046 1.122 0.907 2.368 3.194 1.203 1.554
6 3.082 1.118 0.913 2.133 3.229 1.458 1.565
7 3.038 1.198 0.929 2.947 3.479 1.372 1.595
8 3.009 1.198 0.966 2.601 3.254 2.557 1.601
9 3.262 1.163 0.931 2.302 3.284 0.801 1.627
10 3.185 1.164 0.943 2.488 3.232 1.049 1.614
Method Forecasting with time series imaging
SIFT 3.135 1.125 0.908 2.266 3.463 0.849 1.579
CNN
Inception-v1+XGBoost 3.096 1.139 0.947 2.479 3.289 1.015 1.592
ResNet-v1-101+XGBoost 3.106 1.147 0.927 2.579 3.377 0.970 1.591
ResNet-v1-50+XGBoost 3.104 1.143 0.917 2.441 3.363 0.965 1.583
VGG-19+XGBoost 3.098 1.133 0.931 2.355 3.235 0.991 1.581
17
Table 4. Benchmarking the sMAPE performance of our proposed forecasting method based on time series imaging
against the top 10 methods in the M4 competition.
Yearly Quarterly Monthly Weekly Daily Hourly Total
Ranking M4 competition
1 13.176 9.679 12.126 7.817 3.170 9.328 11.374
2 13.528 9.733 12.639 7.625 3.097 11.506 11.720
3 13.943 9.796 12.747 6.919 2.452 9.611 11.845
4 13.712 9.809 12.487 6.814 3.037 9.934 11.695
5 13.673 9.816 12.737 8.627 2.985 15.563 11.836
6 13.669 9.800 12.888 6.726 2.995 13.167 11.897
7 13.679 10.378 12.839 7.818 3.222 13.466 12.020
8 13.366 10.155 13.002 9.148 3.041 17.567 11.986
9 13.910 10.000 12.780 6.728 3.053 8.913 11.924
10 13.821 10.093 13.151 8.989 3.026 9.765 12.114
Method Forecasting with time series imaging
SIFT 13.896 9.863 12.596 7.899 3.063 11.772 11.816
CNN
Inception-v1+XGBoost 13.899 9.962 12.659 8.228 3.047 12.521 11.891
ResNet-v1-101+XGBoost 13.917 9.991 12.714 8.277 3.110 12.480 11.914
ResNet-v1-50+XGBoost 13.918 9.973 12.723 8.086 3.123 12.396 11.914
VGG-19+XGBoost 13.872 9.912 12.652 8.294 3.049 12.598 11.853
18
Table 5. Benchmarking the OWA performance of our proposed forecasting method based on time series imaging
against the top 10 methods in the M4 competition.
Yearly Quarterly Monthly Weekly Daily Hourly Total
Ranking M4 competition
1 0.778 0.847 0.836 0.851 1.046 0.440 0.821
2 0.799 0.847 0.858 0.796 1.019 0.484 0.838
3 0.820 0.855 0.867 0.766 0.806 0.444 0.841
4 0.813 0.859 0.854 0.795 0.996 0.474 0.842
5 0.802 0.855 0.868 0.897 0.977 0.674 0.843
6 0.806 0.853 0.876 0.751 0.984 0.663 0.848
7 0.801 0.908 0.882 0.957 1.060 0.653 0.860
8 0.788 0.898 0.905 0.968 0.996 1.012 0.861
9 0.836 0.878 0.881 0.782 1.002 0.410 0.865
10 0.824 0.883 0.899 0.939 0.990 0.485 0.869
Method Forecasting with time series imaging
SIFT 0.820 0.858 0.863 0.839 1.009 0.498 0.848
CNN
Inception-v1+XGBoost 0.814 0.867 0.885 0.895 1.002 0.552 0.854
ResNet-v1-101+XGBoost 0.816 0.872 0.877 0.916 1.025 0.542 0.855
ResNet-v1-50+XGBoost 0.816 0.869 0.873 0.881 1.025 0.538 0.853
VGG-19+XGBoost 0.814 0.863 0.876 0.877 0.994 0.549 0.850
19
Table 6. Model-averaging results compared with the top methods in the Tourism competition in terms of the
MAPE and MASE values.
MAPE MASE
Forecasting method Yearly Quarterly Monthly Total Yearly Quarterly Monthly Total
ARIMA 30.639 16.172 21.746 23.444 3.197 1.595 1.495 2.200
ETS 25.065 15.316 20.965 20.745 3.000 1.592 1.526 2.130
THETA 23.409 15.927 22.390 20.688 2.730 1.661 1.649 2.080
SNAIVE 23.610 16.459 22.562 20.988 3.007 1.699 1.631 2.197
DAMPED 27.975 35.830 47.192 35.898 3.061 3.221 3.404 3.209
Forecasting with time series imaging
SIFT 24.164 15.236 19.984 20.089 2.760 1.570 1.444 2.005
CNN
Inception-v1+XGBoost 24.633 15.333 20.261 20.383 2.834 1.560 1.467 2.037
ResNet-v1-101+XGBoost 24.288 15.047 20.221 20.142 2.779 1.555 1.468 2.014
ResNet-v1-50+XGBoost 24.347 15.101 19.981 20.117 2.750 1.563 1.454 2.002
VGG-19+XGBoost 23.616 15.599 20.055 20.010 2.689 1.638 1.476 2.008
where Ytis the real value of the time series at point t,ˆ
Ytis the point forecast, and his the
forecasting horizon.
The top methods in the competition that include ARIMA, ETS, THETA, SNAIVE, and
DAMPED are discussed in Athanasopoulos et al. (2011). The first four methods are described
in Table 1. DAMPED is a variation of Holt-Winters method that “dampens” the trend to a flat
line sometime in the future, and is implemented using forecast::holt(..., damped=TRUE)
in R. We reproduce these top methods and use them as our benchmarks.
Following the framework in Fig. 6, we obtain the forecasts of the Tourism time series based
on time series imaging. Our model averaging results outperforms the top methods from Tourism
competition (Athanasopoulos et al.,2011) with high distinctions. Table 6reports the MASE and
MAPE values for our model-averaging method and the top methods from Tourism competition.
The numbers in bold indicate that our method is better than the benchmark. Especially, our
method performs exceptionally well on monthly and quarterly data. For the yearly dataset,
our method is slightly worse, which may be due to the inadequacy of historical data. The
optimal parameters of XGBoost on the Tourism competition dataset can be found in Table 8
of Appendix B.
20
5. Discussions
Feature-based time series forecasting has been proved highly promising, primarily through
the extraction and selection of an appropriate set of features. Nonetheless, traditional time series
feature extraction requires manual design of feature metrics, which is typically complicated to
time series forecasting practitioners. Known features used in time series forecasting literature
are global characteristic of a time series, which may ignore important local patterns. Evidence
from the literature further indicates that feature-based forecast combination might not perform
as well as simple averaging when the feature extraction and selection are not properly conducted.
We propose an automated time series imaging feature extraction approach with computer
vision algorithms, and our experiment results show that our approach works well for forecast
combination. An innovative point of our approach over other feature-based time series forecast-
ing methods is that time series features are extracted automatically from time series imaging,
which are obtained using recurrence plots. In principle, any image feature extraction algorithm
is applicable to our proposed framework. We employ two widely used algorithms to extract
features from time series images, namely the spatial bag-of-features (SBoF) model and the deep
convolutional neural networks (CNN).
The SBoF model, combining the scale-invariant feature transform (SIFT) algorithm, the
locality constrained linear coding (LLC) method, and spatial pyramid matching (SPM) and
max pooling, can capture both global and local characteristics of images. The traditional SBoF
model is a fast industry level model in computer vision applications. One may notice that
the features extracted based on the traditional SIFT model performs better than the deep
CNN model in some scenarios with our testing data. But it is worth to mention that SIFT
method is not a fully automated image feature extraction processing because it requires a
careful specification of four steps, namely (1) detecting extreme values in the scale spaces, (2)
finding the key points, (3) assigning feature directions, and (4) describing key points. Moreover,
SIFT algorithm is patent protected (Lowe,2004), which means other open source program could
not incorporate it without the patent owner’s permission. Having an alternative approach with
highly comparable performance but without patent restrictions is important to time series
forecasters.
The alternative feature extraction algorithm based on deep CNN is an automated process
once the source task is confirmed. We use transfer learning to borrow the information of well
pre-trained neural network models for imaging classification, which can avoid the complication of
settings the network structure and tuning the hyper-parameters. Unlike traditional CNN tasks
that require the fine-tuning and massive computation, we transfer the convolutional layers and
21
fully-connected lays from the ImageNet competition results to our task. Hence only one new
adaption layer needs to train, which significantly saves the computational power.
Although the aims of source task in ImageNet and the target task of time series forecasting
are naturally different, the image features generated from time series share similar shapes and
angles with the image of real objects. This explains why we could transfer a different task
to time series forecasting. In practice, the forecasting practitioners may train a customized
CNN model to further improve the forecasting performance if a rich collection of time series are
available.
Another significant merit of using deep CNN and transfer learning for time series feature
extraction is that, the pre-trained neural network models (e.g., on ImageNet) are continuously
updated and improved in the image processing literature. Thus, we believe that this line of
automated time series feature extraction approaches has great potential in the future.
In this paper, we use the features extracted from recurrence plots to reveal the characteristics
of the corresponding time series. The recurrence plot for a given time series displays its dynamics
based on the distance correlations within the time series. However, other features such as
cross-correlation coefficients can also be used to generate cross-correlation recurrence plots.
Thus, multi-channel images, with more comprehensive information, can be obtained for each
time series, which can potentially improve the feature extraction and feature-based forecast
combination performances. Therefore, time series forecasting based on multi-channel imaging
can be one potential extension of our current work.
The forecasting framework based on time series image features is in line with the work in
(Montero-Manso et al.,2020), where they use 42 manual time series features and nine forecast-
ing methods to optimize the weights for forecast combination. Montero-Manso et al. (2020)
won the second place in the M4 competition (Makridakis et al.,2020). To be consistent and
comparable, in our study, we employ the same set of forecasting methods in the M4 dataset.
However, we want to mention that the choice of candidate forecasting methods for forecast
combination also requires expert knowledge and practical experience. The performance of fore-
cast combinations depends on the accuracy of individual forecasting methods and the diversity
among them since the merits of forecast combination stem from the independent information
across multiple forecasts (Thomson et al.,2019). How to automatically select an appropriate
set of candidate methods for combination is another interesting direction for future research.
In our experiments, all the time series are independent data. Therefore we treat the time
series features as independent images and apply them to the CNN framework which is also used
for classifying objects in ImageNet. A further extension of our work is to extend time series
22
forecasting with imaging to (1) forecasting with time varying image features, and (2) hierarchical
time series or multivariate time series with recurrent dependence. In both scenes, hierarchical
image classification framework mixtures with CNN and RNN could be further explored.
We make our code publicly available at https://github.com/lixixibj/forecasting-
with-time-series-imaging. Making it open-source can enrich the toolboxes of forecasting
support systems by providing a competitive alternative to the existing feature-based time series
forecasting methods.
6. Concluding remarks
In this paper, we propose to use image features for forecast model combination. First, time
series are encoded into images. Computer vision algorithms are then applied to extract features
from the images, which are used for forecast model averaging. The proposed method enables
automated feature extraction, making it more flexible than using manually selected time series
features. Besides, our image features can depict local features of time series as well as global
features. Our paper is the first attempt that applies imaging to time series forecasting to the
best of our knowledge.
We examined the performance of our approach on two widely-used time series competition
datasets (M4 and Tourism), and compared it with the top methods in the two competitions. Our
experiments show that the proposed method can produce highly comparable forecast accuracies
with the top-ranked benchmarks in the competitions. Moreover, forecasting based on time series
imaging offers an automatic tool for time series feature extraction, in the sense that it does not
reply on many manual inputs for feature selection, which is crucial for forecast practitioners.
Acknowledgments
We are thankful to Dr. Slawek Smyl from Uber and Professor Christoph Bergmeir from
Monash University for their insightful suggestions on a previous version of this paper presented
at the 39th International Symposium on Forecasting.
Yanfei Kang is supported by the National Natural Science Foundation of China (No. 11701022)
and the National Key Research and Development Program (No. 2019YFB1404600). Feng Li
is supported by the National Natural Science Foundation of China (No. 11501587) and the
Beijing Universities Advanced Disciplines Initiative (No. GJJ2019163).
23
References
Abdollahi, M., Khaleghi, T. and Yang, K. (2020), ‘An integrated feature learning approach using
deep learning for travel time prediction’, Expert Systems with Applications 139, 112864.
Ahmad, Z., Jindal, R., Ekbal, A. and Bhattachharyya, P. (2020), ‘Borrow from rich cousin:
transfer learning for emotion detection using cross lingual embedding’, Expert Systems with
Applications 139, 112851.
Arinze, B. (1994), ‘Selecting appropriate forecasting models using rule induction’, Omega-
international Journal of Management Science 22(6), 647–658.
Armstrong, J. S. (2001), Combining forecasts, in ‘Principles of forecasting’, Springer, pp. 417–
439.
Assimakopoulos, V. and Nikolopoulos, K. (2000), ‘The theta model: a decomposition approach
to forecasting’, International Journal of Forecasting 16(4), 521–530.
Athanasopoulos, G., Hyndman, R. J., Kourentzes, N. and Petropoulos, F. (2017), ‘Forecasting
with temporal hierarchies’, European Journal of Operational Research 262(1), 60–74.
Athanasopoulos, G., Hyndman, R. J., Song, H. and Wu, D. C. (2011), ‘The tourism forecasting
competition’, International Journal of Forecasting 27(3), 822–844.
Bandara, K., Bergmeir, C. and Smyl, S. (2020), ‘Forecasting across time series databases using
recurrent neural networks on groups of similar series: a clustering approach’, Expert Systems
With Applications 140, 112896.
Baydogan, M. G., Runger, G. and Tuv, E. (2013), ‘A bag-of-features framework to classify time
series’, IEEE transactions on pattern analysis and machine intelligence 35(11), 2796–2802.
Chen, T. and Guestrin, C. (2016), Xgboost:a scalable tree boosting system, in ‘ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining’, pp. 785–794.
Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr, A. W. (2018), ‘Time Series FeatuRe
Extraction on basis of Scalable Hypothesis tests (tsfresh a Python package)’, Neurocomputing
307, 72 – 77.
Cleveland, R. B., Cleveland, W. S., McRae, J. E. and Terpenning, I. (1990), ‘STL: A seasonal-
trend decomposition procedure based on loess’, Journal of Official Statistics 6(1), 3–73.
Collopy, F. and Armstrong, J. S. (1992), ‘Rule-based forecasting: development and validation
of an expert systems approach to combining time series extrapolations’, Management Science
38(10), 1394–1414.
Corizzo, R., Ceci, M., Zdravevski, E. and Japkowicz, N. (2020), ‘Scalable auto-encoders for grav-
itational waves detection from time series data’, Expert Systems with Applications p. 113378.
De Livera, A. M., Hyndman, R. J. and Snyder, R. D. (2011), ‘Forecasting time series with
24
complex seasonal patterns using exponential smoothing’, Journal of the American statistical
association 106(496), 1513–1527.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K. and Li, F. F. (2009), Imagenet: A large-
scale hierarchical image database, in ‘IEEE Conference on Computer Vision and Pattern
Recognition’.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Ning, Z., Tzeng, E., Darrell, T., Donahue, J., Jia,
Y. and Vinyals, O. (2014), Decaf: A deep convolutional activation feature for generic visual
recognition, in ‘International Conference on International Conference on Machine Learning’.
Doornik, J. A., Castle, J. L. and Hendry, D. F. (2020), ‘Card forecasts for m4’, International
Journal of Forecasting 36(1), 129–134.
Eckmann, J.-P., Kamphorst, S. O. and Ruelle, D. (1987), ‘Recurrence plots of dynamical sys-
tems’, EPL (Europhysics Letters) 4(9), 973.
Fiorucci, J. A. and Louzada, F. (2020), ‘Groec: combination method via generalized rolling
origin evaluation’, International Journal of Forecasting 36(1), 105–109.
Fulcher, B. D. (2018), Feature-based time-series analysis, in ‘Feature engineering for machine
learning and data analytics’, CRC Press, pp. 87–116.
Fulcher, B. D., Little, M. A. and Jones, N. S. (2013), ‘Highly comparative time-series analy-
sis: the empirical structure of time series and their methods’, Journal of the Royal Society
Interface 10(83), 20130048.
Fulcher, B. and Jones, N. (2014), ‘Highly comparative feature-based time-series classification’,
IEEE Transactions on Knowledge and Data Engineering 26(12), 3026–3037.
Ge, W. and Yu, Y. (2017), Borrowing treasures from the wealthy: Deep transfer learning
through selective joint fine-tuning, in ‘Computer Vision and Pattern Recognition’.
Han, D., Liu, Q. and Fan, W. (2018), ‘A new image classification method using cnn transfer
learning and web data augmentation’, Expert Systems With Applications 95, 43–56.
Hatami, N., Gavet, Y. and Debayle, J. (2017), ‘Bag of recurrence patterns representation for
time-series classification’, Pattern Analysis and Applications pp. 1–11.
He, K., Zhang, X., Ren, S. and Sun, J. (2016), Deep residual learning for image recognition, in
‘Proceedings of the IEEE conference on computer vision and pattern recognition’, pp. 770–
778.
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., O’Hara-Wild, M.,
Petropoulos, F., Razbash, S., Wang, E. and Yasmeen, F. (2019), forecast: Forecasting func-
tions for time series and linear models. R package version 8.5.
URL: http://pkg.robjhyndman.com/forecast
25
Hyndman, R. J. and Khandakar, Y. (2008), ‘Automatic time series forecasting: the forecast
package for R’, Journal of Statistical Software 26(3), 1–22.
Hyndman, R. J., Koehler, A. B., Snyder, R. D. and Grose, S. (2002), ‘A state space framework
for automatic forecasting using exponential smoothing methods’, International Journal of
Forecasting 18(3), 439–454.
Hyndman, R. J., Wang, E. and Laptev, N. (2015), Large-scale unusual time series detection, in
‘Proceedings of the IEEE International Conference on Data Mining’, Atlantic City, NJ, USA.
14–17 November 2015.
Kang, Y., Hyndman, R. J. and Li, F. (2020), ‘GRATIS: GeneRAting TIme Series with diverse
and controllable characteristics’, Statistical Analysis and Data Mining (in press).
URL: https://doi.org/10.1002/sam.11461
Kang, Y., Hyndman, R. J. and Smith-Miles, K. (2017), ‘Visualising forecasting algorithm perfor-
mance using time series instance spaces’, International Journal of Forecasting 33(2), 345–358.
Krizhevsky, A., Sutskever, I. and E. Hinton, G. (2012), ‘Imagenet classification with deep
convolutional neural networks’, Neural Information Processing Systems 25.
Laptev, N., Yosinski, J., Li, L. E. and Smyl, S. (2017), Time-series extreme event forecasting
with neural networks at uber, in ‘International Conference on Machine Learning’, Vol. 34,
pp. 1–5.
Lazebnik, S., Schmid, C. and Ponce, J. (2006), Beyond bags of features: Spatial pyramid match-
ing for recognizing natural scene categories, in ‘2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’06)’, Vol. 2, IEEE, pp. 2169–2178.
Lowe, D. G. (1999), Object recognition from local scale-invariant features, in ‘Computer vi-
sion, 1999. The proceedings of the seventh IEEE international conference on’, Vol. 2, IEEE,
pp. 1150–1157.
Lowe, D. G. (2004), ‘Method and apparatus for identifying scale invariant features in an image
and use of same for locating an object in an image’. US Patent 6,711,293.
Maaten, L. v. d. (2014), ‘Accelerating t-SNE using tree-based algorithms’, The Journal of
Machine Learning Research 15(1), 3221–3245.
Makridakis, S. and Hibon, M. (2000), ‘The M3-Competition: results, conclusions and implica-
tions’, International Journal of Forecasting 16(4), 451–476.
Makridakis, S., Spiliotis, E. and Assimakopoulos, V. (2020), ‘The M4 competition: 100,000 time
series and 61 forecasting methods’, International Journal of Forecasting 36(1), 54–74.
Meade, N. (2000), ‘Evidence for the selection of forecasting methods’, Journal of Forecasting
19(6), 515–535.
26
Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J. and Talagala, T. S. (2020),
‘FFORMA: Feature-based forecast model averaging’, International Journal of Forecasting
36(1), 86 – 92.
Nanopoulos, A., Alcock, R. and Manolopoulos, Y. (2001), ‘Feature-based classification of time-
series data’, International Journal of Computer Research 10(3).
Pan, S. J. and Qiang, Y. (2010), ‘A survey on transfer learning’, IEEE Transactions on Knowl-
edge and Data Engineering 22(10), 1345–1359.
Pawlikowski, M. and Chorowska, A. (2020), ‘Weighted ensemble of statistical models’, Interna-
tional Journal of Forecasting 36(1), 93–97.
Petropoulos, F., Makridakis, S., Assimakopoulos, V. and Nikolopoulos, K. (2014), “Horses for
courses’ in demand forecasting’, European Journal of Operational Research 237(1), 152–163.
Petropoulos, F. and Svetunkov, I. (2020), ‘A simple combination of univariate models’, Inter-
national journal of forecasting 36(1), 110–115.
Razavian, A. S., Azizpour, H., Sullivan, J. and Carlsson, S. (2014), ‘Cnn features off-the-shelf:
An astounding baseline for recognition’.
Shah, C. (1997), ‘Model selection in univariate time series forecasting using discriminant anal-
ysis’, International Journal of Forecasting 13(4), 489–500.
Shaub, D. (2020), ‘Fast and accurate yearly time series forecasting with forecast combinations’,
International Journal of Forecasting 36(1), 116–120.
Simonyan, K. and Zisserman, A. (2014), ‘Very deep convolutional networks for large-scale image
recognition’, Computer Science .
Smyl, S. (2020), ‘A hybrid method of exponential smoothing and recurrent neural networks for
time series forecasting’, International Journal of Forecasting 36(1), 75–85.
Svetunkov, I. and Kourentzes, N. (2018), ‘Complex exponential smoothing for seasonal time
series’.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V. and Rabinovich, A. (2015), Going deeper with convolutions, in ‘Proceedings of the IEEE
conference on computer vision and pattern recognition’, pp. 1–9.
Talagala, P. D., Hyndman, R. J., Smith-Miles, K., Kandanaarachchi, S. and Mu˜noz, M. A.
(2019), ‘Anomaly detection in streaming nonstationary temporal data’, Journal of Computa-
tional and Graphical Statistics (In press), 1–28.
Talagala, T., Li, F. and Kang, Y. (2019), ‘FFORMPP: Feature-based forecast model perfor-
mance prediction’, arXiv 1908.11500.
URL: https://arxiv.org/abs/1908.11500
27
Talagala, T. S., Hyndman, R. J. and Athanasopoulos, G. (2018), Meta-learning how to fore-
cast time series, Working paper 6/18, Monash University, Department of Econometrics and
Business Statistics.
Thiel, M., Romano, M. C. and Kurths, J. (2004), ‘How much information is contained in a
recurrence plot?’, Physics Letters A 330(5), 343–349.
Thomson, M. E., Pollock, A. C., Onkal, D. and Gonul, M. S. (2019), ‘Combining forecasts:
Performance and coherence’, International Journal of Forecasting 35(2), 474–484.
Vincent, P., Larochelle, H., Bengio, Y. and Manzagol, P.-A. (2008), Extracting and compos-
ing robust features with denoising autoencoders, in ‘Proceedings of the 25th international
conference on Machine learning’, pp. 1096–1103.
Wang, J., Liu, P., She, M. F., Nahavandi, S. and Kouzani, A. (2013), ‘Bag-of-words repre-
sentation for biomedical time series classification’, Biomedical Signal Processing and Control
8(6), 634–644.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. and Gong, Y. (2010), Locality-constrained linear
coding for image classification, in ‘Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on’, IEEE, pp. 3360–3367.
Wang, X., Smith, K. A. and Hyndman, R. J. (2006), ‘Characteristic-based clustering for time
series data’, Data Mining and Knowledge Discovery 13(3), 335–364.
Wang, Z. and Oates, T. (2015), Imaging time-series to improve classification and imputation,
in ‘Proceedings of the 24th International Conference on Artificial Intelligence’, AAAI Press,
pp. 3939–3945.
Appendices
A. Experimental setup in the SoBF model and CNN model
In the traditional image processing method with SIFT, we need to obtain the basic de-
scriptors before the linear coding. We choose k= 200 as the number of clusters, 200 centroid
coordinates are used as the coordinates of basic descriptors. We select 5 close descriptors from
200 basic descriptors for each descriptor with the K-nearest neighbors (KNN) algorithm and
the adjustment factor λ=e4in LLC. We set 1, 2, and 4 as the SPM parameters. In the
end, we split the image into 1 ×1, 2 ×2 and 4 ×4 subimages, respectively. To eliminate range
differences of time series, we further adopt the minimax transformation for time series before
applying the recurrence plot. The parameter of for recurrence plot generation is 0.5.
28
Table 7. Optimal parameters of XGBoost on M4 competition dataset.
Method max depth learning rate sample proportion feature proportion
SIFT 14.000 0.575 0.916 0.767
CNN
Inception-v1+XGBoost 15.000 0.600 0.920 0.810
ResNet-v1-101+XGBoost 20.000 0.660 0.892 0.871
ResNet-v1-50+XGBoost 18.000 0.640 0.960 0.850
VGG-19+XGBoost 12.000 0.530 0.940 0.830
The parameters for the pre-trained CNN models are set as follows.
Dimension of the output of the pre-trained Inception-v1 model: 1024.
Dimension of the output of the pre-trained ResNet-v1-101 model: 2048.
Dimension of the output of the pre-trained ResNet-v1-50 model: 2048.
Dimension of the output of the pre-trained VGG model: 1000.
B. Experimental setup for XGBoost
To set optimal parameters for XGBoost, we perform a search in subspaces of the hyper-
parameter spaces, by measuring the OWA via 10-fold cross-validation of the training data. We
describe the hyper-parameters and the searching ranges of the cross-validation procedure as
follows.
max depth: The maximum depth of a tree ranges from 6 to 25.
learning rate: The learning rate and the scale of contribution of each tree ranges from
0.01 to 1.
sample proportion: The proportion of the training set used to calculate the trees in each
iteration ranges from 0.7 to 1.
feature proportion: The proportion of features used to calculate the trees in each iter-
ation ranges from 0.7 to 1.
Table 7reports the optimal parameters of XGBoost on the M4 competition dataset. In the
experiment, we train the XGBoost with all data of different periods and as a result get one set
of optimal parameters.
Table 8shows the optimal parameters of XGBoost for yearly, quarterly and monthly, respec-
tively on the Tourism competition dataset. Due to the small size of Tourism dataset, we use
29
Table 8. Optimal parameters of XGBoost on Tourism competition dataset.
Method max depth learning rate sample proportion feature proportion
Yearly
SIFT 25.000 1.000 0.747 1.000
CNN
Inception-v1+XGBoost 12.000 0.907 0.700 1.000
ResNet-v1-101+XGBoost 6.000 1.000 0.967 0.866
ResNet-v1-50+XGBoost 7.000 0.872 0.747 0.976
VGG-19+XGBoost 8.000 0.877 0.960 0.710
Quarterly
SIFT 12.000 0.880 0.851 0.861
CNN
Inception-v1+XGBoost 17.000 0.856 1.000 0.700
ResNet-v1-101+XGBoost 8.000 0.985 0.985 0.947
ResNet-v1-50+XGBoost 14.000 0.581 0.921 0.781
VGG-19+XGBoost 11.000 0.872 0.858 0.764
Monthly
SIFT 14.000 0.575 0.916 0.767
CNN
Inception-v1+XGBoost 25.000 1.000 0.861 0.700
ResNet-v1-101+XGBoost 25.000 1.000 1.000 1.000
ResNet-v1-50+XGBoost 14.000 1.000 1.000 0.705
VGG-19+XGBoost 17.000 0.842 0.935 0.913
M4 data with the corresponding periods as the training data. Hence, we obtain three groups of
optimal parameters for yearly, quarterly and monthly data, respectively.
30
... Recurrence Plot (RP), is a graphical tool for measuring the time constancy of dynamical systems in presented and illustrated with typical examples [31], by converting time series into images, it contain all relevant dynamical information while give us a way to observe the periodic nature of a trajectory through a phase space. For a time series T = t 1 , t 2 , ...,t m is an ordered set of m variables, and a recurrence plot can convert T to image form which is an array of dots in a m × m square, for each moment t i in time, a phase space trajectory visits roughly the same area in the phase space as at time t j . ...
... instead of binary output, it gives more value and results in colored plots [31]. Fig. 2 shows the form of recurrence plot at the T0 position of the underground working face.Where the first 480 points are gas concentration at a low level and the last 480 points are at a high level. ...
Article
Full-text available
Underground gas sensors as the most intuitive tool for monitoring gas concentrations in underground mining, yet they are subject to frequent anomalies due to ground pressure, constructions, even malicious masking by workers. Due to the depth of underground mining and the complexity of the environment, it is almost impossible to manually monitor the status of the all the sensors. Thus, the ability to accurately identify the working status of gas sensors at the working face are critical importance to mining safety. In this paper, we propose a deep learning feature engineering based approach to coupling the relationship between underground sensors. Experiment results show that the relationship between gas sensors can be expressed by position and time, so that when a sensor such as upper corner T0 malfunctions, it can be detected by other sensors such as T1 and T2. By converting the gas concentration into the form of recurrence plots (RPs), we are able to transform time-series gas concentration data into images with more dimensions in time lag, and enabling the application of more efficient and accurate machine vision methods. Based on the location of sensors at the working face, we found that the sensors at positions T0, T1 and T2 are correlated as the wind flows through the tunnel and have a higher correlation in the subsections of the time series. And those correlation can directly use to check the operating status of the sensors. We also discuss whether the relationships between the data itself can be preserved at the feature level during the mapping of gas concentrations to features, since deep learning (DL) looks like the next promise future after digitization in the mining industrialization with more and more data analysis and placing the results under a larger decision. This feature-based approach for gas concentration analysis can also be used for prediction and early warning.
... Methods, such as autoregressive models (Djedidi et al., 2021), Long-Short Term Memory (LSTM) neural networks (Gribbestad et al., 2021), and Relevance Vector Machine (RVM) (Tang et al., 2020), have been analysed when performing RUL prediction. However, to the best of the authors' knowledge, time series imaging has not yet been considered for such a task, despite time series imaging demonstrating promising results when applying forecasting (Li et al., 2020). Analogously, results in Makridis et al. (2020) demonstrated that ensemble models can be more stable than individual models. ...
... LSTM and 1D-CNN have been selected, as they have been determined as one of the best options when performing fault prognosis, and thus predicting the RUL of an asset (Gribbestad et al., 2021). TSI + CNN are also analysed in this study due to the promising results that TSI have demonstrated in analogous tasks, such as fault classification (Fahim et al., 2021;Velasco-Gallego and Lazakis, 2022b;Velasco-Gallego and Lazakis, 2022c), and forecasting (Li et al., 2020). Furthermore, Couture and Lin (2022) demonstrated the benefits of image-based inputs for use in regression tasks, and, specifically, when predicting the RUL. ...
Article
Although the maritime industry has the potential to lead smart maintenance methodologies, current maintenance routines within the sector focus on either reactive or preventive maintenance approaches. These approaches are increasingly conservative and often prompted by an increase of large costs or unnecessary maintenance actions. Attempts are being made, however, to optimise these costs and actions through the application of novel practices, such as the employment of prognostic-based maintenance, which is an approach that aims to minimise risk while maximising the lifespan of a system. In this respect, Mar-RUL is introduced to support Operations & Maintenance (O&M) decision making and address some of the main challenges that the sector is currently experiencing, including the lack of fault data analysis and the formalisation of deep learning technologies for the implementation of the Remaining Useful Life (RUL) prediction. In response, a degradation data simulation module is developed in tandem with an ensemble model comprised of three distinct deep learning architectures: Markov-Convolutional Neural Network (CNN), 1D-CNN, and Long Short-Term Memory (LSTM) neural network. To evaluate the performance of Mar-RUL, a case study on the turbocharger of a diesel generator of a tanker ship is presented. The case study results demonstrated that the application of time series imaging and ensemble methods can provide promising outcomes for the enhancement of RUL prediction.
... The use of time series in the analysis of the tourism industry The use of time series is widespread in the field of tourism economics (Cang & Seetaram, 2012). From a demand-side perspective, its applications are focused on areas as diverse as, among others, the analysis of demand trends (Cho, 2003), seasonality (Lundtorp, 2001) and forecasts (Li et al., 2020;Song et al., 2011;Song & Li, 2008); the tourism and economic growth nexus (Ahmad et al., 2020;Balaguer & Cantavella-Jorda, 2002;Brida et al., 2016) and the economic impact on tourism activity of events or shocks such as economic crises (Perles-Ribes et al., 2016;Zhang et al., 2021) terrorist attacks, social unrest, or economic uncertainty (Khan et al., 2021;Liu & Pratt, 2017;Sloboda, 2003). ...
Article
Sustainability is a crucial aspect of tourism growth and development. This paper suggests the adaptation of a methodology originally designed for analysing explosive behaviour in financial market time series to assess the explosive behaviour of demand pressure in tourist destinations. The proposed methodology could function as an early warning system for potential issues such as over-tourism by identifying further unsustainable growth in destinations. The methodology is exemplified by examining the long-term evolution of tourism pressure in both Spain as a whole, and in Barcelona, which is a destination frequently scrutinised in the literature and media due to perceived over-tourism. The findings suggest explosive behaviour in tourism pressure, especially in the case of Barcelona, from which valuable lessons can be learned and meaningful conclusions drawn. ARTICLE HISTORY
... Transforming time series into images has demonstrated performance improvement in several applications [20,36,37]. The intuition is to exploit spatial features by projecting the raw time series data into another space, then applying trigonometric transformation. ...
Article
Full-text available
With the COVID-19 outbreak, schools and universities have massively adopted online learning to ensure the continuation of the learning process. However, in such setting, instructors lack efficient mechanisms to evaluate the learning gains and get insights about difficulties learners encounter. In this research work, we tackle the problem of predicting learner performance in online learning using a deep learning-based approach. Our proposed solution allows stakeholders involved in the online learning to anticipate the learner outcome ahead of the final assessment hence offering the opportunity for proactive measures to assist the learners. We propose a two-pathway deep learning model to classify learner performance using their interaction during the online sessions in the form of clickstreams. We also propose to transform these time series of clicks into images using the Gramian Angular Field. The learning model makes use of the available extra demographic and assessment information. We evaluate our approach on the Open University Learning Analytics Dataset. Comprehensive comparative study is conducted with evaluation against state-of-art approaches under different experimental settings. We also demonstrate the importance of including extra demographic and assessment data in the prediction process.
... In general, when the ground-truth model is unknown, estimated models can be evaluated using the symmetric Mean Absolute Percentage Error (sM AP E) [12]: ...
Chapter
Full-text available
State-space models (SSMs) are becoming mainstream for time series analysis because their flexibility and increased explainability, as they model observations separately from unobserved dynamics. Critically, using SSMs based on multivariate autoregressive equations enhances understanding of system evolution and its dynamical interactions. However, some challenges remain unsolved, such as estimation in large-scale scenarios, with very noisy data, and model selection. Here, we explore a state-space alternating least squares (SSALS) algorithm for time series forecasting, demonstrating its application with simulated and real data, and how to solve model selection in noisy scenarios with a novel cross-validation technique. Altogether, testing this methodology with time series forecasting is ideal to demonstrate its strengths and weaknesses, and appreciate its advantages compared to current methods.
Article
Full-text available
Machine learning is enabling transformative changes in the tourism industry. Various machine learning algorithms and models can detect patterns in huge amounts of data for the prediction process, recommendations, and decisions without any coding or programming. The tourism sector generates massive data through sources as such online reviews and ratings, social media activity, traffic information, and customer relationship management records. Machine learning is poised to unlock insights and opportunities from this data. This paper provides an overview of how machine learning is currently influencing and may shape the future of tourism. Techniques for predictive analytics, personalized recommendation systems, computer vision, natural language processing, and more are powering applications to improve customer experiences, optimize and automate operations, gain competitive advantage, and support sustainability. Current applications are discussed, including demand forecasting, personalized travel recommendations, automated photo filtering, sentiment analysis of tourism reviews, chatbots for customer service, and others. Emerging opportunities are explored, as machine learning may enhance smart tourism for destinations through intelligent transportation, customized experiences, optimized resource allocation, and improved accessibility. Challenges exist regarding data quality, privacy, bias, and job disruption. However, machine learning is expected to become an integral tool for data-driven, personalized, and sustainable tourism. Overall, this review paper aims to synthesize the state of machine learning in tourism by highlighting current applications, opportunities, considerations, and likely future trends. The conclusions point to machine learning as a catalyst for innovation in tourism that may significantly transform the visitor experience, business operations, and destination management in the years to come.
Chapter
The paper proposes and compares two models for creating a recommendation system in the stock market, based on convolutional neural networks (CNN). The first model encodes the values of the time series of the stock exchange quotations into multiple technical indicators incorporating them in the form of an image. These indicators are defined on the closing prices of the stock quotes from the previous day The second model introduces some modifications of the previous approach by changing the definitions of the technical indicators. The numerical experiments have shown its improved performance. Both models are generated from the one-dimensional stock market data and saved as images. The CNN neural network uses these images in the training and testing phases. The numerical experiments aimed at maximizing profit from the investments have been performed on the stock data of the six largest companies listed on the Warsaw Stock Exchange. The recommendations for companies were classified in the form of three classes (Buy, Sell, Hold). The numerical results for the proposed methods are presented and compared with other investment methods typically used in the stock market.
Article
Full-text available
This paper introduces a novel meta-learning algorithm for time series forecast model performance prediction. We model the forecast error as a function of time series features calculated from historical time series with an efficient Bayesian multivariate surface regression approach. The minimum predicted forecast error is then used to identify an individual model or a combination of models to produce the final forecasts. It is well known that the performance of most meta-learning models depends on the representativeness of the reference dataset used for training. In such circumstances, we augment the reference dataset with a feature-based time series simulation approach, namely GRATIS, to generate a rich and representative time series collection. The proposed framework is tested using the M4 competition data and is compared against commonly used forecasting approaches. Our approach provides comparable performance to other model selection and combination approaches but at a lower computational cost and a higher degree of interpretability, which is important for supporting decisions. We also provide useful insights regarding which forecasting models are expected to work better for particular types of time series, the intrinsic mechanisms of the meta-learners, and how the forecasting performance is affected by various factors.
Article
Full-text available
The explosion of time series data in recent years has brought a flourish of new time series analysis methods, for forecasting, clustering, classification and other tasks. The evaluation of these new methods requires either collecting or simulating a diverse set of time series benchmarking data to enable reliable comparisons against alternative approaches. We propose GeneRAting TIme Series with diverse and controllable characteristics, named GRATIS, with the use of mixture autoregressive (MAR) models. We simulate sets of time series using MAR models and investigate the diversity and coverage of the generated time series in a time series feature space. By tuning the parameters of the MAR models, GRATIS is also able to efficiently generate new time series with controllable features. In general, as a costless surrogate to the traditional data collection approach, GRATIS can be used as an evaluation tool for tasks such as time series forecasting and classification. We illustrate the usefulness of our time series generation process through a time series forecasting application.
Article
Full-text available
Gravitational waves represent a new opportunity to study and interpret phenomena from the universe. In order to efficiently detect and analyze them, advanced and automatic signal processing and machine learning techniques could help to support standard tools and techniques. Another challenge relates to the large volume of data collected by the detectors on a daily basis, which creates a gap between the amount of data generated and effectively analyzed. In this paper, we propose two approaches involving deep auto-encoder models to analyze time series collected from Gravitational Waves detectors and provide a classification label (noise or real signal). The purpose is to discard noisy time series accurately and identify time series that potentially contain a real phenomenon. Experiments carried out on three datasets show that the proposed approaches implemented using the Apache Spark framework, represent a valuable machine learning tool for astrophysical analysis, offering competitive accuracy and scalability performances with respect to state-of-the-art methods.
Article
Features of time series are useful in identifying suitable models for forecasting. We present a general framework, labelled FFORMS (Feature‐based FORecast Model Selection), which selects forecast models based on features calculated from each time series. The FFORMS framework builds a mapping that relates the features of a time series to the “best” forecast model using a classification algorithm such as a random forest. The framework is evaluated using time series from the M‐forecasting competitions and is shown to yield forecasts that are almost as accurate as state‐of‐the‐art methods, but are much faster to compute. We use model‐agnostic machine learning interpretability methods to explore the results and to study what types of time series are best suited to each forecasting model.
Article
We propose an automated method for obtaining weighted forecast combinations using time series features. The proposed approach involves two phases. First, we use a collection of time series to train a meta-model for assigning weights to various possible forecasting methods with the goal of minimizing the average forecasting loss obtained from a weighted forecast combination. The inputs to the meta-model are features that are extracted from each series. Then, in the second phase, we forecast new series using a weighted forecast combination, where the weights are obtained from our previously trained meta-model. Our method outperforms a simple forecast combination, as well as all of the most popular individual methods in the time series forecasting literature. The approach achieved second position in the M4 competition.
Article
With the advent of Big Data, nowadays in many applications databases containing large quantities of similar time series are available. Forecasting time series in these domains with traditional univariate forecasting procedures leaves great potentials for producing accurate forecasts untapped. Recurrent neural networks (RNNs), and in particular Long Short Term Memory (LSTM) networks, have proven recently that they are able to outperform state-of-the-art univariate time series forecasting methods in this context, when trained across all available time series. However, if the time series database is heterogeneous, accuracy may degenerate, so that on the way towards fully automatic forecasting methods in this space, a notion of similarity between the time series needs to be built into the methods. To this end, we present a prediction model that can be used with different types of RNN models on subgroups of similar time series, which are identified by time series clustering techniques. We assess our proposed methodology using LSTM networks, a widely popular RNN variant, together with various clustering algorithms, such as kMeans, DBScan, Partition Around Medoids (PAM), and Snob. Our method achieves competitive results on benchmarking datasets under competition evaluation procedures. In particular, in terms of mean sMAPE accuracy it consistently outperforms the baseline LSTM model, and outperforms all other methods on the CIF2016 forecasting competition dataset.
Article
It has long been known that combination forecasting strategies produce superior out-of-sample forecasting performances. In the M4 forecasting competition, a very simple forecast combination strategy achieved third place on yearly time series. An analysis of the ensemble model and its component models suggests that the competitive accuracy comes from avoiding poor forecasts, rather than from beating the best individual models. Moreover, the simple ensemble model can be fitted very quickly, can easily scale horizontally with additional CPU cores or a cluster of computers, and can be implemented by users very quickly and easily. This approach might be of particular interest to users who need accurate yearly forecasts without being able to spend significant time, resources, or expertise on tuning models. Users of the R statistical programming language can access this modeling approach using the “forecastHybrid” package.
Article
Travel time data is a vital factor for numbers of performance measures in transportation systems. Travel time prediction is both a challenging and interesting problem in ITS, because of the underlying traffic and events’ hidden patterns. In this study, we propose a multi-step deep-learning-based algorithm for predicting travel time. Our algorithm starts with data pre-processing. Then, the data is augmented by incorporating external datasets. Moreover, extensive feature learning and engineering such as spatiotemporal feature analysis, feature extraction, and clustering algorithms is applied to improve the feature space. Furthermore, for representing features we used a deep stacked autoencoder with dropout layer as regularizer. Finally, a deep multi-layer perceptron is trained to predict travel times. For testing our predictive accuracy, we used a 5-fold cross validation to test the generalization of our predictive model. As we observed, the performance of the proposed algorithm is on average 4 min better than applying the deep neural network to the initial feature space. Furthermore, we have noticed that representation learning using stacked autoencoders makes our learner robust to overfitting. Moreover, our algorithm is capable of capturing the general dynamics of the traffic, however further works need to be done for some rare events which impact travel time prediction significantly.