ArticlePDF Available

Forecasting with time series imaging

June 2020
Expert Systems with Applications 160(6):113680

June 2020
160(6):113680

DOI:10.1016/j.eswa.2020.113680

Authors:

Xixi Li

The University of Manchester

Yanfei Kang

Beihang University (BUAA)

Feng Li

Central University of Finance and Economics

Typical examples of recurrence plots (right column) for time series data with different patterns (left column): uncorrelated stochastic data, i.e., white noise (top), a time series with periodicity and chaos (middle), and a time series with periodicity and trend (bottom).

…

Image-based time series feature extraction with spatial bag-of-features model. It consists of four steps: (i) encode a time series as an image with recurrence plots; (ii) detect key points with SIFT and obtain the basic descriptors with k-means for the codebook; (iii) generate the representation based on LLC; and (iv) extract spatial information via SPM and max pooling.

…

Spatial pyramid matching and max pooling. The image is divided into progressively finer grid sequences at each level of the pyramid, and features are derived from each grid and combined into one large feature vector.

…

Framework of transfer learning with fine-tuning. Classic CNN models are trained on a large dataset (ImageNet). For the CNN model, the closer the layer is to the first layer, the more general features can be

…

Two-dimensional feature spaces of the M4 time series with different periods. The blue points highlight areas where the time series instance (orange points) with the corresponding seasonal pattern lie. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

…

Figures - uploaded by Feng Li

Content may be subject to copyright.

Content uploaded by Feng Li

Content may be subject to copyright.

Forecasting with time series imaging

Xixi Lia,1, Yanfei Kanga,1, Feng Lib,∗

aSchool of Economics and Management, Beihang University, Beijing 100191, China.

bSchool of Statistics and Mathematics, Central University of Finance and Economics, Beijing 102206, China.

Abstract

Feature-based time series representations have attracted substantial attention in a wide

range of time series analysis methods. Recently, the use of time series features for forecast

model averaging has been an emerging research focus in the forecasting community. Nonethe-

less, most of the existing approaches depend on the manual choice of an appropriate set of

features. Exploiting machine learning methods to extract features from time series automat-

ically becomes crucial in state-of-the-art time series analysis. In this paper, we introduce an

automated approach to extract time series features based on time series imaging. We ﬁrst

transform time series into recurrence plots, from which local features can be extracted using

computer vision algorithms. The extracted features are used for forecast model averaging. Our

experiments show that forecasting based on automatically extracted features, with less human

intervention and a more comprehensive view of the raw time series data, yields highly compa-

rable performances with the best methods in the largest forecasting competition dataset (M4)

and outperforms the top methods in the Tourism forecasting competition dataset.

Keywords: Forecasting, Time series imaging, Time series feature extraction, Recurrence

plots, Forecast combination.

1. Introduction

Time series features are a collection of statistical representations of time series character-

istics. Feature-based time series representation has attracted remarkable attention in a vast

majority of data mining tasks for time series. Most of the time series problems, including

time series clustering (e.g., Wang et al.,2006;Bandara et al.,2020), classiﬁcation (e.g., Fulcher

and Jones,2014;Nanopoulos et al.,2001) and anomaly detection (e.g., Hyndman et al.,2015;

Talagala, Hyndman, Smith-Miles, Kandanaarachchi and Mu˜noz,2019;Corizzo et al.,2020),

∗Corresponding author

Email addresses: lixixi199407@buaa.edu.cn (Xixi Li ), yanfeikang@buaa.edu.cn (Yanfei Kang ),

feng.li@cufe.edu.cn (Feng Li )

URL: https://orcid.org/0000-0001-5846-3460 (Xixi Li ), https://orcid.org/0000-0001-8769-6650

(Yanfei Kang ), https://orcid.org/0000-0002-4248-9778 (Feng Li )

1The authors contributed equally.

Preprint submitted to arXiv

arXiv:1904.08064v3 [stat.ML] 5 Jun 2020

are eventually attributed to the quantiﬁcation of similarity among time series data using time

series feature representations. Fulcher (2018) presents thousands of interpretable features that

can be used to represent a time series, such as global features, subsequence features and other

hybrid features, for classifying time series (Fulcher and Jones,2014) and labeling the emotional

content of speech (Fulcher et al.,2013). Christ et al. (2018) compute 794 time series features

based on hypothesis tests and illustrate their applications in time series anomaly detection and

classiﬁcation. Another line of approaches for time series feature extraction is by auto-encoder

models (e.g., Vincent et al.,2008). Corizzo et al. (2020) further exploit time series features

extracted from auto-encoder models for gravitational waves detection. Other recent studies use

auto-encoder models for feature representation in time series forecasting (e.g., Laptev et al.,

2017;Abdollahi et al.,2020).

Instead of the traditional time series forecasting procedure – ﬁtting a model to the historical

data and simulating future data based on the ﬁtted model, selecting the most appropriate

forecasting model or averaging a number of candidate models based on time series features has

been a popular alternative approach. In the last few decades, many attempts have been made on

the feature-based model selection and averaging procedures for univariate time series forecasting.

For example, Collopy and Armstrong (1992) provided 99 rules using 18 features to combine four

extrapolation methods by examining a rule base to forecast annual economic and demographic

time series; Arinze (1994) described the use of artiﬁcial intelligence techniques to improve

the forecasting accuracy, built an induction tree to model time series features and developed

the most accurate forecasting method; Shah (1997) constructed several individual selection

rules for forecasting using discriminant analysis based on 26 time series features; Meade (2000)

used 25 summary statistics of time series as explanatory variables in predicting the relative

performances of nine forecasting methods based on a set of simulated time series with known

properties; Petropoulos et al. (2014) proposed “horses for courses” and measured the eﬀects of

seven time series features on the forecasting performances of 14 popular forecasting methods

on the monthly data in the M3 dataset (Makridakis and Hibon,2000); more recently, Kang

et al. (2017) visualized the performances of diﬀerent forecasting methods in a two-dimensional

principal component feature space and provided a preliminary understanding of their relative

performances. Talagala et al. (2018) presented a general framework for forecast model selection

using meta-learning in which they utilize a random forest algorithm to select the best forecasting

method based on time series features. Montero-Manso et al. (2020) trained a meta-model to

obtain the weights of various forecasting methods and made a weighted forecasting combination.

The input of the meta-model is a set of features calculated on the training data, while the output

is a group of weights assigned to each candidate forecasting method. Their method ranked 2nd

in the M4 competition (Makridakis et al.,2020).

Having revisited the literature on feature-based time series forecasting, we ﬁnd that (i) al-

though researchers often highlight the usefulness of time series features in selecting the best

forecasting method, most of the existing approaches depend on the manual choice of an ap-

propriate set of features, which makes the forecast process that relies on the data and the

expertise of the forecasters inﬂexible (Fulcher,2018), and more importantly (ii) the current

literature on feature-based forecasting focuses on global features of time series, leaving local

characteristics under-emphasized. In some instances, the local dynamics of time series contain

important information such as heart failure in medical signals and irregular weather changes.

Therefore, exploiting automated feature extraction from time series data becomes vital. In-

spired by the recent work of Hatami et al. (2017) and Wang and Oates (2015) in time series

classiﬁcation tasks, this paper aims to explore time series forecasting based on model averaging

with the idea of time series imaging, from which time series global and local features can be

automatically extracted using computer vision algorithms. The proposed approach also enables

automated feature extraction. This novel approach for time series forecasting is more ﬂexible

than forecasting based on manually curated time series features.

The rest of the paper is organized as follows. Section 2presents our feature extraction

method for time series imaging. In Section 3, we describe how to assign weights to a group of

candidate forecasting methods using imaging-based time series features and perform forecast

combination accordingly. Section 4applies our imaging-based time series forecast combination

method to two large collections of real datasets, namely the M4 competition dataset and the

Tourism competition dataset. Section 5provides our discussions and insights, as well as several

possible future research directions. Section 6concludes the paper.

2. Time series imaging and feature extraction

In this paper, we extract time series features based on time series imaging in two steps. In

the ﬁrst step, we encode the time series into images using recurrence plots. In the second step,

time series features are extracted from images using image processing techniques. We consider

two diﬀerent image feature extraction approaches: spatial bag-of-features (SBoF) model and

convolutional neural networks (CNNs). We describe the details in the following sections.

2.1. Time series imaging

We use recurrence plots (RPs) to encode time series data into images, which provides a way

to visualize the periodic nature of a trajectory through a phase space (Eckmann et al.,1987)

and can contain all relevant dynamical information in the time series (Thiel et al.,2004). A

recurrence plot of time series x, showing when the time series revisits a previous state, can be

formulated as

R(i, j) = Θ(− k xi−xjk),

where R(i, j) is the element of the recurrence matrix R,iindexes time on the x-axis of the

recurrence plot, and jindexes time on the y-axis. is a predeﬁned threshold, and Θ(·) is the

Heaviside step function. In short, one draws a black dot when xiand xjare closer than .

Instead of binary output, an un-thresholded RP is not binary but diﬃcult to quantify. We use

the following modiﬁed RP, which balances the binary output and un-thresholded RP.

R(i, j) = 









kxi−xjk> ,

kxi−xjkotherwise.

It gives more values than a binary RP and results in colored plots. Fig. 1shows three typical

examples of recurrence plots. They reveal diﬀerent patterns of recurrence plots for time series

with randomness, periodicity, chaos, and trend. We can see that the recurrence plots shown

in the right column well depict the pre-deﬁned patterns in the time series shown in the left

column.

2.2. Feature extraction with the SBoF model

We propose an image-based time series feature extraction framework using the SBoF (spatial

bag-of-features) model. As shown in Fig. 2, the framework consists of three steps: (i) detect

key points with the scale-invariant feature transform (SIFT) algorithm (Lowe,1999) and ﬁnd

basic descriptors with k-means; (ii) generate the representation based on the locality constrained

linear coding (LLC) method (Wang et al.,2010); and (iii) extract spatial information via spatial

pyramid matching (SPM) and pooling. We interpret the details in each step, respectively.

The original bag-of-features (BoF) model, which extracts features from one-dimensional sig-

nal segments, has achieved great success in time series classiﬁcation (Baydogan et al.,2013;

Wang et al.,2013). Hatami et al. (2017) transformed a time series into two-dimensional recur-

rence images with a recurrence plot (Eckmann et al.,1987) and then applied the BoF model.

Extracting time series features is then equivalent to identifying key points in images, which

are called key descriptors. A promising algorithm is the SIFT algorithm (Lowe,1999), which

is used to detect and describe local features in images by identifying the maxima/minima of

the diﬀerence of Gaussians (DoG) that occur at the multiscale spaces of an image as its key

descriptors. It consists of the following four steps.

Figure 1. Typical examples of recurrence plots (right column) for time series data with diﬀerent patterns (left

column): uncorrelated stochastic data, i.e., white noise (top), a time series with periodicity and chaos (middle),

and a time series with periodicity and trend (bottom).

Figure 2. Image-based time series feature extraction with spatial bag-of-features model. It consists of four steps:

(i) encode a time series as an image with recurrence plots; (ii) detect key points with SIFT and obtain the basic

descriptors with k-means for the codebook; (iii) generate the representation based on LLC; and (iv) extract

spatial information via SPM and max pooling.

1. Detect extreme values in the scale spaces. We search over all the scale spaces and

use the Gaussian diﬀerential method to identify the potential interest points and select

those invariant to scale and orientation.

2. Find the key points. The position scale is determined by ﬁtting a model at each

candidate position, and the key points are selected according to their stability.

3. Assign feature directions. This step assigns the key points one or more directions

based on the local gradient direction of the image. All subsequent operations are about

how to transform the direction, scale, and position of the key points to allow for invariance

in the features.

4. Describe key points. Within the neighborhood around each feature point, the local

gradient of the image is measured at selected scales, which is transformed into a repre-

sentation that allows larger local shape deformations and illumination transformations.

The SIFT method uses a 128-dimensional vector to characterize the key descriptors in

an image. First, an 8-direction histogram is established in each 4 ×4 subregion, and 16

subregions around the key points are used. We then calculate the magnitude and direction

of each pixel’s gradient magnitude and add it to the corresponding subregion. In the end,

128-dimensional image data based on histograms are generated.

Each descriptor can be projected onto its local coordinate system, and the projected coordi-

nates are integrated by max pooling to generate the ﬁnal representation with the LLC method,

which utilizes the locality constraints to project each descriptor onto its local coordinate system

(Wang et al.,2010). The projected coordinates are integrated by max pooling to generate the

ﬁnal representation:

min

kxi−Bcik2+λkdicik2, s.t. 1Tci= 1,∀i, (1)

where di= exp(dist(xi, B)/σ) and xi∈R128×1is the vector of one descriptor. The basic

descriptors B∈R128×Mare obtained by k-means clustering. The representation parameters

ciare used as time series representations through Equation (1). The locality adaptor digives

diﬀerent freedom for each basis vector proportional to its similarity to the input descriptor. We

use σto adjust the weight decay speed for the locality adaptor, and λis the adjustment factor.

However, in reality, the number of descriptors obtained by the SIFT algorithm is usually huge.

To address this problem, Wang et al. (2010) proposed an incremental codebook optimization

method for LLC.

The bag-of-features model calculates the distribution characteristics of feature points in the

whole image and then generates a global histogram. As a result, the image’s spatial distribution

information is lost, and the image may not be accurately described. To obtain the spatial

information of images, we apply a spatial pyramid matching (SPM) method, which statistically

distributes image feature points at diﬀerent resolutions and has achieved high accuracy on a

large dataset of 15 natural scene categories (Lazebnik et al.,2006). The image is divided into

progressively ﬁner grid sequences at each level of the pyramid, and features are derived from

each grid and combined into one large feature vector. Fig. 3depicts the diagram of the SPM

and max pooling process. In this task, we divide the image by 1 ×1, 2 ×2 and 4 ×4, and

thus obtain 21 subregions. To obtain the representation for each subregion, we ﬁrst obtain the

descriptors. Suppose that we obtain 12 descriptors denoted by Di∈R12×200 for the third region

(the dimension of the local linear representation of the descriptors is equal to 200). We then

can obtain the maximum value of every dimension of Di. After max pooling, we calculate the

feature representation denoted by fi∈R200×1for the third region. The feature representations

of the other twenty regions can be obtained in the same way. Finally, the 21 features are linked

together for the ﬁnal representation of the time series. In this way, the ﬁnal size of the feature

vectors is 21 ×200 = 4200. More details about the experimental setup in the SoBF model can

be found in Appendix A.

Figure 3. Spatial pyramid matching and max pooling. The image is divided into progressively ﬁner grid sequences

at each level of the pyramid, and features are derived from each grid and combined into one large feature vector.

We divide the image by 1 ×1, 2 ×2 and 4 ×4, and thus obtain 21 subregions. We ﬁrst obtain the descriptors

for each region. Suppose that we obtain 12 descriptors denoted by Di∈R12×200 for the third region (200 is the

dimension of the local linear representation of the descriptor). Then, we can obtain the maximum value of every

dimension of Di. After max pooling, we obtain the feature representation denoted by fi∈R200×1for the third

region. The feature representations of the other 20 regions can be obtained in the same way. Finally, the 21

features are linked together for the ﬁnal representation for the time series.

2.3. Feature extraction with ﬁne-tuned deep neural networks

An alternative to SBoF for image feature extraction is to use a deep CNN, which has achieved

great breakthroughs in image processing (Krizhevsky et al.,2012). For example, Berkeley re-

searchers (Donahue et al.,2014) proposed feature extraction methods called DeCAF (a deep

convolutional activation feature for generic visual recognition) and directly used deep convolu-

tional neural networks for feature extraction. Their experimental results show that the extracted

features yield higher accuracy than traditional image features. In addition, some researchers

(e.g., Razavian et al.,2014) use the features acquired by convolutional neural networks as the

input of an image classiﬁer, which signiﬁcantly improves the image classiﬁcation accuracy.

Nonetheless, the performance of neural networks heavily depends on the setting of the

network structure and the hyper-parameters. A deeper layer is often essential for achieving

higher performance in a task. As a result, extensive computational power is needed. An

appealing feature of our time series imaging approach is that a large number of well pre-trained

neural network models for imaging classiﬁcation exist. We could easily transfer the model to

our task via transfer learning (Pan and Qiang,2010), which has been widely used recently in a

variety of ﬁelds such as image classiﬁcation (Han et al.,2018) and natural language processing

(Ahmad et al.,2020). To simplify our task, we use the ﬁne-tuning approach (Ge and Yu,2017)

from the ﬁeld of transfer learning. In short, it uses pre-trained networks and makes adjustments

to our tasks. We ﬁx the parameters of the previous layers based on the pre-trained model with

ImageNet data and ﬁne-tune the last few layers for our task. In general, the closer the layer

is to the ﬁrst layer, the more general features can be extracted; the closer the layer is to the

back layer, the more speciﬁc features for classiﬁcation tasks can be extracted. In this way, the

computational eﬃciency of network training can be signiﬁcantly accelerated.

Fig. 4shows the framework of transfer learning with ﬁne-tuning. In this task, the deep

network is trained on the large ImageNet dataset (Deng et al.,2009), and the pre-trained

network is publicly available. Speciﬁcally, we ﬁx the weights of all the previous layers of the pre-

trained network except for the last fully connected layers and then use our time series images

as inputs. Finally, the high-dimensional features of the time series images can be obtained

from the pre-trained network. We consider the following representative architectures in our

experiments: ResNet-v1-101 (He et al.,2016), ResNet-v1-50 (He et al.,2016), Inception-v1

(Szegedy et al.,2015), and VGG-19 (Simonyan and Zisserman,2014). The dimensions of the

time series features obtained from the pre-trained ResNet-v1-101,ResNet-v1-50 ,Inception-v1

and VGG-19 architectures are 2048, 2048, 1024 and 1000, respectively. More details about the

experimental setup in the CNN-based feature extraction can be found in Appendix A.

Figure 4. Framework of transfer learning with ﬁne-tuning. Classic CNN models are trained on a large dataset

(ImageNet). For the CNN model, the closer the layer is to the ﬁrst layer, the more general features can be

extracted; the closer the layer is to the back layer, the more speciﬁc features for classiﬁcation tasks can be

extracted. To extract time series features, we ﬁx the parameters of all the previous layers except for the last

fully connected layer, and ﬁne-tune the last layer for our task. With the trained model, we obtain the ﬁnal

representation of the time series.

3. Time series forecasting with image features

We aim to ﬁnd the optimal combination of a pool of candidate forecasting methods. The

essence is to link the knowledge of forecasting errors from diﬀerent forecasting methods to time

series features. Therefore, in this section, we focus on the mapping from the time series image

features to forecasting method performances. We use nine most popular time series forecasting

methods as candidates for forecast combination, which are also used in many recent studies

(Montero-Manso et al.,2020;Talagala, Li and Kang,2019;Kang et al.,2020). They are the

automated ARIMA algorithm (ARIMA), automated exponential smoothing algorithm (ETS),

feed-forward neural network using autoregressive inputs (NNET-AR), exponential smoothing

state space model with a Box-Cox transformation (TBATS), seasonal and trend decomposition

using LOESS with AR modeling of the seasonally adjusted series (STLM-AR), random walk with

drift (RW-DRIFT), theta method (THETA), na¨ıve (NAIVE), and seasonal na¨ıve (SNAIVE).

They are described in Table 1and implemented in the Rpackage forecast (Hyndman et al.,

2019).

To validate the eﬀectiveness of our image features of the time series, we follow the work of

Montero-Manso et al. (2020), who proposed a model-averaging method based on 42 manually

curated time series features and won the second place in the M4 competition (Makridakis et al.,

2020), to obtain the weights for forecast combination based on our image features. To make our

proposed method comparable with those in M4, we use the overall weighted average (OWA) to

measure the forecasting accuracies, as used in the M4 competition. OWA is an overall indicator

of two accuracy measures, the mean absolute scaled error (MASE) and the symmetric mean

absolute percentage error (sMAPE). The individual measures are calculated as follows.

sMAPE = 1

t=1

2|Yt−b

Yt|

|Yt|+|b

Yt|,

MASE = 1

hPh

t=1 |Yt−b

Yt|

n−mPn

t=m+1 |Yt−Yt−m|,

OWA = 1

2(sMAPE/sMAPENaive2 + MASE/MASENaive2),

(2)

where Ytis the real value of the time series at point t,ˆ

Ytis the point forecast, his the forecasting

horizon, and mis the frequency of the data (e.g., 4 for quarterly series). Na¨ıve2 is equivalent

to the Na¨ıve (NAIVE) method but applied to the time series adjusted for seasonal factors.

Our framework for model averaging is shown in Fig. 6. It consists of two parts. In the

training process, based on the extracted image features and the OWA values of the nine fore-

casting methods, we train a feature-based gradient tree boosting model (XGBoost, Chen and

Guestrin,2016), to produce nine weights for forecast model averaging by minimizing the OWA

Table 1. The methods used for forecast combination. All these methods are implemented using the forecast

package in the Rsoftware.

Forecasting

method

Description Rimplementation

ARIMA The autoregressive integrated moving average model

automatically estimated in the forecast package for

R(Hyndman and Khandakar,2008).

auto.arima()

ETS The exponential smoothing state space model (Hyn-

dman et al.,2002).

ets()

NNET-AR A feed-forward neural network using autoregressive

inputs.

nnetar()

TBATS The exponential smoothing state space model with

a Box-Cox transformation, ARMA errors, trend and

seasonal components (De Livera et al.,2011).

tbats()

STLM-AR The STL decomposition (Cleveland et al.,1990) with

AR modeling of the seasonally adjusted series.

stlm(..., modelf = ar)

RW-DRIFT The random walk model with drift. rwf(..., drift = TRUE)

THETA The decomposition forecasting model by modifying

the local curvature of the time-series through a co-

eﬃcient ‘Theta’ that is applied directly to the sec-

ond diﬀerences of the data (Assimakopoulos and

Nikolopoulos,2000).

thetaf()

NAIVE The na¨ıve method, which takes the last observation

as the forecasts of all the forecast horizons.

naive()

SNAIVE The seasonal na¨ıve method, which forecasts using the

most recent values of the same season.

snaive()

Figure 5. The temporal holdout strategy used to generate the training dataset. Each original time series is

divided into a training period and a testing period. The length of the testing period is equal to the forecasting

horizon (h) given by the M4 competition. We calculate time series image features from the training periods of

the training dataset, generate forecasts, and compute the corresponding OWA values over the test periods for

each candidate forecasting method. We train an XGBoost model on the training dataset and obtain weights for

each candidate forecasting method, which are then used to generate forecasts by forecast combination for the

future data.

error obtained by forecast combination. Let fnbe the image features extracted from a time

series, and Nis the total number of the time series. Onm is the contribution to the OWA error

of m-th method for the series n-th time series. p(fn)mis the output of the XGBoost algorithm

corresponding to m-th forecasting method, based on the features extracted from the n-th time

series. The gradient tree boosting approach minimizes the weighted average loss function as

arg min

n=1

m=1

w(fn)mOnm,

where w(fn)mare the softmax-transformed weights for the output p(fn)mof the XGBoost model

deﬁned as

w(fn)m=exp{p(fn)m}

Pmexp{p(fn)m}.

The hyper-parameter settings for XGBoost are available in Appendix B.

In the testing process, we use the trained model and the image features extracted from the

testing data to obtain the weights of diﬀerent forecasting models. Finally, based on the weights

and forecasts of diﬀerent models, we can obtain the ﬁnal forecasts for the testing data.

4. Experiments

4.1. Forecasting with M4 competition data

The ﬁrst dataset we use to evaluate our proposed method is a collection of general-purpose

data from the M4 competition that consists of 100,000 time series diversely from the economic,

Figure 6. Framework of forecast model averaging based on automatic feature extraction. In the training process,

nine weights are obtained for the forecast model combination using XGBoost. Based on the weights, we obtain

the forecasts for the testing data in the testing process.

ﬁnance, demographics, and industry domains. In the training process, we divide the original

time series in M4 into training and testing periods following the strategy in Fig. 5. The lengths

of the testing periods are equal to the forecasting horizon (h), i.e., 6 for yearly, 8 for quarterly,

18 for monthly, 13 for weekly 14 for daily, and 48 for hourly data, which are given by the M4

competition. For each time series in M4, we calculate time series features from the training

period, generate forecasts, and compute the corresponding OWA values over the test period for

each candidate forecasting method. We then train an XGBoost model to produce the weights

for each forecasting method described in Table 1. In the testing process, we use the trained

model to forecast the original M4 time series, and evaluate the forecasts based on the future

M4 data, which are public after the M4 competition.

We now apply our imaging-based time series forecasting method to the M4 data. To illustrate

that the extracted image features are diverse and can be used to characterize the original time

series, we project the features of the time series with diﬀerent periods into two-dimensional

feature space using t-distributed stochastic neighbor embedding (t-SNE, Maaten (2014)). From

Fig. 7, we notice that yearly, quarterly, monthly, daily and hourly data can be well distinguished

in the feature spaces, although the features are automatically extracted from time series images.

Following the framework in Fig. 6, we obtain the forecasts of M4 based on time series

imaging. Our model averaging results are compared with the results of the top ten ranked

Figure 7. Two-dimensional feature spaces of the M4 time series with diﬀerent periods. The blue points highlight

areas where the time series instance (orange points) with the corresponding seasonal pattern lie.

Table 2. Description of the top ten forecasting methods in M4 competition (Makridakis et al.,2020).

Ranking Description

1 A hybrid model mixing Exponential Smoothing (ES) with a black-box Recurrent Neural

Network (RNN) forecasting engine (Smyl,2020).

2 Weighted forecast combination of nine standard forecasting methods in Table 1(Montero-

Manso et al.,2020).

3 Weighted average of multiple statistical methods using hold-out tests (Pawlikowski and

Chorowska,2020).

4 Combination of multiple statistical methods as described in Armstrong (2001).

5 Weighted average of the standard ARIMA, ETS and THETA methods described in Table 1

(Fiorucci and Louzada,2020).

6 Median of ETS, CES (Complex exponential smoothing, Svetunkov and Kourentzes,2018),

ARIMA, and THETA methods (Petropoulos and Svetunkov,2020).

7 Combination of two THIEF (Temporal Hierarchical Forecasting, Athanasopoulos et al.,

2017) forecasts (with the base model of ARIMA and THETA, respectively) (Shaub,2020).

8 THETA method with data deseasonalization and Box-Cox Transformation.

9 A calibrated average of Rho and Delta (Card) forecasting methods (Doornik et al.,2020).

10 Forecast combination of seven benchmarks.

methods (Table 2) from the M4 competition, which are available in the concluding paper of

M4 (Makridakis et al.,2020). Detailed descriptions and the code for replicating the top ten

methods are available in the M4 GitHub repository (https://github.com/Mcompetitions/M

4-methods). Note that the replication results may slightly diﬀer due to the updates of related

Rpackages. However, since the concluding paper of M4 competition (Makridakis et al.,2020)

is publicly available at the same time of this work, the possible code changes in the Rpackages

used by the competitors are negligible. Tables 3,4and 5depict the MASE, sMAPE, and OWA

values for our time series imaging method with model-averaging, and the top ten methods from

the M4 competition. The optimal parameters of XGBoost on the M4 competition dataset can

be found in Table 7of Appendix B.

Overall, our model averaging method with automated time series image features can achieve

highly comparable performances with the top methods from the M4 competition. From Table 5,

our method ranks the sixth overall. But our approach has the advantages: (1) limited human

interaction is required during feature extraction, (2) both global and local features are uti-

lized, (3) the ﬁne-tuned results from existing CNN models in the computer vision tasks can be

seamlessly transferred to our model, and (4) it sheds the potential improvements of forecasting

performance with the advances of neural networks for the computer vision tasks.

4.2. Forecasting with the Tourism competition data

To validate our method’s generality and robustness in even speciﬁc forecasting domains,

we now apply the proposed method to the Tourism competition dataset that consists of 366

monthly series, 427 quarterly series, and 518 yearly series (Athanasopoulos et al.,2011). In

the training process, we use the M4 competition data as training data to train the XGBoost

model and produce the optimal weights for each candidate forecasting method, which are used

to forecast the Tourism data. Since the Tourism dataset has smaller size compared to the M4

competition data, we use M4 monthly data as the training data for the Tourism monthly data

to obtain the optimal weights from XGBoost. The same strategy is applied to the quarterly

and yearly datasets.

We apply the same accuracy metrics as in the Tourism competition (Athanasopoulos et al.,

2011) to make the results comparable with the literature, which are the mean absolute per-

centage error (MAPE) and the mean absolute scaled error (MASE). MASE is calculated as

Equation (2), and MAPE is calculated as follows.

MAPE = 1

t=1

|Yt−b

Yt|

|Yt|,

Table 3. Benchmarking the MASE performance of our proposed forecasting method based on time series imaging

against the top 10 methods in the M4 competition.

Yearly Quarterly Monthly Weekly Daily Hourly Total

Ranking M4 competition

1 2.980 1.118 0.884 2.356 3.446 0.893 1.536

2 3.060 1.111 0.893 2.108 3.344 0.819 1.551

3 3.130 1.125 0.905 2.158 2.642 0.873 1.547

4 3.126 1.135 0.895 2.350 3.258 0.976 1.571

5 3.046 1.122 0.907 2.368 3.194 1.203 1.554

6 3.082 1.118 0.913 2.133 3.229 1.458 1.565

7 3.038 1.198 0.929 2.947 3.479 1.372 1.595

8 3.009 1.198 0.966 2.601 3.254 2.557 1.601

9 3.262 1.163 0.931 2.302 3.284 0.801 1.627

10 3.185 1.164 0.943 2.488 3.232 1.049 1.614

Method Forecasting with time series imaging

SIFT 3.135 1.125 0.908 2.266 3.463 0.849 1.579

CNN

Inception-v1+XGBoost 3.096 1.139 0.947 2.479 3.289 1.015 1.592

ResNet-v1-101+XGBoost 3.106 1.147 0.927 2.579 3.377 0.970 1.591

ResNet-v1-50+XGBoost 3.104 1.143 0.917 2.441 3.363 0.965 1.583

VGG-19+XGBoost 3.098 1.133 0.931 2.355 3.235 0.991 1.581

Table 4. Benchmarking the sMAPE performance of our proposed forecasting method based on time series imaging

against the top 10 methods in the M4 competition.

Yearly Quarterly Monthly Weekly Daily Hourly Total

Ranking M4 competition

1 13.176 9.679 12.126 7.817 3.170 9.328 11.374

2 13.528 9.733 12.639 7.625 3.097 11.506 11.720

3 13.943 9.796 12.747 6.919 2.452 9.611 11.845

4 13.712 9.809 12.487 6.814 3.037 9.934 11.695

5 13.673 9.816 12.737 8.627 2.985 15.563 11.836

6 13.669 9.800 12.888 6.726 2.995 13.167 11.897

7 13.679 10.378 12.839 7.818 3.222 13.466 12.020

8 13.366 10.155 13.002 9.148 3.041 17.567 11.986

9 13.910 10.000 12.780 6.728 3.053 8.913 11.924

10 13.821 10.093 13.151 8.989 3.026 9.765 12.114

Method Forecasting with time series imaging

SIFT 13.896 9.863 12.596 7.899 3.063 11.772 11.816

CNN

Inception-v1+XGBoost 13.899 9.962 12.659 8.228 3.047 12.521 11.891

ResNet-v1-101+XGBoost 13.917 9.991 12.714 8.277 3.110 12.480 11.914

ResNet-v1-50+XGBoost 13.918 9.973 12.723 8.086 3.123 12.396 11.914

VGG-19+XGBoost 13.872 9.912 12.652 8.294 3.049 12.598 11.853

Table 5. Benchmarking the OWA performance of our proposed forecasting method based on time series imaging

against the top 10 methods in the M4 competition.

Yearly Quarterly Monthly Weekly Daily Hourly Total

Ranking M4 competition

1 0.778 0.847 0.836 0.851 1.046 0.440 0.821

2 0.799 0.847 0.858 0.796 1.019 0.484 0.838

3 0.820 0.855 0.867 0.766 0.806 0.444 0.841

4 0.813 0.859 0.854 0.795 0.996 0.474 0.842

5 0.802 0.855 0.868 0.897 0.977 0.674 0.843

6 0.806 0.853 0.876 0.751 0.984 0.663 0.848

7 0.801 0.908 0.882 0.957 1.060 0.653 0.860

8 0.788 0.898 0.905 0.968 0.996 1.012 0.861

9 0.836 0.878 0.881 0.782 1.002 0.410 0.865

10 0.824 0.883 0.899 0.939 0.990 0.485 0.869

Method Forecasting with time series imaging

SIFT 0.820 0.858 0.863 0.839 1.009 0.498 0.848

CNN

Inception-v1+XGBoost 0.814 0.867 0.885 0.895 1.002 0.552 0.854

ResNet-v1-101+XGBoost 0.816 0.872 0.877 0.916 1.025 0.542 0.855

ResNet-v1-50+XGBoost 0.816 0.869 0.873 0.881 1.025 0.538 0.853

VGG-19+XGBoost 0.814 0.863 0.876 0.877 0.994 0.549 0.850

Table 6. Model-averaging results compared with the top methods in the Tourism competition in terms of the

MAPE and MASE values.

MAPE MASE

Forecasting method Yearly Quarterly Monthly Total Yearly Quarterly Monthly Total

ARIMA 30.639 16.172 21.746 23.444 3.197 1.595 1.495 2.200

ETS 25.065 15.316 20.965 20.745 3.000 1.592 1.526 2.130

THETA 23.409 15.927 22.390 20.688 2.730 1.661 1.649 2.080

SNAIVE 23.610 16.459 22.562 20.988 3.007 1.699 1.631 2.197

DAMPED 27.975 35.830 47.192 35.898 3.061 3.221 3.404 3.209

Forecasting with time series imaging

SIFT 24.164 15.236 19.984 20.089 2.760 1.570 1.444 2.005

CNN

Inception-v1+XGBoost 24.633 15.333 20.261 20.383 2.834 1.560 1.467 2.037

ResNet-v1-101+XGBoost 24.288 15.047 20.221 20.142 2.779 1.555 1.468 2.014

ResNet-v1-50+XGBoost 24.347 15.101 19.981 20.117 2.750 1.563 1.454 2.002

VGG-19+XGBoost 23.616 15.599 20.055 20.010 2.689 1.638 1.476 2.008

where Ytis the real value of the time series at point t,ˆ

Ytis the point forecast, and his the

forecasting horizon.

The top methods in the competition that include ARIMA, ETS, THETA, SNAIVE, and

DAMPED are discussed in Athanasopoulos et al. (2011). The ﬁrst four methods are described

in Table 1. DAMPED is a variation of Holt-Winters method that “dampens” the trend to a ﬂat

line sometime in the future, and is implemented using forecast::holt(..., damped=TRUE)

in R. We reproduce these top methods and use them as our benchmarks.

Following the framework in Fig. 6, we obtain the forecasts of the Tourism time series based

on time series imaging. Our model averaging results outperforms the top methods from Tourism

competition (Athanasopoulos et al.,2011) with high distinctions. Table 6reports the MASE and

MAPE values for our model-averaging method and the top methods from Tourism competition.

The numbers in bold indicate that our method is better than the benchmark. Especially, our

method performs exceptionally well on monthly and quarterly data. For the yearly dataset,

our method is slightly worse, which may be due to the inadequacy of historical data. The

optimal parameters of XGBoost on the Tourism competition dataset can be found in Table 8

of Appendix B.

5. Discussions

Feature-based time series forecasting has been proved highly promising, primarily through

the extraction and selection of an appropriate set of features. Nonetheless, traditional time series

feature extraction requires manual design of feature metrics, which is typically complicated to

time series forecasting practitioners. Known features used in time series forecasting literature

are global characteristic of a time series, which may ignore important local patterns. Evidence

from the literature further indicates that feature-based forecast combination might not perform

as well as simple averaging when the feature extraction and selection are not properly conducted.

We propose an automated time series imaging feature extraction approach with computer

vision algorithms, and our experiment results show that our approach works well for forecast

combination. An innovative point of our approach over other feature-based time series forecast-

ing methods is that time series features are extracted automatically from time series imaging,

which are obtained using recurrence plots. In principle, any image feature extraction algorithm

is applicable to our proposed framework. We employ two widely used algorithms to extract

features from time series images, namely the spatial bag-of-features (SBoF) model and the deep

convolutional neural networks (CNN).

The SBoF model, combining the scale-invariant feature transform (SIFT) algorithm, the

locality constrained linear coding (LLC) method, and spatial pyramid matching (SPM) and

max pooling, can capture both global and local characteristics of images. The traditional SBoF

model is a fast industry level model in computer vision applications. One may notice that

the features extracted based on the traditional SIFT model performs better than the deep

CNN model in some scenarios with our testing data. But it is worth to mention that SIFT

method is not a fully automated image feature extraction processing because it requires a

careful speciﬁcation of four steps, namely (1) detecting extreme values in the scale spaces, (2)

ﬁnding the key points, (3) assigning feature directions, and (4) describing key points. Moreover,

SIFT algorithm is patent protected (Lowe,2004), which means other open source program could

not incorporate it without the patent owner’s permission. Having an alternative approach with

highly comparable performance but without patent restrictions is important to time series

forecasters.

The alternative feature extraction algorithm based on deep CNN is an automated process

once the source task is conﬁrmed. We use transfer learning to borrow the information of well

pre-trained neural network models for imaging classiﬁcation, which can avoid the complication of

settings the network structure and tuning the hyper-parameters. Unlike traditional CNN tasks

that require the ﬁne-tuning and massive computation, we transfer the convolutional layers and

fully-connected lays from the ImageNet competition results to our task. Hence only one new

adaption layer needs to train, which signiﬁcantly saves the computational power.

Although the aims of source task in ImageNet and the target task of time series forecasting

are naturally diﬀerent, the image features generated from time series share similar shapes and

angles with the image of real objects. This explains why we could transfer a diﬀerent task

to time series forecasting. In practice, the forecasting practitioners may train a customized

CNN model to further improve the forecasting performance if a rich collection of time series are

available.

Another signiﬁcant merit of using deep CNN and transfer learning for time series feature

extraction is that, the pre-trained neural network models (e.g., on ImageNet) are continuously

updated and improved in the image processing literature. Thus, we believe that this line of

automated time series feature extraction approaches has great potential in the future.

In this paper, we use the features extracted from recurrence plots to reveal the characteristics

of the corresponding time series. The recurrence plot for a given time series displays its dynamics

based on the distance correlations within the time series. However, other features such as

cross-correlation coeﬃcients can also be used to generate cross-correlation recurrence plots.

Thus, multi-channel images, with more comprehensive information, can be obtained for each

time series, which can potentially improve the feature extraction and feature-based forecast

combination performances. Therefore, time series forecasting based on multi-channel imaging

can be one potential extension of our current work.

The forecasting framework based on time series image features is in line with the work in

(Montero-Manso et al.,2020), where they use 42 manual time series features and nine forecast-

ing methods to optimize the weights for forecast combination. Montero-Manso et al. (2020)

won the second place in the M4 competition (Makridakis et al.,2020). To be consistent and

comparable, in our study, we employ the same set of forecasting methods in the M4 dataset.

However, we want to mention that the choice of candidate forecasting methods for forecast

combination also requires expert knowledge and practical experience. The performance of fore-

cast combinations depends on the accuracy of individual forecasting methods and the diversity

among them since the merits of forecast combination stem from the independent information

across multiple forecasts (Thomson et al.,2019). How to automatically select an appropriate

set of candidate methods for combination is another interesting direction for future research.

In our experiments, all the time series are independent data. Therefore we treat the time

series features as independent images and apply them to the CNN framework which is also used

for classifying objects in ImageNet. A further extension of our work is to extend time series

forecasting with imaging to (1) forecasting with time varying image features, and (2) hierarchical

time series or multivariate time series with recurrent dependence. In both scenes, hierarchical

image classiﬁcation framework mixtures with CNN and RNN could be further explored.

We make our code publicly available at https://github.com/lixixibj/forecasting-

with-time-series-imaging. Making it open-source can enrich the toolboxes of forecasting

support systems by providing a competitive alternative to the existing feature-based time series

forecasting methods.

6. Concluding remarks

In this paper, we propose to use image features for forecast model combination. First, time

series are encoded into images. Computer vision algorithms are then applied to extract features

from the images, which are used for forecast model averaging. The proposed method enables

automated feature extraction, making it more ﬂexible than using manually selected time series

features. Besides, our image features can depict local features of time series as well as global

features. Our paper is the ﬁrst attempt that applies imaging to time series forecasting to the

best of our knowledge.

We examined the performance of our approach on two widely-used time series competition

datasets (M4 and Tourism), and compared it with the top methods in the two competitions. Our

experiments show that the proposed method can produce highly comparable forecast accuracies

with the top-ranked benchmarks in the competitions. Moreover, forecasting based on time series

imaging oﬀers an automatic tool for time series feature extraction, in the sense that it does not

reply on many manual inputs for feature selection, which is crucial for forecast practitioners.

Acknowledgments

We are thankful to Dr. Slawek Smyl from Uber and Professor Christoph Bergmeir from

Monash University for their insightful suggestions on a previous version of this paper presented

at the 39th International Symposium on Forecasting.

Yanfei Kang is supported by the National Natural Science Foundation of China (No. 11701022)

and the National Key Research and Development Program (No. 2019YFB1404600). Feng Li

is supported by the National Natural Science Foundation of China (No. 11501587) and the

Beijing Universities Advanced Disciplines Initiative (No. GJJ2019163).

References

Abdollahi, M., Khaleghi, T. and Yang, K. (2020), ‘An integrated feature learning approach using

deep learning for travel time prediction’, Expert Systems with Applications 139, 112864.

Ahmad, Z., Jindal, R., Ekbal, A. and Bhattachharyya, P. (2020), ‘Borrow from rich cousin:

transfer learning for emotion detection using cross lingual embedding’, Expert Systems with

Applications 139, 112851.

Arinze, B. (1994), ‘Selecting appropriate forecasting models using rule induction’, Omega-

international Journal of Management Science 22(6), 647–658.

Armstrong, J. S. (2001), Combining forecasts, in ‘Principles of forecasting’, Springer, pp. 417–

439.

Assimakopoulos, V. and Nikolopoulos, K. (2000), ‘The theta model: a decomposition approach

to forecasting’, International Journal of Forecasting 16(4), 521–530.

Athanasopoulos, G., Hyndman, R. J., Kourentzes, N. and Petropoulos, F. (2017), ‘Forecasting

with temporal hierarchies’, European Journal of Operational Research 262(1), 60–74.

Athanasopoulos, G., Hyndman, R. J., Song, H. and Wu, D. C. (2011), ‘The tourism forecasting

competition’, International Journal of Forecasting 27(3), 822–844.

Bandara, K., Bergmeir, C. and Smyl, S. (2020), ‘Forecasting across time series databases using

recurrent neural networks on groups of similar series: a clustering approach’, Expert Systems

With Applications 140, 112896.

Baydogan, M. G., Runger, G. and Tuv, E. (2013), ‘A bag-of-features framework to classify time

series’, IEEE transactions on pattern analysis and machine intelligence 35(11), 2796–2802.

Chen, T. and Guestrin, C. (2016), Xgboost:a scalable tree boosting system, in ‘ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining’, pp. 785–794.

Christ, M., Braun, N., Neuﬀer, J. and Kempa-Liehr, A. W. (2018), ‘Time Series FeatuRe

Extraction on basis of Scalable Hypothesis tests (tsfresh a Python package)’, Neurocomputing

307, 72 – 77.

Cleveland, R. B., Cleveland, W. S., McRae, J. E. and Terpenning, I. (1990), ‘STL: A seasonal-

trend decomposition procedure based on loess’, Journal of Oﬃcial Statistics 6(1), 3–73.

Collopy, F. and Armstrong, J. S. (1992), ‘Rule-based forecasting: development and validation

of an expert systems approach to combining time series extrapolations’, Management Science

38(10), 1394–1414.

Corizzo, R., Ceci, M., Zdravevski, E. and Japkowicz, N. (2020), ‘Scalable auto-encoders for grav-

itational waves detection from time series data’, Expert Systems with Applications p. 113378.

De Livera, A. M., Hyndman, R. J. and Snyder, R. D. (2011), ‘Forecasting time series with

complex seasonal patterns using exponential smoothing’, Journal of the American statistical

association 106(496), 1513–1527.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K. and Li, F. F. (2009), Imagenet: A large-

scale hierarchical image database, in ‘IEEE Conference on Computer Vision and Pattern

Recognition’.

Donahue, J., Jia, Y., Vinyals, O., Hoﬀman, J., Ning, Z., Tzeng, E., Darrell, T., Donahue, J., Jia,

Y. and Vinyals, O. (2014), Decaf: A deep convolutional activation feature for generic visual

recognition, in ‘International Conference on International Conference on Machine Learning’.

Doornik, J. A., Castle, J. L. and Hendry, D. F. (2020), ‘Card forecasts for m4’, International

Journal of Forecasting 36(1), 129–134.

Eckmann, J.-P., Kamphorst, S. O. and Ruelle, D. (1987), ‘Recurrence plots of dynamical sys-

tems’, EPL (Europhysics Letters) 4(9), 973.

Fiorucci, J. A. and Louzada, F. (2020), ‘Groec: combination method via generalized rolling

origin evaluation’, International Journal of Forecasting 36(1), 105–109.

Fulcher, B. D. (2018), Feature-based time-series analysis, in ‘Feature engineering for machine

learning and data analytics’, CRC Press, pp. 87–116.

Fulcher, B. D., Little, M. A. and Jones, N. S. (2013), ‘Highly comparative time-series analy-

sis: the empirical structure of time series and their methods’, Journal of the Royal Society

Interface 10(83), 20130048.

Fulcher, B. and Jones, N. (2014), ‘Highly comparative feature-based time-series classiﬁcation’,

IEEE Transactions on Knowledge and Data Engineering 26(12), 3026–3037.

Ge, W. and Yu, Y. (2017), Borrowing treasures from the wealthy: Deep transfer learning

through selective joint ﬁne-tuning, in ‘Computer Vision and Pattern Recognition’.

Han, D., Liu, Q. and Fan, W. (2018), ‘A new image classiﬁcation method using cnn transfer

learning and web data augmentation’, Expert Systems With Applications 95, 43–56.

Hatami, N., Gavet, Y. and Debayle, J. (2017), ‘Bag of recurrence patterns representation for

time-series classiﬁcation’, Pattern Analysis and Applications pp. 1–11.

He, K., Zhang, X., Ren, S. and Sun, J. (2016), Deep residual learning for image recognition, in

‘Proceedings of the IEEE conference on computer vision and pattern recognition’, pp. 770–

778.

Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., O’Hara-Wild, M.,

Petropoulos, F., Razbash, S., Wang, E. and Yasmeen, F. (2019), forecast: Forecasting func-

tions for time series and linear models. R package version 8.5.

URL: http://pkg.robjhyndman.com/forecast

Hyndman, R. J. and Khandakar, Y. (2008), ‘Automatic time series forecasting: the forecast

package for R’, Journal of Statistical Software 26(3), 1–22.

Hyndman, R. J., Koehler, A. B., Snyder, R. D. and Grose, S. (2002), ‘A state space framework

for automatic forecasting using exponential smoothing methods’, International Journal of

Forecasting 18(3), 439–454.

Hyndman, R. J., Wang, E. and Laptev, N. (2015), Large-scale unusual time series detection, in

‘Proceedings of the IEEE International Conference on Data Mining’, Atlantic City, NJ, USA.

14–17 November 2015.

Kang, Y., Hyndman, R. J. and Li, F. (2020), ‘GRATIS: GeneRAting TIme Series with diverse

and controllable characteristics’, Statistical Analysis and Data Mining (in press).

URL: https://doi.org/10.1002/sam.11461

Kang, Y., Hyndman, R. J. and Smith-Miles, K. (2017), ‘Visualising forecasting algorithm perfor-

mance using time series instance spaces’, International Journal of Forecasting 33(2), 345–358.

Krizhevsky, A., Sutskever, I. and E. Hinton, G. (2012), ‘Imagenet classiﬁcation with deep

convolutional neural networks’, Neural Information Processing Systems 25.

Laptev, N., Yosinski, J., Li, L. E. and Smyl, S. (2017), Time-series extreme event forecasting

with neural networks at uber, in ‘International Conference on Machine Learning’, Vol. 34,

pp. 1–5.

Lazebnik, S., Schmid, C. and Ponce, J. (2006), Beyond bags of features: Spatial pyramid match-

ing for recognizing natural scene categories, in ‘2006 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR’06)’, Vol. 2, IEEE, pp. 2169–2178.

Lowe, D. G. (1999), Object recognition from local scale-invariant features, in ‘Computer vi-

sion, 1999. The proceedings of the seventh IEEE international conference on’, Vol. 2, IEEE,

pp. 1150–1157.

Lowe, D. G. (2004), ‘Method and apparatus for identifying scale invariant features in an image

and use of same for locating an object in an image’. US Patent 6,711,293.

Maaten, L. v. d. (2014), ‘Accelerating t-SNE using tree-based algorithms’, The Journal of

Machine Learning Research 15(1), 3221–3245.

Makridakis, S. and Hibon, M. (2000), ‘The M3-Competition: results, conclusions and implica-

tions’, International Journal of Forecasting 16(4), 451–476.

Makridakis, S., Spiliotis, E. and Assimakopoulos, V. (2020), ‘The M4 competition: 100,000 time

series and 61 forecasting methods’, International Journal of Forecasting 36(1), 54–74.

Meade, N. (2000), ‘Evidence for the selection of forecasting methods’, Journal of Forecasting

19(6), 515–535.

Montero-Manso, P., Athanasopoulos, G., Hyndman, R. J. and Talagala, T. S. (2020),

‘FFORMA: Feature-based forecast model averaging’, International Journal of Forecasting

36(1), 86 – 92.

Nanopoulos, A., Alcock, R. and Manolopoulos, Y. (2001), ‘Feature-based classiﬁcation of time-

series data’, International Journal of Computer Research 10(3).

Pan, S. J. and Qiang, Y. (2010), ‘A survey on transfer learning’, IEEE Transactions on Knowl-

edge and Data Engineering 22(10), 1345–1359.

Pawlikowski, M. and Chorowska, A. (2020), ‘Weighted ensemble of statistical models’, Interna-

tional Journal of Forecasting 36(1), 93–97.

Petropoulos, F., Makridakis, S., Assimakopoulos, V. and Nikolopoulos, K. (2014), “Horses for

courses’ in demand forecasting’, European Journal of Operational Research 237(1), 152–163.

Petropoulos, F. and Svetunkov, I. (2020), ‘A simple combination of univariate models’, Inter-

national journal of forecasting 36(1), 110–115.

Razavian, A. S., Azizpour, H., Sullivan, J. and Carlsson, S. (2014), ‘Cnn features oﬀ-the-shelf:

An astounding baseline for recognition’.

Shah, C. (1997), ‘Model selection in univariate time series forecasting using discriminant anal-

ysis’, International Journal of Forecasting 13(4), 489–500.

Shaub, D. (2020), ‘Fast and accurate yearly time series forecasting with forecast combinations’,

International Journal of Forecasting 36(1), 116–120.

Simonyan, K. and Zisserman, A. (2014), ‘Very deep convolutional networks for large-scale image

recognition’, Computer Science .

Smyl, S. (2020), ‘A hybrid method of exponential smoothing and recurrent neural networks for

time series forecasting’, International Journal of Forecasting 36(1), 75–85.

Svetunkov, I. and Kourentzes, N. (2018), ‘Complex exponential smoothing for seasonal time

series’.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,

V. and Rabinovich, A. (2015), Going deeper with convolutions, in ‘Proceedings of the IEEE

conference on computer vision and pattern recognition’, pp. 1–9.

Talagala, P. D., Hyndman, R. J., Smith-Miles, K., Kandanaarachchi, S. and Mu˜noz, M. A.

(2019), ‘Anomaly detection in streaming nonstationary temporal data’, Journal of Computa-

tional and Graphical Statistics (In press), 1–28.

Talagala, T., Li, F. and Kang, Y. (2019), ‘FFORMPP: Feature-based forecast model perfor-

mance prediction’, arXiv 1908.11500.

URL: https://arxiv.org/abs/1908.11500

Talagala, T. S., Hyndman, R. J. and Athanasopoulos, G. (2018), Meta-learning how to fore-

cast time series, Working paper 6/18, Monash University, Department of Econometrics and

Business Statistics.

Thiel, M., Romano, M. C. and Kurths, J. (2004), ‘How much information is contained in a

recurrence plot?’, Physics Letters A 330(5), 343–349.

Thomson, M. E., Pollock, A. C., Onkal, D. and Gonul, M. S. (2019), ‘Combining forecasts:

Performance and coherence’, International Journal of Forecasting 35(2), 474–484.

Vincent, P., Larochelle, H., Bengio, Y. and Manzagol, P.-A. (2008), Extracting and compos-

ing robust features with denoising autoencoders, in ‘Proceedings of the 25th international

conference on Machine learning’, pp. 1096–1103.

Wang, J., Liu, P., She, M. F., Nahavandi, S. and Kouzani, A. (2013), ‘Bag-of-words repre-

sentation for biomedical time series classiﬁcation’, Biomedical Signal Processing and Control

8(6), 634–644.

Wang, J., Yang, J., Yu, K., Lv, F., Huang, T. and Gong, Y. (2010), Locality-constrained linear

coding for image classiﬁcation, in ‘Computer Vision and Pattern Recognition (CVPR), 2010

IEEE Conference on’, IEEE, pp. 3360–3367.

Wang, X., Smith, K. A. and Hyndman, R. J. (2006), ‘Characteristic-based clustering for time

series data’, Data Mining and Knowledge Discovery 13(3), 335–364.

Wang, Z. and Oates, T. (2015), Imaging time-series to improve classiﬁcation and imputation,

in ‘Proceedings of the 24th International Conference on Artiﬁcial Intelligence’, AAAI Press,

pp. 3939–3945.

Appendices

A. Experimental setup in the SoBF model and CNN model

In the traditional image processing method with SIFT, we need to obtain the basic de-

scriptors before the linear coding. We choose k= 200 as the number of clusters, 200 centroid

coordinates are used as the coordinates of basic descriptors. We select 5 close descriptors from

200 basic descriptors for each descriptor with the K-nearest neighbors (KNN) algorithm and

the adjustment factor λ=e−4in LLC. We set 1, 2, and 4 as the SPM parameters. In the

end, we split the image into 1 ×1, 2 ×2 and 4 ×4 subimages, respectively. To eliminate range

diﬀerences of time series, we further adopt the minimax transformation for time series before

applying the recurrence plot. The parameter of for recurrence plot generation is 0.5.

Table 7. Optimal parameters of XGBoost on M4 competition dataset.

Method max depth learning rate sample proportion feature proportion

SIFT 14.000 0.575 0.916 0.767

CNN

Inception-v1+XGBoost 15.000 0.600 0.920 0.810

ResNet-v1-101+XGBoost 20.000 0.660 0.892 0.871

ResNet-v1-50+XGBoost 18.000 0.640 0.960 0.850

VGG-19+XGBoost 12.000 0.530 0.940 0.830

The parameters for the pre-trained CNN models are set as follows.

•Dimension of the output of the pre-trained Inception-v1 model: 1024.

•Dimension of the output of the pre-trained ResNet-v1-101 model: 2048.

•Dimension of the output of the pre-trained ResNet-v1-50 model: 2048.

•Dimension of the output of the pre-trained VGG model: 1000.

B. Experimental setup for XGBoost

To set optimal parameters for XGBoost, we perform a search in subspaces of the hyper-

parameter spaces, by measuring the OWA via 10-fold cross-validation of the training data. We

describe the hyper-parameters and the searching ranges of the cross-validation procedure as

follows.

•max depth: The maximum depth of a tree ranges from 6 to 25.

•learning rate: The learning rate and the scale of contribution of each tree ranges from

0.01 to 1.

•sample proportion: The proportion of the training set used to calculate the trees in each

iteration ranges from 0.7 to 1.

•feature proportion: The proportion of features used to calculate the trees in each iter-

ation ranges from 0.7 to 1.

Table 7reports the optimal parameters of XGBoost on the M4 competition dataset. In the

experiment, we train the XGBoost with all data of diﬀerent periods and as a result get one set

of optimal parameters.

Table 8shows the optimal parameters of XGBoost for yearly, quarterly and monthly, respec-

tively on the Tourism competition dataset. Due to the small size of Tourism dataset, we use

Table 8. Optimal parameters of XGBoost on Tourism competition dataset.

Method max depth learning rate sample proportion feature proportion

Yearly

SIFT 25.000 1.000 0.747 1.000

CNN

Inception-v1+XGBoost 12.000 0.907 0.700 1.000

ResNet-v1-101+XGBoost 6.000 1.000 0.967 0.866

ResNet-v1-50+XGBoost 7.000 0.872 0.747 0.976

VGG-19+XGBoost 8.000 0.877 0.960 0.710

Quarterly

SIFT 12.000 0.880 0.851 0.861

CNN

Inception-v1+XGBoost 17.000 0.856 1.000 0.700

ResNet-v1-101+XGBoost 8.000 0.985 0.985 0.947

ResNet-v1-50+XGBoost 14.000 0.581 0.921 0.781

VGG-19+XGBoost 11.000 0.872 0.858 0.764

Monthly

SIFT 14.000 0.575 0.916 0.767

CNN

Inception-v1+XGBoost 25.000 1.000 0.861 0.700

ResNet-v1-101+XGBoost 25.000 1.000 1.000 1.000

ResNet-v1-50+XGBoost 14.000 1.000 1.000 0.705

VGG-19+XGBoost 17.000 0.842 0.935 0.913

M4 data with the corresponding periods as the training data. Hence, we obtain three groups of

optimal parameters for yearly, quarterly and monthly data, respectively.

Underground abnormal sensor condition detection based on gas monitoring data and deep learning image feature engineering

Article

Full-text available

Nov 2023

Underground gas sensors as the most intuitive tool for monitoring gas concentrations in underground mining, yet they are subject to frequent anomalies due to ground pressure, constructions, even malicious masking by workers. Due to the depth of underground mining and the complexity of the environment, it is almost impossible to manually monitor the status of the all the sensors. Thus, the ability to accurately identify the working status of gas sensors at the working face are critical importance to mining safety. In this paper, we propose a deep learning feature engineering based approach to coupling the relationship between underground sensors. Experiment results show that the relationship between gas sensors can be expressed by position and time, so that when a sensor such as upper corner T0 malfunctions, it can be detected by other sensors such as T1 and T2. By converting the gas concentration into the form of recurrence plots (RPs), we are able to transform time-series gas concentration data into images with more dimensions in time lag, and enabling the application of more efficient and accurate machine vision methods. Based on the location of sensors at the working face, we found that the sensors at positions T0, T1 and T2 are correlated as the wind flows through the tunnel and have a higher correlation in the subsections of the time series. And those correlation can directly use to check the operating status of the sensors. We also discuss whether the relationships between the data itself can be preserved at the feature level during the mapping of gas concentrations to features, since deep learning (DL) looks like the next promise future after digitization in the mining industrialization with more and more data analysis and placing the results under a larger decision. This feature-based approach for gas concentration analysis can also be used for prediction and early warning.

Mar-RUL: A remaining useful life prediction approach for fault prognostics of marine machinery

Article

Nov 2023
APPL OCEAN RES

Although the maritime industry has the potential to lead smart maintenance methodologies, current maintenance routines within the sector focus on either reactive or preventive maintenance approaches. These approaches are increasingly conservative and often prompted by an increase of large costs or unnecessary maintenance actions. Attempts are being made, however, to optimise these costs and actions through the application of novel practices, such as the employment of prognostic-based maintenance, which is an approach that aims to minimise risk while maximising the lifespan of a system. In this respect, Mar-RUL is introduced to support Operations & Maintenance (O&M) decision making and address some of the main challenges that the sector is currently experiencing, including the lack of fault data analysis and the formalisation of deep learning technologies for the implementation of the Remaining Useful Life (RUL) prediction. In response, a degradation data simulation module is developed in tandem with an ensemble model comprised of three distinct deep learning architectures: Markov-Convolutional Neural Network (CNN), 1D-CNN, and Long Short-Term Memory (LSTM) neural network. To evaluate the performance of Mar-RUL, a case study on the turbocharger of a diesel generator of a tanker ship is presented. The case study results demonstrated that the application of time series imaging and ensemble methods can provide promising outcomes for the enhancement of RUL prediction.

Methodological proposal to determine explosive tourism growth: a warning for unsustainable development?

Article

May 2024
Curr Issues Tourism

Sustainability is a crucial aspect of tourism growth and development. This paper suggests the adaptation of a methodology originally designed for analysing explosive behaviour in financial market time series to assess the explosive behaviour of demand pressure in tourist destinations. The proposed methodology could function as an early warning system for potential issues such as over-tourism by identifying further unsustainable growth in destinations. The methodology is exemplified by examining the long-term evolution of tourism pressure in both Spain as a whole, and in Barcelona, which is a destination frequently scrutinised in the literature and media due to perceived over-tourism. The findings suggest explosive behaviour in tourism pressure, especially in the case of Barcelona, from which valuable lessons can be learned and meaningful conclusions drawn. ARTICLE HISTORY

Performance prediction in online academic course: a deep learning approach with time series imaging

Article

Full-text available

Nov 2023

With the COVID-19 outbreak, schools and universities have massively adopted online learning to ensure the continuation of the learning process. However, in such setting, instructors lack efficient mechanisms to evaluate the learning gains and get insights about difficulties learners encounter. In this research work, we tackle the problem of predicting learner performance in online learning using a deep learning-based approach. Our proposed solution allows stakeholders involved in the online learning to anticipate the learner outcome ahead of the final assessment hence offering the opportunity for proactive measures to assist the learners. We propose a two-pathway deep learning model to classify learner performance using their interaction during the online sessions in the form of clickstreams. We also propose to transform these time series of clicks into images using the Gramian Angular Field. The learning model makes use of the available extra demographic and assessment information. We evaluate our approach on the Open University Learning Analytics Dataset. Comprehensive comparative study is conducted with evaluation against state-of-art approaches under different experimental settings. We also demonstrate the importance of including extra demographic and assessment data in the prediction process.

Machine Learning for Time Series Forecasting Using State Space Models

Chapter

Full-text available

Nov 2023

State-space models (SSMs) are becoming mainstream for time series analysis because their flexibility and increased explainability, as they model observations separately from unobserved dynamics. Critically, using SSMs based on multivariate autoregressive equations enhances understanding of system evolution and its dynamical interactions. However, some challenges remain unsolved, such as estimation in large-scale scenarios, with very noisy data, and model selection. Here, we explore a state-space alternating least squares (SSALS) algorithm for time series forecasting, demonstrating its application with simulated and real data, and how to solve model selection in noisy scenarios with a novel cross-validation technique. Altogether, testing this methodology with time series forecasting is ideal to demonstrate its strengths and weaknesses, and appreciate its advantages compared to current methods.

alPCA: An automatic software for the selection and combination of forecasts in monthly series

Article

Apr 2024

Evolution of Machine Learning in Tourism: A Comprehensive Review of Seminal Research

Article

Full-text available

Dec 2023

Ferhat Şeker

Machine learning is enabling transformative changes in the tourism industry. Various machine learning algorithms and models can detect patterns in huge amounts of data for the prediction process, recommendations, and decisions without any coding or programming. The tourism sector generates massive data through sources as such online reviews and ratings, social media activity, traffic information, and customer relationship management records. Machine learning is poised to unlock insights and opportunities from this data. This paper provides an overview of how machine learning is currently influencing and may shape the future of tourism. Techniques for predictive analytics, personalized recommendation systems, computer vision, natural language processing, and more are powering applications to improve customer experiences, optimize and automate operations, gain competitive advantage, and support sustainability. Current applications are discussed, including demand forecasting, personalized travel recommendations, automated photo filtering, sentiment analysis of tourism reviews, chatbots for customer service, and others. Emerging opportunities are explored, as machine learning may enhance smart tourism for destinations through intelligent transportation, customized experiences, optimized resource allocation, and improved accessibility. Challenges exist regarding data quality, privacy, bias, and job disruption. However, machine learning is expected to become an integral tool for data-driven, personalized, and sustainable tourism. Overall, this review paper aims to synthesize the state of machine learning in tourism by highlighting current applications, opportunities, considerations, and likely future trends. The conclusions point to machine learning as a catalyst for innovation in tourism that may significantly transform the visitor experience, business operations, and destination management in the years to come.

From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting

Conference Paper

Nov 2023

Role of image feature enhancement in intelligent fault diagnosis for mechanical equipment: A review

Article

Nov 2023
ENG FAIL ANAL

Deep Learning Recommendation System for Stock Market Investments

Chapter

Sep 2023

The paper proposes and compares two models for creating a recommendation system in the stock market, based on convolutional neural networks (CNN). The first model encodes the values of the time series of the stock exchange quotations into multiple technical indicators incorporating them in the form of an image. These indicators are defined on the closing prices of the stock quotes from the previous day The second model introduces some modifications of the previous approach by changing the definitions of the technical indicators. The numerical experiments have shown its improved performance. Both models are generated from the one-dimensional stock market data and saved as images. The CNN neural network uses these images in the training and testing phases. The numerical experiments aimed at maximizing profit from the investments have been performed on the stock data of the six largest companies listed on the Warsaw Stock Exchange. The recommendations for companies were classified in the form of three classes (Buy, Sell, Hold). The numerical results for the proposed methods are presented and compared with other investment methods typically used in the stock market.

FFORMPP: Feature-based forecast model performance prediction

Article

Full-text available

Aug 2021
INT J FORECASTING

This paper introduces a novel meta-learning algorithm for time series forecast model performance prediction. We model the forecast error as a function of time series features calculated from historical time series with an efficient Bayesian multivariate surface regression approach. The minimum predicted forecast error is then used to identify an individual model or a combination of models to produce the final forecasts. It is well known that the performance of most meta-learning models depends on the representativeness of the reference dataset used for training. In such circumstances, we augment the reference dataset with a feature-based time series simulation approach, namely GRATIS, to generate a rich and representative time series collection. The proposed framework is tested using the M4 competition data and is compared against commonly used forecasting approaches. Our approach provides comparable performance to other model selection and combination approaches but at a lower computational cost and a higher degree of interpretability, which is important for supporting decisions. We also provide useful insights regarding which forecasting models are expected to work better for particular types of time series, the intrinsic mechanisms of the meta-learners, and how the forecasting performance is affected by various factors.

GRATIS: GeneRAting TIme Series with diverse and controllable characteristics

Article

Full-text available

May 2020

The explosion of time series data in recent years has brought a flourish of new time series analysis methods, for forecasting, clustering, classification and other tasks. The evaluation of these new methods requires either collecting or simulating a diverse set of time series benchmarking data to enable reliable comparisons against alternative approaches. We propose GeneRAting TIme Series with diverse and controllable characteristics, named GRATIS, with the use of mixture autoregressive (MAR) models. We simulate sets of time series using MAR models and investigate the diversity and coverage of the generated time series in a time series feature space. By tuning the parameters of the MAR models, GRATIS is also able to efficiently generate new time series with controllable features. In general, as a costless surrogate to the traditional data collection approach, GRATIS can be used as an evaluation tool for tasks such as time series forecasting and classification. We illustrate the usefulness of our time series generation process through a time series forecasting application.

Scalable Auto-Encoders for Gravitational Waves Detection from Time Series Data

Article

Full-text available

Mar 2020
EXPERT SYST APPL

Gravitational waves represent a new opportunity to study and interpret phenomena from the universe. In order to efficiently detect and analyze them, advanced and automatic signal processing and machine learning techniques could help to support standard tools and techniques. Another challenge relates to the large volume of data collected by the detectors on a daily basis, which creates a gap between the amount of data generated and effectively analyzed. In this paper, we propose two approaches involving deep auto-encoder models to analyze time series collected from Gravitational Waves detectors and provide a classification label (noise or real signal). The purpose is to discard noisy time series accurately and identify time series that potentially contain a real phenomenon. Experiments carried out on three datasets show that the proposed approaches implemented using the Apache Spark framework, represent a valuable machine learning tool for astrophysical analysis, offering competitive accuracy and scalability performances with respect to state-of-the-art methods.

Meta‐learning how to forecast time series

Article

Feb 2023

Features of time series are useful in identifying suitable models for forecasting. We present a general framework, labelled FFORMS (Feature‐based FORecast Model Selection), which selects forecast models based on features calculated from each time series. The FFORMS framework builds a mapping that relates the features of a time series to the “best” forecast model using a classification algorithm such as a random forest. The framework is evaluated using time series from the M‐forecasting competitions and is shown to yield forecasts that are almost as accurate as state‐of‐the‐art methods, but are much faster to compute. We use model‐agnostic machine learning interpretability methods to explore the results and to study what types of time series are best suited to each forecasting model.

Feature-Based Time-Series Analysis

Chapter

Mar 2018

Ben Fulcher

FFORMA: Feature-based forecast model averaging

Article

Sep 2019
INT J FORECASTING

We propose an automated method for obtaining weighted forecast combinations using time series features. The proposed approach involves two phases. First, we use a collection of time series to train a meta-model for assigning weights to various possible forecasting methods with the goal of minimizing the average forecasting loss obtained from a weighted forecast combination. The inputs to the meta-model are features that are extracted from each series. Then, in the second phase, we forecast new series using a weighted forecast combination, where the weights are obtained from our previously trained meta-model. Our method outperforms a simple forecast combination, as well as all of the most popular individual methods in the time series forecasting literature. The approach achieved second position in the M4 competition.

Forecasting Across Time Series Databases using Recurrent Neural Networks on Groups of Similar Series: A Clustering Approach

Article

Aug 2019
EXPERT SYST APPL

With the advent of Big Data, nowadays in many applications databases containing large quantities of similar time series are available. Forecasting time series in these domains with traditional univariate forecasting procedures leaves great potentials for producing accurate forecasts untapped. Recurrent neural networks (RNNs), and in particular Long Short Term Memory (LSTM) networks, have proven recently that they are able to outperform state-of-the-art univariate time series forecasting methods in this context, when trained across all available time series. However, if the time series database is heterogeneous, accuracy may degenerate, so that on the way towards fully automatic forecasting methods in this space, a notion of similarity between the time series needs to be built into the methods. To this end, we present a prediction model that can be used with different types of RNN models on subgroups of similar time series, which are identified by time series clustering techniques. We assess our proposed methodology using LSTM networks, a widely popular RNN variant, together with various clustering algorithms, such as kMeans, DBScan, Partition Around Medoids (PAM), and Snob. Our method achieves competitive results on benchmarking datasets under competition evaluation procedures. In particular, in terms of mean sMAPE accuracy it consistently outperforms the baseline LSTM model, and outperforms all other methods on the CIF2016 forecasting competition dataset.

Fast and accurate yearly time series forecasting with forecast combinations

Article

Aug 2019
INT J FORECASTING

David Shaub

It has long been known that combination forecasting strategies produce superior out-of-sample forecasting performances. In the M4 forecasting competition, a very simple forecast combination strategy achieved third place on yearly time series. An analysis of the ensemble model and its component models suggests that the competitive accuracy comes from avoiding poor forecasts, rather than from beating the best individual models. Moreover, the simple ensemble model can be fitted very quickly, can easily scale horizontally with additional CPU cores or a cluster of computers, and can be implemented by users very quickly and easily. This approach might be of particular interest to users who need accurate yearly forecasts without being able to spend significant time, resources, or expertise on tuning models. Users of the R statistical programming language can access this modeling approach using the “forecastHybrid” package.

An integrated feature learning approach using deep learning for travel time prediction

Article

Aug 2019
EXPERT SYST APPL

Travel time data is a vital factor for numbers of performance measures in transportation systems. Travel time prediction is both a challenging and interesting problem in ITS, because of the underlying traffic and events’ hidden patterns. In this study, we propose a multi-step deep-learning-based algorithm for predicting travel time. Our algorithm starts with data pre-processing. Then, the data is augmented by incorporating external datasets. Moreover, extensive feature learning and engineering such as spatiotemporal feature analysis, feature extraction, and clustering algorithms is applied to improve the feature space. Furthermore, for representing features we used a deep stacked autoencoder with dropout layer as regularizer. Finally, a deep multi-layer perceptron is trained to predict travel times. For testing our predictive accuracy, we used a 5-fold cross validation to test the generalization of our predictive model. As we observed, the performance of the proposed algorithm is on average 4 min better than applying the deep neural network to the initial feature space. Furthermore, we have noticed that representation learning using stacked autoencoders makes our learner robust to overfitting. Moreover, our algorithm is capable of capturing the general dynamics of the traffic, however further works need to be done for some rare events which impact travel time prediction significantly.

Borrow from Rich Cousin: Transfer Learning for Emotion Detection using Cross Lingual Embedding

Article

Jul 2019
EXPERT SYST APPL

Forecasting with time series imaging

Figures

Recommended publications

DTW-Approach for uncorrelated multivariate time series imputation

Forecasting with time series imaging

Forecast with Forecasts: Diversity Matters

Improving forecasting with sub-seasonal time series patterns

Improving forecasting by subsampling seasonal time series