Content uploaded by Leo McCarthy
All content in this area was uploaded by Leo McCarthy on Jan 18, 2021
Content may be subject to copyright.
Department of Geography and Planning
School of Environmental Sciences
Spatio-temporal analysis and
machine learning for traﬃc speed
Dissertation submitted in partial fulfilment of the degree
MSc Geographic Data Science
Word count: 10,175
I declare that the material presented for this dissertation is entirely my own work and
has not previously been presented for the award of any other degree of any institution
and that all secondary sources used are properly acknowledged
September 15, 2020
Advances in machine learning, computational power and the explosion of traﬃc
data are driving the development and use of Intelligent Transport Systems. Real-
time road traﬃc forecasting is a fundamental element required to make e↵ective
use of these Intelligent Traﬃc Systems. Traﬃc modelling techniques have moved
beyond traditional statistical approaches towards data-driven and computational
intelligence methods. Often still overlooked is the spatial structure of traﬃc data
and the impact of road network topology on traﬃc ﬂows. In order to develop
e↵ective forecasting methods understanding the autocorrelation structure among
space-time observations is crucial. This paper presents methods for quantifying the
complex spatio-temporal relationships present in transportation network data and
assess the impact of upstream and downstream congestion propagation. Addition-
ally, introduces machine learning techniques for the forecasting of traﬃc speeds in a
supervised learning task where, by leveraging spatial and temporal characteristics
of dynamic traﬃc ﬂows, the LightGBM algorithm is shown to produce state-of-
the-art performance on real-world motorway traﬃc data with predictions up to 30
minutes in advance, outperforming other established forecasting techniques.
List of Figures 4
List of Tables 4
1 Introduction 5
2 Literature Review 6
2.1 Traﬃc data .................................. 7
2.2 Space-time autocorrelations ......................... 8
2.3 Traﬃc modelling ............................... 9
2.3.1 ARIMA ................................ 9
2.3.2 OLS .................................. 10
2.3.3 Nonparametric machine learning .................. 11
2.3.4 Recurrent Neural Networks ..................... 11
2.3.5 Gradient Boosting .......................... 12
2.3.6 Developing the current literature .................. 13
3 Methods and Data 14
3.1 Data ...................................... 14
3.1.1 Preprocessing ............................. 15
3.1.2 Overview ............................... 16
3.2 Space-time autocorrelations ......................... 19
3.2.1 Temporal ............................... 19
3.2.2 Spatio-temporal ........................... 20
3.3 Experimental set up ............................. 20
3.4 Model speciﬁcation and tuning ....................... 21
3.4.1 OLS .................................. 22
3.4.2 LightGBM .............................. 23
3.4.3 LSTM ................................. 24
3.5 Evaluation metrics .............................. 25
3.6 Implementation ................................ 25
4.1 Space-time autocorrelations ......................... 26
4.2 Modelling ................................... 27
5 Discussion 30
5.1 Literature implications ............................ 31
5.2 Real world applications ........................... 32
5.3 Limitations .................................. 32
5.4 Future work .................................. 33
6 Conclusion 33
7 Bibliography 35
List of Figures
1 Neural Network architectures ........................ 12
2 Decision tree example ............................ 13
3 Sensor locations ................................ 15
4 Heat-map of traﬃc speeds for all sensors on 2019-09-06 ......... 16
5 Average traﬃc speeds across each day of the week ............ 17
6 Average traﬃc speeds by time of day for each sensor ........... 17
7 Traﬃc speeds on 2019-09-06 for each sensor ................ 18
8 Box plot of average daily traﬃc speeds across all sensors ......... 19
9 Input windows for time series prediction .................. 21
10 Train validation and test procedure for modeling ............. 22
11 Gradient boosted decision tree learning process .............. 23
12 Structure of a general LSTM network layer ................ 24
13 Implemented LSTM network structure ................... 25
14 Temporal autocorrelations .......................... 26
15 CCF coeﬃcients for all times ........................ 27
16 CCF coeﬃecients for afternoon peak hours ................ 27
17 MAE for all predictions in each hour of the day .............. 30
18 Actual and predictied traﬃc speed values for LightGBM and LSTM models 30
List of Tables
1 Four model Input vectors taken over 6 previous time steps ........ 21
2 LightGBM hyperparameters ......................... 23
3 MAE and RMSE scores for target sensor M60/9223A at time t+1 ... 28
4 MAE and RMSE scores for target sensor M60/9223A at time t+2 ... 29
The prediction of future traﬃc characteristics such as ﬂow, travel time and speed have
been a key component of Intelligent Transport Systems (ITS) for the past couple of
decades. Accurate forecasting of future traﬃc conditions has beneﬁts for both road
users and traﬃc authorities alike. The broad aims of Intelligent Transport Systems are
to enable road users and administrators to be better informed and make safer and more
eﬃcient use of transport networks. Speciﬁcally, the foremost consideration of almost
all traﬃc management systems is reducing the high levels of congestion that generate
economic, social and environmental issues in many cities around the world (Knittel et al.
The rapid development of technologies that allow for the capture and storage of data
describing the prevailing traﬃc conditions such as sensors, radars, inductive loops and
GPS (Leduc 2008) has resulted in increasingly sophisticated ITS. Real time traﬃc infor-
mation is vital for e↵ective and swift traﬃc management allowing traﬃc authorities to
actively respond to prevailing traﬃc conditions in order to reduce congestion, provide
information to road users and deal with collisions, it is a key element in many of the
operational components of ITS such as Advanced Traveller Information System (ATIS)
and Dynamic Traﬃc Assignment (DTA). In the past, much of the focus from traﬃc
managers was on reactive traﬃc measures based on current traﬃc conditions, dealing
with congestion once it had developed with no attempt to anticipate. Now that many
authorities possess real time traﬃc data feeds and subsequently large volumes of his-
torical traﬃc data, the focus has switched towards developing methods for forecasting
traﬃc conditions ahead of time, allowing for proactive traﬃc measures based on ex-
pected traﬃc conditions rather than making decisions based on a traﬃc state soon to
The past decade has seen a signiﬁcant uptick in research surrounding the develop-
ment of predictive models for traﬃc state predictions ahead of time. Currently there
is far from any consensus as to the most e↵ective methods, this is in part due to the
variability in problem framing. Given there is no universal standard in traﬃc data col-
lection and aggregation techniques there is huge di↵erences in structure across traﬃc
data sets with di↵erences in data size, type, frequency and the prediction horizon.
Given the increasing variety and huge volume of traﬃc data being collected, more re-
cent years have seen a move away from traﬃc theory and towards more data driven and
machine-learning methods in an attempt to develop fast, accurate and scalable models.
Initial attempts mainly involved the use of parametric univariate time series models such
as Autoregressive Integrated Moving Average method (ARIMA) and Seasonal Autore-
gressive Integrated Moving Average method (SARIMA) (Williams & Hoel 2003). Many
of these models however have shown shortcomings in generating traﬃc predictions which
has led to the wider use of nonparametric machine-learning (ML) or deep-learning (DL)
based models which form the basis for most of the current literature. These are advan-
tageous for a few di↵erent reasons, ML and DL models are much better at extracting
non-linear relationships within the data, they can handle very high-dimensional data
and co-linear features more e↵ectively and make no prior assumptions regarding the
normality of the data distribution, all of these factors culminating in more accurate
and robust predictions. There are several examples of ML applied to traﬃc forecasting
such as Support Vector Regression (Castro-Neto et al. 2009) and K-nearest-neighbour
(Sun et al. 2018). The most current literature is largely centred around Recurrent Neu-
ral Networks (RNN) and speciﬁcally Long-Short Term Memory (LSTM) networks, the
main reason behind their usage is their high representational power and ability to cap-
ture short and long temporal dependencies of traﬃc data. Gradient Boosting and in
particular eﬃcient implementations of Gradient Boosted Decision Trees (GBDT) such
as LighGBM and XGBoost are hugely popular amongst data science practitioners and
in competitive data science events such as those hosted on Kaggle. They have shown
state-of-the-art performance on a range of tasks including time-series prediction and
applications to spatio-temporal data (Bojer & Meldgaard 2020). Despite this, there has
been limited attempt to apply them to the problem of traﬃc forecasting, this paper will
attempt to bridge that gap by implementing LightGBM for traﬃc prediction and com-
paring results to both OLS regression and LSTM - the most popular Neural Network
architecture for traﬃc forecasting.
The First Law of Geography (Tobler 1970) states “everything is related to everything
else, but near things are more related than distant things.” This would imply that traﬃc
speeds at a given location are not only a↵ected by the traﬃc state at previous points
in time but also by traﬃc states at other nearby locations, given the connected nature
of road networks. Shockwave theory has long been used to model the downstream to
upstream progression of queues (Richards 1956) and Several studies have shown traﬃc
networks to exhibit both spatial and temporal patterns (Cheng et al. 2012)whichmust
be carefully considered in any modelling framework. Recent studies have shown the im-
provement of predictions when of including upstream or downstream traﬃc information
(Chandra & Al-Deek 2009). The importance of space and network topology is clear,
however is often overlooked and is still an area that requires further development in the
domain of traﬃc forecasting (Vlahogianni et al. 2014). More recently attempts have
been made to integrate both spatial and temporal dependence in the modelling process
(Luo et al. 2019). Furthermore, studies have shown that the inclusion of exogenous
variables such as current weather conditions, incident reports and geo-tagged tweets can
improve prediction accuracy (Essien et al. 2020). This paper will attempt to quantify
the importance of spatial variables while also exploring the e↵ectiveness of state-of-the-
art machine leaning models for traﬃc prediction.
The main aims of this dissertation can be summarised as follows:
•To identify and quantify the relevant spatio-temporal patterns present in traﬃc
•To develop traﬃc speed prediction models integrating both spatial and tempo-
ral elements of traﬃc speed data. Comparing statistical and machine learning
•Asses the importance of exploiting the spatial structure of traﬃc data on model
The remainder of this thesis is organised as follows. Section 2 presents an overview of
the methodologies prevalent in the current traﬃc forecast literature. Section 3 provides
a description of the data and outlines the experimental setup and model speciﬁcations.
Section 4 Presents the ﬁndings. Section 5 discusses the ﬁndings and their implications.
Section 6 concludes the paper.
2 Literature Review
There is a signiﬁcant amount of literature relating to short term traﬃc forecasting. The
continually growing urban population places ever greater emphasis on reliable traﬃc
management and in turn traﬃc prediction methods. Generally, traﬃc forecasting meth-
ods can be divided into two main strands. The ﬁrst approach is one based on explicit
modelling of the traﬃc system itself and carrying out simulations incorporating param-
eters such as traﬃc ﬂow, signal controls and driver behaviour. Much of this work is
grounded in classical traﬃc ﬂow theory (Drew 1968) macroscopic traﬃc models deﬁne
explicit mathematical equations to describe the relationship between variables such as
traﬃc ﬂow, density, congestion-shockwave and singling. The main advantages of these
traﬃc simulation approaches is that “they allow inclusion of traﬃc control measures
(ramp metering, routing, traﬃc lights, and even traﬃc information) in the prediction,
and that they provide full insight into the locations and causes of possible delays on the
road network of interest” (Lint 2004). However, there are a number of drawbacks to
this approach such as the computational complexity of parameter calibration and also
their lack of ﬂexibility. These models require a-priori knowledge of traﬃc processes and
their predictive quality relies on the quality of its inputs which must be determined “by
hand”, they cannot learn the latent traﬃc patterns from given data.
This paper focuses on the second main approach, statistical and data driven methods
including machine learning (ML). Machine Learning methods take no prior knowledge
of the traﬃc system and aim to deﬁne functional approximations between input data
and output data (Hastie et al. 2009), learning the relationships in the traﬃc data to
predict variables such as speed and ﬂow etc. Activity in this research area has been
expedited in recent years due to improved data collection methods resulting in large,
rich data sets which has occurred in parallel to the rapid advancement of technology and
available computational power making it possible to implement advanced data driven
methodologies on big data sets. Despite this, there is still considerable variety in applied
methods and large areas for development within this research area as highlighted in
recent reviews of the research ﬁeld (Vlahogianni et al. 2014). One such area is the
understanding and e↵ective incorporation of space-time autocorrelations in traﬃc data.
This section will provide an overview of the relevant literature with regards to traﬃc
forecasting methods with particular emphasis on exploiting the latent spatial structure
of the underlying traﬃc network data to improve forecasting models.
2.1 Traﬃc data
Most of the recent developments in traﬃc forecasting have come about as a result of
data intensive, as opposed to analytical, approaches and thus their e↵ectiveness is highly
reliant on the prevalence, quality and resolution of available traﬃc data (Vlahogianni
et al. 2004). Given the enormous undertaking it would be for a researcher to collect task
speciﬁc traﬃc data, the development of traﬃc forecasting methods almost always relies
on data already available, collected by transport authorities or previous large research
projects. The two main data characteristics to be considered for traﬃc forecasting
problems are data resolution and the measured parameters. Resolution of traﬃc data
can vary greatly, with traﬃc state readings being made as often as every few seconds
(Ye et al. 2012) to as infrequently as every hour (Hong 2012). The forecasting step is
equivalent to the resolution of the data, for example for data collected every 15 minutes
one forecasting step would be 15 minutes in advance of the current time step. The
forecasting horizon is a related concept and refers to the speciﬁc time horizon in the
future for which we want predictions to be made and can include several forecasting
steps depending on the data resolution. For instance, Ishak & Al-Deek (2002) and
Liu et al. (2019) conclude that prediction accuracy becomes increasingly more diﬃcult
as the prediction horizon increases. However as Vlahogianni et al. (2014) notes, the
usefulness of a prediction model degrades given too short a prediction horizon, as for
the majority of traﬃc management applications a reasonable amount of notice is required
for proactive steps to be put in place or information to be communicated to road users.
It is also the case that particularly short time steps make predictions diﬃcult due to large
ﬂuctuations in readings from one step to the next (Smith & Demetsky 1997), because
of this, oftentimes high resolution data is aggregated for easier modelling (Florio &
There are several methods of data collection for traﬃc data, the most common by
some way is traﬃc sensors (Coleri et al. 2004). These can o↵er measurements for a
range of traﬃc characteristics such as speed, ﬂow, occupancy and travel times. There
is no real consensus in the literature on which variable is more suitable in describing
underlying traﬃc conditions but Vlahogianni et al. (2014) found that the most com-
mon characteristics chosen for prediction where ﬂow and travel time. There have also
been attempts to forecast di↵erent measures simultaneously such as (Florio & Mussone
1996) who tried to predict the evolution of traﬃc ﬂow, density and speed over 10 min.
However Innamaa (2005) found that that one single model predicting two variables gave
worse forecasts than two separate models for speed and ﬂow. In addition to this, some
researchers have found success in incorporating exogenous variables in the modelling
process. These can be things such as weather conditions and traﬃc incident updates
mined from twitter as done by Essien et al. (2020) ,or in the case of Xia et al. (2019)
contextual temporal information such as the day of the week or whether it is a national
Deﬁning the appropriate data resolution and measurement parameters are of partic-
ular importance for data driven methods as it a↵ects the quality of information about
traﬃc conditions lying in the data (Vlahogianni et al. 2004). Poor or inappropriate
data for the task at hand could result in a poor model; the “Garbage in, garbage out”
2.2 Space-time autocorrelations
Whilst many traditional approaches to traﬃc forecasting consider it is a stationary and
single point time series problem, the most current literature makes it clear that it is of
paramount importance to consider both spatial and temporal dimensions of the data in
order to develop accurate traﬃc forecasting models. So ﬁrst provided is an overview of
the literature regarding autocorrelations in space-time and their presence across traﬃc
networks. (Temporal) Autocorrelation is a term most prevalent in time series analysis
and signal processing, it is a mathematical representation of the degree of similarity
between a given time series and a lagged version of itself over successive time intervals.
However, it can also be extended to the spatial dimension hence the term “spatial
autocorrelation” which has groundings in the First Law of Geography (Tobler 1970)
and can be deﬁned as follows. “Spatial autocorrelation is the correlation among values
of a single variable strictly attributable to their relatively close locational positions on
a two-dimensional surface, introducing a deviation from the independent observations
assumption of classical statistics” (Griﬃth 1987). Until recently the spatial component
of autocorrelations in traﬃc data were overlooked, however a number of studies, most
notably Cheng et al. (2012) highlighted the extent of spatial autocorrelation in road
network data and its potential importance to traﬃc forecasting. Various indices exist
in the literature to measure autocorrelations in both space and time. Most of which are
derived from Pearson’s product moment correlation coeﬃcient (PMCC) (Soper et al.
1917) which given two independent variables X and Y is deﬁned as their covariance over
the product of their standard deviations or the equation:
Where ¯x,¯yand Y,Yare the means and standard deviations of variables X and
Yrespectively. Thecoeﬃcient⇢X,Y is used as a measure of the strength of linear
association between variables and can fall in the range 1 to 1 with 1 indicating perfect
positive correlation, 1 perfect negative correlation and 0 no correlation. Temporal
autocorrelation can be calculated by taking a lagged speciﬁcation of PMCC so rather
than calculating the correlation between two variables it is calculated between a variable
at time tand the same variable at time tkfor some k.
Moran’s I is a measure commonly found in the broader geospatial literature and is
an extension of PMCC to the spatial dimension (Moran 1950), its more complicated
than temporal autocorrelation in that it can occur in any direction. Moran’s I however,
does not consider the temporal dimension so cannot capture dynamic spatio-temporal
correlation properties. Thus, adaptations are required to capture spatial and temporal
correlations simultaneously. A global Space-Time Autocorrelation Function (ST-ACF)
is proposed by Pfeifer & Deutrch (1980) and implemented in the domain of network
ﬂow data by Griﬃth (1987), this measure measures the cross-covariances between all
possible pairs of locations lagged in both time and space. STACF has been used in Space-
Time Autoregressive Integrated Moving Average (STARIMA) modelling to determine
the range of spatial neighbourhoods that contribute to the current location measurement
at a given time lag (Pfeifer & Deutrch 1980)
As is the case with Morans, the STACF has a local variant. Local spatio-temporal
correlations can be quantiﬁed using the Cross-Correlation Function (CCF) (see Box et al.
(2011)). This treats two time series in separate spatial locations as a bivariate stochastic
process and measures the cross-covariance coeﬃcients between each series at speciﬁed
time lags. Yu e & Yeh (2008) show that the CCF can e↵ectively measure spatio-temporal
dependencies between a road sensor and its neighbours and can be used for analysing
the forecast-ability of a given traﬃc data set. For example, a signiﬁcant positive CCF
coeﬃcient between sensor Aat time tand downstream sensor Bat given time tk
would suggest a back propagation of congestion from Bto Aover the time period k.
Despite the empirical evidence provided by Yue & Ye h (2008) and Cheng et al. (2012)
that the CCF is a sound method of measuring spatio-temporal relationships between
neighbouring road links it has limited exposure in the ﬁeld of traﬃc forecasting where
it could be used for determining appropriate modelling approaches or selecting spatial
model-input features. In this study the CCF will be implemented to determine the
extent of local spatio-temporal autocorrelation and their dynamic behaviour, the exact
speciﬁcation with respect to this data set will be provided in the methods section.
2.3 Traﬃc modelling
Traﬃc forecasting has evolved a great deal over the past couple of decades however
there is still no uniﬁed modelling approach. This section will provide an overview of the
most used prediction modelling techniques both traditionally and in the most current
literature. The data-driven forecasting methods can broadly be divided into two distinct
categories: Statistical parametric approaches and non-parametric machine learning.
Parametric models have a ﬁxed and ﬁnite number of parameters, the number determined
prior to model ﬁtting. One of the most popular parametric models for time series predic-
tion both within traﬃc forecasting and across other disciplines is the Auto Regressive
Integrated Moving Average (ARIMA) model. ARIMA uses a statistical approach to
identify recurring patterns of di↵erent periodicity from previous observations of a time
series. It can be deﬁned by the following di↵erencing equation:
Yt=c+1yt1+2yt2+... +pytp+✓1✏t1+... +✓tq+✏t(2)
This is an ARIMA(p, d, q)modelwhere:
yt1...ytprepresents the previous pvalues general time series. This is the autoregres-
sive part of the model
Ytrepresents the forecast for yat time t
✏t...✏tqis zero mean white noise or the qmoving average terms
and ✓are parameters to be determined by model ﬁtting
dis the order of di↵erencing applied of the data set as the time series must be stationary
for the application of ARIMA.
The ﬁrst recorded application of the ARIMA model in the transport forecast literature
was by Ahmed & Cook (1979) in which an ARIMA(0,1,3) model was proposed to predict
traﬃc ﬂow and occupancy on freeways, where it was shown to be more accurate that
previous smoothing techniques. A primary assumption made by this model is the sta-
tionarity of the mean, variance, and autocorrelation. Thus, a major criticism directed
toward ARMA models concerns their tendency to perform poorly during abnormal traf-
ﬁc states and be unable to predict extreme values in time–series (Vlahogianni et al.
2004). Furthermore, the univariate nature of ARIMA means that the integration of
spatial or contextual features is not possible.
Other Linear methods include Ordinary Least Squares (OLS) regression and its exten-
sion Geographically Weighted Regression as applied by Gebresilassie (2017). OLS allows
the simple integration of multiple input variables and additional features such as multi-
ple sensor measurements across a network or contextual features. OLS regression aims
to ﬁnd a linear relationship between input and output variables and an OLS model is
deﬁned as :
Where yiis the response variable, xi:i=1,2,..,n are the nindependent predictor
variables, ✏is a random error term. 0is the model constant (intercept) and i:i=
1,2,..,n are the model parameters to be determined using the Ordinary Least Squares
estimator which is deﬁned by the equation:
Parametric approaches have favourable statistical properties and capture regular varia-
tions very well. However, ﬁnd prediction diﬃcult given more volatile traﬃc states such
as when a traﬃc accident or severe congestion occurs (Smith et al. 2002). Further to this
the aforementioned parametric models make the assumption of linearity between input
and response variables which makes them sub-optimal when considering the complex
spatio-temporal structure and non-linear relationships in traﬃc data.
2.3.3 Nonparametric machine learning
Nonparametric models function in a fundamentally di↵erent way to parametric mod-
els. Nonparametric models do not assume that the structure of a model is ﬁxed and in
most cases the model complexity grows to accommodate the complexity of the data, in
other words the model structure is “learned” from the data (Bishop 2006). Compared
to parametric methods, nonparametric methods are believed to o↵er more ﬂexibility in
ﬁtting the nonlinear characteristics and are better able to process noisy data (Karlaftis
& Vlahogianni 2011a). These methods include algorithms such as K-Nearest Neighbour
(KNN) Support Vector Regression (SVR) and Artiﬁcial Neural Networks (ANN). There
have been several attempts to implement machine learning models for traﬃc forecasting
(Sun et al. 2018, Castro-Neto et al. 2009, Essien et al. 2020) and even hybrid combi-
nations of these models (Wang et al. 2015, Luo et al. 2019). These non-linear, ﬂexible
and multivariate models have proven to show greater accuracy in traﬃc forecasting than
parametric models and are better able to leverage the spatio-temporal structure of the
traﬃc data (Vlahogianni et al. 2014).
2.3.4 Recurrent Neural Networks
Recurrent Neural Networks (RNN) are a subset of Artiﬁcial Neural Networks (ANN)
that aim to preserve the temporal dimension of time series data using a hidden recurrent
state. They are seen as the most appropriate ANN architecture for series prediction
problems and have show success in traﬃc forecasting problems as shown by Cheng et al.
(2018) when using the RNN model to predict the traﬃc ﬂow in the case of special
events. In conventional ANN’s there are only full connections between adjacent layers,
but no connection among the nodes of the same layer. As Zhao et al. (2017) explains
the ANN may be sub-optimal when dealing with the spatio-temporal problems, because
there are always interactions among the nodes in spatio-temporal network. Di↵erent
from conventional networks, the hidden units in RNN receive a feedback which is from
the previous state to current state (Fig 1) allowing them to understand relationships
between states i.e. time steps.
(a) Simple feed-forward Artiﬁcial Neural Network
(b) Recurrent Neural Network (Essien et al. 2020)
Figure 1: Neural Network architectures
RNN’s can however often struggle with long term dependencies and are unable to
process a large number of time lags due to correlations caused by periodic or seasonal
structures that are often seen in spatio-temporal time series. For this reason, most traﬃc
forecast literature now implements the Long Short-Term Memory (LSTM) (Hochreiter
&Schmidhuber1997) adaptation of the RNN. The LSTM network is a special kind of
RNN. By treating the hidden layer as a memory unit, LSTM networks can cope with
the both short and long-term correlations. LSTM neural networks have found great
popularity in traﬃc forecasting, many researchers have applied the algorithm to spatio-
temporal traﬃc data with great success, where it has been found to outperform other
popular techniques including ARIMA, Support Vector Machines and Random Forests.
(Yang et al. 2019, Zhao et al. 2017), it is probably the most popular single ML algorithm
in the current traﬃc forecasting literature.
2.3.5 Gradient Boosting
Gradient Boosting and in particular Gradient Boosted Decision Trees (GBDT) is a tree
based machine learning model that is becoming increasingly popular due to their fast
implementations such as LightGBM (Ke et al. 2017) and XGBoost (Chen & Guestrin
2016) and also there state-of-the-art performance across a broad range of problems.
Figure 2: Decision tree example
GBDT works by combining many decision trees (Fig 2) with low accuracy into a model
which has signiﬁcantly higher accuracy than its base learners. It is an additive model
with the general expression:
Where ˆyiis the response variable given input vector xi.Kis the number of decision trees
and fis a function of the function space Fwhich contains all possible decision trees. A
decision tree (Fig 2) is an algorithm for classiﬁcation and regression that sequentially
splits the data based on feature values, the process is repeated in a recursive manner on
each successive subset following the split, the end nodes represent the predicted value for
data in that subset. This carries on until splitting no longer adds value to the predictions
(Breiman et al. 1984).
GBDT algorithms, despite their popularity in the Data Mining and ML communities,
have been implemented only a couple of times in the current literature namely Xia et al.
(2019) who implement LightGBM to predict traﬃc characteristics in Shenzhen and ﬁnd
it signiﬁcantly outperforms Linear Regression, ANN and ARIMA. Also Mei et al. (2018)
combines both XGBoost and LightGBM for accurate ﬂow predictions on spatio-temporal
2.3.6 Developing the current literature
From this review and also highlighted in the comprehensive review of the research area
by Vlahogianni et al. (2014), it can be seen that historically most e↵ort has been put
into employing univariate statistical models, predicting traﬃc volume or travel time and
using data from single point sources. Currently there is considerably more e↵ort being
applied to incorporate the spatial structure of the data using multivariate methods such
as OLS, LSTM and SVM however often there is less of an attempt to understand how or
why spatial dependencies e↵ect traﬃc forecasting models. Gradient boosting methods
such as LightGBM have largely been overlooked in the literature. Lastly, attempts to
predict traﬃc speed are far less prevalent than other traﬃc characteristics such as ﬂow or
travel time. With this in mind this paper will ﬁrst attempt to quantify the signiﬁcance
and nature of local spatial autocorrelations in traﬃc data and then implement three
modelling approaches LSTM, LightGBM and OLS in order to predict traﬃc speeds,
assessing their performance with respect to both each other and also the spatial richness
of various input data.
3 Methods and Data
This section aims to outline the applied methodology and experimental procedure carried
out, guided by the key research themes of this paper. As a reminder these are can be
brieﬂy summarised as follows.
•Quantify spatio-temporal relationships in traﬃc data.
•Develop and compare traﬃc forecasting models using OLS, LSTM and LightGBM.
•Assess the impact of spatial features on model performance.
The Data used is real world traﬃc data collected by Highways England and can be down-
loaded from their WebTris website 1,a map-based data query service with measurements
collected by road sensors along the Strategic Road Network (SRN).
The selected data set contains measurements for 21 site locations along the M60
motorway in England, in the clockwise direction between J7 and J20 (Fig 3). The data
contains measurements for both traﬃc speed and traﬃc volume, although only speed was
considered for this analysis. The total time period for the data used was between 2019-
05-01 and 2019-10-31 this is a 6-month period (184 days) for which measurements are
available at 15 minute intervals with traﬃc speeds being aggregated over this 15-minute
period. Resulting in 17,664 total speed readings across 21 sites so 370,944 individual
Figure 3: Sensor locations
Across the 21 sites there were 1668 missing data points equating to 0.44% of the data,
a very small proportion. For a given time stamp where there existed missing data
points, the missingness was not consistent across all sites, further to this, discarding
entire time steps worth of data can result in inconsistent time steps across the series
and be detrimental to model performance as periodic and seasonal e↵ects are distorted.
Because of this, where data was missing it was imputed with values calculated by taking
the average speed for that sensor location at that particular time of day across the rest
of the data set, in order to maintain day of week and time of day characteristics of traﬃc
Before carrying out any analysis or modelling, some exploratory work was carried out
with the data in order to better understand and obvious patterns and spot any potential
measurement errors. Figure 4is a heat-map showing a snapshot of average traﬃc speeds
on 2019-09-06. The time is shown on the x-axis and the spatial component is shown on
the y-axis with sensor locations ordered from downstream to upstream or from south-
north as shown on Fig 3. Both spatial and temporal patterns are apparent with a drop-
in speed around peak times and also a speed values closeted based on sensor locations
which can be seen by the vertical strip patterns. The three horizontal strip patterns
(M60/919K, M60/9229L, M60/9298K) represent roads sensors located on entry or exit
slip roads where speed is consistently reduced.
Figure 4: Heat-map of traﬃc speeds for all sensors on 2019-09-06
Figure 5: Average traﬃc speeds across each day of the week
Figure 6: Average traﬃc speeds by time of day for each sensor
Figure 7: Traﬃc speeds on 2019-09-06 for each sensor
Fig 5shows time of day and day-of-week variations in traﬃc speeds across the net-
work. During the working week severe congestion peaks can be seen around the “rush
hour” period with speeds dropping by almost half for 1-2 hours every day. There is a
clear weekend e↵ect with speeds higher across the day and no visible congestion at any
point, Fridays also show less sever and earlier in the day congestion than the rest of the
working weeks. Clearly traﬃc speeds are largely inﬂuenced by the commuting patterns
of the working population. Fig 6shows the distribution of aggregated speeds across
the day for each individual sensor, coloured depending on location from the southern-
most sensor progressing downstream. Not all sensors share the same daily patterns with
some exhibiting more severe congestion patterns than others, the spatial dependence is
obvious with closely located pairs of sensors sharing more similarities in traﬃc speeds
compared to sensor pairs more widely spread. The congestion during the PM period
is far more severe than that in the AM period. Meanwhile Fig 7shows a snap shot
of traﬃc speeds by hour taken on 2019-09-06 which highlights the volatility in traﬃc
patterns on a day-to-day basis, the patterns on this particular day di↵er signiﬁcantly
from the average patterns over the whole time series as shown in Fig 6. Again shown,
is the strong spatial dependence with abnormal congestion centred around a handful
of adjacent sensors. Congestion events can also extend past an hour by hour basis,
Fig 8shows a box-plot of average speeds across the network by day of the week. The
large interquartile range and number of outliers suggest there can be large variation
in traﬃc speeds not only during days but between days of aggregated speeds as well.
These variations, along with the complex spatial relationships and factors such as un-
expected events, traﬃc accidents, slow moving vehicles, bad weather etc., bring lots of
challenges for robust short-term traﬃc prediction by traditional linear regression models
Figure 8: Box plot of average daily traﬃc speeds across all sensors
3.2 Space-time autocorrelations
The standard autocorrelation function is used to calculate the level of correlation be-
tween the speed value at a sensor with a lagged version of itself at pervious time steps,
given a time series y1,y
2...ytat the target sensor, autocorrelation of lag kis described
by the equation:
Where E[(Yt¯x)(Yt+k¯x)] is the covariance between Ytand Ytkand Yt,
sent the standard deviations of Ytand Ytkrespectively. ⇢(k) measures the relationship
between ytand ytk.
Also measured is Partial autocorrelation. This is similar to the standard autocorrelation
however accounts for the inﬂuence of intermediate values in the time series between yt
and ytkthe Partial Autocorrelation Function is given as:
This gives a clearer understanding between the relationship of a time series when com-
pared with a version of itself that is lagged over multiple time steps, as it removes
information carried by intermediate time steps. This will help us to determine the
feasibility of multi-step forecasts using a single sensor.
The introduction of the spatial element to autocorrelation calculations is brought by
the Cross Correlation Function (CCF). Let Xtdenote the time series of measurements
collected at sensor Xup until time T. Let Let Ytdenote the time series of measurements
collected at a di↵erent sensor Y, either upstream or downstream, up until time T.The
CCF at lag kis then given by the equation:
Where E[(Xt¯x)(Yt+k¯y] represents the cross-covariance between Xand Yat lag k
This equation allows us to calculate the correlation between sensors in di↵erent locations
where one is temporally lagged, thus the CCF can be used to measure quantitatively
the spatio-temporal relationship of the traﬃc ﬂow and speed.
3.3 Experimental set up
In order to train and validate ML models the data must ﬁrst be in a form that can be
interpreted by the algorithms. In this case we will frame the data to be a supervised
learning problem where each individual input vector is mapped to a unique output
value. To do this, input feature vectors must be extracted from the raw sensor data.
Starting with the simplest case of using one sensors historical data to predict the traﬃc
speed at the next time step, ﬁxed size and non-overlapping historic feature windows are
constructed (Fig 9) which are mapped to a future output value. The temporal length of
the feature window needs to be determined in order to ensure best model performance
but without unneeded complexity or redundant features, in this case the previous 6-
time steps where used i.e. 90 minutes. So, for a prediction at time tat target sensor
kconsidering only historical information from the target sensor the input vector ˆxk,t
would be [xk(t1),x
k(t6)] and the ground truth output is the traﬃc speed
value at the target sensor kat time t.
Figure 9: Input windows for time series prediction
Expanding this to include upstream and downstream sensors is fairly simple the
input vector becomes
2.., xjare the upstream and downstream sensors to be included in the input
data. This paper focuses only on predictions for a single target sensor at a time. Thus,
the problem deﬁnition becomes to develop a functional approximation Fsuch that
I.e. the sum of the squared errors between the function mapping, in this case a model
prediction, and the true speed value is minimised. For each modelling technique three
di↵erent input vectors are tested in order to evaluate the impact of spatial features these
can be seen in Table 1.
Table 1: Four model Input vectors taken over 6 previous time steps
Input 1 Target sensor
Input 2 Target sensor + downstream sensors
Input 3 Target sensor + upstream sensors
Input 3 All sensors
3.4 Model speciﬁcation and tuning
Machine learning models can contain hyper-parameters that determine structural fea-
tures of the model and are very important to model performance. These hyper-parameters
can be highly sensitive, with small perturbations resulting in large changes in prediction
outcomes. They are also diﬃcult to determine heuristically and an empirical method
must be employed in order to determine the most appropriate values. The process of
“tuning” hyper-parameters requires repeatedly testing the model accuracy with di↵erent
sets of hyper-parameters on a data set that was not used during the training procedure,
in order to determine their best values, this is called model validation. It is not to be
confused however with model testing, the validation data set must be distinct from the
testing data set due to the fact hyper-parameter values are being chosen based on vali-
dation performance which gives a potentially biased representation on model accuracy.
A separate test set, that shares no overlap with training or validation sets, must be held
back for testing model prediction accuracy and o↵ers an unbiased view on model perfor-
mance on unseen data. The raw data set is split into train, validation and test sets. This
is done in a sequential manner in order to avoid look-ahead bias so data from 2019-05-01
to 2019-08-31 is used for model training, data from 2019-09-01 to 2019-09-31 for vali-
dation and the data for October 2019 used for model testing. The train-validation-test
workﬂow can be seen in Fig 10
Figure 10: Train validation and test procedure for modeling
The OLS model implemented is described by the following equation in the univariate
Where xtis the prediction for target sensor at time t,xiare the traﬃc speed values at
the target sensor at time i,are regression coeﬃcients to be determined by model ﬁtting.
In the multivariate case the equation is
Where xk,t is the prediction for target sensor kat time t,xi,k are the traﬃc speed values
at target sensor k,xi,j are the traﬃc speed values at additional Jsurrounding sensors.
Gradient Boosted Decision Trees (GBDT), as brieﬂy introduced in the literature review,
are an ensemble machine learning technique that combines many decision trees to create
a strong learner. Decision trees are a popular algorithm however tend to overﬁt the data
and are unable to generalise to unseen data, GBDT’s are able to generalise better. They
work by sequentially combining many weak decision trees in a way that each new tree ﬁts
to the residuals of the previous tree such that the model improves, employing the logic
in which the subsequent predictors learn from the mistakes of the previous predictors
LightGBM (Ke et al. 2017) is an eﬃcient implementation of GBDT’s which uses a
histogram based decision tree algorithm and leaf wise tree growth methods rather than
the level wise method that most other GBDT implementations opt for.
LightGBM has numerous hyper-parameters that are tuned here using a grid search
method, iteratively traversing a grid of possible parameter combinations and comparing
their performance on the validation data set. The number of boosting iterations was
determined by the early stopping criterion with a patience of 100 i.e. if validation per-
formance didn’t improve after 100 rounds then there were no further boosting iterations,
ﬁnal hyper-parameters are shown in Table 2.
Figure 11: Gradient boosted decision tree learning process 2
Table 2: LightGBM hyperparameters
Feature fraction 0.9
bagging fraction 0.5
num leaves 64
bagging freq 4
learning rate 0.007
boosting rounds 515
Long Short Term Memory (LSTM) networks introduced by Hochreiter & Schmidhuber
(1997) are an adaptation of the Recurrent Neural Network that ﬁx many of the issues
of its predecessor such as vanishing/exploding gradients and handling of long term de-
pendencies. Fig 12 shows the structure of an LSTM layer, which is similar to that of
an RNN (Fig 1) in that it is a chain of repeating modules of a neural network, but
the repeating modules have a di↵erent structure. The memory module contains input,
output, and forget gates, which respectively carry out to write, read, and reset functions
on each cell. The multiplicative gates, and ⌦, refer to operations for matrix addition
and dot product respectively and allow the model to store information over long periods,
eliminating the aforementioned vanishing/exploding gradient problem.
LSTM networks, like most ANN’s are trained using back propagation and gradient
descent. Given an error function E(X, ✓), which deﬁnes the error between the ground
truth yiand the predicted output ˆyiof the neural network with input vector xifor a
set of input output pairs (xi,y
i)2Xand given a set of network weights and biases ✓
. Gradient descent requires the calculation of the gradient of the error function with
respect to the weights wk
ij and biases bk
i. Then according to the learning rate ↵, each
iteration updates the weights and biases according to
✓t+1 =✓t↵@E(X, ✓t)
Where ✓trepresents the parameters of the network (weights and biases) at iteration
tof the gradient descent. The goal is to minimise the error function.
Figure 12: Structure of a general LSTM network layer
The structure of the network use in this paper is shown in Fig 13 and consists of the
input layer, a hidden LSTM layer, a dropout layer which randomly drops 5% of nodes
during a run through the network and is an e↵ective technique for preventing over ﬁtting
(Srivastava et al. 2014), then ﬁnally a fully connected dense layer which outputs the ﬁnal
prediction. The learning rate was 0.001 and early stopping was used to determine 29
epochs as optimal to balance training error and generalisation strength.
Figure 13: Implemented LSTM network structure
3.5 Evaluation metrics
In order to asses and compare model prediction performance two evaluation metrics are
employed to valuate the di↵erence between actual and predicted speed values. These are
Root Mean Squared Error (RMSE) which is the most frequently used metric in previous
work (Luo et al. 2019) and Mean Absolute Error (MAE). RMSE tends to place more
weight on particularly large individual errors, thus more useful when large errors are
All model training and computation was carried out using Pyhton 3 running on macOS
Catalina v10.15.6 operating within a MacBook Pro (2018) with 2.3GHz Quad-Core intel
I5 and 8 GB of 2133Mhz memory. LSTM was implemented using the Keras 3API library
running on top of Tensorﬂow. The LightGBM implementation of GBDT’s used here is
freely available online 4.
4.1 Space-time autocorrelations
First, presented here is an evaluation of the spatial and temporal patterns found within
the data using measures of autocorrelation previously outlined. Fig 14 shows the tem-
poral autocorrelations found in the sequence measurements of the target sensor for up
to 10 lagged time steps. Fig 14a shows a steady decrease in temporal autocorrelation as
the time lag increases. The partial autocorrelation show in Fig 14b, which controls for
intermediate time lags, shows a steep drop o↵once following once the variable is lagged
more than one time step.
(a) Autocorrelation (b) Partial autocorrelation
Figure 14: Temporal autocorrelations
Examining the CCF coeﬃcient between target sensor and the other network sensors
gives a local view of the space-time autocorrelation structure of the network. The CCF
coeﬃcient is calculated both across the whole day and also separately for the peak after-
noon period (4pm-8pm) to see how space-time autocorrelations vary both spatially and
temporally. Fig 15 and Fig 16 are both cantered at time order 0, this is the correlation
between sensors at the same point in time. Positive time lags indicate that the sensor
has been lagged in time with respect to the target sensor whereas a negative lag value is
given when a series has been shifted forwards in time with respect to the target sensor.
Each time lag step represents 15 intervals minutes in real terms. The CCF coeﬃcient
was calculated between the target sensor and all other surrounding sensors Fig 15 shows
the CCF coeﬃcient values across the whole day lagged up to 10 time steps with plot-
ting distinguishing between upstream and downstream sensors. Almost all sensors, both
upstream and downstream, show some level of cross correlation with the target sensor.
Generally, corss-correlation decreases as time lag increases which is to be expected, with
upstream sensors generally showing greater levels of correlation with target, although
as show in Fig 3upstream sensors are in this instance located slightly closer on aver-
age than downstream.. At peak times as shown in Fig 16 downstream sensors show
increased CCF coeﬃcient values due to the backward propagation of traﬃc through the
network at congested times. The plots demonstrate cross correlations varying both in
space (sensor locations) and time of day.
Figure 15: CCF coeﬃcients for all times
Figure 16: CCF coeﬃecients for afternoon peak hours
This section aims to give a comparative overview of the di↵erent modelling approaches
and their performance when provided with varying input data. The metrics employed to
evaluate model performance are RMSE and MAE, the calculations for these are outlined
in the methods section. The data used for evaluation is traﬃc speed data collected
between 2019-10-01 and 2019-10-31 this data was not used during model training thus
o↵ers an unbiased estimated of true model prediction accuracy on unseen data. Models
are also compared to a “common sense” approach, in this case this means taking the
traﬃc speed at the current time and assuming that it will remain the same for the
next time step, this is a consideration that is often overlooked in traﬃc forecasting
but is vital to ensure that any proposed, potentially complex, models are in fact able
to provide genuine predictive power when compared to a much more simple intuitive
Table 3: MAE and RMSE scores for target sensor M60/9223A at time t+1
Model Input Data MAE RMSE
OLS Target sensor 1.71923 3.42504
Target sensor + downstream sensors 1.67342 2.97104
Target sensor + upstream sensors 1.72988 3.4096
All sensors 1.69123 2.97821
LightGBM Target sensor 1.66323 3.32318
Target sensor + downstream sensors 1.48142 2.6648
Target sensor + upstream sensors 1.59021 3.18623
All sensors 1.43430 2.62749
LSTM Target sensor 1.64175 3.26481
Target sensor + downstream sensors 1.58658 2.95209
Target sensor + upstream sensors 1.57595 3.32331
All sensors 1.56655 3.17687
Extrapolate 1.59935 3.40976
Table 3Presents a comparison of model accuracy for traﬃc speed prediction at the
target sensor M60/9223A at time horizon t+ 1. Taking the OLS model it is clear that
prediction improvement can be gained by incorporating spatial features. The MAE falls
from 1.71 to 1.69 when the previous time step readings for upstream and downstream
sensor readings are incorporated as model input, as well as previous target sensor read-
ings. In terms of MAE the OLS model however fails to outperform the “common sense”
approach which takes the speed value at time tto be a prediction for the speed at time
t+ 1, it can however outperform this approach with regard to RMSE which suggest the
OLS is making fewer very large errors than the “common sense” approach. The Light-
GBM is the overall best performing model by a clear margin. When only the previous
6 time steps for the target sensor are considered as input data it fails to outperform the
“common sense” approach in terms of MAE, however as can be seen from the table the
incorporation of upstream and downstream sensor readings can improve model perfor-
mance signiﬁcantly, the LightGBM model appears the most capable of leveraging the
spatio-temporal structure of the data, with a 15.96% decrease in MAE for LightGBM
when incorporating previous time step data for all sensors compared to only the target
sensor. LightGBM is in fact the only model to outperform the “common sense” ap-
proach when in terms of MAE, it does so by a good margin, MAE of 1.43 compared to
1.60, but only when incorporating surrounding sensor data. The LSTM performs with
slightly better accuracy than the OLS model but gains only a slight improvement in
error when incorporating upstream and downstream sensor information, it marginally
outperforms the “common sense” in terms of MAE and RMSE.
Table 4presents the results for model predictions at time horizon t+ 2. The results
share similarities with results for t+ 1 predictions with once again LightGBM being
by far the best performing model with a MAE of 1.885 and RMSE of 3.696, this when
including all sensors as input. The model performance when compared to common sense
is magniﬁed when predicting 2 time steps ahead, there is a 12% decrease in MAE in
LightGBM model predictions compared to common sense while when predicting one time
step ahead there was a 10.32% di↵erence in error. Again OLS and LightGBM model
performance improves, with a reduction in MAE and RMSE when incorporating spatial
features by way of upstream and downstream sensor readings At this time horizon, the
LSTM model does not seem to take advantage of the spatial elements of the data as it
fails to improve when incorporating surrounding sensor readings.
Table 4: MAE and RMSE scores for target sensor M60/9223A at time t+2
Model Input Data MAE RMSE
OLS Target sensor 2.29718 4.8005
Target sensor + downstream sensors 2.23167 4.2297
Target sensor + upstream sensors 2.27521 4.65812
All sensors 2.23841 4.19049
LightGBM Target sensor 2.1893 4.62002
Target sensor + downstream sensors 1.91728 3.81149
Target sensor + upstream sensors 2.07082 4.3758
All sensors 1.88533 3.7961
LSTM Target sensor 2.16217 4.68135
Target sensor + downstream sensors 2.20957 4.65229
Target sensor + upstream sensors 2.26725 4.6352
All sensors 2.20115 4.78406
Extrapolate 2.15487 4.65977
The results in terms of MAE for the best performing LightGBM model are shown in
Fig 17 the MAE of predictions is aggregated for each hour of the day and the plot shows
the inﬂuence of both time of day and the inclusion of spatial features on model perfor-
mance. During none peak times the accuracy of models that take input data only from
the target sensor is marginally worse than models that take inputs from both the target
sensor and surrounding sensors upstream and downstream. At peak times however the
di↵erence is more pronounced, the error is much higher in models that take input data
from only the target sensor or the target sensor + upstream sensors. Congested times
appear to be when the incorporation of spatial features is most important, especially
downstream sensors, due to the back-propagation of traﬃc through the network at busy
For one week 2019-10-21 to 2019-10-28 (Mon-Sun) in Fig 18, the predicted traﬃc
speeds for both the LightGBM and LSTM models are shown, as well as the actual
target sensor readings for that period also known as “ground truth” values. Whilst
both models ﬁt the general trend of the data the LightGBM better matches the ground
truth at times when the traﬃc speeds are particularly unstable or congested, such as
the areas circled.
Figure 17: MAE for all predictions in each hour of the day
Figure 18: Actual and predictied traﬃc speed values for LightGBM and LSTM models
To reiterate, the main research questions of this study where thus. To what extent
does traﬃc data exhibit spatio-temporal patterns? Which is the best performing traﬃc
forecasting technique OLS regression, Gradient Boosted Decision Trees or LSTM neural
networks? Does the inclusion of spatial features improve traﬃc forecasting models? The
results from this study provide clear indications as to the answers. The exploration of
spatio-temporal patterns using the Cross-Correlation Function indicated that there are
clear spatial and temporal dependencies in the data. These spatio-temporal patterns
are dynamic in both space and time with distinctions found between the inﬂuence of
upstream traﬃc compared to downstream traﬃc, for instance downstream traﬃc pat-
terns becoming more signiﬁcant during congested periods (Fig 16). The result found
the LightGBM model to be the most accurate by a signiﬁcant margin, it appears to
be the most appropriate model for capturing the complex spatio-temporal patterns and
generated accurate predictions under both free ﬂowing and congested conditions (Fig
18) up to 30 minutes into the future. The LightGBM model also saw considerable im-
provements when including the e↵ects of upstream and downstream sensors in the input
vector(Table 3), which in turn answers our ﬁnal question.
5.1 Literature implications
The spatio-temporal correlations studied in this paper concur with ﬁndings from Yue &
Yeh (2008) and Cheng et al. (2012) that traﬃc networks exhibit strong spatial as well
as temporal patterns as traﬃc propagates through the network. Although this is an
aspect that has previously been overlooked in the literature (Vlahogianni et al. 2014)
the importance of space in traﬃc forecasting has been brought more to the fore in recent
years. The CCF proved an e↵ective way of quantifying space time relationships which
could aid in model selection and feature selection for traﬃc forecast models.
As previously stated the literature is still divided over which modelling techniques
are most appropriate, with relation to this study both Xia et al. (2019) and Mei et al.
(2018) also found success in developing traﬃc ﬂow forecasting models with GBDT’s. On
the other hand Essien et al. (2020) found that their Autoencoder-LSTM outperformed
XGBoost when predicting traﬃc ﬂows across the A56 in Manchester and Wang & Li
(2018) found that their hybrid Convolutional LSTM network outperformed LightGBM
in predicting travel speed 15 to 90 minutes in the future. In both of these studies
the GBDT models where being used as comparison against signiﬁcantly more complex
models and networks with multiple Convolutional and bidirectional-LSTM hidden layers.
There are few, if any, studies that show LightGBM to be outperformed by a pure LSTM
model. This study conﬁrms the ﬁndings of Xia et al. (2019) and Mei et al. (2018) that
LightGBM appears to be a useful method, dealing with spatio-temporal relationships
well without the need for supplementary models or network layers to capture the spatial
aspect of the data as is the case with LSTM networks. The importance of integrating
upstream and downstream sensors is a agrees with ﬁnds by both Gebresilassie (2017) and
Du et al. (2018) and forms a growing body of evidence for the importance of explicitly
extracting spatial characteristics of traﬃc data to improve traﬃc forecasting.
Perhaps unexpectedly, some of the model variations implemented failed to outper-
form a much more simple “common sense” approach that required almost no compu-
tation. This is interesting for a couple of reasons, ﬁrst of all it is evidence of just how
challenging it can be to accurately model such a dynamic and complex system as an
urban traﬃc network. The intricate spatio-temporal relationships in the data, outlined
in this study and previous works (Cheng et al. 2012, Yue & Ye h 2008), mean that traﬃc
prediction is diﬃcult, it requires serious though about a number of interconnected fac-
tors in spatial and temporal dimensions as well as consideration for contextual factors
that drive traﬃc patterns such as weather and accidents. Second of all it raises questions
around the methods of evaluation for traﬃc forecasting models in the literature. Very
few studies in the literature, of a similar nature to this one, test their models against
similar heuristic common sense measures, usually opting instead to test developed mod-
els against other common modelling techniques. The problem with this is that it tends
to lose sight of the underlying goals of traﬃc forecasting, rather than leveraging com-
putational techniques to enhance or improve existing methods, much research is pitting
increasingly complex models against each other(Karlaftis & Vlahogianni 2011b). As
(Vlahogianni et al. 2014) notes the increasing model complexity can also be detrimental
to the explanatory power of the model which in the real world is imperative to make
them adaptable and responsive to dynamic traﬃc changes.(Karlaftis & Vlahogianni
2011b). Tree based methods such as GBDT’s can o↵er methods to explain predictions
by extracting the input features that have most inﬂuence on model predictions (Chen
& Guestrin 2016). This is an area where models such as LSTM networks struggle.
5.2 Real world applications
Given that this is a ﬁeld of research that developed with real-world applications in
mind, it would be sensible to examine the ﬁndings of this paper in the context of real
Intelligent Transport Systems. The results gained here regarding the spatio-temporal
structure of the traﬃc data emphasise the need for comprehensive and widespread traﬃc
monitoring and data collection systems, these methods rely on high quality, rich and
multidimensional raw data sources and the clear spatial dependence exhibited by the
type of network data shown in this study means the performance is only likely to be
improved given the addition of more sensors distributed across the network. The promis-
ing performance shown by the LightGBM model to provide accurate predictions even
in abnormal conditions ( Fig 18) and over multiple time steps (Table 4) suggest that
it is a modelling technique that could be of signiﬁcant use in real world applications.
It is important to highlight here that this study focused only on generating forecasts
for an individual sensor. Clearly, for traﬃc forecasts to be an e↵ective tool in a traﬃc
management system they need to be extended across the entire network. There are a
number of challenges that come with this, however. It is not clear for instance how many
upstream and or downstream sensors are required for accurate predictions, in this study
all sensors appeared to exhibit some level of spatio-temporal correlation with the target
sensor however this is unlikely to be the case as the distance between sensors increases.
A further challenge regarding real world integration is the availability of real-time
data streams, while this study used historical data to asses performance, real world
application requires that data can be captured, processed and fed to any model in a
short window of time for any prediction scheme to be viable, the reverse is also true
and there is a challenge to determine which ITS and traﬃc management technologies
are capable of integrating and adapting to information taken from forecasting models.
(Vlahogianni et al. 2014)
Some of the main limitations have been brieﬂy touched upon already in this discussion.
Firstly, the fact that the modelling techniques employed in this paper are suitable for
predictions at a single sensor at a time and more general methods that are able to
predict traﬃc speeds across an entire network might be prefered. One option is the
method of Geographically Weighted Regression (Brunsdon et al. 1998) as implemented
by Gebresilassie (2017). This technique would involve a single regression model for the
entire network but allows independent and dependent variables to vary by locality. Luo
et al. (2019) developed a method in which models where developed for individual sensors
the predictions combined and a function based on spatial weights between sensors was
implemented in order to improve overall accuracy when making simultaneous predictions
across the network.
A further limitation of this study is the lack of understanding of causality of model
predictions. While models may be able to accurately map an input vector to an output
value this o↵ers no information as to the underlying causes of congestion or abnormal
Finally homogeneity of the data used for model evaluation represents a limitation in
terms of what can be inferred from the results, while this paper highlighted the potential
of the LighGBM model when compared to other methods. It is not possible to conclude
that the LightGBM is a more appropriate method than LSTM for traﬃc forecasting
based o↵this one study, for that more tests would need to be carried out on a number
of data sets across di↵ering road networks.
5.4 Future work
Although this study highlighted the importance of spatial dependence in traﬃc fore-
casting, beyond the inclusion of explicitly spatial features in the form of upstream and
downstream sensors, there was no signiﬁcant e↵ort to more formally encode the spatial
structure of the data in a way that might help models to better learn the latent spa-
tial structure of the data. There have attempts to develop models that are “spatially
aware” for instance Du et al. (2018) and Wang & Li (2018) introduce Convolutional
Neural Networks (Cheng et al. 2018), these were originally developed for image recogni-
tion but have been applied in traﬃc forecasting literature as they are designed to extract
spatial features of unstructured data such as images and graphs which is appropriate
since transport networks can easily be represented as a graph structure.
Given that this study focused on one continuous road there is room for further explo-
ration of spatio-temporal patterns of interconnected road networks to see if relationships
hold between roads that are adjacent or even completely disjoint but in close proximity
to each other work by Cheng et al. (2012) would suggest that the inﬂuence of road
links on their neighbours is local and varies widely in space and time making it hard to
measure spatio-temporal relationships across complex road topologies.
The growth of urban populations globally has resulted in rising levels of traﬃc and
congestion, creating numerous social, environmental and economic issues (Hymel 2009)
with the pollution caused by increased traﬃc levels in urban areas even being found
to increase infant mortality rates (Knittel et al. 2016). The need therefore has never
been greater for Intelligent traﬃc management systems guided by e↵ective forecasting
techniques and a comprehensive understanding of the spatial and temporal patterns that
drive traﬃc ﬂows. This dissertation ﬁrst examined how traﬃc forecasting can beneﬁt
from the understanding of how traﬃc speeds amongst neighbouring locations are related.
Furthermore, it compared the e↵ectiveness of machine learning algorithms to provide
accurate and robust forecasts, improved by incorporating the e↵ects of upstream and
A Cross-Correlation Function was used in order to identify and quantify the spatio-
temporal dependencies of traﬃc speed data where it was found that signiﬁcant relation-
ships exist that are dynamic in both space and time. There were distinct di↵erences
between the inﬂuence of upstream and downstream traﬃc patterns on a given road
segment, which also varied over the course of the day and provided evidence of traﬃc
“shockwaves” (Richards 1956) during congested periods as traﬃc propagated upstream.
The Cross-Correlation Function was found to be e↵ective method of quantifying space
time relationships which could aid in model selection, feature selection and determining
the feasibility of short-term traﬃc forecasts on a given road network.
Three traﬃc forecasting models where developed and tested on traﬃc data collected
on the M60 motorway in Manchester, England. The goal was to predict traﬃc speeds
both 15 minutes and 30 minutes ahead of time. The LightGBM algorithm, an eﬃcient
gradient boosting implementation, was the best performing model by a signiﬁcant mar-
gin. It outperformed Ordinary Least Squares regression and Long Short-Term Memory
neural networks at both time prediction horizons and under varying traﬃc conditions.
It also outperformed a heuristic approach by a large enough margin to suggest it is
a method with genuine predictive power. This is hypothesised to be due to the abil-
ity of the LightGBM algorithm to more e↵ectively learn the latent spatial structure
of the traﬃc speed data, as opposed to hugely popular sequence prediction algorithm
LSTM neural network, which evidence suggests does not capture spatial dependencies
well on its own, resulting in sup-optimal performance as a standalone algorithm that
may require the use of hybrid modelling structures to extract spatial features prior to
its application (Du et al. 2018, Cheng et al. 2012). Empirical evidence was provided for
the beneﬁt of including neighbouring road sensor measurements for traﬃc forecasting
models given an appropriate model that was able to capture the spatial structure. The
results highlighted both the importance of spatial dependences and also that of correct
model selection and speciﬁcation in order to produce acceptable performance in com-
plex traﬃc forecasting tasks. The role of geo-space in this study was clear, but until
recent years has been overlooked in the literature, which leaves open the much broader
question of which other related ﬁelds may beneﬁt from
Ahmed, M. S. & Cook, A. R. (1979), Analysis of freeway traﬃc time-series data by
using Box-Jenkins techniques, number 722.
Bishop, C. M. (2006), Pattern recognition and machine learning, springer.
Bojer, C. S. & Meldgaard, J. P. (2020), ‘Kaggle forecasting competitions: An overlooked
learning opportunity’, International Journal of Forecasting .
Box, G. E., Jenkins, G. M. & Reinsel, G. C. (2011), Time series analysis: forecasting
and control, Vol. 734, John Wiley & Sons.
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984), Classiﬁcation and
Brunsdon, C., Fotheringham, S. & Charlton, M. (1998), ‘Geographically weighted regres-
sion’, Journal of the Royal Statistical Society: Series D (The Statistician) 47(3), 431–
Castro-Neto, M., Jeong, Y.-S., Jeong, M.-K. & Han, L. D. (2009), ‘Online-svr for short-
term traﬃc ﬂow prediction under typical and atypical traﬃc conditions’, Expert Sys-
tems with Applications 36(3, Part 2), 6164–6173.
Chandra, S. R. & Al-Deek, H. (2009), ‘Predictions of freeway traﬃc speeds and volumes
using vector autoregressive models’, Journal of Intelligent Transportation Systems
Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in ‘Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining’, KDD ’16, Association for Computing Machinery, New York, NY,
USA, p. 785–794.
Cheng, N., Lyu, F., Chen, J., Xu, W., Zhou, H., Zhang, S. & Shen, X. (2018), ‘Big data
driven vehicular networks’, IEEE Network 32(6), 160–167.
Cheng, T., Haworth, J. & Wang, J. (2012), ‘Spatio-temporal autocorrelation of road
network data’, Journal of Geographical Systems 14(4), 389–413.
Coleri, S., Cheung, S. Y. & Varaiya, P. (2004), Sensor networks for monitoring traﬃc,
in ‘In Allerton Conference on Communication, Control and Computing’.
Drew, D. R. (1968), Traﬃc ﬂow theory and control, Technical report.
Du, S., Li, T., Gong, X. & Horng, S.-J. (2018), ‘A hybrid method for traﬃc ﬂow fore-
casting using multimodal deep learning’, arXiv preprint arXiv:1803.02099 .
Essien, A., Petrounias, I., Sampaio, P. & Sampaio, S. (2020), ‘A deep-learning model
for urban traﬃc ﬂow prediction with traﬃc events mined from twitter’, World Wide
Florio, L. & Mussone, L. (1996), ‘Neural-network models for classiﬁcation and forecast-
ing of freeway traﬃc ﬂow stability’, Control Engineering Practice 4(2), 153 – 164.
Gebresilassie, M. A. (2017), Spatio-temporal traﬃc ﬂow prediction.
Griﬃth, D. A. (1987), ‘Spatial autocorrelation’, A Primer. Washington DC: Association
of American Geographers .
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The elements of statistical learning:
data mining, inference, and prediction, Springer Science & Business Media.
Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation
Hong, W.-C. (2012), ‘Application of seasonal svr with chaotic immune algorithm in
traﬃc ﬂow forecasting’, Neural Computing and Applications 21(3), 583–593.
Hymel, K. (2009), ‘Does traﬃc congestion reduce employment growth?’, Journal of
Urban Economics 65(2), 127 – 135.
Innamaa, S. (2005), Short-term prediction of traﬃc situation using mlp-neural networks.
Ishak, S. & Al-Deek, H. (2002), ‘Performance evaluation of short-term time-series traﬃc
prediction model’, Journal of Transportation Engineering 128(6), 490–498.
Karlaftis, M. G. & Vlahogianni, E. I. (2011a), ‘Statistical methods versus neural net-
works in transportation research: Di↵erences, similarities and some insights’, Trans-
portation Research Part C: Emerging Technologies 19(3), 387–399.
Karlaftis, M. & Vlahogianni, E. (2011b), ‘Statistical methods versus neural networks in
transportation research: Di↵erences, similarities and some insights’, Transportation
Research Part C: Emerging Technologies 19(3), 387 – 399.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y.
(2017), Lightgbm: A highly eﬃcient gradient boosting decision tree, in I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett,
eds, ‘Advances in Neural Information Processing Systems 30’, Curran Associates, Inc.,
Knittel, C. R., Miller, D. L. & Sanders, N. J. (2016), ‘Caution, drivers! children present:
Traﬃc, pollution, and infant health’, Review of Economics and Statistics 98(2), 350–
Leduc, G. (2008), Road traﬃc data: Collection methods and applications.
Lint, J. (2004), Reliable travel time prediction for freeways, PhD thesis.
Liu, D., Tang, L., Shen, G. & Han, X. (2019), ‘Traﬃc speed prediction: An attention-
based method’, Sensors 19(18).
Luo, X., Li, D., Yang, Y. & Zhang, S. (2019), ‘Spatiotemporal traﬃc ﬂow prediction
with knn and lstm’, Journal of Advanced Transportation 2019, 4145353.
Mei, Z., Xiang, F. & Zhen-hui, L. (2018), Short-term traﬃc ﬂow prediction based on
combination model of xgboost-lightgbm, in ‘2018 International Conference on Sensor
Networks and Signal Processing (SNSP)’, pp. 322–327.
Moran, P. A. (1950), ‘Notes on continuous stochastic phenomena’, Biometrika
Pfeifer, P. E. & Deutrch, S. J. (1980), ‘A three-stage iterative procedure for space-time
modeling phillip’, Technometrics 22(1), 35–47.
Richards, P. I. (1956), ‘Shock waves on the highway’, Operations Research 4(1), 42–51.
Smith, B. L. & Demetsky, M. J. (1997), ‘Traﬃc ﬂow forecasting: Comparison of modeling
approaches’, Journal of Transportation Engineering 123(4), 261–266.
Smith, B. L., Williams, B. M. & Keith Oswald, R. (2002), ‘Comparison of parametric
and nonparametric models for traﬃc ﬂow forecasting’, Transportation Research Part
C: Emerging Technologies 10(4), 303 – 321.
Soper, H. E., Young, A. W., Cave, B. M., Lee, A. & Pearson, K. (1917), ‘On the
distribution of the correlation coeﬃcient in small samples. appendix ii to the papers
of ”student” and r. a. ﬁsher’, Biometrika 11(4), 328–413.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014),
‘Dropout: a simple way to prevent neural networks from overﬁtting’, The journal of
machine learning research 15(1), 1929–1958.
Sun, B., Cheng, W., Goswami, P. & Bai, G. (2018), ‘Short-term traﬃc forecasting using
self-adjusting k-nearest neighbours’, IET Intelligent Transport Systems 12(1), 41–48.
Tobler, W. R. (1970), ‘A computer movie simulating urban growth in the detroit region’,
Economic Geography 46(sup1), 234–240.
Vlahogianni, E. I., Golias, J. C. & Karlaftis, M. G. (2004), ‘Short-term traﬃc forecasting:
Overview of objectives and methods’, Transport Reviews 24(5), 533–557.
Vlahogianni, E. I., Karlaftis, M. G. & Golias, J. C. (2014), ‘Short-term traﬃc forecasting:
Where we are and where we’re going’, Transportation Research Part C: Emerging
Technologies 43, 3 – 19. Special Issue on Short-term Traﬃc Flow Forecasting.
Wang, W. & Li, X. (2018), ‘Travel speed prediction with a hierarchical convolutional
neural network and long short-term memory model framework’.
Wang, X., An, K., Tang, L. & Chen, X. (2015), ‘Short term prediction of freeway exiting
volume based on svm and knn’, International Journal of Transportation Science and
Technology 4(3), 337–352.
Williams, B. M. & Hoel, L. A. (2003), ‘Modeling and forecasting vehicular traﬃc ﬂow as
a seasonal arima process: Theoretical basis and empirical results’, Journal of Trans-
portation Engineering 129(6), 664–672.
Xia, H., Wei, X., Gao, Y. & Lv, H. (2019), Traﬃc prediction based on ensemble machine
learning strategies with bagging and lightgbm, in ‘2019 IEEE International Conference
on Communications Workshops (ICC Workshops)’, pp. 1–6.
Yang, B., Sun, S., Li, J., Lin, X. & Tian, Y. (2019), ‘Traﬃc ﬂow prediction using lstm
with feature enhancement’, Neurocomputing 332, 320 – 327.
Ye, Q., Szeto, W. Y. & Wong, S. C. (2012), ‘Short-term traﬃc speed forecasting based on
data recorded at irregular intervals’, IEEE Transactions on Intelligent Transportation
Systems 13(4), 1727–1737.
Yue, Y. & Yeh, A. G.-O. (2008), ‘Spatiotemporal traﬃc-ﬂow dependency and short-term
traﬃc forecasting’, Environment and Planning B: Planning and Design 35(5), 762–
Zhao, Z., Chen, W., Wu, X., Chen, P. C. Y. & Liu, J. (2017), ‘Lstm network: a deep
learning approach for short-term traﬃc forecast’, IET Intelligent Transport Systems