Content uploaded by Leo McCarthy
Author content
All content in this area was uploaded by Leo McCarthy on Jan 18, 2021
Content may be subject to copyright.
Department of Geography and Planning
School of Environmental Sciences
Spatio-temporal analysis and
machine learning for traffic speed
forecasting
Author:
Leo McCarthy
Dissertation submitted in partial fulfilment of the degree
MSc Geographic Data Science
September 2020
Word count: 10,175
DECLARATION
I declare that the material presented for this dissertation is entirely my own work and
has not previously been presented for the award of any other degree of any institution
and that all secondary sources used are properly acknowledged
Signature
September 15, 2020
Date
1
Abstract
Advances in machine learning, computational power and the explosion of traffic
data are driving the development and use of Intelligent Transport Systems. Real-
time road traffic forecasting is a fundamental element required to make e↵ective
use of these Intelligent Traffic Systems. Traffic modelling techniques have moved
beyond traditional statistical approaches towards data-driven and computational
intelligence methods. Often still overlooked is the spatial structure of traffic data
and the impact of road network topology on traffic flows. In order to develop
e↵ective forecasting methods understanding the autocorrelation structure among
space-time observations is crucial. This paper presents methods for quantifying the
complex spatio-temporal relationships present in transportation network data and
assess the impact of upstream and downstream congestion propagation. Addition-
ally, introduces machine learning techniques for the forecasting of traffic speeds in a
supervised learning task where, by leveraging spatial and temporal characteristics
of dynamic traffic flows, the LightGBM algorithm is shown to produce state-of-
the-art performance on real-world motorway traffic data with predictions up to 30
minutes in advance, outperforming other established forecasting techniques.
2
Contents
List of Figures 4
List of Tables 4
1 Introduction 5
2 Literature Review 6
2.1 Traffic data .................................. 7
2.2 Space-time autocorrelations ......................... 8
2.3 Traffic modelling ............................... 9
2.3.1 ARIMA ................................ 9
2.3.2 OLS .................................. 10
2.3.3 Nonparametric machine learning .................. 11
2.3.4 Recurrent Neural Networks ..................... 11
2.3.5 Gradient Boosting .......................... 12
2.3.6 Developing the current literature .................. 13
3 Methods and Data 14
3.1 Data ...................................... 14
3.1.1 Preprocessing ............................. 15
3.1.2 Overview ............................... 16
3.2 Space-time autocorrelations ......................... 19
3.2.1 Temporal ............................... 19
3.2.2 Spatio-temporal ........................... 20
3.3 Experimental set up ............................. 20
3.4 Model specification and tuning ....................... 21
3.4.1 OLS .................................. 22
3.4.2 LightGBM .............................. 23
3.4.3 LSTM ................................. 24
3.5 Evaluation metrics .............................. 25
3.6 Implementation ................................ 25
4Results 26
4.1 Space-time autocorrelations ......................... 26
4.2 Modelling ................................... 27
5 Discussion 30
5.1 Literature implications ............................ 31
5.2 Real world applications ........................... 32
5.3 Limitations .................................. 32
5.4 Future work .................................. 33
6 Conclusion 33
7 Bibliography 35
3
List of Figures
1 Neural Network architectures ........................ 12
2 Decision tree example ............................ 13
3 Sensor locations ................................ 15
4 Heat-map of traffic speeds for all sensors on 2019-09-06 ......... 16
5 Average traffic speeds across each day of the week ............ 17
6 Average traffic speeds by time of day for each sensor ........... 17
7 Traffic speeds on 2019-09-06 for each sensor ................ 18
8 Box plot of average daily traffic speeds across all sensors ......... 19
9 Input windows for time series prediction .................. 21
10 Train validation and test procedure for modeling ............. 22
11 Gradient boosted decision tree learning process .............. 23
12 Structure of a general LSTM network layer ................ 24
13 Implemented LSTM network structure ................... 25
14 Temporal autocorrelations .......................... 26
15 CCF coefficients for all times ........................ 27
16 CCF coeffiecients for afternoon peak hours ................ 27
17 MAE for all predictions in each hour of the day .............. 30
18 Actual and predictied traffic speed values for LightGBM and LSTM models 30
List of Tables
1 Four model Input vectors taken over 6 previous time steps ........ 21
2 LightGBM hyperparameters ......................... 23
3 MAE and RMSE scores for target sensor M60/9223A at time t+1 ... 28
4 MAE and RMSE scores for target sensor M60/9223A at time t+2 ... 29
4
1 Introduction
The prediction of future traffic characteristics such as flow, travel time and speed have
been a key component of Intelligent Transport Systems (ITS) for the past couple of
decades. Accurate forecasting of future traffic conditions has benefits for both road
users and traffic authorities alike. The broad aims of Intelligent Transport Systems are
to enable road users and administrators to be better informed and make safer and more
efficient use of transport networks. Specifically, the foremost consideration of almost
all traffic management systems is reducing the high levels of congestion that generate
economic, social and environmental issues in many cities around the world (Knittel et al.
2016).
The rapid development of technologies that allow for the capture and storage of data
describing the prevailing traffic conditions such as sensors, radars, inductive loops and
GPS (Leduc 2008) has resulted in increasingly sophisticated ITS. Real time traffic infor-
mation is vital for e↵ective and swift traffic management allowing traffic authorities to
actively respond to prevailing traffic conditions in order to reduce congestion, provide
information to road users and deal with collisions, it is a key element in many of the
operational components of ITS such as Advanced Traveller Information System (ATIS)
and Dynamic Traffic Assignment (DTA). In the past, much of the focus from traffic
managers was on reactive traffic measures based on current traffic conditions, dealing
with congestion once it had developed with no attempt to anticipate. Now that many
authorities possess real time traffic data feeds and subsequently large volumes of his-
torical traffic data, the focus has switched towards developing methods for forecasting
traffic conditions ahead of time, allowing for proactive traffic measures based on ex-
pected traffic conditions rather than making decisions based on a traffic state soon to
be obsolete.
The past decade has seen a significant uptick in research surrounding the develop-
ment of predictive models for traffic state predictions ahead of time. Currently there
is far from any consensus as to the most e↵ective methods, this is in part due to the
variability in problem framing. Given there is no universal standard in traffic data col-
lection and aggregation techniques there is huge di↵erences in structure across traffic
data sets with di↵erences in data size, type, frequency and the prediction horizon.
Given the increasing variety and huge volume of traffic data being collected, more re-
cent years have seen a move away from traffic theory and towards more data driven and
machine-learning methods in an attempt to develop fast, accurate and scalable models.
Initial attempts mainly involved the use of parametric univariate time series models such
as Autoregressive Integrated Moving Average method (ARIMA) and Seasonal Autore-
gressive Integrated Moving Average method (SARIMA) (Williams & Hoel 2003). Many
of these models however have shown shortcomings in generating traffic predictions which
has led to the wider use of nonparametric machine-learning (ML) or deep-learning (DL)
based models which form the basis for most of the current literature. These are advan-
tageous for a few di↵erent reasons, ML and DL models are much better at extracting
non-linear relationships within the data, they can handle very high-dimensional data
and co-linear features more e↵ectively and make no prior assumptions regarding the
normality of the data distribution, all of these factors culminating in more accurate
and robust predictions. There are several examples of ML applied to traffic forecasting
such as Support Vector Regression (Castro-Neto et al. 2009) and K-nearest-neighbour
(Sun et al. 2018). The most current literature is largely centred around Recurrent Neu-
ral Networks (RNN) and specifically Long-Short Term Memory (LSTM) networks, the
main reason behind their usage is their high representational power and ability to cap-
ture short and long temporal dependencies of traffic data. Gradient Boosting and in
particular efficient implementations of Gradient Boosted Decision Trees (GBDT) such
5
as LighGBM and XGBoost are hugely popular amongst data science practitioners and
in competitive data science events such as those hosted on Kaggle. They have shown
state-of-the-art performance on a range of tasks including time-series prediction and
applications to spatio-temporal data (Bojer & Meldgaard 2020). Despite this, there has
been limited attempt to apply them to the problem of traffic forecasting, this paper will
attempt to bridge that gap by implementing LightGBM for traffic prediction and com-
paring results to both OLS regression and LSTM - the most popular Neural Network
architecture for traffic forecasting.
The First Law of Geography (Tobler 1970) states “everything is related to everything
else, but near things are more related than distant things.” This would imply that traffic
speeds at a given location are not only a↵ected by the traffic state at previous points
in time but also by traffic states at other nearby locations, given the connected nature
of road networks. Shockwave theory has long been used to model the downstream to
upstream progression of queues (Richards 1956) and Several studies have shown traffic
networks to exhibit both spatial and temporal patterns (Cheng et al. 2012)whichmust
be carefully considered in any modelling framework. Recent studies have shown the im-
provement of predictions when of including upstream or downstream traffic information
(Chandra & Al-Deek 2009). The importance of space and network topology is clear,
however is often overlooked and is still an area that requires further development in the
domain of traffic forecasting (Vlahogianni et al. 2014). More recently attempts have
been made to integrate both spatial and temporal dependence in the modelling process
(Luo et al. 2019). Furthermore, studies have shown that the inclusion of exogenous
variables such as current weather conditions, incident reports and geo-tagged tweets can
improve prediction accuracy (Essien et al. 2020). This paper will attempt to quantify
the importance of spatial variables while also exploring the e↵ectiveness of state-of-the-
art machine leaning models for traffic prediction.
The main aims of this dissertation can be summarised as follows:
•To identify and quantify the relevant spatio-temporal patterns present in traffic
speed data.
•To develop traffic speed prediction models integrating both spatial and tempo-
ral elements of traffic speed data. Comparing statistical and machine learning
techniques.
•Asses the importance of exploiting the spatial structure of traffic data on model
performance.
The remainder of this thesis is organised as follows. Section 2 presents an overview of
the methodologies prevalent in the current traffic forecast literature. Section 3 provides
a description of the data and outlines the experimental setup and model specifications.
Section 4 Presents the findings. Section 5 discusses the findings and their implications.
Section 6 concludes the paper.
2 Literature Review
There is a significant amount of literature relating to short term traffic forecasting. The
continually growing urban population places ever greater emphasis on reliable traffic
management and in turn traffic prediction methods. Generally, traffic forecasting meth-
ods can be divided into two main strands. The first approach is one based on explicit
6
modelling of the traffic system itself and carrying out simulations incorporating param-
eters such as traffic flow, signal controls and driver behaviour. Much of this work is
grounded in classical traffic flow theory (Drew 1968) macroscopic traffic models define
explicit mathematical equations to describe the relationship between variables such as
traffic flow, density, congestion-shockwave and singling. The main advantages of these
traffic simulation approaches is that “they allow inclusion of traffic control measures
(ramp metering, routing, traffic lights, and even traffic information) in the prediction,
and that they provide full insight into the locations and causes of possible delays on the
road network of interest” (Lint 2004). However, there are a number of drawbacks to
this approach such as the computational complexity of parameter calibration and also
their lack of flexibility. These models require a-priori knowledge of traffic processes and
their predictive quality relies on the quality of its inputs which must be determined “by
hand”, they cannot learn the latent traffic patterns from given data.
This paper focuses on the second main approach, statistical and data driven methods
including machine learning (ML). Machine Learning methods take no prior knowledge
of the traffic system and aim to define functional approximations between input data
and output data (Hastie et al. 2009), learning the relationships in the traffic data to
predict variables such as speed and flow etc. Activity in this research area has been
expedited in recent years due to improved data collection methods resulting in large,
rich data sets which has occurred in parallel to the rapid advancement of technology and
available computational power making it possible to implement advanced data driven
methodologies on big data sets. Despite this, there is still considerable variety in applied
methods and large areas for development within this research area as highlighted in
recent reviews of the research field (Vlahogianni et al. 2014). One such area is the
understanding and e↵ective incorporation of space-time autocorrelations in traffic data.
This section will provide an overview of the relevant literature with regards to traffic
forecasting methods with particular emphasis on exploiting the latent spatial structure
of the underlying traffic network data to improve forecasting models.
2.1 Traffic data
Most of the recent developments in traffic forecasting have come about as a result of
data intensive, as opposed to analytical, approaches and thus their e↵ectiveness is highly
reliant on the prevalence, quality and resolution of available traffic data (Vlahogianni
et al. 2004). Given the enormous undertaking it would be for a researcher to collect task
specific traffic data, the development of traffic forecasting methods almost always relies
on data already available, collected by transport authorities or previous large research
projects. The two main data characteristics to be considered for traffic forecasting
problems are data resolution and the measured parameters. Resolution of traffic data
can vary greatly, with traffic state readings being made as often as every few seconds
(Ye et al. 2012) to as infrequently as every hour (Hong 2012). The forecasting step is
equivalent to the resolution of the data, for example for data collected every 15 minutes
one forecasting step would be 15 minutes in advance of the current time step. The
forecasting horizon is a related concept and refers to the specific time horizon in the
future for which we want predictions to be made and can include several forecasting
steps depending on the data resolution. For instance, Ishak & Al-Deek (2002) and
Liu et al. (2019) conclude that prediction accuracy becomes increasingly more difficult
as the prediction horizon increases. However as Vlahogianni et al. (2014) notes, the
usefulness of a prediction model degrades given too short a prediction horizon, as for
the majority of traffic management applications a reasonable amount of notice is required
for proactive steps to be put in place or information to be communicated to road users.
It is also the case that particularly short time steps make predictions difficult due to large
7
fluctuations in readings from one step to the next (Smith & Demetsky 1997), because
of this, oftentimes high resolution data is aggregated for easier modelling (Florio &
Mussone 1996).
There are several methods of data collection for traffic data, the most common by
some way is traffic sensors (Coleri et al. 2004). These can o↵er measurements for a
range of traffic characteristics such as speed, flow, occupancy and travel times. There
is no real consensus in the literature on which variable is more suitable in describing
underlying traffic conditions but Vlahogianni et al. (2014) found that the most com-
mon characteristics chosen for prediction where flow and travel time. There have also
been attempts to forecast di↵erent measures simultaneously such as (Florio & Mussone
1996) who tried to predict the evolution of traffic flow, density and speed over 10 min.
However Innamaa (2005) found that that one single model predicting two variables gave
worse forecasts than two separate models for speed and flow. In addition to this, some
researchers have found success in incorporating exogenous variables in the modelling
process. These can be things such as weather conditions and traffic incident updates
mined from twitter as done by Essien et al. (2020) ,or in the case of Xia et al. (2019)
contextual temporal information such as the day of the week or whether it is a national
holiday.
Defining the appropriate data resolution and measurement parameters are of partic-
ular importance for data driven methods as it a↵ects the quality of information about
traffic conditions lying in the data (Vlahogianni et al. 2004). Poor or inappropriate
data for the task at hand could result in a poor model; the “Garbage in, garbage out”
phenomenon.
2.2 Space-time autocorrelations
Whilst many traditional approaches to traffic forecasting consider it is a stationary and
single point time series problem, the most current literature makes it clear that it is of
paramount importance to consider both spatial and temporal dimensions of the data in
order to develop accurate traffic forecasting models. So first provided is an overview of
the literature regarding autocorrelations in space-time and their presence across traffic
networks. (Temporal) Autocorrelation is a term most prevalent in time series analysis
and signal processing, it is a mathematical representation of the degree of similarity
between a given time series and a lagged version of itself over successive time intervals.
However, it can also be extended to the spatial dimension hence the term “spatial
autocorrelation” which has groundings in the First Law of Geography (Tobler 1970)
and can be defined as follows. “Spatial autocorrelation is the correlation among values
of a single variable strictly attributable to their relatively close locational positions on
a two-dimensional surface, introducing a deviation from the independent observations
assumption of classical statistics” (Griffith 1987). Until recently the spatial component
of autocorrelations in traffic data were overlooked, however a number of studies, most
notably Cheng et al. (2012) highlighted the extent of spatial autocorrelation in road
network data and its potential importance to traffic forecasting. Various indices exist
in the literature to measure autocorrelations in both space and time. Most of which are
derived from Pearson’s product moment correlation coefficient (PMCC) (Soper et al.
1917) which given two independent variables X and Y is defined as their covariance over
the product of their standard deviations or the equation:
⇢X,Y =E[(X¯x)(Y¯x)]
XY
(1)
Where ¯x,¯yand Y,Yare the means and standard deviations of variables X and
Yrespectively. Thecoefficient⇢X,Y is used as a measure of the strength of linear
8
association between variables and can fall in the range 1 to 1 with 1 indicating perfect
positive correlation, 1 perfect negative correlation and 0 no correlation. Temporal
autocorrelation can be calculated by taking a lagged specification of PMCC so rather
than calculating the correlation between two variables it is calculated between a variable
at time tand the same variable at time tkfor some k.
Moran’s I is a measure commonly found in the broader geospatial literature and is
an extension of PMCC to the spatial dimension (Moran 1950), its more complicated
than temporal autocorrelation in that it can occur in any direction. Moran’s I however,
does not consider the temporal dimension so cannot capture dynamic spatio-temporal
correlation properties. Thus, adaptations are required to capture spatial and temporal
correlations simultaneously. A global Space-Time Autocorrelation Function (ST-ACF)
is proposed by Pfeifer & Deutrch (1980) and implemented in the domain of network
flow data by Griffith (1987), this measure measures the cross-covariances between all
possible pairs of locations lagged in both time and space. STACF has been used in Space-
Time Autoregressive Integrated Moving Average (STARIMA) modelling to determine
the range of spatial neighbourhoods that contribute to the current location measurement
at a given time lag (Pfeifer & Deutrch 1980)
As is the case with Morans, the STACF has a local variant. Local spatio-temporal
correlations can be quantified using the Cross-Correlation Function (CCF) (see Box et al.
(2011)). This treats two time series in separate spatial locations as a bivariate stochastic
process and measures the cross-covariance coefficients between each series at specified
time lags. Yu e & Yeh (2008) show that the CCF can e↵ectively measure spatio-temporal
dependencies between a road sensor and its neighbours and can be used for analysing
the forecast-ability of a given traffic data set. For example, a significant positive CCF
coefficient between sensor Aat time tand downstream sensor Bat given time tk
would suggest a back propagation of congestion from Bto Aover the time period k.
Despite the empirical evidence provided by Yue & Ye h (2008) and Cheng et al. (2012)
that the CCF is a sound method of measuring spatio-temporal relationships between
neighbouring road links it has limited exposure in the field of traffic forecasting where
it could be used for determining appropriate modelling approaches or selecting spatial
model-input features. In this study the CCF will be implemented to determine the
extent of local spatio-temporal autocorrelation and their dynamic behaviour, the exact
specification with respect to this data set will be provided in the methods section.
2.3 Traffic modelling
Traffic forecasting has evolved a great deal over the past couple of decades however
there is still no unified modelling approach. This section will provide an overview of the
most used prediction modelling techniques both traditionally and in the most current
literature. The data-driven forecasting methods can broadly be divided into two distinct
categories: Statistical parametric approaches and non-parametric machine learning.
2.3.1 ARIMA
Parametric models have a fixed and finite number of parameters, the number determined
prior to model fitting. One of the most popular parametric models for time series predic-
tion both within traffic forecasting and across other disciplines is the Auto Regressive
Integrated Moving Average (ARIMA) model. ARIMA uses a statistical approach to
identify recurring patterns of di↵erent periodicity from previous observations of a time
series. It can be defined by the following di↵erencing equation:
Yt=c+1yt1+2yt2+... +pytp+✓1✏t1+... +✓tq+✏t(2)
9
This is an ARIMA(p, d, q)modelwhere:
yt1...ytprepresents the previous pvalues general time series. This is the autoregres-
sive part of the model
Ytrepresents the forecast for yat time t
✏t...✏tqis zero mean white noise or the qmoving average terms
and ✓are parameters to be determined by model fitting
dis the order of di↵erencing applied of the data set as the time series must be stationary
for the application of ARIMA.
The first recorded application of the ARIMA model in the transport forecast literature
was by Ahmed & Cook (1979) in which an ARIMA(0,1,3) model was proposed to predict
traffic flow and occupancy on freeways, where it was shown to be more accurate that
previous smoothing techniques. A primary assumption made by this model is the sta-
tionarity of the mean, variance, and autocorrelation. Thus, a major criticism directed
toward ARMA models concerns their tendency to perform poorly during abnormal traf-
fic states and be unable to predict extreme values in time–series (Vlahogianni et al.
2004). Furthermore, the univariate nature of ARIMA means that the integration of
spatial or contextual features is not possible.
2.3.2 OLS
Other Linear methods include Ordinary Least Squares (OLS) regression and its exten-
sion Geographically Weighted Regression as applied by Gebresilassie (2017). OLS allows
the simple integration of multiple input variables and additional features such as multi-
ple sensor measurements across a network or contextual features. OLS regression aims
to find a linear relationship between input and output variables and an OLS model is
defined as :
yi.=0+1x1+2x2+... +nxn+✏i(3)
Where yiis the response variable, xi:i=1,2,..,n are the nindependent predictor
variables, ✏is a random error term. 0is the model constant (intercept) and i:i=
1,2,..,n are the model parameters to be determined using the Ordinary Least Squares
estimator which is defined by the equation:
ˆ
=(X
0
X)1X
0
y(4)
Parametric approaches have favourable statistical properties and capture regular varia-
tions very well. However, find prediction difficult given more volatile traffic states such
as when a traffic accident or severe congestion occurs (Smith et al. 2002). Further to this
the aforementioned parametric models make the assumption of linearity between input
and response variables which makes them sub-optimal when considering the complex
spatio-temporal structure and non-linear relationships in traffic data.
10
2.3.3 Nonparametric machine learning
Nonparametric models function in a fundamentally di↵erent way to parametric mod-
els. Nonparametric models do not assume that the structure of a model is fixed and in
most cases the model complexity grows to accommodate the complexity of the data, in
other words the model structure is “learned” from the data (Bishop 2006). Compared
to parametric methods, nonparametric methods are believed to o↵er more flexibility in
fitting the nonlinear characteristics and are better able to process noisy data (Karlaftis
& Vlahogianni 2011a). These methods include algorithms such as K-Nearest Neighbour
(KNN) Support Vector Regression (SVR) and Artificial Neural Networks (ANN). There
have been several attempts to implement machine learning models for traffic forecasting
(Sun et al. 2018, Castro-Neto et al. 2009, Essien et al. 2020) and even hybrid combi-
nations of these models (Wang et al. 2015, Luo et al. 2019). These non-linear, flexible
and multivariate models have proven to show greater accuracy in traffic forecasting than
parametric models and are better able to leverage the spatio-temporal structure of the
traffic data (Vlahogianni et al. 2014).
2.3.4 Recurrent Neural Networks
Recurrent Neural Networks (RNN) are a subset of Artificial Neural Networks (ANN)
that aim to preserve the temporal dimension of time series data using a hidden recurrent
state. They are seen as the most appropriate ANN architecture for series prediction
problems and have show success in traffic forecasting problems as shown by Cheng et al.
(2018) when using the RNN model to predict the traffic flow in the case of special
events. In conventional ANN’s there are only full connections between adjacent layers,
but no connection among the nodes of the same layer. As Zhao et al. (2017) explains
the ANN may be sub-optimal when dealing with the spatio-temporal problems, because
there are always interactions among the nodes in spatio-temporal network. Di↵erent
from conventional networks, the hidden units in RNN receive a feedback which is from
the previous state to current state (Fig 1) allowing them to understand relationships
between states i.e. time steps.
11
(a) Simple feed-forward Artificial Neural Network
(b) Recurrent Neural Network (Essien et al. 2020)
Figure 1: Neural Network architectures
RNN’s can however often struggle with long term dependencies and are unable to
process a large number of time lags due to correlations caused by periodic or seasonal
structures that are often seen in spatio-temporal time series. For this reason, most traffic
forecast literature now implements the Long Short-Term Memory (LSTM) (Hochreiter
&Schmidhuber1997) adaptation of the RNN. The LSTM network is a special kind of
RNN. By treating the hidden layer as a memory unit, LSTM networks can cope with
the both short and long-term correlations. LSTM neural networks have found great
popularity in traffic forecasting, many researchers have applied the algorithm to spatio-
temporal traffic data with great success, where it has been found to outperform other
popular techniques including ARIMA, Support Vector Machines and Random Forests.
(Yang et al. 2019, Zhao et al. 2017), it is probably the most popular single ML algorithm
in the current traffic forecasting literature.
2.3.5 Gradient Boosting
Gradient Boosting and in particular Gradient Boosted Decision Trees (GBDT) is a tree
based machine learning model that is becoming increasingly popular due to their fast
implementations such as LightGBM (Ke et al. 2017) and XGBoost (Chen & Guestrin
2016) and also there state-of-the-art performance across a broad range of problems.
12
Figure 2: Decision tree example
GBDT works by combining many decision trees (Fig 2) with low accuracy into a model
which has significantly higher accuracy than its base learners. It is an additive model
with the general expression:
ˆyi=
K
X
k=1
fk(xi),f
k2F(5)
Where ˆyiis the response variable given input vector xi.Kis the number of decision trees
and fis a function of the function space Fwhich contains all possible decision trees. A
decision tree (Fig 2) is an algorithm for classification and regression that sequentially
splits the data based on feature values, the process is repeated in a recursive manner on
each successive subset following the split, the end nodes represent the predicted value for
data in that subset. This carries on until splitting no longer adds value to the predictions
(Breiman et al. 1984).
GBDT algorithms, despite their popularity in the Data Mining and ML communities,
have been implemented only a couple of times in the current literature namely Xia et al.
(2019) who implement LightGBM to predict traffic characteristics in Shenzhen and find
it significantly outperforms Linear Regression, ANN and ARIMA. Also Mei et al. (2018)
combines both XGBoost and LightGBM for accurate flow predictions on spatio-temporal
data.
2.3.6 Developing the current literature
From this review and also highlighted in the comprehensive review of the research area
by Vlahogianni et al. (2014), it can be seen that historically most e↵ort has been put
into employing univariate statistical models, predicting traffic volume or travel time and
using data from single point sources. Currently there is considerably more e↵ort being
applied to incorporate the spatial structure of the data using multivariate methods such
as OLS, LSTM and SVM however often there is less of an attempt to understand how or
why spatial dependencies e↵ect traffic forecasting models. Gradient boosting methods
such as LightGBM have largely been overlooked in the literature. Lastly, attempts to
13
predict traffic speed are far less prevalent than other traffic characteristics such as flow or
travel time. With this in mind this paper will first attempt to quantify the significance
and nature of local spatial autocorrelations in traffic data and then implement three
modelling approaches LSTM, LightGBM and OLS in order to predict traffic speeds,
assessing their performance with respect to both each other and also the spatial richness
of various input data.
3 Methods and Data
This section aims to outline the applied methodology and experimental procedure carried
out, guided by the key research themes of this paper. As a reminder these are can be
briefly summarised as follows.
•Quantify spatio-temporal relationships in traffic data.
•Develop and compare traffic forecasting models using OLS, LSTM and LightGBM.
•Assess the impact of spatial features on model performance.
3.1 Data
The Data used is real world traffic data collected by Highways England and can be down-
loaded from their WebTris website 1,a map-based data query service with measurements
collected by road sensors along the Strategic Road Network (SRN).
The selected data set contains measurements for 21 site locations along the M60
motorway in England, in the clockwise direction between J7 and J20 (Fig 3). The data
contains measurements for both traffic speed and traffic volume, although only speed was
considered for this analysis. The total time period for the data used was between 2019-
05-01 and 2019-10-31 this is a 6-month period (184 days) for which measurements are
available at 15 minute intervals with traffic speeds being aggregated over this 15-minute
period. Resulting in 17,664 total speed readings across 21 sites so 370,944 individual
data points.
1https://webtris.highwaysengland.co.uk
14
Figure 3: Sensor locations
3.1.1 Preprocessing
Across the 21 sites there were 1668 missing data points equating to 0.44% of the data,
a very small proportion. For a given time stamp where there existed missing data
points, the missingness was not consistent across all sites, further to this, discarding
entire time steps worth of data can result in inconsistent time steps across the series
and be detrimental to model performance as periodic and seasonal e↵ects are distorted.
Because of this, where data was missing it was imputed with values calculated by taking
the average speed for that sensor location at that particular time of day across the rest
of the data set, in order to maintain day of week and time of day characteristics of traffic
data.
15
3.1.2 Overview
Before carrying out any analysis or modelling, some exploratory work was carried out
with the data in order to better understand and obvious patterns and spot any potential
measurement errors. Figure 4is a heat-map showing a snapshot of average traffic speeds
on 2019-09-06. The time is shown on the x-axis and the spatial component is shown on
the y-axis with sensor locations ordered from downstream to upstream or from south-
north as shown on Fig 3. Both spatial and temporal patterns are apparent with a drop-
in speed around peak times and also a speed values closeted based on sensor locations
which can be seen by the vertical strip patterns. The three horizontal strip patterns
(M60/919K, M60/9229L, M60/9298K) represent roads sensors located on entry or exit
slip roads where speed is consistently reduced.
Figure 4: Heat-map of traffic speeds for all sensors on 2019-09-06
16
Figure 5: Average traffic speeds across each day of the week
Figure 6: Average traffic speeds by time of day for each sensor
17
Figure 7: Traffic speeds on 2019-09-06 for each sensor
Fig 5shows time of day and day-of-week variations in traffic speeds across the net-
work. During the working week severe congestion peaks can be seen around the “rush
hour” period with speeds dropping by almost half for 1-2 hours every day. There is a
clear weekend e↵ect with speeds higher across the day and no visible congestion at any
point, Fridays also show less sever and earlier in the day congestion than the rest of the
working weeks. Clearly traffic speeds are largely influenced by the commuting patterns
of the working population. Fig 6shows the distribution of aggregated speeds across
the day for each individual sensor, coloured depending on location from the southern-
most sensor progressing downstream. Not all sensors share the same daily patterns with
some exhibiting more severe congestion patterns than others, the spatial dependence is
obvious with closely located pairs of sensors sharing more similarities in traffic speeds
compared to sensor pairs more widely spread. The congestion during the PM period
is far more severe than that in the AM period. Meanwhile Fig 7shows a snap shot
of traffic speeds by hour taken on 2019-09-06 which highlights the volatility in traffic
patterns on a day-to-day basis, the patterns on this particular day di↵er significantly
from the average patterns over the whole time series as shown in Fig 6. Again shown,
is the strong spatial dependence with abnormal congestion centred around a handful
of adjacent sensors. Congestion events can also extend past an hour by hour basis,
Fig 8shows a box-plot of average speeds across the network by day of the week. The
large interquartile range and number of outliers suggest there can be large variation
in traffic speeds not only during days but between days of aggregated speeds as well.
These variations, along with the complex spatial relationships and factors such as un-
expected events, traffic accidents, slow moving vehicles, bad weather etc., bring lots of
challenges for robust short-term traffic prediction by traditional linear regression models
or ARIMA.
18
Figure 8: Box plot of average daily traffic speeds across all sensors
3.2 Space-time autocorrelations
3.2.1 Temporal
The standard autocorrelation function is used to calculate the level of correlation be-
tween the speed value at a sensor with a lagged version of itself at pervious time steps,
given a time series y1,y
2...ytat the target sensor, autocorrelation of lag kis described
by the equation:
⇢(k)=E[(Yt¯x)(Yt+k¯x)]
YtYtk
k=0,±1,±2,... (6)
Where E[(Yt¯x)(Yt+k¯x)] is the covariance between Ytand Ytkand Yt,
Ytkrepre-
sent the standard deviations of Ytand Ytkrespectively. ⇢(k) measures the relationship
between ytand ytk.
Also measured is Partial autocorrelation. This is similar to the standard autocorrelation
however accounts for the influence of intermediate values in the time series between yt
and ytkthe Partial Autocorrelation Function is given as:
⇢Y(k)=Cov([Yt|Yt1,Y
t1,...,Y
tk+1],[Ytk|Yt1,Y
t1,...,Y
tk+1])
[Yt|Yt1,Yt1,...,Ytk+1][Ytk|Yt1,Yt1,...,Ytk+1 ]
(7)
This gives a clearer understanding between the relationship of a time series when com-
pared with a version of itself that is lagged over multiple time steps, as it removes
information carried by intermediate time steps. This will help us to determine the
feasibility of multi-step forecasts using a single sensor.
19
3.2.2 Spatio-temporal
The introduction of the spatial element to autocorrelation calculations is brought by
the Cross Correlation Function (CCF). Let Xtdenote the time series of measurements
collected at sensor Xup until time T. Let Let Ytdenote the time series of measurements
collected at a di↵erent sensor Y, either upstream or downstream, up until time T.The
CCF at lag kis then given by the equation:
⇢X,Y (k)=E[(Xt¯x)(Yt+k¯y)]
XY
k=0,±1,±2,... (8)
Where E[(Xt¯x)(Yt+k¯y] represents the cross-covariance between Xand Yat lag k
and
X=sPT
t=1(xt¯x)2
T1,
Y=sPT
t=1(ytk¯y)2
T1
This equation allows us to calculate the correlation between sensors in di↵erent locations
where one is temporally lagged, thus the CCF can be used to measure quantitatively
the spatio-temporal relationship of the traffic flow and speed.
3.3 Experimental set up
In order to train and validate ML models the data must first be in a form that can be
interpreted by the algorithms. In this case we will frame the data to be a supervised
learning problem where each individual input vector is mapped to a unique output
value. To do this, input feature vectors must be extracted from the raw sensor data.
Starting with the simplest case of using one sensors historical data to predict the traffic
speed at the next time step, fixed size and non-overlapping historic feature windows are
constructed (Fig 9) which are mapped to a future output value. The temporal length of
the feature window needs to be determined in order to ensure best model performance
but without unneeded complexity or redundant features, in this case the previous 6-
time steps where used i.e. 90 minutes. So, for a prediction at time tat target sensor
kconsidering only historical information from the target sensor the input vector ˆxk,t
would be [xk(t1),x
k(t2),...,x
k(t6)] and the ground truth output is the traffic speed
value at the target sensor kat time t.
20
Figure 9: Input windows for time series prediction
Expanding this to include upstream and downstream sensors is fairly simple the
input vector becomes
ˆxk,t =[xk(t1),x
k(t2),...,x
k(t6),x
1(t1),x
1(t2),...,x
1(t6),
...,x
j(t1),x
j(t2),...,x
j(t6)]
Where x1,x
2.., xjare the upstream and downstream sensors to be included in the input
data. This paper focuses only on predictions for a single target sensor at a time. Thus,
the problem definition becomes to develop a functional approximation Fsuch that
T
X
t=7
[F(ˆxk,t)xk,t]2
I.e. the sum of the squared errors between the function mapping, in this case a model
prediction, and the true speed value is minimised. For each modelling technique three
di↵erent input vectors are tested in order to evaluate the impact of spatial features these
can be seen in Table 1.
Table 1: Four model Input vectors taken over 6 previous time steps
Data source
Input 1 Target sensor
Input 2 Target sensor + downstream sensors
Input 3 Target sensor + upstream sensors
Input 3 All sensors
3.4 Model specification and tuning
Machine learning models can contain hyper-parameters that determine structural fea-
tures of the model and are very important to model performance. These hyper-parameters
can be highly sensitive, with small perturbations resulting in large changes in prediction
outcomes. They are also difficult to determine heuristically and an empirical method
must be employed in order to determine the most appropriate values. The process of
“tuning” hyper-parameters requires repeatedly testing the model accuracy with di↵erent
sets of hyper-parameters on a data set that was not used during the training procedure,
in order to determine their best values, this is called model validation. It is not to be
confused however with model testing, the validation data set must be distinct from the
21
testing data set due to the fact hyper-parameter values are being chosen based on vali-
dation performance which gives a potentially biased representation on model accuracy.
A separate test set, that shares no overlap with training or validation sets, must be held
back for testing model prediction accuracy and o↵ers an unbiased view on model perfor-
mance on unseen data. The raw data set is split into train, validation and test sets. This
is done in a sequential manner in order to avoid look-ahead bias so data from 2019-05-01
to 2019-08-31 is used for model training, data from 2019-09-01 to 2019-09-31 for vali-
dation and the data for October 2019 used for model testing. The train-validation-test
workflow can be seen in Fig 10
Figure 10: Train validation and test procedure for modeling
3.4.1 OLS
The OLS model implemented is described by the following equation in the univariate
case.
xt=0+
t1
X
i=(t6)
ixi
Where xtis the prediction for target sensor at time t,xiare the traffic speed values at
the target sensor at time i,are regression coefficients to be determined by model fitting.
In the multivariate case the equation is
xt,k =0+
t1
X
i=(t6)
ikxik +
t1
X
i=(t6)
J
X
j=1
ij xti
22
Where xk,t is the prediction for target sensor kat time t,xi,k are the traffic speed values
at target sensor k,xi,j are the traffic speed values at additional Jsurrounding sensors.
3.4.2 LightGBM
Gradient Boosted Decision Trees (GBDT), as briefly introduced in the literature review,
are an ensemble machine learning technique that combines many decision trees to create
a strong learner. Decision trees are a popular algorithm however tend to overfit the data
and are unable to generalise to unseen data, GBDT’s are able to generalise better. They
work by sequentially combining many weak decision trees in a way that each new tree fits
to the residuals of the previous tree such that the model improves, employing the logic
in which the subsequent predictors learn from the mistakes of the previous predictors
(Fig 11).
LightGBM (Ke et al. 2017) is an efficient implementation of GBDT’s which uses a
histogram based decision tree algorithm and leaf wise tree growth methods rather than
the level wise method that most other GBDT implementations opt for.
LightGBM has numerous hyper-parameters that are tuned here using a grid search
method, iteratively traversing a grid of possible parameter combinations and comparing
their performance on the validation data set. The number of boosting iterations was
determined by the early stopping criterion with a patience of 100 i.e. if validation per-
formance didn’t improve after 100 rounds then there were no further boosting iterations,
final hyper-parameters are shown in Table 2.
Figure 11: Gradient boosted decision tree learning process 2
Table 2: LightGBM hyperparameters
Hyper-parameter Value
Objective Regression
Feature fraction 0.9
bagging fraction 0.5
num leaves 64
bagging freq 4
learning rate 0.007
boosting rounds 515
2https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af
23
3.4.3 LSTM
Long Short Term Memory (LSTM) networks introduced by Hochreiter & Schmidhuber
(1997) are an adaptation of the Recurrent Neural Network that fix many of the issues
of its predecessor such as vanishing/exploding gradients and handling of long term de-
pendencies. Fig 12 shows the structure of an LSTM layer, which is similar to that of
an RNN (Fig 1) in that it is a chain of repeating modules of a neural network, but
the repeating modules have a di↵erent structure. The memory module contains input,
output, and forget gates, which respectively carry out to write, read, and reset functions
on each cell. The multiplicative gates, and ⌦, refer to operations for matrix addition
and dot product respectively and allow the model to store information over long periods,
eliminating the aforementioned vanishing/exploding gradient problem.
LSTM networks, like most ANN’s are trained using back propagation and gradient
descent. Given an error function E(X, ✓), which defines the error between the ground
truth yiand the predicted output ˆyiof the neural network with input vector xifor a
set of input output pairs (xi,y
i)2Xand given a set of network weights and biases ✓
. Gradient descent requires the calculation of the gradient of the error function with
respect to the weights wk
ij and biases bk
i. Then according to the learning rate ↵, each
iteration updates the weights and biases according to
✓t+1 =✓t↵@E(X, ✓t)
@✓
Where ✓trepresents the parameters of the network (weights and biases) at iteration
tof the gradient descent. The goal is to minimise the error function.
Figure 12: Structure of a general LSTM network layer
The structure of the network use in this paper is shown in Fig 13 and consists of the
input layer, a hidden LSTM layer, a dropout layer which randomly drops 5% of nodes
during a run through the network and is an e↵ective technique for preventing over fitting
(Srivastava et al. 2014), then finally a fully connected dense layer which outputs the final
prediction. The learning rate was 0.001 and early stopping was used to determine 29
epochs as optimal to balance training error and generalisation strength.
24
Figure 13: Implemented LSTM network structure
3.5 Evaluation metrics
In order to asses and compare model prediction performance two evaluation metrics are
employed to valuate the di↵erence between actual and predicted speed values. These are
Root Mean Squared Error (RMSE) which is the most frequently used metric in previous
work (Luo et al. 2019) and Mean Absolute Error (MAE). RMSE tends to place more
weight on particularly large individual errors, thus more useful when large errors are
undesirable.
RMSE =v
u
u
t
1
n
n
X
i=1
(yiˆyi)2
MAE =1
n
n
X
i=1
|yiˆyi|
3.6 Implementation
All model training and computation was carried out using Pyhton 3 running on macOS
Catalina v10.15.6 operating within a MacBook Pro (2018) with 2.3GHz Quad-Core intel
I5 and 8 GB of 2133Mhz memory. LSTM was implemented using the Keras 3API library
running on top of Tensorflow. The LightGBM implementation of GBDT’s used here is
freely available online 4.
3https://keras.io
4https://lightgbm.readthedocs.io
25
4 Results
4.1 Space-time autocorrelations
First, presented here is an evaluation of the spatial and temporal patterns found within
the data using measures of autocorrelation previously outlined. Fig 14 shows the tem-
poral autocorrelations found in the sequence measurements of the target sensor for up
to 10 lagged time steps. Fig 14a shows a steady decrease in temporal autocorrelation as
the time lag increases. The partial autocorrelation show in Fig 14b, which controls for
intermediate time lags, shows a steep drop o↵once following once the variable is lagged
more than one time step.
(a) Autocorrelation (b) Partial autocorrelation
Figure 14: Temporal autocorrelations
Examining the CCF coefficient between target sensor and the other network sensors
gives a local view of the space-time autocorrelation structure of the network. The CCF
coefficient is calculated both across the whole day and also separately for the peak after-
noon period (4pm-8pm) to see how space-time autocorrelations vary both spatially and
temporally. Fig 15 and Fig 16 are both cantered at time order 0, this is the correlation
between sensors at the same point in time. Positive time lags indicate that the sensor
has been lagged in time with respect to the target sensor whereas a negative lag value is
given when a series has been shifted forwards in time with respect to the target sensor.
Each time lag step represents 15 intervals minutes in real terms. The CCF coefficient
was calculated between the target sensor and all other surrounding sensors Fig 15 shows
the CCF coefficient values across the whole day lagged up to 10 time steps with plot-
ting distinguishing between upstream and downstream sensors. Almost all sensors, both
upstream and downstream, show some level of cross correlation with the target sensor.
Generally, corss-correlation decreases as time lag increases which is to be expected, with
upstream sensors generally showing greater levels of correlation with target, although
as show in Fig 3upstream sensors are in this instance located slightly closer on aver-
age than downstream.. At peak times as shown in Fig 16 downstream sensors show
increased CCF coefficient values due to the backward propagation of traffic through the
network at congested times. The plots demonstrate cross correlations varying both in
space (sensor locations) and time of day.
26
Figure 15: CCF coefficients for all times
Figure 16: CCF coeffiecients for afternoon peak hours
4.2 Modelling
This section aims to give a comparative overview of the di↵erent modelling approaches
and their performance when provided with varying input data. The metrics employed to
evaluate model performance are RMSE and MAE, the calculations for these are outlined
27
in the methods section. The data used for evaluation is traffic speed data collected
between 2019-10-01 and 2019-10-31 this data was not used during model training thus
o↵ers an unbiased estimated of true model prediction accuracy on unseen data. Models
are also compared to a “common sense” approach, in this case this means taking the
traffic speed at the current time and assuming that it will remain the same for the
next time step, this is a consideration that is often overlooked in traffic forecasting
but is vital to ensure that any proposed, potentially complex, models are in fact able
to provide genuine predictive power when compared to a much more simple intuitive
approach.
Table 3: MAE and RMSE scores for target sensor M60/9223A at time t+1
Model Input Data MAE RMSE
OLS Target sensor 1.71923 3.42504
Target sensor + downstream sensors 1.67342 2.97104
Target sensor + upstream sensors 1.72988 3.4096
All sensors 1.69123 2.97821
LightGBM Target sensor 1.66323 3.32318
Target sensor + downstream sensors 1.48142 2.6648
Target sensor + upstream sensors 1.59021 3.18623
All sensors 1.43430 2.62749
LSTM Target sensor 1.64175 3.26481
Target sensor + downstream sensors 1.58658 2.95209
Target sensor + upstream sensors 1.57595 3.32331
All sensors 1.56655 3.17687
Extrapolate 1.59935 3.40976
Table 3Presents a comparison of model accuracy for traffic speed prediction at the
target sensor M60/9223A at time horizon t+ 1. Taking the OLS model it is clear that
prediction improvement can be gained by incorporating spatial features. The MAE falls
from 1.71 to 1.69 when the previous time step readings for upstream and downstream
sensor readings are incorporated as model input, as well as previous target sensor read-
ings. In terms of MAE the OLS model however fails to outperform the “common sense”
approach which takes the speed value at time tto be a prediction for the speed at time
t+ 1, it can however outperform this approach with regard to RMSE which suggest the
OLS is making fewer very large errors than the “common sense” approach. The Light-
GBM is the overall best performing model by a clear margin. When only the previous
6 time steps for the target sensor are considered as input data it fails to outperform the
“common sense” approach in terms of MAE, however as can be seen from the table the
incorporation of upstream and downstream sensor readings can improve model perfor-
mance significantly, the LightGBM model appears the most capable of leveraging the
spatio-temporal structure of the data, with a 15.96% decrease in MAE for LightGBM
when incorporating previous time step data for all sensors compared to only the target
sensor. LightGBM is in fact the only model to outperform the “common sense” ap-
proach when in terms of MAE, it does so by a good margin, MAE of 1.43 compared to
1.60, but only when incorporating surrounding sensor data. The LSTM performs with
slightly better accuracy than the OLS model but gains only a slight improvement in
error when incorporating upstream and downstream sensor information, it marginally
outperforms the “common sense” in terms of MAE and RMSE.
Table 4presents the results for model predictions at time horizon t+ 2. The results
share similarities with results for t+ 1 predictions with once again LightGBM being
28
by far the best performing model with a MAE of 1.885 and RMSE of 3.696, this when
including all sensors as input. The model performance when compared to common sense
is magnified when predicting 2 time steps ahead, there is a 12% decrease in MAE in
LightGBM model predictions compared to common sense while when predicting one time
step ahead there was a 10.32% di↵erence in error. Again OLS and LightGBM model
performance improves, with a reduction in MAE and RMSE when incorporating spatial
features by way of upstream and downstream sensor readings At this time horizon, the
LSTM model does not seem to take advantage of the spatial elements of the data as it
fails to improve when incorporating surrounding sensor readings.
Table 4: MAE and RMSE scores for target sensor M60/9223A at time t+2
Model Input Data MAE RMSE
OLS Target sensor 2.29718 4.8005
Target sensor + downstream sensors 2.23167 4.2297
Target sensor + upstream sensors 2.27521 4.65812
All sensors 2.23841 4.19049
LightGBM Target sensor 2.1893 4.62002
Target sensor + downstream sensors 1.91728 3.81149
Target sensor + upstream sensors 2.07082 4.3758
All sensors 1.88533 3.7961
LSTM Target sensor 2.16217 4.68135
Target sensor + downstream sensors 2.20957 4.65229
Target sensor + upstream sensors 2.26725 4.6352
All sensors 2.20115 4.78406
Extrapolate 2.15487 4.65977
The results in terms of MAE for the best performing LightGBM model are shown in
Fig 17 the MAE of predictions is aggregated for each hour of the day and the plot shows
the influence of both time of day and the inclusion of spatial features on model perfor-
mance. During none peak times the accuracy of models that take input data only from
the target sensor is marginally worse than models that take inputs from both the target
sensor and surrounding sensors upstream and downstream. At peak times however the
di↵erence is more pronounced, the error is much higher in models that take input data
from only the target sensor or the target sensor + upstream sensors. Congested times
appear to be when the incorporation of spatial features is most important, especially
downstream sensors, due to the back-propagation of traffic through the network at busy
times.
For one week 2019-10-21 to 2019-10-28 (Mon-Sun) in Fig 18, the predicted traffic
speeds for both the LightGBM and LSTM models are shown, as well as the actual
target sensor readings for that period also known as “ground truth” values. Whilst
both models fit the general trend of the data the LightGBM better matches the ground
truth at times when the traffic speeds are particularly unstable or congested, such as
the areas circled.
29
Figure 17: MAE for all predictions in each hour of the day
Figure 18: Actual and predictied traffic speed values for LightGBM and LSTM models
5 Discussion
To reiterate, the main research questions of this study where thus. To what extent
does traffic data exhibit spatio-temporal patterns? Which is the best performing traffic
forecasting technique OLS regression, Gradient Boosted Decision Trees or LSTM neural
networks? Does the inclusion of spatial features improve traffic forecasting models? The
results from this study provide clear indications as to the answers. The exploration of
spatio-temporal patterns using the Cross-Correlation Function indicated that there are
clear spatial and temporal dependencies in the data. These spatio-temporal patterns
are dynamic in both space and time with distinctions found between the influence of
upstream traffic compared to downstream traffic, for instance downstream traffic pat-
terns becoming more significant during congested periods (Fig 16). The result found
the LightGBM model to be the most accurate by a significant margin, it appears to
be the most appropriate model for capturing the complex spatio-temporal patterns and
generated accurate predictions under both free flowing and congested conditions (Fig
30
18) up to 30 minutes into the future. The LightGBM model also saw considerable im-
provements when including the e↵ects of upstream and downstream sensors in the input
vector(Table 3), which in turn answers our final question.
5.1 Literature implications
The spatio-temporal correlations studied in this paper concur with findings from Yue &
Yeh (2008) and Cheng et al. (2012) that traffic networks exhibit strong spatial as well
as temporal patterns as traffic propagates through the network. Although this is an
aspect that has previously been overlooked in the literature (Vlahogianni et al. 2014)
the importance of space in traffic forecasting has been brought more to the fore in recent
years. The CCF proved an e↵ective way of quantifying space time relationships which
could aid in model selection and feature selection for traffic forecast models.
As previously stated the literature is still divided over which modelling techniques
are most appropriate, with relation to this study both Xia et al. (2019) and Mei et al.
(2018) also found success in developing traffic flow forecasting models with GBDT’s. On
the other hand Essien et al. (2020) found that their Autoencoder-LSTM outperformed
XGBoost when predicting traffic flows across the A56 in Manchester and Wang & Li
(2018) found that their hybrid Convolutional LSTM network outperformed LightGBM
in predicting travel speed 15 to 90 minutes in the future. In both of these studies
the GBDT models where being used as comparison against significantly more complex
models and networks with multiple Convolutional and bidirectional-LSTM hidden layers.
There are few, if any, studies that show LightGBM to be outperformed by a pure LSTM
model. This study confirms the findings of Xia et al. (2019) and Mei et al. (2018) that
LightGBM appears to be a useful method, dealing with spatio-temporal relationships
well without the need for supplementary models or network layers to capture the spatial
aspect of the data as is the case with LSTM networks. The importance of integrating
upstream and downstream sensors is a agrees with finds by both Gebresilassie (2017) and
Du et al. (2018) and forms a growing body of evidence for the importance of explicitly
extracting spatial characteristics of traffic data to improve traffic forecasting.
Perhaps unexpectedly, some of the model variations implemented failed to outper-
form a much more simple “common sense” approach that required almost no compu-
tation. This is interesting for a couple of reasons, first of all it is evidence of just how
challenging it can be to accurately model such a dynamic and complex system as an
urban traffic network. The intricate spatio-temporal relationships in the data, outlined
in this study and previous works (Cheng et al. 2012, Yue & Ye h 2008), mean that traffic
prediction is difficult, it requires serious though about a number of interconnected fac-
tors in spatial and temporal dimensions as well as consideration for contextual factors
that drive traffic patterns such as weather and accidents. Second of all it raises questions
around the methods of evaluation for traffic forecasting models in the literature. Very
few studies in the literature, of a similar nature to this one, test their models against
similar heuristic common sense measures, usually opting instead to test developed mod-
els against other common modelling techniques. The problem with this is that it tends
to lose sight of the underlying goals of traffic forecasting, rather than leveraging com-
putational techniques to enhance or improve existing methods, much research is pitting
increasingly complex models against each other(Karlaftis & Vlahogianni 2011b). As
(Vlahogianni et al. 2014) notes the increasing model complexity can also be detrimental
to the explanatory power of the model which in the real world is imperative to make
them adaptable and responsive to dynamic traffic changes.(Karlaftis & Vlahogianni
2011b). Tree based methods such as GBDT’s can o↵er methods to explain predictions
by extracting the input features that have most influence on model predictions (Chen
& Guestrin 2016). This is an area where models such as LSTM networks struggle.
31
5.2 Real world applications
Given that this is a field of research that developed with real-world applications in
mind, it would be sensible to examine the findings of this paper in the context of real
Intelligent Transport Systems. The results gained here regarding the spatio-temporal
structure of the traffic data emphasise the need for comprehensive and widespread traffic
monitoring and data collection systems, these methods rely on high quality, rich and
multidimensional raw data sources and the clear spatial dependence exhibited by the
type of network data shown in this study means the performance is only likely to be
improved given the addition of more sensors distributed across the network. The promis-
ing performance shown by the LightGBM model to provide accurate predictions even
in abnormal conditions ( Fig 18) and over multiple time steps (Table 4) suggest that
it is a modelling technique that could be of significant use in real world applications.
It is important to highlight here that this study focused only on generating forecasts
for an individual sensor. Clearly, for traffic forecasts to be an e↵ective tool in a traffic
management system they need to be extended across the entire network. There are a
number of challenges that come with this, however. It is not clear for instance how many
upstream and or downstream sensors are required for accurate predictions, in this study
all sensors appeared to exhibit some level of spatio-temporal correlation with the target
sensor however this is unlikely to be the case as the distance between sensors increases.
A further challenge regarding real world integration is the availability of real-time
data streams, while this study used historical data to asses performance, real world
application requires that data can be captured, processed and fed to any model in a
short window of time for any prediction scheme to be viable, the reverse is also true
and there is a challenge to determine which ITS and traffic management technologies
are capable of integrating and adapting to information taken from forecasting models.
(Vlahogianni et al. 2014)
5.3 Limitations
Some of the main limitations have been briefly touched upon already in this discussion.
Firstly, the fact that the modelling techniques employed in this paper are suitable for
predictions at a single sensor at a time and more general methods that are able to
predict traffic speeds across an entire network might be prefered. One option is the
method of Geographically Weighted Regression (Brunsdon et al. 1998) as implemented
by Gebresilassie (2017). This technique would involve a single regression model for the
entire network but allows independent and dependent variables to vary by locality. Luo
et al. (2019) developed a method in which models where developed for individual sensors
the predictions combined and a function based on spatial weights between sensors was
implemented in order to improve overall accuracy when making simultaneous predictions
across the network.
A further limitation of this study is the lack of understanding of causality of model
predictions. While models may be able to accurately map an input vector to an output
value this o↵ers no information as to the underlying causes of congestion or abnormal
traffic patterns.
Finally homogeneity of the data used for model evaluation represents a limitation in
terms of what can be inferred from the results, while this paper highlighted the potential
of the LighGBM model when compared to other methods. It is not possible to conclude
that the LightGBM is a more appropriate method than LSTM for traffic forecasting
based o↵this one study, for that more tests would need to be carried out on a number
of data sets across di↵ering road networks.
32
5.4 Future work
Although this study highlighted the importance of spatial dependence in traffic fore-
casting, beyond the inclusion of explicitly spatial features in the form of upstream and
downstream sensors, there was no significant e↵ort to more formally encode the spatial
structure of the data in a way that might help models to better learn the latent spa-
tial structure of the data. There have attempts to develop models that are “spatially
aware” for instance Du et al. (2018) and Wang & Li (2018) introduce Convolutional
Neural Networks (Cheng et al. 2018), these were originally developed for image recogni-
tion but have been applied in traffic forecasting literature as they are designed to extract
spatial features of unstructured data such as images and graphs which is appropriate
since transport networks can easily be represented as a graph structure.
Given that this study focused on one continuous road there is room for further explo-
ration of spatio-temporal patterns of interconnected road networks to see if relationships
hold between roads that are adjacent or even completely disjoint but in close proximity
to each other work by Cheng et al. (2012) would suggest that the influence of road
links on their neighbours is local and varies widely in space and time making it hard to
measure spatio-temporal relationships across complex road topologies.
6 Conclusion
The growth of urban populations globally has resulted in rising levels of traffic and
congestion, creating numerous social, environmental and economic issues (Hymel 2009)
with the pollution caused by increased traffic levels in urban areas even being found
to increase infant mortality rates (Knittel et al. 2016). The need therefore has never
been greater for Intelligent traffic management systems guided by e↵ective forecasting
techniques and a comprehensive understanding of the spatial and temporal patterns that
drive traffic flows. This dissertation first examined how traffic forecasting can benefit
from the understanding of how traffic speeds amongst neighbouring locations are related.
Furthermore, it compared the e↵ectiveness of machine learning algorithms to provide
accurate and robust forecasts, improved by incorporating the e↵ects of upstream and
downstream sensors.
A Cross-Correlation Function was used in order to identify and quantify the spatio-
temporal dependencies of traffic speed data where it was found that significant relation-
ships exist that are dynamic in both space and time. There were distinct di↵erences
between the influence of upstream and downstream traffic patterns on a given road
segment, which also varied over the course of the day and provided evidence of traffic
“shockwaves” (Richards 1956) during congested periods as traffic propagated upstream.
The Cross-Correlation Function was found to be e↵ective method of quantifying space
time relationships which could aid in model selection, feature selection and determining
the feasibility of short-term traffic forecasts on a given road network.
Three traffic forecasting models where developed and tested on traffic data collected
on the M60 motorway in Manchester, England. The goal was to predict traffic speeds
both 15 minutes and 30 minutes ahead of time. The LightGBM algorithm, an efficient
gradient boosting implementation, was the best performing model by a significant mar-
gin. It outperformed Ordinary Least Squares regression and Long Short-Term Memory
neural networks at both time prediction horizons and under varying traffic conditions.
It also outperformed a heuristic approach by a large enough margin to suggest it is
a method with genuine predictive power. This is hypothesised to be due to the abil-
ity of the LightGBM algorithm to more e↵ectively learn the latent spatial structure
of the traffic speed data, as opposed to hugely popular sequence prediction algorithm
LSTM neural network, which evidence suggests does not capture spatial dependencies
33
well on its own, resulting in sup-optimal performance as a standalone algorithm that
may require the use of hybrid modelling structures to extract spatial features prior to
its application (Du et al. 2018, Cheng et al. 2012). Empirical evidence was provided for
the benefit of including neighbouring road sensor measurements for traffic forecasting
models given an appropriate model that was able to capture the spatial structure. The
results highlighted both the importance of spatial dependences and also that of correct
model selection and specification in order to produce acceptable performance in com-
plex traffic forecasting tasks. The role of geo-space in this study was clear, but until
recent years has been overlooked in the literature, which leaves open the much broader
question of which other related fields may benefit from
34
7 Bibliography
Ahmed, M. S. & Cook, A. R. (1979), Analysis of freeway traffic time-series data by
using Box-Jenkins techniques, number 722.
Bishop, C. M. (2006), Pattern recognition and machine learning, springer.
Bojer, C. S. & Meldgaard, J. P. (2020), ‘Kaggle forecasting competitions: An overlooked
learning opportunity’, International Journal of Forecasting .
URL: http://www.sciencedirect.com/science/article/pii/S0169207020301114
Box, G. E., Jenkins, G. M. & Reinsel, G. C. (2011), Time series analysis: forecasting
and control, Vol. 734, John Wiley & Sons.
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984), Classification and
regression trees,CRCpress.
Brunsdon, C., Fotheringham, S. & Charlton, M. (1998), ‘Geographically weighted regres-
sion’, Journal of the Royal Statistical Society: Series D (The Statistician) 47(3), 431–
443.
Castro-Neto, M., Jeong, Y.-S., Jeong, M.-K. & Han, L. D. (2009), ‘Online-svr for short-
term traffic flow prediction under typical and atypical traffic conditions’, Expert Sys-
tems with Applications 36(3, Part 2), 6164–6173.
URL: http://www.sciencedirect.com/science/article/pii/S0957417408004740
Chandra, S. R. & Al-Deek, H. (2009), ‘Predictions of freeway traffic speeds and volumes
using vector autoregressive models’, Journal of Intelligent Transportation Systems
13(2), 53–72.
URL: https://doi.org/10.1080/15472450902858368
Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in ‘Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining’, KDD ’16, Association for Computing Machinery, New York, NY,
USA, p. 785–794.
URL: https://doi.org/10.1145/2939672.2939785
Cheng, N., Lyu, F., Chen, J., Xu, W., Zhou, H., Zhang, S. & Shen, X. (2018), ‘Big data
driven vehicular networks’, IEEE Network 32(6), 160–167.
Cheng, T., Haworth, J. & Wang, J. (2012), ‘Spatio-temporal autocorrelation of road
network data’, Journal of Geographical Systems 14(4), 389–413.
URL: https://doi.org/10.1007/s10109-011-0149-5
Coleri, S., Cheung, S. Y. & Varaiya, P. (2004), Sensor networks for monitoring traffic,
in ‘In Allerton Conference on Communication, Control and Computing’.
Drew, D. R. (1968), Traffic flow theory and control, Technical report.
Du, S., Li, T., Gong, X. & Horng, S.-J. (2018), ‘A hybrid method for traffic flow fore-
casting using multimodal deep learning’, arXiv preprint arXiv:1803.02099 .
Essien, A., Petrounias, I., Sampaio, P. & Sampaio, S. (2020), ‘A deep-learning model
for urban traffic flow prediction with traffic events mined from twitter’, World Wide
Web .
URL: https://doi.org/10.1007/s11280-020-00800-3
35
Florio, L. & Mussone, L. (1996), ‘Neural-network models for classification and forecast-
ing of freeway traffic flow stability’, Control Engineering Practice 4(2), 153 – 164.
URL: http://www.sciencedirect.com/science/article/pii/0967066195002219
Gebresilassie, M. A. (2017), Spatio-temporal traffic flow prediction.
Griffith, D. A. (1987), ‘Spatial autocorrelation’, A Primer. Washington DC: Association
of American Geographers .
Hastie, T., Tibshirani, R. & Friedman, J. (2009), The elements of statistical learning:
data mining, inference, and prediction, Springer Science & Business Media.
Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation
9(8), 1735–1780.
Hong, W.-C. (2012), ‘Application of seasonal svr with chaotic immune algorithm in
traffic flow forecasting’, Neural Computing and Applications 21(3), 583–593.
URL: https://doi.org/10.1007/s00521-010-0456-7
Hymel, K. (2009), ‘Does traffic congestion reduce employment growth?’, Journal of
Urban Economics 65(2), 127 – 135.
URL: http://www.sciencedirect.com/science/article/pii/S0094119008001162
Innamaa, S. (2005), Short-term prediction of traffic situation using mlp-neural networks.
Ishak, S. & Al-Deek, H. (2002), ‘Performance evaluation of short-term time-series traffic
prediction model’, Journal of Transportation Engineering 128(6), 490–498.
Karlaftis, M. G. & Vlahogianni, E. I. (2011a), ‘Statistical methods versus neural net-
works in transportation research: Di↵erences, similarities and some insights’, Trans-
portation Research Part C: Emerging Technologies 19(3), 387–399.
Karlaftis, M. & Vlahogianni, E. (2011b), ‘Statistical methods versus neural networks in
transportation research: Di↵erences, similarities and some insights’, Transportation
Research Part C: Emerging Technologies 19(3), 387 – 399.
URL: http://www.sciencedirect.com/science/article/pii/S0968090X10001610
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y.
(2017), Lightgbm: A highly efficient gradient boosting decision tree, in I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett,
eds, ‘Advances in Neural Information Processing Systems 30’, Curran Associates, Inc.,
pp. 3146–3154.
URL: http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-
decision-tree.pdf
Knittel, C. R., Miller, D. L. & Sanders, N. J. (2016), ‘Caution, drivers! children present:
Traffic, pollution, and infant health’, Review of Economics and Statistics 98(2), 350–
366.
Leduc, G. (2008), Road traffic data: Collection methods and applications.
Lint, J. (2004), Reliable travel time prediction for freeways, PhD thesis.
Liu, D., Tang, L., Shen, G. & Han, X. (2019), ‘Traffic speed prediction: An attention-
based method’, Sensors 19(18).
URL: https://www.mdpi.com/1424-8220/19/18/3836
36
Luo, X., Li, D., Yang, Y. & Zhang, S. (2019), ‘Spatiotemporal traffic flow prediction
with knn and lstm’, Journal of Advanced Transportation 2019, 4145353.
URL: https://doi.org/10.1155/2019/4145353
Mei, Z., Xiang, F. & Zhen-hui, L. (2018), Short-term traffic flow prediction based on
combination model of xgboost-lightgbm, in ‘2018 International Conference on Sensor
Networks and Signal Processing (SNSP)’, pp. 322–327.
Moran, P. A. (1950), ‘Notes on continuous stochastic phenomena’, Biometrika
37(1/2), 17–23.
Pfeifer, P. E. & Deutrch, S. J. (1980), ‘A three-stage iterative procedure for space-time
modeling phillip’, Technometrics 22(1), 35–47.
Richards, P. I. (1956), ‘Shock waves on the highway’, Operations Research 4(1), 42–51.
URL: https://doi.org/10.1287/opre.4.1.42
Smith, B. L. & Demetsky, M. J. (1997), ‘Traffic flow forecasting: Comparison of modeling
approaches’, Journal of Transportation Engineering 123(4), 261–266.
Smith, B. L., Williams, B. M. & Keith Oswald, R. (2002), ‘Comparison of parametric
and nonparametric models for traffic flow forecasting’, Transportation Research Part
C: Emerging Technologies 10(4), 303 – 321.
URL: http://www.sciencedirect.com/science/article/pii/S0968090X02000098
Soper, H. E., Young, A. W., Cave, B. M., Lee, A. & Pearson, K. (1917), ‘On the
distribution of the correlation coefficient in small samples. appendix ii to the papers
of ”student” and r. a. fisher’, Biometrika 11(4), 328–413.
URL: http://www.jstor.org/stable/2331830
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014),
‘Dropout: a simple way to prevent neural networks from overfitting’, The journal of
machine learning research 15(1), 1929–1958.
Sun, B., Cheng, W., Goswami, P. & Bai, G. (2018), ‘Short-term traffic forecasting using
self-adjusting k-nearest neighbours’, IET Intelligent Transport Systems 12(1), 41–48.
Tobler, W. R. (1970), ‘A computer movie simulating urban growth in the detroit region’,
Economic Geography 46(sup1), 234–240.
URL: https://www.tandfonline.com/doi/abs/10.2307/143141
Vlahogianni, E. I., Golias, J. C. & Karlaftis, M. G. (2004), ‘Short-term traffic forecasting:
Overview of objectives and methods’, Transport Reviews 24(5), 533–557.
URL: https://doi.org/10.1080/0144164042000195072
Vlahogianni, E. I., Karlaftis, M. G. & Golias, J. C. (2014), ‘Short-term traffic forecasting:
Where we are and where we’re going’, Transportation Research Part C: Emerging
Technologies 43, 3 – 19. Special Issue on Short-term Traffic Flow Forecasting.
URL: http://www.sciencedirect.com/science/article/pii/S0968090X14000096
Wang, W. & Li, X. (2018), ‘Travel speed prediction with a hierarchical convolutional
neural network and long short-term memory model framework’.
Wang, X., An, K., Tang, L. & Chen, X. (2015), ‘Short term prediction of freeway exiting
volume based on svm and knn’, International Journal of Transportation Science and
Technology 4(3), 337–352.
37
Williams, B. M. & Hoel, L. A. (2003), ‘Modeling and forecasting vehicular traffic flow as
a seasonal arima process: Theoretical basis and empirical results’, Journal of Trans-
portation Engineering 129(6), 664–672.
Xia, H., Wei, X., Gao, Y. & Lv, H. (2019), Traffic prediction based on ensemble machine
learning strategies with bagging and lightgbm, in ‘2019 IEEE International Conference
on Communications Workshops (ICC Workshops)’, pp. 1–6.
Yang, B., Sun, S., Li, J., Lin, X. & Tian, Y. (2019), ‘Traffic flow prediction using lstm
with feature enhancement’, Neurocomputing 332, 320 – 327.
URL: http://www.sciencedirect.com/science/article/pii/S092523121831470X
Ye, Q., Szeto, W. Y. & Wong, S. C. (2012), ‘Short-term traffic speed forecasting based on
data recorded at irregular intervals’, IEEE Transactions on Intelligent Transportation
Systems 13(4), 1727–1737.
Yue, Y. & Yeh, A. G.-O. (2008), ‘Spatiotemporal traffic-flow dependency and short-term
traffic forecasting’, Environment and Planning B: Planning and Design 35(5), 762–
771.
Zhao, Z., Chen, W., Wu, X., Chen, P. C. Y. & Liu, J. (2017), ‘Lstm network: a deep
learning approach for short-term traffic forecast’, IET Intelligent Transport Systems
11(2), 68–75.
38