Content uploaded by Leo McCarthy

Author content

All content in this area was uploaded by Leo McCarthy on Jan 18, 2021

Content may be subject to copyright.

Department of Geography and Planning

School of Environmental Sciences

Spatio-temporal analysis and

machine learning for traﬃc speed

forecasting

Author:

Leo McCarthy

Dissertation submitted in partial fulfilment of the degree

MSc Geographic Data Science

September 2020

Word count: 10,175

DECLARATION

I declare that the material presented for this dissertation is entirely my own work and

has not previously been presented for the award of any other degree of any institution

and that all secondary sources used are properly acknowledged

Signature

September 15, 2020

Date

1

Abstract

Advances in machine learning, computational power and the explosion of traﬃc

data are driving the development and use of Intelligent Transport Systems. Real-

time road traﬃc forecasting is a fundamental element required to make e↵ective

use of these Intelligent Traﬃc Systems. Traﬃc modelling techniques have moved

beyond traditional statistical approaches towards data-driven and computational

intelligence methods. Often still overlooked is the spatial structure of traﬃc data

and the impact of road network topology on traﬃc ﬂows. In order to develop

e↵ective forecasting methods understanding the autocorrelation structure among

space-time observations is crucial. This paper presents methods for quantifying the

complex spatio-temporal relationships present in transportation network data and

assess the impact of upstream and downstream congestion propagation. Addition-

ally, introduces machine learning techniques for the forecasting of traﬃc speeds in a

supervised learning task where, by leveraging spatial and temporal characteristics

of dynamic traﬃc ﬂows, the LightGBM algorithm is shown to produce state-of-

the-art performance on real-world motorway traﬃc data with predictions up to 30

minutes in advance, outperforming other established forecasting techniques.

2

Contents

List of Figures 4

List of Tables 4

1 Introduction 5

2 Literature Review 6

2.1 Traﬃc data .................................. 7

2.2 Space-time autocorrelations ......................... 8

2.3 Traﬃc modelling ............................... 9

2.3.1 ARIMA ................................ 9

2.3.2 OLS .................................. 10

2.3.3 Nonparametric machine learning .................. 11

2.3.4 Recurrent Neural Networks ..................... 11

2.3.5 Gradient Boosting .......................... 12

2.3.6 Developing the current literature .................. 13

3 Methods and Data 14

3.1 Data ...................................... 14

3.1.1 Preprocessing ............................. 15

3.1.2 Overview ............................... 16

3.2 Space-time autocorrelations ......................... 19

3.2.1 Temporal ............................... 19

3.2.2 Spatio-temporal ........................... 20

3.3 Experimental set up ............................. 20

3.4 Model speciﬁcation and tuning ....................... 21

3.4.1 OLS .................................. 22

3.4.2 LightGBM .............................. 23

3.4.3 LSTM ................................. 24

3.5 Evaluation metrics .............................. 25

3.6 Implementation ................................ 25

4Results 26

4.1 Space-time autocorrelations ......................... 26

4.2 Modelling ................................... 27

5 Discussion 30

5.1 Literature implications ............................ 31

5.2 Real world applications ........................... 32

5.3 Limitations .................................. 32

5.4 Future work .................................. 33

6 Conclusion 33

7 Bibliography 35

3

List of Figures

1 Neural Network architectures ........................ 12

2 Decision tree example ............................ 13

3 Sensor locations ................................ 15

4 Heat-map of traﬃc speeds for all sensors on 2019-09-06 ......... 16

5 Average traﬃc speeds across each day of the week ............ 17

6 Average traﬃc speeds by time of day for each sensor ........... 17

7 Traﬃc speeds on 2019-09-06 for each sensor ................ 18

8 Box plot of average daily traﬃc speeds across all sensors ......... 19

9 Input windows for time series prediction .................. 21

10 Train validation and test procedure for modeling ............. 22

11 Gradient boosted decision tree learning process .............. 23

12 Structure of a general LSTM network layer ................ 24

13 Implemented LSTM network structure ................... 25

14 Temporal autocorrelations .......................... 26

15 CCF coeﬃcients for all times ........................ 27

16 CCF coeﬃecients for afternoon peak hours ................ 27

17 MAE for all predictions in each hour of the day .............. 30

18 Actual and predictied traﬃc speed values for LightGBM and LSTM models 30

List of Tables

1 Four model Input vectors taken over 6 previous time steps ........ 21

2 LightGBM hyperparameters ......................... 23

3 MAE and RMSE scores for target sensor M60/9223A at time t+1 ... 28

4 MAE and RMSE scores for target sensor M60/9223A at time t+2 ... 29

4

1 Introduction

The prediction of future traﬃc characteristics such as ﬂow, travel time and speed have

been a key component of Intelligent Transport Systems (ITS) for the past couple of

decades. Accurate forecasting of future traﬃc conditions has beneﬁts for both road

users and traﬃc authorities alike. The broad aims of Intelligent Transport Systems are

to enable road users and administrators to be better informed and make safer and more

eﬃcient use of transport networks. Speciﬁcally, the foremost consideration of almost

all traﬃc management systems is reducing the high levels of congestion that generate

economic, social and environmental issues in many cities around the world (Knittel et al.

2016).

The rapid development of technologies that allow for the capture and storage of data

describing the prevailing traﬃc conditions such as sensors, radars, inductive loops and

GPS (Leduc 2008) has resulted in increasingly sophisticated ITS. Real time traﬃc infor-

mation is vital for e↵ective and swift traﬃc management allowing traﬃc authorities to

actively respond to prevailing traﬃc conditions in order to reduce congestion, provide

information to road users and deal with collisions, it is a key element in many of the

operational components of ITS such as Advanced Traveller Information System (ATIS)

and Dynamic Traﬃc Assignment (DTA). In the past, much of the focus from traﬃc

managers was on reactive traﬃc measures based on current traﬃc conditions, dealing

with congestion once it had developed with no attempt to anticipate. Now that many

authorities possess real time traﬃc data feeds and subsequently large volumes of his-

torical traﬃc data, the focus has switched towards developing methods for forecasting

traﬃc conditions ahead of time, allowing for proactive traﬃc measures based on ex-

pected traﬃc conditions rather than making decisions based on a traﬃc state soon to

be obsolete.

The past decade has seen a signiﬁcant uptick in research surrounding the develop-

ment of predictive models for traﬃc state predictions ahead of time. Currently there

is far from any consensus as to the most e↵ective methods, this is in part due to the

variability in problem framing. Given there is no universal standard in traﬃc data col-

lection and aggregation techniques there is huge di↵erences in structure across traﬃc

data sets with di↵erences in data size, type, frequency and the prediction horizon.

Given the increasing variety and huge volume of traﬃc data being collected, more re-

cent years have seen a move away from traﬃc theory and towards more data driven and

machine-learning methods in an attempt to develop fast, accurate and scalable models.

Initial attempts mainly involved the use of parametric univariate time series models such

as Autoregressive Integrated Moving Average method (ARIMA) and Seasonal Autore-

gressive Integrated Moving Average method (SARIMA) (Williams & Hoel 2003). Many

of these models however have shown shortcomings in generating traﬃc predictions which

has led to the wider use of nonparametric machine-learning (ML) or deep-learning (DL)

based models which form the basis for most of the current literature. These are advan-

tageous for a few di↵erent reasons, ML and DL models are much better at extracting

non-linear relationships within the data, they can handle very high-dimensional data

and co-linear features more e↵ectively and make no prior assumptions regarding the

normality of the data distribution, all of these factors culminating in more accurate

and robust predictions. There are several examples of ML applied to traﬃc forecasting

such as Support Vector Regression (Castro-Neto et al. 2009) and K-nearest-neighbour

(Sun et al. 2018). The most current literature is largely centred around Recurrent Neu-

ral Networks (RNN) and speciﬁcally Long-Short Term Memory (LSTM) networks, the

main reason behind their usage is their high representational power and ability to cap-

ture short and long temporal dependencies of traﬃc data. Gradient Boosting and in

particular eﬃcient implementations of Gradient Boosted Decision Trees (GBDT) such

5

as LighGBM and XGBoost are hugely popular amongst data science practitioners and

in competitive data science events such as those hosted on Kaggle. They have shown

state-of-the-art performance on a range of tasks including time-series prediction and

applications to spatio-temporal data (Bojer & Meldgaard 2020). Despite this, there has

been limited attempt to apply them to the problem of traﬃc forecasting, this paper will

attempt to bridge that gap by implementing LightGBM for traﬃc prediction and com-

paring results to both OLS regression and LSTM - the most popular Neural Network

architecture for traﬃc forecasting.

The First Law of Geography (Tobler 1970) states “everything is related to everything

else, but near things are more related than distant things.” This would imply that traﬃc

speeds at a given location are not only a↵ected by the traﬃc state at previous points

in time but also by traﬃc states at other nearby locations, given the connected nature

of road networks. Shockwave theory has long been used to model the downstream to

upstream progression of queues (Richards 1956) and Several studies have shown traﬃc

networks to exhibit both spatial and temporal patterns (Cheng et al. 2012)whichmust

be carefully considered in any modelling framework. Recent studies have shown the im-

provement of predictions when of including upstream or downstream traﬃc information

(Chandra & Al-Deek 2009). The importance of space and network topology is clear,

however is often overlooked and is still an area that requires further development in the

domain of traﬃc forecasting (Vlahogianni et al. 2014). More recently attempts have

been made to integrate both spatial and temporal dependence in the modelling process

(Luo et al. 2019). Furthermore, studies have shown that the inclusion of exogenous

variables such as current weather conditions, incident reports and geo-tagged tweets can

improve prediction accuracy (Essien et al. 2020). This paper will attempt to quantify

the importance of spatial variables while also exploring the e↵ectiveness of state-of-the-

art machine leaning models for traﬃc prediction.

The main aims of this dissertation can be summarised as follows:

•To identify and quantify the relevant spatio-temporal patterns present in traﬃc

speed data.

•To develop traﬃc speed prediction models integrating both spatial and tempo-

ral elements of traﬃc speed data. Comparing statistical and machine learning

techniques.

•Asses the importance of exploiting the spatial structure of traﬃc data on model

performance.

The remainder of this thesis is organised as follows. Section 2 presents an overview of

the methodologies prevalent in the current traﬃc forecast literature. Section 3 provides

a description of the data and outlines the experimental setup and model speciﬁcations.

Section 4 Presents the ﬁndings. Section 5 discusses the ﬁndings and their implications.

Section 6 concludes the paper.

2 Literature Review

There is a signiﬁcant amount of literature relating to short term traﬃc forecasting. The

continually growing urban population places ever greater emphasis on reliable traﬃc

management and in turn traﬃc prediction methods. Generally, traﬃc forecasting meth-

ods can be divided into two main strands. The ﬁrst approach is one based on explicit

6

modelling of the traﬃc system itself and carrying out simulations incorporating param-

eters such as traﬃc ﬂow, signal controls and driver behaviour. Much of this work is

grounded in classical traﬃc ﬂow theory (Drew 1968) macroscopic traﬃc models deﬁne

explicit mathematical equations to describe the relationship between variables such as

traﬃc ﬂow, density, congestion-shockwave and singling. The main advantages of these

traﬃc simulation approaches is that “they allow inclusion of traﬃc control measures

(ramp metering, routing, traﬃc lights, and even traﬃc information) in the prediction,

and that they provide full insight into the locations and causes of possible delays on the

road network of interest” (Lint 2004). However, there are a number of drawbacks to

this approach such as the computational complexity of parameter calibration and also

their lack of ﬂexibility. These models require a-priori knowledge of traﬃc processes and

their predictive quality relies on the quality of its inputs which must be determined “by

hand”, they cannot learn the latent traﬃc patterns from given data.

This paper focuses on the second main approach, statistical and data driven methods

including machine learning (ML). Machine Learning methods take no prior knowledge

of the traﬃc system and aim to deﬁne functional approximations between input data

and output data (Hastie et al. 2009), learning the relationships in the traﬃc data to

predict variables such as speed and ﬂow etc. Activity in this research area has been

expedited in recent years due to improved data collection methods resulting in large,

rich data sets which has occurred in parallel to the rapid advancement of technology and

available computational power making it possible to implement advanced data driven

methodologies on big data sets. Despite this, there is still considerable variety in applied

methods and large areas for development within this research area as highlighted in

recent reviews of the research ﬁeld (Vlahogianni et al. 2014). One such area is the

understanding and e↵ective incorporation of space-time autocorrelations in traﬃc data.

This section will provide an overview of the relevant literature with regards to traﬃc

forecasting methods with particular emphasis on exploiting the latent spatial structure

of the underlying traﬃc network data to improve forecasting models.

2.1 Traﬃc data

Most of the recent developments in traﬃc forecasting have come about as a result of

data intensive, as opposed to analytical, approaches and thus their e↵ectiveness is highly

reliant on the prevalence, quality and resolution of available traﬃc data (Vlahogianni

et al. 2004). Given the enormous undertaking it would be for a researcher to collect task

speciﬁc traﬃc data, the development of traﬃc forecasting methods almost always relies

on data already available, collected by transport authorities or previous large research

projects. The two main data characteristics to be considered for traﬃc forecasting

problems are data resolution and the measured parameters. Resolution of traﬃc data

can vary greatly, with traﬃc state readings being made as often as every few seconds

(Ye et al. 2012) to as infrequently as every hour (Hong 2012). The forecasting step is

equivalent to the resolution of the data, for example for data collected every 15 minutes

one forecasting step would be 15 minutes in advance of the current time step. The

forecasting horizon is a related concept and refers to the speciﬁc time horizon in the

future for which we want predictions to be made and can include several forecasting

steps depending on the data resolution. For instance, Ishak & Al-Deek (2002) and

Liu et al. (2019) conclude that prediction accuracy becomes increasingly more diﬃcult

as the prediction horizon increases. However as Vlahogianni et al. (2014) notes, the

usefulness of a prediction model degrades given too short a prediction horizon, as for

the majority of traﬃc management applications a reasonable amount of notice is required

for proactive steps to be put in place or information to be communicated to road users.

It is also the case that particularly short time steps make predictions diﬃcult due to large

7

ﬂuctuations in readings from one step to the next (Smith & Demetsky 1997), because

of this, oftentimes high resolution data is aggregated for easier modelling (Florio &

Mussone 1996).

There are several methods of data collection for traﬃc data, the most common by

some way is traﬃc sensors (Coleri et al. 2004). These can o↵er measurements for a

range of traﬃc characteristics such as speed, ﬂow, occupancy and travel times. There

is no real consensus in the literature on which variable is more suitable in describing

underlying traﬃc conditions but Vlahogianni et al. (2014) found that the most com-

mon characteristics chosen for prediction where ﬂow and travel time. There have also

been attempts to forecast di↵erent measures simultaneously such as (Florio & Mussone

1996) who tried to predict the evolution of traﬃc ﬂow, density and speed over 10 min.

However Innamaa (2005) found that that one single model predicting two variables gave

worse forecasts than two separate models for speed and ﬂow. In addition to this, some

researchers have found success in incorporating exogenous variables in the modelling

process. These can be things such as weather conditions and traﬃc incident updates

mined from twitter as done by Essien et al. (2020) ,or in the case of Xia et al. (2019)

contextual temporal information such as the day of the week or whether it is a national

holiday.

Deﬁning the appropriate data resolution and measurement parameters are of partic-

ular importance for data driven methods as it a↵ects the quality of information about

traﬃc conditions lying in the data (Vlahogianni et al. 2004). Poor or inappropriate

data for the task at hand could result in a poor model; the “Garbage in, garbage out”

phenomenon.

2.2 Space-time autocorrelations

Whilst many traditional approaches to traﬃc forecasting consider it is a stationary and

single point time series problem, the most current literature makes it clear that it is of

paramount importance to consider both spatial and temporal dimensions of the data in

order to develop accurate traﬃc forecasting models. So ﬁrst provided is an overview of

the literature regarding autocorrelations in space-time and their presence across traﬃc

networks. (Temporal) Autocorrelation is a term most prevalent in time series analysis

and signal processing, it is a mathematical representation of the degree of similarity

between a given time series and a lagged version of itself over successive time intervals.

However, it can also be extended to the spatial dimension hence the term “spatial

autocorrelation” which has groundings in the First Law of Geography (Tobler 1970)

and can be deﬁned as follows. “Spatial autocorrelation is the correlation among values

of a single variable strictly attributable to their relatively close locational positions on

a two-dimensional surface, introducing a deviation from the independent observations

assumption of classical statistics” (Griﬃth 1987). Until recently the spatial component

of autocorrelations in traﬃc data were overlooked, however a number of studies, most

notably Cheng et al. (2012) highlighted the extent of spatial autocorrelation in road

network data and its potential importance to traﬃc forecasting. Various indices exist

in the literature to measure autocorrelations in both space and time. Most of which are

derived from Pearson’s product moment correlation coeﬃcient (PMCC) (Soper et al.

1917) which given two independent variables X and Y is deﬁned as their covariance over

the product of their standard deviations or the equation:

⇢X,Y =E[(X¯x)(Y¯x)]

XY

(1)

Where ¯x,¯yand Y,Yare the means and standard deviations of variables X and

Yrespectively. Thecoeﬃcient⇢X,Y is used as a measure of the strength of linear

8

association between variables and can fall in the range 1 to 1 with 1 indicating perfect

positive correlation, 1 perfect negative correlation and 0 no correlation. Temporal

autocorrelation can be calculated by taking a lagged speciﬁcation of PMCC so rather

than calculating the correlation between two variables it is calculated between a variable

at time tand the same variable at time tkfor some k.

Moran’s I is a measure commonly found in the broader geospatial literature and is

an extension of PMCC to the spatial dimension (Moran 1950), its more complicated

than temporal autocorrelation in that it can occur in any direction. Moran’s I however,

does not consider the temporal dimension so cannot capture dynamic spatio-temporal

correlation properties. Thus, adaptations are required to capture spatial and temporal

correlations simultaneously. A global Space-Time Autocorrelation Function (ST-ACF)

is proposed by Pfeifer & Deutrch (1980) and implemented in the domain of network

ﬂow data by Griﬃth (1987), this measure measures the cross-covariances between all

possible pairs of locations lagged in both time and space. STACF has been used in Space-

Time Autoregressive Integrated Moving Average (STARIMA) modelling to determine

the range of spatial neighbourhoods that contribute to the current location measurement

at a given time lag (Pfeifer & Deutrch 1980)

As is the case with Morans, the STACF has a local variant. Local spatio-temporal

correlations can be quantiﬁed using the Cross-Correlation Function (CCF) (see Box et al.

(2011)). This treats two time series in separate spatial locations as a bivariate stochastic

process and measures the cross-covariance coeﬃcients between each series at speciﬁed

time lags. Yu e & Yeh (2008) show that the CCF can e↵ectively measure spatio-temporal

dependencies between a road sensor and its neighbours and can be used for analysing

the forecast-ability of a given traﬃc data set. For example, a signiﬁcant positive CCF

coeﬃcient between sensor Aat time tand downstream sensor Bat given time tk

would suggest a back propagation of congestion from Bto Aover the time period k.

Despite the empirical evidence provided by Yue & Ye h (2008) and Cheng et al. (2012)

that the CCF is a sound method of measuring spatio-temporal relationships between

neighbouring road links it has limited exposure in the ﬁeld of traﬃc forecasting where

it could be used for determining appropriate modelling approaches or selecting spatial

model-input features. In this study the CCF will be implemented to determine the

extent of local spatio-temporal autocorrelation and their dynamic behaviour, the exact

speciﬁcation with respect to this data set will be provided in the methods section.

2.3 Traﬃc modelling

Traﬃc forecasting has evolved a great deal over the past couple of decades however

there is still no uniﬁed modelling approach. This section will provide an overview of the

most used prediction modelling techniques both traditionally and in the most current

literature. The data-driven forecasting methods can broadly be divided into two distinct

categories: Statistical parametric approaches and non-parametric machine learning.

2.3.1 ARIMA

Parametric models have a ﬁxed and ﬁnite number of parameters, the number determined

prior to model ﬁtting. One of the most popular parametric models for time series predic-

tion both within traﬃc forecasting and across other disciplines is the Auto Regressive

Integrated Moving Average (ARIMA) model. ARIMA uses a statistical approach to

identify recurring patterns of di↵erent periodicity from previous observations of a time

series. It can be deﬁned by the following di↵erencing equation:

Yt=c+1yt1+2yt2+... +pytp+✓1✏t1+... +✓tq+✏t(2)

9

This is an ARIMA(p, d, q)modelwhere:

yt1...ytprepresents the previous pvalues general time series. This is the autoregres-

sive part of the model

Ytrepresents the forecast for yat time t

✏t...✏tqis zero mean white noise or the qmoving average terms

and ✓are parameters to be determined by model ﬁtting

dis the order of di↵erencing applied of the data set as the time series must be stationary

for the application of ARIMA.

The ﬁrst recorded application of the ARIMA model in the transport forecast literature

was by Ahmed & Cook (1979) in which an ARIMA(0,1,3) model was proposed to predict

traﬃc ﬂow and occupancy on freeways, where it was shown to be more accurate that

previous smoothing techniques. A primary assumption made by this model is the sta-

tionarity of the mean, variance, and autocorrelation. Thus, a major criticism directed

toward ARMA models concerns their tendency to perform poorly during abnormal traf-

ﬁc states and be unable to predict extreme values in time–series (Vlahogianni et al.

2004). Furthermore, the univariate nature of ARIMA means that the integration of

spatial or contextual features is not possible.

2.3.2 OLS

Other Linear methods include Ordinary Least Squares (OLS) regression and its exten-

sion Geographically Weighted Regression as applied by Gebresilassie (2017). OLS allows

the simple integration of multiple input variables and additional features such as multi-

ple sensor measurements across a network or contextual features. OLS regression aims

to ﬁnd a linear relationship between input and output variables and an OLS model is

deﬁned as :

yi.=0+1x1+2x2+... +nxn+✏i(3)

Where yiis the response variable, xi:i=1,2,..,n are the nindependent predictor

variables, ✏is a random error term. 0is the model constant (intercept) and i:i=

1,2,..,n are the model parameters to be determined using the Ordinary Least Squares

estimator which is deﬁned by the equation:

ˆ

=(X

0

X)1X

0

y(4)

Parametric approaches have favourable statistical properties and capture regular varia-

tions very well. However, ﬁnd prediction diﬃcult given more volatile traﬃc states such

as when a traﬃc accident or severe congestion occurs (Smith et al. 2002). Further to this

the aforementioned parametric models make the assumption of linearity between input

and response variables which makes them sub-optimal when considering the complex

spatio-temporal structure and non-linear relationships in traﬃc data.

10

2.3.3 Nonparametric machine learning

Nonparametric models function in a fundamentally di↵erent way to parametric mod-

els. Nonparametric models do not assume that the structure of a model is ﬁxed and in

most cases the model complexity grows to accommodate the complexity of the data, in

other words the model structure is “learned” from the data (Bishop 2006). Compared

to parametric methods, nonparametric methods are believed to o↵er more ﬂexibility in

ﬁtting the nonlinear characteristics and are better able to process noisy data (Karlaftis

& Vlahogianni 2011a). These methods include algorithms such as K-Nearest Neighbour

(KNN) Support Vector Regression (SVR) and Artiﬁcial Neural Networks (ANN). There

have been several attempts to implement machine learning models for traﬃc forecasting

(Sun et al. 2018, Castro-Neto et al. 2009, Essien et al. 2020) and even hybrid combi-

nations of these models (Wang et al. 2015, Luo et al. 2019). These non-linear, ﬂexible

and multivariate models have proven to show greater accuracy in traﬃc forecasting than

parametric models and are better able to leverage the spatio-temporal structure of the

traﬃc data (Vlahogianni et al. 2014).

2.3.4 Recurrent Neural Networks

Recurrent Neural Networks (RNN) are a subset of Artiﬁcial Neural Networks (ANN)

that aim to preserve the temporal dimension of time series data using a hidden recurrent

state. They are seen as the most appropriate ANN architecture for series prediction

problems and have show success in traﬃc forecasting problems as shown by Cheng et al.

(2018) when using the RNN model to predict the traﬃc ﬂow in the case of special

events. In conventional ANN’s there are only full connections between adjacent layers,

but no connection among the nodes of the same layer. As Zhao et al. (2017) explains

the ANN may be sub-optimal when dealing with the spatio-temporal problems, because

there are always interactions among the nodes in spatio-temporal network. Di↵erent

from conventional networks, the hidden units in RNN receive a feedback which is from

the previous state to current state (Fig 1) allowing them to understand relationships

between states i.e. time steps.

11

(a) Simple feed-forward Artiﬁcial Neural Network

(b) Recurrent Neural Network (Essien et al. 2020)

Figure 1: Neural Network architectures

RNN’s can however often struggle with long term dependencies and are unable to

process a large number of time lags due to correlations caused by periodic or seasonal

structures that are often seen in spatio-temporal time series. For this reason, most traﬃc

forecast literature now implements the Long Short-Term Memory (LSTM) (Hochreiter

&Schmidhuber1997) adaptation of the RNN. The LSTM network is a special kind of

RNN. By treating the hidden layer as a memory unit, LSTM networks can cope with

the both short and long-term correlations. LSTM neural networks have found great

popularity in traﬃc forecasting, many researchers have applied the algorithm to spatio-

temporal traﬃc data with great success, where it has been found to outperform other

popular techniques including ARIMA, Support Vector Machines and Random Forests.

(Yang et al. 2019, Zhao et al. 2017), it is probably the most popular single ML algorithm

in the current traﬃc forecasting literature.

2.3.5 Gradient Boosting

Gradient Boosting and in particular Gradient Boosted Decision Trees (GBDT) is a tree

based machine learning model that is becoming increasingly popular due to their fast

implementations such as LightGBM (Ke et al. 2017) and XGBoost (Chen & Guestrin

2016) and also there state-of-the-art performance across a broad range of problems.

12

Figure 2: Decision tree example

GBDT works by combining many decision trees (Fig 2) with low accuracy into a model

which has signiﬁcantly higher accuracy than its base learners. It is an additive model

with the general expression:

ˆyi=

K

X

k=1

fk(xi),f

k2F(5)

Where ˆyiis the response variable given input vector xi.Kis the number of decision trees

and fis a function of the function space Fwhich contains all possible decision trees. A

decision tree (Fig 2) is an algorithm for classiﬁcation and regression that sequentially

splits the data based on feature values, the process is repeated in a recursive manner on

each successive subset following the split, the end nodes represent the predicted value for

data in that subset. This carries on until splitting no longer adds value to the predictions

(Breiman et al. 1984).

GBDT algorithms, despite their popularity in the Data Mining and ML communities,

have been implemented only a couple of times in the current literature namely Xia et al.

(2019) who implement LightGBM to predict traﬃc characteristics in Shenzhen and ﬁnd

it signiﬁcantly outperforms Linear Regression, ANN and ARIMA. Also Mei et al. (2018)

combines both XGBoost and LightGBM for accurate ﬂow predictions on spatio-temporal

data.

2.3.6 Developing the current literature

From this review and also highlighted in the comprehensive review of the research area

by Vlahogianni et al. (2014), it can be seen that historically most e↵ort has been put

into employing univariate statistical models, predicting traﬃc volume or travel time and

using data from single point sources. Currently there is considerably more e↵ort being

applied to incorporate the spatial structure of the data using multivariate methods such

as OLS, LSTM and SVM however often there is less of an attempt to understand how or

why spatial dependencies e↵ect traﬃc forecasting models. Gradient boosting methods

such as LightGBM have largely been overlooked in the literature. Lastly, attempts to

13

predict traﬃc speed are far less prevalent than other traﬃc characteristics such as ﬂow or

travel time. With this in mind this paper will ﬁrst attempt to quantify the signiﬁcance

and nature of local spatial autocorrelations in traﬃc data and then implement three

modelling approaches LSTM, LightGBM and OLS in order to predict traﬃc speeds,

assessing their performance with respect to both each other and also the spatial richness

of various input data.

3 Methods and Data

This section aims to outline the applied methodology and experimental procedure carried

out, guided by the key research themes of this paper. As a reminder these are can be

brieﬂy summarised as follows.

•Quantify spatio-temporal relationships in traﬃc data.

•Develop and compare traﬃc forecasting models using OLS, LSTM and LightGBM.

•Assess the impact of spatial features on model performance.

3.1 Data

The Data used is real world traﬃc data collected by Highways England and can be down-

loaded from their WebTris website 1,a map-based data query service with measurements

collected by road sensors along the Strategic Road Network (SRN).

The selected data set contains measurements for 21 site locations along the M60

motorway in England, in the clockwise direction between J7 and J20 (Fig 3). The data

contains measurements for both traﬃc speed and traﬃc volume, although only speed was

considered for this analysis. The total time period for the data used was between 2019-

05-01 and 2019-10-31 this is a 6-month period (184 days) for which measurements are

available at 15 minute intervals with traﬃc speeds being aggregated over this 15-minute

period. Resulting in 17,664 total speed readings across 21 sites so 370,944 individual

data points.

1https://webtris.highwaysengland.co.uk

14

Figure 3: Sensor locations

3.1.1 Preprocessing

Across the 21 sites there were 1668 missing data points equating to 0.44% of the data,

a very small proportion. For a given time stamp where there existed missing data

points, the missingness was not consistent across all sites, further to this, discarding

entire time steps worth of data can result in inconsistent time steps across the series

and be detrimental to model performance as periodic and seasonal e↵ects are distorted.

Because of this, where data was missing it was imputed with values calculated by taking

the average speed for that sensor location at that particular time of day across the rest

of the data set, in order to maintain day of week and time of day characteristics of traﬃc

data.

15

3.1.2 Overview

Before carrying out any analysis or modelling, some exploratory work was carried out

with the data in order to better understand and obvious patterns and spot any potential

measurement errors. Figure 4is a heat-map showing a snapshot of average traﬃc speeds

on 2019-09-06. The time is shown on the x-axis and the spatial component is shown on

the y-axis with sensor locations ordered from downstream to upstream or from south-

north as shown on Fig 3. Both spatial and temporal patterns are apparent with a drop-

in speed around peak times and also a speed values closeted based on sensor locations

which can be seen by the vertical strip patterns. The three horizontal strip patterns

(M60/919K, M60/9229L, M60/9298K) represent roads sensors located on entry or exit

slip roads where speed is consistently reduced.

Figure 4: Heat-map of traﬃc speeds for all sensors on 2019-09-06

16

Figure 5: Average traﬃc speeds across each day of the week

Figure 6: Average traﬃc speeds by time of day for each sensor

17

Figure 7: Traﬃc speeds on 2019-09-06 for each sensor

Fig 5shows time of day and day-of-week variations in traﬃc speeds across the net-

work. During the working week severe congestion peaks can be seen around the “rush

hour” period with speeds dropping by almost half for 1-2 hours every day. There is a

clear weekend e↵ect with speeds higher across the day and no visible congestion at any

point, Fridays also show less sever and earlier in the day congestion than the rest of the

working weeks. Clearly traﬃc speeds are largely inﬂuenced by the commuting patterns

of the working population. Fig 6shows the distribution of aggregated speeds across

the day for each individual sensor, coloured depending on location from the southern-

most sensor progressing downstream. Not all sensors share the same daily patterns with

some exhibiting more severe congestion patterns than others, the spatial dependence is

obvious with closely located pairs of sensors sharing more similarities in traﬃc speeds

compared to sensor pairs more widely spread. The congestion during the PM period

is far more severe than that in the AM period. Meanwhile Fig 7shows a snap shot

of traﬃc speeds by hour taken on 2019-09-06 which highlights the volatility in traﬃc

patterns on a day-to-day basis, the patterns on this particular day di↵er signiﬁcantly

from the average patterns over the whole time series as shown in Fig 6. Again shown,

is the strong spatial dependence with abnormal congestion centred around a handful

of adjacent sensors. Congestion events can also extend past an hour by hour basis,

Fig 8shows a box-plot of average speeds across the network by day of the week. The

large interquartile range and number of outliers suggest there can be large variation

in traﬃc speeds not only during days but between days of aggregated speeds as well.

These variations, along with the complex spatial relationships and factors such as un-

expected events, traﬃc accidents, slow moving vehicles, bad weather etc., bring lots of

challenges for robust short-term traﬃc prediction by traditional linear regression models

or ARIMA.

18

Figure 8: Box plot of average daily traﬃc speeds across all sensors

3.2 Space-time autocorrelations

3.2.1 Temporal

The standard autocorrelation function is used to calculate the level of correlation be-

tween the speed value at a sensor with a lagged version of itself at pervious time steps,

given a time series y1,y

2...ytat the target sensor, autocorrelation of lag kis described

by the equation:

⇢(k)=E[(Yt¯x)(Yt+k¯x)]

YtYtk

k=0,±1,±2,... (6)

Where E[(Yt¯x)(Yt+k¯x)] is the covariance between Ytand Ytkand Yt,

Ytkrepre-

sent the standard deviations of Ytand Ytkrespectively. ⇢(k) measures the relationship

between ytand ytk.

Also measured is Partial autocorrelation. This is similar to the standard autocorrelation

however accounts for the inﬂuence of intermediate values in the time series between yt

and ytkthe Partial Autocorrelation Function is given as:

⇢Y(k)=Cov([Yt|Yt1,Y

t1,...,Y

tk+1],[Ytk|Yt1,Y

t1,...,Y

tk+1])

[Yt|Yt1,Yt1,...,Ytk+1][Ytk|Yt1,Yt1,...,Ytk+1 ]

(7)

This gives a clearer understanding between the relationship of a time series when com-

pared with a version of itself that is lagged over multiple time steps, as it removes

information carried by intermediate time steps. This will help us to determine the

feasibility of multi-step forecasts using a single sensor.

19

3.2.2 Spatio-temporal

The introduction of the spatial element to autocorrelation calculations is brought by

the Cross Correlation Function (CCF). Let Xtdenote the time series of measurements

collected at sensor Xup until time T. Let Let Ytdenote the time series of measurements

collected at a di↵erent sensor Y, either upstream or downstream, up until time T.The

CCF at lag kis then given by the equation:

⇢X,Y (k)=E[(Xt¯x)(Yt+k¯y)]

XY

k=0,±1,±2,... (8)

Where E[(Xt¯x)(Yt+k¯y] represents the cross-covariance between Xand Yat lag k

and

X=sPT

t=1(xt¯x)2

T1,

Y=sPT

t=1(ytk¯y)2

T1

This equation allows us to calculate the correlation between sensors in di↵erent locations

where one is temporally lagged, thus the CCF can be used to measure quantitatively

the spatio-temporal relationship of the traﬃc ﬂow and speed.

3.3 Experimental set up

In order to train and validate ML models the data must ﬁrst be in a form that can be

interpreted by the algorithms. In this case we will frame the data to be a supervised

learning problem where each individual input vector is mapped to a unique output

value. To do this, input feature vectors must be extracted from the raw sensor data.

Starting with the simplest case of using one sensors historical data to predict the traﬃc

speed at the next time step, ﬁxed size and non-overlapping historic feature windows are

constructed (Fig 9) which are mapped to a future output value. The temporal length of

the feature window needs to be determined in order to ensure best model performance

but without unneeded complexity or redundant features, in this case the previous 6-

time steps where used i.e. 90 minutes. So, for a prediction at time tat target sensor

kconsidering only historical information from the target sensor the input vector ˆxk,t

would be [xk(t1),x

k(t2),...,x

k(t6)] and the ground truth output is the traﬃc speed

value at the target sensor kat time t.

20

Figure 9: Input windows for time series prediction

Expanding this to include upstream and downstream sensors is fairly simple the

input vector becomes

ˆxk,t =[xk(t1),x

k(t2),...,x

k(t6),x

1(t1),x

1(t2),...,x

1(t6),

...,x

j(t1),x

j(t2),...,x

j(t6)]

Where x1,x

2.., xjare the upstream and downstream sensors to be included in the input

data. This paper focuses only on predictions for a single target sensor at a time. Thus,

the problem deﬁnition becomes to develop a functional approximation Fsuch that

T

X

t=7

[F(ˆxk,t)xk,t]2

I.e. the sum of the squared errors between the function mapping, in this case a model

prediction, and the true speed value is minimised. For each modelling technique three

di↵erent input vectors are tested in order to evaluate the impact of spatial features these

can be seen in Table 1.

Table 1: Four model Input vectors taken over 6 previous time steps

Data source

Input 1 Target sensor

Input 2 Target sensor + downstream sensors

Input 3 Target sensor + upstream sensors

Input 3 All sensors

3.4 Model speciﬁcation and tuning

Machine learning models can contain hyper-parameters that determine structural fea-

tures of the model and are very important to model performance. These hyper-parameters

can be highly sensitive, with small perturbations resulting in large changes in prediction

outcomes. They are also diﬃcult to determine heuristically and an empirical method

must be employed in order to determine the most appropriate values. The process of

“tuning” hyper-parameters requires repeatedly testing the model accuracy with di↵erent

sets of hyper-parameters on a data set that was not used during the training procedure,

in order to determine their best values, this is called model validation. It is not to be

confused however with model testing, the validation data set must be distinct from the

21

testing data set due to the fact hyper-parameter values are being chosen based on vali-

dation performance which gives a potentially biased representation on model accuracy.

A separate test set, that shares no overlap with training or validation sets, must be held

back for testing model prediction accuracy and o↵ers an unbiased view on model perfor-

mance on unseen data. The raw data set is split into train, validation and test sets. This

is done in a sequential manner in order to avoid look-ahead bias so data from 2019-05-01

to 2019-08-31 is used for model training, data from 2019-09-01 to 2019-09-31 for vali-

dation and the data for October 2019 used for model testing. The train-validation-test

workﬂow can be seen in Fig 10

Figure 10: Train validation and test procedure for modeling

3.4.1 OLS

The OLS model implemented is described by the following equation in the univariate

case.

xt=0+

t1

X

i=(t6)

ixi

Where xtis the prediction for target sensor at time t,xiare the traﬃc speed values at

the target sensor at time i,are regression coeﬃcients to be determined by model ﬁtting.

In the multivariate case the equation is

xt,k =0+

t1

X

i=(t6)

ikxik +

t1

X

i=(t6)

J

X

j=1

ij xti

22

Where xk,t is the prediction for target sensor kat time t,xi,k are the traﬃc speed values

at target sensor k,xi,j are the traﬃc speed values at additional Jsurrounding sensors.

3.4.2 LightGBM

Gradient Boosted Decision Trees (GBDT), as brieﬂy introduced in the literature review,

are an ensemble machine learning technique that combines many decision trees to create

a strong learner. Decision trees are a popular algorithm however tend to overﬁt the data

and are unable to generalise to unseen data, GBDT’s are able to generalise better. They

work by sequentially combining many weak decision trees in a way that each new tree ﬁts

to the residuals of the previous tree such that the model improves, employing the logic

in which the subsequent predictors learn from the mistakes of the previous predictors

(Fig 11).

LightGBM (Ke et al. 2017) is an eﬃcient implementation of GBDT’s which uses a

histogram based decision tree algorithm and leaf wise tree growth methods rather than

the level wise method that most other GBDT implementations opt for.

LightGBM has numerous hyper-parameters that are tuned here using a grid search

method, iteratively traversing a grid of possible parameter combinations and comparing

their performance on the validation data set. The number of boosting iterations was

determined by the early stopping criterion with a patience of 100 i.e. if validation per-

formance didn’t improve after 100 rounds then there were no further boosting iterations,

ﬁnal hyper-parameters are shown in Table 2.

Figure 11: Gradient boosted decision tree learning process 2

Table 2: LightGBM hyperparameters

Hyper-parameter Value

Objective Regression

Feature fraction 0.9

bagging fraction 0.5

num leaves 64

bagging freq 4

learning rate 0.007

boosting rounds 515

2https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af

23

3.4.3 LSTM

Long Short Term Memory (LSTM) networks introduced by Hochreiter & Schmidhuber

(1997) are an adaptation of the Recurrent Neural Network that ﬁx many of the issues

of its predecessor such as vanishing/exploding gradients and handling of long term de-

pendencies. Fig 12 shows the structure of an LSTM layer, which is similar to that of

an RNN (Fig 1) in that it is a chain of repeating modules of a neural network, but

the repeating modules have a di↵erent structure. The memory module contains input,

output, and forget gates, which respectively carry out to write, read, and reset functions

on each cell. The multiplicative gates, and ⌦, refer to operations for matrix addition

and dot product respectively and allow the model to store information over long periods,

eliminating the aforementioned vanishing/exploding gradient problem.

LSTM networks, like most ANN’s are trained using back propagation and gradient

descent. Given an error function E(X, ✓), which deﬁnes the error between the ground

truth yiand the predicted output ˆyiof the neural network with input vector xifor a

set of input output pairs (xi,y

i)2Xand given a set of network weights and biases ✓

. Gradient descent requires the calculation of the gradient of the error function with

respect to the weights wk

ij and biases bk

i. Then according to the learning rate ↵, each

iteration updates the weights and biases according to

✓t+1 =✓t↵@E(X, ✓t)

@✓

Where ✓trepresents the parameters of the network (weights and biases) at iteration

tof the gradient descent. The goal is to minimise the error function.

Figure 12: Structure of a general LSTM network layer

The structure of the network use in this paper is shown in Fig 13 and consists of the

input layer, a hidden LSTM layer, a dropout layer which randomly drops 5% of nodes

during a run through the network and is an e↵ective technique for preventing over ﬁtting

(Srivastava et al. 2014), then ﬁnally a fully connected dense layer which outputs the ﬁnal

prediction. The learning rate was 0.001 and early stopping was used to determine 29

epochs as optimal to balance training error and generalisation strength.

24

Figure 13: Implemented LSTM network structure

3.5 Evaluation metrics

In order to asses and compare model prediction performance two evaluation metrics are

employed to valuate the di↵erence between actual and predicted speed values. These are

Root Mean Squared Error (RMSE) which is the most frequently used metric in previous

work (Luo et al. 2019) and Mean Absolute Error (MAE). RMSE tends to place more

weight on particularly large individual errors, thus more useful when large errors are

undesirable.

RMSE =v

u

u

t

1

n

n

X

i=1

(yiˆyi)2

MAE =1

n

n

X

i=1

|yiˆyi|

3.6 Implementation

All model training and computation was carried out using Pyhton 3 running on macOS

Catalina v10.15.6 operating within a MacBook Pro (2018) with 2.3GHz Quad-Core intel

I5 and 8 GB of 2133Mhz memory. LSTM was implemented using the Keras 3API library

running on top of Tensorﬂow. The LightGBM implementation of GBDT’s used here is

freely available online 4.

3https://keras.io

4https://lightgbm.readthedocs.io

25

4 Results

4.1 Space-time autocorrelations

First, presented here is an evaluation of the spatial and temporal patterns found within

the data using measures of autocorrelation previously outlined. Fig 14 shows the tem-

poral autocorrelations found in the sequence measurements of the target sensor for up

to 10 lagged time steps. Fig 14a shows a steady decrease in temporal autocorrelation as

the time lag increases. The partial autocorrelation show in Fig 14b, which controls for

intermediate time lags, shows a steep drop o↵once following once the variable is lagged

more than one time step.

(a) Autocorrelation (b) Partial autocorrelation

Figure 14: Temporal autocorrelations

Examining the CCF coeﬃcient between target sensor and the other network sensors

gives a local view of the space-time autocorrelation structure of the network. The CCF

coeﬃcient is calculated both across the whole day and also separately for the peak after-

noon period (4pm-8pm) to see how space-time autocorrelations vary both spatially and

temporally. Fig 15 and Fig 16 are both cantered at time order 0, this is the correlation

between sensors at the same point in time. Positive time lags indicate that the sensor

has been lagged in time with respect to the target sensor whereas a negative lag value is

given when a series has been shifted forwards in time with respect to the target sensor.

Each time lag step represents 15 intervals minutes in real terms. The CCF coeﬃcient

was calculated between the target sensor and all other surrounding sensors Fig 15 shows

the CCF coeﬃcient values across the whole day lagged up to 10 time steps with plot-

ting distinguishing between upstream and downstream sensors. Almost all sensors, both

upstream and downstream, show some level of cross correlation with the target sensor.

Generally, corss-correlation decreases as time lag increases which is to be expected, with

upstream sensors generally showing greater levels of correlation with target, although

as show in Fig 3upstream sensors are in this instance located slightly closer on aver-

age than downstream.. At peak times as shown in Fig 16 downstream sensors show

increased CCF coeﬃcient values due to the backward propagation of traﬃc through the

network at congested times. The plots demonstrate cross correlations varying both in

space (sensor locations) and time of day.

26

Figure 15: CCF coeﬃcients for all times

Figure 16: CCF coeﬃecients for afternoon peak hours

4.2 Modelling

This section aims to give a comparative overview of the di↵erent modelling approaches

and their performance when provided with varying input data. The metrics employed to

evaluate model performance are RMSE and MAE, the calculations for these are outlined

27

in the methods section. The data used for evaluation is traﬃc speed data collected

between 2019-10-01 and 2019-10-31 this data was not used during model training thus

o↵ers an unbiased estimated of true model prediction accuracy on unseen data. Models

are also compared to a “common sense” approach, in this case this means taking the

traﬃc speed at the current time and assuming that it will remain the same for the

next time step, this is a consideration that is often overlooked in traﬃc forecasting

but is vital to ensure that any proposed, potentially complex, models are in fact able

to provide genuine predictive power when compared to a much more simple intuitive

approach.

Table 3: MAE and RMSE scores for target sensor M60/9223A at time t+1

Model Input Data MAE RMSE

OLS Target sensor 1.71923 3.42504

Target sensor + downstream sensors 1.67342 2.97104

Target sensor + upstream sensors 1.72988 3.4096

All sensors 1.69123 2.97821

LightGBM Target sensor 1.66323 3.32318

Target sensor + downstream sensors 1.48142 2.6648

Target sensor + upstream sensors 1.59021 3.18623

All sensors 1.43430 2.62749

LSTM Target sensor 1.64175 3.26481

Target sensor + downstream sensors 1.58658 2.95209

Target sensor + upstream sensors 1.57595 3.32331

All sensors 1.56655 3.17687

Extrapolate 1.59935 3.40976

Table 3Presents a comparison of model accuracy for traﬃc speed prediction at the

target sensor M60/9223A at time horizon t+ 1. Taking the OLS model it is clear that

prediction improvement can be gained by incorporating spatial features. The MAE falls

from 1.71 to 1.69 when the previous time step readings for upstream and downstream

sensor readings are incorporated as model input, as well as previous target sensor read-

ings. In terms of MAE the OLS model however fails to outperform the “common sense”

approach which takes the speed value at time tto be a prediction for the speed at time

t+ 1, it can however outperform this approach with regard to RMSE which suggest the

OLS is making fewer very large errors than the “common sense” approach. The Light-

GBM is the overall best performing model by a clear margin. When only the previous

6 time steps for the target sensor are considered as input data it fails to outperform the

“common sense” approach in terms of MAE, however as can be seen from the table the

incorporation of upstream and downstream sensor readings can improve model perfor-

mance signiﬁcantly, the LightGBM model appears the most capable of leveraging the

spatio-temporal structure of the data, with a 15.96% decrease in MAE for LightGBM

when incorporating previous time step data for all sensors compared to only the target

sensor. LightGBM is in fact the only model to outperform the “common sense” ap-

proach when in terms of MAE, it does so by a good margin, MAE of 1.43 compared to

1.60, but only when incorporating surrounding sensor data. The LSTM performs with

slightly better accuracy than the OLS model but gains only a slight improvement in

error when incorporating upstream and downstream sensor information, it marginally

outperforms the “common sense” in terms of MAE and RMSE.

Table 4presents the results for model predictions at time horizon t+ 2. The results

share similarities with results for t+ 1 predictions with once again LightGBM being

28

by far the best performing model with a MAE of 1.885 and RMSE of 3.696, this when

including all sensors as input. The model performance when compared to common sense

is magniﬁed when predicting 2 time steps ahead, there is a 12% decrease in MAE in

LightGBM model predictions compared to common sense while when predicting one time

step ahead there was a 10.32% di↵erence in error. Again OLS and LightGBM model

performance improves, with a reduction in MAE and RMSE when incorporating spatial

features by way of upstream and downstream sensor readings At this time horizon, the

LSTM model does not seem to take advantage of the spatial elements of the data as it

fails to improve when incorporating surrounding sensor readings.

Table 4: MAE and RMSE scores for target sensor M60/9223A at time t+2

Model Input Data MAE RMSE

OLS Target sensor 2.29718 4.8005

Target sensor + downstream sensors 2.23167 4.2297

Target sensor + upstream sensors 2.27521 4.65812

All sensors 2.23841 4.19049

LightGBM Target sensor 2.1893 4.62002

Target sensor + downstream sensors 1.91728 3.81149

Target sensor + upstream sensors 2.07082 4.3758

All sensors 1.88533 3.7961

LSTM Target sensor 2.16217 4.68135

Target sensor + downstream sensors 2.20957 4.65229

Target sensor + upstream sensors 2.26725 4.6352

All sensors 2.20115 4.78406

Extrapolate 2.15487 4.65977

The results in terms of MAE for the best performing LightGBM model are shown in

Fig 17 the MAE of predictions is aggregated for each hour of the day and the plot shows

the inﬂuence of both time of day and the inclusion of spatial features on model perfor-

mance. During none peak times the accuracy of models that take input data only from

the target sensor is marginally worse than models that take inputs from both the target

sensor and surrounding sensors upstream and downstream. At peak times however the

di↵erence is more pronounced, the error is much higher in models that take input data

from only the target sensor or the target sensor + upstream sensors. Congested times

appear to be when the incorporation of spatial features is most important, especially

downstream sensors, due to the back-propagation of traﬃc through the network at busy

times.

For one week 2019-10-21 to 2019-10-28 (Mon-Sun) in Fig 18, the predicted traﬃc

speeds for both the LightGBM and LSTM models are shown, as well as the actual

target sensor readings for that period also known as “ground truth” values. Whilst

both models ﬁt the general trend of the data the LightGBM better matches the ground

truth at times when the traﬃc speeds are particularly unstable or congested, such as

the areas circled.

29

Figure 17: MAE for all predictions in each hour of the day

Figure 18: Actual and predictied traﬃc speed values for LightGBM and LSTM models

5 Discussion

To reiterate, the main research questions of this study where thus. To what extent

does traﬃc data exhibit spatio-temporal patterns? Which is the best performing traﬃc

forecasting technique OLS regression, Gradient Boosted Decision Trees or LSTM neural

networks? Does the inclusion of spatial features improve traﬃc forecasting models? The

results from this study provide clear indications as to the answers. The exploration of

spatio-temporal patterns using the Cross-Correlation Function indicated that there are

clear spatial and temporal dependencies in the data. These spatio-temporal patterns

are dynamic in both space and time with distinctions found between the inﬂuence of

upstream traﬃc compared to downstream traﬃc, for instance downstream traﬃc pat-

terns becoming more signiﬁcant during congested periods (Fig 16). The result found

the LightGBM model to be the most accurate by a signiﬁcant margin, it appears to

be the most appropriate model for capturing the complex spatio-temporal patterns and

generated accurate predictions under both free ﬂowing and congested conditions (Fig

30

18) up to 30 minutes into the future. The LightGBM model also saw considerable im-

provements when including the e↵ects of upstream and downstream sensors in the input

vector(Table 3), which in turn answers our ﬁnal question.

5.1 Literature implications

The spatio-temporal correlations studied in this paper concur with ﬁndings from Yue &

Yeh (2008) and Cheng et al. (2012) that traﬃc networks exhibit strong spatial as well

as temporal patterns as traﬃc propagates through the network. Although this is an

aspect that has previously been overlooked in the literature (Vlahogianni et al. 2014)

the importance of space in traﬃc forecasting has been brought more to the fore in recent

years. The CCF proved an e↵ective way of quantifying space time relationships which

could aid in model selection and feature selection for traﬃc forecast models.

As previously stated the literature is still divided over which modelling techniques

are most appropriate, with relation to this study both Xia et al. (2019) and Mei et al.

(2018) also found success in developing traﬃc ﬂow forecasting models with GBDT’s. On

the other hand Essien et al. (2020) found that their Autoencoder-LSTM outperformed

XGBoost when predicting traﬃc ﬂows across the A56 in Manchester and Wang & Li

(2018) found that their hybrid Convolutional LSTM network outperformed LightGBM

in predicting travel speed 15 to 90 minutes in the future. In both of these studies

the GBDT models where being used as comparison against signiﬁcantly more complex

models and networks with multiple Convolutional and bidirectional-LSTM hidden layers.

There are few, if any, studies that show LightGBM to be outperformed by a pure LSTM

model. This study conﬁrms the ﬁndings of Xia et al. (2019) and Mei et al. (2018) that

LightGBM appears to be a useful method, dealing with spatio-temporal relationships

well without the need for supplementary models or network layers to capture the spatial

aspect of the data as is the case with LSTM networks. The importance of integrating

upstream and downstream sensors is a agrees with ﬁnds by both Gebresilassie (2017) and

Du et al. (2018) and forms a growing body of evidence for the importance of explicitly

extracting spatial characteristics of traﬃc data to improve traﬃc forecasting.

Perhaps unexpectedly, some of the model variations implemented failed to outper-

form a much more simple “common sense” approach that required almost no compu-

tation. This is interesting for a couple of reasons, ﬁrst of all it is evidence of just how

challenging it can be to accurately model such a dynamic and complex system as an

urban traﬃc network. The intricate spatio-temporal relationships in the data, outlined

in this study and previous works (Cheng et al. 2012, Yue & Ye h 2008), mean that traﬃc

prediction is diﬃcult, it requires serious though about a number of interconnected fac-

tors in spatial and temporal dimensions as well as consideration for contextual factors

that drive traﬃc patterns such as weather and accidents. Second of all it raises questions

around the methods of evaluation for traﬃc forecasting models in the literature. Very

few studies in the literature, of a similar nature to this one, test their models against

similar heuristic common sense measures, usually opting instead to test developed mod-

els against other common modelling techniques. The problem with this is that it tends

to lose sight of the underlying goals of traﬃc forecasting, rather than leveraging com-

putational techniques to enhance or improve existing methods, much research is pitting

increasingly complex models against each other(Karlaftis & Vlahogianni 2011b). As

(Vlahogianni et al. 2014) notes the increasing model complexity can also be detrimental

to the explanatory power of the model which in the real world is imperative to make

them adaptable and responsive to dynamic traﬃc changes.(Karlaftis & Vlahogianni

2011b). Tree based methods such as GBDT’s can o↵er methods to explain predictions

by extracting the input features that have most inﬂuence on model predictions (Chen

& Guestrin 2016). This is an area where models such as LSTM networks struggle.

31

5.2 Real world applications

Given that this is a ﬁeld of research that developed with real-world applications in

mind, it would be sensible to examine the ﬁndings of this paper in the context of real

Intelligent Transport Systems. The results gained here regarding the spatio-temporal

structure of the traﬃc data emphasise the need for comprehensive and widespread traﬃc

monitoring and data collection systems, these methods rely on high quality, rich and

multidimensional raw data sources and the clear spatial dependence exhibited by the

type of network data shown in this study means the performance is only likely to be

improved given the addition of more sensors distributed across the network. The promis-

ing performance shown by the LightGBM model to provide accurate predictions even

in abnormal conditions ( Fig 18) and over multiple time steps (Table 4) suggest that

it is a modelling technique that could be of signiﬁcant use in real world applications.

It is important to highlight here that this study focused only on generating forecasts

for an individual sensor. Clearly, for traﬃc forecasts to be an e↵ective tool in a traﬃc

management system they need to be extended across the entire network. There are a

number of challenges that come with this, however. It is not clear for instance how many

upstream and or downstream sensors are required for accurate predictions, in this study

all sensors appeared to exhibit some level of spatio-temporal correlation with the target

sensor however this is unlikely to be the case as the distance between sensors increases.

A further challenge regarding real world integration is the availability of real-time

data streams, while this study used historical data to asses performance, real world

application requires that data can be captured, processed and fed to any model in a

short window of time for any prediction scheme to be viable, the reverse is also true

and there is a challenge to determine which ITS and traﬃc management technologies

are capable of integrating and adapting to information taken from forecasting models.

(Vlahogianni et al. 2014)

5.3 Limitations

Some of the main limitations have been brieﬂy touched upon already in this discussion.

Firstly, the fact that the modelling techniques employed in this paper are suitable for

predictions at a single sensor at a time and more general methods that are able to

predict traﬃc speeds across an entire network might be prefered. One option is the

method of Geographically Weighted Regression (Brunsdon et al. 1998) as implemented

by Gebresilassie (2017). This technique would involve a single regression model for the

entire network but allows independent and dependent variables to vary by locality. Luo

et al. (2019) developed a method in which models where developed for individual sensors

the predictions combined and a function based on spatial weights between sensors was

implemented in order to improve overall accuracy when making simultaneous predictions

across the network.

A further limitation of this study is the lack of understanding of causality of model

predictions. While models may be able to accurately map an input vector to an output

value this o↵ers no information as to the underlying causes of congestion or abnormal

traﬃc patterns.

Finally homogeneity of the data used for model evaluation represents a limitation in

terms of what can be inferred from the results, while this paper highlighted the potential

of the LighGBM model when compared to other methods. It is not possible to conclude

that the LightGBM is a more appropriate method than LSTM for traﬃc forecasting

based o↵this one study, for that more tests would need to be carried out on a number

of data sets across di↵ering road networks.

32

5.4 Future work

Although this study highlighted the importance of spatial dependence in traﬃc fore-

casting, beyond the inclusion of explicitly spatial features in the form of upstream and

downstream sensors, there was no signiﬁcant e↵ort to more formally encode the spatial

structure of the data in a way that might help models to better learn the latent spa-

tial structure of the data. There have attempts to develop models that are “spatially

aware” for instance Du et al. (2018) and Wang & Li (2018) introduce Convolutional

Neural Networks (Cheng et al. 2018), these were originally developed for image recogni-

tion but have been applied in traﬃc forecasting literature as they are designed to extract

spatial features of unstructured data such as images and graphs which is appropriate

since transport networks can easily be represented as a graph structure.

Given that this study focused on one continuous road there is room for further explo-

ration of spatio-temporal patterns of interconnected road networks to see if relationships

hold between roads that are adjacent or even completely disjoint but in close proximity

to each other work by Cheng et al. (2012) would suggest that the inﬂuence of road

links on their neighbours is local and varies widely in space and time making it hard to

measure spatio-temporal relationships across complex road topologies.

6 Conclusion

The growth of urban populations globally has resulted in rising levels of traﬃc and

congestion, creating numerous social, environmental and economic issues (Hymel 2009)

with the pollution caused by increased traﬃc levels in urban areas even being found

to increase infant mortality rates (Knittel et al. 2016). The need therefore has never

been greater for Intelligent traﬃc management systems guided by e↵ective forecasting

techniques and a comprehensive understanding of the spatial and temporal patterns that

drive traﬃc ﬂows. This dissertation ﬁrst examined how traﬃc forecasting can beneﬁt

from the understanding of how traﬃc speeds amongst neighbouring locations are related.

Furthermore, it compared the e↵ectiveness of machine learning algorithms to provide

accurate and robust forecasts, improved by incorporating the e↵ects of upstream and

downstream sensors.

A Cross-Correlation Function was used in order to identify and quantify the spatio-

temporal dependencies of traﬃc speed data where it was found that signiﬁcant relation-

ships exist that are dynamic in both space and time. There were distinct di↵erences

between the inﬂuence of upstream and downstream traﬃc patterns on a given road

segment, which also varied over the course of the day and provided evidence of traﬃc

“shockwaves” (Richards 1956) during congested periods as traﬃc propagated upstream.

The Cross-Correlation Function was found to be e↵ective method of quantifying space

time relationships which could aid in model selection, feature selection and determining

the feasibility of short-term traﬃc forecasts on a given road network.

Three traﬃc forecasting models where developed and tested on traﬃc data collected

on the M60 motorway in Manchester, England. The goal was to predict traﬃc speeds

both 15 minutes and 30 minutes ahead of time. The LightGBM algorithm, an eﬃcient

gradient boosting implementation, was the best performing model by a signiﬁcant mar-

gin. It outperformed Ordinary Least Squares regression and Long Short-Term Memory

neural networks at both time prediction horizons and under varying traﬃc conditions.

It also outperformed a heuristic approach by a large enough margin to suggest it is

a method with genuine predictive power. This is hypothesised to be due to the abil-

ity of the LightGBM algorithm to more e↵ectively learn the latent spatial structure

of the traﬃc speed data, as opposed to hugely popular sequence prediction algorithm

LSTM neural network, which evidence suggests does not capture spatial dependencies

33

well on its own, resulting in sup-optimal performance as a standalone algorithm that

may require the use of hybrid modelling structures to extract spatial features prior to

its application (Du et al. 2018, Cheng et al. 2012). Empirical evidence was provided for

the beneﬁt of including neighbouring road sensor measurements for traﬃc forecasting

models given an appropriate model that was able to capture the spatial structure. The

results highlighted both the importance of spatial dependences and also that of correct

model selection and speciﬁcation in order to produce acceptable performance in com-

plex traﬃc forecasting tasks. The role of geo-space in this study was clear, but until

recent years has been overlooked in the literature, which leaves open the much broader

question of which other related ﬁelds may beneﬁt from

34

7 Bibliography

Ahmed, M. S. & Cook, A. R. (1979), Analysis of freeway traﬃc time-series data by

using Box-Jenkins techniques, number 722.

Bishop, C. M. (2006), Pattern recognition and machine learning, springer.

Bojer, C. S. & Meldgaard, J. P. (2020), ‘Kaggle forecasting competitions: An overlooked

learning opportunity’, International Journal of Forecasting .

URL: http://www.sciencedirect.com/science/article/pii/S0169207020301114

Box, G. E., Jenkins, G. M. & Reinsel, G. C. (2011), Time series analysis: forecasting

and control, Vol. 734, John Wiley & Sons.

Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984), Classiﬁcation and

regression trees,CRCpress.

Brunsdon, C., Fotheringham, S. & Charlton, M. (1998), ‘Geographically weighted regres-

sion’, Journal of the Royal Statistical Society: Series D (The Statistician) 47(3), 431–

443.

Castro-Neto, M., Jeong, Y.-S., Jeong, M.-K. & Han, L. D. (2009), ‘Online-svr for short-

term traﬃc ﬂow prediction under typical and atypical traﬃc conditions’, Expert Sys-

tems with Applications 36(3, Part 2), 6164–6173.

URL: http://www.sciencedirect.com/science/article/pii/S0957417408004740

Chandra, S. R. & Al-Deek, H. (2009), ‘Predictions of freeway traﬃc speeds and volumes

using vector autoregressive models’, Journal of Intelligent Transportation Systems

13(2), 53–72.

URL: https://doi.org/10.1080/15472450902858368

Chen, T. & Guestrin, C. (2016), Xgboost: A scalable tree boosting system, in ‘Proceed-

ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining’, KDD ’16, Association for Computing Machinery, New York, NY,

USA, p. 785–794.

URL: https://doi.org/10.1145/2939672.2939785

Cheng, N., Lyu, F., Chen, J., Xu, W., Zhou, H., Zhang, S. & Shen, X. (2018), ‘Big data

driven vehicular networks’, IEEE Network 32(6), 160–167.

Cheng, T., Haworth, J. & Wang, J. (2012), ‘Spatio-temporal autocorrelation of road

network data’, Journal of Geographical Systems 14(4), 389–413.

URL: https://doi.org/10.1007/s10109-011-0149-5

Coleri, S., Cheung, S. Y. & Varaiya, P. (2004), Sensor networks for monitoring traﬃc,

in ‘In Allerton Conference on Communication, Control and Computing’.

Drew, D. R. (1968), Traﬃc ﬂow theory and control, Technical report.

Du, S., Li, T., Gong, X. & Horng, S.-J. (2018), ‘A hybrid method for traﬃc ﬂow fore-

casting using multimodal deep learning’, arXiv preprint arXiv:1803.02099 .

Essien, A., Petrounias, I., Sampaio, P. & Sampaio, S. (2020), ‘A deep-learning model

for urban traﬃc ﬂow prediction with traﬃc events mined from twitter’, World Wide

Web .

URL: https://doi.org/10.1007/s11280-020-00800-3

35

Florio, L. & Mussone, L. (1996), ‘Neural-network models for classiﬁcation and forecast-

ing of freeway traﬃc ﬂow stability’, Control Engineering Practice 4(2), 153 – 164.

URL: http://www.sciencedirect.com/science/article/pii/0967066195002219

Gebresilassie, M. A. (2017), Spatio-temporal traﬃc ﬂow prediction.

Griﬃth, D. A. (1987), ‘Spatial autocorrelation’, A Primer. Washington DC: Association

of American Geographers .

Hastie, T., Tibshirani, R. & Friedman, J. (2009), The elements of statistical learning:

data mining, inference, and prediction, Springer Science & Business Media.

Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation

9(8), 1735–1780.

Hong, W.-C. (2012), ‘Application of seasonal svr with chaotic immune algorithm in

traﬃc ﬂow forecasting’, Neural Computing and Applications 21(3), 583–593.

URL: https://doi.org/10.1007/s00521-010-0456-7

Hymel, K. (2009), ‘Does traﬃc congestion reduce employment growth?’, Journal of

Urban Economics 65(2), 127 – 135.

URL: http://www.sciencedirect.com/science/article/pii/S0094119008001162

Innamaa, S. (2005), Short-term prediction of traﬃc situation using mlp-neural networks.

Ishak, S. & Al-Deek, H. (2002), ‘Performance evaluation of short-term time-series traﬃc

prediction model’, Journal of Transportation Engineering 128(6), 490–498.

Karlaftis, M. G. & Vlahogianni, E. I. (2011a), ‘Statistical methods versus neural net-

works in transportation research: Di↵erences, similarities and some insights’, Trans-

portation Research Part C: Emerging Technologies 19(3), 387–399.

Karlaftis, M. & Vlahogianni, E. (2011b), ‘Statistical methods versus neural networks in

transportation research: Di↵erences, similarities and some insights’, Transportation

Research Part C: Emerging Technologies 19(3), 387 – 399.

URL: http://www.sciencedirect.com/science/article/pii/S0968090X10001610

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y.

(2017), Lightgbm: A highly eﬃcient gradient boosting decision tree, in I. Guyon,

U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett,

eds, ‘Advances in Neural Information Processing Systems 30’, Curran Associates, Inc.,

pp. 3146–3154.

URL: http://papers.nips.cc/paper/6907-lightgbm-a-highly-eﬃcient-gradient-boosting-

decision-tree.pdf

Knittel, C. R., Miller, D. L. & Sanders, N. J. (2016), ‘Caution, drivers! children present:

Traﬃc, pollution, and infant health’, Review of Economics and Statistics 98(2), 350–

366.

Leduc, G. (2008), Road traﬃc data: Collection methods and applications.

Lint, J. (2004), Reliable travel time prediction for freeways, PhD thesis.

Liu, D., Tang, L., Shen, G. & Han, X. (2019), ‘Traﬃc speed prediction: An attention-

based method’, Sensors 19(18).

URL: https://www.mdpi.com/1424-8220/19/18/3836

36

Luo, X., Li, D., Yang, Y. & Zhang, S. (2019), ‘Spatiotemporal traﬃc ﬂow prediction

with knn and lstm’, Journal of Advanced Transportation 2019, 4145353.

URL: https://doi.org/10.1155/2019/4145353

Mei, Z., Xiang, F. & Zhen-hui, L. (2018), Short-term traﬃc ﬂow prediction based on

combination model of xgboost-lightgbm, in ‘2018 International Conference on Sensor

Networks and Signal Processing (SNSP)’, pp. 322–327.

Moran, P. A. (1950), ‘Notes on continuous stochastic phenomena’, Biometrika

37(1/2), 17–23.

Pfeifer, P. E. & Deutrch, S. J. (1980), ‘A three-stage iterative procedure for space-time

modeling phillip’, Technometrics 22(1), 35–47.

Richards, P. I. (1956), ‘Shock waves on the highway’, Operations Research 4(1), 42–51.

URL: https://doi.org/10.1287/opre.4.1.42

Smith, B. L. & Demetsky, M. J. (1997), ‘Traﬃc ﬂow forecasting: Comparison of modeling

approaches’, Journal of Transportation Engineering 123(4), 261–266.

Smith, B. L., Williams, B. M. & Keith Oswald, R. (2002), ‘Comparison of parametric

and nonparametric models for traﬃc ﬂow forecasting’, Transportation Research Part

C: Emerging Technologies 10(4), 303 – 321.

URL: http://www.sciencedirect.com/science/article/pii/S0968090X02000098

Soper, H. E., Young, A. W., Cave, B. M., Lee, A. & Pearson, K. (1917), ‘On the

distribution of the correlation coeﬃcient in small samples. appendix ii to the papers

of ”student” and r. a. ﬁsher’, Biometrika 11(4), 328–413.

URL: http://www.jstor.org/stable/2331830

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014),

‘Dropout: a simple way to prevent neural networks from overﬁtting’, The journal of

machine learning research 15(1), 1929–1958.

Sun, B., Cheng, W., Goswami, P. & Bai, G. (2018), ‘Short-term traﬃc forecasting using

self-adjusting k-nearest neighbours’, IET Intelligent Transport Systems 12(1), 41–48.

Tobler, W. R. (1970), ‘A computer movie simulating urban growth in the detroit region’,

Economic Geography 46(sup1), 234–240.

URL: https://www.tandfonline.com/doi/abs/10.2307/143141

Vlahogianni, E. I., Golias, J. C. & Karlaftis, M. G. (2004), ‘Short-term traﬃc forecasting:

Overview of objectives and methods’, Transport Reviews 24(5), 533–557.

URL: https://doi.org/10.1080/0144164042000195072

Vlahogianni, E. I., Karlaftis, M. G. & Golias, J. C. (2014), ‘Short-term traﬃc forecasting:

Where we are and where we’re going’, Transportation Research Part C: Emerging

Technologies 43, 3 – 19. Special Issue on Short-term Traﬃc Flow Forecasting.

URL: http://www.sciencedirect.com/science/article/pii/S0968090X14000096

Wang, W. & Li, X. (2018), ‘Travel speed prediction with a hierarchical convolutional

neural network and long short-term memory model framework’.

Wang, X., An, K., Tang, L. & Chen, X. (2015), ‘Short term prediction of freeway exiting

volume based on svm and knn’, International Journal of Transportation Science and

Technology 4(3), 337–352.

37

Williams, B. M. & Hoel, L. A. (2003), ‘Modeling and forecasting vehicular traﬃc ﬂow as

a seasonal arima process: Theoretical basis and empirical results’, Journal of Trans-

portation Engineering 129(6), 664–672.

Xia, H., Wei, X., Gao, Y. & Lv, H. (2019), Traﬃc prediction based on ensemble machine

learning strategies with bagging and lightgbm, in ‘2019 IEEE International Conference

on Communications Workshops (ICC Workshops)’, pp. 1–6.

Yang, B., Sun, S., Li, J., Lin, X. & Tian, Y. (2019), ‘Traﬃc ﬂow prediction using lstm

with feature enhancement’, Neurocomputing 332, 320 – 327.

URL: http://www.sciencedirect.com/science/article/pii/S092523121831470X

Ye, Q., Szeto, W. Y. & Wong, S. C. (2012), ‘Short-term traﬃc speed forecasting based on

data recorded at irregular intervals’, IEEE Transactions on Intelligent Transportation

Systems 13(4), 1727–1737.

Yue, Y. & Yeh, A. G.-O. (2008), ‘Spatiotemporal traﬃc-ﬂow dependency and short-term

traﬃc forecasting’, Environment and Planning B: Planning and Design 35(5), 762–

771.

Zhao, Z., Chen, W., Wu, X., Chen, P. C. Y. & Liu, J. (2017), ‘Lstm network: a deep

learning approach for short-term traﬃc forecast’, IET Intelligent Transport Systems

11(2), 68–75.

38