ArticlePDF Available

Abstract and Figures

We are witnessing the dramatic consequences of the COVID-19 pandemic which, unfortunately, go beyond the impact on the health system. Until herd immunity is achieved with vaccines, the only available mechanisms for controlling the pandemic are quarantines, perimeter closures and social distancing with the aim of reducing mobility. Governments only apply these measures for a reduced period, since they involve the closure of economic activities such as tourism, cultural activities, or nightlife. The main criterion for establishing these measures and planning socioeconomic subsidies is the evolution of infections. However, the collapse of the health system and the unpredictability of human behavior, among others, make it difficult to predict this evolution in the short to medium term. This article evaluates different models for the early prediction of the evolution of the COVID-19 pandemic to create a decision support system for policy-makers. We consider a wide branch of models including artificial neural networks such as LSTM and GRU and statistically based models such as autoregressive (AR) or ARIMA. Moreover, several consensus strategies to ensemble all models into one system are proposed to obtain better results in this uncertain environment. Finally, a multivariate model that includes mobility data provided by Google is proposed to better forecast trend changes in the 14-day CI. A real case study in Spain is evaluated, providing very accurate results for the prediction of 14-day CI in scenarios with and without trend changes, reaching 0.93 R2, 4.16 RMSE and 1.08 MAE.
Content may be subject to copyright.
1
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports
Improving prediction of COVID‑19
evolution by fusing epidemiological
and mobility data
Santi García‑Cremades1, Juan Morales‑García2, Rocío Hernández‑Sanjaime1,
Raquel Martínez‑España2, Andrés Bueno‑Crespo2, Enrique Hernández‑Orallo3,
José J. López‑Espín1 & José M. Cecilia3*
We are witnessing the dramatic consequences of the COVID‑19 pandemic which, unfortunately, go
beyond the impact on the health system. Until herd immunity is achieved with vaccines, the only
available mechanisms for controlling the pandemic are quarantines, perimeter closures and social
distancing with the aim of reducing mobility. Governments only apply these measures for a reduced
period, since they involve the closure of economic activities such as tourism, cultural activities, or
nightlife. The main criterion for establishing these measures and planning socioeconomic subsidies
is the evolution of infections. However, the collapse of the health system and the unpredictability
of human behavior, among others, make it dicult to predict this evolution in the short to medium
term. This article evaluates dierent models for the early prediction of the evolution of the COVID‑19
pandemic to create a decision support system for policy‑makers. We consider a wide branch of models
including articial neural networks such as LSTM and GRU and statistically based models such as
autoregressive (AR) or ARIMA. Moreover, several consensus strategies to ensemble all models into
one system are proposed to obtain better results in this uncertain environment. Finally, a multivariate
model that includes mobility data provided by Google is proposed to better forecast trend changes in
the 14‑day CI. A real case study in Spain is evaluated, providing very accurate results for the prediction
of 14‑day CI in scenarios with and without trend changes, reaching 0.93
R2
, 4.16 RMSE and 1.08 MAE.
e COVID-19 pandemic is the biggest global challenge in our recent history, which puts the welfare state of
today’s society at risk. Spain is undoubtedly among the countries most aected by the pandemic, with up to
3,697,987 total cases of infection, and a total of 80,196 deaths (as reported on June 7, 2021)1. Governments
worldwide are taking drastic measures such as social distancing, contact tracing, perimeter closures and even
quarantines, which are either reinforced or alleviated depending on the epidemiological status of the disease2.
ese non-sanitary measures focus on the reduction of human mobility, which has an important socio-economic
eect3. For instance, according to the European Commission, the economic forecast for Spain is the worst in
its recent history with a 9.4% drop in GDP, and an expected unemployment of up to 18.9% at the end of 2020.
Globally speaking, the Organisation for Economic Co-operation and Development (OECD)4 stated that these
bad economic projections will lead to widespread poverty, child malnutrition, stress, and suicides, just to men-
tion a few of the dramatic consequences for the population . However, beyond the economic consequences,
the measures of social distancing and lockdowns can raise new social scenarios in fundamental aspects such as
education, gender violence, immigration and other new issues that may arise because of such extreme public
health measures.
Early understanding of the evolution of the pandemic prevents scenarios that could increase the number of
COVID-19 victims. Governments have implemented public health surveillance systems for COVID-19 based
on the fundamental principles provided by the World Health Organization (WHO); i.e., tracking clinical and
epidemiological gures such as conrmed, death, active cases, just to mention a few5,6. is information is usually
provided by governments daily, and currently, these surveillance systems provide robust and stable information
on the evolution of the pandemic7. However, this epidemiological information shows a posterior picture of the
pandemic, i.e., once people have been infected and are showing symptoms, usually aer an incubation period of
OPEN
1Center of Operations Research, Miguel Hernandez University of Elche (UMH), 03202 Elche, Spain. 2Computer
Science Department, Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain. 3Computer Engineering
Department, Univesitat Politècnica de València (UPV), 46022 Valencia, Spain. *email: jmcecilia@disca.upv.es
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
7-10 days8. From these epidemiological data, novel Machine learning (ML), Articial Intelligence (AI) and data
science methods can provide signicant outcomes for tracking and detecting COVID-19 evolution at national
and regional level9. All in all, the infection curve can be seen as a time series in which trend changes are hardly
predictable, as it does not follow a seasonal pattern, mainly due to the chaotic interaction of people.
Figure3 shows the 14-day CI in Spain from July 20, 2020 until January 2021. e rst Spanish wave ocially
ended on July 20, 2020 and the 14-day CI started to increase again from that date onwards. It is worth mention-
ing that from the second wave until today, there have been several waves, understood as trend changes in the
14-days CI. At the beginning of October, 9th the 14-day CI started to increase again, matching with a vacation
period at the national level, from October 9th to 12th. In addition, in mid-December a trend change of the 14-day
CI was reported, also coinciding with a vacation period (December 8–12, 2020), which is increasing from that
date until now. ese trend changes are one of the most dicult scenarios for modelling. e 14-day CI is a
time series that includes daily data from July. Besides, not every day is reported, COVID-19 data in Spain is only
reported on working days, i.e., Monday through Friday, except holidays. e lack of historical data, as well as the
scarce changes in trends during the training period makes it very dicult to let the models learn these changes.
In this paper, we propose a multivariate model to predict trend changes in the 14-day Cumulative Incidence
(CI) of COVID-19. We conducted a comprehensive analysis of dierent mobility components oered by Google
to incorporate this information into our multivariate model as exogenous information. e multivariate model
resulting from adding this information can predict trend changes in 14-day CI with greater accuracy. e main
contributions of the paper are the following:
1. Several state-of-the-art ML and statistical methods are evaluated to predict the 14-day CI, using only the
historical information of this variable as input for two dierent scenarios, i.e., 14-day CI with trend changes
and without trend changes in the time series.
2. A ensemble strategy is provided to combine previous models and provide an optimal prediction. ese
methods oer very good performance for this time series when there are no clear trend changes.
3. A multivariate model is designed and fed with 14-day CI and mobility variables provided by Google as
exogenous information.
4. e multivariate model is optimized using operational research techniques to achieve better prediction of
trend changes in 14-day CI.
5. e evaluation is based on information from several waves in Spain in which clear trend changes were
reported.
e reminder of the paper is structured as follows. Firstly, we discuss the related work. en, the methods of
this article are introduced in “Methods” section, including the main ML and statistical models proposed, their
ensemble and the exogenous information targeted. Finally, “Evaluation and results” section shows the main
results and nding of our article before the main conclusions and directions for future work are introduced.
Related work
Since the right beginning of the COVID-19, scientists have struggled on designing models that could forecast
not only the evolution of the disease but also the impact of the dierent measures taken. e problem is that
these models must characterise not only how the virus spread, which is far from being understood, but also about
human behaviour, which can be erratic. Firstly, it is necessary to evaluate and model how fast the COVID-19 is
spreading. A fundamental epidemiological quantity, the reproductive number R, represents the average number
of new infections an infected person can generate (so the greater the number, the faster the spreading). First
estimations of the
R0
value for the COVID-19 evidenced a relatively high value, in the range (2.4–5.6)1012. For-
tunately, measures such as social distancing, facial masks and mobility reduction have allowed health authorities
to control the spread of the disease.
Dierent types of models have been proposed for forecasting COVID-19 evolution: compartmental models,
statistical-based models and machine learning (ML) based models13. In epidemiological compartmental models,
the population is assigned to dierent compartments (for example, the simple SIR models with three compart-
ments: Susceptible, Infectious, and Recovered). ese compartmental models have been used to evaluate and
forecast the impact of the dierent measures taken, such as quarantine, isolation and contact tracing. For example,
in14,15 the authors model and evaluate the general eects of containment mechanisms. Regarding contact trac-
ing, in10,11 it was stated that contact tracing and isolation as currently practiced is not helping in preventing the
COVID-19 pandemic. Finally, in16,17, the authors evaluated the impact of the technological aspects (such as reso-
lution, centralised vs decentralised approaches) of the current smart-based contact tracing application showing
that for being eective, it would have required a high adoption rate and a centralised technology. Unfortunately,
it was not the case, so these kinds of contact tracing applications failed to control the disease.
On the other hand, statistical-based models, i.e., time series analysis and forecasting, only rely on past data to
predict the near future. ere are many dierent methods, such as Auto-Regressive Moving Average (ARMA),
Auto-Regressive Integrated Moving Average (ARIMA), Support Vector Regressor (SVR), Linear Regressor poly-
nomial (LRP), Bayesian Ridge Regression (BRR), Linear Regression (LR), Random Forest Regressor (RFR),
Holt-Winter Exponential Smoothing (HW), and Extreme Gradient Boost Regressor (XGB). Note that some
authors consider some of these methods as Machine Learning Methods18 but none of them seems to improve
the overall quality of the prediction1921 (see below for a detailed description of this references). Among these
models, we may highlight ARIMA model22, which has shown good results forecasting the COVID-19 infections.
For instance, Benvenuto etal.23 proposed the use of ARIMA models to predict the COVID-19 spread around
the world, while Perone etal.24 proposed a model for dierent regions of Italy and Sahai etal.25 did the same for
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
the top ve aected countries. Nevertheless, these models can only predict short-time behaviour as intervals of
condence grows extremely fast as time elapses26. Petropoulos etal.27 also recognized the limitations of forecast-
ing longer term trajectories of an outbreak.
As previously commented, some authors consider most of the previous statistical methods to be part of more
general Machine learning (ML) and Deep Learning (DL) methods19. For example, Shahit etal.28 used DL meth-
ods for the prediction of time series of conrmed cases, deaths and recoveries in COVID-19 aected countries,
where the performance of models was measured by mean absolute error (MAE), root mean square error (RMSE)
and
R2
. ey focus on dierent variables (but not 14-day CI) but with stable trends. Similarly, Zerorual etal.29
compared up to ve DL models for COVID-19 forecasting using dierent COVID-19 information including,
Italy, Spain, France, China, USA and Australia. Nevertheless, more specic ML methods such as neural networks
and Support Vector Machines (SVMs) have shown to perform poorly since they require more training data than
the currently available datasets20,21. Furthermore, as stated by Ribeiro etal.30 this fact can also be attributed to
the chaotic dynamics of the analysed data, as well as the diversity of exogenous factors.
Several studies have shown the relationship between mobility and the disease spread. Linka etal.31 showed
a strong correlation between the reduction in mobility and the eective reproduction number across Europe,
which was particularly high for countries such as the Netherlands, Germany, Ireland, Spain, and Sweden (which
have a Spearmans rank correlation
ρ
of 0.99). e authors in32,33 found that mobility statistics oered in open
COVID-19 datasets showed the evolution of the COVID-19 spread in China, placing the contagious peak at
the early beginning of 2020. A recent study using mobile phone data of more than 13 million users in Spain34,
has shown that these data can be used as a predictor of COVID-19-related deaths. Particularly, they stated that
there is a critical level (around 70% of the radius of gyration, which quanties the mobility range of an indi-
vidual during a given week35) when hospitalizations and deaths tend to increase two to three weeks aer this
threshold is exceeded. Finally, Google and Apple mobility data, which are used in this paper, has demonstrated
to be of great help in quantifying and predicting the eects of COVID-19. For example, Cot etal.36 quantify
the eects of social distancing on the COVID-19 spreading dynamics in Europe and in the USA, and Nouvellet
etal.37 show the correlation between the reduction in mobility and COVID-19 transmission. One key aspect of
all these models is the quality of the data used. Having a wide range of data, updated on a real-time basis and
accessible is critical to characterizing disease outbreaks and obtaining useful models38. Nevertheless, better data
are necessary, but not sucient. As stated by Castro etal.26, human models are really hard to model since there
is always an uncertainty in human behaviour, so most models can fail to forecast some important issues such as
turning points and the end of the expansion.
Summing up, the problem with the described forecasting models is to accurately predict trend changes (i.e.,
waves) when using only previous historical information. ese changes in trends can depend on varying exter-
nal elements, such as mobility, social distancing, etc. erefore, a way to improve the precision of the previous
forecasting methods is to combine several data sources. Particularly, in this paper, we show that the utilisation
of mobility data can improve forecasting when only time series (such as 14-day CI) are used.
Methods
Temporary data are omnipresent in many application domains, such as medicine, agriculture or robotics39,40.
Increasingly, time series forecasting is being introduced in these elds which follows a quantitative approach
that uses historical information along with certain associated patterns such as trends, seasonality and irregular
components to predict future observations. Trend data in the time series oers long-term information for the
prediction. Seasonality are patterns in the time series that occur at specic and regular intervals. Finally, irregu-
lar components are unsystematic uctuations due to external factors. Having access to historical time-series
data, forecasting models can be used to understand the behaviour of the time series. However, the irregular
components of the time series are dicult to predict as they do not follow a given pattern. Generally speaking,
time-series models cannot learn these irregular components from the historical data of the time series, so they
need additional information to identify these possible events41
Indeed, the evolution of the 14-day CI of COVID-19 is based on irregular components that are mainly caused
by the dierent implementations of the national legislation that reduces people’s mobility42. Several ICT compa-
nies such as Google or Apple have provided mobility data taken from smartphones that run mobility applications,
such as Google Maps or Maps from Apple Maps, to gure out the changes that have occurred in people’s mobility
as a result of the policies to deal with COVID-1943. As previously explained in “Related work” section, several
works have been recently done to predict the COVID-19 evolution based on trends and seasonality in time series,
but none of them has not analysed trend changes due to these irregular components. is section introduces
the ML and statistical univariate models used in this article to predict the 14-day CI using only the endogenous
variable; i.e. previous observations of the 14-day CI. ese models are combined through an ensemble approach
that uses dierent consensus strategies based on quality metrics that are rst described. Finally, the multivariate
model is introduced to improve the prediction of the 14-day CI, in those time lags where there are trend changes.
Metrics and statistical models used. e main metrics and statistical models used in this work are the
following (where
xi
is the real data for instance i and
Pi
is the prediction for instance i):
Coecient of determination (
R2
) is used to analyse how dierences in one variable can be explained by dier-
ences in a second variable. It is a value ranging from 0 to 1 and indicates that the regression line represents
none or all of the data, respectively, so that the higher the value, the better the goodness of t of the model44.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Root mean square error (RMSE) is the standard deviation of the prediction errors, which are a measure of
the distance of the data from the regression line, indicating the concentration of the data around the line of
best t. It is, therefore, a measure of the dispersion of these errors (also known as residuals)45.
Mean absolute error(MAE) allows measurement of the average magnitude of the errors for a set of predic-
tions, regardless of their direction. It represents the mean of the absolute dierences in the sample between
the prediction and the actual observation, taking into account that all individual dierences are of equal
signicance45.
Spearman correlation Spearman’s correlation coecient is a non-parametric measure of rank correlation; i.e.
statistical dependence of the ranking between two variables. It measures the strength and direction of the
association between two ranked variables46.
Granger causality Granger causality is a testing framework comparing the unrestricted model, in which a
time series y is explained by the lags of y and the lags of an additional series of observations x (both lags up
to the same xed order), and the restricted model, in which y is only explained by the lags of y. us, Granger
causality determines if one time series is helpful for predicting another, and in some cases, it may be used to
assert stronger causal statements47.
Principal component analysis (PCA) e aim of this technique is to reduce the dimensionality of multivariate
data preserving as much of the relevant information as possible48.
Ensemble approach for univariate prediction. is subsection proposes a combination of time series
and ML models and techniques to provide a consensus strategy that brings all the results into one. Each method
and model has demonstrated in the literature good results for predicting dierent epidemiological variables
related to COVID-19. Moreover, dierent congurations and/or parameterisations of these models are also
important for the quality of the predicted results. With the proposed ensemble, the search space of the models
is explored automatically in order to obtain the best possible prediction. e statistical and machine learning
methods under study are the following:
1. Autoregresive (AR) is a univariate model49 where a prediction is made using a linear combination of past
values of that variable. e term autoregression indicates that it is a regression of the variable against itself.
us, an autoregressive model is established according to its order p. Autoregressive models are remarkably
exible to handle a wide range of dierent time series patterns.
2. Autoregressive Integrated Moving Average (ARIMA) is a linear statistical model50, which uses variations and
regressions of statistical data in order to nd patterns for a prediction into the future. Automatic Regression
(AR) is the term that refers to the delays of the dierentiated series (
), Moving Average (MA) refers to
the delays of the errors and integration (I) is the number of dierences used to make the time series station-
ary.
3. Long short-term memory (LSTM) is a type of recurrent neural architecture with a state memory and mul-
tilayer cell structure51. LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. e
cell remembers values over arbitrary time intervals and the three gates regulate the ow of information into
and out of the cell(Fig.1b). e LSTM diers from a classic recurrent network in that it does not overwrite its
content at each time step but is able to decide whether to keep the existing memory through the introduced
doors. If the LSTM unit detects an important characteristic of an input sequence at an early stage, it carries
this information over long distances, therefore it detects long-distance dependencies.
4. Gate Recurrent Unit (GRU) is a type of recurrent neural network, which presents a modication, which
allows to solve a problem of this type of recurrent networks which is the vanishing gradient problem since
the model is not washing out the new input every single time but keeps the relevant information and passes
it down to the next time steps of the network52. It is similar to LSTM but without memory cells, which makes
them simpler to compute and implement. It is composed of two gates (reset and update) (Fig.1a), so that it
allows each recurrent unit to capture the dependencies in an adaptive way in dierent time scales. rough
these two gates, it is decided what information should be passed on at the output, without eliminating infor-
mation that is apparently irrelevant to the prediction, so that the information is retained for a long time.
In the process of combining the information of the proposed ensemble approach, the validation metrics for
the regression task are used. Particularly, our ensemble approach uses the coecient of determination (
R2
), root
mean square error (RMSE) and mean absolute error (MAE) metrics53. Before describing in detail the phases of
this proposed ensemble approach, the 4 combination methods used to obtain and calculate the model for the
inference are described. e combination methods used are briey detailed below:
(1)
R
2=(
n
i=1(xi−¯x)(Pi
¯
P))
2
n
i=1
(x
i
−¯x)2
n
i=1
(P
i
¯
P)2
(2)
RMSE
=
n
i=1(xiPi)2
n
(3)
MAE
=
n
i=1|
x
i
P
i
|
n
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Maximume predictions of the model that has a metric greater than
R2
are selected.
Minimum e models with the lowest RMSE and MAE metrics are selected and a weighted average is com-
puted.
Average An average of all models is made without taking into account their values.
Weighted average A weighted average is made based on the
R2
score of each model.
e proposed ensemble approach consists of the following steps. Figure2 summarizes these steps.
1. Let’s be |E| , the training dataset and
|E|v
, a validation dataset.
2. Each technique t is trained with the |E| dataset, generating
P|E|
for each t.
3. For each technique t, the values
R2
, RMSE and MAE are calculated using the predictions
Pt
|E|
and
|E|v
dataset.
4. Using the combination methods |C|, models whose predictions are eective are selected.
5. Depending on the combination method, the
P|Ev|
predictions are calculated by taking the data from the
validation dataset
|E|v
as input.
6. e metrics of
R2
, RMSE and MAE are calculated with the predictions
P|E|v
, leaving the model built and
ready to infer values.
7. Equation (4) is used to infer a new value i in the model:
where
PMaxRt
i
is the prediction for instance i that provides the t model with the maximum
R2
;
PMinRMSEt
i
is
the prediction for instance i that provides the t model with the minimum RMSE and
PMinMAEt
i
is the predic-
tion for instance i that provides the t model with the minimum MAE.
Measuring mobility for the multivariate model. Reducing mobility has been one of the main tools
that all governments worldwide are using to prevent the COVID-19 spread. Tracing infection from mobility
data has been used from the early beginning of the COVID-19 outbreak. Kraemer etal.32,33 found that mobility
statistics oered in open COVID-19 datasets showed the evolution of the COVID-19 spread in China, placing
the contagious peak at the early beginning of 2020. erefore, the measurement of mobility in dierent cities
has been subjected to study by dierent public and private organizations. Huang etal.54 showed that mobility
patterns obtained from Twitter can quantitatively reect the mobility dynamics.
Google mobility data (GMD) (https:// www. google. com/ covid 19/ mobil ity/) is a tool developed by Google
to deal with the COVID-19. It shows a set of aggregated and anonymized data obtained from information in
products such as Google Maps55. is data is provided through local mobility reports which oer valuable infor-
mation on changes in people’s mobility patterns as a consequence of the measures taken by the governments to
deal with the COVID-19 pandemic. Among the information found in these reports, of particular interest to us
are the movement trends of citizens over time. is information is arranged by geographical area and classied
into various categories of places, such as workplaces, stores, supermarkets, leisure spaces, pharmacies, parks,
transportation stations and residential areas. e main variables GMD provides are the following:
Retail and recreation is variable shows mobility trends for places such as restaurants, cafes, museums, malls,
cinemas and libraries.
Supermarket and pharmacy is variable shows mobility trends for places such as supermarkets, food ware-
houses and pharmacies.
Parks is variable show mobility trends for places such as national parks, public beaches, plazas and public
gardens.
Public transport is variable shows mobility trends for places that are public transport hubs, such as train
stations, subway or bus.
Workplaces is variable shows mobility trends for places of work.
Residential is variable shows mobility trends for places of residence.
e number provided by GMD is used to compare the mobility on the date of the report with the mobility on
the day of the reference value. e data corresponding to the date of the report is calculated (if the information is
available) and a positive or negative percentage is shown. e data shows how the number of visitors to (or time
spent in) the categorized locations changes compared to our baseline. A baseline represents a normal value on
that day of the week. e baseline is the average value for the 5-week period from January 3 to February 6, 2020.
In each region-category, the baseline is not a single value, but 7 individual values. e same number of visitors
on two dierent days of the week results in dierent percentage changes. It is important to note that baseline
days never change. In the calculation of the reference values, the seasonality has not been taken into account.
For example, the number of people going to the parks usually increases as the weather improves.
A multivariate model including these variables is proposed to predict 14-day CI. Our rst approach was to
explore a multivariate regression model which includes the ensemble information and additional information
in the mobility variables as exogenous information. e multivariate equation is shown in Eq. (5).
(4)
P
i=
P
MaxR
t
i+P
MinRMSE
t
i+P
MinMAE
t
i
3
(5)
CI
14
day
=β0+β1(
Ensemble
)+β2
GMD
2+β3
GMD
3+··· +β
iGMD4
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
where the response variable is
CI
14
day
,
β0
is the independent term,
β1
is the term that weights the values obtained
by our ensemble, and
βi
is the term that weights the Google mobility variables (
GMDiwhere i =2, 3, 4, 5, 6, 7
).
GMD variables will be evaluated through t-statistic to gure out if there is a signicant relationship between the
response variable (14-day CI) and each of the predictors included in the model (ensemble and mobility variables).
If so, these variables will be included in the multivariate model.
It is important to note that main assumptions of multivariate regression such as linear relationship between
the target variable and the independent variables, normality of all variables, lack of multicollinearity are not
met in our case as it is shown in “Evaluation and results” section. erefore, an operations research approach is
proposed to optimize the coecients of our multivariate model in order to minimize the MAE. Particularly, the
Non-Linear Minimization (NLM) procedure56, included in R programming soware that carries out an itera-
tive minimization procedure is applied to look for optimal coecients. is method requires a seed to initialize
the optimization of the coecients and three dierent starting values were analysed: (i) coecients randomly
generated from a uniform distribution from
10
to 10, (ii) coecients with the same weight for each of the
independent variables and (iii) coecient estimates for the multivariate regression model described in Table9.
Evaluation and results
is section presents the evaluation of our models for estimating 14-day COVID-19. First, the datasets to per-
form the experiments are explained. Next, the dierent univariate ML models and ensemble approach previously
explained in “Methods” section for the prediction of the 14-day CI are evaluated. e Google mobility informa-
tion is then statistically analysed and a PCA is performed to obtain exogenous information to be included in a
multivariate model. Finally, the multivariate model with this exogeneous information is evaluated.
Benchmarking. is section summarizes the datasets used to carry out the experiments. As previously
commented, the evaluation is based on the data provided by the Spanish Ministry of Health. ey provide
several variables for all Spanish regions (19 regions in total). Among them, we may highlight total cases last 24
h, 14-day cumulative incidence and 7-day cumulative incidence. e information is provided by the regional
governments that report daily, except on weekends and holidays, to the Spanish Ministry of Health that develops
a report with the COVID-19 current situation in Spain. It is important to note that the information is updated
backwards when new notications arrive from previous days, mainly due to delays, error detection, etc. ere-
fore, we focus on the more stable notication period (i.e. 14-days) as it includes all previous notications. Par-
ticularly, we focus on estimating the 14-day cumulative incidence; i.e. the number of new cases of COVID-19
during 14 days divided by the size of the population at the start of the period.
Of particular interest is the information from the surveillance system from July, since it changed the way the
Spanish Ministry of Health develops the strategy of early detection, monitoring and control of COVID-19. Since
then, the count of COVID-19 cases has been kept uniform, with slight changes and updates. Table1 shows the
two dierent periods under study that are translated into two dierent datasets. For each period, a train and
test datasets have been designed to assess the dierent trend changes as indicated in the Table1. Particularly,
the rst dataset (DS1) includes the information from July 20, 2020 to December 4, 2020. e second dataset
(DS2) includes the information from July 20, 2020 to December 18, 2020. In DS1, the models are trained with
the information until November 29th, included. e testing, however, is carried out using the data of the week
from November 30th to December 4th. In DS2, the models are trained with the information until December
4th, included. e evaluation is carried out with the data from December 5th to December 18th, both included.
It is important to note that the 14-day CI was decreasing in the DS1 test period (see Fig.3). However, the
14-day CI was decreasing at the beginning of the DS2 test period but it suddenly started to increase from Decem-
ber, 11 and beyond. Moreover, DS1 only includes 5 days to predict and DS2 includes 9 days.
Moreover, the metrics used for testing the performance of each model are the coecient of determination
(
R2
), the root-mean-square error (RMSE) and the mean absolute error (MAE). All of them are calculated using
Figure1. Diagram of a GRU and LSTM unit. Where
xt
represents the input and
yt
the forecast in a step (
yt
1
for forecast in the previous steps). For LSTM, the
Ct
indicates the state that is passing from one LSTM unit to
another.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
the scikit-learn metrics package57. e best possible score for the
R2
is 1.0. A constant model that always predicts
the expected value of y, regardless of the input features, would get a
R2
score of 0.0.
e models obtained have been previously validated and tested using dierent congurations. For ARIMA-
based models, we several (p,d,q) parameters were tested, including (1, 1, 1), (3, 1, 3), (6, 1, 6), (1, 2, 1), (3, 2,
3), (6, 2, 6). For AR-based models, the best performing congurations where those with
p=1, 3
and 6. Finally,
Table2 shows the congurations for GRU and LSTM neural networks that were included in the evaluation. ese
parameters were empirically determined aer several experiments.
Finally, two well-known time series libraries have been included for comparison purposes; i.e., PROPHET58
and TPOT59. Prophet is a Python-based library developed by Facebook which, according to their authors, “aims
at forecasting time series data based on an additive model where non-linear trends are t with yearly, weekly, and
daily seasonality, plus holiday eects. It works best with time series that have strong seasonal eects and several
seasons of historical data. Prophet is robust to missing data and shis in the trend, and typically handles outliers
well”. TPOT is also a Python-based automated ML tool that optimizes ML pipelines using genetic programming.
TPOT explores many congurations of models and pipelines to nd the best one for the target data. e main
output of TPOT is a Python code for the best pipeline it has found for your data. ese methods have been suc-
cessfully applied to COVID-19 prediction in dierent countries such as India, Brazil or UK60,61
14‑day CI estimation. Tables3 and 4 show the
R2
, RMSE and MAE scores for the dierent ML and statisti-
cal models targeted in this study using the evaluation environment previously mentioned in “Benchmarking
section. Let us remind the reader that the main dierence between both datasets is the test set. e DS1 develops
the prediction in a shorter time series (i.e. 1 week) but with a stable trend (i.e. a decreasing time series). e
DS2 develops the prediction in longer time series (i.e. 2 weeks) but with an unstable trend (i.e. increasing and
decreasing time series).
Table3 shows the performance of those algorithms when they target the DS1 dataset. In general, articial
neural networks models do not work well for predicting 14-day CI. e dataset includes 1 data item per day,
which means a total of data for the largest dataset of up to 109 data items. erefore, there is not enough informa-
tion to train the articial neural network models for a good inference. However, statistical models perform very
well in general. e best performing model for the DS1 is the ARIMA with the parameter set up
p=3
,
d=1
,
q=3
, reaching up to 0.99
R2
score, with an RMSE of 4.48 and MAE of 3.90. ese results are slightly improved
with our ensemble approach, reaching up to 0.99
R2
, with an RMSE of 4.16 and MAE of 3.55. Figure4a shows
graphically the actual data and the prediction made by the ensemble for dataset 1.
Figure2. Outline of the proposed ensemble approach.
Table 1. Datasets for training and testing ML algorithms. ey include dierent periods with dierent spatio-
temporal characteristics.
Dataset name DS1 DS2
Training period July, 20–November, 29 July, 20–December, 4
Testing period November, 30–December, 4 December, 5–December, 18
Testing period trend Decreasing Decreasing–increasing
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Table4 shows the performance of targeted models for the DS2 dataset. e results are signicantly worse than
those shown in the Table3. DS2 is more challenging as for the features previously commented (i.e. longer period
and unstable trend). Again, our ensemble approach achieves the best performance of all models but, in this case,
it only achieves up to 0.62
R2
score, with an RMSE score of 6.84 and MAE score of 5.49. It is important to note
that the ensemble approach takes the results of the AR(3) method as the other methods are signicantly worse
in terms of MAE and RMSE. Moreover, these tests revealed that the prediction of 14-day CI with only historical
information performs well for short periods and, above all, clearly marked tendencies. Change in trends due to
irregular components are very dicult to predict only using endogenous information and therefore, to improve
our forecast for this scenario, we propose the inclusion of an exogenous variable that allows the prediction of
these changes in tendency over long periods. Figure4 shows graphically the actual data and the prediction made
by the ensemble for dataset 2.
Exogeneity evaluation and multivariate model. e inclusion of exogenous variables into the multi-
variate model requires a preliminary study of the relationship between the 14-day CI and the mobility variables.
For that purpose, Spearmans correlation between 14-day CI and Google mobility variables has been rstly cal-
culated under dierent scenarios. Table5 shows Spearman’s correlation between 14-day CI and dierent lags of
the mobility time series.
e analysis in Table5 indicates that most mobility variables have a relevant correlation with 14-day CI,
especially retail and recreation, parks and public transport. Interestingly, leisure-related mobility variables, i.e.
retail and recreation and parks, have a negative correlation with CI while non-leisure mobility variables have a
Figure3. 14-day cumulative incidence (CI) in Spain. e evaluation dates are highlighted to let the reader
know the trend of 14-day CI at that period.
Table 2. Parameter setup for GRU and LSTM ANNs.
Parameter LSTM GRU
Number of input neurons 70 70
Batch size 32 32
Number of epochs 600 600
Learning factor 0.001 0.001
Optimizer Adam Adam
Activation function Hyperbolic tangent Hyperbolic tangent
Loss function Mean squared error Mean squared error
Delay sequence 6 6
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
positive correlation. Additionally, it is worth highlighting that the two situations are distinguished. If the correla-
tion between 14-day CI and a mobility variable (in absolute value) grows as the lags of the exogenous variable
increases, past values of the mobility variable have a more signicant association with current cumulative inci-
dence than recent ones. In contrast, if correlation decreases as the number of lags augments, the corresponding
mobility variable might be considered either not signicantly associated with 14-day CI or more signicantly
related with 14-day CI for recent values of the mobility variable. is underscores a pragmatic limitation of
univariate models, in that available exogenous variables cannot be used to forecast changes in 14-day CI curve
trend such as an uptick in new coronavirus cases.
Nevertheless, in practice, the establishment of causal statements between series of observations is not straight-
forward. Our interest is to examine whether mobility time series helps to predict future values of 14-day CI,
controlling for lags. Table6 reports Granger causality test outcomes for dierent lag orders analysing whether
past values of mobility variables provide additional information about 14-day CI beyond past values of 14-day CI.
From the results in Table6, the eect of lags of mobility variables retail and recreation, parks and public
transport on 14-day CI is highly signicant whatever the number of lags is. e stationarity of the variables was
previously checked using the Augmented Dickey-Fuller test via the adf.test function in R. Bearing this in mind,
according to WHO, the incubation period of COVID-19 is on average 5–6 days but can be as long as 14 days,
lags have been considered varying from 5 to 14 days. However, it is important to note that too few lags can lead
to a biased test due to residual autocorrelation whereas with too many, null hypothesis might be incorrectly
rejected because of spurious correlation. erefore, the number of lags that need to be chosen reaching is a
tradeo between bias and power. en, it can be concluded that these three mobility variables are predictive of
future cumulative incidence gures.
Table 3. 14-day CI accuracy prediction for the rst dataset. Training from July 20, 2020 to November 29,
2020, Prediction from November, 30 to December, 4.
Model
R2
score RMSE score MAE score
GRU 0.96 92.90 91.49
LSTM 0.86 109.91 108.72
AR (1) >0.99 37.82 33.01
AR (3) 0.99 6.28 5.61
AR (6) > 0.99 13.30 13.10
ARIMA (1, 1, 1) > 0.99 10.67 10.54
ARIMA (3, 1, 3) 0.99 4.48 3.90
ARIMA (6, 1, 6) 0.99 4.96 3.72
ARIMA (1, 2, 1) > 0.99 16.71 16.04
ARIMA (3, 2, 3) > 0.99 7.96 7.86
ARIMA (6, 2, 6) > 0.99 11.08 10.62
Ensemble approach > 0.99 4.16 3.55
PROPHET 0.99 39.54 36.89
TPOT 0.99 30.94 28.37
Table 4. 14-day CI accuracy prediction for the second dataset. Training from July 20, 2020 to December 4,
2020, Prediction from December, 5 to December, 18.
Model
R2
score RMSE score MAE score
GRU 0.59 15.16 11.43
LSTM 0.65 27.18 25.03
AR (1) 0.07 44.79 42.48
AR (3) 0.62 6.84 5.49
AR (6) 0.16 35.11 26.94
ARIMA (1, 1, 1) 0.10 46.21 35.17
ARIMA (3, 1, 3) 0.11 38.50 27.45
ARIMA (6, 1, 6) 0.11 40.41 29.56
ARIMA (1, 2, 1) 0.06 67.44 52.50
ARIMA (3, 2, 3) 0.06 54.76 39.28
ARIMA (6, 2, 6) 0.06 56.33 42.57
Ensemble approach 0.62 6.84 5.49
PROPHET 0.74 20.08 13.21
TPOT 0.01 41.72 31.37
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Reciprocally, Granger causality tests analysing whether 14-day CI values help to predict future values of mobil-
ity variables have been run and corresponding p-values are shown in Table7. According to these results, 14-day
CI is highly signicant on retail and recreation for every lag order and, in general, for the rest of the mobility
variables from a lag length of 8. In other words, 14-day CI is predictive of mobility variables in a period of a week
from current values. is nding is consistent regarding the incubation period; however, these results should be
cautiously interpreted. An increase in new coronavirus cases is bound to force government intervention and the
application of measures aimed at restricting citizens mobility. Likewise, a decline of the 14-day CI curve would
lead to social relaxation, which would be translated into an increase in mobility.
As a result, reverse or bidirectional causation may be present in our problem. erefore, we cannot conclude
that mobility variables potentially cause future values of 14-day CI. Moreover, government containment measures
in mobility, nightclubs or bars and other factors such as social alarm also involve changes in 14-day CI trends
and thus, there may be latent confounders that are correlated with 14-day CI underlying the true cause of the
evolution of new coronavirus cases. Hence, making a strong causal statement is hard, however, our intention
was less ambitious targeted at shedding light on what mobility variables are useful for predicting 14-day CI.
Based on this preliminary study, the results obtained by our ensemble approach, retail and recreation, parks
and public transport time series will be used hereaer as explanatory variables to develop a multivariate model
Figure4. 14-day CI accuracy prediction for both datasets.
Table 5. Spearman’s correlation between 14-day CI and Google mobility variables for dierent lags in the
mobility time series.
Lags Retail and recreation Supermarket and pharmacy Parks Public transport Workplaces Residential
0
0.42 0.28
0.59 0.38 0.23 0.32
−5
0.39 0.21
0.53 0.35 0.14 0.25
−6
0.38 0.22
0.52 0.36 0.14 0.24
−7
0.37 0.21
0.51 0.36 0.14 0.22
−8
0.35 0.21
0.50 0.37 0.14 0.21
−9
0.34 0.21
0.48 0.37 0.14 0.20
−10
0.32 0.22
0.47 0.37 0.14 0.19
−11
0.30 0.22
0.46 0.38 0.13 0.18
−12
0.28 0.22
0.44 0.39 0.13 0.17
−13
0.27 0.22
0.43 0.39 0.13 0.15
−14
0.25 0.23
0.42 0.40 0.13 0.13
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
where 14-day CI is the response variable. Because the average incubation period of COVID-19 outlined by the
WHO lasts a minimum of 5 days, selected mobility variables will be considered 5 periods lagged. Furthermore,
Google mobility variables will be standardised and rescaled to the last three days of 14-day CI before predictions
are made in order to provide meaningful information to the model.
Finally, a principal component analysis (PCA) is computed considering these variables. Table8 indicates that
two components would preserve more than 87% of the total variance in the original data. In other words, two
components explain more than 87% of the information provided by the exogenous variables. Figure5 graphically
illustrates that mobility variables are clearly dierentiated from the ensemble approach in the PCA analysis. us,
mobility variables would provide additional information to the proposed multivariate model.
In particular, this paper includes an optimization model aimed at improving forecasts in 14-day CI time
series which uses multivariate regression as starting point. Table9 shows the regression outcomes obtained for
DS2 training period. e coecient estimates and standard errors are calculated. e p-value corresponding
to the t-statistic of each coecient indicates if there is a signicant relationship between the response variable
(14-day CI) and each of the predictors included in the model (ensemble and mobility variables). Table10 shows
the results obtained by the NLM method for the dierent seed values previously described in “Methods” section,
i.e. the MAE and the number of iterations performed by the procedure in each case. It is important to highlight
that when the seed of NLM is the coecients randomly generated from a uniform distribution from -10 to 10,
the NLM algorithm is executed 10 times and the MAE and number of iterations in Table10 are calculated as
the average over 10 simulation runs. As can be seen, the best result is reached by performing 36 iterations of the
Table 6. Granger causality testing mobility variables predictive of 14-day CI for dierent lag orders.
Lags Retail and recreation Supermarket and pharmacy Parks Public transport Workplaces Resid ential
5 0.03 0.72 <0.01 <0.01 0.52 0.16
6 0.01 0.66 0.01 <0.01 0.17 0.22
7 0.01 0.70 0.02 <0.01 0.18 0.28
8 0.03 0.61 0.08 <0.01 0.17 0.37
9 <0.01 0.49 0.17 <0.01 0.13 0.14
10 0.02 0.78 <0.01 <0.01 0.19 0.30
11 <0.01 0.32 0.01 <0.01 0.31 0.32
12 0.01 0.35 <0.01 <0.01 0.35 0.29
13 <0.01 0.15 0.01 <0.01 0.19 0.19
14 <0.01 0.04 0.01 <0.01 0.21 0.01
Table 7. Granger causality testing 14-day CI predictive of mobility variables for dierent lag orders.
Lags Retail and recreation Supermarket and pharmacy Parks Public transport Workplaces Residential
5 <0.01 0.20 0.38 0.26 0.05 0.25
6 <0.01 0.35 0.10 0.17 0.31 0.08
7 <0.01 0.01 0.21 0.03 <0.01 <0.01
8 <0.01 0.02 0.04 <0.01 <0.01 <0.01
9 0.01 <0.01 0.12 <0.01 <0.01 <0.01
10 0.01 <0.01 0.13 <0.01 <0.01 <0.01
11 0.03 0.01 0.12 <0.01 <0.01 <0.01
12 0.05 0.01 0.16 <0.01 <0.01 <0.01
13 0.02 0.02 0.17 <0.01 <0.01 <0.01
14 0.03 0.04 0.30 <0.01 <0.01 <0.01
Table 8. Eigenvalues and proportion of variance (i.e. information) explained by each component in the PCA.
Number of components Eigenvalues Proportion of variance (%) Cumulative proportion (%)
1 2.91 72.83 72.83
2 0.597 14.93 87.76
3 0.391 9.77 97.52
4 0.099 2.48 100
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
algorithm, it returns a MAE of 3.77 and it is achieved when the NLM procedure uses the multivariate regression
model as seed.
Once the MAE has been minimized, Table11 presents 14-day CI predictions for an evaluation period from
5th to 18th of December using the multivariate model with the optimal coecient values obtained by NLM for
the minimum MAE. It is important to remark that if exogenous variables are not extended, 14-day CI forecasts
are restricted to a ve-period prediction horizon. Nonetheless, forecasts in the evaluation period have been
obtained using the observed past values of the mobility variables. is approach might not be realistic, but the
purpose of the study is to validate the performance of the model using mobility data regarding other ML meth-
ods not including this exogenous information. To assess the accuracy of the model, the mean absolute error is
measured and a comparison is made with regard to predictions given by the univariate strategy in the ensemble
approach. In addition, Fig.6 shows true 14-day CI curve and the ensemble approach and multivariate predicted
values throughout the forecast horizon. It is noteworthy that the multivariate model substantially outperforms
the ensemble approach. e results also suggest that both models produce reasonably good estimates, but the
multivariate model tracks better changing trends in 14-day CI.
To conclude, it is interesting to note that predictions made from 16th to 18th of December (labeled by 12,
13, 14 in Fig.5), when a new uptick in coronavirus infections and hospitalizations began, are located in the
exogenous area of the PCA graphics meaning that for these values mobility variables have a higher impact.
Again, these results evidence that exogenous variables oer valuable information to cope with trend changes in
the 14-day CI curve and justies the use of a multivariate model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ENSEMBLE
TRANSPORT
PARKS
RETAIL
−2
−1
0
1
−2 −1 012
Dim1 (74%)
Dim2 (16%)
PCA − Biplot
Figure5. PCA to ensemble approach and mobility variables. Positively correlated variables point to the same
side of the plot. Negatively correlated variables point to opposite sides of the graph.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Discussion
e use of a regression model entails the acceptance of assumptions that may be questionable at best in the
context of time series data. Methodologically, this approach is awed mainly because accuracy may be seriously
aected in the presence of autocorrelation. Furthermore, diculties in data collection due to discrepancies in
regional notications and dierences on COVID-19 medical tests carried out are added to statistical problems,
which are compounded when data include measurement error. In view of the foregoing, this multivariate tech-
nique cannot be used as an inference method. However, the use of an operation research optimization method
such as NLM implementing the regression coecients as a seed improves the solution obtained by the univariate
model. Evidently, this option has its own drawbacks such as the problem of falling in local optima or the setting
of good initial values for the solver.
e ensemble approach rendered a smoother curve that could not detect trend changes. Indeed, the results
provided by the ensemble approach reinforce the need for monitoring models that can also detect changes in
trend with some foresight. Accordingly, despite the potential limitations mentioned above, the proposed multi-
variate approach can be gainfully used for predicting possible upticks in COVID-19 cases at least in a short-term
period. erefore, the inclusion of the two models within a decision support system provides us with a positive
result that covers the dierent types of data behavior, both when the trend is constant and in the changes of trend.
In this system, depending on the error produced by each model when introducing a new value to predict, it will
be selected either the ensemble approach or the multivariate approach.
Conclusions and future work
COVID-19 has caused one of the biggest crises in our recent history. Most countries have developed monitor-
ing systems based on pandemic evolution indicators to trigger social distancing measures whenever signicant
increases in infections are detected. Data analysis can help forecast the short- and medium-term evolution of the
Table 9. Multivariate regression for DS2 training period.
R2
=
0, 79
, p-value
<0.01
.
Coecients Estimate Std. Error p-value
β0
(Independent) −110.59 259.76 0.68
β1
(Ensemble) 1.31 0.26 <0.01
β2
(Retail and recreation) 1.00 0.59 0.13
β3
(Parks) −0.20 1.29 0.88
β4
(Public transport) −0.60 0.63 0.37
Table 10. MAE achieved and iterations performed by NLM procedure using dierent seeds.
Seed Avg. of 10 random runs Weighted equally Multivariate regression model
MAE 4.66 4.06 3.77
NLM iterations 50 46 36
Table 11. 14-day CI accuracy prediction for ensemble approach (EA) and NLM methog (NLM). Training
from July 20, 2020 to December 4, 2020, Prediction from December, 5 to December, 18.
DATE 14-day CI CI Ensemble CI NLM
MAEEA
MAENLM
December 5 226.39 226.08 225.10 3.14 0.31
December 6 216.07 216.28 214.58 1.83 0.26
December 7 207.52 202.21 204.94 2.46 1.94
December 8 201.59 205.76 204.93 3.18 2.50
December 9 193.26 205.11 202.78 4.62 4.37
December 10 188.72 197.11 197.34 5.92 5.04
December 11 189.56 197.94 195.49 6.48 5.52
December 12 194.19 194.19 193.76 6.19 4.83
December 13 196.61 193.09 191.53 5.64 4.68
December 14 193.65 190.11 188.13 5.50 4.57
December 15 198.77 198.64 195.77 5.04 4.16
December 16 201.16 202.87 201.91 4.79 3.96
December 17 207.26 201.91 202.32 4.96 4.07
December 18 214.12 214.11 210.12 5.49 3.78
Content courtesy of Springer Nature, terms of use apply. Rights reserved
14
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
pandemic and thus help policymakers in their decision making. In this paper, we have analysed the evolution
of the 14-day cumulative incidence in Spain from the beginning of the second wave of COVID-19 until January
2021, where several trend changes (also called waves) occurred. We have proposed a set of statistical and ML
models to achieve maximum performance, reaching very good results for short and stable periods. However,
the 14-day CI is aected by irregular components which are very challenging scenarios for traditional models
using only historical information. erefore, the mobility data provided by Google as a consequence of the
COVID-19 outbreak are fed into our models as exogenous information to predict these irregular components.
Our results reveal that this information improves the prediction of this unstable scenario, providing an MAE
of up to 1.08 on average.
Data fusion between socio-economic and endogenous variables is still at a relatively early stage, and we
acknowledge that we have tested a relatively simple variant of a multivariate model. But, with many other types
of multivariate models and data such as vaccination gures yet to be explored, this eld seems to oer a promis-
ing and potentially fruitful area of research. Moreover, this approach can be followed at the international level
to predict changes in trends and coordinate the pandemic globally.
Received: 28 April 2021; Accepted: 9 July 2021
References
1. Cecilia, J. M., Cano, J.-C., Hernández-Orallo, E., Calafate, C. T. & Manzoni, P. Mobile crowdsensing approaches to address the
covid-19 pandemic in spain. IET Smart Cities 2, 58–63 (2020).
2. Kissler, S. M., Tedijanto, C., Goldstein, E., Grad, Y. H. & Lipsitch, M. Projecting the transmission dynamics of sars-cov-2 through
the postpandemic period. Science 368, 860–868 (2020).
3. B onaccorsi, G. et al. Economic and social consequences of human mobility restrictions under covid-19. Proc. Natl. Acad. Sci. 117,
15530–15535 (2020).
4. OECD & Sta, O. OECD Economic Outlook, vol. 2020 (OECD Publishing, 2020).
5. Organization, W. H. et al. Critical Preparedness, Readiness and Response Actions for Covid-19: Interim Guidance, 4 Nov 2020,
World Health Organization, Technical Report (2020).
6. Organization, W. H. et al. Public Health Surveillance for Covid-19: Interim Guidance, 16 Dec 2020, World Health Organization,
Techniacl Report, (2020).
246810 12 14
160180 200220 240
days
14−day CI
14−day CI
CI Ensemble
CI Multivariate
Figure6. 14-day CI accuracy prediction for dierent estimated models.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
15
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
7. Han, E. et al. Lessons learnt from easing covid-19 restrictions: An analysis of countries and regions in Asia Pacic and Europe.
Lancet (2020).
8. Z aki, N. & Mohamed, E. A. e estimations of the covid-19 incubation period: A scoping reviews of the literature. J. Infect. Public
Health 14, 638–646 (2021).
9. Zoabi, Y., Deri-Rozov, S. & Shomron, N. Machine learning-based prediction of covid-19 diagnosis based on symptoms. NPJ Dig.
Med. 4, 1–5 (2021).
10. Hellewell, J. et al. Feasibility of controlling covid-19 outbreaks by isolation of cases and contacts. Lancet Global Health 8, e488–e496
(2020).
11. Ferretti, L. et al. Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing. Science (2020).
12. Flaxman, S. et al. Estimating the eects of non-pharmaceutical interventions on covid-19 in Europe. Nature 584, 257–261 (2020).
13. Estrada, E. Covid-19 and sars-cov-2. Modeling the present, looking at the future. Phys.Rep. 869, 1–51 (2020).
14. Maier, B. F. & Brockmann, D. Eective containment explains subexponential growth in recent conrmed covid-19 cases in china.
Science 368, 742–746 (2020).
15. Wong, G. N. et al. Modeling covid-19 dynamics in illinois under nonpharmaceutical interventions. Phys. Rev. X 10, 041033 (2020).
16. Hernández-Orallo, E., Manzoni, P., Calafate, C. T. & Cano, J. Evaluating how smartphone contact tracing technology can reduce
the spread of infectious diseases: e case of covid-19. IEEE Access 8, 99083–99097 (2020).
17. Hernández-Orallo, E., Manzoni, P., Calafate, C. T. & Cano, J. Evaluating the eectiveness of covid-19 bluetooth-based smartphone
contact tracing applications. Appl. Sci. 10, 7113 (2020).
18. Khakharia, A. et al. Outbreak prediction of covid-19 for dense and populated countries using machine learning. Ann. Data Sci. 8,
1–19 (2021).
19. Lalmuanawma, S., Hussain, J. & Chhakchhuak, L. Applications of machine learning and articial intelligence for covid-19 (sars-
cov-2) pandemic: A review. Chaos Solitons Fractals 139, 110059 (2020).
20. Rustam, F. et al. Covid-19 future forecasting using supervised machine learning models. IEEE Access 8, 101489–101499 (2020).
21. Chimmula, V. K. R . & Zhang, L. Time series forecasting of covid-19 transmission in Canada using ISTM networks. Chaos Solitons
Fractals 135, 109864 (2020).
22. Hernandez-Matamoros, A., Fujita, H., Hayashi, T. & Perez-Meana, H. Forecasting of covid19 per regions using arima models and
polynomial functions. Appl. So Comput. 96, 106610 (2020).
23. Benvenuto, D., Giovanetti, M., Vassallo, L., Angeletti, S. & Ciccozzi, M. Application of the Arima model on the covid-2019 epidemic
dataset. Data Brief 29, 105340 (2020).
24. Perone, G. An arima model to forecast the spread and the nal size of covid-2019 epidemic in italy. medRxiv (2020).
25. Sahai, A. K., Rath, N., So od, V. & Singh, M. P. Arima modelling and forecasting of covid-19 in top ve aected countries. Diabetes
Metab. Syndr. 14, 1419–1427 (2020).
26. Castro, M., Ares, S., Cuesta, J. A. & Manrubia, S. e turning point and end of an expanding epidemic cannot be precisely forecast.
Proc. Natl. Acad. Sci. 117, 26190–26196 (2020).
27. Petropoulos, F., Makrida kis, S. & Stylianou, N. Covid-19: Forecasting conrmed cases and deaths with a simple time series model.
Int. J. Forecast. (2020).
28. Shahid, F., Zameer, A. & Muneeb, M. Predictions for covid-19 with deep learning models of ISTM GRU and BI-ISTM. Chaos
Solitons Fractals 140, 110212 (2020).
29. Z eroual, A., Harrou, F., Dairi, A. & Sun, Y. Deep learning methods for forecasting covid-19 time-series data: A comparative study.
Chaos Solitons Fractals 140, 110212 (2020).
30. Ribeiro, M. H. D. M., da Silva, R. G., Mariani, V. C. & dos Santos Coelho, L. Short-term forecasting covid-19 cumulative conrmed
cases: Perspectives for Brazil. Chaos Solitons Fractals 135, 109853 (2020).
31. Linka, K., Peirlinck, M. & Kuhl, E. e reproduction number of covid-19 and its correlation with public health interventions.
Comput. Mech. 66, 1035–1050 (2020).
32. Kraemer, M. U. et al. e eect of human mobility and control measures on the covid-19 epidemic in China. Science 368, 493–497
(2020).
33. Buckee, C. O. et al. Aggregated mobility data could help ght covid-19. Sci. (N. Y.) 368, 145 (2020).
34. Hernando, A., Mateo, D., Bayer, J. & Barrios, I. Radius of gyration as predictor of covid-19 deaths trend with three-weeks oset.
medRxiv (2021).
35. Gonzalez, M. C., Hidalgo, C. A. & Barabasi, A.-L. Understanding individual human mobility patterns. Nature 453, 779–782 (2008).
36. Cot, C., Cacciapaglia, G. & Sannino, F. Mining google and apple mobility data: Temporal anatomy for covid-19 social distancing.
Sci. Rep. 11, 4150 (2021).
37. Nouvellet, P. et al. Reduction in mobility and covid-19 transmission. Nat. Commun. 12, 1090 (2021).
38. Kraemer, M. U. G. et al. Data curation during a pandemic and lessons learned from covid-19. Nat. Comput. Sci. 1, 9–10 (2021).
39. Palit, A. K. & Popovic, D. Computational Intelligence in time Series Forecasting: eory and Engineering Applications (Springer
Science & Business Media, 2006).
40. Guillén-Navarro, M. A. et al. A decision support system for water optimization in anti-frost techniques by sprinklers. Sensors 20,
7129 (2020).
41. Tavenard, R. et al. Tslearn, a machine learning toolkit for time series data. J. Mach. Learn. Res. 21, 1–6 (2020).
42. deSanidad, M. Plan de respuesta temprana en un escenario de control de la pandemia por COVID-19 (Gobierno de España, 2020).
43. Cot, C., Cacciapaglia, G. & Sannino, F. Mining google and apple mobility data: Temporal anatomy for covid-19 social distancing.
Sci. Rep. 11, 1–8 (2021).
44. Nagelkerke, N. J. et al. A note on a general denition of the coecient of determination. Biometrika 78, 691–692 (1991).
45. Chai, T. & Draxler, R. R. Root mean square error (RMSE) or mean absolute error (MAE). Geosci. Model Dev. Discuss. 7, 1525–1534
(2014).
46. Spearman, C. e proof and measurement of association between two things. (1961).
47. Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econometrica J. Econ. Soc. 424–438
(1969).
48. Jollie, I. Principal component analysis. Technometrics 45, 276 (2003).
49. Mills, T. C. & Mills, T. C. Time Series Techniques for Economists (Cambridge University Press, 1991).
50. Box, G. E., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. Time Series Analysis: Forecasting and Control (John Wiley & Sons, 2015).
51. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
52. Cho, K. etal. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:
1406. 1078 (2014).
53. Homann, F., Bertram, T., Mikut, R., Reischl, M. & Nelles, O. Benchmarking in classication and regression. Wiley Interdiscip.
Rev. Data Mining Knowl. Discov. 9, e1318 (2019).
54. Huang, X., Li, Z., Jiang, Y., Li, X. & Porter, D. Twitter reveals human mobility dynamics during the covid-19 pandemic. PloS ONE
15, e0241957 (2020).
55. Yilmazkuday, H. Stay-at-home works to ght against covid-19: International evidence from google mobility data. J. Human Behav.
Soc. Environ. 31, 1–11 (2020).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
16
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
56. S chnabel, R. B., Koonatz, J. E. & Weiss, B. E. A modular system of algorithms for unconstrained minimization. ACM Trans. Math.
Sow. (TOMS) 11, 419–440 (1985).
57. Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
58. Taylor, S. J. & Letham, B. Forecasting at scale. Am. Stat. 72, 37–45 (2018).
59. Le, T. T., Fu, W. & Moore, J. H. Scaling tree-based automated machine learning to biomedical big data with a feature set selector.
Bioinformatics 36, 250–256 (2020).
60. Indhuja, M. & Sindhuja, P. Prediction of covid-19 cases in India using prophet. Int. J. Stat. Appl. Math. 5, 103–106 (2020).
61. Han, T., Gois, F. N.B., Oliveira, R., Prates, L.R. & deAlmeidaPorto, M.M. Modeling the progression of covid-19 deaths using
kalman lter and automl. So Comput. 1–16 (2021).
Acknowledgements
is work has been partially supported by the Spanish Ministry of Science and Innovation, under Grants
RYC2018-025580-I, RTI2018-096384-B-I00, RTC-2017-6389-5 and RTC2019-007159-5, by the Fundación
Séneca del Centro de Coordinación de la Investigación de la Región de Murcia under Project 20813/PI/18, by
the “Conselleria de Educación, Investigación, Cultura y Deporte, Direcció General de Ciéncia i Investigació,
Proyectos AICO/2020”, Spain, under Grant AICO/2020/302 and a predoctoral contract by the Generalitat Valen-
ciana and the European Social Fund under Grant ACIF/2018/219.
Author contributions
Conceptualization, S.G.C. and J.L.E; methodology, S.G.C. and J.L.E.; soware, J.M.G., R.M.E., A.B.C, E.H.O.;
validation, S.G.C., J.L.E., R.M.E. and J.M.C.; formal analysis, S.G.C., R.H.S., R.M.E., J.L.E., A.B.C,; investigation,
S.G.C., R.H.S., J.L.E., R.M.E., A.B.C., E.H.O.; resources, S.G.C. and J.M.G.; data curation, S.G.C., J.M.G., R.H.S.
and R.M.E.; writing—original dra preparation, S.G.C., J.M.C.; writing—review and editing, J.M.G., E.H.O.;
visualization, J.M.G., R.H.S., A.B.C,; supervision, J.L.E and J.M.C.; funding acquisition, J.M.C. All authors have
read and agreed to the published version of the manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to J.M.C.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... With these efforts, nonpharmaceutical interventions (such as national lockdowns) have been evaluated for their effectiveness and socioeconomic impact on different groups [9][10][11], models have been developed to predict disease spatial diffusion [12,13], and scenarios have been modeled to assess their outcomes [14][15][16][17]. Studies have demonstrated that mobility data are a meaningful proxy measure of social distancing [18], affect viral spreading [19,20], and are useful for predicting the spread of COVID-19 [21][22][23]. In particular, to control the spread of new cases and plan efficiently for hospital needs and capacities during an epidemic, public health decision-makers require accurate predictions of future case numbers [7]. ...
... Their study demonstrated that public mobility data can be used to develop reduced-form and simple models that mimic the behavior of more sophisticated epidemiological models for predicting COVID-19 cases on a 10-day basis [21]. Another study examined several state-of-the-art machine learning models and statistical methods and demonstrated how mobility data can improve prediction trends when used as exogenous information in models [22]. ...
Article
Full-text available
In light of the outbreak of COVID-19, analyzing and measuring human mobility has become increasingly important. A wide range of studies have explored spatiotemporal trends over time, examined associations with other variables, evaluated non-pharmacologic interventions (NPIs), and predicted or simulated COVID-19 spread using mobility data. Despite the benefits of publicly available mobility data, a key question remains unanswered: are models using mobility data performing equitably across demographic groups? We hypothesize that bias in the mobility data used to train the predictive models might lead to unfairly less accurate predictions for certain demographic groups. To test our hypothesis, we applied two mobility-based COVID infection prediction models at the county level in the United States using SafeGraph data, and correlated model performance with sociodemographic traits. Findings revealed that there is a systematic bias in models’ performance toward certain demographic characteristics. Specifically, the models tend to favor large, highly educated, wealthy, young, and urban counties. We hypothesize that the mobility data currently used by many predictive models tends to capture less information about older, poorer, less educated and people from rural regions, which in turn negatively impacts the accuracy of the COVID-19 prediction in these areas. Ultimately, this study points to the need of improved data collection and sampling approaches that allow for an accurate representation of the mobility patterns across demographic groups.
... A review of the ARIMA model literature through February 4, 2023, suggests that only human mobility has been used to predict COVID-19 transmission before universal vaccination implementation in such models [42][43][44][45][46], and that vaccination was included as a predictor in only one study [47]. With increasing vaccination coverage and SARS-CoV-2 evolution, a comprehensive set of factors should be examined and vaccination and other protective behaviors should be included as predictors to improve dynamic forecasting. ...
... The COVID-19 case growth rate, rather than the number of new cases per million, was selected because it better reflects epidemic trends and meets the stationary condition for time-dependent trends, thereby enabling more accurate prediction [42,43]. Furthermore, the variance of the series is stabilized by this taking of a logarithmic approach [63]. ...
Article
Full-text available
Background Mathematical and statistical models are used to predict trends in epidemic spread and determine the effectiveness of control measures. Automatic regressive integrated moving average (ARIMA) models are used for time-series forecasting, but only few models of the 2019 coronavirus disease (COVID-19) pandemic have incorporated protective behaviors or vaccination, known to be effective for pandemic control. Methods To improve the accuracy of prediction, we applied newly developed ARIMA models with predictors (mask wearing, avoiding going out, and vaccination) to forecast weekly COVID-19 case growth rates in Canada, France, Italy, and Israel between January 2021 and March 2022. The open-source data was sourced from the YouGov survey and Our World in Data. Prediction performance was evaluated using the root mean square error (RMSE) and the corrected Akaike information criterion (AICc). Results A model with mask wearing and vaccination variables performed best for the pandemic period in which the Alpha and Delta viral variants were predominant (before November 2021). A model using only past case growth rates as autoregressive predictors performed best for the Omicron period (after December 2021). The models suggested that protective behaviors and vaccination are associated with the reduction of COVID-19 case growth rates, with booster vaccine coverage playing a particularly vital role during the Omicron period. For example, each unit increase in mask wearing and avoiding going out significantly reduced the case growth rate during the Alpha/Delta period in Canada (–0.81 and –0.54, respectively; both p < 0.05). In the Omicron period, each unit increase in the number of booster doses resulted in a significant reduction of the case growth rate in Canada (–0.03), Israel (–0.12), Italy (–0.02), and France (–0.03); all p < 0.05. Conclusions The key findings of this study are incorporating behavior and vaccination as predictors led to accurate predictions and highlighted their significant role in controlling the pandemic. These models are easily interpretable and can be embedded in a “real-time” schedule with weekly data updates. They can support timely decision making about policies to control dynamically changing epidemics.
... Interestingly, too lenient measures stimulated a sizeable proportion of the population to spontaneously engage in health-protective behavior such as avoiding social contacts 21 and public transportation. 22 These spontaneous selfrestrictions apparently reduced feelings of uncertainty and ambiguity that were provoked by inadequate governmental interventions to contain the (perceived) health threat. 23 Present Study Although prior studies have mainly focused on relating psychological variables to either the epidemiological situation 1 or to stringency of the measures, 24 the above description suggests that their impact is not one-sidedly positive or negative. ...
Article
Background: The stringency of the measures taken by governments to combat the COVID-19 pandemic varied considerably across countries and time. In the present study, we examined how the proportionality to the epidemiological situation is related to citizens’behavior, motivation and mental health. Methods: Across 421 days between March 2020 and March 2022, 273,722 Belgian participants (Mage = 49.47; 63.9% female; 33% single) completed an online questionnaire. Multiple linear mixed regression modeling was used to examine the interaction between the epidemiological situation, as indicated by the actual hospitalization numbers, and the stringency index to predict day-to-day variation in the variables of interest. Results: Systematic evidence emerged showing that disproportional situations, as opposed to proportional situations, were associated with a clear pattern of maladaptive outcomes. Specifically, when either strict or lenient measures were disproportional in relation to the epidemiological situation, people reported lower autonomous motivation, more controlled motivation and amotivation, less adherence to sanitary rules, higher perceived risk of infection, lower need satisfaction, and higher anxiety and depressive symptoms. Perceived risk severity especially covaried with the stringency of the measures. At the absolute level, citizens reported the highest need satisfaction and mental health during days with proportional lenient measures. Conclusion: Stringent measures are not per se demotivating or compromising of people’s well-being, nor are lenient measures as such motivating or enhancing well-being. Only proportional measures, that is, measures with a level of stringency that is aligned with the actual epidemiological situation, are associated with the greatest motivational, behavioral, and mental health benefits.
... Previous research on the topic of enhancing the accuracy of COVID-19 prediction by integrating epidemiological and mobility data (García-Cremades et al., 2021) aimed to assess various models that can be used to make early predictions about the progression of the COVID-19 outbreak, with the ultimate goal of developing a decision support system to aid policy-makers. For example, spatiotemporal disease models and graph neural networks were integrated to enhance the forecasting accuracy of weekly COVID-19 cases in Germany (Fritz et al., 2022). ...
Article
Full-text available
Research background: The COVID-19 pandemic has caused unprecedented disruptions to the global tourism industry, resulting in significant impacts on both human and economic activities. Travel restrictions, border closures, and quarantine measures have led to a sharp decline in tourism demand, causing businesses to shut down, jobs to be lost, and economies to suffer. Purpose of the article: This study aims to examine the correlation and causal relationship between real-time mobility data and statistical data on tourism, specifically tourism overnights, across eleven European countries during the first 14 months of the pandemic. We analyzed the short longitudinal connections between two dimensions of tourism and related activities. Methods: Our method is to use Google and Apple's observational data to link with tourism statistical data, enabling the development of early predictive models and econometric models for tourism overnights (or other tourism indices). This approach leverages the more timely and more reliable mobility data from Google and Apple, which is published with less delay than tourism statistical data. Findings & value added: Our findings indicate statistically significant correlations between specific mobility dimensions, such as recreation and retail, parks, and tourism statistical data, but poor or insignificant relations with workplace and transit dimensions. We have identified that leisure and recreation have a much stronger influence on tourism than the domestic and routine-named dimensions. Additionally, our neural network analysis revealed that Google Mobility Parks and Google Mobility Retail & Recreation are the best predictors for tourism, while Apple Driving and Apple Walking also show significant correlations with tourism data. The main added value of our research is that it combines observational data with statistical data, demonstrates that Google and Apple location data can be used to model tourism phenomena, and identifies specific methods to determine the extent, direction, and intensity of the relationship between mobility and tourism flows
... Although literature has documented the direct relationship between mobility and number of cases using simulation studies and analysis of aggregated mobility data [36][37][38], community quarantine alone was shown to be ineffective in curtailing the spread of infection and brought devastating economic and societal impact on different populations [39]. As the governments design their respective lockdown exit strategy, it is crucial to maintain the awareness of the population about the persistence of the threat due to the virus and sustain its interest in disease prevention and control measures [40]. ...
Article
Full-text available
Background: Traditional surveillance systems rely on routine collection of data. The inherent delay in retrieval and analysis of data leads to reactionary rather than preventive measures. Forecasting and analysis of behavior-related data can supplement the information from traditional surveillance systems. Objective: We assessed the use of behavioral indicators, such as the general public's interest in the risk of contracting SARS-CoV-2 and changes in their mobility, in building a vector autoregression model for forecasting and analysis of the relationships of these indicators with the number of COVID-19 cases in the National Capital Region. Methods: An etiologic, time-trend, ecologic study design was used to forecast the daily number of cases in 3 periods during the resurgence of COVID-19. We determined the lag length by combining knowledge on the epidemiology of SARS-CoV-2 and information criteria measures. We fitted 2 models to the training data set and computed their out-of-sample forecasts. Model 1 contains changes in mobility and number of cases with a dummy variable for the day of the week, while model 2 also includes the general public's interest. The forecast accuracy of the models was compared using mean absolute percentage error. Granger causality test was performed to determine whether changes in mobility and public's interest improved the prediction of cases. We tested the assumptions of the model through the Augmented Dickey-Fuller test, Lagrange multiplier test, and assessment of the moduli of eigenvalues. Results: A vector autoregression (8) model was fitted to the training data as the information criteria measures suggest the appropriateness of 8. Both models generated forecasts with similar trends to the actual number of cases during the forecast period of August 11-18 and September 15-22. However, the difference in the performance of the 2 models became substantial from January 28 to February 4, as the accuracy of model 2 remained within reasonable limits (mean absolute percentage error [MAPE]=21.4%) while model 1 became inaccurate (MAPE=74.2%). The results of the Granger causality test suggest that the relationship of public interest with number of cases changed over time. During the forecast period of August 11-18, only change in mobility (P=.002) improved the forecasting of cases, while public interest was also found to Granger-cause the number of cases during September 15-22 (P=.001) and January 28 to February 4 (P=.003). Conclusions: To the best of our knowledge, this is the first study that forecasted the number of COVID-19 cases and explored the relationship of behavioral indicators with the number of COVID-19 cases in the Philippines. The resemblance of the forecasts from model 2 with the actual data suggests its potential in providing information about future contingencies. Granger causality also implies the importance of examining changes in mobility and public interest for surveillance purposes.
... In [17], researchers found a consistent pattern of a sharp reduction in deaths after mobility is reduced. Other groups implemented prediction models [19][20][21][22][23][24][25] to estimate the effects of mobility reduction and predict the number of cases and deaths. These models were implemented with varying levels of complexity; for instance, [19,20] added additional variables, including (in [19]) meteorological variables, such as temperature, humidity, and rainfall, along with the correlation between mobility and COVID-19 case counts. ...
Article
Full-text available
Human mobility plays an important role in the spread of COVID-19. Given this knowledge, countries implemented mobility-restricting policies. Concomitantly, as the pandemic progressed, population resistance to the virus increased via natural immunity and vaccination. We address the question: “What is the impact of mobility-restricting measures on a resistant population?” We consider two factors: different types of points of interest (POIs)—including transit stations, groceries and pharmacies, retail and recreation, workplaces, and parks—and the emergence of the Delta variant. We studied a group of 14 countries and estimated COVID-19 transmission based on the type of POI, the fraction of population resistance, and the presence of the Delta variant using a Pearson correlation between mobility and the growth rate of cases. We find that retail and recreation venues, transit stations, and workplaces are the POIs that benefit the most from mobility restrictions, mainly if the fraction of the population with resistance is below 25–30%. Groceries and pharmacies may benefit from mobility restrictions when the population resistance fraction is low, whereas in parks, there is little advantage to mobility-restricting measures. These results are consistent for both the original strain and the Delta variant; Omicron data were not included in this work.
... 03/01/2020-06/02/2020). Google data were widely used in previous studies to evaluate the reduction in the movement of people during the COVID-19 pandemic (Siqueira et al., 2020), and the prediction of COVID-19 spread was improved by fusing epidemiological and mobility data (Al Zobbi et al., 2020;García-Cremades et al., 2021;Sulyok and Walker, 2020). ...
Article
Many countries imposed lockdown (LD) to limit the spread of COVID-19, which led to a reduction in the emission of anthropogenic atmospheric pollutants. Several studies have investigated the effects of LD on air quality, mostly in urban settings and criteria pollutants. However, less information is available on background sites, and virtually no information is available on particle number size distribution (PNSD). This study investigated the effect of LD on air quality at an urban background site representing a near coast area in the central Mediterranean. The analysis focused on equivalent black carbon (eBC), particle mass concentrations in different size fractions: PM2.5 (aerodynamic diameter Da < 2.5 μm), PM10 (Da < 10 μm), PM10-2.5 (2.5 < Da < 10 μm); and PNSD in a wide range of diameters (0.01-10 μm). Measurements in 2020 during the national LD in Italy and period immediately after LD (POST-LD period) were compared with those in the corresponding periods from 2015 to 2019. The results showed that LD reduced the frequency and intensity of high-pollution events. Reductions were more relevant during POST-LD than during LD period for all variables, except quasi-ultrafine particles and PM10-2.5. Two events of long-range transport of dust were observed, which need to be identified and removed to determine the effect of LD. The decreases in the quasi-ultrafine particles and eBC concentrations were 20%, and 15-22%, respectively. PM2.5 concentration was reduced by 13-44% whereas PM10-2.5 concentration was unaffected. The concentration of accumulation mode particles followed the behaviour of PM2.5, with reductions of 19-57%. The results obtained could be relevant for future strategies aimed at improving air quality and understanding the processes that influence the number and mass particle size distributions.
Article
The COVID-19 pandemic has mainstreamed human mobility data into the public domain, with research focused on understanding the impact of mobility reduction policies as well as on regional COVID-19 case prediction models. Nevertheless, current research on COVID-19 case prediction tends to focus on performance improvements, masking relevant insights about when mobility data does not help, and more importantly, why, so that it can adequately inform local decision making. In this paper, we carry out a systematic analysis to reveal the conditions under which human mobility data provides (or not) an enhancement over individual regional COVID-19 case prediction models that do not use mobility as a source of information. Our analysis - focused on US county-based COVID-19 case prediction models - shows that (1) at most, 60% of counties improve their performance after adding mobility data; (2) that the performance improvements are modest, with median correlation improvements of approximately 0.13; (3) that improvements were lower for counties with higher Black, Hispanic, and other non-White populations as well as low-income and rural populations, pointing to potential bias in the mobility data negatively impacting predictive performance; and that (4) different mobility datasets, predictive models and training approaches bring about diverse performance improvements.
Article
Full-text available
Background . COVID-19 has challenged every country to issue the policy to control its population mobility. This policy paper discusses policies related to controlling population mobility from 2020 to the end of 2021 issued by the government agencies under the authority of the central government in Indonesia. All of these policies are accessed from the official website, then identified, and made into the appropriate categories. Policy and Implications . Mobility control was applied in two periods, namely PSBB (the Large-Scale Social Restrictions) and PPKM (Community Activity Restriction Implementation). This control was carried out strictly, but along with the vaccination program development, the government started to loosen the control depending on the number of cases and the progress of the vaccination program in the country. In the middle of 2021, the government continued to loosen the control by making presentation of the vaccination card mandatory instead of getting the COVID-19 test done. Recommendations Mobility control during PSBB and PPKM in Indonesia has proven successful in controlling the transmission of Covid-19. This initiative may prove to be the best practice to control contagious diseases even in the future. Conclusions This pandemic and its control measures in Indonesia show the strong role of the state in controlling the pandemic, as the health of the population is always the main concern.
Article
Full-text available
People mobility data sets played a role during the COVID-19 pandemic in assessing the impact of lockdown measures and correlating mobility with pandemic trends. Two global data sets were Apple’s Mobility Trends Reports and Google’s Community Mobility Reports. The former is no longer available online, while the latter is no longer updated since October 2022. Thus, new products are required. To establish a lower bound on data set penetration guaranteeing high adherence between new products and the Big Tech products, an independent mobility data set based on 3.8 million smartphone trajectories is analysed to compare its information content with that of the Google data set. This lower bound is determined to be around 10⁻⁴ (1 trajectory every 10,000 people) suggesting that relatively small data sets are suitable for replacing Big Tech reports.
Article
Full-text available
We employ the Google and Apple mobility data to identify, quantify and classify different degrees of social distancing and characterise their imprint on the first wave of the COVID-19 pandemic in Europe and in the United States. We identify the period of enacted social distancing via Google and Apple data, independently from the political decisions. Our analysis allows us to classify different shades of social distancing measures for the first wave of the pandemic. We observe a strong decrease in the infection rate occurring two to five weeks after the onset of mobility reduction. A universal time scale emerges, after which social distancing shows its impact. We further provide an actual measure of the impact of social distancing for each region, showing that the effect amounts to a reduction by 20–40% in the infection rate in Europe and 30–70% in the US.
Article
Full-text available
In response to the COVID-19 pandemic, countries have sought to control SARS-CoV-2 transmission by restricting population movement through social distancing interventions, thus reducing the number of contacts. Mobility data represent an important proxy measure of social distancing, and here, we characterise the relationship between transmission and mobility for 52 countries around the world. Transmission significantly decreased with the initial reduction in mobility in 73% of the countries analysed, but we found evidence of decoupling of transmission and mobility following the relaxation of strict control measures for 80% of countries. For the majority of countries, mobility explained a substantial proportion of the variation in transmissibility (median adjusted R-squared: 48%, interquartile range - IQR - across countries [27–77%]). Where a change in the relationship occurred, predictive ability decreased after the relaxation; from a median adjusted R-squared of 74% (IQR across countries [49–91%]) pre-relaxation, to a median adjusted R-squared of 30% (IQR across countries [12–48%]) post-relaxation. In countries with a clear relationship between mobility and transmission both before and after strict control measures were relaxed, mobility was associated with lower transmission rates after control measures were relaxed indicating that the beneficial effects of ongoing social distancing behaviours were substantial.
Article
Full-text available
Background A novel coronavirus (COVID-19) has taken the world by storm. The disease has spread very swiftly worldwide. A timely clue which includes the estimation of the incubation period among COVID-19 patients can allow governments and healthcare authorities to act accordingly. Objectives to undertake a review and critical appraisal of all published/preprint reports that offer an estimation of incubation periods for COVID-19. Eligibility criteria This research looked for all relevant published articles between the dates of December 1, 2019, and April 25, 2020, i.e. those that were related to the COVID-19 incubation period. Papers were included if they were written in English, and involved human participants. Papers were excluded if they were not original (e.g. reviews, editorials, letters, commentaries, or duplications). Sources of evidence COVID-19 Open Research Dataset supplied by Georgetown’s Centre for Security and Emerging Technology as well as PubMed and Embase via Arxiv, medRxiv, and bioRxiv. Charting methods A data-charting form was jointly developed by the two reviewers (NZ and EA), to determine which variables to extract. The two reviewers independently charted the data, discussed the results, and updated the data-charting form. Results and conclusions Screening was undertaken 44,000 articles with a final selection of 25 studies referring to 18 different experimental projects related to the estimation of the incubation period of COVID-19. The majority of extant published estimates offer empirical evidence showing that the incubation period for the virus is a mean of 7.8 days, with a median of 5.01 days, which falls into the ranges proposed by the WHO (0 to 14 days) and the ECDC (2 to 12 days). Nevertheless, a number of authors proposed that quarantine time should be a minimum of 14 days and that for estimates of mortality risks a median time delay of 13 days between illness and mortality should be under consideration. It is unclear as to whether any correlation exists between the age of patients and the length of time they incubate the virus.
Preprint
Full-text available
Total and perimetral lockdowns were the strongest nonpharmaceutical interventions to fight against Covid-19, as well as the with the strongest socioeconomic collateral effects. Lacking a metric to predict the effect of lockdowns in the spreading of COVID-19, authorities and decision-makers opted for preventive measures that showed either too strong or not strong enough after a period of two to three weeks, once data about hospitalizations and deaths was available. We present here the radius of gyration as a candidate predictor of the trend in deaths by COVID-19 with an offset of three weeks. Indeed, the radius of gyration aggregates the most relevant microscopic aspects of human mobility into a macroscopic value, very sensitive to temporary trends and local effects, such as lockdowns and mobility restrictions. We use mobile phone data of more than 13 million users in Spain during a period of one year (from January 6 th 2020 to January 10 th 2021) to compute the users’ daily radius of gyration and compare the median value of the population with the evolution of COVID-19 deaths: we find that for all weeks where the radius of gyration is above a critical value (70% of its pre-pandemic score) the number of weekly deaths increases three weeks after. The reverse also stands: for all weeks where the radius of gyration is below the critical value, the number of weekly deaths decreased after three weeks. This observation leads to two conclusions: i) the radius of gyration can be used as a predictor of COVID-19-related deaths; and ii) partial mobility restrictions are as effective as a total lockdown as far the radius of gyration is below this critical value. Background Authorities around the World have used lockdowns and partial mobility restrictions as major nonpharmaceutical interventions to control the expansion of COVID-19. While effective, the efficiency of these measures on the number of COVID-19 cases and deaths is difficult to quantify, severely limiting the feedback that can be used to tune the intensity of these measures. In addition, collateral socioeconomic effects challenge the overall effectiveness of lockdowns in the long term, and the degree by which they are followed can be difficult to estimate. It is desirable to find both a metric to accurately monitor the mobility restrictions and a predictor of their effectiveness. Methods We correlate the median of the daily radius of gyration of more than 13M users in Spain during all of 2020 with the evolution of COVID-19 deaths for the same period. Mobility data is obtained from mobile phone metadata from one of the major operators in the country. Results The radius of gyration is a predictor of the trend in the number of COVID-19 deaths with 3 weeks offset. When the radius is above/below a critical threshold (70% of the pre-pandemic score), the number of deaths increases/decreases three weeks later. Conclusions The radius of gyration can be used to monitor in real time the effectiveness of the mobility restrictions. The existence of a critical threshold suggest that partial lockdowns can be as efficient as total lockdowns, while reducing their socioeconomic impact. The mechanism behind the critical value is still unknow, and more research is needed.
Article
Full-text available
The COVID-19 pandemic continues to have a destructive effect on the health and well-being of the global population. A vital step in the battle against it is the successful screening of infected patients, together with one of the effective screening methods being radiology examination using chest radiography. Recognition of epidemic growth patterns across temporal and social factors can improve our capability to create epidemic transmission designs, including the critical job of predicting the estimated intensity of the outbreak morbidity or mortality impact at the end. The study’s primary motivation is to be able to estimate with a certain level of accuracy the number of deaths due to COVID-19, managing to model the progression of the pandemic. Predicting the number of possible deaths from COVID-19 can provide governments and decision-makers with indicators for purchasing respirators and pandemic prevention policies. Thus, this work presents itself as an essential contribution to combating the pandemic. Kalman Filter is a widely used method for tracking and navigation and filtering and time series. Designing and tuning machine learning methods are a labor- and time-intensive task that requires extensive experience. The field of automated machine learning Auto Machine Learning relies on automating this task. Auto Machine Learning tools enable novice users to create useful machine learning units, while experts can use them to free up valuable time for other tasks. This paper presents an objective method of forecasting the COVID-19 outbreak using Kalman Filter and Auto Machine Learning. We use a COVID-19 dataset of Ceará, one of the 27 federative units in Brazil. Ceará has more than 235,222 confirmed cases of COVID-19 and 8850 deaths due to the disease. The TPOT automobile model showed the best result with a 0.99 of \(R^2\) score.
Article
Full-text available
Daily Google mobility data covering 130 countries over the period between February 15, 2020 and May 2, 2020 suggest that less mobility is associated with lower COVID-19 cases and deaths. This observation is formally tested by using a difference-in-difference design, where country-fixed effects, day-fixed effects, as well as the country-specific timing of the 100th COVID-19 case are controlled for. The results suggest that 1% of a weekly increase in being at residential places leads to about 70 less weekly COVID-19 cases and about 7 less weekly COVID-19 deaths, whereas 1% of a weekly decrease in visits to transit stations leads to about 33 less weekly COVID-19 cases and about 4 less weekly COVID-19 deaths, on average across countries. Similarly, 1% of a weekly reduction in visits to retail & recreation results in about 25 less weekly COVID-19 cases and about 3 less weekly COVID-19 deaths, or 1% of a weekly reduction in visits to workplaces results in about 18 less weekly COVID-19 cases and about 2 less weekly COVID-19 deaths.
Article
Full-text available
Effective screening of SARS-CoV-2 enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine several features to estimate the risk of infection have been developed. These aim to assist medical staff worldwide in triaging patients, especially in the context of limited healthcare resources. We established a machine-learning approach that trained on records from 51,831 tested individuals (of whom 4769 were confirmed to have COVID-19). The test set contained data from the subsequent week (47,401 tested individuals of whom 3624 were confirmed to have COVID-19). Our model predicted COVID-19 test results with high accuracy using only eight binary features: sex, age ≥60 years, known contact with an infected individual, and the appearance of five initial clinical symptoms. Overall, based on the nationwide data publicly reported by the Israeli Ministry of Health, we developed a model that detects COVID-19 cases by simple features accessed by asking basic questions. Our framework can be used, among other considerations, to prioritize testing for COVID-19 when testing resources are limited.
Article
Full-text available
Precision agriculture is a growing sector that improves traditional agricultural processes through the use of new technologies. In southeast Spain, farmers are continuously fighting against harsh conditions caused by the effects of climate change. Among these problems, the great variability of temperatures (up to 20 °C in the same day) stands out. This causes the stone fruit trees to flower prematurely and the low winter temperatures freeze the flower causing the loss of the crop. Farmers use anti-freeze techniques to prevent crop loss and the most widely used techniques are those that use water irrigation as they are cheaper than other techniques. However, these techniques waste too much water and it is a scarce resource, especially in this area. In this article, we propose a novel intelligent Internet of Things (IoT) monitoring system to optimize the use of water in these anti-frost techniques while minimizing crop loss. The intelligent component of the IoT system is designed using an approach based on a multivariate Long Short-Term Memory (LSTM) model, designed to predict low temperatures. We compare the proposed approach of multivariate model with the univariate counterpart version to figure out which model obtains better accuracy to predict low temperatures. An accurate prediction of low temperatures would translate into significant water savings, as anti-frost techniques would not be activated without being necessary. Our experimental results show that the proposed multivariate LSTM approach improves the univariate counterpart version, obtaining an average quadratic error no greater than 0.65 °C and a coefficient of determination R2 greater than 0.97. The proposed system has been deployed and is currently operating in a real environment obtained satisfactory performance.
Article
Detailed, accurate data related to a disease outbreak enable informed public health decision making. Given the variety of data types available across different regions, global data curation and standardization efforts are essential to guarantee rapid data integration and dissemination in times of a pandemic.
Article
Forecasting the outcome of outbreaks as early and as accurately as possible is crucial for decision making and policy implementations. A significant challenge faced by forecasters is that not all outbreaks and epidemics turn into pandemics making the prediction of their severity difficult. At the same time, the decisions made to enforce lockdowns and other mitigating interventions versus their socioeconomic consequences are not only hard to make, but also highly uncertain. The majority of modeling approaches to outbreaks, epidemics, and pandemics take an epidemiological approach that considers biological and disease processes. In this paper, we accept the limitations of forecasting to predict the long-term trajectory of an outbreak, and instead, we propose a statistical, time-series approach to modelling and predicting the short-term behaviour of COVID-19. Our model assumes a multiplicative trend, aiming to capture the continuation of the two variables we predict (global confirmed cases and deaths) as well as their uncertainty. We present the timeline of producing and evaluating 10-day-ahead forecasts over a period of four months. Our simple model offers competitive forecast accuracy and estimates of uncertainty that are useful and practically relevant.