ArticlePDF Available

Abstract and Figures

We are witnessing the dramatic consequences of the COVID-19 pandemic which, unfortunately, go beyond the impact on the health system. Until herd immunity is achieved with vaccines, the only available mechanisms for controlling the pandemic are quarantines, perimeter closures and social distancing with the aim of reducing mobility. Governments only apply these measures for a reduced period, since they involve the closure of economic activities such as tourism, cultural activities, or nightlife. The main criterion for establishing these measures and planning socioeconomic subsidies is the evolution of infections. However, the collapse of the health system and the unpredictability of human behavior, among others, make it difficult to predict this evolution in the short to medium term. This article evaluates different models for the early prediction of the evolution of the COVID-19 pandemic to create a decision support system for policy-makers. We consider a wide branch of models including artificial neural networks such as LSTM and GRU and statistically based models such as autoregressive (AR) or ARIMA. Moreover, several consensus strategies to ensemble all models into one system are proposed to obtain better results in this uncertain environment. Finally, a multivariate model that includes mobility data provided by Google is proposed to better forecast trend changes in the 14-day CI. A real case study in Spain is evaluated, providing very accurate results for the prediction of 14-day CI in scenarios with and without trend changes, reaching 0.93 R2, 4.16 RMSE and 1.08 MAE.
Content may be subject to copyright.
1
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports
Improving prediction of COVID‑19
evolution by fusing epidemiological
and mobility data
Santi García‑Cremades1, Juan Morales‑García2, Rocío Hernández‑Sanjaime1,
Raquel Martínez‑España2, Andrés Bueno‑Crespo2, Enrique Hernández‑Orallo3,
José J. López‑Espín1 & José M. Cecilia3*
We are witnessing the dramatic consequences of the COVID‑19 pandemic which, unfortunately, go
beyond the impact on the health system. Until herd immunity is achieved with vaccines, the only
available mechanisms for controlling the pandemic are quarantines, perimeter closures and social
distancing with the aim of reducing mobility. Governments only apply these measures for a reduced
period, since they involve the closure of economic activities such as tourism, cultural activities, or
nightlife. The main criterion for establishing these measures and planning socioeconomic subsidies
is the evolution of infections. However, the collapse of the health system and the unpredictability
of human behavior, among others, make it dicult to predict this evolution in the short to medium
term. This article evaluates dierent models for the early prediction of the evolution of the COVID‑19
pandemic to create a decision support system for policy‑makers. We consider a wide branch of models
including articial neural networks such as LSTM and GRU and statistically based models such as
autoregressive (AR) or ARIMA. Moreover, several consensus strategies to ensemble all models into
one system are proposed to obtain better results in this uncertain environment. Finally, a multivariate
model that includes mobility data provided by Google is proposed to better forecast trend changes in
the 14‑day CI. A real case study in Spain is evaluated, providing very accurate results for the prediction
of 14‑day CI in scenarios with and without trend changes, reaching 0.93
R2
, 4.16 RMSE and 1.08 MAE.
e COVID-19 pandemic is the biggest global challenge in our recent history, which puts the welfare state of
today’s society at risk. Spain is undoubtedly among the countries most aected by the pandemic, with up to
3,697,987 total cases of infection, and a total of 80,196 deaths (as reported on June 7, 2021)1. Governments
worldwide are taking drastic measures such as social distancing, contact tracing, perimeter closures and even
quarantines, which are either reinforced or alleviated depending on the epidemiological status of the disease2.
ese non-sanitary measures focus on the reduction of human mobility, which has an important socio-economic
eect3. For instance, according to the European Commission, the economic forecast for Spain is the worst in
its recent history with a 9.4% drop in GDP, and an expected unemployment of up to 18.9% at the end of 2020.
Globally speaking, the Organisation for Economic Co-operation and Development (OECD)4 stated that these
bad economic projections will lead to widespread poverty, child malnutrition, stress, and suicides, just to men-
tion a few of the dramatic consequences for the population . However, beyond the economic consequences,
the measures of social distancing and lockdowns can raise new social scenarios in fundamental aspects such as
education, gender violence, immigration and other new issues that may arise because of such extreme public
health measures.
Early understanding of the evolution of the pandemic prevents scenarios that could increase the number of
COVID-19 victims. Governments have implemented public health surveillance systems for COVID-19 based
on the fundamental principles provided by the World Health Organization (WHO); i.e., tracking clinical and
epidemiological gures such as conrmed, death, active cases, just to mention a few5,6. is information is usually
provided by governments daily, and currently, these surveillance systems provide robust and stable information
on the evolution of the pandemic7. However, this epidemiological information shows a posterior picture of the
pandemic, i.e., once people have been infected and are showing symptoms, usually aer an incubation period of
OPEN
1Center of Operations Research, Miguel Hernandez University of Elche (UMH), 03202 Elche, Spain. 2Computer
Science Department, Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain. 3Computer Engineering
Department, Univesitat Politècnica de València (UPV), 46022 Valencia, Spain. *email: jmcecilia@disca.upv.es
Content courtesy of Springer Nature, terms of use apply. Rights reserved
2
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
7-10 days8. From these epidemiological data, novel Machine learning (ML), Articial Intelligence (AI) and data
science methods can provide signicant outcomes for tracking and detecting COVID-19 evolution at national
and regional level9. All in all, the infection curve can be seen as a time series in which trend changes are hardly
predictable, as it does not follow a seasonal pattern, mainly due to the chaotic interaction of people.
Figure3 shows the 14-day CI in Spain from July 20, 2020 until January 2021. e rst Spanish wave ocially
ended on July 20, 2020 and the 14-day CI started to increase again from that date onwards. It is worth mention-
ing that from the second wave until today, there have been several waves, understood as trend changes in the
14-days CI. At the beginning of October, 9th the 14-day CI started to increase again, matching with a vacation
period at the national level, from October 9th to 12th. In addition, in mid-December a trend change of the 14-day
CI was reported, also coinciding with a vacation period (December 8–12, 2020), which is increasing from that
date until now. ese trend changes are one of the most dicult scenarios for modelling. e 14-day CI is a
time series that includes daily data from July. Besides, not every day is reported, COVID-19 data in Spain is only
reported on working days, i.e., Monday through Friday, except holidays. e lack of historical data, as well as the
scarce changes in trends during the training period makes it very dicult to let the models learn these changes.
In this paper, we propose a multivariate model to predict trend changes in the 14-day Cumulative Incidence
(CI) of COVID-19. We conducted a comprehensive analysis of dierent mobility components oered by Google
to incorporate this information into our multivariate model as exogenous information. e multivariate model
resulting from adding this information can predict trend changes in 14-day CI with greater accuracy. e main
contributions of the paper are the following:
1. Several state-of-the-art ML and statistical methods are evaluated to predict the 14-day CI, using only the
historical information of this variable as input for two dierent scenarios, i.e., 14-day CI with trend changes
and without trend changes in the time series.
2. A ensemble strategy is provided to combine previous models and provide an optimal prediction. ese
methods oer very good performance for this time series when there are no clear trend changes.
3. A multivariate model is designed and fed with 14-day CI and mobility variables provided by Google as
exogenous information.
4. e multivariate model is optimized using operational research techniques to achieve better prediction of
trend changes in 14-day CI.
5. e evaluation is based on information from several waves in Spain in which clear trend changes were
reported.
e reminder of the paper is structured as follows. Firstly, we discuss the related work. en, the methods of
this article are introduced in “Methods” section, including the main ML and statistical models proposed, their
ensemble and the exogenous information targeted. Finally, “Evaluation and results” section shows the main
results and nding of our article before the main conclusions and directions for future work are introduced.
Related work
Since the right beginning of the COVID-19, scientists have struggled on designing models that could forecast
not only the evolution of the disease but also the impact of the dierent measures taken. e problem is that
these models must characterise not only how the virus spread, which is far from being understood, but also about
human behaviour, which can be erratic. Firstly, it is necessary to evaluate and model how fast the COVID-19 is
spreading. A fundamental epidemiological quantity, the reproductive number R, represents the average number
of new infections an infected person can generate (so the greater the number, the faster the spreading). First
estimations of the
R0
value for the COVID-19 evidenced a relatively high value, in the range (2.4–5.6)1012. For-
tunately, measures such as social distancing, facial masks and mobility reduction have allowed health authorities
to control the spread of the disease.
Dierent types of models have been proposed for forecasting COVID-19 evolution: compartmental models,
statistical-based models and machine learning (ML) based models13. In epidemiological compartmental models,
the population is assigned to dierent compartments (for example, the simple SIR models with three compart-
ments: Susceptible, Infectious, and Recovered). ese compartmental models have been used to evaluate and
forecast the impact of the dierent measures taken, such as quarantine, isolation and contact tracing. For example,
in14,15 the authors model and evaluate the general eects of containment mechanisms. Regarding contact trac-
ing, in10,11 it was stated that contact tracing and isolation as currently practiced is not helping in preventing the
COVID-19 pandemic. Finally, in16,17, the authors evaluated the impact of the technological aspects (such as reso-
lution, centralised vs decentralised approaches) of the current smart-based contact tracing application showing
that for being eective, it would have required a high adoption rate and a centralised technology. Unfortunately,
it was not the case, so these kinds of contact tracing applications failed to control the disease.
On the other hand, statistical-based models, i.e., time series analysis and forecasting, only rely on past data to
predict the near future. ere are many dierent methods, such as Auto-Regressive Moving Average (ARMA),
Auto-Regressive Integrated Moving Average (ARIMA), Support Vector Regressor (SVR), Linear Regressor poly-
nomial (LRP), Bayesian Ridge Regression (BRR), Linear Regression (LR), Random Forest Regressor (RFR),
Holt-Winter Exponential Smoothing (HW), and Extreme Gradient Boost Regressor (XGB). Note that some
authors consider some of these methods as Machine Learning Methods18 but none of them seems to improve
the overall quality of the prediction1921 (see below for a detailed description of this references). Among these
models, we may highlight ARIMA model22, which has shown good results forecasting the COVID-19 infections.
For instance, Benvenuto etal.23 proposed the use of ARIMA models to predict the COVID-19 spread around
the world, while Perone etal.24 proposed a model for dierent regions of Italy and Sahai etal.25 did the same for
Content courtesy of Springer Nature, terms of use apply. Rights reserved
3
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
the top ve aected countries. Nevertheless, these models can only predict short-time behaviour as intervals of
condence grows extremely fast as time elapses26. Petropoulos etal.27 also recognized the limitations of forecast-
ing longer term trajectories of an outbreak.
As previously commented, some authors consider most of the previous statistical methods to be part of more
general Machine learning (ML) and Deep Learning (DL) methods19. For example, Shahit etal.28 used DL meth-
ods for the prediction of time series of conrmed cases, deaths and recoveries in COVID-19 aected countries,
where the performance of models was measured by mean absolute error (MAE), root mean square error (RMSE)
and
R2
. ey focus on dierent variables (but not 14-day CI) but with stable trends. Similarly, Zerorual etal.29
compared up to ve DL models for COVID-19 forecasting using dierent COVID-19 information including,
Italy, Spain, France, China, USA and Australia. Nevertheless, more specic ML methods such as neural networks
and Support Vector Machines (SVMs) have shown to perform poorly since they require more training data than
the currently available datasets20,21. Furthermore, as stated by Ribeiro etal.30 this fact can also be attributed to
the chaotic dynamics of the analysed data, as well as the diversity of exogenous factors.
Several studies have shown the relationship between mobility and the disease spread. Linka etal.31 showed
a strong correlation between the reduction in mobility and the eective reproduction number across Europe,
which was particularly high for countries such as the Netherlands, Germany, Ireland, Spain, and Sweden (which
have a Spearmans rank correlation
ρ
of 0.99). e authors in32,33 found that mobility statistics oered in open
COVID-19 datasets showed the evolution of the COVID-19 spread in China, placing the contagious peak at
the early beginning of 2020. A recent study using mobile phone data of more than 13 million users in Spain34,
has shown that these data can be used as a predictor of COVID-19-related deaths. Particularly, they stated that
there is a critical level (around 70% of the radius of gyration, which quanties the mobility range of an indi-
vidual during a given week35) when hospitalizations and deaths tend to increase two to three weeks aer this
threshold is exceeded. Finally, Google and Apple mobility data, which are used in this paper, has demonstrated
to be of great help in quantifying and predicting the eects of COVID-19. For example, Cot etal.36 quantify
the eects of social distancing on the COVID-19 spreading dynamics in Europe and in the USA, and Nouvellet
etal.37 show the correlation between the reduction in mobility and COVID-19 transmission. One key aspect of
all these models is the quality of the data used. Having a wide range of data, updated on a real-time basis and
accessible is critical to characterizing disease outbreaks and obtaining useful models38. Nevertheless, better data
are necessary, but not sucient. As stated by Castro etal.26, human models are really hard to model since there
is always an uncertainty in human behaviour, so most models can fail to forecast some important issues such as
turning points and the end of the expansion.
Summing up, the problem with the described forecasting models is to accurately predict trend changes (i.e.,
waves) when using only previous historical information. ese changes in trends can depend on varying exter-
nal elements, such as mobility, social distancing, etc. erefore, a way to improve the precision of the previous
forecasting methods is to combine several data sources. Particularly, in this paper, we show that the utilisation
of mobility data can improve forecasting when only time series (such as 14-day CI) are used.
Methods
Temporary data are omnipresent in many application domains, such as medicine, agriculture or robotics39,40.
Increasingly, time series forecasting is being introduced in these elds which follows a quantitative approach
that uses historical information along with certain associated patterns such as trends, seasonality and irregular
components to predict future observations. Trend data in the time series oers long-term information for the
prediction. Seasonality are patterns in the time series that occur at specic and regular intervals. Finally, irregu-
lar components are unsystematic uctuations due to external factors. Having access to historical time-series
data, forecasting models can be used to understand the behaviour of the time series. However, the irregular
components of the time series are dicult to predict as they do not follow a given pattern. Generally speaking,
time-series models cannot learn these irregular components from the historical data of the time series, so they
need additional information to identify these possible events41
Indeed, the evolution of the 14-day CI of COVID-19 is based on irregular components that are mainly caused
by the dierent implementations of the national legislation that reduces people’s mobility42. Several ICT compa-
nies such as Google or Apple have provided mobility data taken from smartphones that run mobility applications,
such as Google Maps or Maps from Apple Maps, to gure out the changes that have occurred in people’s mobility
as a result of the policies to deal with COVID-1943. As previously explained in “Related work” section, several
works have been recently done to predict the COVID-19 evolution based on trends and seasonality in time series,
but none of them has not analysed trend changes due to these irregular components. is section introduces
the ML and statistical univariate models used in this article to predict the 14-day CI using only the endogenous
variable; i.e. previous observations of the 14-day CI. ese models are combined through an ensemble approach
that uses dierent consensus strategies based on quality metrics that are rst described. Finally, the multivariate
model is introduced to improve the prediction of the 14-day CI, in those time lags where there are trend changes.
Metrics and statistical models used. e main metrics and statistical models used in this work are the
following (where
xi
is the real data for instance i and
Pi
is the prediction for instance i):
Coecient of determination (
R2
) is used to analyse how dierences in one variable can be explained by dier-
ences in a second variable. It is a value ranging from 0 to 1 and indicates that the regression line represents
none or all of the data, respectively, so that the higher the value, the better the goodness of t of the model44.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
4
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Root mean square error (RMSE) is the standard deviation of the prediction errors, which are a measure of
the distance of the data from the regression line, indicating the concentration of the data around the line of
best t. It is, therefore, a measure of the dispersion of these errors (also known as residuals)45.
Mean absolute error(MAE) allows measurement of the average magnitude of the errors for a set of predic-
tions, regardless of their direction. It represents the mean of the absolute dierences in the sample between
the prediction and the actual observation, taking into account that all individual dierences are of equal
signicance45.
Spearman correlation Spearman’s correlation coecient is a non-parametric measure of rank correlation; i.e.
statistical dependence of the ranking between two variables. It measures the strength and direction of the
association between two ranked variables46.
Granger causality Granger causality is a testing framework comparing the unrestricted model, in which a
time series y is explained by the lags of y and the lags of an additional series of observations x (both lags up
to the same xed order), and the restricted model, in which y is only explained by the lags of y. us, Granger
causality determines if one time series is helpful for predicting another, and in some cases, it may be used to
assert stronger causal statements47.
Principal component analysis (PCA) e aim of this technique is to reduce the dimensionality of multivariate
data preserving as much of the relevant information as possible48.
Ensemble approach for univariate prediction. is subsection proposes a combination of time series
and ML models and techniques to provide a consensus strategy that brings all the results into one. Each method
and model has demonstrated in the literature good results for predicting dierent epidemiological variables
related to COVID-19. Moreover, dierent congurations and/or parameterisations of these models are also
important for the quality of the predicted results. With the proposed ensemble, the search space of the models
is explored automatically in order to obtain the best possible prediction. e statistical and machine learning
methods under study are the following:
1. Autoregresive (AR) is a univariate model49 where a prediction is made using a linear combination of past
values of that variable. e term autoregression indicates that it is a regression of the variable against itself.
us, an autoregressive model is established according to its order p. Autoregressive models are remarkably
exible to handle a wide range of dierent time series patterns.
2. Autoregressive Integrated Moving Average (ARIMA) is a linear statistical model50, which uses variations and
regressions of statistical data in order to nd patterns for a prediction into the future. Automatic Regression
(AR) is the term that refers to the delays of the dierentiated series (
), Moving Average (MA) refers to
the delays of the errors and integration (I) is the number of dierences used to make the time series station-
ary.
3. Long short-term memory (LSTM) is a type of recurrent neural architecture with a state memory and mul-
tilayer cell structure51. LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. e
cell remembers values over arbitrary time intervals and the three gates regulate the ow of information into
and out of the cell(Fig.1b). e LSTM diers from a classic recurrent network in that it does not overwrite its
content at each time step but is able to decide whether to keep the existing memory through the introduced
doors. If the LSTM unit detects an important characteristic of an input sequence at an early stage, it carries
this information over long distances, therefore it detects long-distance dependencies.
4. Gate Recurrent Unit (GRU) is a type of recurrent neural network, which presents a modication, which
allows to solve a problem of this type of recurrent networks which is the vanishing gradient problem since
the model is not washing out the new input every single time but keeps the relevant information and passes
it down to the next time steps of the network52. It is similar to LSTM but without memory cells, which makes
them simpler to compute and implement. It is composed of two gates (reset and update) (Fig.1a), so that it
allows each recurrent unit to capture the dependencies in an adaptive way in dierent time scales. rough
these two gates, it is decided what information should be passed on at the output, without eliminating infor-
mation that is apparently irrelevant to the prediction, so that the information is retained for a long time.
In the process of combining the information of the proposed ensemble approach, the validation metrics for
the regression task are used. Particularly, our ensemble approach uses the coecient of determination (
R2
), root
mean square error (RMSE) and mean absolute error (MAE) metrics53. Before describing in detail the phases of
this proposed ensemble approach, the 4 combination methods used to obtain and calculate the model for the
inference are described. e combination methods used are briey detailed below:
(1)
R
2=(
n
i=1(xi−¯x)(Pi
¯
P))
2
n
i=1
(x
i
−¯x)2
n
i=1
(P
i
¯
P)2
(2)
RMSE
=
n
i=1(xiPi)2
n
(3)
MAE
=
n
i=1|
x
i
P
i
|
n
Content courtesy of Springer Nature, terms of use apply. Rights reserved
5
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Maximume predictions of the model that has a metric greater than
R2
are selected.
Minimum e models with the lowest RMSE and MAE metrics are selected and a weighted average is com-
puted.
Average An average of all models is made without taking into account their values.
Weighted average A weighted average is made based on the
R2
score of each model.
e proposed ensemble approach consists of the following steps. Figure2 summarizes these steps.
1. Let’s be |E| , the training dataset and
|E|v
, a validation dataset.
2. Each technique t is trained with the |E| dataset, generating
P|E|
for each t.
3. For each technique t, the values
R2
, RMSE and MAE are calculated using the predictions
Pt
|E|
and
|E|v
dataset.
4. Using the combination methods |C|, models whose predictions are eective are selected.
5. Depending on the combination method, the
P|Ev|
predictions are calculated by taking the data from the
validation dataset
|E|v
as input.
6. e metrics of
R2
, RMSE and MAE are calculated with the predictions
P|E|v
, leaving the model built and
ready to infer values.
7. Equation (4) is used to infer a new value i in the model:
where
PMaxRt
i
is the prediction for instance i that provides the t model with the maximum
R2
;
PMinRMSEt
i
is
the prediction for instance i that provides the t model with the minimum RMSE and
PMinMAEt
i
is the predic-
tion for instance i that provides the t model with the minimum MAE.
Measuring mobility for the multivariate model. Reducing mobility has been one of the main tools
that all governments worldwide are using to prevent the COVID-19 spread. Tracing infection from mobility
data has been used from the early beginning of the COVID-19 outbreak. Kraemer etal.32,33 found that mobility
statistics oered in open COVID-19 datasets showed the evolution of the COVID-19 spread in China, placing
the contagious peak at the early beginning of 2020. erefore, the measurement of mobility in dierent cities
has been subjected to study by dierent public and private organizations. Huang etal.54 showed that mobility
patterns obtained from Twitter can quantitatively reect the mobility dynamics.
Google mobility data (GMD) (https:// www. google. com/ covid 19/ mobil ity/) is a tool developed by Google
to deal with the COVID-19. It shows a set of aggregated and anonymized data obtained from information in
products such as Google Maps55. is data is provided through local mobility reports which oer valuable infor-
mation on changes in people’s mobility patterns as a consequence of the measures taken by the governments to
deal with the COVID-19 pandemic. Among the information found in these reports, of particular interest to us
are the movement trends of citizens over time. is information is arranged by geographical area and classied
into various categories of places, such as workplaces, stores, supermarkets, leisure spaces, pharmacies, parks,
transportation stations and residential areas. e main variables GMD provides are the following:
Retail and recreation is variable shows mobility trends for places such as restaurants, cafes, museums, malls,
cinemas and libraries.
Supermarket and pharmacy is variable shows mobility trends for places such as supermarkets, food ware-
houses and pharmacies.
Parks is variable show mobility trends for places such as national parks, public beaches, plazas and public
gardens.
Public transport is variable shows mobility trends for places that are public transport hubs, such as train
stations, subway or bus.
Workplaces is variable shows mobility trends for places of work.
Residential is variable shows mobility trends for places of residence.
e number provided by GMD is used to compare the mobility on the date of the report with the mobility on
the day of the reference value. e data corresponding to the date of the report is calculated (if the information is
available) and a positive or negative percentage is shown. e data shows how the number of visitors to (or time
spent in) the categorized locations changes compared to our baseline. A baseline represents a normal value on
that day of the week. e baseline is the average value for the 5-week period from January 3 to February 6, 2020.
In each region-category, the baseline is not a single value, but 7 individual values. e same number of visitors
on two dierent days of the week results in dierent percentage changes. It is important to note that baseline
days never change. In the calculation of the reference values, the seasonality has not been taken into account.
For example, the number of people going to the parks usually increases as the weather improves.
A multivariate model including these variables is proposed to predict 14-day CI. Our rst approach was to
explore a multivariate regression model which includes the ensemble information and additional information
in the mobility variables as exogenous information. e multivariate equation is shown in Eq. (5).
(4)
P
i=
P
MaxR
t
i+P
MinRMSE
t
i+P
MinMAE
t
i
3
(5)
CI
14
day
=β0+β1(
Ensemble
)+β2
GMD
2+β3
GMD
3+··· +β
iGMD4
Content courtesy of Springer Nature, terms of use apply. Rights reserved
6
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
where the response variable is
CI
14
day
,
β0
is the independent term,
β1
is the term that weights the values obtained
by our ensemble, and
βi
is the term that weights the Google mobility variables (
GMDiwhere i =2, 3, 4, 5, 6, 7
).
GMD variables will be evaluated through t-statistic to gure out if there is a signicant relationship between the
response variable (14-day CI) and each of the predictors included in the model (ensemble and mobility variables).
If so, these variables will be included in the multivariate model.
It is important to note that main assumptions of multivariate regression such as linear relationship between
the target variable and the independent variables, normality of all variables, lack of multicollinearity are not
met in our case as it is shown in “Evaluation and results” section. erefore, an operations research approach is
proposed to optimize the coecients of our multivariate model in order to minimize the MAE. Particularly, the
Non-Linear Minimization (NLM) procedure56, included in R programming soware that carries out an itera-
tive minimization procedure is applied to look for optimal coecients. is method requires a seed to initialize
the optimization of the coecients and three dierent starting values were analysed: (i) coecients randomly
generated from a uniform distribution from
10
to 10, (ii) coecients with the same weight for each of the
independent variables and (iii) coecient estimates for the multivariate regression model described in Table9.
Evaluation and results
is section presents the evaluation of our models for estimating 14-day COVID-19. First, the datasets to per-
form the experiments are explained. Next, the dierent univariate ML models and ensemble approach previously
explained in “Methods” section for the prediction of the 14-day CI are evaluated. e Google mobility informa-
tion is then statistically analysed and a PCA is performed to obtain exogenous information to be included in a
multivariate model. Finally, the multivariate model with this exogeneous information is evaluated.
Benchmarking. is section summarizes the datasets used to carry out the experiments. As previously
commented, the evaluation is based on the data provided by the Spanish Ministry of Health. ey provide
several variables for all Spanish regions (19 regions in total). Among them, we may highlight total cases last 24
h, 14-day cumulative incidence and 7-day cumulative incidence. e information is provided by the regional
governments that report daily, except on weekends and holidays, to the Spanish Ministry of Health that develops
a report with the COVID-19 current situation in Spain. It is important to note that the information is updated
backwards when new notications arrive from previous days, mainly due to delays, error detection, etc. ere-
fore, we focus on the more stable notication period (i.e. 14-days) as it includes all previous notications. Par-
ticularly, we focus on estimating the 14-day cumulative incidence; i.e. the number of new cases of COVID-19
during 14 days divided by the size of the population at the start of the period.
Of particular interest is the information from the surveillance system from July, since it changed the way the
Spanish Ministry of Health develops the strategy of early detection, monitoring and control of COVID-19. Since
then, the count of COVID-19 cases has been kept uniform, with slight changes and updates. Table1 shows the
two dierent periods under study that are translated into two dierent datasets. For each period, a train and
test datasets have been designed to assess the dierent trend changes as indicated in the Table1. Particularly,
the rst dataset (DS1) includes the information from July 20, 2020 to December 4, 2020. e second dataset
(DS2) includes the information from July 20, 2020 to December 18, 2020. In DS1, the models are trained with
the information until November 29th, included. e testing, however, is carried out using the data of the week
from November 30th to December 4th. In DS2, the models are trained with the information until December
4th, included. e evaluation is carried out with the data from December 5th to December 18th, both included.
It is important to note that the 14-day CI was decreasing in the DS1 test period (see Fig.3). However, the
14-day CI was decreasing at the beginning of the DS2 test period but it suddenly started to increase from Decem-
ber, 11 and beyond. Moreover, DS1 only includes 5 days to predict and DS2 includes 9 days.
Moreover, the metrics used for testing the performance of each model are the coecient of determination
(
R2
), the root-mean-square error (RMSE) and the mean absolute error (MAE). All of them are calculated using
Figure1. Diagram of a GRU and LSTM unit. Where
xt
represents the input and
yt
the forecast in a step (
yt
1
for forecast in the previous steps). For LSTM, the
Ct
indicates the state that is passing from one LSTM unit to
another.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
7
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
the scikit-learn metrics package57. e best possible score for the
R2
is 1.0. A constant model that always predicts
the expected value of y, regardless of the input features, would get a
R2
score of 0.0.
e models obtained have been previously validated and tested using dierent congurations. For ARIMA-
based models, we several (p,d,q) parameters were tested, including (1, 1, 1), (3, 1, 3), (6, 1, 6), (1, 2, 1), (3, 2,
3), (6, 2, 6). For AR-based models, the best performing congurations where those with
p=1, 3
and 6. Finally,
Table2 shows the congurations for GRU and LSTM neural networks that were included in the evaluation. ese
parameters were empirically determined aer several experiments.
Finally, two well-known time series libraries have been included for comparison purposes; i.e., PROPHET58
and TPOT59. Prophet is a Python-based library developed by Facebook which, according to their authors, “aims
at forecasting time series data based on an additive model where non-linear trends are t with yearly, weekly, and
daily seasonality, plus holiday eects. It works best with time series that have strong seasonal eects and several
seasons of historical data. Prophet is robust to missing data and shis in the trend, and typically handles outliers
well”. TPOT is also a Python-based automated ML tool that optimizes ML pipelines using genetic programming.
TPOT explores many congurations of models and pipelines to nd the best one for the target data. e main
output of TPOT is a Python code for the best pipeline it has found for your data. ese methods have been suc-
cessfully applied to COVID-19 prediction in dierent countries such as India, Brazil or UK60,61
14‑day CI estimation. Tables3 and 4 show the
R2
, RMSE and MAE scores for the dierent ML and statisti-
cal models targeted in this study using the evaluation environment previously mentioned in “Benchmarking
section. Let us remind the reader that the main dierence between both datasets is the test set. e DS1 develops
the prediction in a shorter time series (i.e. 1 week) but with a stable trend (i.e. a decreasing time series). e
DS2 develops the prediction in longer time series (i.e. 2 weeks) but with an unstable trend (i.e. increasing and
decreasing time series).
Table3 shows the performance of those algorithms when they target the DS1 dataset. In general, articial
neural networks models do not work well for predicting 14-day CI. e dataset includes 1 data item per day,
which means a total of data for the largest dataset of up to 109 data items. erefore, there is not enough informa-
tion to train the articial neural network models for a good inference. However, statistical models perform very
well in general. e best performing model for the DS1 is the ARIMA with the parameter set up
p=3
,
d=1
,
q=3
, reaching up to 0.99
R2
score, with an RMSE of 4.48 and MAE of 3.90. ese results are slightly improved
with our ensemble approach, reaching up to 0.99
R2
, with an RMSE of 4.16 and MAE of 3.55. Figure4a shows
graphically the actual data and the prediction made by the ensemble for dataset 1.
Figure2. Outline of the proposed ensemble approach.
Table 1. Datasets for training and testing ML algorithms. ey include dierent periods with dierent spatio-
temporal characteristics.
Dataset name DS1 DS2
Training period July, 20–November, 29 July, 20–December, 4
Testing period November, 30–December, 4 December, 5–December, 18
Testing period trend Decreasing Decreasing–increasing
Content courtesy of Springer Nature, terms of use apply. Rights reserved
8
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Table4 shows the performance of targeted models for the DS2 dataset. e results are signicantly worse than
those shown in the Table3. DS2 is more challenging as for the features previously commented (i.e. longer period
and unstable trend). Again, our ensemble approach achieves the best performance of all models but, in this case,
it only achieves up to 0.62
R2
score, with an RMSE score of 6.84 and MAE score of 5.49. It is important to note
that the ensemble approach takes the results of the AR(3) method as the other methods are signicantly worse
in terms of MAE and RMSE. Moreover, these tests revealed that the prediction of 14-day CI with only historical
information performs well for short periods and, above all, clearly marked tendencies. Change in trends due to
irregular components are very dicult to predict only using endogenous information and therefore, to improve
our forecast for this scenario, we propose the inclusion of an exogenous variable that allows the prediction of
these changes in tendency over long periods. Figure4 shows graphically the actual data and the prediction made
by the ensemble for dataset 2.
Exogeneity evaluation and multivariate model. e inclusion of exogenous variables into the multi-
variate model requires a preliminary study of the relationship between the 14-day CI and the mobility variables.
For that purpose, Spearmans correlation between 14-day CI and Google mobility variables has been rstly cal-
culated under dierent scenarios. Table5 shows Spearman’s correlation between 14-day CI and dierent lags of
the mobility time series.
e analysis in Table5 indicates that most mobility variables have a relevant correlation with 14-day CI,
especially retail and recreation, parks and public transport. Interestingly, leisure-related mobility variables, i.e.
retail and recreation and parks, have a negative correlation with CI while non-leisure mobility variables have a
Figure3. 14-day cumulative incidence (CI) in Spain. e evaluation dates are highlighted to let the reader
know the trend of 14-day CI at that period.
Table 2. Parameter setup for GRU and LSTM ANNs.
Parameter LSTM GRU
Number of input neurons 70 70
Batch size 32 32
Number of epochs 600 600
Learning factor 0.001 0.001
Optimizer Adam Adam
Activation function Hyperbolic tangent Hyperbolic tangent
Loss function Mean squared error Mean squared error
Delay sequence 6 6
Content courtesy of Springer Nature, terms of use apply. Rights reserved
9
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
positive correlation. Additionally, it is worth highlighting that the two situations are distinguished. If the correla-
tion between 14-day CI and a mobility variable (in absolute value) grows as the lags of the exogenous variable
increases, past values of the mobility variable have a more signicant association with current cumulative inci-
dence than recent ones. In contrast, if correlation decreases as the number of lags augments, the corresponding
mobility variable might be considered either not signicantly associated with 14-day CI or more signicantly
related with 14-day CI for recent values of the mobility variable. is underscores a pragmatic limitation of
univariate models, in that available exogenous variables cannot be used to forecast changes in 14-day CI curve
trend such as an uptick in new coronavirus cases.
Nevertheless, in practice, the establishment of causal statements between series of observations is not straight-
forward. Our interest is to examine whether mobility time series helps to predict future values of 14-day CI,
controlling for lags. Table6 reports Granger causality test outcomes for dierent lag orders analysing whether
past values of mobility variables provide additional information about 14-day CI beyond past values of 14-day CI.
From the results in Table6, the eect of lags of mobility variables retail and recreation, parks and public
transport on 14-day CI is highly signicant whatever the number of lags is. e stationarity of the variables was
previously checked using the Augmented Dickey-Fuller test via the adf.test function in R. Bearing this in mind,
according to WHO, the incubation period of COVID-19 is on average 5–6 days but can be as long as 14 days,
lags have been considered varying from 5 to 14 days. However, it is important to note that too few lags can lead
to a biased test due to residual autocorrelation whereas with too many, null hypothesis might be incorrectly
rejected because of spurious correlation. erefore, the number of lags that need to be chosen reaching is a
tradeo between bias and power. en, it can be concluded that these three mobility variables are predictive of
future cumulative incidence gures.
Table 3. 14-day CI accuracy prediction for the rst dataset. Training from July 20, 2020 to November 29,
2020, Prediction from November, 30 to December, 4.
Model
R2
score RMSE score MAE score
GRU 0.96 92.90 91.49
LSTM 0.86 109.91 108.72
AR (1) >0.99 37.82 33.01
AR (3) 0.99 6.28 5.61
AR (6) > 0.99 13.30 13.10
ARIMA (1, 1, 1) > 0.99 10.67 10.54
ARIMA (3, 1, 3) 0.99 4.48 3.90
ARIMA (6, 1, 6) 0.99 4.96 3.72
ARIMA (1, 2, 1) > 0.99 16.71 16.04
ARIMA (3, 2, 3) > 0.99 7.96 7.86
ARIMA (6, 2, 6) > 0.99 11.08 10.62
Ensemble approach > 0.99 4.16 3.55
PROPHET 0.99 39.54 36.89
TPOT 0.99 30.94 28.37
Table 4. 14-day CI accuracy prediction for the second dataset. Training from July 20, 2020 to December 4,
2020, Prediction from December, 5 to December, 18.
Model
R2
score RMSE score MAE score
GRU 0.59 15.16 11.43
LSTM 0.65 27.18 25.03
AR (1) 0.07 44.79 42.48
AR (3) 0.62 6.84 5.49
AR (6) 0.16 35.11 26.94
ARIMA (1, 1, 1) 0.10 46.21 35.17
ARIMA (3, 1, 3) 0.11 38.50 27.45
ARIMA (6, 1, 6) 0.11 40.41 29.56
ARIMA (1, 2, 1) 0.06 67.44 52.50
ARIMA (3, 2, 3) 0.06 54.76 39.28
ARIMA (6, 2, 6) 0.06 56.33 42.57
Ensemble approach 0.62 6.84 5.49
PROPHET 0.74 20.08 13.21
TPOT 0.01 41.72 31.37
Content courtesy of Springer Nature, terms of use apply. Rights reserved
10
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Reciprocally, Granger causality tests analysing whether 14-day CI values help to predict future values of mobil-
ity variables have been run and corresponding p-values are shown in Table7. According to these results, 14-day
CI is highly signicant on retail and recreation for every lag order and, in general, for the rest of the mobility
variables from a lag length of 8. In other words, 14-day CI is predictive of mobility variables in a period of a week
from current values. is nding is consistent regarding the incubation period; however, these results should be
cautiously interpreted. An increase in new coronavirus cases is bound to force government intervention and the
application of measures aimed at restricting citizens mobility. Likewise, a decline of the 14-day CI curve would
lead to social relaxation, which would be translated into an increase in mobility.
As a result, reverse or bidirectional causation may be present in our problem. erefore, we cannot conclude
that mobility variables potentially cause future values of 14-day CI. Moreover, government containment measures
in mobility, nightclubs or bars and other factors such as social alarm also involve changes in 14-day CI trends
and thus, there may be latent confounders that are correlated with 14-day CI underlying the true cause of the
evolution of new coronavirus cases. Hence, making a strong causal statement is hard, however, our intention
was less ambitious targeted at shedding light on what mobility variables are useful for predicting 14-day CI.
Based on this preliminary study, the results obtained by our ensemble approach, retail and recreation, parks
and public transport time series will be used hereaer as explanatory variables to develop a multivariate model
Figure4. 14-day CI accuracy prediction for both datasets.
Table 5. Spearman’s correlation between 14-day CI and Google mobility variables for dierent lags in the
mobility time series.
Lags Retail and recreation Supermarket and pharmacy Parks Public transport Workplaces Residential
0
0.42 0.28
0.59 0.38 0.23 0.32
−5
0.39 0.21
0.53 0.35 0.14 0.25
−6
0.38 0.22
0.52 0.36 0.14 0.24
−7
0.37 0.21
0.51 0.36 0.14 0.22
−8
0.35 0.21
0.50 0.37 0.14 0.21
−9
0.34 0.21
0.48 0.37 0.14 0.20
−10
0.32 0.22
0.47 0.37 0.14 0.19
−11
0.30 0.22
0.46 0.38 0.13 0.18
−12
0.28 0.22
0.44 0.39 0.13 0.17
−13
0.27 0.22
0.43 0.39 0.13 0.15
−14
0.25 0.23
0.42 0.40 0.13 0.13
Content courtesy of Springer Nature, terms of use apply. Rights reserved
11
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
where 14-day CI is the response variable. Because the average incubation period of COVID-19 outlined by the
WHO lasts a minimum of 5 days, selected mobility variables will be considered 5 periods lagged. Furthermore,
Google mobility variables will be standardised and rescaled to the last three days of 14-day CI before predictions
are made in order to provide meaningful information to the model.
Finally, a principal component analysis (PCA) is computed considering these variables. Table8 indicates that
two components would preserve more than 87% of the total variance in the original data. In other words, two
components explain more than 87% of the information provided by the exogenous variables. Figure5 graphically
illustrates that mobility variables are clearly dierentiated from the ensemble approach in the PCA analysis. us,
mobility variables would provide additional information to the proposed multivariate model.
In particular, this paper includes an optimization model aimed at improving forecasts in 14-day CI time
series which uses multivariate regression as starting point. Table9 shows the regression outcomes obtained for
DS2 training period. e coecient estimates and standard errors are calculated. e p-value corresponding
to the t-statistic of each coecient indicates if there is a signicant relationship between the response variable
(14-day CI) and each of the predictors included in the model (ensemble and mobility variables). Table10 shows
the results obtained by the NLM method for the dierent seed values previously described in “Methods” section,
i.e. the MAE and the number of iterations performed by the procedure in each case. It is important to highlight
that when the seed of NLM is the coecients randomly generated from a uniform distribution from -10 to 10,
the NLM algorithm is executed 10 times and the MAE and number of iterations in Table10 are calculated as
the average over 10 simulation runs. As can be seen, the best result is reached by performing 36 iterations of the
Table 6. Granger causality testing mobility variables predictive of 14-day CI for dierent lag orders.
Lags Retail and recreation Supermarket and pharmacy Parks Public transport Workplaces Resid ential
5 0.03 0.72 <0.01 <0.01 0.52 0.16
6 0.01 0.66 0.01 <0.01 0.17 0.22
7 0.01 0.70 0.02 <0.01 0.18 0.28
8 0.03 0.61 0.08 <0.01 0.17 0.37
9 <0.01 0.49 0.17 <0.01 0.13 0.14
10 0.02 0.78 <0.01 <0.01 0.19 0.30
11 <0.01 0.32 0.01 <0.01 0.31 0.32
12 0.01 0.35 <0.01 <0.01 0.35 0.29
13 <0.01 0.15 0.01 <0.01 0.19 0.19
14 <0.01 0.04 0.01 <0.01 0.21 0.01
Table 7. Granger causality testing 14-day CI predictive of mobility variables for dierent lag orders.
Lags Retail and recreation Supermarket and pharmacy Parks Public transport Workplaces Residential
5 <0.01 0.20 0.38 0.26 0.05 0.25
6 <0.01 0.35 0.10 0.17 0.31 0.08
7 <0.01 0.01 0.21 0.03 <0.01 <0.01
8 <0.01 0.02 0.04 <0.01 <0.01 <0.01
9 0.01 <0.01 0.12 <0.01 <0.01 <0.01
10 0.01 <0.01 0.13 <0.01 <0.01 <0.01
11 0.03 0.01 0.12 <0.01 <0.01 <0.01
12 0.05 0.01 0.16 <0.01 <0.01 <0.01
13 0.02 0.02 0.17 <0.01 <0.01 <0.01
14 0.03 0.04 0.30 <0.01 <0.01 <0.01
Table 8. Eigenvalues and proportion of variance (i.e. information) explained by each component in the PCA.
Number of components Eigenvalues Proportion of variance (%) Cumulative proportion (%)
1 2.91 72.83 72.83
2 0.597 14.93 87.76
3 0.391 9.77 97.52
4 0.099 2.48 100
Content courtesy of Springer Nature, terms of use apply. Rights reserved
12
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
algorithm, it returns a MAE of 3.77 and it is achieved when the NLM procedure uses the multivariate regression
model as seed.
Once the MAE has been minimized, Table11 presents 14-day CI predictions for an evaluation period from
5th to 18th of December using the multivariate model with the optimal coecient values obtained by NLM for
the minimum MAE. It is important to remark that if exogenous variables are not extended, 14-day CI forecasts
are restricted to a ve-period prediction horizon. Nonetheless, forecasts in the evaluation period have been
obtained using the observed past values of the mobility variables. is approach might not be realistic, but the
purpose of the study is to validate the performance of the model using mobility data regarding other ML meth-
ods not including this exogenous information. To assess the accuracy of the model, the mean absolute error is
measured and a comparison is made with regard to predictions given by the univariate strategy in the ensemble
approach. In addition, Fig.6 shows true 14-day CI curve and the ensemble approach and multivariate predicted
values throughout the forecast horizon. It is noteworthy that the multivariate model substantially outperforms
the ensemble approach. e results also suggest that both models produce reasonably good estimates, but the
multivariate model tracks better changing trends in 14-day CI.
To conclude, it is interesting to note that predictions made from 16th to 18th of December (labeled by 12,
13, 14 in Fig.5), when a new uptick in coronavirus infections and hospitalizations began, are located in the
exogenous area of the PCA graphics meaning that for these values mobility variables have a higher impact.
Again, these results evidence that exogenous variables oer valuable information to cope with trend changes in
the 14-day CI curve and justies the use of a multivariate model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ENSEMBLE
TRANSPORT
PARKS
RETAIL
−2
−1
0
1
−2 −1 012
Dim1 (74%)
Dim2 (16%)
PCA − Biplot
Figure5. PCA to ensemble approach and mobility variables. Positively correlated variables point to the same
side of the plot. Negatively correlated variables point to opposite sides of the graph.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
13
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
Discussion
e use of a regression model entails the acceptance of assumptions that may be questionable at best in the
context of time series data. Methodologically, this approach is awed mainly because accuracy may be seriously
aected in the presence of autocorrelation. Furthermore, diculties in data collection due to discrepancies in
regional notications and dierences on COVID-19 medical tests carried out are added to statistical problems,
which are compounded when data include measurement error. In view of the foregoing, this multivariate tech-
nique cannot be used as an inference method. However, the use of an operation research optimization method
such as NLM implementing the regression coecients as a seed improves the solution obtained by the univariate
model. Evidently, this option has its own drawbacks such as the problem of falling in local optima or the setting
of good initial values for the solver.
e ensemble approach rendered a smoother curve that could not detect trend changes. Indeed, the results
provided by the ensemble approach reinforce the need for monitoring models that can also detect changes in
trend with some foresight. Accordingly, despite the potential limitations mentioned above, the proposed multi-
variate approach can be gainfully used for predicting possible upticks in COVID-19 cases at least in a short-term
period. erefore, the inclusion of the two models within a decision support system provides us with a positive
result that covers the dierent types of data behavior, both when the trend is constant and in the changes of trend.
In this system, depending on the error produced by each model when introducing a new value to predict, it will
be selected either the ensemble approach or the multivariate approach.
Conclusions and future work
COVID-19 has caused one of the biggest crises in our recent history. Most countries have developed monitor-
ing systems based on pandemic evolution indicators to trigger social distancing measures whenever signicant
increases in infections are detected. Data analysis can help forecast the short- and medium-term evolution of the
Table 9. Multivariate regression for DS2 training period.
R2
=
0, 79
, p-value
<0.01
.
Coecients Estimate Std. Error p-value
β0
(Independent) −110.59 259.76 0.68
β1
(Ensemble) 1.31 0.26 <0.01
β2
(Retail and recreation) 1.00 0.59 0.13
β3
(Parks) −0.20 1.29 0.88
β4
(Public transport) −0.60 0.63 0.37
Table 10. MAE achieved and iterations performed by NLM procedure using dierent seeds.
Seed Avg. of 10 random runs Weighted equally Multivariate regression model
MAE 4.66 4.06 3.77
NLM iterations 50 46 36
Table 11. 14-day CI accuracy prediction for ensemble approach (EA) and NLM methog (NLM). Training
from July 20, 2020 to December 4, 2020, Prediction from December, 5 to December, 18.
DATE 14-day CI CI Ensemble CI NLM
MAEEA
MAENLM
December 5 226.39 226.08 225.10 3.14 0.31
December 6 216.07 216.28 214.58 1.83 0.26
December 7 207.52 202.21 204.94 2.46 1.94
December 8 201.59 205.76 204.93 3.18 2.50
December 9 193.26 205.11 202.78 4.62 4.37
December 10 188.72 197.11 197.34 5.92 5.04
December 11 189.56 197.94 195.49 6.48 5.52
December 12 194.19 194.19 193.76 6.19 4.83
December 13 196.61 193.09 191.53 5.64 4.68
December 14 193.65 190.11 188.13 5.50 4.57
December 15 198.77 198.64 195.77 5.04 4.16
December 16 201.16 202.87 201.91 4.79 3.96
December 17 207.26 201.91 202.32 4.96 4.07
December 18 214.12 214.11 210.12 5.49 3.78
Content courtesy of Springer Nature, terms of use apply. Rights reserved
14
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
pandemic and thus help policymakers in their decision making. In this paper, we have analysed the evolution
of the 14-day cumulative incidence in Spain from the beginning of the second wave of COVID-19 until January
2021, where several trend changes (also called waves) occurred. We have proposed a set of statistical and ML
models to achieve maximum performance, reaching very good results for short and stable periods. However,
the 14-day CI is aected by irregular components which are very challenging scenarios for traditional models
using only historical information. erefore, the mobility data provided by Google as a consequence of the
COVID-19 outbreak are fed into our models as exogenous information to predict these irregular components.
Our results reveal that this information improves the prediction of this unstable scenario, providing an MAE
of up to 1.08 on average.
Data fusion between socio-economic and endogenous variables is still at a relatively early stage, and we
acknowledge that we have tested a relatively simple variant of a multivariate model. But, with many other types
of multivariate models and data such as vaccination gures yet to be explored, this eld seems to oer a promis-
ing and potentially fruitful area of research. Moreover, this approach can be followed at the international level
to predict changes in trends and coordinate the pandemic globally.
Received: 28 April 2021; Accepted: 9 July 2021
References
1. Cecilia, J. M., Cano, J.-C., Hernández-Orallo, E., Calafate, C. T. & Manzoni, P. Mobile crowdsensing approaches to address the
covid-19 pandemic in spain. IET Smart Cities 2, 58–63 (2020).
2. Kissler, S. M., Tedijanto, C., Goldstein, E., Grad, Y. H. & Lipsitch, M. Projecting the transmission dynamics of sars-cov-2 through
the postpandemic period. Science 368, 860–868 (2020).
3. B onaccorsi, G. et al. Economic and social consequences of human mobility restrictions under covid-19. Proc. Natl. Acad. Sci. 117,
15530–15535 (2020).
4. OECD & Sta, O. OECD Economic Outlook, vol. 2020 (OECD Publishing, 2020).
5. Organization, W. H. et al. Critical Preparedness, Readiness and Response Actions for Covid-19: Interim Guidance, 4 Nov 2020,
World Health Organization, Technical Report (2020).
6. Organization, W. H. et al. Public Health Surveillance for Covid-19: Interim Guidance, 16 Dec 2020, World Health Organization,
Techniacl Report, (2020).
246810 12 14
160180 200220 240
days
14−day CI
14−day CI
CI Ensemble
CI Multivariate
Figure6. 14-day CI accuracy prediction for dierent estimated models.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
15
Vol.:(0123456789)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
7. Han, E. et al. Lessons learnt from easing covid-19 restrictions: An analysis of countries and regions in Asia Pacic and Europe.
Lancet (2020).
8. Z aki, N. & Mohamed, E. A. e estimations of the covid-19 incubation period: A scoping reviews of the literature. J. Infect. Public
Health 14, 638–646 (2021).
9. Zoabi, Y., Deri-Rozov, S. & Shomron, N. Machine learning-based prediction of covid-19 diagnosis based on symptoms. NPJ Dig.
Med. 4, 1–5 (2021).
10. Hellewell, J. et al. Feasibility of controlling covid-19 outbreaks by isolation of cases and contacts. Lancet Global Health 8, e488–e496
(2020).
11. Ferretti, L. et al. Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing. Science (2020).
12. Flaxman, S. et al. Estimating the eects of non-pharmaceutical interventions on covid-19 in Europe. Nature 584, 257–261 (2020).
13. Estrada, E. Covid-19 and sars-cov-2. Modeling the present, looking at the future. Phys.Rep. 869, 1–51 (2020).
14. Maier, B. F. & Brockmann, D. Eective containment explains subexponential growth in recent conrmed covid-19 cases in china.
Science 368, 742–746 (2020).
15. Wong, G. N. et al. Modeling covid-19 dynamics in illinois under nonpharmaceutical interventions. Phys. Rev. X 10, 041033 (2020).
16. Hernández-Orallo, E., Manzoni, P., Calafate, C. T. & Cano, J. Evaluating how smartphone contact tracing technology can reduce
the spread of infectious diseases: e case of covid-19. IEEE Access 8, 99083–99097 (2020).
17. Hernández-Orallo, E., Manzoni, P., Calafate, C. T. & Cano, J. Evaluating the eectiveness of covid-19 bluetooth-based smartphone
contact tracing applications. Appl. Sci. 10, 7113 (2020).
18. Khakharia, A. et al. Outbreak prediction of covid-19 for dense and populated countries using machine learning. Ann. Data Sci. 8,
1–19 (2021).
19. Lalmuanawma, S., Hussain, J. & Chhakchhuak, L. Applications of machine learning and articial intelligence for covid-19 (sars-
cov-2) pandemic: A review. Chaos Solitons Fractals 139, 110059 (2020).
20. Rustam, F. et al. Covid-19 future forecasting using supervised machine learning models. IEEE Access 8, 101489–101499 (2020).
21. Chimmula, V. K. R . & Zhang, L. Time series forecasting of covid-19 transmission in Canada using ISTM networks. Chaos Solitons
Fractals 135, 109864 (2020).
22. Hernandez-Matamoros, A., Fujita, H., Hayashi, T. & Perez-Meana, H. Forecasting of covid19 per regions using arima models and
polynomial functions. Appl. So Comput. 96, 106610 (2020).
23. Benvenuto, D., Giovanetti, M., Vassallo, L., Angeletti, S. & Ciccozzi, M. Application of the Arima model on the covid-2019 epidemic
dataset. Data Brief 29, 105340 (2020).
24. Perone, G. An arima model to forecast the spread and the nal size of covid-2019 epidemic in italy. medRxiv (2020).
25. Sahai, A. K., Rath, N., So od, V. & Singh, M. P. Arima modelling and forecasting of covid-19 in top ve aected countries. Diabetes
Metab. Syndr. 14, 1419–1427 (2020).
26. Castro, M., Ares, S., Cuesta, J. A. & Manrubia, S. e turning point and end of an expanding epidemic cannot be precisely forecast.
Proc. Natl. Acad. Sci. 117, 26190–26196 (2020).
27. Petropoulos, F., Makrida kis, S. & Stylianou, N. Covid-19: Forecasting conrmed cases and deaths with a simple time series model.
Int. J. Forecast. (2020).
28. Shahid, F., Zameer, A. & Muneeb, M. Predictions for covid-19 with deep learning models of ISTM GRU and BI-ISTM. Chaos
Solitons Fractals 140, 110212 (2020).
29. Z eroual, A., Harrou, F., Dairi, A. & Sun, Y. Deep learning methods for forecasting covid-19 time-series data: A comparative study.
Chaos Solitons Fractals 140, 110212 (2020).
30. Ribeiro, M. H. D. M., da Silva, R. G., Mariani, V. C. & dos Santos Coelho, L. Short-term forecasting covid-19 cumulative conrmed
cases: Perspectives for Brazil. Chaos Solitons Fractals 135, 109853 (2020).
31. Linka, K., Peirlinck, M. & Kuhl, E. e reproduction number of covid-19 and its correlation with public health interventions.
Comput. Mech. 66, 1035–1050 (2020).
32. Kraemer, M. U. et al. e eect of human mobility and control measures on the covid-19 epidemic in China. Science 368, 493–497
(2020).
33. Buckee, C. O. et al. Aggregated mobility data could help ght covid-19. Sci. (N. Y.) 368, 145 (2020).
34. Hernando, A., Mateo, D., Bayer, J. & Barrios, I. Radius of gyration as predictor of covid-19 deaths trend with three-weeks oset.
medRxiv (2021).
35. Gonzalez, M. C., Hidalgo, C. A. & Barabasi, A.-L. Understanding individual human mobility patterns. Nature 453, 779–782 (2008).
36. Cot, C., Cacciapaglia, G. & Sannino, F. Mining google and apple mobility data: Temporal anatomy for covid-19 social distancing.
Sci. Rep. 11, 4150 (2021).
37. Nouvellet, P. et al. Reduction in mobility and covid-19 transmission. Nat. Commun. 12, 1090 (2021).
38. Kraemer, M. U. G. et al. Data curation during a pandemic and lessons learned from covid-19. Nat. Comput. Sci. 1, 9–10 (2021).
39. Palit, A. K. & Popovic, D. Computational Intelligence in time Series Forecasting: eory and Engineering Applications (Springer
Science & Business Media, 2006).
40. Guillén-Navarro, M. A. et al. A decision support system for water optimization in anti-frost techniques by sprinklers. Sensors 20,
7129 (2020).
41. Tavenard, R. et al. Tslearn, a machine learning toolkit for time series data. J. Mach. Learn. Res. 21, 1–6 (2020).
42. deSanidad, M. Plan de respuesta temprana en un escenario de control de la pandemia por COVID-19 (Gobierno de España, 2020).
43. Cot, C., Cacciapaglia, G. & Sannino, F. Mining google and apple mobility data: Temporal anatomy for covid-19 social distancing.
Sci. Rep. 11, 1–8 (2021).
44. Nagelkerke, N. J. et al. A note on a general denition of the coecient of determination. Biometrika 78, 691–692 (1991).
45. Chai, T. & Draxler, R. R. Root mean square error (RMSE) or mean absolute error (MAE). Geosci. Model Dev. Discuss. 7, 1525–1534
(2014).
46. Spearman, C. e proof and measurement of association between two things. (1961).
47. Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econometrica J. Econ. Soc. 424–438
(1969).
48. Jollie, I. Principal component analysis. Technometrics 45, 276 (2003).
49. Mills, T. C. & Mills, T. C. Time Series Techniques for Economists (Cambridge University Press, 1991).
50. Box, G. E., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. Time Series Analysis: Forecasting and Control (John Wiley & Sons, 2015).
51. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
52. Cho, K. etal. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:
1406. 1078 (2014).
53. Homann, F., Bertram, T., Mikut, R., Reischl, M. & Nelles, O. Benchmarking in classication and regression. Wiley Interdiscip.
Rev. Data Mining Knowl. Discov. 9, e1318 (2019).
54. Huang, X., Li, Z., Jiang, Y., Li, X. & Porter, D. Twitter reveals human mobility dynamics during the covid-19 pandemic. PloS ONE
15, e0241957 (2020).
55. Yilmazkuday, H. Stay-at-home works to ght against covid-19: International evidence from google mobility data. J. Human Behav.
Soc. Environ. 31, 1–11 (2020).
Content courtesy of Springer Nature, terms of use apply. Rights reserved
16
Vol:.(1234567890)
Scientic Reports | (2021) 11:15173 | https://doi.org/10.1038/s41598-021-94696-2
www.nature.com/scientificreports/
56. S chnabel, R. B., Koonatz, J. E. & Weiss, B. E. A modular system of algorithms for unconstrained minimization. ACM Trans. Math.
Sow. (TOMS) 11, 419–440 (1985).
57. Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
58. Taylor, S. J. & Letham, B. Forecasting at scale. Am. Stat. 72, 37–45 (2018).
59. Le, T. T., Fu, W. & Moore, J. H. Scaling tree-based automated machine learning to biomedical big data with a feature set selector.
Bioinformatics 36, 250–256 (2020).
60. Indhuja, M. & Sindhuja, P. Prediction of covid-19 cases in India using prophet. Int. J. Stat. Appl. Math. 5, 103–106 (2020).
61. Han, T., Gois, F. N.B., Oliveira, R., Prates, L.R. & deAlmeidaPorto, M.M. Modeling the progression of covid-19 deaths using
kalman lter and automl. So Comput. 1–16 (2021).
Acknowledgements
is work has been partially supported by the Spanish Ministry of Science and Innovation, under Grants
RYC2018-025580-I, RTI2018-096384-B-I00, RTC-2017-6389-5 and RTC2019-007159-5, by the Fundación
Séneca del Centro de Coordinación de la Investigación de la Región de Murcia under Project 20813/PI/18, by
the “Conselleria de Educación, Investigación, Cultura y Deporte, Direcció General de Ciéncia i Investigació,
Proyectos AICO/2020”, Spain, under Grant AICO/2020/302 and a predoctoral contract by the Generalitat Valen-
ciana and the European Social Fund under Grant ACIF/2018/219.
Author contributions
Conceptualization, S.G.C. and J.L.E; methodology, S.G.C. and J.L.E.; soware, J.M.G., R.M.E., A.B.C, E.H.O.;
validation, S.G.C., J.L.E., R.M.E. and J.M.C.; formal analysis, S.G.C., R.H.S., R.M.E., J.L.E., A.B.C,; investigation,
S.G.C., R.H.S., J.L.E., R.M.E., A.B.C., E.H.O.; resources, S.G.C. and J.M.G.; data curation, S.G.C., J.M.G., R.H.S.
and R.M.E.; writing—original dra preparation, S.G.C., J.M.C.; writing—review and editing, J.M.G., E.H.O.;
visualization, J.M.G., R.H.S., A.B.C,; supervision, J.L.E and J.M.C.; funding acquisition, J.M.C. All authors have
read and agreed to the published version of the manuscript.
Competing interests
e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to J.M.C.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
© e Author(s) 2021
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... The second regression model, with the factors (principal components) resulted from PCA as independent variables and overnight stays as the dependent variable, confirmed the previous results, respectively, that for Romania, during the COVID-19 pandemic, the best predictor is Google Mobility parks. Our methodological approach confirms the results of Garcia-Cremades et al. [49] by using multivariate analysis for forecasting changes. Moreover, our research results from the regression models show the direction of the mobility behavior of people and/or tourists during the COVID-19 pandemic, out of restrictive areas (retail, grocery, pharmacy, transit, workplace), with a latent behavior toward local and/or national parks, public beaches, etc. ...
... Regarding the results inferred from outside research, this study also confirms the Pramana et al. [26] study results showing that the big data sources, such as Google Mobility data, are shown to be a good proxy to infer the impact of the pandemic on domestic tourism, despite the unpredictability of human behavior during the COVID-19 pandemic [49]. The results from this study could help explain the mobility patterns of park visits during the pandemic and predict the process of returning to normal. ...
Article
Full-text available
Our exploratory research focuses on the possible relations between tourism and the mobility of people, using short longitudinal data for mobility dimensions during the COVID-19 pandemic. One of these is real-time, exhaustive type data, published by Google, about the mobility of people in six different dimensions, (retail, parks, residential, workplace, grocery, and transit). The aim is to analyze the directional, intensity, causal, and complex interplay between the statistical data of tourism and mobility data for Romanian counties. The main objective is to determine if real-world big data can be linked with tourism arrivals in the first 14 months of the pandemic. We have found, using correlations, factorial analysis (PCA), regression models, and SEM, that there are strong and/or medium relationships between retail and parks and overnights, and weak or no relations between other mobility dimensions (workplace, transit). By applying factorial analysis (PCA), we have regrouped the six Google Mobility dimensions into two new factors that are good predictors for Romanian tourism at the county location. These findings can help provide a better understanding of the relationship between the real movement of people in different urban areas and the tourism phenomenon: the GM parks dimension best predicts tourism indicators (over-nights), the GM residential dimension correlates inversely with the tourism indicator, and the rest of the GM indices are generally weak predictors for tourism. A more complex analysis could signal the potential and the character of tourism in different destinations, by territorially and chronologically determining the GM indices that are better linked with the tourism statistical indicators. Further research is required to establish forecasting models using Google Mobility data
... While the smart devices are GPS-tracked, the locations of the transaction data can be found at retail outlets, leisure facilities and other public amenities. Among the public datasets currently available, the Google COVID-19 Community Mobility Reports have been widely used for forecasting cases of infection and providing insights on how to use mobility characteristics efficiently (Wang and Yamamoto 2020;Bryant and Elofsson 2020;Achterberg et al. 2020;Schwabe et al. 2021;García-Cremades et al. 2021). Sufficient mobility records in both spatial and temporal dimensions enable the training of machine learning models that require large amount of data. ...
... Sufficient mobility records in both spatial and temporal dimensions enable the training of machine learning models that require large amount of data. Some research has adopted a number of statistical and machine learning models based on a recurrent neural network as well as an ensemble approach in order to predict trend changes in the 14-day cumulative incidence (García-Cremades et al. 2021). In this work, two datasets with similar training periods but different testing periods were used to compare the models. ...
Article
Full-text available
At the beginning of 2022 the global daily count of new cases of COVID-19 exceeded 3.2 million, a tripling of the historical peak value reported between the initial outbreak of the pandemic and the end of 2021. Aerosol transmission through interpersonal contact is the main cause of the disease’s spread, although control measures have been put in place to reduce contact opportunities. Mobility pattern is a basic mechanism for understanding how people gather at a location and how long they stay there. Due to the inherent dependencies in disease transmission, models for associating mobility data with confirmed cases need to be individually designed for different regions and time periods. In this paper, we propose an autoregressive count data model under the framework of a generalized linear model to illustrate a process of model specification and selection. By evaluating a 14-day-ahead prediction from Sweden, the results showed that for a dense population region, using mobility data with a lag of 8 days is the most reliable way of predicting the number of confirmed cases in relative numbers at a high coverage rate. It is sufficient for both of the autoregressive terms, studied variable and conditional expectation, to take one day back. For sparsely populated regions, a lag of 10 days produced the lowest error in absolute value for the predictions, where weekly periodicity on the studied variable is recommended for use. Interventions were further included to identify the most relevant mobility categories. Statistical features were also presented to verify the model assumptions.
... In addition to this, mobility data have also been used to identify environments that facilitate virus spreading such as shopping malls, sport facilities, leisure centres, public libraries, theatres and cinemas, clarifying also how short and long distance travelling may contribute to the spread of the contagion [70,71]. Furthermore, the integration of mobility and epidemiological data have been widely used in order to explain or predict COVID-19 cases with high levels of precision, highlighting how different economic sectors impact on the virus diffusion [28,29,72]. In general, mobility data have been suggested as a relevant input information to support the design of optimal restriction strategies [73,74]. ...
Article
Full-text available
Due to the COVID-19 pandemic, countries around the world are facing one of the most severe health and economic crises of recent history and human society is called to figure out effective responses. However, as current measures have not produced valuable solutions, a multidisciplinary and open approach, enabling collaborations across private and public organizations, is crucial to unleash successful contributions against the disease. Indeed, the COVID-19 represents a Grand Challenge to which joint forces and extension of disciplinary boundaries have been recognized as main imperatives. As a consequence, Open Innovation represents a promising solution to provide a fast recovery. In this paper we present a practical application of this approach, showing how knowledge sharing constitutes one of the main drivers to tackle pressing social needs. To demonstrate this, we propose a case study regarding a data sharing initiative promoted by Facebook, the Data For Good program. We leverage a large-scale dataset provided by Facebook to the research community to offer a representation of the evolution of the Italian mobility during the lockdown. We show that this repository allows to capture different patterns of movements on the territory with increasing levels of detail. We integrate this information with Open Data provided by the Lombardy region to illustrate how data sharing can also provide insights for private businesses and local authorities. Finally, we show how to interpret Data For Good initiatives in light of the Open Innovation Framework and discuss the barriers to adoption faced by public administrations regarding these practices.
... We find a comparably small but significant transmission-increasing effect of mobility by about 7.0% (0.7), even controlling for regional measures. While it is well-established that mobility serves as a surrogate measure to quantify the effectiveness of the corresponding NPI regime, [55] our results indicate that telecommunications data derived mobility estimates might capture additional behavioural differences. ...
Article
Full-text available
The drivers behind regional differences of SARS-CoV-2 spread on finer spatio-temporal scales are yet to be fully understood. Here we develop a data-driven modelling approach based on an age-structured compartmental model that compares 116 Austrian regions to a suitably chosen control set of regions to explain variations in local transmission rates through a combination of meteorological factors, non-pharmaceutical interventions and mobility. We find that more than 60% of the observed regional variations can be explained by these factors. Decreasing temperature and humidity, increasing cloudiness, precipitation and the absence of mitigation measures for public events are the strongest drivers for increased virus transmission, leading in combination to a doubling of the transmission rates compared to regions with more favourable weather. We conjecture that regions with little mitigation measures for large events that experience shifts toward unfavourable weather conditions are particularly predisposed as nucleation points for the next seasonal SARS-CoV-2 waves.
... For example, contact tracing can help in predicting the evolution of the COVID-19 infections so that Fig. 4 The course of the hospital COVID-19 caseload in Kuwait if the lockdown starts 5, 10, or 15 days before the peak and lasts for 15, 30, or 45 days. The uncertainty is shown in the gray shaded areas, while the solid black curve shows the mean of the simulation results predictions of the peak of the epidemic becomes easier [25]. ...
Article
Full-text available
Background Kuwait had its first COVID-19 in late February, and until October 6, 2020 it recorded 108,268 cases and 632 deaths. Despite implementing one of the strictest control measures-including a three-week complete lockdown, there was no sign of a declining epidemic curve. The objective of the current analyses is to determine, hypothetically, the optimal timing and duration of a full lockdown in Kuwait that would result in controlling new infections and lead to a substantial reduction in case hospitalizations. Methods The analysis was conducted using a stochastic Continuous-Time Markov Chain (CTMC), eight state model that depicts the disease transmission and spread of SARS-CoV 2. Transmission of infection occurs between individuals through social contacts at home, in schools, at work, and during other communal activities. Results The model shows that a lockdown 10 days before the epidemic peak for 90 days is optimal but a more realistic duration of 45 days can achieve about a 45% reduction in both new infections and case hospitalizations. Conclusions In the view of the forthcoming waves of the COVID19 pandemic anticipated in Kuwait using a correctly-timed and sufficiently long lockdown represents a workable management strategy that encompasses the most stringent form of social distancing with the ability to significantly reduce transmissions and hospitalizations.
... For example, predictions about when aviation industries can return to normal [79], how much transportation use reduced in each country due to COVID-19 [80], and recommendations of show/music to alleviate people's stress during this pandemic [81]. AI can play a vital role to access the risk and challenges of any sector during these unprecedented and unanticipated times [82]. In some sense, AI is lowering the human involvement in many sectors through automated and real-time decision making abilities [83]. ...
Article
Full-text available
This paper presents the role of artificial intelligence (AI) and other latest technologies that were employed to fight the recent pandemic (i.e., novel coronavirus disease-2019 (COVID-19)). These technologies assisted the early detection/diagnosis, trends analysis, intervention planning, healthcare burden forecasting, comorbidity analysis, and mitigation and control, to name a few. The key-enablers of these technologies was data that was obtained from heterogeneous sources (i.e., social networks (SN), internet of (medical) things (IoT/IoMT), cellular networks, transport usage, epidemiological investigations, and other digital/sensing platforms). To this end, we provide an insightful overview of the role of data-driven analytics leveraging AI in the era of COVID-19. Specifically, we discuss major services that AI can provide in the context of COVID-19 pandemic based on six grounds, (i) AI role in seven different epidemic containment strategies (a.k.a non-pharmaceutical interventions (NPIs)), (ii) AI role in data life cycle phases employed to control pandemic via digital solutions, (iii) AI role in performing analytics on heterogeneous types of data stemming from the COVID-19 pandemic, (iv) AI role in the healthcare sector in the context of COVID-19 pandemic, (v) general-purpose applications of AI in COVID-19 era, and (vi) AI role in drug design and repurposing (e.g., iteratively aligning protein spikes and applying three/four-fold symmetry to yield a low-resolution candidate template) against COVID-19. Further, we discuss the challenges involved in applying AI to the available data and privacy issues that can arise from personal data transitioning into cyberspace. We also provide a concise overview of other latest technologies that were increasingly applied to limit the spread of the ongoing pandemic. Finally, we discuss the avenues of future research in the respective area. This insightful review aims to highlight existing AI-based technological developments and future research dynamics in this area.
Article
Full-text available
Background . During the COVID-19 pandemic, mobile sensing and data analytics techniques have demonstrated their capabilities in monitoring the trajectories of the pandemic, by collecting behavioral, physiological, and mobility data on individual, neighborhood, city, and national scales. Notably, mobile sensing has become a promising way to detect individuals’ infectious status, track the change in long-term health, trace the epidemics in communities, and monitor the evolution of viruses and subspecies. Methods . We followed the PRISMA practice and reviewed 60 eligible papers on mobile sensing for monitoring COVID-19. We proposed a taxonomy system to summarize literature by the time duration and population scale under mobile sensing studies. Results . We found that existing literature can be naturally grouped in four clusters , including remote detection , long-term tracking , contact tracing , and epidemiological study . We summarized each group and analyzed representative works with regard to the system design, health outcomes, and limitations on techniques and societal factors. We further discussed the implications and future directions of mobile sensing in communicable diseases from the perspectives of technology and applications. Conclusion . Mobile sensing techniques are effective, efficient, and flexible to surveil COVID-19 in scales of time and populations. In the post-COVID era, technical and societal issues in mobile sensing are expected to be addressed to improve healthcare and social outcomes.
Article
Full-text available
Controlling human mobility is thought to be an effective measure to prevent the spread of the COVID-19 pandemic. This study aims to clarify the human mobility types that impacted the number of COVID-19 cases during the medium-term COVID-19 pandemic in the Osaka metropolitan area. The method used in this study was analysis of the statistical relationship between human mobility changes and the total number of COVID-19 cases after two weeks. In conclusion, the results indicate that it is essential to control the human mobility of groceries/pharmacies to between −5 and 5% and that of parks to more than −20%. The most significant finding for urban sustainability is that urban transit was not found to be a source of infection. Hence governments in cities around the world may be able to encourage communities to return to transit mobility, if they are able to follow the kind of hygiene processes conducted in Osaka.
Research
Full-text available
The purpose of this review was to provide evidence on the following key question: Where can AI and emerging digital technologies potentially add value in COVID responses to mitigate, control, or prevent COVID-19 and its consequences? Includes both innovative applications and developments, and new applications of established technologies and processes.
Article
Full-text available
We employ the Google and Apple mobility data to identify, quantify and classify different degrees of social distancing and characterise their imprint on the first wave of the COVID-19 pandemic in Europe and in the United States. We identify the period of enacted social distancing via Google and Apple data, independently from the political decisions. Our analysis allows us to classify different shades of social distancing measures for the first wave of the pandemic. We observe a strong decrease in the infection rate occurring two to five weeks after the onset of mobility reduction. A universal time scale emerges, after which social distancing shows its impact. We further provide an actual measure of the impact of social distancing for each region, showing that the effect amounts to a reduction by 20–40% in the infection rate in Europe and 30–70% in the US.
Article
Full-text available
In response to the COVID-19 pandemic, countries have sought to control SARS-CoV-2 transmission by restricting population movement through social distancing interventions, thus reducing the number of contacts. Mobility data represent an important proxy measure of social distancing, and here, we characterise the relationship between transmission and mobility for 52 countries around the world. Transmission significantly decreased with the initial reduction in mobility in 73% of the countries analysed, but we found evidence of decoupling of transmission and mobility following the relaxation of strict control measures for 80% of countries. For the majority of countries, mobility explained a substantial proportion of the variation in transmissibility (median adjusted R-squared: 48%, interquartile range - IQR - across countries [27–77%]). Where a change in the relationship occurred, predictive ability decreased after the relaxation; from a median adjusted R-squared of 74% (IQR across countries [49–91%]) pre-relaxation, to a median adjusted R-squared of 30% (IQR across countries [12–48%]) post-relaxation. In countries with a clear relationship between mobility and transmission both before and after strict control measures were relaxed, mobility was associated with lower transmission rates after control measures were relaxed indicating that the beneficial effects of ongoing social distancing behaviours were substantial.
Article
Full-text available
Background A novel coronavirus (COVID-19) has taken the world by storm. The disease has spread very swiftly worldwide. A timely clue which includes the estimation of the incubation period among COVID-19 patients can allow governments and healthcare authorities to act accordingly. Objectives to undertake a review and critical appraisal of all published/preprint reports that offer an estimation of incubation periods for COVID-19. Eligibility criteria This research looked for all relevant published articles between the dates of December 1, 2019, and April 25, 2020, i.e. those that were related to the COVID-19 incubation period. Papers were included if they were written in English, and involved human participants. Papers were excluded if they were not original (e.g. reviews, editorials, letters, commentaries, or duplications). Sources of evidence COVID-19 Open Research Dataset supplied by Georgetown’s Centre for Security and Emerging Technology as well as PubMed and Embase via Arxiv, medRxiv, and bioRxiv. Charting methods A data-charting form was jointly developed by the two reviewers (NZ and EA), to determine which variables to extract. The two reviewers independently charted the data, discussed the results, and updated the data-charting form. Results and conclusions Screening was undertaken 44,000 articles with a final selection of 25 studies referring to 18 different experimental projects related to the estimation of the incubation period of COVID-19. The majority of extant published estimates offer empirical evidence showing that the incubation period for the virus is a mean of 7.8 days, with a median of 5.01 days, which falls into the ranges proposed by the WHO (0 to 14 days) and the ECDC (2 to 12 days). Nevertheless, a number of authors proposed that quarantine time should be a minimum of 14 days and that for estimates of mortality risks a median time delay of 13 days between illness and mortality should be under consideration. It is unclear as to whether any correlation exists between the age of patients and the length of time they incubate the virus.
Preprint
Full-text available
Total and perimetral lockdowns were the strongest nonpharmaceutical interventions to fight against Covid-19, as well as the with the strongest socioeconomic collateral effects. Lacking a metric to predict the effect of lockdowns in the spreading of COVID-19, authorities and decision-makers opted for preventive measures that showed either too strong or not strong enough after a period of two to three weeks, once data about hospitalizations and deaths was available. We present here the radius of gyration as a candidate predictor of the trend in deaths by COVID-19 with an offset of three weeks. Indeed, the radius of gyration aggregates the most relevant microscopic aspects of human mobility into a macroscopic value, very sensitive to temporary trends and local effects, such as lockdowns and mobility restrictions. We use mobile phone data of more than 13 million users in Spain during a period of one year (from January 6 th 2020 to January 10 th 2021) to compute the users’ daily radius of gyration and compare the median value of the population with the evolution of COVID-19 deaths: we find that for all weeks where the radius of gyration is above a critical value (70% of its pre-pandemic score) the number of weekly deaths increases three weeks after. The reverse also stands: for all weeks where the radius of gyration is below the critical value, the number of weekly deaths decreased after three weeks. This observation leads to two conclusions: i) the radius of gyration can be used as a predictor of COVID-19-related deaths; and ii) partial mobility restrictions are as effective as a total lockdown as far the radius of gyration is below this critical value. Background Authorities around the World have used lockdowns and partial mobility restrictions as major nonpharmaceutical interventions to control the expansion of COVID-19. While effective, the efficiency of these measures on the number of COVID-19 cases and deaths is difficult to quantify, severely limiting the feedback that can be used to tune the intensity of these measures. In addition, collateral socioeconomic effects challenge the overall effectiveness of lockdowns in the long term, and the degree by which they are followed can be difficult to estimate. It is desirable to find both a metric to accurately monitor the mobility restrictions and a predictor of their effectiveness. Methods We correlate the median of the daily radius of gyration of more than 13M users in Spain during all of 2020 with the evolution of COVID-19 deaths for the same period. Mobility data is obtained from mobile phone metadata from one of the major operators in the country. Results The radius of gyration is a predictor of the trend in the number of COVID-19 deaths with 3 weeks offset. When the radius is above/below a critical threshold (70% of the pre-pandemic score), the number of deaths increases/decreases three weeks later. Conclusions The radius of gyration can be used to monitor in real time the effectiveness of the mobility restrictions. The existence of a critical threshold suggest that partial lockdowns can be as efficient as total lockdowns, while reducing their socioeconomic impact. The mechanism behind the critical value is still unknow, and more research is needed.
Article
Full-text available
The COVID-19 pandemic continues to have a destructive effect on the health and well-being of the global population. A vital step in the battle against it is the successful screening of infected patients, together with one of the effective screening methods being radiology examination using chest radiography. Recognition of epidemic growth patterns across temporal and social factors can improve our capability to create epidemic transmission designs, including the critical job of predicting the estimated intensity of the outbreak morbidity or mortality impact at the end. The study’s primary motivation is to be able to estimate with a certain level of accuracy the number of deaths due to COVID-19, managing to model the progression of the pandemic. Predicting the number of possible deaths from COVID-19 can provide governments and decision-makers with indicators for purchasing respirators and pandemic prevention policies. Thus, this work presents itself as an essential contribution to combating the pandemic. Kalman Filter is a widely used method for tracking and navigation and filtering and time series. Designing and tuning machine learning methods are a labor- and time-intensive task that requires extensive experience. The field of automated machine learning Auto Machine Learning relies on automating this task. Auto Machine Learning tools enable novice users to create useful machine learning units, while experts can use them to free up valuable time for other tasks. This paper presents an objective method of forecasting the COVID-19 outbreak using Kalman Filter and Auto Machine Learning. We use a COVID-19 dataset of Ceará, one of the 27 federative units in Brazil. Ceará has more than 235,222 confirmed cases of COVID-19 and 8850 deaths due to the disease. The TPOT automobile model showed the best result with a 0.99 of \(R^2\) score.
Article
Full-text available
Effective screening of SARS-CoV-2 enables quick and efficient diagnosis of COVID-19 and can mitigate the burden on healthcare systems. Prediction models that combine several features to estimate the risk of infection have been developed. These aim to assist medical staff worldwide in triaging patients, especially in the context of limited healthcare resources. We established a machine-learning approach that trained on records from 51,831 tested individuals (of whom 4769 were confirmed to have COVID-19). The test set contained data from the subsequent week (47,401 tested individuals of whom 3624 were confirmed to have COVID-19). Our model predicted COVID-19 test results with high accuracy using only eight binary features: sex, age ≥60 years, known contact with an infected individual, and the appearance of five initial clinical symptoms. Overall, based on the nationwide data publicly reported by the Israeli Ministry of Health, we developed a model that detects COVID-19 cases by simple features accessed by asking basic questions. Our framework can be used, among other considerations, to prioritize testing for COVID-19 when testing resources are limited.
Article
Full-text available
Precision agriculture is a growing sector that improves traditional agricultural processes through the use of new technologies. In southeast Spain, farmers are continuously fighting against harsh conditions caused by the effects of climate change. Among these problems, the great variability of temperatures (up to 20 °C in the same day) stands out. This causes the stone fruit trees to flower prematurely and the low winter temperatures freeze the flower causing the loss of the crop. Farmers use anti-freeze techniques to prevent crop loss and the most widely used techniques are those that use water irrigation as they are cheaper than other techniques. However, these techniques waste too much water and it is a scarce resource, especially in this area. In this article, we propose a novel intelligent Internet of Things (IoT) monitoring system to optimize the use of water in these anti-frost techniques while minimizing crop loss. The intelligent component of the IoT system is designed using an approach based on a multivariate Long Short-Term Memory (LSTM) model, designed to predict low temperatures. We compare the proposed approach of multivariate model with the univariate counterpart version to figure out which model obtains better accuracy to predict low temperatures. An accurate prediction of low temperatures would translate into significant water savings, as anti-frost techniques would not be activated without being necessary. Our experimental results show that the proposed multivariate LSTM approach improves the univariate counterpart version, obtaining an average quadratic error no greater than 0.65 °C and a coefficient of determination R2 greater than 0.97. The proposed system has been deployed and is currently operating in a real environment obtained satisfactory performance.
Article
Full-text available
We present modeling of the COVID-19 epidemic in Illinois, USA, capturing the implementation of a stay-at-home order and scenarios for its eventual release. We use a non-Markovian age-of-infection model that is capable of handling long and variable time delays without changing its model topology. Bayesian estimation of model parameters is carried out using Markov chain Monte Carlo methods. This framework allows us to treat all available input information, including both the previously published parameters of the epidemic and available local data, in a uniform manner. To accurately model deaths as well as demand on the healthcare system, we calibrate our predictions to total and in-hospital deaths as well as hospital and ICU bed occupancy by COVID-19 patients. We apply this model not only to the state as a whole but also its subregions in order to account for the wide disparities in population size and density. Without prior information on nonpharmaceutical interventions, the model independently reproduces a mitigation trend closely matching mobility data reported by Google and Unacast. Forward predictions of the model provide robust estimates of the peak position and severity and also enable forecasting the regional-dependent results of releasing stay-at-home orders. The resulting highly constrained narrative of the epidemic is able to provide estimates of its unseen progression and inform scenarios for sustainable monitoring and control of the epidemic.
Article
Detailed, accurate data related to a disease outbreak enable informed public health decision making. Given the variety of data types available across different regions, global data curation and standardization efforts are essential to guarantee rapid data integration and dissemination in times of a pandemic.
Article
Forecasting the outcome of outbreaks as early and as accurately as possible is crucial for decision making and policy implementations. A significant challenge faced by forecasters is that not all outbreaks and epidemics turn into pandemics making the prediction of their severity difficult. At the same time, the decisions made to enforce lockdowns and other mitigating interventions versus their socioeconomic consequences are not only hard to make, but also highly uncertain. The majority of modeling approaches to outbreaks, epidemics, and pandemics take an epidemiological approach that considers biological and disease processes. In this paper, we accept the limitations of forecasting to predict the long-term trajectory of an outbreak, and instead, we propose a statistical, time-series approach to modelling and predicting the short-term behaviour of COVID-19. Our model assumes a multiplicative trend, aiming to capture the continuation of the two variables we predict (global confirmed cases and deaths) as well as their uncertainty. We present the timeline of producing and evaluating 10-day-ahead forecasts over a period of four months. Our simple model offers competitive forecast accuracy and estimates of uncertainty that are useful and practically relevant.