ArticlePDF Available

Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study

Authors:

Abstract and Figures

Background As the outbreak of coronavirus disease 2019 (COVID-19) progresses, epidemiological data are needed to guide situational awareness and intervention strategies. Here we describe efforts to compile and disseminate epidemiological information on COVID-19 from news media and social networks. Methods In this population-level observational study, we searched DXY.cn, a health-care-oriented social network that is currently streaming news reports on COVID-19 from local and national Chinese health agencies. We compiled a list of individual patients with COVID-19 and daily province-level case counts between Jan 13 and Jan 31, 2020, in China. We also compiled a list of internationally exported cases of COVID-19 from global news media sources (Kyodo News, The Straits Times, and CNN), national governments, and health authorities. We assessed trends in the epidemiology of COVID-19 and studied the outbreak progression across China, assessing delays between symptom onset, seeking care at a hospital or clinic, and reporting, before and after Jan 18, 2020, as awareness of the outbreak increased. All data were made publicly available in real time. Findings We collected data for 507 patients with COVID-19 reported between Jan 13 and Jan 31, 2020, including 364 from mainland China and 143 from outside of China. 281 (55%) patients were male and the median age was 46 years (IQR 35–60). Few patients (13 [3%]) were younger than 15 years and the age profile of Chinese patients adjusted for baseline demographics confirmed a deficit of infections among children. Across the analysed period, delays between symptom onset and seeking care at a hospital or clinic were longer in Hubei province than in other provinces in mainland China and internationally. In mainland China, these delays decreased from 5 days before Jan 18, 2020, to 2 days thereafter until Jan 31, 2020 (p=0·0009). Although our sample captures only 507 (5·2%) of 9826 patients with COVID-19 reported by official sources during the analysed period, our data align with an official report published by Chinese authorities on Jan 28, 2020. Interpretation News reports and social media can help reconstruct the progression of an outbreak and provide detailed patient-level data in the context of a health emergency. The availability of a central physician-oriented social network facilitated the compilation of publicly available COVID-19 data in China. As the outbreak progresses, social media and news reports will probably capture a diminishing fraction of COVID-19 cases globally due to reporting fatigue and overwhelmed health-care systems. In the early stages of an outbreak, availability of public datasets is important to encourage analytical efforts by independent teams and provide robust evidence to guide interventions. Funding Fogarty International Center, US National Institutes of Health.
Content may be subject to copyright.
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
1
Articles
Lancet Digital Health 2020
Published Online
February 20, 2020
https://doi.org/10.1016/
S2589-7500(20)30026-1
See Online/Comment
https://doi.org/10.1016/
S2589-7500(20)30055-8
Division of International
Epidemiology and Population
Studies, Fogarty International
Center, US National Institutes
of Health, Bethesda MD, USA
(K Sun PhD, J Chen BSc,
C Viboud PhD)
Correspondence to:
Dr Cécile Viboud, Division of
International Epidemiology and
Population Studies, Fogarty
International Center, US National
Institutes of Health,
Bethesda, MD 20892, USA
viboudc@mail.nih.gov
Early epidemiological analysis of the coronavirus disease
2019 outbreak based on crowdsourced data: a population-
level observational study
Kaiyuan Sun, Jenny Chen, Cécile Viboud
Summary
Background As the outbreak of coronavirus disease 2019 (COVID-19) progresses, epidemiological data are needed to
guide situational awareness and intervention strategies. Here we describe eorts to compile and disseminate
epidemiological information on COVID-19 from news media and social networks.
Methods In this population-level observational study, we searched DXY.cn, a health-care-oriented social network that
is currently streaming news reports on COVID-19 from local and national Chinese health agencies. We compiled a
list of individual patients with COVID-19 and daily province-level case counts between Jan 13 and Jan 31, 2020, in
China. We also compiled a list of internationally exported cases of COVID-19 from global news media sources (Kyodo
News, The Straits Times, and CNN), national governments, and health authorities. We assessed trends in the
epidemiology of COVID-19 and studied the outbreak progression across China, assessing delays between symptom
onset, seeking care at a hospital or clinic, and reporting, before and after Jan 18, 2020, as awareness of the outbreak
increased. All data were made publicly available in real time.
Findings We collected data for 507 patients with COVID-19 reported between Jan 13 and Jan 31, 2020, including
364 from mainland China and 143 from outside of China. 281 (55%) patients were male and the median age was
46 years (IQR 35–60). Few patients (13 [3%]) were younger than 15 years and the age profile of Chinese patients
adjusted for baseline demographics confirmed a deficit of infections among children. Across the analysed period,
delays between symptom onset and seeking care at a hospital or clinic were longer in Hubei province than in other
provinces in mainland China and internationally. In mainland China, these delays decreased from 5 days before
Jan 18, 2020, to 2 days thereafter until Jan 31, 2020 (p=0·0009). Although our sample captures only 507 (5·2%) of
9826 patients with COVID-19 reported by ocial sources during the analysed period, our data align with an ocial
report published by Chinese authorities on Jan 28, 2020.
Interpretation News reports and social media can help reconstruct the progression of an outbreak and provide detailed
patient-level data in the context of a health emergency. The availability of a central physician-oriented social network
facilitated the compilation of publicly available COVID-19 data in China. As the outbreak progresses, social media and
news reports will probably capture a diminishing fraction of COVID-19 cases globally due to reporting fatigue and
overwhelmed health-care systems. In the early stages of an outbreak, availability of public datasets is important to
encourage analytical eorts by independent teams and provide robust evidence to guide interventions.
Funding Fogarty International Center, US National Institutes of Health.
Copyright © 2020 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license.
Introduction
As the outbreak of coronavirus disease 2019 (COVID-19)
is rapidly expanding in China and beyond, with the
potential to become a worldwide pandemic,1 real-time
analyses of epidemiological data are needed to increase
situational awareness and inform interventions.2
Previously, real-time analyses have shed light on the
transmissibility, severity, and natural history of an
emerging pathogen in the first few weeks of an outbreak,
such as with severe acute respiratory syndrome (SARS),
the 2009 influenza pandemic, and Ebola.3–6 Analyses of
detailed line lists of patients are particularly useful to
infer key epidemiological parameters, such as the
incubation and infectious periods, and delays between
infection and detection, isolation, and reporting of cases.3,4
However, ocial individual patient data rarely become
publicly available early on in an outbreak, when the
information is most needed.
Building on our previous experience collating news
reports to monitor transmission of Ebola virus,7 here
we present an eort to compile individual patient
information and subnational epidemic curves on
COVID-19 from a variety of online resources. Data were
made publicly available in real time and were used by
the infectious disease modelling community to generate
and compare epidemiological estimates relevant to
interventions. We describe the data generation process
and provide an early analysis of age patterns of COVID-19,
Articles
2
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
case counts across China and inter nationally, and delays
between symptom onset, admissions to hospital, and
reporting, for cases reported until Jan 31, 2020.
Methods
Study design and Chinese data sources
In this population-level observational study, we used
crowdsourced reports from DXY.cn, a social network for
Chinese physicians, health-care professionals, phar-
macies, and health-care facilities established in 2000. This
online platform is providing real-time coverage of the
COVID-19 outbreak in China, obtained by collating and
curating reports from news media, government television,
and national and provincial health agencies. The
information reported includes time-stamped cumulative
counts of COVID-19 infections, outbreak maps, and real-
time streaming of health authority announcements in
Chinese (directly or through state media).8 Every report is
linked to an online source, which can be accessed for
more detailed information on individual cases.
These are publicly available, de-identified patient data
reported directly by public health authorities or by state
media. No patient consent was needed and no ethics
approval was required.
Data compilation
We closely monitored updates on DXY.cn between
Jan 20, 2020, and Jan 31, 2020, to extract key information
on individual patients in near real-time, and reports of
daily case counts. For individual-level patient data, we
used descriptions from the original source in Chinese to
retrieve age, sex, province of identification, travel history,
reporting date, dates of symptom onset and seeking care
at a hospital or clinic, and discharge status, when
available. Individual-level patient data were formatted
into a line-list database for further quantitative analysis.
Individual-level patient data were entered from DXY.cn
by a native Chinese speaker (KS), who also generated an
English summary for each patient. Entries were checked
by a second person (JC). Since DXY.cn primarily provides
For DXY website see DXY.cn
Research in context
Evidence before this study
An outbreak of coronavirus disease 2019 (COVID-19) was
recognised in early January, 2020, in Wuhan City, Hubei
province, China. The new virus is thought to have originated
from an animal-to-human spillover event linked to seafood and
live-animal markets. The infection has spread locally in Wuhan
and elsewhere in China, despite strict intervention measures
implemented in the region where the infection originated on
Jan 23, 2020. More than 500 patients infected with COVID-19
outside of mainland China have been reported between Jan 1
and Feb 14, 2020. Although laboratory testing for COVID-19
quickly ramped up in China and elsewhere, information on
individual patients remains scarce and official datasets have not
been made publicly available. Patient-level information is
important to estimate key time-to-delay events (such as the
incubation period and interval between symptom onset and
visit to a hospital), analyse the age profile of infected patients,
reconstruct epidemic curves by onset dates, and infer
transmission parameters. We searched PubMed for publications
between Jan 1, 1990, and Feb 6, 2020, using combinations of
the following terms: (“coronavirus” OR “2019-nCoV”) AND
(“line list” OR “case description” OR “patient data”) AND
(“digital surveillance” OR “social media” OR “crowd-sourced
data”). The search retrieved one relevant study on Middle East
respiratory syndrome coronavirus that mentioned FluTrackers
in their discussion, a website that aggregates epidemiological
information on emerging pathogens. However, FluTrackers
does not report individual-level data on COVID-19.
Added value of this study
To our knowledge, this is the first study that uses crowdsourced
data from social media sources to monitor the COVID-19
outbreak. We searched DXY.cn, a Chinese health-care-oriented
social network that broadcasts information from local and
national health authorities, to reconstruct patient-level
information on COVID-19 in China. We also queried
international media sources and national health agency
websites to collate data on international exportations of
COVID-19. We describe the demographic characteristics, delays
between symptom onset, seeking care at a hospital or clinic,
and reporting for 507 patients infected with COVID-19
reported until Jan 31, 2020. The overall cumulative progression
of the outbreak is consistent between our line list and an official
report published by the Chinese national health authorities on
Jan 28, 2020. The estimated incubation period in our data
aligns with that of previous work. Our dataset was made
available in the public domain on Jan 21, 2020.
Implications of all the available evidence
Crowdsourced line-list data can be reconstructed from social
media data, especially when a central resource is available to
curate relevant information. Public access to line lists is
important so that several teams with different expertise can
provide their own insights and interpretations of the data,
especially in the early phase of an outbreak when little
information is available. Publicly available line lists can also
increase transparency. The main issue with the quality of
patient-level data obtained during health emergencies is the
potential lack of information from locations overwhelmed by
the outbreak (in this case, Hubei province and other provinces
with weaker health infrastructures). Future studies based on
larger samples of patients with COVID-19 could explore in more
detail the transmission dynamics of the outbreak in different
locations, the effectiveness of interventions, and the
demographic factors driving transmission.
For an example of an online
source see https://ncov.dxy.cn/
ncovh5/view/pneumonia
Articles
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
3
information on patients reported in China, we also
compiled additional information on internationally
exported cases of COVID-19. We obtained data for
21 countries outside of mainland China (Australia,
Cambodia, Canada, France, Germany, Hong Kong,
India, Italy, Japan, Malaysia, Nepal, Russia, Singapore,
South Korea, Sri Lanka, Taiwan, Thailand, United Arab
Emirates, the UK, the USA, and Vietnam). We gathered
and cross-checked data for infected patients outside of
China using several sources, including global news
media (Kyodo News, Straits Times, and CNN), ocial
press releases from each country’s Ministry of Health,
and disease control agencies.
In addition to detailed information on individual
patients, we reconstructed the daily progression of
reported patients in each province of China from Jan 13,
until Jan 31, 2020. We used the daily outbreak situation
reports com municated by provincial health authorities,
covered by state television and media, and posted on
DXY.cn. All patients in our databases had a laboratory
confirmed SARS coronavirus 2 (SARS-CoV-2) infection.
Our COVID-19 database was made publicly available as
a Google Sheet, disseminated via Twitter on Jan 21, 2020,
and posted on the website of Northeastern University,
(Boston, MA, USA) on Jan 24, 2020, where it is updated
in real time. Data used in this analysis, frozen at Jan 31,
2020, are available online as a spreadsheet.
Statistical analysis
We assessed the age distribution of all patients with
COVID-19 by discharge status. We adjusted the age profile
of Chinese patients by the population of China. We used
2016 population estimates from the Institute for Health
Metrics and Evaluation9 to calculate the relative risk (RR) of
infection with COVID-19 by age group. To calculate the
RR, we followed the method used by Lemaitre and
colleagues10 to explore the age profile of influenza, where
RR for age group i is defined as
where Ci is the number of cases in age group i and Ni is
the population size of age group i.
To estimate trends in the strength of case detection and
interventions, we analysed delays between symptom onset
and visit to a health-care provider, at a hospital or clinic,
and from seeking care at a hospital or clinic to reporting,
by time period and location. We considered the period
before and after Jan 18, 2020, when media attention and
awareness of the outbreak became more pronounced.11
We used non-parametric tests to assess dierences in
delays between seeking care at a hospital or clinic and
reporting between locations (Wilcoxon test to compare
two locations and Kruskall–Wallis test to compare three or
more locations).
We estimated the duration of the incubation period on
the basis of our line list data. We analysed a subset of
patients returning from Wuhan who had spent less than
a week in Wuhan, to ensure a narrowly defined exposure
window. The incubation period was estimated as the
midpoint between the time spent in Wuhan and the date
of symptom onset.
We did all analyses in R (version 3.5.3). We considered
p values of less than 0·05 to be significant.
Role of the funding source
The funder had no role in study design, data compilation,
data analysis, data interpretation, or writing of the report.
All authors had access to the data, and had final
responsibility for the decision to submit for publication.
Results
Our line list comprised 507 patients reported from Jan
13, to Jan 31, 2020, including 364 (72%) from mainland
China and 143 (28%) from outside of China (table). Our
sample captured 5·2% of 9826 COVID-19 cases reported
by WHO on Jan 31, 2020. The sex ratio was skewed
towards males. In mainland China, five of 30 provinces
were represented, with 133 (26%) patients reported by
Patients (n=507)
Age, years 46 (35–60)
Sex
Male 281 (55%)
Female 201 (40%)
Data missing 25 (5%)
Location
Mainland China 364 (72%)
Beijing 133 (26%)
Shaanxi 87 (17%)
Hubei* 41 (8%)
Tianjin 22 (4%)
Yunnan 19 (4%)
International cases, reported outside of
mainland China
143 (28%)
Relation to Wuhan
Visited Wuhan 153 (30%)
Resident of Wuhan 152 (30%)
None 80 (16%)
Unknown† 122 (24%)
Disease outcome: death at time of reporting 40 (8%)
Data are median (IQR) or n (%). Data are publicly available on the Laboratory for
the Modeling of Biological + Socio-technical systems website and on our frozen
spreadsheet. COVID-19=coronavirus disease 2019. *Including 32 from Wuhan.
†All patients with unknown relation to Wuhan were reported by Beijing Municipal
Health Commission, Beijing, China.
Table: Characteristics of patients with COVID-19 included in the
crowdsourced line list
RR
i =
Ci
i Ci
(
)
Ni
i Ni
(
)
For the WHO situation report
as of Jan 31, 2020, see
https://www.who.int/docs/
default-source/coronaviruse/
situation-reports/20200131-
sitrep-11-ncov.
pdf?sfvrsn=de7c0f7_4
For the Laboratory for the
Modeling of Biological +
Socio-technical systems
website at Northeastern
University see https://www.
mobs-lab.org/2019ncov.html
For the spreadsheet of patient-
level data until Jan 31, 2020,
see https://docs.google.com/
spreadsheets/d/1Gb5cyg0fj
Utsqh3hl_L-C5A23zIOXmWH
5veBklfSHzg/edit?usp=sharing
Articles
4
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
Beijing, 87 (17%) by Shaanxi, 41 (8%) by Hubei (capital
city is Wuhan), 19 (4%) by Tianjin, and 22 (4%) by
Yunnan. Of 435 patients with known relation to Wuhan
city, most reported a travel history to the city (135 [30%])
or were residents of the city (152 [30%]), while 80 (16%)
had no direct relation to the city. 122 (24%) patients, all
reported in Beijing, had no information about their
recent history with Wuhan.
The age distribution of COVID-19 cases was skewed
towards older age groups with a median age of 45 years
(IQR 33–56) for patients who were alive or who had an
unknown outcome at the time of reporting (figure 1). The
median age of patients who had died at the time of
reporting was 70 years (IQR 65–81). Few patients (13 [3%])
were younger than 15 years. Adjustment for the age
demographics of China confirmed a deficit of infections
among children, with a RR below 0·5 in patients younger
than 15 years (figure 1). The RR measure indicated a
sharp increase in the likelihood of reported COVID-19
among people aged 30 years and older.
A timeline of cases in our crowdsourced patient line
list is shown by date of onset in figure 2, indicating
an acceleration of reported cases by Jan 13, 2020.
The outbreak progression based on the crowdsourced
patient line list was consistent with the timeline
published by China Center for Disease Control and
Prevention (CDC) on Jan 28, 2020,12 which is based
on a more comprehensive database of more than
6000 patients with COVID-19. Since Jan 23, 2020, the
cumulative number of cases has slowed down in the
crowdsourced and China CDC curves (figure 2), which
probably reflects the delay between disease onset and
reporting. The median reporting delay was 5 days
(IQR 3–8) in our data.
Province-level epidemic curves are shown by reporting
date in figure 3. As of Jan 31, 2020, 16 (52%) of
30 provinces in mainland China had reported more than
100 confirmed cases. The apparent rapid growth of newly
reported cases between Jan 18, and Jan 31, 2020, in
several provinces outside of Hubei province is consistent
with sustained local transmission.
Across the study period, the median delay between
symptom onset and seeking care at a hospital or clinic was
2 days (IQR 0–5 days) in mainland China (figure 4). This
delay decreased from 5 days before Jan 18, 2020, to 2 days
thereafter (Wilcoxon test p=0·0009). Some provinces, such
as Tianjin and Yunnan had shorter delays (data by province
not shown), while the early cases from Hubei province
were characterised by longer delays in seeking care
(median 0 days [IQR 0–1]).
The median delay between seeking care at a hospital
or clinic and reporting was 2 days (IQR 2–5 days) in
mainland China and decreased from 9 days before
Jan 18, 2020, to 2 days thereafter (Wilcoxon test
p<0·0001; figure 4). Similarly to delays in seeking care
at a hospital or clinic, reporting was quickest in Tianjin
and Yunnan (median 1 day [IQR 0–1]) and slowest in
Hubei province (median 12 days [IQR 7–16]).
The median delay between symptom onset and seeking
care at a hospital or clinic was 1 day (IQR 0–3) for
international travellers, and shorter than for patients in
Hubei province or the rest of mainland China (Kruskal–
Wallis test p<0·0001; figure 4). Even in the period after
Jan 18, 2020, when awareness of the outbreak increased,
a shorter delay between symptom onset and seeking care
at a hospital or clinic was seen for international patients
than for those in mainland China (Wilcoxon test
p<0·0001). For international cases, the delay between
seeking care at a hospital or clinic and reporting was
2 days (IQR 1–4), also shorter than for mainland China
(Wilcoxon test p<0·0001; figure 4).
On the basis of 33 patients with a travel history to
Wuhan, we estimated the median incubation period
for COVID-19 to be 4·5 days (IQR 3·0–5·5; appendix p 2).
Figure 1: Age distribution of patients with COVID-19 from crowdsourced data
(A) All 507 cases by disease outcome (alive or unknown or deceased at time of reporting); vertical bars are case
counts in each age group and the dotted lines show the median age for patients who were alive or with unknown
outcomes at the time of reporting and those who had died at the time of reporting. (B) Relative risk by 5-year age
band for 364 cases reported in China. The observed data are shown by bars and the estimated relative risk is shown
by datapoints and a spline-smoothed curve. COVID-19=coronavirus disease 2019.
Alive or unknown
Deceased
Case count Estimated relative risk
1–4
5–9
10–14
15–19
20–24
25–29
30–34
35–39
45–49
60–64
55–59
50–54
40–44
65–69
70–74
75–79
≥80
0
10
20
30
40
50
B
0
1
2
3
Case count
Relative risk
Age group (years)
0
20
40
60 Outcome
A
Case count
See Online for appendix
Articles
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
5
Discussion
Information from patient line lists is crucial but dicult
to obtain at the beginning of an outbreak. Here we have
shown that careful compilation of crowdsourced reports
curated by a long-standing Chinese medical social
network provides a valuable picture of the outbreak
of COVID-19 in real time. The outbreak timeline is
consistent with aggregated case counts provided by health
authorities. For comparison, China CDC published the
first epidemic curve by symptom onset on Jan 28, 2020.12
Line lists provide unique information on the delays
between symptom onset and detection by the health-care
system, reporting delays, and travel histories. This
information cannot be extracted from aggregated case
counts published by ocial sources. Line list data can
help assess the eectiveness of interventions and the
potential for widespread transmission beyond the initial
foci of infection. In particular, shorter delays between
symptom onset and admission to hospital or seeking care
in a hospital or clinic accelerate detection and isolation of
cases, eectively shortening the infectious period.
A useful feature of our crowdsourced database was the
availability of travel histories for patients returning from
Wuhan, which, along with dates of symptom onset,
allowed for estimation of the incubation period here and
in related work.13,14 A narrow window of exposure could
be defined for a subset of patients who had a short stay
in Wuhan, at a time when the epidemic was still localised
to Wuhan. Several teams have used our dataset and
datasets from others to estimate a mean incubation
period for COVID-19 to be 5–6 days (95% CI 2–11).13–16
Our own estimate (median 4·5 days [IQR 3·0–5·5]) is
consistent with previous work that used other modelling
approaches.13–16 The incubation period is a useful
parameter to guide isolation and contact tracing; based
on existing data, the disease status of a contact should be
known with near certainty after a period of observation
of 14 days.13 Availability of a public dataset enables
independent estimation of important epidemiological
parameters by several teams, allowing for confirmation
and cross-checking at a time when information can be
conflicting and noisy.
An interesting finding in our data relates to the age
distribution of patients. We found a heavy skew of
infection towards older age groups, with substantially
fewer children infected. This pattern could indicate age-
related dierences in susceptibility to infection, severe
outcomes, or behaviour. However, a substantial portion of
the patients in our database are travellers, a population
that is usually predominantly adults (although does not
exclude children). Furthermore, because patient data in
our dataset were captured by the health system, they are
biased towards the more severe spectrum of the disease,
especially for patients from mainland China. Clinical
reports have shown that severity of COVID-19 is associated
with the presence of chronic conditions,16,17 which are
more frequent in older age groups. Nevertheless, we
would also expect children younger than 5 years to be at
risk of severe outcomes and to be reported to the health-
care system, as is seen for other respiratory infections.18
Biological dierences could have a role in shaping
these age profiles. A detailed analysis of one of the early
COVID-19 clusters by Chan and colleagues19 revealed
symptomatic infections in five adult members of the
same household, while a child in the same household
aged 10 years was infected but remained asymptomatic,
potentially indicating biological dierences in the risk of
clinical disease driven by age. Previous immunity from
infection with a related coronavirus has been speculated
to potentially protect children from SARS,20,21 and so
might also have a role in COVID-19. In any case, if the
age distribution of cases reported here was to be
confirmed and the epidemic were to progress globally,
we would expect an increase in respiratory mortality
concentrated among people aged 30 years and older. This
mortality pattern would be substantially dierent from
the profile of the 2009 influenza pandemic, for which
excess mortality was concentrated in those younger than
65 years.21
In our dataset, we saw a rapid increase in the number
of people infected with COVID-19 in several provinces of
China, consistent with local transmission outside of
Hubei province. As of Jan 31, 2020, province-level
epidemic curves are only available by date of reporting,
rather than date of symptom onset, which usually inflates
recent case counts if detection has increased.
Dec 9, 2019
Dec 16, 2019
Dec 23, 2019
Dec 30, 2019
Jan 6, 2020
Jan 13, 2020
Jan 20, 2020
Jan 27, 2020
Feb 3, 2020
0
2000
4000
6000
8000
0
10
20
30
40
Cumulative number of cases
Daily cases
Date of symptom onset
Mainland China, Hubei province
Mainland China, non-Hubei province
International
China CDC
Crowdsourced data (×20)
Cumulative cases
(mainland China)
Daily cases
Figure 2: Daily timeline of the COVID-19 epidemic based on crowdsourced data and official sources, by location
All data are by date of symptom onset. Cumulative curves are shown for the official China CDC data (published on
Jan 28, 2020), and for the crowdsourced data. Crowdsourced data have been rescaled and multiplied by 20 to
enable clear comparison with the official China CDC data. Histograms are daily case count, based on crowdsourced
data for Hubei province, mainland China non-Hubei province, and cases outside of mainland China. CDC=Centers
for Disease Control. COVID-19=coronavirus disease 2019.
Articles
6
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
Furthermore, province-level data include both returning
travellers from Hubei province (ie, importations) and
locally acquired cases, which also usually inflate the
apparent risk of local transmission. Notably, other lines
of evidence suggest that local transmission is now well
established outside of Hubei province, because travel
increased just before the Chinese New Year on
Jan 25, 2020, and before implementation of the travel
ban in Wuhan.22 Accordingly, our own data include
evidence of transmission clusters in non-travellers, with,
for instance, a second-generation transmission event
reported in Shaanxi on Jan 21, 2020.
Our study had several limitations, one of which was the
data we used. Although all provinces in mainland China
provide aggregated information on infections and deaths,
individual-level patient descriptions are only available for
a subset of provinces. Geographical coverage is hetero-
geneous in our line list, and we have a notable deficit of
cases from Hubei province, the foci of the COVID-19
outbreak. We expect that little patient-level information is
shared on social media by province-level and city-level
health authorities in Wuhan and Hubei province because
health systems are overwhelmed. For similar reasons,
provinces with a large total case count at the end of
January, 2020, or with a weaker health infrastructure,
were under-represented in our line list, with the exception
of Beijing. Other limitations in our data include severity
(only patients who had severe enough symptoms to seek
care were captured) and changes in case definition. A
series of epidemiological criteria were required for
COVID-19 testing, including travel history to Wuhan
within the past 2 weeks; residence in Wuhan within the
past 2 weeks; contact with individuals from Wuhan (with
fever and respiratory symptoms) within the past 2 weeks;
and being part of an established disease cluster. Some
of these criteria (eg, relation to Wuhan) were relaxed
over time (appendix). As a result, we have an over-
representation of travel-related cases in our database.
The reproduction number is an important quantity for
outbreak control. We refrained from estimating this
0
10
20
30
40
50
60
Anhui: 297 total cases
0
5
10
15
20
25
Shanghai: 153 total cases
0
5
10
15
20
Heilongjiang: 80 total cases
0
1
2
3
4
5
6
7
Gansu: 26 total cases
0
0·5
1·0
1·5
2·0
2·5
3·0
Qinghai: 9 total cases
Jan 18
Jan 21
Jan 24
Jan 25
Jan 30
Date
Hunan: 389 total cases
Shandong: 196 total cases
Hebei: 96 total cases
Tianjin: 32 total cases
0
20
40
60
80
0
5
10
15
20
25
0
2·5
5·0
7·5
10·0
12·5
15·0
17·5
0
1
2
3
4
0
1
2
3
4
5
Xinjiang: 15 total cases
Jan 18
Jan 21
Jan 24
Jan 25
Jan 30
Date
Henan: 422 total cases
Jiangsu: 202 total cases
Guangxi: 100 total cases
Shanxi: 43 total cases
40
0
10
20
30
40
50
60
70
0
10
20
30
0
5
10
15
20
0
2
4
6
8
0·0
0·5
1·0
1·5
2·0
2·5
3·0
Guizhou: 15 total cases
Jan 18
Jan 21
Jan 24
Jan 25
Jan 30
Date
Guangdong: 520 total cases
Sichuan: 207 total cases
Shaanxi: 102 total cases
Hainan: 54 total cases
0
20
40
60
80
100
120
0
5
10
15
20
25
30
35
0
5
10
15
20
25
0
2
4
6
8
0
1
2
3
4
5
Jilin: 17 total cases
Jan 18
Jan 21
Jan 24
Jan 25
Jan 30
Date
Zhejiang: 599 total cases
Chongqing: 238 total cases
Fujian: 144 total cases
Liaoning: 60 total cases
0
20
40
60
80
100
120
0
5
10
15
20
25
30
35
0
5
10
15
20
25
0
2
4
6
8
10
12
14
0
1
2
3
4
5
Inner Mongolia: 21 total cases
Jan 18
Jan 21
Jan 24
Jan 25
Jan 30
Date
Hubei: 7153 total cases
Jiangxi: 287 total cases
Beijing: 145 total cases
Yunnan: 78 total cases
0
200
400
600
800
1000
1200
1400
0
20
40
60
80
0
2·5
5·0
7·5
10·0
12·5
15·0
17·5
0
2
4
6
8
10
12
14
0
1
2
3
4
5
Ningxia: 22 total cases
Jan 18
Jan 21
Jan 24
Jan 25
Jan 30
Date
New cases dailyNew cases dailyNew cases dailyNew cases dailyNew cases daily
Figure 3: Daily timeline of the COVID-19 epidemic at the provincial level in China, during January, 2020
Vertical bars show the daily counts of new reported cases, with provinces sorted by total number of reported cases. The timeline for each province is reconstructed on the basis of daily outbreak
situation reports provided by provincial health authorities and posted on DXY.cn and are true as of Jan 31, 2020. COVID-19=coronavirus disease 2019.
Articles
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
7
parameter because reporting changes could bias
estimates relying on epidemic growth rates. Furthermore,
our dataset captured cases all over China and does not
reflect transmission patterns in any particular location. A
mean reproduction number of 2·5–2·7 has previously
been estimated on the basis of the volume of importations
of international cases in the pre-intervention period in
Wuhan.11
We recognise that, although our data source is useful
and timely, it should not replace ocial statistics. Manual
compilation of detailed line lists from media sources is
highly time consuming and is not sustainable when case
counts reach several thousands. Here we provide detailed
data on 507 patients when the ocial case count was over
9000 by Jan 31, 2020, representing a sample of
approximately 5% of reported cases and a much smaller
proportion of the full spectrum of COVID-19 cases,
which include mild infections. A crowd sourced system
would not be expected to catch all cases, especially if
many cases are too mild to be captured by the health-care
system, digital surveillance, or social media. Notably,
DXY.cn does not generate data outside of traditional
surveillance systems but rather provides a channel of
rapid communication between the public and health
authorities. In turn, our approach has helped extract and
repackage information from health authorities into an
analytical format, which was not available elsewhere.
At the time of writing, eorts are underway to
coordinate compilation of COVID-19 data from online
sources across several academic teams. Ultimately, we
expect that a line list of patients will be shared by
government sources with the global community;
however, data cleaning and access issues might take a
prohibitively long time to resolve. For the west African
Ebola outbreak, a similarly coordinated eort to publish
a line list took 2 years.23 Given the progression of the
COVID-19 outbreak, such a long delay would be
counterproductive.
Overall, the novelty of our approach was to rely on a
unique source for social media and news reports in China,
which aggregated and curated relevant information. This
approach facilitated entry of robust and standard data on
clinical and demographic information. Reassuringly,
DXY.cn maintains a special section dedicated to debunking
fake news, myths, and rumours about the COVID-19
outbreak. Looking to the future, collection of patient data
in the context of emergencies could include information
on whether patients are identified through contact tracing
or because they seek care on their own. Furthermore, data
interpretability could be improved by gathering more
quantitative information on how case definitions are used
in practice.
In conclusion, crowdsourced epidemiological data can
be useful to monitor emerging outbreaks, such as
COVID-19 and, as previously, Ebola virus.7 These eorts
can help generate and disseminate detailed information
in the early stages of an outbreak when little other data
are available, enabling independent estimation of key
parameters that aect interventions. Based on our small
sample of patients with COVID-19, we note an intriguing
age distribution, reminiscent of that of SARS, which
warrants further epidemiological and serological studies.
We also report early signs that the response is
strengthening in China on the basis of a decrease in case
detection time, and rapid management of travel-related
infections that are identified internationally. This is an
early report of a rapidly evolving situation and the
parameters discussed here could change quickly. In the
coming weeks, we will continue to monitor the
epidemiology of this outbreak using data from news
reports and ocial sources.
Contributors
KS and CV contributed to the study design. KS and JC contributed to the
data compilation. KS, JC, and CV contributed to data analysis. KS and JC
contributed to the design and drawing of figures. KS, JC, and CV
contributed to the writing of the manuscript.
Declaration of interests
We declare no competing interests.
Data sharing
All data used in this report have been made publicly available on the
Laboratory for the Modeling of Biological + Socio-technical systems
website of Northeastern University. The available data include daily case
counts of COVID-19 by reporting date and Chinese province, and a
de-identified line list of patients with COVID-19. The line list includes
geographical location (country and province), reporting date, dates of
symptom onset and seeking care at a hospital or clinic, relation to
Wuhan, discharge status when known, an English summary of the case
description from media sources, and a link to the original source of data.
Acknowledgments
The study was funded by the in-house research division of the Fogarty
International Center. CV and KS acknowledge support from the Bill &
Melinda Gates Foundation. The findings and conclusions in this study
are those of the authors and do not necessarily represent the ocial
position of the US National Institutes of Health or US Department of
Health and Human Services.
Figure 4: Delay between symptom onset and seeking care at a hospital or clinic (A) and between seeking care
at a hospital or clinic and reporting (B) of COVID-19 cases, by location
Data are for the entire study period and include all cases reported between Jan 13 and Jan 31, 2020. Datapoints are
medians, with the spread of data indicated by the filled shapes. All time intervals significantly differ between
locations (Kruskall Wallis test, p<0·0001). COVID-19=coronavirus disease 2019.
Hubei province,
China
Non-Hubei
province, China
International
0
10
20
30
Days
Location
Hubei province,
China
Non-Hubei
province, China
International
Location
Seeking care at hospital or clinic to reportSymptom onset to seeking care at hospital or clinic
Articles
8
www.thelancet.com/digital-health Published online February 20, 2020 https://doi.org/10.1016/S2589-7500(20)30026-1
References
1 WHO. Statement on the second meeting of the International
Health Regulations (2005) Emergency Committee regarding the
outbreak of novel coronavirus (2019-nCoV). Geneva: World Health
Organization, 2020. https://www.who.int/news-room/detail/30–01–
2020-statement-on-the-second-meeting-of-the-international-health-
regulations-(2005)-emergency-committee-regarding-the-outbreak-
of-novel-coronavirus-(2019-ncov) (accessed Feb 10, 2020).
2 Rivers C, Chretien JP, Riley S, et al. Using “outbreak science” to
strengthen the use of models during epidemics. Nat Commun 2019;
10: 3102.
3 Chowell G, Bertozzi SM, Colchero MA, et al. Severe respiratory
disease concurrent with the circulation of H1N1 influenza.
N Engl J Med 2009; 361: 674–79.
4 Chowell G, Echevarría-Zuno S, Viboud C, et al. Characterizing the
epidemiology of the 2009 influenza A/H1N1 pandemic in Mexico.
PLoS Med 2011; 8: e1000436.
5 Fraser C, Donnelly CA, Cauchemez S, et al. Pandemic potential of
a strain of influenza A (H1N1): early findings. Science 2009;
324: 1557–61.
6 Lipsitch M, Cohen T, Cooper B, et al. Transmission dynamics and
control of severe acute respiratory syndrome. Science 2003;
300: 1966–70.
7 Cleaton JM, Viboud C, Simonsen L, Hurtado AM, Chowell G.
Characterizing Ebola transmission patterns based on internet news
reports. Clin Infect Dis 2016; 62: 24–31.
8 DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia
(accessed Feb 10, 2020; in Chinese).
9 Institute for Health Metrics and Evaluation. Global health data
exchange. Seattle, WA: Institute for Health Metrics and Evaluation,
2020. ghdx.healthdata.org (accessed Feb 14, 2020).
10 Lemaitre M, Carrat F. Comparative age distribution of influenza
morbidity and mortality during seasonal influenza epidemics and
the 2009 H1N1 pandemic. BMC Infect Dis 2010; 10: 162.
11 MRC Centre for Global Infectious Disease Analysis. News/
2019-nCoV. London: Imperial College London, 2020. https://www.
imperial.ac.uk/mrc-global-infectious-disease-analysis/news--
wuhan-coronavirus/ (accessed Feb 10, 2020).
12 China Center for Disease Control and Prevention. 2019 epidemic
update and risk assessment of 2019 novel coronavirus. Beijing:
China Center for Disease Control and Prevention, 2020.
http://www.chinacdc.cn/yyrdgz/202001/P020200128523354919292.
pdf (accessed Feb 10, 2020).
13 Backer JA, Klinkenberg D, Wallinga J. Incubation period of 2019
novel coronavirus (2019-nCoV) infections among travellers from
Wuhan, China, 20–28 January 2020. Euro Surveill 2020; 25: 2000062.
14 Lauer SA, Grantz KH, Bi Q, et al. The incubation period of
2019-nCoV from publicly reported confirmed cases: estimation and
application. medRxiv 2020; published online Feb 4.
DOI:10.1101/2020.02.02.20020016 (preprint).
15 Zhao A, Ran J, Musa SS, et al. Preliminary estimation of the basic
reproduction number of novel coronavirus (2019-nCoV) in China,
from 2019 to 2020: a data-driven analysis in the early phase of the
outbreak. bioRxiv 2020; published online Jan 24.
DOI:10.1101/2020.01.23.916395v1 (preprint)
16 Li Q, Guan X, Wu P, et al. Early transmission dynamics in Wuhan,
China, of novel coronavirus-infected pneumonia. N Engl J Med
2020; published online Jan 29. DOI:10.1056/NEJMoa2001316.
17 Huang C, Wang Y, Li X, et al. Clinical features of patients infected
with 2019 novel coronavirus in Wuhan, China. Lancet 2020;
published online Jan 24. https://doi.org/10.1016/S0140-
6736(20)30183-5.
18 Greenbaum AH, Chen J, Reed C, et al. Hospitalizations for severe
lower respiratory tract infections. Pediatrics 2014; 134: 546–54.
19 Chan JF, Yuan S, Kok KH, et al. A familial cluster of pneumonia
associated with the 2019 novel coronavirus indicating person-to-
person transmission: a study of a family cluster. Lancet 2020;
published online Jan 24. https://doi.org/10.1016/S0140-
6736(20)30154-9.
20 Leung GM, Hedley AJ, Ho LM, et al. The epidemiology of severe
acute respiratory syndrome in the 2003 Hong Kong epidemic:
an analysis of all 1755 patients. Ann Intern Med 2004; 141: 662–73.
21 Simonsen L, Spreeuwenberg P, Lustig R, et al. Global mortality
estimates for the 2009 Influenza Pandemic from the GLaMOR
project: a modeling study. PLoS Med 2013; 10: e1001558.
22 Du W, Wang L, Cauchemez S, et al. Risk of 2019 novel coronavirus
importations throughout China prior to the Wuhan quarantine
medRxiv 2020; published online Feb 4.
DOI:10.1101/2020.01.28.20019299 (preprint).
23 Agua-Agum J, Ariyarajah A, Aylward B, et al. Exposure patterns
driving ebola transmission in west Africa: A retrospective
observational study. PLoS Med 2016; 13: e1002170.
... 18 Furthermore, other unconventional streams of data and social web also have the potential to rebuild early pandemic outbreaks as in one of the study researchers conducted a populationbased observational study in China, tracking the health-related social websites and found that unconventional data sets help in understanding the pandemic spread and then designing and implementing the effective health strategies. 19 In the same way, Qin and coworkers 20 employed "Social media search indexes" to find new suspected or confirmed cases of COVID-19 on the basis of symptoms and concluded that new suspected cases of COVID-19 could be detected 6-9 days prior and the confirmed cases 10 days earlier. ...
... Currently, several researches are being in progress to find radiological indicators on an early basis, which would be extremely valuable in terms of stratifying COVID-19 cases and clinically treat them. 8,19,21 ...
Article
Full-text available
The recent pandemic is ramification of coronavirus SARS-CoV2. This outbreak has greatly affected every domain of individuals' lives. It has spread over 214 countries and over 392 million positive cases were reported till February 6, 2022. Due to the latest breakthroughs in the sphere of digital divide, Big Data can aid in dealing with the enormous data of COVID-19 derived from state health surveillance, health monitoring, and daily briefing of government bodies. "Big data" is huge amounts of facts that work wonderfully. It has become a subject of particular interest for the last 2 decades due to its unseen significant potential in it. The purpose of the current review is to overview the potential applications of Big Data. Furthermore, issues and challenges associated with the solutions to the pandemic situations were highlighted and last, recommendations were provided for effective control of the pandemic situation. This review is an effort to provide a fresh insight into the way of big data in terms to stop the pandemic outbreak. Recent Pandemic Outbreak According to the available data, a number of cases were reported in the city of Wuhan, with respiratory symptoms and unknown causes in late December 2019. 1 The recent virus was first coined as "COVID-19" and, afterwards known as "SARS-CoV-2." The novel virus has spread all over the world, 214 countries were affected and more than 4.250.335 deaths reported. The infected individual may present with mild to severe symptoms such as high temperature, pain in throat to respiratory failure and death. 2 The novel virus, is a Communicable single-strand, RNA virus. 3 From a genetic point of view, the coronavirus shares approximately 50% of MERS-CoV and 79% of SARS-CoV. Moreover, SARS-CoV-2 also shares a receptor-binding with SARS-CoV. 4 There is no indication that the numbers of cases will decrease and the condition will be in hand. The global pandemic swing can be visualized in Figure 1. In the present situation, the combined effort between national authorities and larger enterprises are foreseen to substantially lessen the threats from the spread of novel virus. For instance, Google, as a giant browser, initiated a portal for pandemic COVID-19 (www.google:com/covid19), where one can obtain valuable information, like, recent statistics, COVID-19 map and the most frequent questions on corona virus. Google, IBM, and Amazon also developed a system of supercomputing with the White House for the coronavirus-related researches. 5 Moreover, some publishing house offer free access to the COVID-19-related documents and research articles and archival services of web such as arXiv and bioRxiv also developed speedy link to collect COVID-19-related preprint papers. These technological advancements are playing a significant role in the fight against the recent outbreak. Several research papers and preprints have been made available online in the recent months to enhance the understanding of COVID-19 and to lessen the cases. The aim of this review paper is to highlight the role of big data in the battle against recent pandemic of COVID-19. Furthermore, issues and problems associated with existing big data set techniques are also provided in the present review to produce range of recommendations for the research bodies.
... In December 2019, a new strain of virus causing severe acute respiratory syndrome was primarily detected in China, and was subsequently denoted as COVID-19 (Sun et al., 2020). The virus spread rapidly and, at the time of writing this article, more than 238 million cases had been reported worldwide ( Diabetes mellitus is a chronic metabolic condition manifesting as high levels of blood glucose. ...
... Sun et al., 2020). It is still unclear which factors influence COVID-19 ...
Article
Full-text available
Background: The impact of severe acute respiratory syndrome corona virus (SARS-CoV-2) infection on Type 1 Diabetes Mellitus (T1DM) patients and their humoral response against the virus infection or vaccination is presently unclear, as in extant research Type 1 and Type 2 DM is rarely distinguished. Objective: we aimed to investigate the impact of SARS-CoV-2 infection, any associated risk factors for hospitalization, and the COVID-19 IgG antibody levels in T1DM patients versus those obtained from healthy individuals. Methods and subjects: 58 T1DM patients and 56 healthy adults with documented COVID-19 diagnosis and/or documented vaccination were recruited from different clinics in Al-Karak Governmental Hospital to complete a questionnaire before collecting their serum samples for measuring IgG levels. Results: Our results revealed a statistically significant decrease in SARS-CoV-2 NP IgG antibody levels in COVID-19 infected T1DM patients compared to infected healthy individuals who served as controls, while, no significant difference was noticed in the levels of SARS-CoV-2 S1/S2 IgG antibody among vaccinated T1DM patients versus controls. After adjusting for associated risk factors, the risk of hospitalization due to COVID-19 for individuals with uncontrolled T1DM was significantly increased compared to controls, and among patients with T1DM, glycosylated hemoglobin (HbA1c) correlated negatively with the IgG levels. Moreover, IgG seropositivity was significantly associated with old age and smoking. Conclusion: Our findings point towards an increased need for vaccination for patients with T1DM, and suggest that glycemic control could be a vital measure for diminishing the impact of COVID-19 on these individuals.
... In December 2019, a new strain of virus causing severe acute respiratory syndrome was primarily detected in China, and was subsequently denoted as COVID-19 (Sun et al., 2020). The virus spread rapidly and, at the time of writing this article, more than 238 million cases had been reported worldwide ( Diabetes mellitus is a chronic metabolic condition manifesting as high levels of blood glucose. ...
... Sun et al., 2020). It is still unclear which factors influence COVID-19 ...
Article
Full-text available
Background: The impact of severe acute respiratory syndrome corona virus (SARS-CoV-2) infection on Type 1 Diabetes Mellitus (T1DM) patients and their humoral response against the virus infection or vaccination is presently unclear, as in extant research Type 1 and Type 2 DM is rarely distinguished. Objective: we aimed to investigate the impact of SARS-CoV-2 infection, any associated risk factors for hospitalization, and the COVID-19 IgG antibody levels in T1DM patients versus those obtained from healthy individuals. Methods and subjects: 58 T1DM patients and 56 healthy adults with documented COVID-19 diagnosis and/or documented vaccination were recruited from different clinics in Al-Karak Governmental Hospital to complete a questionnaire before collecting their serum samples for measuring IgG levels. Results: Our results revealed a statistically significant decrease in SARS-CoV-2 NP IgG antibody levels in COVID-19 infected T1DM patients compared to infected healthy individuals who served as controls, while, no significant difference was noticed in the levels of SARS-CoV-2 S1/S2 IgG antibody among vaccinated T1DM patients versus controls. After adjusting for associated risk factors, the risk of hospitalization due to COVID-19 for individuals with uncontrolled T1DM was significantly increased compared to controls, and among patients with T1DM, glycosylated hemoglobin (HbA1c) correlated negatively with the IgG levels. Moreover, IgG seropositivity was significantly associated with old age and smoking. Conclusion: Our findings point towards an increased need for vaccination for patients with T1DM, and suggest that glycemic control could be a vital measure for diminishing the impact of COVID-19 on these individuals.
... The rapid spread of this disease encouraged researchers to develop epidemiological models for the spread of this disease. Such models are useful for understanding the transmission patterns of the disease, which in turn, help us in formulating optimal strategies to curb the spread and lessen the impact of the outbreak [4,5]. ...
... Such models are trained using data, which may be available as a time series. Some researchers even use news and social media data to model the trends in the growth of COVID-19 cases [4]. ...
Article
Full-text available
During the COVID-19 outbreak, modeling the spread of infectious diseases became a challenging research topic due to its rapid spread and high mortality rate. The main objective of a standard epidemiological model is to estimate the number of infected, suspected, and recovered from the illness by mathematical modeling. This model does not capture how the disease transmits between neighboring regions through interaction. A more general framework such as Cellular Automata (CA) is required to accommodate a more complex spatial interaction within the epidemiological model. The critical issue of modeling in the spread of diseases is how to reduce the prediction error. This research aims to formulate the influence of the interaction of a neighborhood on the spreading pattern of COVID-19 using a neighborhood frame model in a Cellular-Automata (CA) approach and obtain a predictive model for the COVID-19 spread with the error reduction to improve the model. We propose a non-uniform continuous CA (N-CCA) as our contribution to demonstrate the influence of interactions on the spread of COVID-19. The model has succeeded in demonstrating the influence of the interaction between regions on the COVID-19 spread, as represented by the coefficients obtained. These coefficients result from multiple regression models. The coefficient obtained represents the population’s behavior interacting with its neighborhood in a cell and influences the number of cases that occur the next day. The evaluation of the N-CCA model is conducted by root mean square error (RMSE) for the difference in the number of cases between prediction and real cases per cell in each region. This study demonstrates that this approach improves the prediction of accuracy for 14 days in the future using data points from the past 42 days, compared to a baseline model.
... Rather surprisingly, the literature on this subject is still extremely poor. Few contributions have suggested the use of crowdsourced data rather than a sampling design along with officially collected data (Leung and Leung 2020;Sun et al. 2020); the risk of erroneous inferences based on these data has been pointed out by Arbia (2020), Di Gennaro Splendore et al. (2020 and Ioannidis (2020). Our aim is to suggest a sampling design whose statistical optimality properties are formally proven, where the design is also operational and can be immediately put into action upon taking the many practical obstacles that may arise in an emergency into account. ...
Article
Full-text available
Given the urgent informational needs connected with the diffusion of infection with regard to the COVID-19 pandemic, in this article, we propose a sampling design for building a continuous-time surveillance system. Compared with other observational strategies, the proposed method has three important elements of strength and originality: (1) it aims to provide a snapshot of the phenomenon at a single moment in time, and it is designed to be a continuous survey that is repeated in several waves over time, taking different target variables during different stages of the development of the epidemic into account; (2) the statistical optimality properties of the proposed estimators are formally derived and tested with a Monte Carlo experiment; and (3) it is rapidly operational as this property is required by the emergency connected with the diffusion of the virus. The sampling design is thought to be designed with the diffusion of SAR-CoV-2 in Italy during the spring of 2020 in mind. However, it is very general, and we are confident that it can be easily extended to other geographical areas and to possible future epidemic outbreaks. Formal proofs and a Monte Carlo exercise highlight that the estimators are unbiased and have higher efficiency than the simple random sampling scheme.
... In a paper that focused on AI for bigdata analytics in infectious diseases, which was written over a year before the current COVID-19 pandemic, Wong et al. point out that, in our current technological age, a variety of sources of epidemiological transmission data exist, such as sentinel reporting systems, disease centres, genome databases, transport systems, social media data, outbreak reports, and vaccinology related data (Wong et al., 2019). In the early stages of global vaccine roll out, compounded by the difficulty of scaling national testing efforts, this data is crucial for contact tracing, and for building models to understand and predict the spread of the disease (Sun et al., 2020). ...
Article
Full-text available
The current COVID-19 global pandemic caused by the SARS-CoV-2 betacoronavirus has resulted in over a million deaths and is having a grave socio-economic impact, hence there is an urgency to find solutions to key research challenges. Much of this COVID-19 research depends on distributed computing. In this article, I review distributed architectures -- various types of clusters, grids and clouds -- that can be leveraged to perform these tasks at scale, at high-throughput, with a high degree of parallelism, and which can also be used to work collaboratively. High-performance computing (HPC) clusters will be used to carry out much of this work. Several bigdata processing tasks used in reducing the spread of SARS-CoV-2 require high-throughput approaches, and a variety of tools, which Hadoop and Spark offer, even using commodity hardware. Extremely large-scale COVID-19 research has also utilised some of the world's fastest supercomputers, such as IBM's SUMMIT -- for ensemble docking high-throughput screening against SARS-CoV-2 targets for drug-repurposing, and high-throughput gene analysis -- and Sentinel, an XPE-Cray based system used to explore natural products. Grid computing has facilitated the formation of the world's first Exascale grid computer. This has accelerated COVID-19 research in molecular dynamics simulations of SARS-CoV-2 spike protein interactions through massively-parallel computation and was performed with over 1 million volunteer computing devices using the Folding@home platform. Grids and clouds both can also be used for international collaboration by enabling access to important datasets and providing services that allow researchers to focus on research rather than on time-consuming data-management tasks.
... These models have been used to guide strategies for disease transmission control [16] and quantify the role of individual protective measures in controlling several outbreaks, including the Ebola virus outbreak in West Africa in 2014 [17], the SARS outbreak in Hong Kong in 2003 [18], and the H1N1 outbreak in central Mexico in 2009 [19]. Some scholars also assessed the epidemic trend and studied the progress of the epidemic in different parts of China based on the public epidemiological data of COVID-19 [20]. ...
Article
Full-text available
With the rapid development of the Mobile Internet in China, epidemic information is real-time and holographic, and the role of information diffusion in epidemic control is increasingly prominent. At the same time, the publicity of all kinds of big data also provides the possibility to explore the impact of media information diffusion on disease transmission. We explored the mechanism of the influence of information diffusion on the transmission of COVID-19, developed a model of the interaction between information diffusion and disease transmission based on the Susceptible–Infected–Recovered (SIR) model, and conducted an empirical test by using econometric methods. The benchmark result showed that there was a significant negative correlation between the information diffusion and the transmission of COVID-19. The result of robust test showed that the diffusion of both epidemic information and protection information hindered the further transmission of the epidemic. Heterogeneity test results showed that the effect of epidemic information on the suppression of COVID-19 is more significant in cities with weak epidemic control capabilities and higher Internet development levels.
... Since its first appearance in Wuhan, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly spread across the world in a way unlike any other respiratory viruses. Coronavirus disease 2019 , caused by SARS-CoV-2, is considered the third highly pathogenic coronavirus following SARS-CoV-1 and Middle East respiratory syndrome coronavirus (MERS-CoV) [1]. The most striking feature of the incidences and epidemiology of SARS-CoV-2 is its high ability for transmission among people [2]. ...
Article
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an emerging RNA virus causing COVID-19 disease, across the globe. SARS-CoV-2 infected patients may exhibit acute respiratory distress syndrome which can be compounded by endemic respiratory viruses and thus highlighting the need to understand the genetic bases of clinical outcome under multiple respiratory infections. In this study, 42 individual datasets and a multi-parametric based selected list of over 12,000 genes against five medically important respiratory viruses (SARS-CoV-2, SARS-CoV-1, influenza A, respiratory syncytial virus (RSV) and rhinovirus were collected and analysed in an attempt to understand differentially regulated gene patterns and to cast genetic markers of individual and multiple co-infections. While a certain cohort of virus-specific genes were regulated (negatively and positively), notably results revealed a greatest correlation among genes regulation by SARS-CoV-2 and RSV. Furthermore, out of analysed genes, the MAP2K5 and NFKBIL1 were specifically and highly upregulated in SARS-CoV-2 infection both in vivo or in vitro. The most conserved genetic signature was JAK2 gene as well as the constitutively downregulated ZNF219 gene. In contrast, several genes including GPBAR1 and SC5DL were specifically downregulated in SARS-CoV-2 datasets. Finally, we catalogued a set of genes that were conserved or differentially regulated across studied respiratory viruses. These finding provide foundational and genome-wide data to gauge the markers of respiratory viral infections individually and under co-infection. This work compares the virogenomic signatures among human respiratory viruses and provides valid targets for potential antiviral therapy.
... According to Xianwang Zhou, Mayor of Wuhan, before the quarantine, more than 5 million people had left (The Guardian, 2020). Moreover, many of the early confirmed cases in other cities were epidemiologically related to Wuhan (Sun et al., 2020). Therefore, the outflows from Wuhan to other cities before lockdown are deemed important. ...
Article
Full-text available
Investigating the spatial epidemic dynamics of COVID-19 is crucial in understanding the routine of spatial diffusion and in surveillance, prediction, identification and prevention of another potential outbreak. However, previous studies attempting to evaluate these spatial diffusion dynamics are limited. Using city as the research unit and spatial association analysis as the primary strategy, this study explored the changing primary risk factors impacting the spatial spread of COVID-19 across Chinese cities under various diffusion assumptions and throughout the epidemic stage. Moreover, this study investigated the characteristics and geographical distributions of high-risk areas in different epidemic stages. The results empirically indicated rapid intercity diffusion at the early stage and primarily intracity diffusion thereafter. Before countermeasures took effect, proximity, GDP per capita, medical resources, outflows from Wuhan and intercity mobility significantly affected early diffusion. With speedily effective countermeasures, outflows from the epicenter, proximity, and intracity outflows played an important role. At the early stage, high-risk areas were mainly cities adjacent to the epicenter, with higher GDP per capita, or a combination of higher GDP per capita and better medical resources, with more outflow from the epicenter, or more intercity mobility. After countermeasures were effected, cities adjacent to the epicenter, or with more outflow from the epicenter or more intracity mobility became high-risk areas. This study provides an insightful understanding of the spatial diffusion of COVID-19 across cities. The findings are informative for effectively handling the potential-recurrence of COVID-19 in various settings.
Article
Full-text available
The coronavirus (COVID-19) pandemic has dramatically changed how people can safely use public spaces, particularly in densely populated urban environments. This short commentary revealed the effects of people’s absence in public places due to the series of lockdowns or social distancing policies imposed due to the pandemic. The argument presented was that the absence of people affects urban studies, and considers urban atmospheres in public places during such pandemic times. This article provides a six-step conceptual framework that guides urban planners and designers in reconstructing public space settings for a!ective atmospheres based on new realities in people’s presence and proximity.
Preprint
Full-text available
A novel human coronavirus (2019-nCoV) was identified in China in December, 2019. There is limited support for many of its key epidemiologic features, including the incubation period, which has important implications for surveillance and control activities. Here, we use data from public reports of 101 confirmed cases in 38 provinces, regions, and countries outside of Wuhan (Hubei province, China) with identifiable exposure windows and known dates of symptom onset to estimate the incubation period of 2019-nCoV. We estimate the median incubation period of 2019-nCoV to be 5.2 days (95% CI: 4.4, 6.0), and 97.5% of those who develop symptoms will do so within 10.5 days (95% CI: 7.3, 15.3) of infection. These estimates imply that, under conservative assumptions, 64 out of every 10,000 cases will develop symptoms after 14 days of active monitoring or quarantine. Whether this risk is acceptable depends on the underlying risk of infection and consequences of missed cases. The estimates presented here can be used to inform policy in multiple contexts based on these judgments.
Article
Full-text available
A novel coronavirus (2019-nCoV) is causing an outbreak of viral pneumonia that started in Wuhan, China. Using the travel history and symptom onset of 88 confirmed cases that were detected outside Wuhan in the early outbreak phase, we estimate the mean incubation period to be 6.4 days (95% credible interval: 5.6 7.7), ranging from 2.1 to 11.1 days (2.5th to 97.5th percentile). These values should help inform 2019-nCoV case definitions and appropriate quarantine durations. © 2020 European Centre for Disease Prevention and Control (ECDC). All rights reserved.
Article
Full-text available
Backgrounds: An ongoing outbreak of a novel coronavirus (2019-nCoV) pneumonia hit a major city of China, Wuhan, December 2019 and subsequently reached other provinces/regions of China and countries. We present estimates of the basic reproduction number,R0, of 2019-nCoV in the early phase of the outbreak. Methods: Accounting for the impact of the variations in disease reporting rate, we modelled the epidemic curve of 2019-nCoV cases time series, in mainland China from January 10 to January 24, 2020, through the exponential growth. With the estimated intrinsic growth rate (γ), we estimated R0 by using the serial intervals (SI) of two other well-known coronavirus diseases, MERS and SARS, as approximations for the true unknown SI. Findings: The early outbreak data largely follows the exponential growth. We estimated that the meanR0 ranges from 2.24 (95%CI: 1.96-2.55) to 3.58 (95%CI: 2.89-4.39) associated with 8-fold to 2-fold increase in the reporting rate. We demonstrated that changes in reporting rate substantially affect estimates of R0. CONCLUSION: The mean estimate ofR0 for the 2019-nCoV ranges from 2.24 to 3.58, and significantly larger than 1. Our findings indicate the potential of 2019-nCoV to cause outbreaks.
Article
Full-text available
Background: The initial cases of novel coronavirus (2019-nCoV)-infected pneumonia (NCIP) occurred in Wuhan, Hubei Province, China, in December 2019 and January 2020. We analyzed data on the first 425 confirmed cases in Wuhan to determine the epidemiologic characteristics of NCIP. Methods: We collected information on demographic characteristics, exposure history, and illness timelines of laboratory-confirmed cases of NCIP that had been reported by January 22, 2020. We described characteristics of the cases and estimated the key epidemiologic time-delay distributions. In the early period of exponential growth, we estimated the epidemic doubling time and the basic reproductive number. Results: Among the first 425 patients with confirmed NCIP, the median age was 59 years and 56% were male. The majority of cases (55%) with onset before January 1, 2020, were linked to the Huanan Seafood Wholesale Market, as compared with 8.6% of the subsequent cases. The mean incubation period was 5.2 days (95% confidence interval [CI], 4.1 to 7.0), with the 95th percentile of the distribution at 12.5 days. In its early stages, the epidemic doubled in size every 7.4 days. With a mean serial interval of 7.5 days (95% CI, 5.3 to 19), the basic reproductive number was estimated to be 2.2 (95% CI, 1.4 to 3.9). Conclusions: On the basis of this information, there is evidence that human-to-human transmission has occurred among close contacts since the middle of December 2019. Considerable efforts to reduce transmission will be required to control outbreaks if similar dynamics apply elsewhere. Measures to prevent or reduce transmission should be implemented in populations at risk. (Funded by the Ministry of Science and Technology of China and others.).
Article
Full-text available
Background: An ongoing outbreak of pneumonia associated with a novel coronavirus was reported in Wuhan city, Hubei province, China. Affected patients were geographically linked with a local wet market as a potential source. No data on person-to-person or nosocomial transmission have been published to date. Methods: In this study, we report the epidemiological, clinical, laboratory, radiological, and microbiological findings of five patients in a family cluster who presented with unexplained pneumonia after returning to Shenzhen, Guangdong province, China, after a visit to Wuhan, and an additional family member who did not travel to Wuhan. Phylogenetic analysis of genetic sequences from these patients were done. Findings: From Jan 10, 2020, we enrolled a family of six patients who travelled to Wuhan from Shenzhen between Dec 29, 2019 and Jan 4, 2020. Of six family members who travelled to Wuhan, five were identified as infected with the novel coronavirus. Additionally, one family member, who did not travel to Wuhan, became infected with the virus after several days of contact with four of the family members. None of the family members had contacts with Wuhan markets or animals, although two had visited a Wuhan hospital. Five family members (aged 36-66 years) presented with fever, upper or lower respiratory tract symptoms, or diarrhoea, or a combination of these 3-6 days after exposure. They presented to our hospital (The University of Hong Kong-Shenzhen Hospital, Shenzhen) 6-10 days after symptom onset. They and one asymptomatic child (aged 10 years) had radiological ground-glass lung opacities. Older patients (aged >60 years) had more systemic symptoms, extensive radiological ground-glass lung changes, lymphopenia, thrombocytopenia, and increased C-reactive protein and lactate dehydrogenase levels. The nasopharyngeal or throat swabs of these six patients were negative for known respiratory microbes by point-of-care multiplex RT-PCR, but five patients (four adults and the child) were RT-PCR positive for genes encoding the internal RNA-dependent RNA polymerase and surface Spike protein of this novel coronavirus, which were confirmed by Sanger sequencing. Phylogenetic analysis of these five patients' RT-PCR amplicons and two full genomes by next-generation sequencing showed that this is a novel coronavirus, which is closest to the bat severe acute respiatory syndrome (SARS)-related coronaviruses found in Chinese horseshoe bats. Interpretation: Our findings are consistent with person-to-person transmission of this novel coronavirus in hospital and family settings, and the reports of infected travellers in other geographical regions. Funding: The Shaw Foundation Hong Kong, Michael Seak-Kan Tong, Respiratory Viral Research Foundation Limited, Hui Ming, Hui Hoy and Chow Sin Lan Charity Fund Limited, Marina Man-Wai Lee, the Hong Kong Hainan Commercial Association South China Microbiology Research Fund, Sanming Project of Medicine (Shenzhen), and High Level-Hospital Program (Guangdong Health Commission).
Article
Full-text available
Infectious disease modeling has played a prominent role in recent outbreaks, yet integrating these analyses into public health decision-making has been challenging. We recommend establishing ‘outbreak science’ as an inter-disciplinary field to improve applied epidemic modeling.
Article
Full-text available
Background The ongoing West African Ebola epidemic began in December 2013 in Guinea, probably from a single zoonotic introduction. As a result of ineffective initial control efforts, an Ebola outbreak of unprecedented scale emerged. As of 4 May 2015, it had resulted in more than 19,000 probable and confirmed Ebola cases, mainly in Guinea (3,529), Liberia (5,343), and Sierra Leone (10,746). Here, we present analyses of data collected during the outbreak identifying drivers of transmission and highlighting areas where control could be improved. Methods and Findings Over 19,000 confirmed and probable Ebola cases were reported in West Africa by 4 May 2015. Individuals with confirmed or probable Ebola (“cases”) were asked if they had exposure to other potential Ebola cases (“potential source contacts”) in a funeral or non-funeral context prior to becoming ill. We performed retrospective analyses of a case line-list, collated from national databases of case investigation forms that have been reported to WHO. These analyses were initially performed to assist WHO’s response during the epidemic, and have been updated for publication. We analysed data from 3,529 cases in Guinea, 5,343 in Liberia, and 10,746 in Sierra Leone; exposures were reported by 33% of cases. The proportion of cases reporting a funeral exposure decreased over time. We found a positive correlation (r = 0.35, p < 0.001) between this proportion in a given district for a given month and the within-district transmission intensity, quantified by the estimated reproduction number (R). We also found a negative correlation (r = −0.37, p < 0.001) between R and the district proportion of hospitalised cases admitted within ≤4 days of symptom onset. These two proportions were not correlated, suggesting that reduced funeral attendance and faster hospitalisation independently influenced local transmission intensity. We were able to identify 14% of potential source contacts as cases in the case line-list. Linking cases to the contacts who potentially infected them provided information on the transmission network. This revealed a high degree of heterogeneity in inferred transmissions, with only 20% of cases accounting for at least 73% of new infections, a phenomenon often called super-spreading. Multivariable regression models allowed us to identify predictors of being named as a potential source contact. These were similar for funeral and non-funeral contacts: severe symptoms, death, non-hospitalisation, older age, and travelling prior to symptom onset. Non-funeral exposures were strongly peaked around the death of the contact. There was evidence that hospitalisation reduced but did not eliminate onward exposures. We found that Ebola treatment units were better than other health care facilities at preventing exposure from hospitalised and deceased individuals. The principal limitation of our analysis is limited data quality, with cases not being entered into the database, cases not reporting exposures, or data being entered incorrectly (especially dates, and possible misclassifications). Conclusions Achieving elimination of Ebola is challenging, partly because of super-spreading. Safe funeral practices and fast hospitalisation contributed to the containment of this Ebola epidemic. Continued real-time data capture, reporting, and analysis are vital to track transmission patterns, inform resource deployment, and thus hasten and maintain elimination of the virus from the human population.
Article
Full-text available
Background: Detailed information on patient exposure, contact patterns, and discharge status, is rarely available in real time from traditional surveillance systems in the context of an emerging infectious disease outbreak. Here we validate the systematic collection of Internet news reports to characterize epidemiological patterns of Ebola virus disease (EVD) infections during the West African 2014-2015 outbreak. Methods: Based on 58 news reports, we analyzed a total of 79 EVD clusters (286 cases) of size ranging from 1 to 33 cases between January 2014 and February 2015 in Guinea, Sierra Leone and Liberia. Results and conclusions: The great majority of reported exposures stemmed from contact with family members (57.3%) followed by hospitals (18.2%) and funerals (12.7%). Our data indicated that funeral exposure was significantly more frequent in Sierra Leone (27.3%) followed by Guinea (18.2%) and Liberia (1.8%) (Chi-square test; P<0.0001). Funeral transmission was the dominant route of transmission until April 2014 (60%) and was replaced with hospital exposure in June-July 2014 (70%), both of which declined after interventions were put in place. The mean reproduction number of the outbreak was 2.3 (95% CI: 1.8, 2.7). The case fatality rate was estimated at 74.4% (95% CI: 68.3, 79.8). Overall our findings based on news reports are in close agreement with those derived from traditional epidemiological surveillance data and with those reported for prior outbreaks. Our findings support the use of real time-information from trustworthy news reports to provide timely estimates of key epidemiological parameters that may be hard to ascertain otherwise.
Article
Background: A novel human coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was identified in China in December 2019. There is limited support for many of its key epidemiologic features, including the incubation period for clinical disease (coronavirus disease 2019 [COVID-19]), which has important implications for surveillance and control activities. Objective: To estimate the length of the incubation period of COVID-19 and describe its public health implications. Design: Pooled analysis of confirmed COVID-19 cases reported between 4 January 2020 and 24 February 2020. Setting: News reports and press releases from 50 provinces, regions, and countries outside Wuhan, Hubei province, China. Participants: Persons with confirmed SARS-CoV-2 infection outside Hubei province, China. Measurements: Patient demographic characteristics and dates and times of possible exposure, symptom onset, fever onset, and hospitalization. Results: There were 181 confirmed cases with identifiable exposure and symptom onset windows to estimate the incubation period of COVID-19. The median incubation period was estimated to be 5.1 days (95% CI, 4.5 to 5.8 days), and 97.5% of those who develop symptoms will do so within 11.5 days (CI, 8.2 to 15.6 days) of infection. These estimates imply that, under conservative assumptions, 101 out of every 10 000 cases (99th percentile, 482) will develop symptoms after 14 days of active monitoring or quarantine. Limitation: Publicly reported cases may overrepresent severe cases, the incubation period for which may differ from that of mild cases. Conclusion: This work provides additional evidence for a median incubation period for COVID-19 of approximately 5 days, similar to SARS. Our results support current proposals for the length of quarantine or active monitoring of persons potentially exposed to SARS-CoV-2, although longer monitoring periods might be justified in extreme cases. Primary funding source: U.S. Centers for Disease Control and Prevention, National Institute of Allergy and Infectious Diseases, National Institute of General Medical Sciences, and Alexander von Humboldt Foundation.
Article
Background: Hospitalization for lower respiratory tract infections (LRTIs) among children have been well characterized. We characterized hospitalizations for severe LRTI among children. Methods: We analyzed claims data from commercial and Medicaid insurance enrollees (MarketScan) ages 0 to 18 years from 2007 to 2011. LRTI hospitalizations were identified by the first 2 listed International Classification of Diseases, Ninth Revision discharge codes; those with ICU admission and/or receiving mechanical ventilation were defined as severe LRTI. Underlying conditions were determined from out- and inpatient discharge codes in the preceding year. We report insurance specific and combined rates that used both commercial and Medicaid rates and adjusted for age and insurance status. Results: During 2007-2011, we identified 16797 and 12053 severe LRTI hospitalizations among commercial and Medicaid enrollees, respectively. The rates of severe LRTI hospitalizations per 100000 person-years were highest in children aged <1 year (commercial: 244; Medicaid: 372, respectively), and decreased with age. Among commercial enrollees, ≥ 1 condition increased the risk for severe LRTI (1 condition: adjusted relative risk, 2.68; 95% confidence interval, 2.58-2.78; 3 conditions: adjusted relative risk, 4.85; 95% confidence interval, 4.65-5.07) compared with children with no medical conditions. Using commercial/Medicaid combined rates, an estimated 31289 hospitalizations for severe LRTI occurred each year in children in the United States. Conclusions: Among children, the burden of hospitalization for severe LRTI is greatest among children aged <1 year. Children with underlying medical conditions are at greatest risk for severe LRTI hospitalization.