ArticlePDF Available

Modeling Spread of Disease from Social Interactions


Abstract and Figures

Research in computational epidemiology to date has concentrated on coarse-grained statistical analysis of populations, often synthetic ones. By contrast, this paper focuses on fine-grained modeling of the spread of infectious diseases throughout a large real-world social network. Specifically, we study the roles that social ties and interactions between specific individuals play in the progress of a contagion.We focus on public Twitter data, where we find that for every health-related message there are more than 1,000 unrelated ones. This class imbalance makes classification particularly challenging. Nonetheless, we present a framework that accurately identifies sick individuals from the content of online communication. Evaluation on a sample of 2.5 million geo-tagged Twitter messages shows that social ties to infected, symptomatic people, as well as the intensity of recent co-location, sharply increase one's likelihood of contracting the illness in the near future. To our knowledge, this work is the first to model the interplay of social activity, human mobility, and the spread of infectious disease in a large real-world population. Furthermore, we provide the first quantifiable estimates of the characteristics of disease transmission on a large scale without active user participation-a step towards our ability to model and predict the emergence of global epidemics from day-to-day interpersonal interactions. Copyright © 2012, Association for the Advancement of Artificial Intelligence ( All rights reserved.
Content may be subject to copyright.
Modeling Spread of Disease from Social Interactions
Adam Sadilek
Department of Computer Science
University of Rochester
Rochester, NY 14627
Henry Kautz
Department of Computer Science
University of Rochester
Rochester, NY 14627
Vincent Silenzio
School of Medicine and Dentistry
University of Rochester
Rochester, NY 14627
Research in computational epidemiology to date has con-
centrated on coarse-grained statistical analysis of popula-
tions, often synthetic ones. By contrast, this paper focuses
on fine-grained modeling of the spread of infectious diseases
throughout a large real-world social network. Specifically, we
study the roles that social ties and interactions between spe-
cific individuals play in the progress of a contagion. We focus
on public Twitter data, where we find that for every health-
related message there are more than 1,000 unrelated ones.
This class imbalance makes classification particularly chal-
lenging. Nonetheless, we present a framework that accurately
identifies sick individuals from the content of online commu-
nication. Evaluation on a sample of 2.5 million geo-tagged
Twitter messages shows that social ties to infected, symp-
tomatic people, as well as the intensity of recent co-location,
sharply increase one’s likelihood of contracting the illness in
the near future. To our knowledge, this work is the first to
model the interplay of social activity, human mobility, and
the spread of infectious disease in a large real-world popula-
tion. Furthermore, we provide the first quantifiable estimates
of the characteristics of disease transmission on a large scale
without active user participation—a step towards our ability
to model and predict the emergence of global epidemics from
day-to-day interpersonal interactions.
Given that ve of your friends have flu-like symptoms, and
that you have recently met eight people, possibly strangers,
who complained about having runny noses and headaches,
what is the probability that you will soon become ill as well?
This work explores how accurately such questions can be
answered across a large sample of people participating in
online social media (see Fig. 1).
Imagine Joe is about to take off on an airplane and quickly
posts a Twitter update from his phone. He writes that he
has a fever and feels awful. Since Joe has a public Twitter
profile, we know who some of his friends are, and from his
GPS-tagged messages we see some of the places he has re-
cently visited. Additionally, we can infer a large fraction of
the hidden parts of Joe’s social network and his latent loca-
tions by applying the results of previous work, as we discuss
2012, Association for the Advancement of Artificial
Intelligence ( All rights reserved.
Figure 1: Visualization of a sample of friends in New York
City. The red links between users represent friendships, and
the colored pins show their current location on a map. We
see the highlighted person complaining about her health, and
hinting about the specifics of her ailment. This work investi-
gates the impact of such a person on the health of her friends,
and of people around her.
below. In the same manner, we can identify other people who
are likely to be at Joe’s airport, or even on the same flight.
Using both the observed and inferred information, we can
now monitor individuals who likely came into contact with
Joe, such as the passengers seated next to him. Joe’s disease
may have been transmitted to them, and vice versa, though
they may not exhibit any symptoms yet. As people travel
to their respective destinations, they may be infecting others
encountered along the way. Eventually, some of the people
will tweet about how they feel, and we can observe at least
a fraction of the population that actually contracted the dis-
The example just given illustrates our vision of how pub-
lic health modeling may look like in the near future. This
paper reports on our initial progress towards making this vi-
sion a reality.
Traditionally, public health is monitored via surveys and
by aggregating statistics obtained from healthcare providers.
Such methods are costly, slow, and may be biased. For in-
stance, a person with flu is recorded only after he or she vis-
its a doctor’s office and the information is sent to the appro-
priate agency. Affected people who do not seek treatment,
or do not respond to surveys are virtually invisible to the
traditional methods.
Recently, digital media has been successfully used to sig-
nificantly reduce the latency and improve the overall effec-
tiveness of public health monitoring. Perhaps most notably,
Google Flu Trends
models the prevalence of flu via analysis
of geo-located search queries (Ginsberg et al. 2008).
Twitter itself has been recently shown to accurately assess
the overall prevalence of flu independently in a number of
countries with accuracy comparable to current state of the
art methods including Google Flu and Center for Disease
Control and Prevention (CDC) statistics
(Lampos, De Bie,
and Cristianini 2010; Culotta 2010; Signorini, Segre, and
Polgreen 2011). However, even the state of the art systems
suffer from two major drawbacks. First, they produce only
coarse, aggregate statistics, such as the expected number of
people afflicted by flu in Texas. Furthermore, they often per-
form mere passive monitoring, and prediction is severely
limited by the low resolution of the aggregate approach.
By contrast, our work takes a bottom-up approach, where
we take into account the fine-grained interactions between
individuals. We apply machine learning techniques to the
difficult task of detecting ill individuals based on the content
of their Twitter status updates. We are then able to estimate
the physical interactions between healthy and sick people
via their online activities, and model the impact of these in-
teractions on public health.
As a result, this work is one of the first steps towards
the development of automated methods that identify disease
vectors, trace the transmission between concrete individu-
als, and ultimately help us understand and predict the spread
of infectious diseases with fine granularity. Specifically, we
investigate the following research question: What roles do
co-location and social ties play in the spread of infectious
diseases from person to person? Our answers to this ques-
tion provide a solid stepping stone for further research.
The Data
Our experiments are based on data obtained from Twitter, a
popular micro-blogging service where people post message
updates at most 140 characters long. The forced brevity en-
courages frequent mobile updates, as we show below. Rela-
tionships between users on Twitter are not necessarily sym-
metric. One can follow (subscribe to receive messages from)
Figure 2: A snapshot of a heatmap animation of Twitter
users’ movement within New York City that captures a typ-
ical distribution of geo-tagged messaging on a weekday af-
ternoon. The hotter (more red) an area is, the more people
have recently tweeted from that location.
a user without being followed back. When users do recipro-
cate following, we say they are friends on Twitter. There is
anecdotal evidence that Twitter friendships have a substan-
tial overlap with offline friendships (Gruzd, Wellman, and
Takhteyev 2011). Twitter launched in 2006 and has been
experiencing an explosive growth since then. As of March
2011, approximately 200 million accounts are registered on
Using the Twitter Search API
, we collected a sample of
public tweets that originated from the New York City (NYC)
metropolitan area shown in Fig. 2. The collection period
was one month long and started on May 18, 2010. Using
a Python script, we periodically queried Twitter for all re-
cent tweets within 100 kilometers of the NYC city center. In
order to avoid exceeding Twitter’s query rate limits and sub-
sequently missing some tweets, we distributed the work over
a number of machines with different IP addresses that asyn-
chronously queried the server and merged their results. Twit-
ter does not provide any guarantees as to what sample of ex-
isting tweets can be retrieved through their API, but a com-
parison to official Twitter statistics shows that our method
recorded the majority of the publicly available tweets in the
region. Altogether, we have logged nearly 16 million tweets
authored by more than 630 thousand unique users (see Ta-
ble 1). To put these statistics in context, the entire NYC
metropolitan area has an estimated population of 19 million
We concentrate on accounts that posted more than
100 GPS-tagged tweets during the one-month data collec-
tion period. We refer to them as geo-active users. The social
network of the 6,237 geo-active users is shown in Fig. 3.
Methodology and Models
In this section, we first present our method for automatic
detection of Twitter messages that suggest the author con-
Figure 3: Visualization of the social network consisting
of the geo-active users. Edges between nodes represent
friendships on Twitter. The image has been created us-
ing LaNet-vi package implementing k-core decomposition
o, Alvarez-Hamelin, and Busch 2008). The coreness of
nodes is color-coded using the scale on the right. The degree
of a node is represented by its size shown on the left. We see
that there are relatively few important “hubs” in the central
area, and a large number of less connected individuals on the
tracted an infectious disease.
We then discuss how we lever-
age this framework in order to develop our model of public
Detecting Illness-Related Messages
As a first step, we need to identify Twitter messages that
indicate the author is infected with a disease of interest at
the time of posting. Based on the results of previous work,
we expect that health-related tweets are relatively scarce
as compared to other types of messages (Culotta 2010;
Paul and Dredze 2011a). Given this class imbalance prob-
lem, we formulate a semi-supervised cascade-based ap-
proach (shown in Fig. 4) to learning a robust support vector
machine (SVM) classifier with a large area under the ROC
curve (i.e., consistently high precision and high recall). SVM
is an established model of data in machine learning (Cortes
and Vapnik 1995). We learn an SVM for linear binary classi-
fication to accurately distinguish between tweets indicating
the author is afflicted by an infectious ailment (we call such
tweets “sick”), and all other tweets (called “other” or “nor-
In order to learn such classifier, we ultimately need to ef-
fortlessly obtain a high-quality set of labeled training data.
We achieve this via the following “bootstrapping” process.
We begin by training two different binary SVM classifiers,
In this study, such diseases include those with symptoms that
overlap with, but are not necessarily limited to, influenza-like ill-
ness ( illness).
New York City Dataset
Unique users 632,611
Unique geo-active users 6,237
Tweets total 15,944,084
GPS-tagged tweets 4,405,961
GPS-tagged tweets by geo-active users 2,535,706
GPS-tagged tweets by geo-active users 2,047
that show a symptom of an illness
Distinct visited locations 57,109
“Follows” relationships 102,739
between geo-active users
“Friends” relationships 31,874
between geo-active users
Table 1: Summary statistics of the data collected from NYC.
Geo-active users are ones who geo-tag their tweets relatively
frequently (more than 100 times per month). Note that fol-
lowing reciprocity is about 31%, which is consistent with
previous findings (Kwak et al. 2010). The number of dis-
tinct visited locations is calculated as the number of cells
(100 by 100 meters) of the NYC grid that have been visited
by at least one geo-active individual.
Corpus of
5,128 tweets
labeled by
Corpus of 1.6
million machine-
labeled tweets
Corpus of
of "sick"
Random sample of
200 milion tweets
Figure 4: A diagram of our cascade learning of SVMs. The
and symbols denote thresholding of the classification
score, where we select the bottom 10% of the scores pre-
dicted by C
(i.e., tweets that are normal with high probabil-
ity), and the top 10% of scores predicted by C
(i.e., likely
“sick” tweets).
and C
, using the SVM
is highly pe-
nalized for inducing false positives (mistakenly labeling a
normal tweet as one about sickness), whereas C
is heavily
penalized for creating false negatives (labeling symptomatic
tweets as normal). For both classifiers, the misclassification
penalty for one direction was always a hundred times larger
than in the opposite direction. We train C
and C
a dataset of 5,128 tweets, each labeled as either “sick” or
“other” by multiple Amazon Mechanical Turk workers and
carefully checked by the authors. After training, we used the
two classifiers to label a set of 1.6 million tweets that are
likely health-related, but contain some noise. We obtained
both datasets from Paul and Dredze (2011a), and they are
completely disjoint from our NYC data.
The intuition behind this cascading process is to extract
tweets that are with high confidence about sickness with C
and tweets that are almost certainly about other topics with
from the corpus of 1.6 million tweets. We further supple-
ment the final corpus with messages from a sample of 200
million tweets (also disjoint from all other corpora consid-
ered here) that C
classified as “other” with high probability.
We apply thresholding on the classification score to reduce
the noise in the cascade, as shown in Fig. 4.
The cascade yields a final corpus with over 700 thousand
“sick” messages and 3 million “other” tweets, which we use
for training the final SVM C
. We will discuss how we lever-
age C
to model the disease spread below, but first let us
describe the feature space and our learning methodology in
more detail.
As features, we use all unigram, bigram, and trigram word
tokens that appear in the training data. For example, a tweet
“I feel represented by the following feature vector:
i, feel, sick, i feel, feel sick, i feel sick
Before tokenization, we convert all text to lower case, strip
punctuation and special characters, and remove mentions of
user names (the “@” tag). All re-tweets (analogous to email
forwarding) have been removed as well, since those mes-
sages typically refer to popular news and social games, and
rarely describe the current state of the author. However, we
do keep hashtags (such as “#sick”), as those are often rele-
vant to the author’s health state, and are particularly useful
for disambiguation of short or ill-formed messages. When
learning the final SVM C
, we only consider tokens that ap-
pear at least three times in the training set.
While our feature space has a very high dimensionality
operates in more than 1.7 million dimensions), with
many possibly irrelevant features, support vector machines
with a linear kernel have been shown to perform very well
under such circumstances (Joachims 2006; Sculley et al.
2011; Paul and Dredze 2011a).
To overcome the class imbalance problem, where the
number of tweets about an illness is much smaller than the
number of other messages, we apply the ROCArea SVM
learning method that directly optimizes the area under the
ROC curve, as described in Joachims (2005). Traditional ob-
jective functions, such as the 0-1 loss perform poorly under
severe class imbalance. For instance, a trivial model that la-
bels every example as belonging to the majority class has
an excellent accuracy, because it misses only the relatively
few minority examples. By contrast, the ROCArea method
works by implicitly transforming the classical SVM learn-
ing problem over individual training examples into one over
pairs of examples. This allows efficient calculation of the
area under the ROC curve from the predicted ranking of the
Modeling the Spread of Disease
Human contact is the single most important factor in the
transmission of infectious diseases (Clayton, Hills, and
Pickles 1993). Since the contact is often indirect, such as
via a doorknob, we focus on a more general notion of co-
location. We consider two individuals co-located if they visit
the same 100 by 100 meter cell within a time window (slack)
of length T . For clarity, we show results for T P t1, 4, 12u
hours. We use the 100m threshold, as that is the typical
lower bound on the accuracy of a GPS sensor in obstructed
areas, such as Manhattan. Since we focus on geo-active in-
dividuals, we can calculate co-location with high accuracy.
The results below are for a condition, where a person is
ill up to two days after they write a “sick” tweet. It is im-
portant to note that the relationships among friendship, co-
location, and health are consistent over a wide range of du-
ration of contagiousness (from 1 to 7 days). Most infectious
illnesses produce influenza-like symptoms that stop within a
few days, and thus within these temporal bounds.
To quantify the effect of social ties on disease transmis-
sion, we leverage users’ Twitter friendships. Clearly, there
are complex events and interactions that take place “behind
the scenes”, which are not directly recorded in online social
media. However, this work posits that these latent events of-
ten exhibit themselves in the activity of the sample of people
we can observe. For instance, as we will see, having social
ties to infected people significantly increases your chances
of becoming ill in the near future. However, we do not be-
lieve that the social ties themselves cause or even facilitate
the spread of an infection. Instead, the Twitter friendships
are proxies and indicators for a complex set of phenomena
that may not be directly accessible. For example, friends of-
ten eat out together, meet in classes, share items, and travel
together. While most of these events are never explicitly
mentioned online, they are crucial from the disease trans-
mission perspective. However, their likelihood is modulated
by the structure of the social ties, allowing us to reason about
Limitations Our observations are limited by the preva-
lence of public tweets in which users talk about their health,
and by our ability to identify them in the flood of other types
of messages. Both these factors contribute to the fact that
the number of infected individuals is systematically under-
estimated, but evaluation of C
suggests that the latter effect
is small. We can approximate the magnitude of this bias us-
ing the statistics presented earlier. We see that about 1 in 30
residents of NYC appears in our dataset. If we strictly focus
on the geo-active individuals, the ratio is roughly 1:3,000.
However, the results in this paper indicate, that by leverag-
ing the latent effects of our observations, such a sampling
ratio may be sufficient.
We note that currently used methods suffer from similar
biasing effects. For example, infected people who do not
visit a doctor, or do not respond to surveys are virtually in-
visible to the traditional methods. Similarly, efforts such as
Google Flu Trends can only observe individuals who search
the web for certain types of content when sick. A fully com-
prehensive coverage of a population will require a combi-
nation of diverse methods, and application of AI techniques
capable of inferring the missing information.
Experiments and Results
In this section, we evaluate our approach, compare the re-
sults of our model with an established baseline, and discuss
insights gained.
Positive Features Negative Features
Feature Weight Feature Weight
sick 0.9579 sick of ´0.4005
headache 0.5249 you ´0.3662
flu 0.5051 of ´0.3559
fever 0.3879 your ´0.3131
feel 0.3451 lol ´0.3017
cough 0.3062 who ´0.1816
feeling 0.3055 u ´0.1778
coughing 0.2917 love ´0.1753
throat 0.2842 it ´0.1627
cold 0.2825 her ´0.1618
home 0.2107 they ´0.1617
still 0.2101 people ´0.1548
bed 0.2088 shit ´0.1486
better 0.1988 smoking ´0.0980
being 0.1943 i’m sick of ´0.0894
being sick 0.1919 so sick of ´0.0887
stomach 0.1703 pressure ´0.0837
and my 0.1687 massage ´0.0726
infection 0.1686 i love ´0.0719
morning 0.1647 pregnant ´0.0639
Table 2: Top twenty most significant negatively and posi-
tively weighted features of our SVM model.
Evaluation of the final SVM C
described in the previ-
ous section on a held-out test set of 700,000 tweets shows
0.98 precision and 0.97 recall. This evaluation run also al-
lows us to choose an optimal threshold on the classification
score that separates the normal tweets from sick tweets. Ta-
ble 2 lists the most significant features C
found. Table 3
shows examples of tweets that C
identified as “sick”. We
now apply C
to modeling the spread of infectious diseases
throughout the sampled population of NYC described above.
The correlation between the prevalence of infectious dis-
eases predicted by our model and the predictions made
by Google Flu Trends specifically for New York City is
0.73. The official CDC data for NYC is not available with
sufficiently fine granularity, but previous work has shown
that Google’s predictions closely correspond to the offi-
cial statistics for larger geographical areas (Ginsberg et al.
2008). Google Flu Trends may have greater specificity to
“influenza-like illness”, whereas our approach may be less
specific, but more sensitive to detect other, related infectious
processes exhibiting these nonspecific features in Twitter
content. Furthermore, the only overlap between our predic-
tions and those of Google is for May 18 through 23, 2010.
Thus, the correlation between the two needs to be interpreted
with this context in mind.
Figures 5a and 5b show the impact of co-location and
friendship with infected people on a given day on one’s
health the following day. We analyze both the individual and
joint effects of the two factors on disease transmission. For
brevity, we include plots only for a 1-day lag, since other
time offsets result in a similar relationship.
Looking at co-location effect alone, we observe a definite
exponential relationship between probable physical encoun-
ters and ensuing sickness. All three curves in Fig. 5a consis-
tently fit fpxq C e
, where C is a constant that cap-
tures the length of time overlap T (note that C » 0.011{T ;
thus the larger the slack the smaller the effect). For instance,
having 40 encounters with sick individuals with a 1-hour
slack makes one ill with 20% probability. With a more le-
nient slack, such as 4 hours, one needs over 80 encounters
to reach the same level of risk.
In Fig. 5b, we see that the number of sick friends also
has an exponential effect on the probability of getting sick:
fpxq 0.003 e
. By contrast, the number of friends
in any health state (i.e., the size of one’s friend list) has no
impact on one’s health. In fact, the conditional probability
of getting sick given n friends (the blue line in Fig. 5b) is
virtually identical to the prior probability of getting sick (the
black line).
We have discussed the latent influence of friendships ear-
lier. This is quantitatively shown as the green line in Fig. 5b,
where we subtract out the effect of co-location from the
influence of social ties. We do this by counting only sick
friends who have not been encountered. Comparison with
the red curve shows that for smaller numbers of friends
(n ď 6), co-location has a weak additional effect over the
proxy effect of social ties. However, for larger n, the resid-
ual impact of friendships plateaus, and co-location begins to
Related Work
Since the famous cholera study by John Snow, much work
has been done in capturing the mechanisms of epidemics
(Snow 1855). There is ample previous work in computa-
tional epidemiology on building models of coarse-grained
disease spread via differential equations (Anderson and May
1979), by harnessing simulated populations (Eubank et al.
2004), and by analysis of official statistics (Grenfell, Bjorn-
stad, and Kappey 2001). Such models are typically devel-
oped for the purposes of assessing the impact a particular
combination of an outbreak and a vaccination or contain-
ment strategy would have on humanity, a country’s defense,
or ecology (Chen, David, and Kempe 2010). However, the
above works focus on simulated populations and hypothet-
ical scenarios. By contrast, we address the problem of as-
sessing and modeling the health of real-world populations
composed of individuals embedded in a fine social structure.
As a result, our work is a major step towards prediction of
actual threats and instances of disease outbreaks.
In the context of social media, Krieck et al. (2011) ex-
plore augmenting the traditional notification channels about
a disease outbreak with data extracted from Twitter. By man-
ually examining a large number of tweets, they show that
self-reported symptoms are the most reliable signal in de-
tecting if a tweet is relevant to an outbreak or not. This is
because people often do not know what their true problem is
until diagnosed by an expert, but they can readily write about
how they feel. Researchers have also concentrated on cap-
turing the overall trend of a particular disease outbreak, typ-
ically influenza, by monitoring social media (Culotta 2010;
Came home sick today from work with a killer headache and severe nausea, took 2 advil and slept for 6 hours. I feel much better now.
Meh I actually have to go to school tomorrrow.. #sick
Not feeling good at all...that sucks because I plans with my bff and job interviews set up until Tuesday. Stomach is killing me
I’m feeling better today still stuffed up but my nose isn’t running like it was yesterday and my cough is better as well it hurts.
Guys I’m sorry. I’m really have to get some rest. I have nausea, headache, is tired, freezing & now have I got fever. Good Night! :-*
It hurts to breathe, swallow, cough or yawn. I must be getting sick, though because my ear feels worse than my throat.
I just sneezed 6 times in a row. i hate being sick.
feeling misserable. stomach hurts, headache, and no, I’m not pregnant.
Been sleep all day smh.... Currently soothing my jimmy frm α headache as I go back to sleep
Just not feeling it today. Looks like man flu has come back for a visit. I need to be well and have work - is that too much to ask?
Table 3: Example tweets that our SVM model C
identified as “sick”. Note the high degree of variability, and sometimes
subtlety, in the way different people describe their health.
0 20 40 60 80 100 120
f (x)=0.013 e
(0. 055x )
f (x)=0.002 e
(0. 054x )
f (x)=0.001 e
(0. 055x )
Number of estimated encounters with sick individuals at time t
Conditional probability of getting sick at t +1
1 hour time window (T=1h)
4 hour time window (T=4h)
12 hour time window (T=12h)
Prior probability of being sick
(a) Effects of co-location alone
0 2 4 6 8 10
f (x)=0.003 e
(0. 413x )
Number of friends (n)
Conditional probability of getting sick
Prob. of getting sick at t +1 given n friends are sick at t
Prob. of getting sick given having n friends (any)
Prob. of getting sick at t +1 given n unencountred friends are sick at t
Prior probability of being sick
(b) Effects of friendship and co-location
Figure 5: Being co-located with ill, symptomatic individuals, and having sick friends on a given day (t) makes one more likely
to get sick the next day (t ` 1). On the horizontal axis in (a), we plot the amount of co-location of an asymptomatic user with
known sick people on a given day. In (b), we show the number of friends (of an asymptomatic user); either only sick ones or
any depending on the curve. The vertical axes show the conditional probability of getting sick the next day. We also plot the
prior probability of being sick. For co-location, results for three slack time windows, within which we consider an appearance
of two users close together as co-location, are shown (1, 4, and 12 hours).
Lampos, De Bie, and Cristianini 2010; Chunara, Andrews,
and Brownstein 2012). Interesting work of Ritterman, Os-
borne, and Klein (2009) shows that noisy Twitter data is a
valuable information channel for predicting public opinion
regarding the likelihood of a pandemic. Freifeld et al. (2010)
use information actively submitted by cell phone users to
model aggregate public health. However, scaling such sys-
tems poses considerable challenges.
Other researchers focus on a more detailed modeling of
the language of the tweets and its relevance to public health
in general (Paul and Dredze 2011a), and to influenza surveil-
lance in particular (Collier, Son, and Nguyen 2011). Paul
et al. develop a variant of topic models that captures the
symptoms and possible treatments for ailments, such trau-
matic injuries and allergies, that people discuss on Twitter.
In a follow-up work Paul and Dredze (2011b) begin to con-
sider the geographical patterns in the prevalence of such ail-
ments, and show a good agreement of their models with of-
ficial statistics and Google Flu Trends. There is a potential
for synergy between the work of Paul et al. and ours that
would allow us to model the spread of specific diseases by
leveraging the rich language models.
However, all these works consider only aggregate patterns
captured by coarse-grained statistics, whereas the primary
contribution of our paper is a more detailed study of the in-
terplay among human mobility, social structure, and disease
transmission. Our framework allows us to track—without
active user participation—specific likely events of contagion
between individuals, and model the relationship between an
epidemic and self-reported symptoms of actual users of on-
line social media.
While this paper concentrates on “traditional” infectious
diseases, such as flu, similar techniques can be applied to
study mental health disorders, such as depression, that have
strong contagion patterns as well. Pioneering work in this
broad area includes Silenzio et al. (2009), which studies
characteristics of young lesbian, gay, and bisexual individ-
uals in online social networks. They focus on discovering
such members of a community, and design methods for ef-
fective peer-driven information diffusion and preventative
care, focusing specifically on suicide. Twitter has also been
used to monitor the seasonal variation in affect around the
globe (Golder and Macy 2011).
Looking at a more global scale, Bettencourt and West
(2010) argue for a comprehensive scientific approach to ur-
ban planning. They show there are underlying patterns that
tie together the size of a city with its emergent characteris-
tics, such as crime rate, number of patents produced, walk-
ing speed of its inhabitants, and prevalence of epidemics.
The authors argue that cities are the source of many ma-
jor problems, but also contain the solutions because of their
concentrated creativity and productivity.
Since this work leverages social ties and user location, the
large body of prior work on inferring and predicting these
characteristics becomes relevant. A number of researchers
have demonstrated that it is possible to accurately predict
people’s fine-grained location from their online behavior
and interactions (Cho, Myers, and Leskovec 2011; Sadilek,
Kautz, and Bigham 2012). Much progress has been made
in predicting the social structure of participants in online
media, including Twitter, from various types of observed
data (Crandall et al. 2010; Backstrom and Leskovec 2011;
Sadilek, Kautz, and Bigham 2012). Applying these machine
learning techniques will significantly expand the breadth of
data available by allowing us to consider not only declared
friendships and public check-ins, but also their inferred—
though more ambiguous—counterparts.
Conclusions and Future Work
This work is the first to take on modeling the spread of infec-
tious diseases throughout a real-world population with fine
granularity. We focus on self-reported symptoms that appear
in people’s Twitter status updates, and show that although
such messages are rare, we can identify them with high pre-
cision as well as high recall. We achieve this by developing
an SVM model that is robust even in the presence of strong
class imbalance. This is a necessary precondition for fur-
ther progress, as false negatives and false positives cannot
be traded-off against each other in this domain—they both
carry equal importance.
We have seen that avoiding encounters with infected peo-
ple decreases your chances of becoming ill, whereas a large
amount of contact with them makes an onset of a disease
almost certain (Fig. 5a). Similarly, by interpreting a virtual
friendship as a proxy for unobservable phenomena and in-
teractions, we have shown that the likelihood of becoming
ill exponentially increases as the number of infected friends
grows. For example, having more than 5 sick friends in-
creases one’s likelihood of getting sick by a factor of 3, as
compared to prior probability, and even more with respect to
the probability given no sick friends (Fig. 5b). Additionally,
we model the joint influence of co-location and social ties,
and quantify the latent impact of friendships.
Figure 6: Visualization of a sample of Twitter users (yel-
low pins) at the Newark Liberty International Airport. The
highlighted person X says he will be back in 16 days and
mentions specific friends for whom this message is rele-
vant. We immediately see the people at the airport who
could have come into contact with X, and additional candi-
dates can be inferred using methods developed by previous
work (Crandall et al. 2010; Backstrom and Leskovec 2011;
Sadilek, Kautz, and Bigham 2012). Additionally, recent re-
sults show that the future location and co-location of the in-
dividuals can be predicted at various temporal scales with
high accuracy (Cho, Myers, and Leskovec 2011; Sadilek,
Kautz, and Bigham 2012). Since some people explicitly
mention their symptoms, it can be expected that putting all
this information together will yield strong predictions about
the spread of an infection.
An early identification of infected individuals is espe-
cially crucial in preventing and containing devastating dis-
ease outbreaks. Important work of Eubank et al. (2004)
shows that by far the most effective way to fight an epidemic
in urban areas is to quickly confine infected individuals to
their homes. However, this strategy is truly effective only
when applied early on in the outbreak. The agility of tar-
geted vaccination ranks second in effectiveness. This paper
shows that finding these key symptomatic individuals, along
with other people that may have already contracted the dis-
ease, can be done effectively and in a timely manner through
social media. As our final contribution, we show that the pre-
dictions made by our model strongly correlate with Google
Flu Trends, currently the state of the art system for monitor-
ing the prevalence of influenza-like illnesses.
In future work, we will focus on larger geographical ar-
eas (including airplane travel), while maintaining the same
level of detail (i.e., social ties between concerete individuals
and their fine-grained location). This will allow us to model
and predict the emergence of global epidemics from the day-
to-day interactions of individuals, and subsequently answer
questions such as “How did the current flu epidemic in city
A start and where did it come from?” and “How likely I am
to catch a cold if I visit the mall?”.
Prior work has developed a repertoire of powerful AI
techniques for revealing hidden social ties and predicting
user location—two features heavily leveraged by our pub-
lic health model. Therefore, there are opportunities for great
synergy in these areas, as we illustrate in Fig. 6.
Anderson, R., and May, R. 1979. Population biology of
infectious diseases: Part I. Nature 280(5721):361.
Backstrom, L., and Leskovec, J. 2011. Supervised ran-
dom walks: predicting and recommending links in social
networks. In WSDM 2011, 635–644. ACM.
o, M.; Alvarez-Hamelin, J.; and Busch, J. 2008. A low
complexity visualization tool that helps to perform complex
systems analysis. New Journal of Physics 10:125003.
Bettencourt, L., and West, G. 2010. A unified theory of
urban living. Nature 467(7318):912–913.
Chen, P.; David, M.; and Kempe, D. 2010. Better vaccina-
tion strategies for better people. In Proceedings of the 11th
ACM conference on Electronic commerce, 179–188. ACM.
Cho, E.; Myers, S. A.; and Leskovec, J. 2011. Friendship
and mobility: User movement in location-based social net-
works. ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining (KDD).
Chunara, R.; Andrews, J.; and Brownstein, J. 2012. Social
and news media enable estimation of epidemiological pat-
terns early in the 2010 Haitian cholera outbreak. The Ameri-
can Journal of Tropical Medicine and Hygiene 86(1):39–45.
Clayton, D.; Hills, M.; and Pickles, A. 1993. Statistical
models in epidemiology, volume 41. Oxford university press
Collier, N.; Son, N.; and Nguyen, N. 2011. OMG U got
flu? Analysis of shared health messages for bio-surveillance.
Journal of Biomedical Semantics.
Cortes, C., and Vapnik, V. 1995. Support-vector networks.
Machine learning 20(3):273–297.
Crandall, D.; Backstrom, L.; Cosley, D.; Suri, S.; Hutten-
locher, D.; and Kleinberg, J. 2010. Inferring social ties
from geographic coincidences. Proceedings of the National
Academy of Sciences 107(52):22436.
Culotta, A. 2010. Towards detecting influenza epidemics
by analyzing Twitter messages. In Proceedings of the First
Workshop on Social Media Analytics, 115–122. ACM.
Eubank, S.; Guclu, H.; Anil Kumar, V.; Marathe, M.; Srini-
vasan, A.; Toroczkai, Z.; and Wang, N. 2004. Modelling
disease outbreaks in realistic urban social networks. Nature
Freifeld, C.; Chunara, R.; Mekaru, S.; Chan, E.; Kass-Hout,
T.; Iacucci, A.; and Brownstein, J. 2010. Participatory
epidemiology: use of mobile phones for community-based
health reporting. PLoS medicine 7(12):e1000376.
Ginsberg, J.; Mohebbi, M.; Patel, R.; Brammer, L.; Smolin-
ski, M.; and Brilliant, L. 2008. Detecting influenza
epidemics using search engine query data. Nature
Golder, S., and Macy, M. 2011. Diurnal and seasonal mood
vary with work, sleep, and daylength across diverse cultures.
Science 333(6051):1878–1881.
Grenfell, B.; Bjornstad, O.; and Kappey, J. 2001. Travelling
waves and spatial hierarchies in measles epidemics. Nature
Gruzd, A.; Wellman, B.; and Takhteyev, Y. 2011. Imagining
Twitter as an imagined community. In American Behavioral
Scientist, Special issue on Imagined Communities.
Joachims, T. 2005. A support vector method for multivariate
performance measures. In ICML 2005, 377–384. ACM.
Joachims, T. 2006. Training linear svms in linear time. In
Proceedings of the 12th ACM SIGKDD international con-
ference on Knowledge discovery and data mining, 217–226.
Krieck, M.; Dreesman, J.; Otrusina, L.; and Denecke, K.
2011. A new age of public health: Identifying disease out-
breaks by analyzing tweets. Proceedings of Health Web-
Science Workshop, ACM Web Science Conference.
Kwak, H.; Lee, C.; Park, H.; and Moon, S. 2010. What is
Twitter, a Social Network or a News Media? In WWW.
Lampos, V.; De Bie, T.; and Cristianini, N. 2010. Flu
detector-tracking epidemics on Twitter. Machine Learning
and Knowledge Discovery in Databases 599–602.
Paul, M., and Dredze, M. 2011a. A model for mining public
health topics from Twitter. Technical Report. Johns Hopkins
University. 2011.
Paul, M., and Dredze, M. 2011b. You are what you tweet:
Analyzing Twitter for public health. In Fifth International
AAAI Conference on Weblogs and Social Media (ICWSM
Ritterman, J.; Osborne, M.; and Klein, E. 2009. Using pre-
diction markets and Twitter to predict a swine flu pandemic.
1st International Workshop on Mining Social Media.
Sadilek, A.; Kautz, H.; and Bigham, J. P. 2012. Finding your
friends and following them to where you are. In Fifth ACM
International Conference on Web Search and Data Mining.
(Best Paper Award).
Sculley, D.; Otey, M.; Pohl, M.; Spitznagel, B.; Hainsworth,
J.; and Yunkai, Z. 2011. Detecting adversarial advertise-
ments in the wild. In Proceedings of the 17th ACM SIGKDD
international conference on Knowledge discovery and data
mining. ACM.
Signorini, A.; Segre, A.; and Polgreen, P. 2011. The use of
Twitter to track levels of disease activity and public concern
in the us during the influenza a h1n1 pandemic. PLoS One
Silenzio, V.; Duberstein, P.; Tang, W.; Lu, N.; Tu, X.; and
Homan, C. 2009. Connecting the invisible dots: Reaching
lesbian, gay, and bisexual adolescents and young adults at
risk for suicide through online social networks. Social Sci-
ence & Medicine 69(3):469–474.
Snow, J. 1855. On the mode of communication of cholera.
John Churchill.
... Users' texts in social media have been studied extensively in order mine the features hidden behind the textual data that include spatio-temporal and social network information [1,2]. Earlier research has focused on examining users' attributes such as gender, age, and location which showed that their attributes influenced their use of language [3,4,5]. ...
... Health agencies can recruit users working in various medical fields for new job opportunities. For this study, we created a dataset of users following medical accounts, including their biographical content and a label belonging to an occupational class from the "National Occupational Classification" [1] taxonomy. We have designed a dataset that is similar to the dataset that was built by Preoţiuc-Pietro et al. [9]. ...
... [1] ...
... However, despite its promise, the use of such data during pandemics/outbreaks is still in its infancy. A few studies have used social media data (tweets in particular) to examine mental health issues related to events such as an influenza outbreak (Chew and Eysenbach, 2010), vaccination (Love et al., 2013), the spread of flu symptoms (Sadilek et al., 2012), and insights about diseases (Paul and Dredze, 2011). However, these studies have defined and captured mental health in various ways. ...
Governments worldwide have implemented stringent restrictions to curtail the spread of the COVID-19 pandemic. Although beneficial to physical health, these preventive measures could have a profound detrimental effect on the mental health of the population. This study focuses on the impact of lockdowns and mobility restrictions on mental health during the COVID-19 pandemic. We first develop a novel mental health index based on the analysis of data from over three million global tweets using the Microsoft Azure machine learning approach. The computed mental health index scores are then regressed with the lockdown strictness index and Google mobility index using fixed-effects ordinary least squares (OLS) regression. The results reveal that the reduction in workplace mobility, reduction in retail and recreational mobility, and increase in residential mobility (confinement to the residence) have harmed mental health. However, restrictions on mobility to parks, grocery stores, and pharmacy outlets were found to have no significant impact. The proposed mental health index provides a path for theoretical and empirical mental health studies using social media.
... For example, many have measured public sentiment and attitudes at a large scale by aggregating the results of sentiment models applied to individual messages online (O'Connor et al. 2010;Diakopoulos and Shamma 2010;Bollen et al. 2011;Wang et al. 2012;Mitra et al. 2016). Prior work has shown that the prevalence of influenza can be estimated from the number of tweets mentioning an influenza infection (Culotta 2010;Lamb et al. 2013;Sadilek et al. 2012). Others have used classifiers to study behavior in online communities (Yin et al. 2017;Mitra et al. 2016) or patterns in news coverage (West and Pfeffer 2017). ...
Full-text available
This work considers the use of classifiers in a downstream aggregation task estimating class proportions, such as estimating the percentage of reviews for a movie with positive sentiment. We derive the bias and variance of the class proportion estimator when taking classification error into account to determine how to best trade off different error types when tuning a classifier for these tasks. Additionally, we propose a method for constructing confidence intervals that correctly adjusts for classification error when estimating these statistics. We conduct experiments on four document classification tasks comparing our methods to prior approaches across classifier thresholds, sample sizes, and label distributions. Prior approaches have focused on providing the most accurate point estimate while this work focuses on the creation of correct confidence intervals that appropriately account for classifier error. Compared to the prior approaches, our methods provide lower error and more accurate confidence intervals.
... The effects of this paradigm shift in epidemiological studies have been reported by some researchers. Modeling the spread of disease from social interactions was investigated in [18]. They proposed a model that interplay in a massive population of the real-world situation the growth of an infectious disease with activities such as social activity, and human mobility. ...
In recent times, big data has become ubiquitous and can be employed to improve intelligent decision making in diverse real-life application domains such as climatology, agriculture, biomedicine, and epidemiological studies. In this research, we leverage the availability of big data in epidemiology studies to design a data analytics framework for Ebola virus disease (EVD) outbreak surveillance. The perils of an outbreak of infectious diseases such as EVD, Zika virus, acute respiratory syndrome (SARS), and human monkeypox are identical to the effects of a natural disaster such as wildfires, tsunamis, floods, and earthquakes in a community. The devastating capabilities of infectious diseases stem from their ability to strike unexpectedly, giving no elbowroom for adequate preparation. Therefore, a real-time early warning surveillance system that anticipates the emergence of infectious diseases such as EVD is required for taking proactive steps to avert an impending outbreak or reduce its impacts, rather than reacting to an outbreak. EVD is a deadly infectious disease that attacks the hosts’ blood and immune system rapidly and has caused the death of over 15,000 people in Africa. To mitigate the threats of EVD, we proposed an ensemble machine learning algorithm–a hybrid of Artificial Neural Network (ANN) and Genetic Algorithm (GA) deplored on Apache Spark and Kafka frameworks for discovering novel knowledge from big data repository of environmental, epidemiological, and immunological data of previous occurrences of EVD–using Nigeria as a case study to forecast a future outbreak of EVD in terms magnitude, timing and duration– coupled with a real-time alert to the appropriate public health authorities.
... Borchev [187] and Sterman [188] classify simulation models as agent-based, system dynamics and compartmental models. They are common in modelling the spread of diseases and assessing the costeffectiveness of interventions, such as in Crooks & Hailegiorgis [189], Sadilek et al. [190], or Bendor et al. [191]. In the field of policy analysis, simulations are applied to assess the effects of policies on the economy, as discussed in Homer & Hirsch [192], Tesfatsion [193,194] and Barlas [195]. ...
Full-text available
This paper's top-level goal is to provide an overview of research conducted in the many academic domains concerned with forecasting. By providing a summary encompassing these domains, this survey connects them, establishing a common ground for future discussions. To this end, we survey literature on human judgement and quantitative forecasting as well as hybrid methods that involve both humans and algorithmic approaches. The survey starts with key search terms that identified more than 280 publications in the fields of computer science, operations research, risk analysis, decision science, psychology and forecasting. Results show an almost 10-fold increase in the application-focused forecasting literature between the 1990s and the current decade, with a clear rise of quantitative, data-driven forecasting models. Comparative studies of quantitative methods and human judgement show that (1) neither method is universally superior, and (2) the better method varies as a function of factors such as availability, quality, extent and format of data, suggesting that (3) the two approaches can complement each other to yield more accurate and resilient models. We also identify four research thrusts in the human/machine-forecasting literature: (i) the choice of the appropriate quantitative model, (ii) the nature of the interaction between quantitative models and human judgement, (iii) the training and incentivization of human forecasters, and (iv) the combination of multiple forecasts (both algorithmic and human) into one. This review surveys current research in all four areas and argues that future research in the field of human/machine forecasting needs to consider all of them when investigating predictive performance. We also address some of the ethical dilemmas that might arise due to the combination of quantitative models with human judgement.
The interest in new and more advanced technological solutions is paving the way for the diffusion of innovative and revolutionary applications in healthcare organizations. The application of an artificial intelligence system to medical research has the potential to move toward highly advanced e-Health. This analysis aims to explore the main areas of application of big data in healthcare, as well as the restructuring of the technological infrastructure and the integration of traditional data analytical tools and techniques with an elaborate computational technology that is able to enhance and extract useful information for decision-making. We conducted a literature review using the Scopus database over the period 2010–2020. The article selection process involved five steps: the planning and identification of studies, the evaluation of articles, the extraction of results, the summary, and the dissemination of the audit results. We included 93 documents. Our results suggest that effective and patient-centered care cannot disregard the acquisition, management, and analysis of a huge volume and variety of health data. In this way, an immediate and more effective diagnosis could be possible while maximizing healthcare resources. Deriving the benefits associated with digitization and technological innovation, however, requires the restructuring of traditional operational and strategic processes, and the acquisition of new skills.
Full-text available
Twitter represents a massively distributed information source over topics ranging from social and political events to entertainment and sports news. While recent work has suggested this content can be narrowed down to the personalized interests of individual users by training topic filters using standard classifiers, there remain many open questions about the efficacy of such classification-based filtering approaches. For example, over a year or more after training, how well do such classifiers generalize to future novel topical content, and are such results stable across a range of topics? In addition, how robust is a topic classifier over the time horizon, e.g ., can a model trained in 1 year be used for making predictions in the subsequent year? Furthermore, what features, feature classes, and feature attributes are most critical for long-term classifier performance? To answer these questions, we collected a corpus of over 800 million English Tweets via the Twitter streaming API during 2013 and 2014 and learned topic classifiers for 10 diverse themes ranging from social issues to celebrity deaths to the “Iran nuclear deal”. The results of this long-term study of topic classifier performance provide a number of important insights, among them that: (i) such classifiers can indeed generalize to novel topical content with high precision over a year or more after training though performance degrades with time, (ii) the classes of hashtags and simple terms contain the most informative feature instances, (iii) removing tweets containing training hashtags from the validation set allows better generalization, and (iv) the simple volume of tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a long-term study of topic classifiers on Twitter that further justifies classification-based topical filtering approaches while providing detailed insight into the feature properties most critical for topic classifier performance.
Social media play an important role in the daily life of people around the globe and users have emerged as an active part of news distribution as well as production. The threatening pandemic of COVID-19 has been the lead subject in online discussions and posts, resulting to large amounts of related social media data, which can be utilised to reinforce the crisis management in several ways. Towards this direction, we propose a novel framework to collect, analyse, and visualise Twitter posts, which has been tailored to specifically monitor the virus spread in severely affected Italy. We present and evaluate a deep learning localisation technique that geotags posts based on the locations mentioned in their text, a face detection algorithm to estimate the number of people appearing in posted images, and a community detection approach to identify communities of Twitter users. Moreover, we propose further analysis of the collected posts to predict their reliability and to detect trending topics and events. Finally, we demonstrate an online platform that comprises an interactive map to display and filter analysed posts, utilising the outcome of the localisation technique, and a visual analytics dashboard that visualises the results of the topic, community, and event detection methodologies.
Full-text available
Human mobility is a primary driver of infectious disease spread. However, existing data is limited in availability, coverage, granularity, and timeliness. Data-driven forecasts of disease dynamics are crucial for decision-making by health officials and private citizens alike. In this work, we focus on a machine-learned anonymized mobility map (hereon referred to as AMM) aggregated over hundreds of millions of smartphones and evaluate its utility in forecasting epidemics. We factor AMM into a metapopulation model to retrospectively forecast influenza in the USA and Australia. We show that the AMM model performs on-par with those based on commuter surveys, which are sparsely available and expensive. We also compare it with gravity and radiation based models of mobility, and find that the radiation model’s performance is quite similar to AMM and commuter flows. Additionally, we demonstrate our model’s ability to predict disease spread even across state boundaries. Our work contributes towards developing timely infectious disease forecasting at a global scale using human mobility datasets expanding their applications in the area of infectious disease epidemiology.
Full-text available
In this paper, we present an extension of large network visualization (LaNet-vi), a tool to visualize large scale networks using the k-core decomposition. One of the new features is how vertices compute their angular position. While in the later version it is done using shell clusters, in this version we use the angular coordinate of vertices in higher k-shells, and arrange the highest shell according to a cliques decomposition. The time complexity goes from to O(n) upon bounds on a heavy-tailed degree distribution. The tool also performs a k-core-connectivity analysis, highlighting vertices that are not k-connected; e.g. this property is useful to measure robustness or quality of service (QoS) capabilities in communication networks. Finally, the actual version of LaNet-vi can draw labels and all the edges using transparencies, yielding an accurate visualization. Based on the obtained figure, it is possible to distinguish different sources and types of complex networks at a glance, in a sort of 'network iris-print'.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ail-ments, such as influenza, infections, obesity, as compared to standard topic models. Fur-thermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data.
We explore the hypothesis that social media such as Twitter encodes the belief of a large number of people about some concrete statement about the world. Here, these beliefs are aggregated using a Prediction Market specifically concerning the possibility of a Swine Flu Pandemic in 2009. Using a regression framework, we are able to show that simple features extracted from Tweets can reduce the error associated with modelling these beliefs. Our approach is also shown to outperform some baseline methods based purely on time-series information from the Market.
Analyzing user messages in social media can measure different population characteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and insomnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.
Conference Paper
Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of.78 with CDC statistics by leveraging a document classifier to identify relevant messages.
During infectious disease outbreaks, data collected through health institutions and official reporting structures may not be available for weeks, hindering early epidemiologic assessment. By contrast, data from informal media are typically available in near real-time and could provide earlier estimates of epidemic dynamics. We assessed correlation of volume of cholera-related HealthMap news media reports, Twitter postings, and government cholera cases reported in the first 100 days of the 2010 Haitian cholera outbreak. Trends in volume of informal sources significantly correlated in time with official case data and was available up to 2 weeks earlier. Estimates of the reproductive number ranged from 1.54 to 6.89 (informal sources) and 1.27 to 3.72 (official sources) during the initial outbreak growth period, and 1.04 to 1.51 (informal) and 1.06 to 1.73 (official) when Hurricane Tomas afflicted Haiti. Informal data can be used complementarily with official data in an outbreak setting to get timely estimates of disease dynamics.