Detecting influenza epidemics using search engine
Jeremy Ginsberg1, Matthew H. Mohebbi1, Rajan S. Patel1, Lynnette Brammer2, Mark S. Smolinski1& Larry Brilliant1
Seasonal influenza epidemics are a major public health concern,
causing tens of millions of respiratory illnesses and 250,000 to
enza, a new strain of influenza virus against which no previous
immunity exists and that demonstrates human-to-human trans-
mission could result in a pandemic with millions of fatalities2.
Early detection of disease activity, when followed by a rapid
response, can reduce the impact of both seasonal and pandemic
influenza3,4. One way to improve early detection is to monitor
health-seeking behaviour in the form of queries to online search
engines, which are submitted by millions of users around the
world each day. Here we present a method of analysing large
numbers of Google search queries to track influenza-like illness
patient presents with influenza-like symptoms, we can accurately
estimate the current level of weekly influenza activity in each
region of the United States, with a reporting lag of about one
day. This approach may make it possible to use search queries to
detect influenza epidemics in areas with a large population of web
Traditional surveillance systems, including those used by the US
Influenza Surveillance Scheme (EISS), rely on both virological and
clinical data, including influenza-like illness (ILI) physician visits.
systems on a weekly basis, typically with a 1–2-week reporting lag.
In an attempt to provide faster detection, innovative surveillance
systems have been created to monitor indirect signals of influenza
activity, such as call volume to telephone triage advice lines5and
over-the-counter drug sales6. About 90 million American adults are
believed to search online for information about specific diseases or
medical problems each year7, making web search queries a uniquely
valuable source of information about health trends. Previous
attempts at using online activity for influenza surveillance have
counted search queries submitted to a Swedish medical website (A.
Hulth, G. Rydevik andA. Linde, manuscript inpreparation), visitors
to certain pages on a US health website8, and user clicks on a search
keyword advertisement in Canada9. A set of Yahoo search queries
containing the words ‘flu’ or ‘influenza’ were found to correlate with
virological and mortality surveillance data over multiple years10.
Ourproposed system builds on this earlier work by using an auto-
mated method of discovering influenza-related search queries. By
processing hundreds of billions of individual searches from 5years
models for use ininfluenza surveillance, withregional and state-level
of online search engines may eventually enable models to be
developed in international settings.
50 million of the most common search queries in the United States.
Separate aggregate weekly counts were kept for every query in each
time series was normalized by dividing the count for each query in a
in that location during the week, resulting in a query fraction
(Supplementary Fig. 1).
We sought to develop a simple model that estimates the probabil-
itythat arandomphysicianvisit inaparticular regionisrelatedtoan
ILI;thisisequivalent tothepercentage ofILI-relatedphysician visits.
search query submitted from the same region is ILI-related, as deter-
mined by an automated method described below. We fit a linear
model using the log-odds of an ILI physician visit and the log-odds
of an ILI-related search query: logit(I(t))5alogit(Q(t)) 1 e, where
I(t) is the percentage of ILI physician visits, Q(t) is the ILI-related
query fractionattime t,aisthemultiplicativecoefficient, andeisthe
error term. logit(p) is simply ln(p/(12p)).
Publicly available historical data from the CDC’s US Influenza
Sentinel Provider Surveillance Network (http://www.cdc.gov/flu/
weekly) was used to help build our models. For each of the nine
surveillance regions of the United States, the CDC reported the aver-
age percentage of all outpatient visits to sentinel providers that were
ILI-related on a weekly basis. No data were provided for weeks out-
sideofthe annualinfluenza season,and weexcluded suchdates from
model fitting, although our model was used to generate unvalidated
ILI estimates for these weeks.
We designed an automated method of selecting ILI-related search
queries, requiring no previous knowledge about influenza. We mea-
sured how effectively our model would fit the CDC ILI data in each
Each of the 50 million candidate queries in our database was sepa-
rately tested in this manner, to identify the search queries which
could most accurately model the CDC ILI visit percentage in each
region. Our approach rewarded queries that showed regional varia-
tions similar to the regional variations in CDC ILI data: the chance
that a random search query can fit the ILI percentage in all nine
regions is considerably less than the chance that a random search
query can fit a single location (Supplementary Fig. 2).
estscoringsearchqueries, sortedbymean Z-transformed correlation
the ILI-related query fraction, Q(t), we considered different sets of n
top-scoring queries. We measured the performance of these models
obtained the best fit against out-of-sample ILI data across the nine
regions (Fig. 1).
Vol 457|19 February 2009|doi:10.1038/nature07634
Macmillan Publishers Limited. All rights reserved
the best fit. These 45 search queries, although selected automatically,
top 100, not included in our model, included topics like ‘high school
basketball’, which tend to coincide with influenza season in the
United States (Table 1).
Using this ILI-related query fraction as the explanatory variable,
2007 for all nine regions together, thus obtaining a single, region-
CDC-reported ILI percentages, with a mean correlation of 0.90
(min50.80, max50.96, n59 regions; Fig. 2).
untested data from 2007 to 2008, which were excluded from all
previous steps. Estimates generated for these 42 points obtained a
mean correlation of 0.97 (min50.92, max50.99, n59 regions)
with the CDC-observed ILI percentages.
Throughout the 2007–08 influenza season we used preliminary
versions of our model to generate ILI estimates, and shared our
results each week with the Epidemiology and Prevention Branch of
Influenza Division at the CDC to evaluate timeliness and accuracy.
Figure 3 illustrates data available at different points throughout the
season. Acrossthe nineregions,we were abletoestimate consistently
the current ILI percentage 1–2weeks ahead of the publication of
reports by the CDC’s US Influenza Sentinel Provider Surveillance
Because localized influenza surveillance is particularly useful for
public health planning, we sought to validate further our model
against weekly ILI percentages for individual states. The CDC does
not make state-level data publicly available, but we validated our
model against state-reported ILI percentages provided by the state
(Supplementary Fig. 3).
Google web search queries can be used to estimate ILI percentages
accurately in each of the nine public health regions of the United
States. Because search queries can be processed quickly, the resulting
ILI estimates were consistently 1–2weeks ahead of CDC ILI surveil-
lance reports. The early detection provided by this approach may
become an important line of defence against future influenza epi-
demics in the United States, and perhaps eventually in international
Up-to-date influenza estimates may enable public health officials
and health professionals to respond better to seasonal epidemics. If a
region experiences an early, sharp increase in ILI physician visits, it
may be possible to focus additional resources on that region to
identify the aetiology of the outbreak, providing extra vaccine capa-
city or raising local media awareness as necessary.
This system is not designed to be a replacement for traditional
surveillance networks or supplant the need for laboratory-based dia-
0 1020 30 40
Number of queries
5060 708090 100
Figure 1 | An evaluation of how many top-scoring queries to include in the
ILI-related query fraction. Maximal performance at estimating out-of-
sample points during cross-validation was obtained by summing the top 45
81, which is ‘oscar nominations’.
Table 1 | Topics found in search queries which were found to be most cor-
related with CDC ILI data
Search query topicTop 45 queries
Next 55 queries
General influenza symptoms
Term for influenza
Specific influenza symptom
Symptoms of an influenza
General influenza remedies
Symptoms of a related disease
Unrelated to influenza
The top 45 queries were used in our final model; the next 55 queries are presented for
comparison purposes. The number of queries in each topic is indicated, as well as query-
volume-weighted counts, reflecting the relative frequency of queries in each topic.
Figure 2 | A comparison of model estimates for the mid-Atlantic region
(black) against CDC-reported ILI percentages (red), including points over
which the model was fit and validated. A correlation of 0.85 was obtained
over 128 points from this region to which the model was fit, whereas a
correlation of 0.96 was obtained over 42 validation points. Dotted lines
indicate 95% prediction intervals. The region comprises New York, New
Jersey and Pennsylvania.
Data available as of 4 February 2008
Data available as of 3 March 2008
Data available as of 31 March 2008
Data available as of 12 May 2008
Figure 3 | ILI percentages estimated by our model (black) and provided by
the CDC (red) in the mid-Atlantic region, showing data available at four
points in the 2007-2008 influenza season. During week 5 we detected a
sharply increasing ILI percentage in the mid-Atlantic region; similarly, on 3
March our model indicated that the peak ILI percentage had been reached
during week8, with sharpdeclinesin weeks 9 and10. Both results were later
confirmed by CDC ILI data.
NATURE|Vol 457|19 February 2009
Macmillan Publishers Limited. All rights reserved
may indicate a need for publichealth inquiryto identifythe pathogen
or pathogens involved. Demographic data, often provided by
traditional surveillance, cannot be obtained using search queries.
In the event that a pandemic-causing strain of influenza emerges,
accurate and early detection of ILI percentages may enable public
cannot be certain how search engine users will behave in such a
queries used in our model. Alternatively, panic and concern among
healthy individuals may cause a surge in the ILI-related query frac-
tion and exaggerated estimates of the ongoing ILI percentage.
The search queries in our model are not, of course, exclusively
submitted by users who are experiencing influenza-like symptoms,
and the correlations we observe are only meaningful across large
populations. Despite strong historical correlations, our system
remains susceptible to false alerts caused by a sudden increase in
ILI-related queries. An unusual event, such as a drug recall for a
popular cold or flu remedy, could cause such a false alert.
Harnessing the collective intelligence of millions of users, Google
web search logs can provide one of the most timely, broad-reaching
influenza monitoring systems available today. Whereas traditional
systems require 1–2weeks to gather and process surveillance data,
our estimates are current each day. As with other syndromic surveil-
lance systems, the data are most useful as a means to spur further
investigation and collection of direct measures of disease activity.
This system will be used to track the spread of ILI throughout the
2008–09 influenza season in the United States. Results are freely
available online at http://www.google.org/flutrends.
Privacy. None of the queries in the Google database for this project can be
associated with a particular individual. The database retains no information
about the identity, internet protocol (IP) address, or specific physical location
of any user. Furthermore, any original web search logs older than 9months are
Search query database. For the purposes of our database, a search query is a
complete, exact sequence of terms issued by a Google search user; we don’t
combine linguistic variations, synonyms, cross-language translations, misspell-
For example, we tallied the search query ‘indications of flu’ separately from the
search queries ‘flu indications’ and ‘indications of the flu’.
Our database of queries contains 50 million of the most common search
queries on all possible topics, without pre-filtering. Billions of queries occurred
infrequently and were excluded. Using the internet protocol address associated
with each search query, the general physical location from which the query
originated can often be identified, including the nearest major city if within
the United States.
Model data. In the query selection process, we fit per-query models using all
weeks between 28 September 2003 and 11 March 2007 (inclusive) for which the
CDC reported a non-zero ILI percentage, yielding 128 training points for each
region (each week is one data point). Forty-two additional weeks of data (18
March 2007 through to 11 May 2008) were reserved for final validation. Search
query data before 2003 was not available for this project.
Full Methods and any associated references are available in the online version of
the paper at www.nature.com/nature.
Received 14 August; accepted 13 November 2008.
Published online 19 November 2008; corrected 19 February 2009 (details online).
1. World Health Organization. Influenza fact sheet. Æhttp://www.who.int/
World Health Organization. WHO consultation on priority public health
interventions before and during an influenza pandemic. Æhttp://www.who.int/
Southeast Asia. Nature 437, 209–214 (2005).
Longini, I. M. et al. Containing pandemic influenza at the source. Science 309,
Espino, J., Hogan, W. & Wagner, M. Telephone triage: A timely data source for
surveillance of influenza-like diseases. AMIA Annu. Symp. Proc. 215–219 (2003).
Magruder, S. Evaluation of over-the-counter pharmaceutical sales as a possible
early warning indicator of human disease. Johns Hopkins APL Tech. Digest 24,
Fox, S. Online Health Search 2006. Pew Internet & American Life Project Æhttp://
Johnson, H. et al. Analysis of Web access logs for surveillance of influenza. Stud.
Health Technol. Inform. 107, 1202–1206 (2004).
Eysenbach, G. Infodemiology: tracking flu-related searches on the web for
syndromic surveillance. AMIA Annu. Symp. Proc. 244–248 (2006).
10. Polgreen, P. M., Chen, Y., Pennock, D. M. & Forrest, N. D. Using internet searches
for influenza surveillance. Clin. Infect. Dis. 47, 1443–1448 (2008).
Supplementary Information is linked to the online version of the paper at
Acknowledgements We thank L. Finelli for providing background knowledge,
helping us validate results and comments on this manuscript. We are grateful to
R. Rolfs, L. Wyman and M. Patton for providing ILI data. We thank V. Sahai for his
contributions to data collection and processing, and C. Nevill-Manning, A. Roetter
and K. Sarvian for their comments on this manuscript.
Author Contributions J.G. and M.H.M. conceived, designed and implemented the
system. J.G., M.H.M. and R.S.P. analysed the results and wrote the paper. L.B.
contributed data. All authors edited and commented on the paper.
Author Information Reprints and permissions information is available at
www.nature.com/reprints. Correspondence and requests for materials should be
addressed to J.G. or M.H.M. (firstname.lastname@example.org).
NATURE|Vol 457|19 February 2009
Macmillan Publishers Limited. All rights reserved
METHODS Download full-text
validation, we fit models to four 96-point subsets of the 128 points in each
region. Each per-query model was validated by measuring the correlation
regional ILI percentage at those points. Temporal lags were considered, but
ultimately not used in our modelling process.
Each candidate search query was evaluated nine times, once per region, using
in that region. With four cross-validation folds per region, we obtained 36
different correlations between thecandidate model’s estimatesand the observed
performance, we applied the Fisher Z-transformation11to each correlation, and
took the mean of the 36 Z-transformed correlations.
Computation and pre-filtering. In total, we fit 450 million different models to
to divide the work among hundreds of machines efficiently. The amount of
computation required could have been reduced by making assumptions about
which queries might be correlated with ILI. For example, we could have
attempted to eliminate non-influenza-related queries before fitting any models.
However, we were concerned that aggressive filtering might accidentally elim-
inate valuable data. Furthermore, if the highest-scoring queries seemed entirely
unrelated to influenza, it would provide evidence that our query selection
approach was invalid.
Constructingthe ILI-related queryfraction. We concluded thequery selection
process by choosing to keep the searchqueries whose models obtainedthe high-
est mean Z-transformed correlations across regions: these queries were deemed
to be ‘ILI-related’.
To combine the selected search queries into a single aggregate variable, we
summedthequery fractionson a regionalbasis, yieldingour estimateof theILI-
selected for each region.
making estimates in any region or state based on the ILI-related query fraction
from that region or state. We regressed over 1,152 points, combining all 128
additional weeks of previously untested data in each region, from the most
recently available time period (18 March 2007 through to 11 May 2008).
These 42 points represent approximately 25% of the total data available for
State-level model validation. To evaluate the accuracy of state-level ILI esti-
mates generated using our final model, we compared our estimates against
weekly ILI percentages provided by the state of Utah. Because the model was
fit using regional data through 11 March 2007, we validated our Utah ILI esti-
mates using 42 weeks of previously untested data, from the most recently avail-
able time period (18 March 2007 through to 11 May 2008).
11. David, F. The moments of the z and F distributions. Biometrika 36, 394–403
Sixth Symp. Oper. Syst. Des. Implement., (2004).
Macmillan Publishers Limited. All rights reserved