PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Objective: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the cause of the ongoing Covid-19 pandemic which is having devastating effects around the globe. Identifying early indicators of case surges is pivotal for effective pandemic preparedness and response. In this paper, we look at Tata-1mg users' medicine and symptom search data in India, and study its potential to provide early warning of upcoming waves in the current pandemic. Methods and Materials: Tata-1mg is an online healthcare brand present in India, with 50 million monthly active users, that allows users to search and order medicines. We segment different search terms with the help of clinical practitioners used by the customers based on their association with illness and severity and then assess their correlation with reported Covid-19 case numbers for Indian cities. Results: We found that the search terms relating to flu/antiviral medication had the highest leading correlation among all seven search tags tested. We also perform a granular, city level cross correlation analysis for 75 Indian cities. We show that the search terms had up to an average of 19 days prior lead (ranging between 0, 40 days) with significant Pearson correlations (R=0.7, p<0.01) with reported Covid-19 case numbers for most cities. Conclusion: We can use information from search data to formulate better healthcare policies to control the coronavirus pandemic outbreak in the future as well as stock adequate resources. We highlight the ability of search trend data of online pharmaceutical e-commerce platforms to serve as an early warning indicator for future waves.
Content may be subject to copyright.
Analysis of Tata-1mg data for Covid-19 2nd wave prediction in India
Rajat Jain 1, Utkarsh Gupta 2, Sethuraman TV3, Rohan Sukumaran 4, Christin Glorioso MD
PhD 5
1Data Scientist, Tata-1mg 2Head of Data Science and AI, Tata-1mg 3Data Science Researcher,
Pathcheck Foundation 4Research Manager, Pathcheck Foundation 5Head of Research, Data
Informatics Center for Epidemiology, Pathcheck Foundation
ABSTRACT
Objective: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the cause of the ongoing Covid-19 pandemic,
which is having devastating effects around the globe. Identifying early indicators of case surges is pivotal for effective
pandemic preparedness and response. This paper looks at Tata-1mg users’ medicine and symptom search data in India and
studies its potential to provide early warning of upcoming waves in the current pandemic.
Methods and Materials: Tata-1mg is an online healthcare brand present in India, with 50 million monthly active users,
that allows users to search and order medicines. We segment different search terms with the help of clinical practitioners used
by the customers based on their association with illness and severity and then assess their correlation with reported Covid-19
case numbers for Indian cities.
Results: We found that the search terms relating to flu/antiviral medication had the highest leading correlation among all
seven search tags tested. We also perform a granular, city-level cross-correlation analysis for 75 Indian cities. We show that
the search terms had up to an average of 19 days prior lead (ranging between 0, 40 days) with significant Pearson correlations
(R=0.7, p<0.01) with reported Covid-19 case numbers for most cities.
Conclusion: We can use information from search data to formulate better healthcare policies to control the coronavirus
pandemic outbreak in the future as well as stock adequate resources. We highlight the ability of search trend data of online
pharmaceutical e-commerce platforms to serve as an early warning indicator for future waves.
Keywords: Infodemiology, Covid-19, Correlation Analysis, Epidemiology
INTRODUCTION
The novel SARS-CoV-2 Coronavirus was first identified in Wuhan City of China on December 31st, 2020[1]. Since its first
occurrence, the virus has spread throughout the world, impacting a majority of the population. Even after a year, the pandemic
is still raging. After an apparent drop in many countries, more infections and deaths are being reported due to the impact of
subsequent waves. India is one of many countries that have been impacted by the pandemic, particularly during the second
wave. As of July 1st 2021, the total number of cases were more than 30 million, while deaths was over 400,000[2]. With
delayed (or partially ineffective) quarantining, and the inception of newer, more transmissible, variants, the 2nd wave had a
devastating impact on the health infrastructure as well as the livelihood of people in India (and globally). This resulted in lives
lost, economic instability, and a general sense of disharmony. To tackle the spread of Covid-19 cases in India, the government
established an organization called the Integrated Disease Surveillance Project (IDSP), whose aim is to strengthen/maintain a
decentralized IT- enabled laboratory for epidemic disease-surveillance. They monitor disease trends and detect and respond
to outbreaks in the early rising phase through trained Rapid Response Team(s)[3]. However, the system captures data only
when people access healthcare services [4]. Using such direct sources of data that might be delayed or under-report the true
magnitude of cases makes it difficult to estimate the actual magnitude of the spatial and temporal evolution of Covid-19 cases.
Over the past several years, numerous scientific studies have shown that user interactions with web applications generate
latent health-related signals, reflective of an individual as well as community level trends[5]. Infodemiology (information epi-
demiology), introduced by Gunther Eysenbach[6], suggests using online data sources to inform public health and policy [7],[8]
as these approaches have been suggested to support the monitoring and forecasting during past outbreaks and epidemics[9]
1
2
- such as Ebola[10], Zika [11], influenza[12], , measles[13] and mental health [14]. During the Covid-19 pandemic, there
has been an abundance of online activity on various platforms. Popular infodemiology tools like Google Trends, Twitter, and
Facebook can provide the information necessary to analyze the pandemic in different parts of the world. The major benefit
of search/browsing (trends) data is that it has the potential to offer community level insights that current monitoring systems
are unable to obtain due to limited testing capacity [15], and confinement measures that limit people from interacting with
healthcare services. Hence, search trend data can act as a supplemental surveillance tool for national and city level monitoring
of the pandemic. More importantly, unlike invasive testing or surveys that require people to fill in information explicitly, these
search trends data are passively generated by people. Apart from proprietary company-specific insights, these signals provide
a more important secondary function, the ability to predict the trajectory of the pandemic. There has been a myriad of research
around this topic where Amaryllis Mavragani et. al. used Google Trends to compare case and death data for US states [6].
Samira Yousefinaghani et. al. used Google Trends along with Twitter data to establish correlations with case data for the US
and Canada[16].
Google searches can be used for exploration, knowledge gathering, and transaction related purposes. In contrast to search
trends, such as Google Trends, collected by general purpose search engines where the users can casually search for informa-
tion on a topic, searches on a medicine related search engine are closely linked with the intent of purchasing medicines. This
is particularly important in order to serve as a strong signal for early pandemic prediction. Furthermore, the data retrieved in
google trends is normalized over the selected period; thus, the exact queries volumes are unknown to third parties, thereby
limiting data processing and analytical capabilities. Also, the actual algorithm by which Google Trends detects query data is
undisclosed making it difficult, if not impossible, to identify the causes of any phenomena. Further, the mainstream media
heavily influences Google search results, while a good indicator for any epidemic spike has to be influenced by personal need
[17]. Hence, this paper explores alternative signals, namely search data from Tata-1mg (an online healthcare platform), for
predicting future case outbreaks.
Tata-1mg is one of India’s largest online healthcare platforms, where users can search for medicines and place orders for
prescriptive and non-prescriptive medication[18]. During the second wave, India went through a supply crunch for even
normal drugs, such as paracetamol, in local pharmacies[19]. This led online healthcare brands like Tata-1mg, to experience
an unprecedented growth of over 80% in the consumer base from January 2021 till June 2021. Consumers have increasingly
turned to online healthcare firms to check availability and order medicines, even critical ones, unavailable in their vicinity. The
company has eight major warehouses located across India, which serve critical as well as general direct-to-home medicine
delivery across the country. Appropriately stocking warehouses is a core initiative at Tata-1mg. It allows to not go out of stock
and lose on customer demand and maintain reasonable costs associated with an inventory such as holding fees, transportation
costs, and storage costs. Building better prediction techniques and detecting early warning signs through changes in user
search pattern are helpful strategies to stock an appropriate amount of life saving medications. Diagnostic lab tests and e-
consults with physicians can also be performed, giving information on the Covid-19 positivity rate in a region and providing
disease dynamics based on the history of patient illness. The healthcare platform is still a growing venture with almost 50
million unique active users per month across India, giving a significant user base to derive conclusions on. Further, Tata-1mg
is serving orders in almost 1000+ cities in India, including Tier 1, 2, and 3 cities. This paper used this prior information
to analyze the lead/lag relationship between multiple search term categories (based on different stages of the disease) and
the official Covid-19 cases. We show that the search trend data can be used as an early indicator for the pandemic as well
as for alerting the administration about the rising demand for medicines. This would act as a supplementary data source to
appropriately stock up warehouses across the country to aid a proper supply of drugs in areas with increasing caseloads and
meet the supply-demand gap.
METHODS
Users can search for drugs, such as paracetamol salt for fever, using the term "paracetamol" and will get all available medicine
suggestions in accordance with the term searched. The search term is then stored for each search and anonymized for ag-
gregated analysis on Andriod, iOS, and desktop applications, along with the city information from where the search term
3
originated. The recorded aggregated count of each search term can be used as a signal similar to Google trends and enable
a deep-down analysis. We only use and present aggregated results on search term count and location for maintaining user
anonymity, and no user-level analysis is performed in this paper. For this paper, we use the data from 1st January 2021 till the
end of May 2021. To further maintain the data privacy rules, we standardize the scale of the data to between 0 and 1.
Further, the search terms are divided into seven categories based on the symptoms/medication stages and severity clues
from medication, with the help of a panel of 4 clinical experts at Tata-1mg. The panel consisted of three general practitioners
and one internal medicine specialist treating Covid-19 patients on a daily basis since the start of the pandemic. The search
terms were selected after a unanimous consensus of the panel of 4 doctors and alignment with the guidelines provided from
the Covid-19 National Task Force / Joint monitoring group spearheaded by apex medical bodies in the country, including the
All India Institute of Medical Sciences (AIIMS) and Indian Council of medical research (ICMR).
During the initial stages of the second wave, the treatment protocol for managing Covid-19 patients was still evolving,
and continuous changes based on advancing evidence were being made in the treatment strategy. Here we define severity
based on the disease progression and medication used for the treatment at various stages. A guideline from the apex medical
institutes, AIIMS and ICMR, was first published on 22nd April 2021 and revised on 19th May 2021, where the medications
and other supportive treatment guidelines were laid out[20]. As mentioned above, a certified pool of medical practitioners
were consulted to classify the search terms based on practical clinical expertise and the trend of the second wave in India.
In the mild form of the Covid-19 disease, physicians generally advised on fever/ cough medications and immunity boosters
such as Zinc and other multivitamins. These drugs used for symptomatic relief or boosting the patients’ immunity are primar-
ily over-the-counter drugs and do not require a prescription. To treat the moderate cases, Fabiflu and other antiviral agents
and steroids are used, and for cases with symptoms of severe respiratory distress and/or damaged lungs - targeted respira-
tory medications and steroids like steroids Methylprednisolone and blood thinners are recommended. The medicines used to
manage moderate and severe cases of Covid-19 require a valid prescription from a registered medical practitioner. For each
particular search term, we derive all possible medicine types based on its primary use case linked to the search term and aggre-
gate the searches for each search type across the district as well as national level. We use the knowledge provided by clinical
practitioners during the second Covid-19 wave to build a list of search terms that we compare against the confirmed cases data.
The case data is obtained from the covid19india website through their API[23]. The API offers granular geographical units
data, which helps localize the search patterns at a district level. Further, the API provides data of daily cases, average 7-day
cases, along with deaths, vaccination data, and recovery statistics.
Statistics
We fetch the data for search volumes, minimum and maximum search levels for seven search tags obtained. We compute the
Pearson correlation coefficient (r) and the p-value, and plot cross-correlation plots at multiple time lag to adjudge the power
of multiple search tags to act as an early indicator of the pandemic waves, at a national as well as city level. We also compute
the early indication factor in terms of medication severity. The analysis was done on Python 3.8[21] and the Numpy[22] was
used to calculate the correlations and significance. r > 0.7is considered as a significant correlation, and a pvalue < 0.05
is considered statistically significant. The P-value for the calculation of the Pearson correlation coefficient was below 0.01 in
all mentioned results.
r=Pn
i=1(xi¯x)(yi¯y)
pPn
i=1(xi¯x)2pPn
i=1(yi¯y)2(1)
RESULTS
Search terms were classified in table 1 in terms of their severity. We have four mild, one moderate, and two severe search
terms based on the strength of medicines and at what stage of Covid-19 disease progression are they required. According
to governmental guidelines, Steroids and Nasal/ Respiratory medications can be classified as moderate as well as severe
4
Term Severity Prescription required min_searches max_searches avg_searches std
Early Symptoms Mild No 8823 58469 19296 11725
Vitamin and Minerals Mild No 10091 109056 30604 25375
Antibiotic Mild No 1912 9296 3784 1769
Fever Related Mild No 7069 52948 16697 10884
Flu/ Antiviral Moderate Yes 1066 was 9727 3732 1602
Nasal/Respiratory Moderate/ Severe Yes 1923 14571 4633 3124
Steroid Moderate/ Severe Yes 2979 62639 9516 10008
Table 1. Exploratory Data analysis on segregated search types based on physician’s recommendation for pan-India (1st Jan -
28th May ’21)
Term Search Terms
Early Symptoms Fever,Cough,Body Ache,Body Pain,Sore throat,taste,smell
Vitamin and Minerals A to Z, Multivitamins, Vitamins, Calcirol, Uprise-D3, Limcee, Celin, Zincovit, Zin-
conia, Zinc, Vitamin c
Antibiotic Azithral, Antibiotic, Augmentin ,Ceftum, Claribid, Zadocef
Fever Related Paracetamol , Dolo , Crocin, Calpol, Saridon, Lanol
Flu/ Antiviral Fabilflu,Fluguard,Antiflu,Fluvir, Remdesivir, Antiviral
Nasal/Respiratory Levolin, Respiratory distress ,Budecort,Duolin,Levolin, Karvol Plus, Seroflo, Bude-
sal, Foracort
Steroid MethylPrednisolone, Prednisolone, Blood Thinners, Medrol,Dexamethasone, Omna-
cort,Wysolone, Ecosprin,Clexaane, Decmax
Table 2. Search Terms used for each segregated category related to Covid-19 epidemic
depending on case to case basis. However, here we have classified them as severe based on the maximum severity limit for
which the drug can be used. Table 2 represents the search queries used for each classification category in detail.
PAN–India
Results were observed on the normalized 7-day average search data across India for the 7 search tags mentioned in Table. 1 We
further group search terms into 2 categories, Mild/ Moderate that includes medication search categories: "Early symptoms",
"Fever related", "Vitamins/ Minerals", "Antibiotic" and "Flu"/ "Antiviral drugs"; while the severe category included "Nasal/
Respiratory" and "Steroid" medication.
Figure 1a and 1b represent the normalized PAN-India search term variations with the confirmed Covid-19 case data. Mild
and moderate searches (Figure 1a) are leading signals compared to the case data with flu/antiviral search terms showing the
maximum lead. A prescription is not required on purchase for most of this category. Also these medicines are majorly for
treating symptoms of Covid-19. The peak of severe search terms (Figure 1b) coincides with reported cases, showing no lead
time.
Figure 2 below shows the cross-correlations for each search term with the reported Covid-19 case data. The horizontal axis
shows increasing lead times, from left to right, and we mark a horizontal line with the minimum correlation threshold of r=0.7.
We find the highest lead, 20 days, in Fabiflu/ Antiviral searches, still with a significant correlation (r = 0.7). Antibiotics, early
symptoms, and fever-related search terms also display approximately similar leads with 19, 18, and 18 days out, respectively.
As observed earlier, Severe search terms, including Respiratory and nasal distress medication along with steroids, are unable
to show a significant lead due to the late onset of severe symptoms. Also, all the displayed Pearson correlations have a p-value
< 0.01.
5
(a)
(b)
Figure 1. (a) Mild and Moderate search terms against the confirmed daily Covid-19 cases for PAN-India (b) Severe search
terms against the confirmed daily Covid-19 cases for PAN-India
City Level Analysis
We take 75 cities in terms of search volume from January 1st to May 28th at 1mg and replicate the above analysis to figure
out the best individual search terms and leads for these 75 cities in India. The 75 cities used in the analysis have been filtered
based on a significant search volume of Covid-19 related searches. All search terms mentioned in Table 1 are considered, and
the cities with the combined average search volume greater than 500 for the period of January to May, 2021 are selected for
this analysis. We plot figures similar to Figure 1 for the top 4 Indian cities concerning search volume, namely: New Delhi,
Mumbai, Bangalore, and Kolkata. We plot the pattern for both the mild/moderate and severe searches in Figure 3 and Figure 4.
Table 3 (in appendix) represents the maximum lead at which a significant correlation was observed for each city. Further,
6
(a) (b)
(c) (d)
(e) (f)
Figure 2. Cross Correlation plots for selected search terms for PAN-India analysis with confirmed Covid-19 cases. Search
terms described in the figure are as follows: (a) Antibiotic (b) Flu/Antiviral (c) Fever Related (d) Early symptoms (e) Nasal/
Respiratory distress (f) Steroidal
the table also depicts specific search terms that can possibly serve as an early indicator of the next peak across cities. Cross-
correlation heat maps are also plotted (in appendix) with correlations with confirmed cases for each of the 75 cities for every
search term. Out of 75 cities, almost 50 show a significant lead of more than 2 weeks for one of the 5 moderate/ mild search
terms. As observed in the case of PAN-India, the selected cities which are able to capture different zonal information along
with variability in searches also present a lead in mild/ moderate search terms against the observed cases. Consistent with the
7
(a)
(b)
(c)
(d)
Figure 3. Mild/ Moderate search terms against the confirmed daily Covid-19 case data for the following major cities in
India: (a) New Delhi (b) Mumbai (c) Kolkata (d) Bangalore
national level variation, severe searches present little to no lead in terms of early predictability.
8
(a)
(b)
(c)
(d)
Figure 4. Severe search terms against the confirmed daily Covid-19 case data for the following major cities in India: (a) New
Delhi (b) Mumbai (c) Kolkata (d) Bangalore
As observed in Figure 5a and 5b, we find that cities in the west and south are able to display average correlations of r >
0.7 as well as far out lead > 20 days. North as well as east coast, compared to south and west, is presenting a lower lead and
9
(a) (b)
Figure 5. (a)Average cross-correlation across multiple cities in India upto 40 days (b) )Maximum days lead till which a
significant (r>0.7) correlation is observed across multiple cities in India
average correlation.
DISCUSSION
We observe a significant difference in the behavior of mild/ moderate medication as well as severe medicines. Medicines
related to providing symptomatic relief, mainly mild/ non-prescriptive medicines, along with Antivirals’ category, including
medicines such as Remdesivir and Fabiflu, preceded the peak by 15-20 days. Prescription medicines or more targeted medica-
tions that require more technical expertise from a physician coincided with the peak. Antivirals, while requiring a prescription,
are in the mild/ moderate category in terms of treatment and might have been prescribed in abundance to patients all across
India[23]. Since a final objective of this study can be the appropriate stocking of warehouses, from a supply chain perspective,
it takes almost 5-7 days for the medicine to be received at the warehouse from the supplier (known as "Lead Time" in the
supply chain field) for mostly all the medicine types used for search term analysis for all warehouses. Thus having a higher
lead in search data, greater than almost twice the supply chain specific lead time, will give the ability to adequately stock
warehouses with critical life-saving medication beforehand.
Mild and moderate searches seem to lead the severe search terms by an average of almost 12-18 days. This time approxi-
mately lies in the combined date range of symptom onset, which is typically within four or five days after exposure[24]. This
coupled with the time it takes for hospitalization after the development of initial symptoms, which is in the median range of 3
and 10.4 days (longest delay in the age group 20–60 years)[25]. The search pattern hence gives an approximate indication of
the dynamics in disease progression. An immediate spike in the correlation of the mild/ moderate search terms can also signal
to be prepared for severe medications like steroids, blood thinners, and medical oxygen in advance and notify authorities of a
possible spike in the requirement of hospital beds and increased medical staff. We can also see a similar trend for major Indian
10
cities, as shown in figure 3. This gives us insight into the nature of the Covid-19 disease and acts as a confirmatory data point
for understanding the disease dynamics in multiple regions.
While analyzing the 75 Indian cities, a longer lead is observed in India’s western and southern regions compared to India’s
Northern and Eastern parts. This can be attributed to the fact that the wave first hit parts of Maharashtra and Kerala (west and
south regions) during the early onset of the second wave during March. Due to the migration pattern and spread of the disease
and increased travel from these early affected parts to the later affected regions lead to disturbance in the disease pattern. Also,
the lockdown was implemented swiftly in other regions after the early markers in Western and Southern parts. Hence, it might
have affected the predictable nature of the disease along with increased media frenzy might be a reason for lower leads and
differences in the behavior of the regions.
Medical authorities do not recommend antibiotic medicines selected as a search term in this analysis across the world as
it leads to significant development of antibiotic resistance. However, the use of excessive antibiotics and a spike in sales of
Hydroxychloroquine (HCQ) along with Antibiotics was seen in India in other studies apart from our own data[26]. Since the
treatment and diagnostic methods were still evolving during this second wave and due to the prevalent use of the medication
in this particular part of the world, antibiotics have been included in the analysis. A possible limitation of the study is that
the penetration of e-commerce platforms in India is still low in remote rural areas and denser towards urbanized sections of
the country. Almost 65% of the countries population resides in rural areas, and hence the generalizations in the studies can be
made on limited geographies. Because of a large population, the urban population is still very significant for India. The search
volume used for certain city geographies is limited, and results are based on observations limited to the Tata-1mg platform.
The case numbers considered for analysis can be underestimated due to possible reporting errors and the asymptomatic nature
of Covid-19. Despite the limitations, search data can be a guiding indicator towards predicting possible outbreaks. Similar
strategies and approaches have been demonstrated to help in being better prepared in a myriad of previous works. The
approach can also be utilized by other scaled online medical and healthcare platforms on their data to build even a more robust
system and ensure being better prepared for another Covid-19 wave when and if it arrives.
FUTURE AREAS OF WORK
An early warning system can be built using similar search trends data for predicting the next wave in a region. Building
infrastructure to alert administrative authorities as well as planning the implementation of a lockdown to curb the spread of
the disease can be a possible use of the study.
Improving the existing supply chain from the Tata-1mg perspective and allocating life-saving medications, and stocking
oxygen before the next wave might hit can be another use case of this study. Prediction models trained on the past searches
and running Covid-19 cases in multiple cities and searches used as a leading feature in a time series model can automatically
detect spikes in these search terms. This can lead to accurate demand prediction and stocking the appropriate warehouses in
times of shortage and dearth of critical medication.
Apart from medical searches and doorstep delivery, 1mg also conducts multiple lab tests, including RT-PCR and Antigen
tests. There is also an abundance of prescription data, both handwritten as well as digitized, since Tata-1mg receives more
than 10 million prescriptive as well as non-prescriptive orders annually. A prescription is a doctor’s order which stipulates the
administration of drugs in the specified amount, duration, and frequency, and contains details of the patient such as name, age,
and gender, and also the details of the doctor who writes the prescription[27]. This data can be anonymized, aggregated, and
analyzed for different demographics. Leveraging this information can provide insights on the spread of Covid-19, the effect
of new variants, symptom progression, trajectory, and the Spatio-temporal impact of the virus on various age groups/ gender.
This study can aid in fighting the pandemic to the best of our ability and prevent further loss of life.
11
REFERENCES
[1] Mavragani A. Tracking COVID-19 in Europe: Infodemiology Approach. JMIR Public Health Surveill. 2020
Apr;6(2):e18941. Available from: http://publichealth.jmir.org/2020/2/e18941/.
[2] John Hopkins University. NEW COVID-19 CASES WORLDWIDE;. https://coronavirus.jhu.edu/data/new-cases.
[3] Ministry of Family Health and Welfare, Govt of India. Integrated Disease Surveillance Programme;. https://idsp.nic.in/.
[4] Venkatesh U, Gandhi P. Prediction of COVID-19 Outbreaks Using Google Trends in India: A Retrospective Analysis.
Healthcare Informatics Research. 2020;26:175 – 184.
[5] Lampos V, Moura S, Yom-Tov E, Edelstein M, Majumder M, McKendry RA, et al. Tracking COVID-19 using online
search. CoRR. 2020;abs/2003.08086. Available from: https://arxiv.org/abs/2003.08086.
[6] Mavragani A, Gkillas K. COVID-19 predictability in the United States using Google Trends time series. Scientific
reports. 2020 Nov;10(1):20693–20693. 33244028[pmid]. Available from: https://pubmed.ncbi.nlm.nih.gov/33244028.
[7] Mavragani A. Infodemiology and Infoveillance: Scoping Review. J Med Internet Res. 2020 Apr;22(4):e16206. Available
from: http://www.jmir.org/2020/4/e16206/.
[8] Bernardo T, Raji´
c A, Young I, Robiadek KM, Pham M, Funk J. Scoping Review on Search Queries and Social Media
for Disease Surveillance: A Chronology of Innovation. Journal of Medical Internet Research. 2013;15.
[9] Eysenbach G. SARS and Population Health Technology. J Med Internet Res. 2003 Jun;5(2):e14. Available from:
http://www.jmir.org/2003/2/e14/.
[10] van Lent LG, Sungur H, Kunneman FA, van de Velde B, Das E. Too Far to Care? Measuring Public Attention and Fear
for Ebola Using Twitter. J Med Internet Res. 2017 Jun;19(6):e193. Available from: http://www.jmir.org/2017/6/e193/.
[11] Farhadloo M, Winneg K, pui Sally Chan M, Jamieson KH, Albarracín D. Associations of Topics of Discussion on Twitter
With Survey Measures of Attitudes, Knowledge, and Behaviors Related to Zika: Probabilistic Study in the United States.
JMIR Public Health and Surveillance. 2018;4.
[12] Mavragani A, Ochoa G. The Internet and the Anti-Vaccine Movement: Tracking the 2017 EU Measles Outbreak. Big
Data and Cognitive Computing. 2018;2(1). Available from: https://www.mdpi.com/2504-2289/2/1/2.
[13] Du J, Tang L, Xiang Y, Zhi D, Xu J, Song HY, et al. Public Perception Analysis of Tweets During the 2015 Measles
Outbreak: Comparative Study Using Convolutional Neural Network Models. J Med Internet Res. 2018 Jul;20(7):e236.
Available from: https://doi.org/10.2196/jmir.9413.
[14] McClellan C, Ali MM, Mutter R, Kroutil L, Landwehr J. Using social media to monitor mental health discussions
evidence from Twitter. Journal of the American Medical Informatics Association. 2016 10;24(3):496–502. Available
from: https://doi.org/10.1093/jamia/ocw133.
[15] Roser, M Ritchie, H Ortiz-Ospina, E Hasell, J. Statistics and Research-Coronavirus Pandemic (COVID-19);. https:
//ourworldindata.org/coronavirus.
[16] Yousefinaghani S, Dara R, Mubareka S, Sharif S. Prediction of COVID-19 Waves Using Social Media and Google
Search: A Case Study of the US and Canada. Frontiers in Public Health. 2021;9:359. Available from: https://www.
frontiersin.org/article/10.3389/fpubh.2021.656635.
[17] Lampos V, Majumder MS, Yom-Tov E, Edelstein M, Moura S, Hamada Y, et al. Tracking COVID-19 using online
search. npj Digital Medicine. 2021 Feb;4(1):17. Available from: https://doi.org/10.1038/s41746-021-00384- w.
[18] Tata-1mg. Tata-1mg;. https://1mg.com.
12
[19] TimesOfIndia. TOI;. https://timesofindia.indiatimes.com/city/delhi/panic-buying-and-lack-of-supply- causing-
shortage-of-key-drugs/articleshow/82281888.cms.
[20] MOHFW. MOHFW;. https://www.mohfw.gov.in/pdf/COVID19ClinicalManagementProtocolAlgorithmAdults19thMay2021.
pdf.
[21] Python org. python3.8;. https://www.python.org/downloads/release/python-380/.
[22] numpy org. numpy;. https://numpy.org.
[23] BusinessStandard. Antiviralsales;. https://www.business- standard.com/article/companies/fabiflu-numero-uno-in-
indian-pharma-market-show-april-sales-data-121050900853_1.html.
[24] HarvardHealthedu. Harvardhealth;. https://www.health.harvard.edu/diseases-and-conditions/if-youve-been-exposed-
to-the-coronavirus.
[25] Faes C, Abrams S, Van Beckhoven D, Meyfroidt G, Vlieghe E, Hens N, et al. Time between Symptom On-
set, Hospitalisation and Recovery or Death: Statistical Analysis of Belgian COVID-19 Patients. International jour-
nal of environmental research and public health. 2020 Oct;17(20):7560. PMC7589278[pmcid]. Available from:
https://doi.org/10.3390/ijerph17207560.
[26] Sulis G, Batomen B, Kotwani A, Pai M, Gandra S. Sales of antibiotics and hydroxychloroquine in India during the
COVID-19 epidemic: An interrupted time series analysis. PLOS Medicine. 2021 07;18(7):1–18. Available from:
https://doi.org/10.1371/journal.pmed.1003682.
[27] Gupta M, Soeny K. Algorithms for rapid digitalization of prescriptions. Visual Informatics. 2021. Available from:
https://www.sciencedirect.com/science/article/pii/S2468502X21000334.
13
APPENDIX
City Search Type Maximum searches per day Max-day Significant lead City Search Type Maximum searches per day Max day-Significant lead
Nashik antibiotic 180 40 Kolhapur flu 24 16
Jalgaon fever 446 40 Durg vit 284 16
Vadodara symptom 276 40 Faridabad antibiotic 204 16
Buldhana fever 234 39 Prakasam steroid 118 16
Pune antibiotic 424 37 Chennai fever 938 15
Ahmednagar fever 452 37 East Godavari steroid 210 15
Amravati fever 220 36 West Godavari steroid 28 15
Indore antibiotic 262 33 Ranchi antibiotic 160 15
Beed vit 216 33 Chittoor vit 194 14
Satara fever 144 31 Mysore vit 362 14
Solapur symptom 476 31 Jodhpur fever 238 14
Surat symptom 468 31 Varanasi antibiotic 188 14
Ludhiana fever 168 28 Anantapur nasal 158 13
Bhopal antibiotic 256 27 Palghar vit 94 13
Kolkata flu 110 26 Lucknow antibiotic 680 12
Ahmedabad fever 904 25 Kannur vit 128 12
Nagpur antibiotic 438 24 Raipur fever 606 12
Yavatmal vit 182 24 Srikakulam fever 118 12
Latur vit 244 23 Kurnool nasal 216 12
Wardha fever 84 23 Kanpur antibiotic 210 12
Thane antibiotic 202 22 Ernakulam fever 144 11
Visakhapatnam fever 742 21 Coimbatore vit 586 11
Nanded vit 388 21 Kozhikode vit 216 10
South 24 Parganas fever 196 21 Hassan vit 90 10
Howrah fever 466 20 Allahabad fever 882 10
Mumbai fever 2134 19 Thrissur vit 218 9
Jaipur flu 54 19 Palakkad vit 186 9
Guntur fever 298 19 Erode vit 148 9
Bangalore flu 178 18 Madurai vit 280 9
Patna antibiotic 502 18 Kottayam vit 124 8
Sangli vit 226 18 Tiruchirappalli vit 210 6
Chandrapur fever 152 18 Kollam vit 110 5
New Delhi flu 1220 17 Pathanamthitta symptom 56 3
North 24 Parganas fever 316 17 Tiruppur vit 88 3
Bhubaneshwar fever 492 17 Malappuram antibiotic 54 0
Nellore fever 218 17 Alappuzha antibiotic 26 0
Gurugram antibiotic 744 16 Kasaragod antibiotic 36 0
Dehradun fever 492 16
Table 3. Multiple Indian cities and data for maximum day of significant (r>0.7) leading correlation with confirmed cases
data in the mentioned search type
14
Figure 6. Pearson correlation heat map between confirmed cases and search volume of Antibiotic medicines for 75 Indian
cities
15
Figure 7. Pearson correlation heat map between confirmed cases and search volume of Fever medicines for 75 Indian cities
16
Figure 8. Pearson correlation heat map between confirmed cases and search volume of Vitamin/ Mineral supplements for 75
Indian cities
17
Figure 9. Pearson correlation heat map between confirmed cases and search volume of Early symptom medication for 75
Indian cities
18
Figure 10. Pearson correlation heat map between confirmed cases and search volume of Flu/ Antiviral medicines for 75
Indian cities
19
Figure 11. Pearson correlation heat map between confirmed cases and search volume of Nasal/ Respiratory distress
medicines for 75 Indian cities
20
Figure 12. Pearson correlation heat map between confirmed cases and search volume of Steroidal medicines for 75 Indian
cities
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Prescription data are invaluable for healthcare research and intelligence, yet, extraction of these data is challenging as this information is intertwined in the unstructured and non-grammatical text in prescription images. Moreover, text extraction from images in itself is hard, particularly for handwritten text. While piecemeal solutions exist, they are either limited to a small set of entities of interest or have very low accuracy and are not scalable. In this paper, we present two algorithms: the C-Cube algorithm for digitization of computer-printed prescriptions and the 3-Step Filtering algorithm for handwritten prescriptions. While a brute-force approach would match every word that is received from an optical character reader (OCR) with all possible entries in the database, this approach is inefficient and imprecise. The premise of our algorithms is application of pattern intelligence to select a much smaller set of words (from the words returned by the OCR) as potential entities of interest. We rigorously tested the two algorithms on a corpus of more than 10,000 prescriptions’ images, taking the brute-force technique as the baseline methodology. Regarding latencies, we found that the C-Cube and the 3-Step Filtering algorithms were 588 and 231 times faster than the brute-force approach. In terms of accuracies, we found that the F-score of the C-cube algorithm was 90% higher than the F-score of the brute-force approach whereas the F-score for the 3-Step filtering algorithm was found to be 8,600% higher. The algorithms are decidedly faster and more accurate than the brute-force approach. These attributes make them suitable for implementation in real-time environments as well as for use in batch-mode for various applications. We expect the algorithms to play a significant role in digitalization of healthcare information and briefly discuss a few applications.
Article
Full-text available
Background: We assessed the impact of the coronavirus disease 2019 (COVID-19) epidemic in India on the consumption of antibiotics and hydroxychloroquine (HCQ) in the private sector in 2020 compared to the expected level of use had the epidemic not occurred. Methods and findings: We performed interrupted time series (ITS) analyses of sales volumes reported in standard units (i.e., doses), collected at regular monthly intervals from January 2018 to December 2020 and obtained from IQVIA, India. As children are less prone to develop symptomatic severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, we hypothesized a predominant increase in non-child-appropriate formulation (non-CAF) sales. COVID-19-attributable changes in the level and trend of monthly sales of total antibiotics, azithromycin, and HCQ were estimated, accounting for seasonality and lockdown period where appropriate. A total of 16,290 million doses of antibiotics were sold in India in 2020, which is slightly less than the amount in 2018 and 2019. However, the proportion of non-CAF antibiotics increased from 72.5% (95% CI: 71.8% to 73.1%) in 2019 to 76.8% (95% CI: 76.2% to 77.5%) in 2020. Our ITS analyses estimated that COVID-19 likely contributed to 216.4 million (95% CI: 68.0 to 364.8 million; P = 0.008) excess doses of non-CAF antibiotics and 38.0 million (95% CI: 26.4 to 49.2 million; P < 0.001) excess doses of non-CAF azithromycin (equivalent to a minimum of 6.2 million azithromycin treatment courses) between June and September 2020, i.e., until the peak of the first epidemic wave, after which a negative change in trend was identified. In March 2020, we estimated a COVID-19-attributable change in level of +11.1 million doses (95% CI: 9.2 to 13.0 million; P < 0.001) for HCQ sales, whereas a weak negative change in monthly trend was found for this drug. Study limitations include the lack of coverage of the public healthcare sector, the inability to distinguish antibiotic and HCQ sales in inpatient versus outpatient care, and the suboptimal number of pre- and post-epidemic data points, which could have prevented an accurate adjustment for seasonal trends despite the robustness of our statistical approaches. Conclusions: A significant increase in non-CAF antibiotic sales, and particularly azithromycin, occurred during the peak phase of the first COVID-19 epidemic wave in India, indicating the need for urgent antibiotic stewardship measures.
Article
Full-text available
The ongoing COVID-19 pandemic has posed a severe threat to public health worldwide. In this study, we aimed to evaluate several digital data streams as early warning signals of COVID-19 outbreaks in Canada, the US and their provinces and states. Two types of terms including symptoms and preventive measures were used to filter Twitter and Google Trends data. We visualized and correlated the trends for each source of data against confirmed cases for all provinces and states. Subsequently, we attempted to find anomalies in indicator time-series to understand the lag between the warning signals and real-word outbreak waves. For Canada, we were able to detect a maximum of 83% of initial waves 1 week earlier using Google searches on symptoms. We divided states in the US into two categories: category I if they experienced an initial wave and category II if the states have not experienced the initial wave of the outbreak. For the first category, we found that tweets related to symptoms showed the best prediction performance by predicting 100% of first waves about 2–6 days earlier than other data streams. We were able to only detect up to 6% of second waves in category I. On the other hand, 78% of second waves in states of category II were predictable 1–2 weeks in advance. In addition, we discovered that the most important symptoms in providing early warnings are fever and cough in the US. As the COVID-19 pandemic continues to spread around the world, the work presented here is an initial effort for future COVID-19 outbreaks.
Article
Full-text available
Previous research has demonstrated that various properties of infectious diseases can be inferred from online search behaviour. In this work we use time series of online search query frequencies to gain insights about the prevalence of COVID-19 in multiple countries. We first develop unsupervised modelling techniques based on associated symptom categories identified by the United Kingdom’s National Health Service and Public Health England. We then attempt to minimise an expected bias in these signals caused by public interest—as opposed to infections—using the proportion of news media coverage devoted to COVID-19 as a proxy indicator. Our analysis indicates that models based on online searches precede the reported confirmed cases and deaths by 16.7 (10.2–23.2) and 22.1 (17.4–26.9) days, respectively. We also investigate transfer learning techniques for mapping supervised models from countries where the spread of the disease has progressed extensively to countries that are in earlier phases of their respective epidemic curves. Furthermore, we compare time series of online search activity against confirmed COVID-19 cases or deaths jointly across multiple countries, uncovering interesting querying patterns, including the finding that rarer symptoms are better predictors than common ones. Finally, we show that web searches improve the short-term forecasting accuracy of autoregressive models for COVID-19 deaths. Our work provides evidence that online search data can be used to develop complementary public health surveillance methods to help inform the COVID-19 response in conjunction with more established approaches.
Article
Full-text available
During the unprecedented situation that all countries around the globe are facing due to the Coronavirus disease 2019 (COVID-19) pandemic, which has also had severe socioeconomic consequences, it is imperative to explore novel approaches to monitoring and forecasting regional outbreaks as they happen or even before they do so. To that end, in this paper, the role of Google query data in the predictability of COVID-19 in the United States at both national and state level is presented. As a preliminary investigation, Pearson and Kendall rank correlations are examined to explore the relationship between Google Trends data and COVID-19 data on cases and deaths. Next, a COVID-19 predictability analysis is performed, with the employed model being a quantile regression that is bias corrected via bootstrap simulation, i.e., a robust regression analysis that is the appropriate statistical approach to taking against the presence of outliers in the sample while also mitigating small sample estimation bias. The results indicate that there are statistically significant correlations between Google Trends and COVID-19 data, while the estimated models exhibit strong COVID-19 predictability. In line with previous work that has suggested that online real-time data are valuable in the monitoring and forecasting of epidemics and outbreaks, it is evident that such infodemiology approaches can assist public health policy makers in addressing the most crucial issues: flattening the curve, allocating health resources, and increasing the effectiveness and preparedness of their respective health care systems.
Article
Full-text available
There are different patterns in the COVID-19 outbreak in the general population and amongst nursing home patients. We investigate the time from symptom onset to diagnosis and hospitalization or the length of stay (LoS) in the hospital, and whether there are differences in the population. Sciensano collected information on 14,618 hospitalized patients with COVID-19 admissions from 114 Belgian hospitals between 14 March and 12 June 2020. The distributions of different event times for different patient groups are estimated accounting for interval censoring and right truncation of the time intervals. The time between symptom onset and hospitalization or diagnosis are similar, with median length between symptom onset and hospitalization ranging between 3 and 10.4 days, depending on the age of the patient (longest delay in age group 20–60 years) and whether or not the patient lives in a nursing home (additional 2 days for patients from nursing home). The median LoS in hospital varies between 3 and 10.4 days, with the LoS increasing with age. The hospital LoS for patients that recover is shorter for patients living in a nursing home, but the time to death is longer for these patients. Over the course of the first wave, the LoS has decreased.
Article
Full-text available
Objective: Considering the rising menace of coronavirus disease 2019 (COVID-19), it is essential to explore the methods and resources that might predict the case numbers expected and identify the locations of outbreaks. Hence, we have done the following study to explore the potential use of Google Trends (GT) in predicting the COVID-19 outbreak in India. Methods: The Google search terms used for the analysis were "coronavirus", "COVID", "COVID 19", "corona", and "virus". GTs for these terms in Google Web, News, and YouTube, and the data on COVID-19 case numbers were obtained. Spearman correlation and lag correlation were used to determine the correlation between COVID-19 cases and the Google search terms. Results: "Coronavirus" and "corona" were the terms most commonly used by Internet surfers in India. Correlation for the GTs of the search terms "coronavirus" and "corona" was high (r > 0.7) with the daily cumulative and new COVID-19 cases for a lag period ranging from 9 to 21 days. The maximum lag period for predicting COVID-19 cases was found to be with the News search for the term "coronavirus", with 21 days, i.e., the search volume for "coronavirus" peaked 21 days before the peak number of cases reported by the disease surveillance system. Conclusion: Our study revealed that GTs may predict outbreaks of COVID-19, 2 to 3 weeks earlier than the routine disease surveillance, in India. Google search data may be considered as a supplementary tool in COVID-19 monitoring and planning in India.
Article
Full-text available
Background: Web-based sources are increasingly employed in the analysis, detection, and forecasting of diseases and epidemics, and in predicting human behavior towards several health topics. This use of the Internet has come to be known as infodemiology; a concept introduced by Gunther Eysenbach. Infodemiology and infoveillance studies use web-based data and have become an integral part of health informatics research over the past decade. Objective: The aim of this paper is to provide a scoping review of the state-of-the-art in infodemiology, along with the background and history of the concept, identify sources and health categories and topics, elaborate on the validity of the employed methods, and discuss the gaps identified in current research. Methods: The PRISMA guidelines are followed in order to extract the publications that fall under the umbrella of infodemiology and infoveillance from the JMIR, PubMed, and Scopus databases. A total of 338 documents are extracted for assessment. Results: The vast majority of the studies, i.e. 83.43% (282/339), are published with JMIR Publications. The "Journal of Medical Internet Research" features almost half of the publications, i.e. 168/338 (49.70%), and "JMIR Public Health and Surveillance" more than one fifth of the examined studies, i.e. 74/338 (21.89%). The interest in the subject is increasing every year, with 2018 featuring more than one fourth of the total publications (26.33%; 89/338), while counting both 2017 and 2018, the publications account for more than half (50.59%; 171/338) of the total number of publications in the last decade. The most popular source is Twitter with 44.97% (152/338), followed by Google with 24.56% (83/338), Websites/Platforms with 13.91% (47/338), Blogs/Forums with 10.06% (34/338), Facebook with 8.88% (30/338), and other search engines with 5.62% (19/338). As for the subject examined, conditions/diseases with 17.16% (58/338) and epidemics/outbreaks with 15.68% (53/338) are the most popular categories identified in this review, followed by health care (11.54%; 39/338), drugs (10.36%; 40/338), and smoking/alcohol (8.58%; 29/338). Conclusions: The field of infodemiology is becoming increasingly popular, employing innovating methods and approaches for health assessment. The use of Web-based sources that provide us with information that would not be accessible otherwise and also tackle the issues arising from the time-consuming traditional methods, shows that infodemiology plays a very important role in health informatics research. Clinicaltrial:
Article
Full-text available
Background: Timely understanding of public perceptions allows public health agencies to provide up-to-date responses to health crises such as infectious diseases outbreaks. Social media such as Twitter provide an unprecedented way for the prompt assessment of the large-scale public response. Objective: The aims of this study were to develop a scheme for a comprehensive public perception analysis of a measles outbreak based on Twitter data and demonstrate the superiority of the convolutional neural network (CNN) models (compared with conventional machine learning methods) on measles outbreak-related tweets classification tasks with a relatively small and highly unbalanced gold standard training set. Methods: We first designed a comprehensive scheme for the analysis of public perception of measles based on tweets, including 3 dimensions: discussion themes, emotions expressed, and attitude toward vaccination. All 1,154,156 tweets containing the word "measles" posted between December 1, 2014, and April 30, 2015, were purchased and downloaded from DiscoverText.com. Two expert annotators curated a gold standard of 1151 tweets (approximately 0.1% of all tweets) based on the 3-dimensional scheme. Next, a tweet classification system based on the CNN framework was developed. We compared the performance of the CNN models to those of 4 conventional machine learning models and another neural network model. We also compared the impact of different word embeddings configurations for the CNN models: (1) Stanford GloVe embedding trained on billions of tweets in the general domain, (2) measles-specific embedding trained on our 1 million measles related tweets, and (3) a combination of the 2 embeddings. Results: Cohen kappa intercoder reliability values for the annotation were: 0.78, 0.72, and 0.80 on the 3 dimensions, respectively. Class distributions within the gold standard were highly unbalanced for all dimensions. The CNN models performed better on all classification tasks than k-nearest neighbors, naïve Bayes, support vector machines, or random forest. Detailed comparison between support vector machines and the CNN models showed that the major contributor to the overall superiority of the CNN models is the improvement on recall, especially for classes with low occurrence. The CNN model with the 2 embedding combination led to better performance on discussion themes and emotions expressed (microaveraging F1 scores of 0.7811 and 0.8592, respectively), while the CNN model with Stanford embedding achieved best performance on attitude toward vaccination (microaveraging F1 score of 0.8642). Conclusions: The proposed scheme can successfully classify the public's opinions and emotions in multiple dimensions, which would facilitate the timely understanding of public perceptions during the outbreak of an infectious disease. Compared with conventional machine learning methods, our CNN models showed superiority on measles-related tweet classification tasks with a relatively small and highly unbalanced gold standard. With the success of these tasks, our proposed scheme and CNN-based tweets classification system is expected to be useful for the analysis of tweets about other infectious diseases such as influenza and Ebola.
Article
Full-text available
Background: Recent outbreaks of Zika virus around the world led to increased discussions about this issue on social media platforms such as Twitter. These discussions may provide useful information about attitudes, knowledge, and behaviors of the population regarding issues that are important for public policy. Objective: We sought to identify the associations of the topics of discussions on Twitter and survey measures of Zika-related attitudes, knowledge, and behaviors, not solely based upon the volume of such discussions but by analyzing the content of conversations using probabilistic techniques. Methods: Using probabilistic topic modeling with US county and week as the unit of analysis, we analyzed the content of Twitter online communications to identify topics related to the reported attitudes, knowledge, and behaviors captured in a national representative survey (N=33,193) of the US adult population over 33 weeks. Results: Our analyses revealed topics related to "congress funding for Zika," "microcephaly," "Zika-related travel discussions," "insect repellent," "blood transfusion technology," and "Zika in Miami" were associated with our survey measures of attitudes, knowledge, and behaviors observed over the period of the study. Conclusions: Our results demonstrated that it is possible to uncover topics of discussions from Twitter communications that are associated with the Zika-related attitudes, knowledge, and behaviors of populations over time. Social media data can be used as a complementary source of information alongside traditional data sources to gauge the patterns of attitudes, knowledge, and behaviors in a population.