ArticlePDF AvailableLiterature Review

Mining social media for prescription medication abuse monitoring: A review and proposal for a data-centric framework


Abstract and Figures

Objective: Prescription medication (PM) misuse and abuse is a major health problem globally, and a number of recent studies have focused on exploring social media as a resource for monitoring nonmedical PM use. Our objectives are to present a methodological review of social media-based PM abuse or misuse monitoring studies, and to propose a potential generalizable, data-centric processing pipeline for the curation of data from this resource. Materials and methods: We identified studies involving social media, PMs, and misuse or abuse (inclusion criteria) from Medline, Embase, Scopus, Web of Science, and Google Scholar. We categorized studies based on multiple characteristics including but not limited to data size; social media source(s); medications studied; and primary objectives, methods, and findings. Results: A total of 39 studies met our inclusion criteria, with 31 (∼79.5%) published since 2015. Twitter has been the most popular resource, with Reddit and Instagram gaining popularity recently. Early studies focused mostly on manual, qualitative analyses, with a growing trend toward the use of data-centric methods involving natural language processing and machine learning. Discussion: There is a paucity of standardized, data-centric frameworks for curating social media data for task-specific analyses and near real-time surveillance of nonmedical PM use. Many existing studies do not quantify human agreements for manual annotation tasks or take into account the presence of noise in data. Conclusion: The development of reproducible and standardized data-centric frameworks that build on the current state-of-the-art methods in data and text mining may enable effective utilization of social media data for understanding and monitoring nonmedical PM use.
Content may be subject to copyright.
Mining social media for prescription medication abuse
monitoring: a review and proposal for a data-centric
Abeed Sarker ,
Annika DeRoos,
and Jeanmarie Perrone
Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, Georgia, USA,
College of Arts and Scien-
ces, University of Pennsylvania, Philadelphia, Pennsylvania, USA, and
Department of Emergency Medicine, Perelman School of
Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Corresponding Author: Abeed Sarker, PhD, Department of Biomedical Informatics, Emory University School of Medicine,
101 Woodruff Circle, Office 4101, Atlanta, GA 30322, USA;
Received 9 July 2019; Revised 14 August 2019; Editorial Decision 15 August 2019; Accepted 0 Month 0000
Objective: Prescription medication (PM) misuse and abuse is a major health problem globally, and a number of
recent studies have focused on exploring social media as a resource for monitoring nonmedical PM use. Our
objectives are to present a methodological review of social media–based PM abuse or misuse monitoring stud-
ies, and to propose a potential generalizable, data-centric processing pipeline for the curation of data from this
Materials and Methods: We identified studies involving social media, PMs, and misuse or abuse (inclusion cri-
teria) from Medline, Embase, Scopus, Web of Science, and Google Scholar. We categorized studies based on
multiple characteristics including but not limited to data size; social media source(s); medications studied; and
primary objectives, methods, and findings.
Results: A total of 39 studies met our inclusion criteria, with 31 (79.5%) published since 2015. Twitter has been
the most popular resource, with Reddit and Instagram gaining popularity recently. Early studies focused mostly
on manual, qualitative analyses, with a growing trend toward the use of data-centric methods involving natural
language processing and machine learning.
Discussion: There is a paucity of standardized, data-centric frameworks for curating social media data
for task-specific analyses and near real-time surveillance of nonmedical PM use. Many existing studies do
not quantify human agreements for manual annotation tasks or take into account the presence of noise in
Conclusion: The development of reproducible and standardized data-centric frameworks that build on the cur-
rent state-of-the-art methods in data and text mining may enable effective utilization of social media data for un-
derstanding and monitoring nonmedical PM use.
Key words: social media, prescription drug misuse, substance abuse detection, natural language processing, text mining
Prescription medication (PM) abuse (we use the terms abuse,misuse,
and nonmedical use interchangeably in this article to represent all
forms of use that are not medically prescribed, unless explicitly
stated otherwise) is a major public health crisis that has reached
epidemic proportions in many countries including the United
According to a report published in 2011 by the Drug Abuse
Warning Network, about half of all emergency department visits for
CThe Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (
nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way,
and that the work is properly cited. For commercial re-use, please 1
Journal of the American Medical Informatics Association, 00(0), 2019, 1–15
doi: 10.1093/jamia/ocz162
Downloaded from by guest on 19 December 2019
drug misuse were attributed to PMs.
A national survey conducted
in 2014 showed that over 50 million people in the United States
have used PMs nonmedically—a significant portion of which can be
classified as abuse.
Commonly abused PMs include opioids, depres-
sants and stimulants,
and the consequences range from minor side
effects such as nausea to serious adverse outcomes including addic-
tion and death. Owing to the rapidly escalating morbidity and mor-
tality, the problem is now receiving international attention,
particularly for opioids and their relation to illicit analogs such as
heroin and fentanyl.
Despite the enormity of the problem, there is a
lack of surveillance mechanisms that would enable investigations on
the factors contributing to PM abuse, the natural history of the indi-
viduals who develop substance use disorders, and the characteristics
of the populations affected (eg, age and gender) by distinct classes of
abuse-prone PMs. This is emphasized in a recent study delineating
10 steps that the United States government should take to curb the
opioid epidemic, where the top suggestion was new and innovative
methods of surveillance.
The 2016 National Drug Threat Assessment Summary published
by the Drug Enforcement Agency (DEA) revealed that the number
of deaths involving PMs has outpaced those from cocaine and her-
oin combined, for every year since 2002,
with approximately 52
people dying each day in the United States from PM overdose. More
recently, the Centers for Disease Control and Prevention published a
showing that in the year 2017, there were 70 237 deaths due
to drug overdose, of which 17 029 were attributable to prescription
opioids, 11 537 to benzodiazepines and 5269 to antidepressants.
portion of these deaths were due to coingestion, and more than half
of these deaths involved an opioid, including prescription opioids.
Statistics from the WONDER database
suggest that overdoses
from prescription opioids were a pivotal factor in the 15-year in-
crease in opioid overdose deaths, with the sales of pain-related PMs
quadrupling since 1999. This multifold increase in the prescribing
and sales of pain medications occurred despite the total volumes of
office-based physician visits and emergency department visits due to
pain as the primary symptom remaining stable from 2000 to
While the long-term impact and costs of prescription
opioids are now well understood, less is known about other classes
of PMs,
although the recently published survey by the Substance
Abuse and Mental Health Services Administration presents some
alarmingly high numbers.
The survey, which estimated abuse
based on self-reports, revealed the following statistics: 3.3 million
Americans misused opioid pain relievers, 2.0 million misused tran-
quilizers (eg, benzodiazepines and muscle relaxants), 1.7 million
misused stimulants (eg, Adderall), and 0.5 million misused sedatives
(eg, zolpidem). Financial costs associated with PM abuse have been
on the rise as well. Prescription opioid abuse alone amounted to an
estimated total cost of $55.7 billion in 2007
and $78.5 billion in
and recent estimates made by the Centers for Disease Con-
trol and Prevention suggest that PM misuse costs health insurers up
to $72.5 billion annually in direct healthcare costs.
Owing to the enormity of the problem of drug abuse and over-
dose, the White House announced widespread programs in 2015,
which included monitoring and raising awareness about PM abuse,
particularly among young people.
In an earlier report by the Office
of National Drug Control Policy, 4 major areas of focus were de-
tailed, including the improvement of tracking and monitoring tech-
niques to detect and prevent diversion and abuse.
Current PM
abuse monitoring strategies are aimed primarily at distributors and
licensed practitioners. The DEA requires that wholesalers have mon-
itoring programs in place to identify suspicious orders. For licensed,
prescribing health practitioners, most states have Prescription Drug
Monitoring Programs, and pharmacies are required to report the
patients, prescribers, and specific medications dispensed for con-
trolled substances. This data is used by prescribers and law enforce-
ment agencies to identify and limit possible medication abuse. Data
at the national level is obtained through large-scale surveys by the
DEA and others.
These surveys are expensive to conduct and
there are significant lags between the survey dates and the release of
the results (eg, report for the 2016 National Survey on Drug Use
and Health was made available in September 2017). Current PM
monitoring programs are also plagued with numerous limitations,
with efficacies varying widely.
Other existing control measures
and interventions lack critical information such as the patterns of
usage of various PMs and the demographics of the users. Such infor-
mation can be crucial in designing control measures and outreach
programs. For example, warnings to deter PM abuse might be more
successful if broadcast during high abuse periods, if known. In re-
sponse to the necessity of identifying novel strategies for monitoring
PM abuse, the National Institute on Drug Abuse launched PA-18-
encouraging applicants to “develop innovative research
applications on prescription drug abuse,” “examine the factors,”
and “characterize this problem in terms of classes of drugs abused
and combinations of drug types, etiology of abuse, and populations
most affected.”
Social media and medication abuse
Recent studies, including our preliminary studies on the topic,
have validated the use of social media as a platform for monitoring
PM abuse. For example, they have shown that although nonmedical
users of PMs may not voluntarily report their actions to medical
practitioners, their self-reports are often detectable in the social me-
dia sphere.
To summarize, these studies have shown that (1)
many people publicly self-report PM abuse information in social me-
dia, (2) automatic natural language processing (NLP) and machine
learning methods are capable of detecting PM abuse-indicating
posts, and (3) additional information such as temporal patterns of
abuse and common coingestion behaviors can be detected from so-
cial media chatter. The Social Media Fact Sheet
from Pew Re-
search Center shows that currently 69% of all adult Americans use
social media, with particularly high numbers for younger adults
(86% for 18- to 29-year-olds; 80% for 30- to 49-year-olds), and the
trend of adoption is still upward. Similar trends are also visible glob-
ally. Social media may also provide access to communities and infor-
mation generated through social interactions that may not be
available from other sources.
Thus, social media presents a
unique opportunity to study PM abuse at the population level, and
discover unique information.
Challenges of social media–based text mining
Social media provides unfiltered information in near real time,
posted by people from diverse demographic groups.
the volume of data available from this resource is an asset, proper
utilization of this data for knowledge discovery is challenging.
Knowledge from social media must be automatically curated, as it is
not feasible to process such big data manually. Identifying and filter-
ing out relevant data automatically is arduous, requiring customized
methods. Knowledge generation typically requires standardization
of the data, which in turn requires advanced NLP methods to parse
the texts. The language used in social media is unique and
2Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
complicated—due to the presence of colloquialisms, misspellings,
emojis and ambiguities, and often the lack of context.
ally, the language in social media is ever evolving, requiring the de-
velopment of adaptable, intelligent systems that can evolve with the
data. Consequently, while early works attempted to manually create
static consumer health vocabularies from social networks and online
health comminities,
some recent research tasks have attempted
to develop data-centric methods for automatically discovering com-
mon nonstandard consumer health terms
and misspellings.
abuse-related chatter also present mining challenges that illicit drug
abuse-related chatter does not present. For example, any expression
of consumption of illicit drugs is by definition abuse. However, for
PMs, consumption information may represent medical use, misuse
or abuse, consequently complicating automated mining further.
Data search and selection
We searched the databases Medline and Embase, the citation database
Scopus and Web of Science, and Google Scholar to find relevant
articles published within the last 15 years. We searched for keywords
indicating social media AND prescription medication AND abuse.
Besides searching the databases, we also reviewed the reference lists of
studies that met our inclusion criteria, to find additional related stud-
ies that may not be identifiable by our keyword-based approaches (eg,
studies naming specific medications and utilizing social media data
along with data from other sources). Table 1 presents the variants of
the keywords used for each of the 3 categories.
We sorted the search engine results by relevance, filtered a selec-
tive set for review, and obtained their full texts. We included articles
if the titles or abstracts suggested that they used data from social me-
dia for detecting, characterizing or studying PM abuse or misuse.
Studies that met our inclusion criteria were those that presented
original data, utilized any internet-based resource of consumer-
generated data (eg, online health communities, forums, message
boards, social networks), and presented qualitative or quantitative
analyses or well-defined outcomes or results that were directly rele-
vant to at least 1 PM. We included articles that employed manual
analysis as well as those that employed NLP or machine learning
approaches. We excluded studies that solely focused on illicit drug
abuse or trade, or utilized sources such as electronic health records
or published literature. Studies were also excluded if they only de-
scribed clinical trials or extracted information from medication
labels suggesting possibility of abuse, if they were news articles or
other non–peer-reviewed sources, or if they were not published in
English. Additionally, we excluded short commentaries, letters, and
responses, unless they provided methodological insights. Articles fo-
cused on computational methodologies, which are not relevant or
unique to the PM abuse problem, were also excluded unless they in-
cluded at least a case study involving a named social network (eg,
the study by Yakushev and Mityagin
was excluded based on this
Data abstraction
For all the included studies, we abstracted the pertinent information
presented in them, such as study sizes, sources of data, medications
studied, and the primary objectives, methods and findings of the
studies. For study size, we focused on the sample size of the data (eg,
number of tweets) and the number of medications. We broadly cate-
gorized studies into “big” and “small,” with big studies including at
least 10 000 posts in the articles’ primary analyses. We also identi-
fied the medication classes studied, when available (eg, opioids, ben-
zodiazepines). For studies presenting multiple objectives or findings,
we focused on the primary ones only or those that are related to mis-
use or abuse. In our analyses of primary methods and results, we
attempted to critique the data processing method(s) employed, the
primary contributions of the methods, and the relevance and
strengths of the evaluation methods employed.
Data collection
Our searches resulted in an initial set of over 1000 articles. Many of
these articles focused more generally on substance abuse (eg, illicit
drugs and alcohol) and social media, or PM abuse from non–social
media data sources. It was particularly challenging to identify stud-
ies that included both prescription and illicit drugs. Based on an in-
spection of the titles and abstracts of these articles, we selected a
sample of 63 articles for further review. From this set, 39 studies—
journal articles and conference proceedings—were deemed to meet
our inclusion criteria.
The earliest study we identified, which suggested the possibility
of utilizing web-based, consumer-generated sources for studying
drug abuse, was from 2006.
Research on this topic, however, be-
gan gaining attention from 2012, with 3 articles published in that
year. Since then, generally speaking, there has been an increasing
trend in the number of articles published on the topic every year
(Figure 1).
Study characterizations
Detailed characterizations of the included studies across several
dimensions are summarized in Tables 2 and 3. The articles in the 2
tables are listed in the same chronological order. Table 2 shows the
years of publication of the articles, the data sources utilized by the
studies, the number of medications, medication categories studied,
the sizes of the datasets and whether the datasets could be catego-
rized as big data or not. Twitter has been the most commonly used
data source, with 20 (51.3%) studies relying on it. This is particu-
larly due to the early availability and popularity of Twitter’s public
streaming application programming interface (https://developer. The
application programming interface makes available a sample of pub-
lic Twitter posts in real-time, which can be collected using keywords
for research purposes. Among generic social networks, other than
Twitter, Instagram and Reddit are increasing in popularity due to
the growing user bases and the typically public nature of the posts.
Table 1. Sample search queries used to retrieve articles for this re-
Social media Prescription medication Abuse
social media prescription
social network medication misuse
forum drug use
online health
substance usage
discussion board nonmedical use
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 3
Downloaded from by guest on 19 December 2019
Many studies attempted to utilize specialized topic-oriented web
forums for research.
As depicted in Table 2, only 6 studies focused on a single medica-
tion, and at least 10 studies included both prescription and illicit.
Opioids have been the most common medication category studied,
with 16 (41%) papers focusing solely on this category. This is unsur-
prising, considering the growing interest in opioids following the
opioid crisis in the United States. Based on our categorization
threshold for study size, 20 (51.3%) studies included big data, with
3 studies from this set also performing elaborate manual analyses on
smaller samples.
Table 3 details the (1) objectives of the included studies, (2) the
primary methods employed by them, and (3) their primary findings.
A number of studies included multiple objectives, approaches, or
findings, and in the table, we focus on the main contributions of the
articles according to our review guidelines. The objectives of the
articles varied considerably and included studies to assess if social
media chatter contained evidence of abuse, characterize chatter
about specific medications manually or automatically, assess user
sentiments, develop new methods for automating the surveillance of
drug misuse or abuse via social media, discover nonstandard names
or terms associated with abuse-prone drugs, and analyze the geo-
graphic distributions of abuse-related chatter. Methods for data
analysis or characterization included manual analyses, and unsuper-
vised and supervised automatic approaches. We now provide a brief
summary of the key findings.
Summary of methodologies and findings
Early studies mostly relied on manual analyses and characterizations
to ascertain that user posts contained information about misuse or
abuse and the types of the information posted. Typical studies man-
ually annotated small samples for further analyses.
For Twitter,
keyword-based approaches were utilized to analyze the volumes of
chatter mentioning specific medications over time, followed by anal-
yses of the chatter to better understand the patterns in volume.
Following the publication by Cameron et al,
many studies
employed NLP to parse conversations and better categorize the
meanings of the posts, moving beyond keyword-based approaches.
More recently, due to the availability of big data and the absence of
manually annotated data, some studies have employed unsupervised
topic modeling methods such as latent Dirichlet allocation (LDA)
to identify themes associated with the chatter mentioning specific
and identify the abuse-associated topics. The
evaluation approaches for such studies, however, have been ad hoc
in nature, and no standard method has been proposed to determine
the performances of the topic generation methods. Only 9 of the
reviewed studies employed some form of supervised machine learn-
ing using manually annotated data. The performances of the
employed methods suggest that such methods are still very much in
their exploratory phases and the annotated datasets used are rather
small. Due to the sensitive nature of the topic, there is also a lack of
publicly available manually annotated data, which has perhaps
acted as an obstacle to community-driven method development.
In terms of findings, all studies have reported the presence of im-
portant information regarding nonmedical use of PMs—early stud-
ies typically verified the presence of such information, while a
number of recent studies have attempted to develop methods for au-
tomatically detecting and extracting the knowledge contained
within the posts. In addition to the presence of abuse-related infor-
mation, studies reported finding chatter involving illicit trading of
drugs, discovering population subgroups engaged in abuse of spe-
cific PMs (eg, high prevalence of Adderall usage among college stu-
dents), quantifying relapse rates during recovery, measuring
geographic distributions of misuse, and their associations with other
topics (eg, overdose-related deaths). Although some studies reported
the presence of noise in generic social networks, none of the pro-
posed unsupervised methods addressed the issue. Supervised meth-
ods that apply a classification filter prior to data analysis have the
potential of filtering out varying levels of noise. Some studies com-
puted agreement/correlations between social media signals and other
sources, such as metrics from National Survey on Drug Use and
Health surveys
and geolocation-specific overdose deaths.
Broadly speaking, there is still a paucity of studies that have pro-
posed full data-centric processing pipelines for automating the use
of social media data for monitoring or characterization of PM
abuse, or to find novel insights about abuse-prone medications.
Figure 1. Number of articles meeting the inclusion criteria of our review from 2012 to 2018. The figure does not include the year 2019 because full data will not be
available until the end of the year.
4Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
Table 2. Articles published on social media mining for prescription medication abuse or misuse monitoring; their years of publication, data sources, and medications/drugs of focus, and sizes
of datasets studied
Study Year Data source Number of
Medications Medication categories Data size/number of instances Big /small data
Schifano et al
2006 Multiple websites Unspecified Multiple (prescription and illicit) Multiple 290 websites (for prescription
McNaughton et al
2012 Unspecified 6 Oxycodone, hydrocodone,
hydromorphone, oxymor-
phone, morphine, tramadol
Opioids 12 838 Big
Davey et al
2012 Unspecified Multiple Unspecified Unspecified Data from 8 forums Big
Daniulaityte et al
2012 Unspecified 1 Loperamide Diarrhea medication 1290 Small
Cameron et al
2013 Unspecified Multiple Unspecified Multiple: cannabinoids, bupre-
norphine, opioids, sedatives,
and stimulants are mentioned.
1 066 502 Big
Hanson et al
2013 Twitter 1 Adderall Stimulant 213 633 Big
Hanson et al
2013 Twitter Multiple (all
Multiple Multiple 3 389 771 initial tweets Big
McNaughton et al
2014 7 Forums;
3 Oxycontin (oxycodone), Vicodin
(hydrocodone), Dilaudid
Opioids 88 484 Big
McNaughton et al
2015 7 Forums;
1 Tapentadol Opioid 1 940 121 Big
MacLean et al
2015 Forum77 Unspecified Unspecified Opioids 2848 Small
Shutler et al
2015 Twitter Unspecified Unspecified Opioids 2100 Small
Buntain and Goldbeck
2015 Twitter 21 Multiple Opioids þmostly illicit drugs 821 000 000 Big
Katsuki et al
2015 Twitter Unspecified Multiple Opioids, benzodiazepines þ
1000 Small
Chan et al
2015 Twitter 11 (keywords) duragesic, fentanyl, hydroco-
done, hydros, oxy, oxycodone,
oxycotin, oxycotton, vicodin,
vikes, oxycontin
Opioids 540 Small
Seaman and Giraud-
2016 Twitter 73 Multiple Opioids, benzodiazepines, stimu-
lants, and others (including il-
licit opioids)
98 691 Big
Ding et al
2016 Twitter Unspecified Multiple (prescription and illicit) Multiple 116 885; 255 annotated Big/small
Jenhani et al
2016 Twitter Unspecified Multiple (prescription and illicit) Multiple 80 000 Big
Zhou et al
2016 Instagram Unspecified Vicodin þother prescription
drugs; illicit drugs
Opioids and others 1000 posts initially, followed by
16þmillion posts and all posts
from 2362 users
Sarker et al
2016 Twitter 4 Oxycodone, Adderall, Quetia-
pine, and metformin (control
Multiple 6400 annotated; followed by
100 000þunlabeled posts
Anderson et al
2017 Bluelight,
1 Bupropion þ2 comparators (am-
itriptyline and venlafaxine)
Antidepressant 7756 Small
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 5
Downloaded from by guest on 19 December 2019
Table 2. continued
Study Year Data source Number of
Medications Medication categories Data size/number of instances Big /small data
Kalyanam et al
2017 Twitter 3 Percocet, OxyContin and Oxyco-
Opioids 11 million Big
Phan et al
2017 Twitter Unspecified OxyContin, Ritalin and opiates
þillicit drugs
Opiates (illicit and prescription) 300 Small
Yang et al
2017 Instagram Unspecified Multiple (prescription þillicit) Multiple 4819 from Instagram; 4329 from
Chary et al
2017 Twitter Unspecified Prescription opioids Opioids 3 611 528 Big
D’Agostino et al
2017 Reddit Unspecified Unspecified Opioids 100 posts Small
Cherian et al
2018 Instagram 1 Codeine Opioid 1156 Small
Graves et al
2018 Twitter Unspecified Multiple (prescription þillicit) Opioids 84 023 Big
Hu et al
2018 Twitter Unspecified Multiple (prescription þillicit) Multiple More than 3 million raw tweets;
1794 annotated
Chary et al
2018 Lycaeum Unspecified Multiple (prescription þillicit) Sedative-hypnotic, hallucinogen,
stimulant, nootropic, psychiat-
ric, anticholinergic, analgesic,
antipyretic, antiemetic, antihy-
pertensive, cannabinoid, and
9289 Small
Fan et al
2018 Twitter Unspecified Multiple (prescription þillicit) Opioids 4 447 507 tweets from 4 051 423
users; 19 722 tweets from
2312 users annotated
Bigeard et al
2018 Doctissimo Unspecified Multiple (prescription þillicit) Antidepressants, antixiolytics,
and mood disorder drugs
1850 annotated posts Small
Chen et al
2018 French forums:
1 Methylphenidate (trade names:
Ritalin, Quasym, Concerta,
Stimulant 3443 Small
Pandrekar et al
2018 Reddit Unspecified Multiple (prescription þillicit) Opioids 51 537 Big
and Bian
2018 Twitter 13 (prescription
Multiple (prescription þillicit) Opioids 310 323 Big
Hu et al
2018 Twitter Unspecified Multiple Multiple 3 million tweets with 6794 anno-
tated tweets
Adams et al
2019 Reddit and
Unspecified Opioids, fentanyl, cocaine, meth-
amphetamine, marijuana, and
Multiple Not Available or Applicable Big
Lu et al
2019 Reddit Unspecified Unspecified Opioids 309 528 Big
Tibebu et al
2019 Twitter 14 (prescription
Multiple (prescription þillicit) Opioids 2602 Small
Chancellor et al
2019 Reddit Unspecified Multiple Opioids and opioid use disorder
recovery drugs
1 446 948 posts from 63 unique
6Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
Table 3. Summary of the primary objectives, approaches, and findings from the studies included in this review
Study Primary objective(s) and/or significance Primary approach(es) Primary finding(s)
Schifano et
First study to explore web forums for
drug abuse research. Objective was to
analyze data from “web pages” re-
lated to information on consumption,
manufacture and sales of psychoactive
Manual exploration of search engines using
drug names as keywords. User posts from
1633 websites were analyzed primarily
for contents (personal intake and/or trad-
ing) and stance (pro- vs anti-drug).
18% websites included pro-drug chatter,
10% included harm reduction, and
10% included drug trading. Previ-
ously unknown coingestion patterns
were discovered.
et al
To explore the sentiment expressed by
opioid abusers and their endorsement
behavior on internet forums. First
study to employ automated methods
for analyzing social media chatter re-
lated to abuse or misuse.
Mixed-effects multinomial logistic regres-
sion was applied to model the probability
of endorsing, discouraging, mixed, or
unclear messages per compound. Endorse-
ment to discouragement ratios were esti-
mated for each compound.
The following list (ordered), in terms of
endorsement ratio, was obtained for
the included drugs: oxymorphone,
hydromorphone, hydrocodone, oxy-
codone, morphine, and tramadol.
Daniulaityte et
To analyze nonmedical use of lopera-
mide, as reported on a specific patient
Retrieved posts mentioning from 2005 to
2011. A random sample of 258 posts
were manually annotated to identify in-
tent, dosage, and side effects.
The discussion suggested that high doses
of loperamide are used to address opi-
oid withdrawal symptoms or as a
methadone substitute.
Davey et al
To analyze the key features of drug-re-
lated Internet forums and the commu-
Categories, themes, and attributions were
manually analyzed from 8 forums (quali-
The study identified unique communities
of recreational drug users that can
provide information about new drugs
and drug compounds.
Cameron et
The development of a semantic web plat-
form called PREDOSE (PREscription
Drug abuse Online Surveillance and
Epidemiology), designed to facilitate
the epidemiologic study of prescrip-
tion (and related) drug abuse practices
using social media.
A drug abuse ontology is used to recognize 3
types of data, namely (1) entities, (2) rela-
tionships, and (3) triples. Basic natural
language processing approaches are used
to extract entities and relationships, and
to identify sentiment.
The reported approach obtains 85% pre-
cision and 72% recall in entity identi-
fication, on a manually created gold
standard dataset. In manual evalua-
tion, the system obtains 36% preci-
sion in relationship identification, and
33% precision in triple extraction.
Hanson et al
To identify variations in the volume of
Adderall chatter by time and geo-
graphic location in the United States,
as well as commonly mentioned side
effects and coingested substances.
Tweets containing the term Adderall were
collected from November 2011 to May
2012, and a keyword-based approach was
used to detect coingested substances and
side effects using manual analysis of geo-
location clusters and temporal pattern.
Twitter posts confirm Adderall as a
study aid among college students.
Twitter may contribute to normative
behavior regarding its abuse.
Hanson et al
To analyze the networks of users who re-
port abusing/misusing prescription
Tweets mentioning prescription medications
were collected from Twitter as well as
users mentioning prescription medications
multiple times. Social circles of 100 users
were analyzed, particularly their discus-
sions associated with prescription drug
Twitter users who discuss prescription
drug abuse online are surrounded by
others who also discuss it—potentially
reinforcing a negative behavior and
social norm.
et al
To evaluate the reactions to the intro-
duction of reformulated OxyContin.
To identify methods aimed to defeat
the abuse-deterrent properties of the
Posts spanning over 5 years collected from 7
forums were evaluated before and after
the introduction of reformulated Oxy-
Contin on August 9, 2010. Qualitative
and quantitative analyses of the posts
were performed to assess proportions and
Sentiment profile of OxyContin changed
following reformulation. OxyContin
was discouraged significantly more
following reformulation. Frequency of
posts reporting abuse decreased over
et al
To assess the amount of discussion and
endorsement for abuse of tapentadol
and comparator drugs.
Internet messages posted between January 1,
2011, and September 30, 2012, on 7 web
forums were evaluated. Proportions of
posts and unique authors discussing
tapentadol were compared with 8 compar-
ator compounds.
Recreational abusers appeared to be less
interested in discussing tapentadol
MacLean et
To assess the effectiveness of a special-
ized forum in helping misusers/abusers
of prescription opioids.
A taxonomy describing the phases of addic-
tion was developed, and the activities and
linguistic features across phases of use/
abuse, withdrawal, and recovery were ex-
amined. Statistical classifiers were devel-
oped to identify addiction, relapse, and
recovery phases.
According to the forum data, almost
50% of recovering abusers relapsed,
but their prognosis for recovery is fa-
Shutler et al
Qualitatively assess tweets mentioning
prescription opioids to determine if
Manual categorization of posts into prede-
fined categories—abuse, nonabuse, and
Twitter can be a potential resource for
monitoring prescription opioid use, as
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 7
Downloaded from by guest on 19 December 2019
Table 3. continued
Study Primary objective(s) and/or significance Primary approach(es) Primary finding(s)
they represent abuse or nonabuse, or
were not characterizable. To assess
the connotation (positive, negative,
not characterizable; and, in terms of con-
notation, positive, negative, and not char-
abuse is commonly described by users
(mostly with a positive connotation).
Buntain and
To assess how tweets can augment a
public health program that studies
emerging patterns of illicit drug use.
The article proposed an architecture for col-
lecting vast numbers of tweets over time.
Automatic topic modeling was employed
to identify topics, and temporal and geo-
location-based analyses were discussed.
An architecture for mining Twitter data
for drug abuse monitoring (illicit and
Katsuki et al
To conduct surveillance and analysis of
tweets to characterize the frequency of
prescription medication abuse-related
chatter, and identify illegal online
pharmacies involved in drug trading.
Tweets collected using medication keywords
and street names were manually coded to
indicate misuse or abuse behavior and at-
titude (positive/negative). Supervised ma-
chine learning automatically identified
over 100 000 tweets mentioning abuse or
promotion. Word frequency–based
experiments identified associations. Geo-
locations were analyzed for geographic
The study found a large number of
tweets (over 45 000) that directly mar-
keted prescriptions medications ille-
gally. Supervised machine learning
showed adequate performance in au-
tomatic detection.
Chan et al
To manually analyze opioid chatter from
Data was collected from Twitter over 2
weeks and manually coded (eg, personal
vs general experiences including nonmedi-
cal use, and user sentiments toward
opioids) for analysis.
Personal opioid misuse was the most
common theme among the tweets ana-
Seaman and
To present statistics about volume as
well as attitudes toward distribution
(selling/buying) and need.
Only a small number (500) of tweets were
manually analyzed. New York–based
tweets showed that buying/selling and
“need” were the most common topics as-
sociated with the drug names.
Twitter users often express the need for
Adderall and Xanax; chatter related
to specific drugs is directly impacted
by media events involving such sub-
Ding et al
To detect abuse-related posts and dis-
cover new, unknown street names for
A sample of Instagram posts was annotated
for medical use, illicit use, not related, or
not sure. Topic modeling (LDA) was used
to track changes in hashtags. Hand-anno-
tated tweets were used to identify propor-
tions for abuse-related tweets. Manual
analysis of hashtags performed to assess
the performance of the word embeddings.
The topic modeling approach retrieves
drug-related posts with 78.1% accu-
racy. Word embeddings learned from
social media data are useful for find-
ing new hashtags and street terms as-
sociated with abuse.
Jenhani et al
To propose methods for automatically
detecting drug-abuse-related events
from Twitter.
A hybrid approach consisting of a rule-based
component and supervised machine learn-
ing is described. Automatically annotated
tweets are used for evaluation, showing
0.51 F-score.
Machine learning based approach can
detect events not detected by rules.
Findings are limited by the fact that
only automatically annotated data is
used for evaluation, which is prone to
Zhou et al
To explore the possibility of using multi-
media data (images and text) to dis-
cover drug usage patterns at a fine-
grained level with respect to demo-
Posts were retrieved from Instagram using
drug-related hashtags. An initial set of
hashtags was used to create a dictionary
of hashtags. User demographics, such as
age and gender, were predicted using face-
image analysis algorithms. Patterns of
drug-usage associated with demographics,
time and location were then analyzed.
Findings from social media mining are
consistent with findings of the
NSDUH (qualitatively), even at a fine-
grained level.
Sarker et al
To verify that abuse information for
abuse-prone medications in social me-
dia is higher than non–abuse-prone
medications. To assess the possibility
of automatically detecting abuse via
NLP and machine learning. To com-
pare automatically classified temporal
data with past manual analysis.
Manually annotated 6400 tweets to indicate
abuse vs nonabuse. Evaluation of auto-
matic classification was performed via 10-
fold cross-validation; tests for proportions
of abuse-related posts between case and
control medications. Compared classified
Adderall tweets with past manual analy-
There is significantly more abuse-related
information for abuse prone medica-
tions compared with non–abuse-prone
medications. Supervised machine
learning is an effective approach for
automated monitoring.
Anderson et
To determine if misuse or abuse could be
detected via social media listening. To
Posts were collected using generic, brand,
and vernacular brand names and were
reviewed manually by coders.
Agreement among raters in manual cate-
gorization was low (0.448). Analysis
of posts revealed that 8.61% refer-
8Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
Table 3. continued
Study Primary objective(s) and/or significance Primary approach(es) Primary finding(s)
describe and characterize social media
enced misuse or abuse, including
routes of intake. Web forums present
a valuable new source for monitoring
nonmedical use of medications.
Kalyanam et
To demonstrate that the geographic vari-
ation of social media posts mentioning
prescription opioid misuse strongly
correlates with government estimates
of prescription opioid misuse in the
previous month.
Tweets were collected from 2012 to 2014,
using opioid keywords. Tweets were auto-
matically quantified using semantic dis-
tance with word centroids. Unsupervised
classification/clustering used to group
tweets mentioning opioid misuse. Volume
of abuse-related chatter was correlated
with NSDUH surveys, with separate cor-
relations for different age groups.
Mentions of misuse or abuse of prescrip-
tion opioids on Twitter correlate
strongly with state-by-state NSDUH
Phan et al
To verify that tweets contain patterns of
drug abuse. To study the correlations
among different levels of drug usage
including abuse, addiction and death,
and assess the applicability of large-
scale systems for online social net-
work-based drug abuse monitoring.
Manual annotation of opiate-mentioning
tweets and basic feature selection methods
were developed. Several machine learning
classifiers were then trained and evalu-
ated. Word co-occurrence patterns for
abuse-indicating tweets were identified
and used as features in machine learning
experiments. Correlations between words
and drug terms were computed.
The best performance was obtained by a
decision tree-based classifier, but per-
formance was low compared with hu-
man judgment.
Yang et al
To propose a multitask learning method
to leverage images from Instagram for
recognition of drug abuse. To identify
user accounts involved in illicit drug
A multitask learning method was employed
for image classification (stage 1) and
accounts of interest were identified. Drug-
related patterns, temporal patterns, and
relational information patterns were
detected from the user timelines and po-
tential dealer accounts were detected
(stage 2).
A reproducible machine learning model
for tracking and combating illicit drug
trade on Instagram. The framework
can be reused and improved for prac-
tical tracking and combating of illicit
drug trade on Instagram.
Chary et al
Demonstrate that the geographic varia-
tion of tweets mentioning prescription
opioid misuse strongly correlates with
government estimates in the previous
Basic preprocessing was performed on
tweets from 2012 to 2014 (signal tweets
and basal tweets) collected by keywords
linked to prescription opioid use (mis-
spellings as well). Tweets were manually
annotated and geodata was collected.
Compared tweets with NSDUH.
State-by-state correlation between Twit-
ter and NSDUH data was high. Corre-
lation was strongest in NSDUH data
for 18- to 25-year-olds.
D’Agostino et
To examine the online Reddit commun-
ity’s ability to target and support indi-
viduals recovering from opiate
Collected 100 Reddit posts and their com-
ments from August 19, 2016. Manually
annotated the posts/comments according
to DSM-5 criteria to determine the addic-
tion phases of individual users.
Demonstrated the supportive environ-
ment of the online recovery commu-
nity and the willingness to share self-
reported struggles to help others.
Cherian et al
To characterize information about co-
deine misuse through analysis of pub-
lic posts on Instagram to understand
text phrases related to misuse.
1156 posts were collected over 2 weeks from
Instagram via hashtags and text associ-
ated with codeine misuse. Themes and
culture around misuse were identified
through manual analysis.
50% of reported abuse involved combin-
ing codeine with soda (lean). Com-
mon misuse mechanisms included
coingestion with alcohol, cannabis,
and benzodiazepines.
Graves et al
To determine whether Twitter data
could be used to identify geographic
differences in opioid-related discus-
sion. To study whether opioid topics
were significantly correlated with opi-
oid overdose death rate.
Tweets collected using keywords from 2009
to 2015. Topic modeling (LDA) used to
summarize contents into 50 topics. The
correlations between topic distribution
and census region, census division and
opioid overdose death rates were quanti-
Selected topics were significantly corre-
lated with county- and state-level opi-
oid overdose death rates.
Hu et al
To build a system for effective drug
abuse related data collection from so-
cial media and develop an annotation
strategy for categorization of data
(abuse vs nonabuse) and a deep learn-
ing model that can automatically cate-
gorize tweets.
More than 800 keywords were used to col-
lect data, followed by crowd-sourced an-
notation of 4985 tweets. Deep learning
model built on small annotated data and
evaluated via 10-fold cross-validation.
Geographic distribution over 100 000
tweets (positively classified) were ana-
The crowd-sourced annotation method
enabled annotation at a much faster
rate and lower cost. Deep learning
model achieved state-of-the-art classi-
fication performance. Semantic analy-
sis of tweets revealed drug abuse
behaviors. Geolocation-based analysis
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 9
Downloaded from by guest on 19 December 2019
Table 3. continued
Study Primary objective(s) and/or significance Primary approach(es) Primary finding(s)
enabled the identification of geo-
graphic hotspots.
Chary et al
To demonstrate that data concerning
polysubstance use can be extracted
from online user posts, and that these
data can be used to infer novel as well
as known coingestion patterns.
Posts were retrieved via web scraping and
basic natural language processing meth-
ods were applied to identify possible men-
tions of drugs. Correlation was computed
between mentions of pairs of drugs to
identify common ingestion patterns based
on mentions of drugs.
183 coingestion combinations were dis-
covered, including 44 that had not
been studied before.
Fan et al
To propose a novel framework named
AutoDOA to automatically detect opi-
oid addicts from Twitter.
Five groups of annotators (18 persons) with
domain expertise labeled 19 722 tweets
from 2312 users to identify potential
addicts. Using only annotations with full
agreement, an approach relying on meta-
path–based similarity was used to perform
transductive classification of the users
based on the tweets, their likes, and their
Evaluation on annotated data shows
that this method outperforms other
approaches; A case study on 1132
identified heroin addicts qualitatively
show similarities with CDC estimates
of overdoses.
Bigeard et al
To create a typology for drug abuse or
misuse and methods for automatic de-
tection and propose methods for clas-
sification of drug misuses by analyzing
user-generated data in French social
1850 posts were annotated into 4 catego-
ries—misuse, normal use, no use, and un-
able to decide. Categories were used to
create a typology of misuses and to evalu-
ate an automatic system. Several machine
learning algorithms were then trained on
artificially balanced data to categorize
among misuse, no use, and normal use.
Multinomial naı¨ve Bayes is shown to
achieve the best performance on the
artificially balanced data. The manual
categorization of the data reveals an
elaborate typology of intentional and
unintentional misuse. The annotator
agreements are relatively low, show-
ing the difficulty of the misuse annota-
tion task.
Chen et al
To qualitatively analyze posts about
methylphenidate from French patient
forums including an analysis of infor-
mation about misuse or abuse.
Data were collected from French social net-
works that mentioned methylphenidate
keywords. Text mining methods such as
named entity recognition and topic
modeling where used to analyze the chat-
ter, including the identification of adverse
Analysis of the data revealed cases of
misuse of the medication and abuse.
Pandrekar et
To demonstrate the potential of analyz-
ing social media (specifically Reddit)
data to reveal patterns about opioid
abuse at a national level
Collected 51 537 Reddit posts between Jan-
uary 2014 to October 2017; evaluated
psychological categories of the posts and
characterized the extent of social support;
performed topic modeling to determine
major topics of interests and tracked dif-
ferences between anonymous and nona-
nonymous posts.
The information shared on Reddit can
provide a candid and meaningful re-
source to better understand the opioid
tura and
To study and understand (1) the contents
of opioid-related discussions on Twit-
ter, (2) the coingestion of opioids with
other substances, (3) the trajectory of
individual-level opioid use behavior,
and (4) the vocabulary used to discuss
310 323 tweets were collected over 4
months, and 124 143 tweets were in-
cluded in the study following rule-based
filtering. Keyword frequency and co-oc-
currence based methods were applied to
meet the objectives of the study.
Although most of the chatter talked
about use of opioids as legitimate pain
relievers, there was considerable dis-
cussion about misuse or abuse and
coingestion of opioids with other sub-
stances; 18 new terms for opioids,
which were previously not encoded,
were discovered.
Hu et al
To establish a framework for automatic,
large-scale collection of tweets based
on supervised machine learning and
crowd sourcing, with a self-taught
learning approach for automatic de-
Data were collected from Twitter using key-
words and following an initial annotation
by the authors, crowdsourcing was uti-
lized for obtaining reliable annotations.
An iterative automatic classification ap-
proach is applied where the training data
is augmented with machine-classified
tweets to improve performance. Both tra-
ditional and neural network–based classi-
fiers were experimented with.
The neural network–based (convolu-
tional and recurrent) deep, self-taught
learning algorithms outperformed tra-
ditional models in the binary classifi-
cation task with 86% accuracy.
Adams et al
To demonstrate the benefit of mining
platforms other than Twitter, and the
The synonym discovery method was com-
pared for finding terms relevant to mari-
The synonym discovery method yielded
more synonyms from Reddit than
10 Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
Our review covers research efforts that have attempted to mine
user-posted web and social media data for studying, curating,
monitoring, or characterizing PM abuse-related information. The 39
studies that met our inclusion criteria unanimously concluded that
social media is a potentially useful resource for studying PM abuse
due to the presence of considerable amounts of unfiltered informa-
tion available. The studies reviewed fall into 3 broad categories
from the perspective of methodology employed: (1) manual analysis,
(2) automatic unsupervised analysis, and (3) supervised analysis.
Most studies employed some form of manual analyses, and these
analyses were primarily targeted toward hypotheses generation (eg,
“does social media provide information about PM abuse?” and
“can we study information about mechanisms of PM abuse from so-
cial media?”), and hypotheses testing via manual annotation of sam-
ples of data. Such analyses of social media data generated the crucial
early hypotheses and helped establish it as a valuable resource for
toxicovigilance research. But such analyses are limited to small data
samples, are difficult to reproduce, and cannot be used for continu-
ous analysis. Therefore, despite their effectiveness in some cases,
manual approaches are not suitable for long-term, data-centric
efforts that take advantage of the primary attraction of social
media—the continuous generation of big data. We have also reached
a point in which further manual validation of hypotheses regarding
the presence of abuse-related information at the post level are not re-
Unsupervised approaches have primarily focused on big data to
identify trends, for example, through analyses of volume of data to
estimate abuse rates at specific time periods or, more recently, topic
modeling to identify abuse-related topics associated with selected
medications. Volume-oriented unsupervised approaches (eg, key-
word based) are capable of tracking interests and discovering trend-
ing hidden topics in real-time (eg, via LDA), but studies have shown
that only small proportions of the data may present abuse informa-
tion, and so, such methods are likely to be significantly affected by
unrelated chatter, and the conclusions derived may be particularly
unreliable when the proportions of abuse indicating posts for spe-
cific medications are low. Some of the studies mentioned in Tables 2
and 3have shown that for certain medications a very minute portion
of the social media chatter may be associated with abuse.
For ex-
ample, a significant portion of Twitter chatter mentioning opioids is
generated by users sharing general information, such as news
articles, rather than personal experiences. This characteristic of the
data is not unique to the problem of PM abuse, but is generalizable
across social media–based datasets, and has been observed in other
studies including influenza and vaccine monitoring,
cancer com-
and pharmacogivilace.
Thus, especially when
working with generic social media data, applying a supervised classi-
fication filter before the analysis of topics or trends is perhaps meth-
odologically more robust.
Few studies have employed supervised classification approaches
to identify salient information, as supervised learning algorithms re-
quire large volumes of data to be manually annotated for training,
which is time consuming and expensive. However, supervised
approaches, due to their ability to filter out irrelevant information,
are likely to have greater longevity in the constantly evolving sphere
of social media. The time spent in annotating data for supervised
classifications may be valuable for long-term studies and stable sys-
tems, provided the annotations follow explicit guidelines and are
portable across studies.
Despite the promise of supervised classification approaches, the
performances reported by the reviewed systems are typically
Table 3. continued
Study Primary objective(s) and/or significance Primary approach(es) Primary finding(s)
use of word embeddings for keyword
synonym discovery resulting in in-
creased collected data.
juana and opioids from 2 sources—Twit-
ter and Reddit.
Twitter. Twitter, however, provided
more slang terms.
Lu et al
To demonstrate the insights that can be
obtained from employing data mining
techniques on social media to better
understand drug addiction.
Collected 309 528 posts from 125 194
unique Reddit users between January
2012 and May 2018. Used a trained clas-
sifier to predict transition from casual
drug discussion to drug recovery. Used a
Cox regression model to calculate the
likelihood of the transition.
Found that certain utterances and lin-
guistic features of one’s post can help
predict the transition to drug recovery
and determined specific drugs that are
associated more with transition to re-
covery, which offers insight into drug
Tibebu et al
To assess if Twitter maybe used as a
data source for studying population-
level opioid use and perceptions in
Collected 2602 tweets over 1 month and
manually categorized 826 tweets to study
usage and perceptions.
The analyzed tweets presented informa-
tion about medical usage of opioids,
impacts of opioid use on family and
friends, and drug use in public places.
Tweets representing user perceptions
were mostly associated with the key-
words heroin,fentanyl, and opioids.
Chancellor et
To assess if Reddit contains information
on clinically unverified alternative
treatments to opioid use disorder, de-
velop a machine learning approach for
discovering posts representing alterna-
tive treatments, and identifying com-
monly reported agents for successful
A transfer learning approach was developed
to automatically detect posts discussing
recovery from opioid use disorder and
was applied to all the posts collected from
63 subreddits. An approach involving reg-
ular expressions and word embeddings is
used identify alternative treatments from
the positively classified posts.
The transfer learning–based classifica-
tion approach obtained accuracy of
91.7%, leading to 93 104 recovery
posts. Common drugs discovered for
alternative treatments included both
prescription (eg, Loperamide, Xanax,
Valium, Klonopin, gabapentin) and
nonprescription (eg, kratom) drugs.
CDC: Centers for Disease Control and Prevention; LDA: latent Dirichlet allocation; NSDUH: National Survey on Drug Use and Health.
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 11
Downloaded from by guest on 19 December 2019
This is a known issue for social media data—the text is
very difficult to automatically classify due to the factors discussed
previously. Social media data can be hard to decipher even for
humans, as contents can be ambiguous. Studies that double-
annotated sample data, typically reported low agreement
To improve performances of future classification meth-
ods, it is essential to increase human agreement rates during annota-
tion tasks. Only 10 (25.6%) reviewed articles
our sample reported the creation, presence, or use of detailed anno-
tation guide or guidelines or coding rules which the annotators fol-
lowed to improve agreement rates. In our view, future research
should put more focus on developing thorough annotation guide-
lines that can be used as reference for annotating data. For research-
ers from distinct institutions attempting to perform identical tasks,
use of publicly available elaborate guidelines will enable the direct
comparison of research methodologies (eg, classification performan-
ces), even if the data are not shared. There is also a shortage of pub-
licly available annotated data for tasks such as automatic abuse
detection. The recent adoption of social media for similar tasks have
been accelerated by the creation of publicly available annotated data
(eg, for pharmacovigilance).
However, there have been no such
efforts for studying PM abuse from social media, and such efforts
should accelerate the research in this space as well. Such data prepa-
ration and release efforts need also consider the potential ethical
We conclude our review by proposing a possible data-centric NLP
and machine learning framework informed by the extensive review
presented in this paper. The proposed framework may be used for
monitoring PM abuse from social media and for related research
problems within the broader health domain, which have characteris-
tics similar to PM abuse.
Framework for mining social media for prescription
medication abuse
Our proposed framework consists of a data processing pipeline that
starts from data collection, which is often not trivial for social me-
dia–based studies. The data collection strategy has to take into ac-
count common misspellings,
and street names for medications, as
many abuse-prone medications have commonly used street names
(eg, “oxy,” “percs,” “addy,” “xanny”; a list of such street names
provided by the DEA can be found at:
pdf). Collection is particularly difficult for generic social networks,
such as Twitter, due to the presence of large numbers of misspellings
and nonstandard terms, compared with targeted online health com-
munities. Following data collection, it is essential to filter out noise
or irrelevant posts, which most of the retrieved data are likely to
comprise. This is best achieved by classification methods, which not
only filter out noise, but may also classify the posts into relevant cate-
gories (eg, medical consumption vs abuse). Considering the reported
performances of past systems, there need to be future efforts for im-
proving the state of the art in PM abuse classification. These strate-
gies and steps of data collection followed by supervised classification
are also applicable to research problems that resemble that of PM
abuse monitoring. Such studies, for example, include research on al-
cohol misuse or abuse,
and medical and nonmedical consump-
tion of marijuana
from social media—for both these research
topics, like PM abuse, consumption alone, without additional evi-
dence, may not indicate misuse or abuse.
Following the effective removal of unrelated data or noise, the
relevant chatter can be passed on for further NLP and machine
learning based processing for the discovery of knowledge. In Fig-
ure 2, we have specified a few possible studies. For example, once
the noise has been removed, it is appropriate to employ unsuper-
vised chatter analysis methods such as topic modeling to discover sa-
lient topics closely related to PM misuse or abuse. While topic
modeling methods, such as LDA, without any prior filters may re-
trieve mostly irrelevant latent topics, the application of a classifica-
tion filter ensures the relevance of the topics to PM abuse.
Geotagged social media data, if available, can be utilized to compare
abuse or misuse related information across different locations. Simi-
larly, timestamps can be used to analyze temporal patterns of abuse
for different medications. Combinations of unlabeled methods, cou-
pled with geolocation and temporal information can be used to com-
pare information about distinct medications (eg, Vicodin and
Percocet) and categories of medications (eg, opioids and benzodiaze-
pines). Finally, studying longitudinal data related to abuse from
groups of users may enable us to detect cohort-level behavioral pat-
terns and trends.
Research reported in this publication was supported by the National Institute
on Drug Abuse of the National Institutes of Health under Award Number
R01DA046619. The content is solely the responsibility of the authors and
Figure 2. High-level framework for deriving knowledge about prescription medication abuse from social media big data.
12 Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
does not necessarily represent the official views of the National Institutes of
AS contributed significantly to the article review and selection pro-
cess, wrote the majority of the content in the manuscript, and per-
formed critical analysis and comparison of the included studies. AD
contributed significantly to the article review and selection process,
helped the primary author to summarize included studies, and con-
tributed to the preparation of the manuscript. JP provided toxicol-
ogy domain expertise for the review, helped identify key articles,
and contributed to the manuscript writing and revisions.
None declared.
1. Cicero TJ, Ellis MS. Abuse-deterrent formulations and the prescription
opioid abuse epidemic in the United States. JAMA Psychiatry 2015; 72
(5): 424–30.
2. Substance Abuse and Mental Health Services Administration. Highlights
of the 2011 Drug Abuse Warning Network (DAWN) findings on drug-
related emergency department visits.
Accessed September 9, 2019.
3. Center for Behavioral Health Statistics. 2014 National Survey on Drug
Use and Health: Detailed Tables. Substance Abuse and Mental Health
Services Administration. Rockville, MD: Center for Behavioral Health
Statistics; 2015.
4. National Institute on Drug Abuse. Misuse of Prescription Drugs. Be-
thesda, MD: National Institute on Drug Abuse; 2016.
5. Compton WM, Jones CM, Baldwin GT. Relationship between nonmedi-
cal prescription-opioid use and heroin use. N Engl J Med 2016; 374 (2):
6. Kolodny A, Frieden TR. Ten steps the federal government should take
now to reverse the opioid addiction epidemic. JAMA 2017; 318 (16):
7. Drug Enforcement Administration. 2016 National Drug Threat Assess-
ment Summary. Springfield, VA: Drug Enforcement Administration;
8. Centers for Disease Control and Prevention, Opioid Overdose. https:// Accessed September
9, 2019.
9. National Institute on Drug Abuse. Overdose Death Rates. https://www.
Accessed September 09, 2019.
10. Rudd RAR, Seth P, David F, Scholl L. Increases in drug and opioid-
involved overdose deaths—United States, 2010–2015. MMWR Morb
Mortal Wkly Rep 2016; 65 (5051): 1445–52.
11. Centers for Disease Control and Prevention. Wide-Ranging Online Data
for Epidemiologic Research (WONDER). Atlanta, GA: Centers for Dis-
ease Control and Prevention; 2018.
about-cdc-wonder-508.pdf. Accessed September 09, 2019.
12. Daubresse M, Chang H-Y, Yu Y. Ambulatory diagnosis and treatment of
nonmalignant pain in the United States, 2000–2010. Med Care 2013; 51
(10): 870–8.
13. Chang H-Y, Daubresse M, Kruszewski SP, Alexander GC. Prevalence and
treatment of pain in EDs in the United States, 2000 to 2010. Am J Emerg
Med 2014; 32 (5): 421–31. doi: 10.1016/j.ajem.2014.01.015
14. Jena AB, Goldman DP. Growing Internet use may help explain the rise in
prescription drug abuse in the United States. Health Aff (Millwood) 2011;
30 (6): 1192–9.
15. Ahrnsbrak R, Bose J, Hedden SL, Lipari RN, Park-Lee E, Tice P. Key Sub-
stance Use and Mental Health Indicators in the United States: Results
from the 2016 National Survey on Drug Use and Health. Rockville, MD:
Center for Behavioral Statistics and Quality, Substance Abuse and Mental
Health Services Administration; 2017.
16. Birnbaum HG, White AG, Schiller M, Waldman T, Cleveland JM, Roland
CL. Societal costs of prescription opioid abuse, dependence, and misuse in
the United States. Pain Med 2011; 12: 657–67.
17. Florence CS, Zhou C, Luo F, Xu L. The economic burden of prescription
opioid overdose, abuse, and dependence in the United States, 2013. Med
Care 2016; 54 (10): 901–6.
18. Centers for Disease Control and Prevention. Prescription painkiller over-
doses in the US.
Accessed September 09, 2019.
19. White House Office of the Press Secretary. FACT SHEET: Obama Admin-
istration announces public and private sector efforts to address prescrip-
tion drug abuse and heroin use; 2015. https://obamawhitehouse.archives.
announces-public-and-private-sector Accessed September 09, 2019.
20. U.S. Executive Office of the President. Epidemic: responding to America’s
prescription drug abuse crisis.
abstract.aspx? ID¼256103 Accessed September 09, 2019.
21. Manasco AT, Griggs C, Leeds R, et al. Characteristics of state prescription
drug monitoring programs: a state-by-state survey. Pharmacoepidemiol
Drug Saf 2016; 25 (7): 847–51.
22. National Institute on Drug Abuse, National Institutes of Health, Depart-
ment of Health and Human Services. PA-18-058: Prescription Drug Abuse
(R01 Clinical Trial Optional). Prescription Drug Abuse. North Bethesda,
MD: National Institute on Drug Abuse.
23. Shutler L, Nelson LS, Portelli I, Blachford C, Perrone J. Drug use in the
twittersphere: a qualitative contextual analysis of tweets about prescrip-
tion drugs. J Addict Dis 2015; 34 (4): 303–10.
24. Sarker A, O’Connor K, Ginn R, et al. Social media mining for toxicovigi-
lance: automatic monitoring of prescription medication abuse from Twit-
ter. Drug Saf 2016; 39 (3): 231–40.
25. Hanson CL, Burton SH, Giraud-Carrier C, West JH, Barnes MD, Hansen
B. Tweaking and tweeting: exploring Twitter for nonmedical use of a psy-
chostimulant drug (Adderall) among college students. J Med Internet Res
2013; 15 (4): e62.
26. Jouanjus E, Mallaret M, Micallef J, Pont
e C, Roussin A, Lapeyre-Mestre
M. Comment on social media mining for toxicovigilance: monitoring pre-
scription medication abuse from Twitter. Drug Saf 2017; 40 (2): 183.
27. Chary M, Genes N, McKenzie A, Manini AF. Leveraging social networks
for toxicovigilance. J Med Toxicol 2013; 9 (2): 184–91.
28. PEW Research Center. Demographics of Social Media Users and Adop-
tion in the United StatesjPew Research Center. Social Media Fact Sheet.
Washington, DC: PEW Research Center; 2017.
29. Felt M. Social media and the social sciences: how researchers employ big
data analytics. Big Data Soc 2016; 3 (1): 205395171664582.
30. Cao B, Gupta S, Wang J, et al. Social media interventions to promote HIV
testing, linkage, adherence, and retention: systematic review and meta-
analysis. J Med Internet Res 2017; 19 (11): e394.
31. Sinnenberg L, Buttenheim AM, Padrez K, Mancheno C, Ungar L, Mer-
chant RM. Twitter as a tool for health research: a systematic review. Am J
Public Health 2017; 107 (1): e1–8.
32. Culotta A, Kumar Ravi N, Cutler J. Predicting the demographics of
Twitter users from website traffic data. In: AAAI’15 Proceedings of
the Twenty-Ninth AAAI Conference on Artificial Intelligence; 2015:
33. Woods HC, Scott H. #Sleepyteens: Social media use in adolesc ence is asso-
ciated with poor sleep quality, anxiety, depression and low self-esteem. J
Adolesc 2016; 51: 41–9.
34. Wong CA, Merchant RM, Moreno MA. Using social media to engage
adolescents and young adults with their health. Healthcare (Amsterdam,
Netherlands) 2014; 2 (4): 220–4.
35. Nguyen D, Gravel R, Trieschnigg D, Meder T. “How old do you think I
am?”: a study of language and age in Twitter. In: Proceedings of the Sev-
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 13
Downloaded from by guest on 19 December 2019
enth International AAAI Conference on Weblogs and Social Media; 2013:
36. Sarker A, Ginn R, Nikfarjam A, et al. Utilizing social media data for phar-
macovigilance: A review. J Biomed Inform 2015; 54: 202–12.
37. Demner-Fushman D, Elhadad N. Aspiring to unintended consequences of
natural language processing: a review of recent developments in clinical
and consumer-generated text processing. Yearb Med Inform 2016; 1:
38. Zeng QT, Tse T. Exploring and developing consumer health vocabularies.
J Am Med Informatics Assoc 2006; 13 (1): 24–9.
39. Zielstorff RD. Controlled vocabularies for consumer health. J Biomed In-
form 2003; 36 (4–5): 326–33.
40. He Z, Chen Z, Oh S, Hou J, Bian J. Enriching consumer health vocabulary
through mining a social Q&A site: a similarity-based approach. J Biomed
Inform 2017; 69: 75–85.
41. Sarker A, Gonzalez-Hernandez G. An unsupervised and customizable mis-
spelling generator for mining noisy health-related text sources. J Biomed
Inform 2018; 88: 98–107.
42. Yakushev A, Mityagin S. Social networks mining for analysis and model-
ing drugs usage. Proc Comput Sci 2014; 29: 2462–71.
43. Schiano F, Deluca P, Baldacchino A, et al. Drugs on the web; the Psycho-
naut 2002 EU project. Prog Neuro-Psychopharmacol Biol Psychiatry
2006; 30 (4): 640–6
44. Mcnaughton EC, Black RA, Zulueta MG, Budman SH, Butler SF. Mea-
suring online endorsement of prescription opioids abuse: An
integrative methodology. Pharmacoepidemiol Drug Saf 2012; 21 (10):
45. Davey Z, Schiano F, Corazza O, Deluca P. e-Psychonauts: conducting re-
search in online drug forum communities. J Ment Heal 2012; 21 (4):
46. Daniulaityte R, Carlson R, Falck R, et al. I just wanted to tell you that
loperamide WILL WORK”: a web-based study of extra-medical use of
loperamide. Drug Alcohol Depend 2013; 130 (1–3): 241–4.
47. Cameron D, Smith GA, Daniulaityte R, et al. PREDOSE: A semantic web
platform for drug abuse epidemiology using social media. J Biomed In-
form 2013; 46 (6): 985–97.
48. Hanson CL, Cannon B, Burton S, Giraud-Carrier C. An exploration of so-
cial circles and prescription drug abuse through Twitter. J Med Internet
Res 2013; 15 (9): e189.
49. McNaughton EC, Coplan PM, Black RA, Weber SE, Chilcoat HD, Butler
SF. Monitoring of internet forums to evaluate reactions to the introduc-
tion of reformulated oxycontin to deter abuse. J Med Internet Res 2014;
16 (5): e119.
50. Mcnaughton EC, Black RA, Weber SE, Butler SF. Assessing abuse poten-
tial of new analgesic medications following market release: an evaluation
of internet discussion of tapentadol abuse. Pain Med 2015; 16 (1): 131–
51. Maclean D, Gupta S, Lembke A, Manning C, Heer J. Forum77: an analy-
sis of an online health forum dedicated to addiction recovery. In: CSCW
’15 Proc ACM Conference on Computer Supported Cooperative Work &
Social Computing; 2015.
52. Buntain C, Golbeck J. This is your Twitter on drugs. Any questions? In:
Proceedings of the 24th International Conference on World Wide Web-
WWW’15 Companion; 2015.
53. Katsuki T, Mackey TK, Cuomo R. Establishing a link between prescrip-
tion drug abuse and illicit online pharmacies: Analysis of Twitter data. J
Med Internet Res 2015; 17 (12): e280.
54. Chan B, Lopez A, Sarkar U. The Canary in the coal mine tweets: social me-
dia reveals public perceptions of non-medical use of opioids. PLoS One
2015; 10 (8): e0135072.
55. Seaman I, Giraud-Carrier C. Prevalence and attitudes about illicit and pre-
scription drugs on Twitter. In: 2016 IEEE International Conference on
Healthcare Informatics (ICHI); 2016: 14–17.
56. Ding T, Roy A, Chen Z, Zhu Q, Pan S. Analyzing and retrieving illicit
drug-related posts from social media. In: 2016 IEEE International Confer-
ence on Bioinformatics and Biomedicine (BIBM); 2016: 1555–60. doi:
57. Jenhani F, Gouider MS, Said LB. A hybrid approach for drug abuse events
extraction from Twitter. Proc Comput Sci 2016; 96: 1032–40.
58. Zhou Y, Sani N, Luo J. Fine-grained mining of illicit drug use patterns us-
ing social multimedia data from Instagram. In: Proceedings–2016 IEEE
International Conference on Big Data (Big Data 2016); 2016.
59. Anderson L, Bell HG, Gilbert M, et al. Using social listening data to moni-
tor misuse and nonmedical use of bupropion: a content analysis. JMIR
Public Health Surveill 2017; 3 (1): e6.
60. Kalyanam J, Katsuki T, R.G. Lanckriet G, Mackey TK. Exploring trends
of nonmedical use of prescription drugs and polydrug abuse in the Twit-
tersphere using unsupervised machine learning. Addict Behav 2017; 65:
61. Phan N, Bhole M, Ae Chun S, Geller J. Enabling real-time drug abuse de-
tection in tweets. In: Proceedings International Conference on Data Engi-
neering; 2017.
62. Yang X, Luo J. Tracking illicit drug dealing and abuse on Instagram
using multimodal analysis. ACM Trans Intell Syst Technol 2017; 8 (4):
63. Chary M, Genes N, Giraud-Carrier C, Hanson C, Nelson LS, Manini AF.
Epidemiology from tweets: estimating misuse of prescription opioids in
the USA from social media. J Med Toxicol 2017; 13 (4): 278–286.
64. D’Agostino AR, Optican AR, Sowles SJ, Krauss MJ, Escobar Lee K, Cava-
zos-Rehg PA. Social networking online to recover from opioid use disor-
der: a study of community interactions. Drug Alcohol Depend 2017; 181:
65. Cherian R, Westbrook M, Ramo D, Sarkar U. Representations of codeine
misuse on Instagram: content analysis. J Med Internet Res 2018; 4 (1):
66. Graves RL, Tufts C, Meisel ZF, Polsky D, Ungar L, Merchant RM.
Opioid discussion in the twittersphere. Subst Use Misuse 2018; 53 (13):
67. Hu H, Moturu P, Dharan K, et al. Deep learning model for classifying
drug abuse risk behavior in tweets. In: 2018 IEEE International Confer-
ence on Healthcare Informatics (ICHI); 2018: 386–7.
68. Chary M, Yi D, Manini AF. Candyflipping and Other combinations: iden-
tifying drug–drug combinations from an online forum. Front Psychiatry
2018; 9: 135.
69. Fan Y, Zhang Y, Ye Y, Li X, Zheng W. Social media for opioid addiction
epidemiology: automatic detection of opioid addicts from Twitter and
case studies. In: Proceedings of the 2017 ACM on Conference on Informa-
tion and Knowledge Management-CIKM ’17; New York, NY: ACM
Press; 2017: 1259–67.
70. Bigeard E, Grabar N, Thiessard F. Detection and analysis of drug misuses.
Front Pharmacol 2018; 9: 791.
71. Chen X, Faviez C, Schuck S, et al. Mining patients’ narratives in social me-
dia for pharmacovigilance: adverse effects and misuse of methylphenidate.
Front Pharmacol 2018; 9: 541.
72. Pandrekar S, Chen X, Gopalkrishna G, et al. Social media based analysis
of opioid epidemic using reddit. AMIA Annu Symp Proc 2018; 2018:
73. Lossio-Ventura JA, Bian J. An inside look at the opioid crisis over Twitter.
In: 2018 IEEE International Conference on Bioinformatics and Biomedi-
cine (BIBM); IEEE; 2018: 1496–9.
74. Hu H, Phan N, Geller J, et al. Deep self-taught learning for detecting
drug abuse risk behavior in tweets. In: CSoNet 2018: Computational
Data and Social Networks. Cham, Switzerland: Springer; 2018:
75. Adams N, Artigiani EE, Wish ED. Choosing your platform for social me-
dia drug research and improving your keyword filter list. J Drug Issues
2019; 49 (3): 477–92.
76. Lu J, Sridhar S, Pandey R, Hasan MA, Mohler G. Redditors in recovery:
text mining reddit to investigate transitions into drug addiction. In: 2018
IEEE International Conference on Big Data. Seattle, WA: IEEE; 2018:
77. Tibebu S, Chang VC, Drouin C-A, Thompson W, Do MT. At-a-glance-
what can social media tell us about the opioid crisis in Canada? Health
Promot Chronic Dis Prev Can 2018; 38 (6): 263–7.
14 Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0
Downloaded from by guest on 19 December 2019
78. Chancellor S, Nitzburg G, Hu A, Zampieri F, De Choudhury M. Dis-
covering alternative treatments for opioid use recovery using social
media. In: Proceedings of the 2019 CHI Conference on Human Factors
in Computing Systems-CHI ’19. New York, NY: ACM Press; 2019:
79. Shutler L, Nelson LS, Portelli I, Blachford C, Perrone J. Drug use in the
twittersphere: a qualitative contextual analysis of tweets about prescrip-
tion drugs. J Addict Dis 2015; 34 (4): 303–10.
80. Huang X, Smith MC, Jamison AM, et al. Can online self-reports assist in
real-time identification of influenza vaccination uptake? A cross-sectional
study of influenza vaccine-related tweets in the USA. BMJ Open 2019; 9:
81. Zhang S, Grave E, Sklar E, Elhadad N. Longitudinal analysis of discussion
topics in an online breast cancer community using convolutional neural
networks. J Biomed Inform 2017; 69: 1–19.
82. Sarker A, Belousov M, Friedrichs J, et al. Data and systems for
medication-related text classification and concept normalization from
Twitter: insights from the social media mining for health (SMM4H)-2017
shared task. J Am Med Inform Assoc 2018; 25 (10): 1274–83.
83. Tamersoy A, De Choudhury M, Chau DH. Characterizing smoking and
drinking abstinence from social media. HT ACM Conf Hypertext Soc Me-
dia 2015; 2015: 139–48.
84. Salimian PK, Chunara R, Weitzman ER. Averting the perfect storm:
addressing youth substance use risk from social media use. Pediatr Ann
2014; 43 (10): 411.
85. Cavazos-Rehg PA, Krauss MJ, Sowles SJ, Bierut LJ. Marijuana-related
posts on Instagram. Prev Sci 2016; 17 (6): 710–20.
86. Dai H, Hao J. Mining social media data on marijuana use for post trau-
matic stress disorder. Comput Hum Behav 2017; 70: 282–90.
Journal of the American Medical Informatics Association, 2019, Vol. 00, No. 0 15
Downloaded from by guest on 19 December 2019
... In terms of the specific health topic of interest, 22/58 papers included any health condition [7,15,16,21,30,32,34,36,37,39,44,45,47,52,54,55,62,65,69,[72][73][74]. Twelve focused on mental health conditions [20,23,24,28,42,48,50,53,58,64,66,71], 9 on adverse drug reactions (ADRs) [31,43,46,51,57,67,68,70,75], 4 on infectious diseases [25,29,40,41], two each on chronic disease [26,56], substance misuse [49,60], public health [27,59], breast cancer [33,38] and with one each for symptom identification [35], use of complementary and alternative medicine (CAM) therapies [61] and the reasons for existing use by health researchers [63]. ...
... Two reviews looked at the misuse of prescription medicines [49,60]. Kim [49] used findings from existing Twitter analysis to create a typology of SM big data analysis on the topic based on the four conceptual dimensions of poster characteristics, communication characteristics, predictors and mechanism for the discussion of problematic use, and the psychological or behavioural consequences of discussing it on social media. ...
... Among the analysis methods used, sentiment analysis was the most commonly utilised [15, 16, 20, 32, 37, 42-44, 48, 49, 53, 61, 62, 64, 71, 74]. Our review found that much early sentiment analysis was often performed on small volumes of text, using qualitative or content analysis methods [46,60]. Developed originally as a marketing tool for business to understand consumer opinion towards their product [15], sentiment analysis has frequently been used to identify emotions that can signify a posters thinking and mood when trying to identify potential suicide risk [15,20,28,53], to track ADRs and to interpret patient reviews of health care services [74]. ...
Full-text available
Purpose Social media has led to fundamental changes in the way that people look for and share health related information. There is increasing interest in using this spontaneously generated patient experience data as a data source for health research. The aim was to summarise the state of the art regarding how and why SGOPE data has been used in health research. We determined the sites and platforms used as data sources, the purposes of the studies, the tools and methods being used, and any identified research gaps. Methods A scoping umbrella review was conducted looking at review papers from 2015 to Jan 2021 that studied the use of SGOPE data for health research. Using keyword searches we identified 1759 papers from which we included 58 relevant studies in our review. Results Data was used from many individual general or health specific platforms, although Twitter was the most widely used data source. The most frequent purposes were surveillance based, tracking infectious disease, adverse event identification and mental health triaging. Despite the developments in machine learning the reviews included lots of small qualitative studies. Most NLP used supervised methods for sentiment analysis and classification. Very early days, methods need development. Methods not being explained. Disciplinary differences - accuracy tweaks vs application. There is little evidence of any work that either compares the results in both methods on the same data set or brings the ideas together. Conclusion Tools, methods, and techniques are still at an early stage of development, but strong consensus exists that this data source will become very important to patient centred health research.
... The tools were evaluated based on their compatibility with the dataset. Sarker et al. (2020) presented a framework based on data-centric NLP and ML to monitor the abuse of social-media-based prescription medicine (PM). The authors first reviewed 39 studies targeting social media-based PM abuse or misuse to analyze trends. ...
Full-text available
Artificial intelligence (AI) relies on data and algorithms. State-of-the-art (SOTA) AI smart algorithms have been developed to improve the performance of AI-oriented structures. However, model-centric approaches are limited by the absence of high-quality data. Data-centric AI is an emerging approach for solving machine learning (ML) problems. It is a collection of various data manipulation techniques that allow ML practitioners to systematically improve the quality of the data used in an ML pipeline. However, data-centric AI approaches are not well documented. Researchers have conducted various experiments without a clear set of guidelines. This survey highlights six major data-centric AI aspects that researchers are already using to intentionally or unintentionally improve the quality of AI systems. These include big data quality assessment, data preprocessing, transfer learning, semi-supervised learning, MLOps, and the effect of adding more data. In addition, it highlights recent data-centric techniques adopted by ML practitioners. We addressed how adding data might harm datasets and how HoloClean can be used to restore and clean them. Finally, we discuss the causes of technical debt in AI. Technical debt builds up when software design and implementation decisions run into "or outright collide with "business goals and timelines. This survey lays the groundwork for future data-centric AI discussions by summarizing various data-centric approaches.
... Twitter data are a diverse and salient data source for researchers (Sarker, DeRoos, & Perrone, 2020) and policymakers (Aladwani, 2015). Thus, among other social media platforms, Twitter serves as the dominant discursive space (Munoriyarwa & Chambwera, 2020). ...
Full-text available
The Coronavirus Disease 2019 (COVID-19) emerged in Wuhan, China in December 2019. As it spread its tentacles beyond national frontiers, its devastating effects, both as a public health threat and a development challenge, had extensive socioeconomic and political ramifications on a global scale. Zimbabwe, a less economically developed country (LEDC), with a severely incapacitated and fragile public healthcare system, responded to the threat of this novel epidemic in a myriad of ways, such as enforcing a national lockdown and vigorous health education. This qualitative study elicited the views of selected Zimbabweans who commented on the governments' response to the pandemic through Twitter. These views were analysed using critical discourse analysis. The researchers selected tweets posted over a period of one week (14-21 March 2020), following a controversial remark by Zimbabwe's Defence Minister, Oppah Muchinguri, characterising COVID-19 as God's punitive response to the West for imposing economic sanctions on Zimbabwe. Although the Minister's remark was condemned by many for its alleged insensitivity, it emerged that this anti-United States propaganda inadvertently awakened the government of Zimbabwe from an extraordinary slumber characterised by sheer rhetoric and inactivity. From the public health promotion perspective, this article reflects on the implications of such a hurried and ill-conceived response on public health. It exposes the glaring policy disjuncture in the Zimbabwean context. Overall, it advocates genuine political will and commitment in the promotion of public health in Zimbabwe.
... The use of social media text for analysis of information spread in communities and the cascading effects can help understand how deeply such forums influence the social well-being, perceptions, beliefs, public health, and political decisions [3]. Based on the different demographic categories of Twitter users, the data provided by Twitter is a diverse source for researchers and policymakers [4] [5] [6]. Twitter data has been widely used in a variety of different research areas including disaster management, community analysis, network analysis, stock market prediction, recommender systems, health discussion, sports/entertainment, and politics [7]. ...
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
... Some data were collected from social media platforms such as twitter. Twitter data is diverse and so salient to the investigation of topical issues (Sarker et al., 2020). Among other social media platforms, Twitter serves as a dominant discursive space (Munoriyarwa et al., 2020). ...
Full-text available
The outbreak of COVID-19 ominously heralded a health crisis across the world. The World Health Organization (WHO) declared the virus a public health emergency in January 2020 and the world became concerned with the havoc the disease would potentially cause. Though the pandemic was a health crisis, it posed a lethal challenge to shaky contemporary democracies across the world. Governments, including that of Zimbabwe, responded to the pandemic by enacting sweeping stringent lockdown regulations to control the spread of the disease. The regulations curtailed freedom of mobility, regulated public gatherings, and suspended electoral processes. This article looks at how the pandemic has been used to limit citizens’ civil liberties, erode the tenets of democracy, profoundly altered the pre-existing democratic trajectory, and entrenched the Zimbabwe state on the path of authoritarianism. Through a review of available secondary literature, the study systematically analysed the political developments as reported in electronic (including social media-twitter) and print media during the pandemic. The study established that the response by the Zimbabwe government to the COVID-19 pandemic compromised the security of persons, undermined democracy and electoral processes. The health crisis presented an opportunity for the consolidation of authoritarian rule.
... The use of small data sets by some of the studies impacts the generalizability of the results, and some of the researchers acknowledged this and indicated a plan to replicate their studies with more data and the use of automated methods. Consequently, we observed that although such studies may be sampling social media data for hypothesis generation, they do not leverage one of the most important features of social media data, which is the ability to observe the continuous generation of big data to create long-term data-centric insights [73]. ...
Full-text available
Background: Medicinal cannabis is increasingly being used for a variety of physical and mental health conditions. Social media and web-based health platforms provide valuable, real-time, and cost-effective surveillance resources for gleaning insights regarding individuals who use cannabis for medicinal purposes. This is particularly important considering that the evidence for the optimal use of medicinal cannabis is still emerging. Despite the web-based marketing of medicinal cannabis to consumers, currently, there is no robust regulatory framework to measure clinical health benefits or individual experiences of adverse events. In a previous study, we conducted a systematic scoping review of studies that contained themes of the medicinal use of cannabis and used data from social media and search engine results. This study analyzed the methodological approaches and limitations of these studies. Objective: We aimed to examine research approaches and study methodologies that use web-based user-generated text to study the use of cannabis as a medicine. Methods: We searched MEDLINE, Scopus, Web of Science, and Embase databases for primary studies in the English language from January 1974 to April 2022. Studies were included if they aimed to understand web-based user-generated text related to health conditions where cannabis is used as a medicine or where health was mentioned in general cannabis-related conversations. Results: We included 42 articles in this review. In these articles, Twitter was used 3 times more than other computer-generated sources, including Reddit, web-based forums, GoFundMe, YouTube, and Google Trends. Analytical methods included sentiment assessment, thematic analysis (manual and automatic), social network analysis, and geographic analysis. Conclusions: This study is the first to review techniques used by research on consumer-generated text for understanding cannabis as a medicine. It is increasingly evident that consumer-generated data offer opportunities for a greater understanding of individual behavior and population health outcomes. However, research using these data has some limitations that include difficulties in establishing sample representativeness and a lack of methodological best practices. To address these limitations, deidentified annotated data sources should be made publicly available, researchers should determine the origins of posts (organizations, bots, power users, or ordinary individuals), and powerful analytical techniques should be used.
... For instance, a qualitative assessment of the text content from Twitter on NMPDU (specifically, prescription opioids) delivered insights about the epidemic of use and misuse of PMs at specific times [22]. Multiple studies have suggested that although users engaging in NMPDU may not voluntarily report their nonmedical use to medical experts, their selfreports in social media are detectable [21,24,25], and these can potentially be used for public health surveillance. A critical review [18] concluded that social media big data could be an effective resource to comprehend, monitor, and intervene in drug misuses and addiction problems. ...
Full-text available
Background: The behaviors and emotions associated with and reasons for nonmedical prescription drug use (NMPDU) are not well-captured through traditional instruments such as surveys and insurance claims. Publicly available NMPDU-related posts on social media can potentially be leveraged to study these aspects unobtrusively and at scale. Methods: We applied a machine learning classifier to detect self-reports of NMPDU on Twitter and extracted all public posts of the associated users. We analyzed approximately 137 million posts from 87,718 Twitter users in terms of expressed emotions, sentiments, concerns, and possible reasons for NMPDU via natural language processing. Results: Users in the NMPDU group express more negative emotions and less positive emotions, more concerns about family, the past, and body, and less concerns related to work, leisure, home, money, religion, health, and achievement compared to a control group (i.e., users who never reported NMPDU). NMPDU posts tend to be highly polarized, indicating potential emotional triggers. Gender-specific analyses show that female users in the NMPDU group express more content related to positive emotions, anticipation, sadness, joy, concerns about family, friends, home, health, and the past, and less about anger than males. The findings are consistent across distinct prescription drug categories (opioids, benzodiazepines, stimulants, and polysubstance). Conclusion: Our analyses of large-scale data show that substantial differences exist between the texts of the posts from users who self-report NMPDU on Twitter and those who do not, and between males and females who report NMPDU. Our findings can enrich our understanding of NMPDU and the population involved.
Issues: The sale of illicit drugs online has expanded to mainstream social media apps. These platforms provide access to a wide audience, especially children and adolescents. Research is in its infancy and scattered due to the multidisciplinary aspects of the phenomena. Approach: We present a multidisciplinary systematic scoping review on the advertisement and sale of illicit drugs to young people. Peer-reviewed studies written in English, Spanish and French were searched for the period 2015 to 2022. We extracted data on users, drugs studied, rate of posts, terminology used and study methodology. Key findings: A total of 56 peer-reviewed papers were included. The analysis of these highlights the variety of drugs advertised and platforms used to do so. Various methodological designs were considered. Approaches to detecting illicit content were the focus of many studies as algorithms move from detecting drug-related keywords to drug selling behaviour. We found that on average, for the studies reviewed, 13 in 100 social media posts advertise illicit drugs. However, popular platforms used by adolescents are rarely studied. Implications: Promotional content is increasing in sophistication to appeal to young people, shifting towards healthy, glamourous and seemingly legal depictions of drugs. Greater inter-disciplinary collaboration between computational and qualitative approaches are needed to comprehensively study the sale and advertisement of illegal drugs on social media across different platforms. This requires coordinated action from researchers, policy makers and service providers.
Full-text available
Introduction: Drug utilization is currently assessed through traditional data sources such as big electronic medical records (EMRs) databases, surveys, and medication sales. Social media and internet data have been reported to provide more accessible and more timely access to medications' utilization. Objective: This review aims at providing evidence comparing web data on drug utilization to other sources before the COVID-19 pandemic. Methods: We searched Medline, EMBASE, Web of Science, and Scopus until November 25th, 2019, using a predefined search strategy. Two independent reviewers conducted screening and data extraction. Results: Of 6,563 (64%) deduplicated publications retrieved, 14 (0.2%) were included. All studies showed positive associations between drug utilization information from web and comparison data using very different methods. A total of nine (64%) studies found positive linear correlations in drug utilization between web and comparison data. Five studies reported association using other methods: One study reported similar drug popularity rankings using both data sources. Two studies developed prediction models for future drug consumption, including both web and comparison data, and two studies conducted ecological analyses but did not quantitatively compare data sources. According to the STROBE, RECORD, and RECORD-PE checklists, overall reporting quality was mediocre. Many items were left blank as they were out of scope for the type of study investigated. Conclusion: Our results demonstrate the potential of web data for assessing drug utilization, although the field is still in a nascent period of investigation. Ultimately, social media and internet search data could be used to get a quick preliminary quantification of drug use in real time. Additional studies on the topic should use more standardized methodologies on different sets of drugs in order to confirm these findings. In addition, currently available checklists for study quality of reporting would need to be adapted to these new sources of scientific information.
Drug abuse is a global social issue of concern. As the drug market expands, there is an urgent need for technological methods to rapidly detect drug abuse to meet the needs of different situations. Here, we present a strategy for the rapid identification of benzodiazepines (midazolam and diazepam) using surface-enhanced Raman scattering (SERS) combined with neural networks (CNN). The method uses a self-assembled silver nanoparticle paper-based SERS substrate for detection. Then, a SERS spectrum intelligent recognition model based on deep learning technology was constructed to realize the rapid and sensitive distinction between the two drugs. In this work, a total of 560 SERS spectra were collected, and the qualitative and quantitative identification of the two drugs in water and a beverage (Sprite) was realized by a trained convolutional neural network (CNN). The predicted concentrations for each scenario could reach 0.1-50 ppm (midazolam in water), 0.5-50 ppm (midazolam in water and diazepam in Sprite), and 5-150 ppm (diazepam in Sprite), with a strong coefficient of determination (R2) larger than 0.9662. The advantage of this method is that the neural network can extract data features from the entire SERS spectrum, which makes up for information loss when manually identifying the spectrum and selecting a limited number of characteristic peaks. This work clearly clarifies that the combination of SERS and deep learning technology has become an inevitable development trend, and also demonstrates the great potential of this strategy in the practical application of SERS.
Full-text available
Opioid-abuse epidemic in the United States has escalated to national attention due to the dramatic increase of opioid overdose deaths. Analyzing opioid-related social media has the potential to reveal patterns of opioid abuse at a national scale, understand opinions of the public, and provide insights to support prevention and treatment. Reddit is a community based social media with more reliable content curated by the community through voting. In this study, we collected and analyzed all opioid related discussions from January 2014 to October 2017, which contains 51,537 posts by 16,162 unique users. We analyzed the data to understand the psychological categories of the posts, and performed topic modeling to reveal the major topics of interest. We also characterized the extent of social support received from comments and scores by each post. Last, we analyzed statistically significant difference in the posts between anonymous and non-anonymous users.
Full-text available
Introduction The Centers for Disease Control and Prevention (CDC) spend significant time and resources to track influenza vaccination coverage each influenza season using national surveys. Emerging data from social media provide an alternative solution to surveillance at both national and local levels of influenza vaccination coverage in near real time. Objectives This study aimed to characterise and analyse the vaccinated population from temporal, demographical and geographical perspectives using automatic classification of vaccination-related Twitter data. Methods In this cross-sectional study, we continuously collected tweets containing both influenza-related terms and vaccine-related terms covering four consecutive influenza seasons from 2013 to 2017. We created a machine learning classifier to identify relevant tweets, then evaluated the approach by comparing to data from the CDC’s FluVaxView. We limited our analysis to tweets geolocated within the USA. Results We assessed 1 124 839 tweets. We found strong correlations of 0.799 between monthly Twitter estimates and CDC, with correlations as high as 0.950 in individual influenza seasons. We also found that our approach obtained geographical correlations of 0.387 at the US state level and 0.467 at the regional level. Finally, we found a higher level of influenza vaccine tweets among female users than male users, also consistent with the results of CDC surveys on vaccine uptake. Conclusion Significant correlations between Twitter data and CDC data show the potential of using social media for vaccination surveillance. Temporal variability is captured better than geographical and demographical variability. We discuss potential paths forward for leveraging this approach.
Full-text available
Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Discussion: Our proposed spelling variant generator has several advantages over the existing spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Full-text available
Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (
Conference Paper
Full-text available
Drug abuse continues to accelerate toward becoming the most severe public health problem in the United States. The ability to detect drug abuse risk behavior at a population scale, such as among the population of Twitter users, can help us to monitor the trend of drug-abuse incidents. Unfortunately, traditional methods do not effectively detect drug abuse risk behavior, given tweets. This is because: (1) Tweets usually are noisy and sparse; and (2) The availability of labeled data is limited. To address these challenging problems, we proposed a deep self-taught learning system to detect and monitor drug abuse risk behaviors in the Twitter sphere, by leveraging a large amount of unla-beled data. Our models automatically augment annotated data: (i) To improve the classification performance, and (ii) To capture the evolving picture of drug abuse on online social media. Our extensive experiment has been conducted on 3 million drug abuse-related tweets with geo-location information. Results show that our approach is highly effective in detecting drug abuse risk behaviors.
Full-text available
Drug misuse may happen when patients do not follow the prescriptions and do actions which lead to potentially harmful situations, such as intakes of incorrect dosage (overuse or underuse) or drug use for indications different from those prescribed. Although such situations are dangerous, patients usually do not report the misuse of drugs to their physicians. Hence, other sources of information are necessary for studying these issues. We assume that online health fora can provide such information and propose to exploit them. The general purpose of our work is the automatic detection and classification of drug misuses by analysing user-generated data in French social media. To this end, we propose a multi-step method, the main steps of which are: (1) indexing of messages with extended vocabulary adapted to social media writing; (2) creation of typology of drug misuses; and (3) automatic classification of messages according to whether they contain drug misuses or not. We present the results obtained at different steps and discuss them. The proposed method permit to detect the misuses with up to 0.773 F-measure.
Full-text available
We explored social media as a potential data source for acquiring realtime information on opioid use and perceptions in Canada. Twitter messages were collected through a social media analytics platform between June 15, 2017, and July 13, 2017, and analyzed to identify recurring topics mentioned in the messages. Messages concerning the medical use of opioids as well as commentary on the Canadian government’s current response efforts to the opioid crisis were common. The findings of this study may help to inform public health practice and community stakeholders in their efforts to address the opioid crisis.
Conference Paper
Increasing rates of opioid drug abuse and heightened prevalence of online support communities underscore the necessity of employing data mining techniques to better understand drug addiction using these rapidly developing online resources. In this work, we obtained data from Reddit, an online collection of forums, to gather insight into drug use/misuse using text snippets from users narratives. Specifically, using users' posts, we trained a binary classifier which predicts a user's transitions from casual drug discussion forums to drug recovery forums. We also proposed a Cox regression model that outputs likelihoods of such transitions. In doing so, we found that utterances of select drugs and certain linguistic features contained in one's posts can help predict these transitions. Using unfiltered drug-related posts, our research delineates drugs that are associated with higher rates of transitions from recreational drug discussion to support/recovery discussion, offers insight into modern drug culture, and provides tools with potential applications in combating the opioid crisis.
Conference Paper
Opioid use disorder (OUD) poses substantial risks to personal well-being and public health. In online communities, users support those seeking recovery, in part by promoting clinically grounded treatments. However, some communities also promote clinically unverified OUD treatments, such as unregulated and untested drugs. Little research exists on which alternative treatments people use, whether these treatments are effective for recovery, or if they cause negative side effects. We provide the first large-scale social media study of clinically unverified, alternative treatments in OUD recovery on Reddit, partnering with an addiction research scientist. We adopt transfer learning across 63 subreddits to precisely identify posts related to opioid recovery. Then, we quantitatively discover potential alternative treatments and contextualize their effectiveness. Our work benefits health research and practice by identifying undiscovered recovery strategies. We also discuss the impacts to online communities dealing with stigmatized behavior and research ethics.
Social media research often has two things in common: Twitter is the platform used and a keyword filter list is used to extract only relevant Tweets. Here we propose that (a) alternative platforms be considered more often when doing social media research, and (b) regardless of platform, researchers use word embeddings as a type of synonym discovery to improve their keyword filter list, both of which lead to more relevant data. We demonstrate the benefit of these proposals by comparing how successful our synonym discovery method is at finding terms for marijuana and select opioids on Twitter versus a platform that can be filtered by topic, Reddit. We also find words that are not on the U.S. Drug Enforcement Agency (DEA) drug slang list for that year, some of which appear on the list the subsequent year, showing that this method could be employed to find drug terms faster than traditional means.