PreprintPDF Available

Social media mining for toxicovigilance of prescription medications: Progress, challenges and future work

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Substance use, substance use disorder, and overdoses related to substance use are major public health problems globally and in the United States. A key aspect of addressing these problems from a public health standpoint is improved surveillance. Traditional surveillance systems are laggy, and social media are potentially useful sources of timely data. However, mining knowledge from social media is a challenging task and requires the development of advanced artificial intelligence, specifically natural language processing and machine learning methods. Funded by the National Institute on Drug Abuse, we developed a sophisticated end-to-end pipeline for mining information about nonmedical prescription medication use from social media, namely Twitter and Reddit. In this paper, we describe the progress we have made over four years, including our automated data mining infrastructure, existing challenges in social media mining for toxicovigilance, and possible future research directions.
TITLE: Social media mining for toxicovigilance of prescription medications: Progress,
challenges and future work
Authors:
aAbeed Sarker, PhD (abeed.sarker@emory.edu)
KEYWORDS: substance use; substance use disorder; social media; natural language processing;
data science
Author affiliations:
aDepartment of Biomedical Informatics, School of Medicine, Emory University, Woodruff
Memorial Research Building, 101 Woodruff Circle, Suite 4101Atlanta, GA 30322, USA
Corresponding author: Abeed Sarker
101 Woodruff Circle, Suite 4101, Atlanta, GA 30032, USA
Email: abeed.sarker@emory.edu
ABSTRACT
Substance use, substance use disorder, and overdoses related to substance use are major public
health problems globally and in the United States. A key aspect of addressing these problems
from a public health standpoint is improved surveillance. Traditional surveillance systems are
laggy, and social media are potentially useful sources of timely data. However, mining
knowledge from social media is a challenging task and requires the development of advanced
artificial intelligence, specifically natural language processing and machine learning methods.
Funded by the National Institute on Drug Abuse, we developed a sophisticated end-to-end
pipeline for mining information about nonmedical prescription medication use from social
media, namely Twitter and Reddit. In this paper, we describe the progress we have made over
four years, including our automated data mining infrastructure, existing challenges in social
media mining for toxicovigilance, and possible future research directions.
Main Article
Substance use, including nonmedical use of
prescription medications, substance use
disorder, and overdoses related to substance
use are major public health problems in the
United States (US) and globally. According
to the latest estimates (available in
November 2022), in the 12 months leading
up to May 2022, more than 100,000 overdose-
related deaths occurred in the US (over 275
deaths per day on average).1 Despite many
years of effort, it has thus far not been
possible to curb the opioid epidemic in the
US. While there are many reliable traditional
sources of information, such as the
WONDER database from the Centers for
Disease Control and Prevention (CDC) for
tracking overdose2 deaths and the National
Survey on Drug Use and Health (NSDUH),3
these sources are often laggy. There are
substantial lags associated with the process
of data collection, curation, and publication.
Overdose death data, for example, may take
more than a year to compile and publish.
Consequently, trends in substance use and
its impact are only known retrospectively.
Often, by the time we are able to obtain a
complete picture of the trends within a given
time period, considerable damage has
already been done and the patterns have
shifted. This is particularly problematic
because patterns in population-level
substance use are constantly evolving.4
There is thus a need for establishing
complementary sources and methods for
surveillance. Social media, coupled with
methods for mining knowledge from them,
have the potential of serving as timely and
complementary sources of information for
substance use. However, there has been
limited research on developing and
validating methods for leveraging social
media data for substance use surveillance or
toxicovigilance. In this paper, we outline our
progress in developing the social media
mining infrastructure needed for mining
substance use-related knowledge from social
media. Our specific focus was prescription
medications, including opioids,
benzodiazepines, and stimulants. In the
following paragraphs, we outline our
methodological infrastructure, including
natural language processing (NLP) and
machine learning methods, and findings
over a period of four years of research.
Data collection and annotation
The first step in successfully leveraging
social media data for toxicovigilance is to
establish a data collection strategy. For our
work, we initially focused on Twitter and
later integrated data from Reddit. For
Twitter, we collected data about a given set
of prescription medications including
opioids (such as oxycodone), stimulants
(such as Adderall®), and benzodiazepines
(such as alprazolam). We used the Twitter
academic application programming
interface (API) for collecting data using the
medication names (generic and trade) as
keywords. Since medication names are often
misspelled by Twitter subscribers, we had to
devise a strategy for incorporating common
misspellings for our chosen set of medication
names. We developed an automatic, data-
centric misspelling generator that used a
phrase embedding model learned from
social media and a recursive algorithm that
combined semantic and lexical similarity
measures.5 We found that the inclusion of
misspellings automatically generated by our
algorithm increased our post retrieval rate
by over 30% and we later extended the
algorithm for generating multi-word lexical
variants.6
In line with past literature,7 a manual review
of the posts retrieved by our data collection
mechanism revealed that only a small
portion of all posts represented personal
nonmedical use, while most were simply
mentions of the medications (e.g., sharing of
News articles). Since our focus was on
studying nonmedical use of prescription
medications, we decided against conducting
unsupervised analysis of the entire data and
instead decided to apply a supervised
machine learning filter to automatically
identify the posts that represented
nonmedical use. As a first step, in the
absence of annotated data for this supervised
classification task, we prepared a detailed
annotation guideline and then manually
annotated data into four classes—(i)
nonmedical use, (ii) consumption, (iii)
mention, and (iv) unrelated. We annotated a
total of 16,443 posts, obtaining an average,
pairwise inter-annotator agreement of 0.86
(Cohens kappa8). The annotated guideline
and the dataset are available for research
purposes.
Data collection from Reddit was
substantially simpler since Reddit consists of
many special-interest communities
(subreddits) that host topic-specific chatter.
For our targeted studies, we first identified
subreddits of interest and then collected all
posts available via the PRAW API.9
Supervised classification
Using the manually-annotated data, we
experimented with several supervised
learning algorithms to identify the best
strategy. Specifically, we compared the
performances of traditional classification
models such as support vector machines
(SVMs), deep learning methods, and
transformer-based methods.10 We found that
fusion-based classifiers involving multiple
transformer-based models achieved the best
performance in terms of F1 score (0.67) for the
nonmedical use (minority) class. Because of
the challenging nature of this classification
task, we later attempted to improve the
classification performance for the
nonmedical use class by incorporating
additional innovations, such as source-
adaptive pretraining and topic-specific
pretraining.11 We also attempted to promote
community-driven development of effective
solutions for this task by proposing it in the
social media mining for health applications
(SMM4H) shared tasks, 2020.12
Our approach for detecting self-reports of
nonmedical prescription medication use
enabled us to create what is to date the
largest social media based cohort for
nonmedical prescription medication use. At
the time of writing, this cohort consists of
over 600,000 members. Each member of the
cohort have been automatically detected at
some point to have publicly expressed
nonmedical prescription medication use.
Using the API, we collected all publicly-
available past posts of each cohort member,
and we repeated this collection strategy
every two weeks, resulting in multi-year
longitudinal timelines of these cohort
members. We used the classified posts and
the longitudinal data for targeted,
downstream analyses.
Post-classification tasks
We conducted several studies in which we
tried to further filter our data to remove
noise. For example, we developed methods
for detecting and removing bots from our
cohort,13 comparing therapeutic and
recreational use of opioids from Twitter data
by employing a multi-class classification
strategy,14 and automating the detection of
illicit opioid use.15 For some of our targeted
analysis of Twitter data, we used only post-
level data samples (i.e., only the posts that
contained the medication names rather than
longitudinal data from the cohort members).
Such studies were particularly conducted
early on in our project when sufficient
amounts of longitudinal cohort data had not
been collected. For example, we conducted
thematic analyses to study provider
perceptions about buprenorphine
initiation,16 and compare chatter regarding
medications for opioid use disorder such as
buprenorphine-naloxone and methadone.17
A key advance made by our Twitter-based
approach was in geolocation-centric
analysis. As mentioned earlier, one of our
key motivations was to detect signals of
nonmedical use from social media earlier
than other sources. Hence, we wanted to
compare past signals that we could collect
from social media with known metrics from
traditional sources, such as the CDC
WONDER database and NSDUH. Using the
state of Pennsylvania as our subject, we
compared the rates of tweets classified as
nonmedical use with overdose death
numbers at the county level and several
relevant metrics from the NSDUH at the
substate level.18 We found significant
correlations between the social media
estimates and the metrics from the
traditional sources, suggesting that social
media data may serve as a complementary
resource for predicting substance use and
related overdose deaths.
Our more recent work focused on leveraging
the longitudinal data posted by our cohort
members. In particular, we attempted to
address one key limitation of social media
data compared to the NSDUH and other
traditional sources of information. The latter
typically have demographic information
available (e.g., biological sex/gender identify
and race), which social media data did not
have. Therefore, we developed methods for
automatically estimating the distribution of
gender identities, race, and age groups from
our cohort data and compared them to those
reported in traditional sources. Broadly
speaking, we found the binary gender
distributions to largely agree with
traditional sources,19 often agreeing more
with one source compared to another. For
race we found strong correlation between
estimates derived from Twitter and
NSDUH.20 We found moderate correlation
for age-group-related metrics, which was
explainable since the subscriber base of
Twitter is skewed towards overrepresenting
younger people and underrepresenting
younger people. We also conducted a large-
scale analysis of emotions expressed along
with nonmedical prescription medication
use, and we compared these between cohort
subsets.21 Our analyses revealed significant
differences between the emotions and
concerns expressed by people who
nonmedically use stimulants, opioids and
benzodiazepines compared to those who do
not. Our analyses also revealed significant
differences between the emotional contents
of men and women who report nonmedical
use of prescription medications.
Reddit data analysis
While we used Twitter primarily to progress
the public health aspect of our work, we
found Reddit to be particularly helpful in
providing us with clinical insights. Early on
in our project, we used Reddit to understand
consumer perceptions about medications for
opioid use disorder, such as buprenorphine-
naloxone.22 Unlike Twitter, we found Reddit
to encapsulate intricate details about user
experiences and perceptions. This is
particularly the case because Reddit does not
impose length limits at the post level , unlike
Twitter, and the platform is built on the
concept of anonymity (i.e., subscribers can
remain anonymous if they desire).
Consequently, discussions on Reddit are
often candid and contain more depth. In a
later study, we utilized data from Reddit to
understand the experiences of people who
use opioids in terms of precipitated
withdrawal when initiating buprenorphine
treatment. We found that while the literature
lacked information about this problem that
is being observed relatively commonly in
emergency departments, Reddit subscribers
had been discussing this topic for multiple
years. We even found that Reddit
subscribers had advocated self-management
strategies for avoiding precipitated
withdrawal, including using a specific
microdosing strategy called the Bernese
method.23
Conclusion and future work
Our deployed pipeline will continue to
automatically collect cohort members based
on their self-reported nonmedical
prescription medication use. Longitudinal
data from cohort members will also continue
to be collected. Thus, our cohort dataset is
the largest of its kind and it will uniquely
preserve longitudinal data posted by the
cohort members. While our completed work
represents considerable progress in this
research space, there are a number of
important future directions to expand our
work. We outline some of these future
research tasks below.
i. Expanding to illicit substances: this is
a natural extension of our current
pipeline which focuses on
nonmedical use of prescription
medications only. We plan to
incorporate illicit substances into our
pipeline. This will require manual
annotation of additional data and
training new supervised
classification models. Note that there
is a subtle difference in the
classification process between
prescription and illicit substances.
Any consumption of illicit substances
is considered nonmedical use, but
that is not the case for prescription
medications.
ii. Discovering novel psychoactive
substances: this is an area where
social media may have a high impact.
Emerging and novel substances often
appear on social media first before
they are detected by traditional
systems. Our preliminary work in
this space focusing on
benzodiazepines suggests that it may
be possible to obtain strong signals
about novel benzodiazepines before
they become widespread in the
population.24
iii. Studying long-term impacts and
trajectories of substance use: our
massive and growing cohort will
have multiple years of longitudinal
data available. These can be used to
study long-term trajectories and
impacts. While impacts, such as
social and clinical impacts, can be
sparsely occurring, the availability of
a large cohort may make the
detection of patterns possible. Our
preliminary works in analyzing
trends have produced promising
results, and we envision the
application of few-/low-shot learning
methods for automatically detecting
sparse concepts.
iv. Creating a publicly available
resource: while most of our data and
methods are public, there is the
potential to make aggregated
statistics available to the broader
research community in easy-to-use
formats. For example, creating a web-
based dashboard that enables the
easy download of aggregated
statistics may help public health and
related researchers.
v. Reporting back to the community:
while we conduct research using
publicly available data, we rarely
report back to the people whose data
are being used in the research. Our
vision is to establish a protocol for
reporting the findings of our work
back to the research community.
Funding
Research reported in this publication is supported by the NIDA of the NIH under the award
numbers R01DA046619 and R01DA057599. The content is solely the responsibility of the authors
and does not necessarily represent the official views of the NIH.
References
1. National Center for Health Statistics. Products - Vital Statistics Rapid Release - Provisional
Drug Overdose Data. https://www.cdc.gov/nchs/nvss/vsrr/drug-overdose-data.htm (2022).
2. CDC WONDER. https://wonder.cdc.gov/.
3. National Survey on Drug Use and Health. https://nsduhweb.rti.org/respweb/homepage.cfm.
4. Changing dynamics of the drug overdose epidemic in the United States from 1979 through
2016 | Science. https://www.science.org/doi/10.1126/science.aau1184.
5. Sarker, A. & Gonzalez-Hernandez, G. An unsupervised and customizable misspelling
generator for mining noisy health-related text sources. J. Biomed. Inform. 88, 98107 (2018).
6. Sarker, A. LexExp: a system for automatically expanding concept lexicons for noisy
biomedical texts. Bioinformatics 37, 24992501 (2021).
7. Sarker, A. et al. Social Media Mining for Toxicovigilance: Automatic Monitoring of
Prescription Medication Abuse from Twitter. Drug Saf. 39, 231240 (2016).
8. A Coefficient of Agreement for Nominal Scales - Jacob Cohen, 1960.
https://journals.sagepub.com/doi/10.1177/001316446002000104.
9. PRAW: The Python Reddit API Wrapper PRAW 7.6.1 documentation.
https://praw.readthedocs.io/en/stable/.
10. Al-Garadi, M. A. et al. Text classification models for the automatic detection of
nonmedical prescription medication use from social media. BMC Med. Inform. Decis. Mak. 21,
27 (2021).
11. Guo, Y., Ge, Y., Yang, Y.-C., Al-Garadi, M. A. & Sarker, A. Comparison of Pretraining
Models and Strategies for Health-Related Social Media Text Classification. Healthcare 10, 1478
(2022).
12. Klein, A. et al. Overview of the Fifth Social Media Mining for Health Applications
(#SMM4H) Shared Tasks at COLING 2020. in Proceedings of the Fifth Social Media Mining for
Health Applications Workshop & Shared Task 2736 (Association for Computational Linguistics,
2020).
13. Davoudi, A., Klein, A. Z., Sarker, A. & Gonzalez-Hernandez, G. Towards Automatic Bot
Detection in Twitter for Health-related Tasks. AMIA Summits Transl. Sci. Proc. 2020, 136141
(2020).
14. Fodeh, S. J. et al. Utilizing a multi-class classification approach to detect therapeutic and
recreational misuse of opioids on Twitter. Comput. Biol. Med. 129, 104132 (2021).
15. Sarker, A., Gonzalez-Hernandez, G. & Perrone, J. Towards automating location-specific
opioid toxicosurveillance from Twitter via data science methods. Stud. Health Technol. Inform.
264, 333337 (2019).
16. Chenworth, M. et al. Buprenorphine Initiation in the Emergency Department: a Thematic
Content Analysis of a #firesidetox Tweetchat. J. Med. Toxicol. 16, 262268 (2020).
17. Chenworth, M. et al. Methadone and suboxone® mentions on twitter: thematic and
sentiment analysis. Clin. Toxicol. Phila. Pa 59, 982991 (2021).
18. Sarker, A., Gonzalez-Hernandez, G., Ruan, Y. & Perrone, J. Machine Learning and
Natural Language Processing for Geolocation-Centric Monitoring and Characterization of
Opioid-Related Social Media Chatter. JAMA Netw. Open 2, e1914672 (2019).
19. Yang, Y.-C., Al-Garadi, M. A., Love, J. S., Perrone, J. & Sarker, A. Automatic gender
detection in Twitter profiles for health-related cohort studies. JAMIA Open 4, ooab042 (2021).
20. Yang, Y.-C. et al. Can accurate demographic information about people who use
prescription medications non-medically be derived from Twitter? | medRxiv. Medrxiv (2022).
21. Al-Garadi, M. A. et al. Large-Scale Social Media Analysis Reveals Emotions Associated
with Nonmedical Prescription Drug Use. Health Data Sci. 2022, 112 (2022).
22. Graves, R. L. et al. Thematic Analysis of Reddit Content About Buprenorphine-naloxone
Using Manual Annotation and Natural Language Processing Techniques. J. Addict. Med. 16,
454460 (2022).
23. Spadaro, A. et al. Reddit discussions about buprenorphine associated precipitated
withdrawal in the era of fentanyl. Clin. Toxicol. 60, 694701 (2022).
24. Sarker, A. et al. Evidence of the emergence of illicit benzodiazepines from online drug
forums. Eur. J. Public Health ckac161 (2022) doi:10.1093/eurpub/ckac161.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Traditional substance use (SU) surveillance methods, such as surveys, incur substantial lags. Due to the continuously evolving trends in SU, insights obtained via such methods are often outdated. Social media-based sources have been proposed for obtaining timely insights, but methods leveraging such data cannot typically provide fine-grained statistics about subpopulations, unlike traditional approaches. We address this gap by developing methods for automatically characterizing a large Twitter nonmedical prescription medication use (NPMU) cohort (n = 288,562) in terms of age-group, race, and gender. Our natural language processing and machine learning methods for automated cohort characterization achieved 0.88 precision (95% CI:0.84 to 0.92) for age-group, 0.90 (95% CI: 0.85 to 0.95) for race, and 94% accuracy (95% CI: 92 to 97) for gender, when evaluated against manually annotated gold-standard data. We compared automatically derived statistics for NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in the National Survey on Drug Use and Health (NSDUH) and the National Emergency Department Sample (NEDS). Distributions automatically estimated from Twitter were mostly consistent with the NSDUH [Spearman r: race: 0.98 (P < 0.005); age-group: 0.67 (P < 0.005); gender: 0.66 (P = 0.27)] and NEDS, with 34/65 (52.3%) of the Twitter-based estimates lying within 95% CIs of estimates from the traditional sources. Explainable differences (e.g., overrepresentation of younger people) were found for age-group-related statistics. Our study demonstrates that accurate subpopulation-specific estimates about SU, particularly NPMU, may be automatically derived from Twitter to obtain earlier insights about targeted subpopulations compared to traditional surveillance approaches.
Article
Full-text available
Illicit or 'designer' benzodiazepines are a growing contributor to overdose deaths. We employed natural language processing (NLP) to study benzodiazepine mentions over 10 years on 270 online drug forums (subreddits) on Reddit. Using NLP, we automatically detected mentions of illicit and prescription benzodiazepines, including their misspellings and non-standard names, grouping relative mentions by quarter. On a collection of 17 861 755 posts between 2012 and 2021, we searched for 26 benzodiazepines (8 prescription; 18 illicit), detecting 173 275 mentions. The rate of posts about both prescription and illicit benzodiazepines increased consistently with increases in deaths involving both drug classes, illustrating the utility of surveillance via Reddit.
Article
Full-text available
Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT+TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.
Article
Full-text available
Background: The behaviors and emotions associated with and reasons for nonmedical prescription drug use (NMPDU) are not well-captured through traditional instruments such as surveys and insurance claims. Publicly available NMPDU-related posts on social media can potentially be leveraged to study these aspects unobtrusively and at scale. Methods: We applied a machine learning classifier to detect self-reports of NMPDU on Twitter and extracted all public posts of the associated users. We analyzed approximately 137 million posts from 87,718 Twitter users in terms of expressed emotions, sentiments, concerns, and possible reasons for NMPDU via natural language processing. Results: Users in the NMPDU group express more negative emotions and less positive emotions, more concerns about family, the past, and body, and less concerns related to work, leisure, home, money, religion, health, and achievement compared to a control group (i.e., users who never reported NMPDU). NMPDU posts tend to be highly polarized, indicating potential emotional triggers. Gender-specific analyses show that female users in the NMPDU group express more content related to positive emotions, anticipation, sadness, joy, concerns about family, friends, home, health, and the past, and less about anger than males. The findings are consistent across distinct prescription drug categories (opioids, benzodiazepines, stimulants, and polysubstance). Conclusion: Our analyses of large-scale data show that substantial differences exist between the texts of the posts from users who self-report NMPDU on Twitter and those who do not, and between males and females who report NMPDU. Our findings can enrich our understanding of NMPDU and the population involved.
Article
Full-text available
Background: Induction of buprenorphine, an evidence-based treatment for opioid use disorder (OUD), has been reported to be difficult for people with heavy use of fentanyl, the most prevalent opioid in many areas of the country. In this population, precipitated opioid withdrawal (POW) may occur even after individuals have completed a period of opioid abstinence prior to induction. Our objective was to study potential associations between fentanyl, buprenorphine induction, and POW, using social media data. Methods: This is a mixed methods study of data from seven opioid-related forums (subreddits) on Reddit. We retrieved publicly available data from the subreddits via an application programming interface, and applied natural language processing to identify subsets of posts relevant to buprenorphine induction, POW, and fentanyl and analogs (F&A). We computed mention frequencies for keywords/phrases of interest specified by our medical toxicology experts. We further conducted manual, qualitative, and thematic analyses of automatically identified posts to characterize the information presented. Results: In 267,136 retrieved posts, substantial increases in mentions of F&A (3 in 2013 to 3870 in 2020) and POW (2 in 2012 to 332 in 2020) were observed. F&A mentions from 2013 to 2021 were strongly correlated with mentions of POW (Spearman's ρ: 0.882; p = .0016), and mentions of the Bernese method (BM), a microdosing induction strategy (Spearman's ρ: 0.917; p = .0005). Manual review of 384 POW- and 106 BM-mentioning posts revealed that common discussion themes included "specific triggers of POW" (55.1%), "buprenorphine dosing strategies" (38.2%) and "experiences of OUD" (36.1%). Many reported experiencing POW despite prolonged opioid abstinence periods, and recommended induction via microdosing, including specifically via the BM. Conclusions: Reddit subscribers often associate POW with F&A use and describe self-managed buprenorphine induction strategies involving microdosing to avoid POW. Further objective studies in patients with fentanyl use and OUD initiating buprenorphine are needed to corroborate these findings.HIGHLIGHTSIncrease in mentions of precipitated opioid withdrawal (POW) on Reddit from 2012 to 2021 was associated with the increase in fentanyl and analog mentions.Experiences of precipitated opioid withdrawal (POW) were described by individuals despite reporting prolonged periods of abstinence compared to standard buprenorphine induction protocols.People with Opioid Use Disorder (OUD) on Reddit are using and recommending microdosing strategies with buprenorphine to avoid POW.People who used fentanyl report experiencing POW following statistically longer periods of abstinence than people who use heroin.
Article
Full-text available
Background: Opioid use disorder (OUD) is a major public health crisis for which buprenorphine-naloxone is an effective evidence-based treatment. Analysis of Reddit data yields detailed information about firsthand experiences with buprenorphine-naloxone that has the potential to inform treatment of OUD. Methods: We conducted a thematic analysis of posts about buprenorphine-naloxone from a Reddit forum in which Reddit users anonymously discuss topics related to opioid use. We used an application programming interface to retrieve posts about buprenorphine-naloxone, then applied natural language processing to generate meta-information and curate samples of salient posts. We manually categorized posts according to their content and conducted natural language processing-aided analysis of posts about buprenorphine tapering strategies, withdrawal symptoms, and adjunctive substances/behaviors useful in the tapering process. Results: A total of 16,146 posts from 1933 redditors were retrieved from the /r/suboxone subreddit. Thematic analysis of sample posts (N = 200) revealed descriptions of personal experiences (74%), nonpersonal accounts (24%), and other content (2%). Among redditors who reported tapering to termination (N = 40), 0.063 mg and 0.125 mg were the most common termination doses. Fatigue, gastrointestinal disturbance, and mood disturbance were the most frequent adverse effects, and loperamide and vitamins/dietary supplements the most frequently discussed adverse effects adjunctive substances/behaviors respectively. Conclusions: Discussions on Reddit are rich in information about buprenorphine-naloxone. Information derived from analysis of Reddit posts about buprenorphine-naloxone may not be available elsewhere and may help providers improve treatment of people with OUD through better understanding of the experiences of people who have used buprenorphine-naloxone.
Article
Full-text available
Objective Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. Materials and Methods We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. Results and Discussion We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37). Conclusion Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public).
Article
Full-text available
Background: According to the latest medical evidence, Methadone and buprenorphine-naloxone (Suboxone®) are effective treatments for opioid use disorder (OUD). While the evidence basis for the use of these medications is favorable, less is known about the perceptions of the general public about them. Objective: This study aimed to use Twitter to assess the public perceptions about methadone and buprenorphine-naloxone, and to compare their discussion contents based on themes/topics, subthemes, and sentiment. Methods: We conducted a descriptive analysis of a small and automatic analysis of a large volume of microposts ("tweets") that mentioned "methadone" or "suboxone". In the manual analysis, we categorized the tweets into themes and subthemes, as well as by sentiment and personal experience, and compared the information posted about these two medications. We performed automatic topic modeling and sentiment analysis over large volumes of posts and compared the outputs to those from the manual analyses. Results: We manually analyzed 900 tweets, most of which related to access (15.3% for methadone; 14.3% for buprenorphine-naloxone), stigma (17.0%; 15.5%), and OUD treatment (12.8%; 15.6%). Only a small proportion of tweets (16.4% for Suboxone® and 9.3% for methadone) expressed positive sentiments about the medications, with few tweets describing personal experiences. Tweets mentioning both medications primarily discussed MOUD broadly, rather than comparing the two medications directly. Automatic topic modeling revealed topics from the larger dataset that corresponded closely to the manually identified themes, but sentiment analysis did not reveal any notable differences in chatter regarding the two medications. Conclusions: Twitter content about methadone and Suboxone® is similar, with the same major themes and similar sub-themes. Despite the proven effectiveness of these medications, there was little dialogue related to their benefits or efficacy in the treatment of OUD. Perceptions of these medications may contribute to their underutilization in combatting OUDs.
Article
Full-text available
Background Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. Methods We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Results Our proposed fusion-based model performs significantly better than the best traditional model (F 1 -score [95% CI]: 0.67 [0.64–0.69] vs. 0.45 [0.42–0.48]). We illustrate, via experimentation using varying training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. Conclusions BERT, BERT-like and fusion-based models outperform traditional machine learning and deep learning models, achieving substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges associated with the lack of context and the nature of social media language need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.
Article
Full-text available
LexExp is an open-source, data-centric lexicon expansion system that generates spelling variants of lexical expressions in a lexicon using a phrase embedding model, lexical similarity-based natural language processing methods, and a set of tunable threshold decay functions. The system is customizable, can be optimized for recall or precision, and can generate variants for multi-word expressions. Availability and implementation Code available at: https://bitbucket.org/asarker/lexexp; Data and resources available at: https://sarkerlab.org/lexexp.