ArticlePDF Available

Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter

Authors:

Abstract

Importance Automatic curation of consumer-generated, opioid-related social media big data may enable real-time monitoring of the opioid epidemic in the United States. Objective To develop and validate an automatic text-processing pipeline for geospatial and temporal analysis of opioid-mentioning social media chatter. Design, Setting, and Participants This cross-sectional, population-based study was conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were manually categorized into 4 classes, and training and evaluation of several machine learning algorithms were performed. Temporal and geospatial patterns were analyzed with the best-performing classifier on unlabeled data. Main Outcomes and Measures Pearson and Spearman correlations of county- and substate-level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use and Health for 3 years were calculated. Classifier performances were measured through microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs. Results A total of 9006 social media posts were annotated, of which 1748 (19.4%) were related to abuse, 2001 (22.2%) were related to information, 4830 (53.6%) were unrelated, and 427 (4.7%) were not in the English language. Yearly rates of abuse-indicating social media post showed statistically significant correlation with county-level opioid-related overdose death rates (n = 75) for 3 years (Pearson r = 0.451, P < .001; Spearman r = 0.331, P = .004). Abuse-indicating tweet rates showed consistent correlations with 4 NSDUH metrics (n = 13) associated with nonmedical prescription opioid use (Pearson r = 0.683, P = .01; Spearman r = 0.346, P = .25), illicit drug use (Pearson r = 0.850, P < .001; Spearman r = 0.341, P = .25), illicit drug dependence (Pearson r = 0.937, P < .001; Spearman r = 0.495, P = .09), and illicit drug dependence or abuse (Pearson r = 0.935, P < .001; Spearman r = 0.401, P = .17) over the same 3-year period, although the tests lacked power to demonstrate statistical significance. A classification approach involving an ensemble of classifiers produced the best performance in accuracy or microaveraged F1 score (0.726; 95% CI, 0.708-0.743). Conclusions and Relevance The correlations obtained in this study suggest that a social media–based approach reliant on supervised machine learning may be suitable for geolocation-centric monitoring of the US opioid epidemic in near real time.
Original Investigation | Health Informatics
Machine Learning and Natural Language Processing
for Geolocation-Centric Monitoring and Characterization
of Opioid-Related Social Media Chatter
Abeed Sarker, PhD; Graciela Gonzalez-Hernandez,PhD; Yucheng Ruan, BEng; Jeanmarie Perrone, MD
Abstract
IMPORTANCE Automatic curation of consumer-generated, opioid-related social media big data may
enable real-time monitoring of the opioid epidemic in the United States.
OBJECTIVE To develop and validate an automatic text-processing pipeline for geospatial and
temporal analysis of opioid-mentioning social media chatter.
DESIGN, SETTING, AND PARTICIPANTS This cross-sectional, population-based study was
conducted from December 1, 2017, to August 31, 2019, and used more than 3 years of publicly
available social media posts on Twitter, dated from January 1, 2012, to October 31, 2015, that were
geolocated in Pennsylvania. Opioid-mentioning tweets were extracted using prescription and illicit
opioid names, including street names and misspellings. Social media posts (tweets) (n = 9006) were
manually categorized into 4 classes, and training and evaluation of several machine learning
algorithms were performed. Temporal and geospatial patterns were analyzed with the best-
performing classifier on unlabeled data.
MAIN OUTCOMES AND MEASURES Pearson and Spearman correlations of county- and substate-
level abuse-indicating tweet rates with opioid overdose death rates from the Centers for Disease
Control and Prevention WONDER database and with 4 metrics from the National Survey on Drug Use
and Health for 3 years were calculated. Classifier performances were measured through
microaveraged F1 scores (harmonic mean of precision and recall) or accuracies and 95% CIs.
RESULTS A total of 9006 social media posts were annotated, of which 1748 (19.4%) were related to
abuse, 2001 (22.2%) were related to information, 4830 (53.6%) were unrelated, and 427 (4.7%)
were not in the English language. Yearly rates of abuse-indicating social media post showed
statistically significant correlation with county-level opioid-related overdose death rates (n = 75) for
3 years (Pearson r= 0.451, P< .001; Spearman r= 0.331, P= .004). Abuse-indicating tweet rates
showed consistent correlations with 4 NSDUH metrics (n = 13) associated with nonmedical
prescription opioid use (Pearson r= 0.683, P= .01; Spearman r= 0.346, P= .25), illicit drug use
(Pearson r= 0.850, P< .001; Spearman r= 0.341, P= .25), illicit drug dependence (Pearson
r= 0.937, P< .001; Spearman r= 0.495, P= .09), and illicit drug dependence or abuse (Pearson
r= 0.935, P< .001; Spearman r= 0.401, P= .17) over the same 3-year period, although the tests
lacked power to demonstrate statistical significance. A classification approach involving an ensemble
of classifiers produced the best performance in accuracy or microaveraged F1 score (0.726; 95% CI,
0.708-0.743).
(continued)
Key Points
Question Can natural language
processing be used to gain real-time
temporal and geospatial information
from social media data about
opioid abuse?
Findings In this cross-sectional,
population-based study of 9006 social
media posts, supervised machine
learning methods performed automatic
4-class classification of opioid-related
social media chatter with a maximum F1
score of 0.726. Rates of automatically
classified opioid abuse–indicating social
media posts from Pennsylvania
correlated with county-level overdose
death rates and with 4 national survey
metrics at the substate level.
Meaning The findings suggest that
automatic processing of social media
data, combined with geospatial and
temporal information, may provide
close to real-time insights into the status
and trajectory of the opioid epidemic.
+Supplemental content
Author affiliations and article information are
listed at the end of this article.
Open Access. This is an open access article distributed under the terms of the CC-BY License.
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 1/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
Abstract (continued)
CONCLUSIONS AND RELEVANCE The correlations obtained in this study suggest that a social
media–based approach reliant on supervised machine learning may be suitable for geolocation-
centric monitoring of the US opioid epidemic in near real time.
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672
Introduction
The problem of drug addiction and overdose has reached epidemic proportions in the United States,
and it is largely driven by opioids, both prescription and illicit.
1
More than 72 000 overdose-related
deaths in the United States were estimated to have occurred in 2017, of which more than 47000
(approximately 68%) involved opioids,
2
meaning that a mean of more than 130 people died each day
from opioid overdoses, and approximately 46 of these deaths were associated with prescription
opioids.
3
According to the Centers for Disease Control and Prevention, the opioid crisis has hit some
US states harder than others, with West Virginia, Ohio, and Pennsylvania having death rates greater
than 40 per 100 000 people in 2017 and with statistically significant increases in death rates year by
year.
4
Studies have suggested that the state-by-state variations in opioid overdose–related deaths
are multifactorial but may be associated with differences in state-level policies and laws regarding
opioid prescribing practices and population-level awareness or education regarding the risks and
benefits of opioid use.
5
Although the geographic variation is now known, strategies for monitoring
the crisis are grossly inadequate.
6,7
Current monitoring strategies have a substantial time lag,
meaning that the outcomes of recent policy changes, efforts, and implementations
8-10
cannot be
assessed close to real time. Kolodny and Frieden
11
discussed some of the drawbacks of current
monitoring strategies and suggested 10 federal-level steps for reversing the opioid epidemic, with
improved monitoring or surveillance as a top priority.
In recent years, social media has emerged as a valuable resource for performing public health
surveillance,
12-15
including for drug abuse.
16-18
Adoption of social media is at an all-time high
19
and
continues to grow. Consequently, social media chatter is rich in health-related information, which, if
mined appropriately, may provide unprecedented insights. Studies have suggested that social media
posts mentioning opioids and other abuse-prone substances contain detectable signals of abuse or
misuse,
20-22
with some users openly sharing such information, which they may not share with their
physicians or through any other means.
13,17,23,24
Manual analyses established the potential of social
media for drug abuse research, but automated, data-centric processing pipelines are required to fully
realize social media’s research potential. However, the characteristics of social media data present
numerous challenges to automatic processing from the perspective of natural language processing
and machine learning, including the presence of misspellings, colloquial expressions, data imbalance,
and noise. Some studies have automated social media mining for this task by proposing approaches
such as rule-based categorization,
22
supervised classification,
17
and unsupervised methods.
5
Studies
that have compared opioid-related chatter and its association with the opioid crisis have been
unsupervised in nature, and they either do not filter out information unrelated to personal abuses
5
or do not quantitatively evaluate the performance of their filtering strategy.
21
These and similar
studies have, however, established the importance of social media data for toxicovigilance and have
paved the platform for end-to-end automatic pipelines for using social media information in near
real time.
In this cross-sectional study, we developed and evaluated the building blocks, based on natural
language processing and machine learning, for an automated social media–based pipeline for
toxicovigilance. The proposed approach relies on supervised machine learning to automatically
characterize opioid-related chatter and combines the output of the data processing pipeline with
temporal and geospatial information from Twitter to analyze the opioid crisis at a specific time and
place. We believe this supervised learning-based model is more robust than unsupervised
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 2/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
approaches as it is not dependent on the volume of the overall chatter, which fluctuates from time to
time depending on various factors, such as media coverage. This study, which focused on the state
of Pennsylvania, suggests that the rate of personal opioid abuse–related chatter on Twitter was
reflective of the opioid overdose deaths from the Centers for Disease Control and Prevention
WONDER database and 4 metrics from the National Surveys on Drug Use and Health (NSDUH) over
a period of 3 years.
Methods
Data Collection, Refinement, and Annotation
This cross-sectional study was conducted from December 1, 2017, to August 31, 2019. It was deemed
by the University of Pennsylvania Institutional Review Board to be exempt from review as all data
used were publicly available. Informed consent was not necessary for this reason. This study followed
the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting
guideline.
Publicly available social media posts on Twitter from January 1, 2012, to October 31, 2015, were
collected as part of a broader project through the public streaming API (application programming
interface).
25
The API provides access to a representative random sample of approximately 1% of the
data in near real time.
26
Social media posts (tweets) originating from Pennsylvania were identified
through the geolocation detection process, as described in Schwartz et al.
27
To include opioid-related
posts only, our research team, led by a medical toxicologist (J.P.), identified keywords, including
street names (relevant unambiguous street names were chosen from the US Drug Enforcement
Administration website
28
) that represented prescription and illicit opioids. Because social media
posts have been reported to include many misspellings,
29
and drug names are often misspelled, we
used an automatic spelling variant generator for the selected keywords.
30
We observed an increase
in retrieval rate for certain keywords when we combined these misspellings with the original
keywords (example in eFigure 1 in the Supplement).
We wanted to exclude noisy terms with low signal to noise ratios for the manual annotation
phase. We manually analyzed a random sample of approximately 16 000 social media posts to
identify such noisy terms. We found that 4 keywords (dope,tar,skunk, and smack) and their spelling
variants occurred in more than 80% of the tweets (eFigure 2 in the Supplement). Manual review
performed by one of us (A.S.) and the annotators suggested that almost all social media posts
retrieved by these keywords were referring to nonopioid content. For example, the term dope is
typically used in social media to indicate something is good (eg, “that song is dope”). We removed all
the posts mentioning these keywords, which reduced the data set from more than 350 000 to
approximately 131 000, a decrease of more than 50%.
We developed annotation guidelines using the grounded theory approach.
31
First, we grouped
tweets into topics and then into broad categories. Four annotation categories or classes were
chosen: self-reported abuse or misuse (A), information sharing (I), unrelated (U), and non-English (E).
Iterative annotation of a smaller set of 550 posts was used to develop the guidelines and to increase
agreement between the annotators. For the final annotation set, disagreements were resolved by a
third annotator. Further details about the annotation can be found in the pilot publication
32
and
eTable 1 in the Supplement.
Machine Learning Models and Classification
We used the annotated posts to train and evaluate several supervised learning algorithms and to
compare their performances. We experimented with 6 classifiers: naive bayes, decision tree,
k-nearest neighbors, random forest, support vector machine, and a deep convolutional neural
network. Tweets were preprocessed before training or evaluation by lowercasing. For the first 5 of
the 6 classifiers (or traditional classifiers), we stemmed the terms as a preprocessing step using the
Porter stemmer.
33
As features for the traditional classifiers, we used word n-grams (contiguous
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 3/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
sequences of words) along with 2 additional engineered features (word clusters and presence and
counts of abuse-indicating terms) that we had found to be useful in our related past work.
17
The sixth
classifier, a deep convolutional neural network, consisted of 3 layers and used dense vector
representations of words, commonly known as word embeddings,
34
which were learned from a large
social media data set.
35
Because the word embeddings we used were learned from social media
drug-related chatter, they captured the semantic representations of drug-related keywords.
We randomly split the annotated posts into 3 sets: training, validation, and testing. For
parameter optimization of the traditional classifiers, we combined the training and validation sets
and identified optimal parameter values by using 10-fold cross-validations (eTable 2 in the
Supplement). For the deep convolutional neural network, we used the validation set at training time
for finding optimal parameter values, given that running 10-fold cross-validation for parameter
optimization of neural networks is time consuming and hence infeasible. The best performance
achieved by each classifier over the training set is presented in eTable 3 in the Supplement.To
address the data imbalance between classes, we evaluated each individual classifier using random
undersampling of the majority class (U) and oversampling of the pertinent smaller classes (A and I)
using SMOTE (synthetic minority oversampling technique
36
).
In addition, we used ensembling strategies for combining the classifications of the classifiers.
The first ensembling method was based on majority voting; the most frequent classification label by
a subset of the classifiers was chosen as the final classification. In the case of ties, the classification
by the best-performing individual classifier was used. For the second ensembling approach, we
attempted to improve recall for the 2 nonmajority classes (A and I), which represented content-rich
posts. For this system variant, if any post was classified as A or I by at least 2 classifiers, the post was
labeled as such. Otherwise, the majority rule was applied.
We used the best-performing classification strategy for all the unlabeled posts in the data set.
Our goal was to study the distributions of abuse- and information-related social media chatter over
time and geolocations, as past research has suggested that such analyses may reveal
interesting trends.
5,21,37
Statistical Analysis
We compared the performances of the classifiers using the precision, recall, and microaveraged F1 or
accuracy scores. The formulas for computing the metrics were as follows, with tp representing true
positives; fn, false negatives; and fp, false-positives:
recall =tp
tp + fn
tp
tp + fp
; precision =; F1 score =2 x recall x precision
recall + precision
To compute the microaveraged F1 score, the tp, fp, and fn values for all of the classes are summed
before calculating precision and recall. Formally,
M
c = 1
M
c = 1
M
c = 1
FMICRO = F ()
tpc,fpc,fnc
in which Fis the function to compute the metric, cis a label, and Mis the set of all labels. For a
multiclass problem such as this, microaveraged F1 score and accuracy are equal. We computed 95%
CIs for the F1 scores using the bootstrap resampling technique
38
with 1000 resamples.
For geospatial analyses, we compared the abuse-indicating social media post rates from
Pennsylvania with related metrics for the same period from 2 reference data sets: the WONDER
database
39
and the NSDUH.
40
We obtained county-level yearly opioid overdose death rates from
WONDER and percentages for 4 relevant substate-level measures (past month use of illicit drugs [no
marijuana], past year nonmedical use of pain relievers, past year illicit drug dependence or abuse,
and past year illicit drug dependence) from NSDUH. All the data collected were for the years 2012 to
2015. For the NSDUH measures, percentage values of annual means over the 3 years were obtained.
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 4/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
We investigated the possible correlations (Pearson and Spearman) between the known metrics and
the automatically detected abuse-indicating tweet rates and then visually compared them using
geospatial heat maps and scatterplots.
For Pearson and Spearman correlation analyses, we used the Python library SciPy, version 1.3.1.
Two-tailed P< .05 was interpreted as statistical significance.
Results
We used 56 expressions of illicit and prescription opioids for data collection, with a total of 213
keywords or phrases, including spelling variants (eTable 4 in the Supplement). The annotations
resulted in a final set of 9006 social media posts (6304 [70.0%] for training, 900 [10.0%] for
validation, and 1802 [20.0%] for testing). There were 550 overlapping posts between the 2
annotators, and interannotator agreement was moderate with κ = 0.75 (Cohen κ
41
). Of the 9006
posts, 4830 (53.6%) were unrelated to opioids, 427 (4.7%) were not in the English language, and the
proportions of abuse (1748 [19.4%]) and information (2001 [22.2%]) posts were similar (eTable 5 in
the Supplement).
To capture the natural variation in the distribution of posts in real time, we did not stratify the
sets by class during the training or testing set splitting. Consequently, the testing set consisted of a
marginally lower proportion of abuse-indicating posts (17.7%) compared with the training set
(19.8%). Statistically significant variation was found in the distribution of posts mentioning
prescriptions (2257 [25.1%]) and illicit opioids (7038 [78.1%]) at an approximate ratio of 3:1.
Proportions of class A and class I tweets were much higher for prescription opioid tweets (24.7% vs
18.0% for class A; 30.4% vs 20.9% for class I), whereas the proportion of class U tweets (55.1% vs
44.5%) was much higher for the illicit opioid posts (see eTable 5 in the Supplement for post
distributions per class).
Model Performances
Table 1 presents the performances of the classification algorithms, showing the recall, precision, and
microaveraged F1 score and 95% CIs. Among the traditional classifiers, support vector machines (F1
score = 0.700; 95% CI, 0.681-0.718) and random forests (F1 score = 0.701; 95% CI, 0.683-0.718)
showed similar performances, outperforming the others in F1 scores. The deep convolutional neural
network outperformed all of the traditional classifiers (F1 score = 0.720; 95% CI, 0.699-0.735). The
resampling experiments did not improve performance of the individual classifiers. Both pairs of
ensemble classification strategies shown in Table 1 performed better than the individual classifiers,
with the simple majority voting ensemble of 4 classifiers (Ensemble_1) producing the best
microaveraged F1 score (0.726; 95% CI, 0.708-0.743). Performances of the classifiers were high for
class U and class N and low for class A.
The most common errors for the best-performing system (Ensemble_1) were incorrect
classification to class U, comprising 145 (79.2%) of the 183 incorrect classifications for posts originally
labeled as class A, 122 (67.4%) of the 181 incorrect classifications for posts labeled as class I, and all 4
(100%) of the incorrect classifications for posts labeled as class N (eTable 7 in the Supplement).
Temporal and Geospatial Analyses
Figure 1 shows the monthly frequency and proportion distributions of class A and I posts. The
frequencies of both categories of posts increased over time, which was unsurprising given the
growth in the number of daily active Twitter users over the 3 years of study as well as greater
awareness about the opioid crisis. Greater awareness is perhaps also reflected by the increasing
trend in information-related tweets. However, although the volume of abuse-related chatter
increased, its overall proportion in all opioid-related chatter decreased over time, from
approximately 0.055 to approximately 0.042. The true signals of opioid abuse from social media
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 5/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
were likely hidden in large volumes of other types of information as awareness about the opioid crisis
increased.
Figure 2 shows the similarities between 2 sets of county-level heat maps for population-
adjusted, overdose-related death rates and abuse-indicating post rates as well as a scatterplot
illustrating the positive association between the 2 variables. We found a statistically significant
correlation (Pearson r= 0.451, P< .001; Spearman r= 0.331, P= .004) between the county-level
overdose death rates and the abuse-indicating social media posts over 3 years (n = 75). In
comparison, the pioneering study by Graves et al,
5
perhaps the study most similar to ours, reported
a maximum (among 50 topics) Pearson correlation of 0.331 between a specific opioid-related social
media topic and county-level overdose death rates. In addition, we found that the Pearson
correlation coefficient increased when the threshold for the minimum number of deaths for including
counties was raised. If only counties with at least 50 deaths were included, the Pearson correlation
coefficient increased to 0.54; for 100 deaths, the correlation coefficient increased to 0.67.
Table 1. Performances of Different Classifiers on the Testing Set
Classifier
Precision Recall Microaveraged F1
or Accuracy Score
(95% CI)Class A Class I Class U Class N Class A Class I Class U Class N
Random classifier
a
0.166 0.235 0.535 0.052 0.189 0.224 0.530 0.044 0.375 (0.360-0.394)
NB 0.307 0.501 0.788 0.737 0.670 0.504 0.463 0.811 0.539 (0.518-0.558)
NB Random oversampling 0.297 0.502 0.806 0.745 0.695 0.495 0.456 0.778 0.523 (0.505-0.542)
NB Undersampling 0.293 0.620 0.820 0.735 0.733 0.454 0.499 0.867 0.548 (0.529-0.568)
NB SMOTE 0.319 0.509 0.793 0.737 0.651 0.498 0.526 0.811 0.555 (0.536-0.574)
DT 0.389 0.540 0.725 0.816 0.371 0.447 0.783 0.889 0.638 (0.618-0.655)
DT Random oversampling 0.388 0.510 0.752 0.818 0.455 0.476 0.724 0.900 0.617 (0.599-0.644)
DT Undersampling 0.341 0.481 0.797 0.802 0.487 0.548 0.630 0.900 0.599 (0.579-0.617)
DT SMOTE 0.307 0.437 0.723 0.833 0.365 0.488 0.638 0.889 0.568 (0.549-0.587)
k-NN 0.314 0.791 0.589 0.852 0.101 0.081 0.942 0.876 0.593 (0.574-0.612)
k-NN Random oversampling 0.287 0.629 0.627 0.861 0.248 0.159 0.852 0.900 0.587 (0.567-0.607)
k-NN Undersampling 0.355 0.474 0.815 0.781 0.522 0.572 0.606 0.911 0.599 (0.580-0.618)
k-NN SMOTE 0.317 0.446 0.724 0.868 0.380 0.493 0.643 0.878 0.574 (0.549-0.587)
SVM 0.476 0.717 0.728 0.895 0.374 0.529 0.856 0.944 0.700 (0.681-0.718)
SVM Random oversampling 0.446 0.657 0.821 0.895 0.560 0.756 0.644 0.944 0.704 (0.683 –0.720)
SVM Undersampling 0.409 0.611 0.862 0.843 0.629 0.668 0.667 0.956 0.675 (0.656 0.693)
SVM Oversampling SMOTE 0.330 0.598 0.764 0.920 0.566 0.548 0.616 0.9 0.605 (0.587-0.624)
RF 0.493 0.762 0.713 0.835 0.330 0.469 0.897 0.956 0.701 (0.683-0.718)
RF Random oversampling 0.447 0.679 0.775 0.835 0.462 0.569 0.809 0.956 0.700 (0.684-0.719)
RF Undersampling 0.414 0.561 0.883 0.791 0.616 0.688 0.639 0.967 0.663 (0.645-0.682)
RF Oversampling SMOTE 0.379 0.539 0.771 0.843 0.465 0.565 0.688 0.956 0.634 (0.616-0. 652)
CNN 0.532 0.676 0.759 0.902 0.386 0.608 0.858 0.922 0.720 (0.699-0.735)
CNN Random oversampling 0.532 0.677 0.758 0.902 0.386 0.602 0.860 0.922 0.720 (0.699-0.734)
CNN Undersampling 0.414 0.551 0.866 0.902 0.400 0.565 0.639 0.922 0.638 (0.618-0.658)
CNN SMOTE 0.493 0.598 0.800 0.902 0.414 0.548 0.688 0.922 0.658 (0.640-0.677)
Ensemble_1
(CNN, RF, SVM, NB)
0.517 0.721 0.758 0.887 0.425 0.565 0.866 0.956 0.726 (0.708-0.743)
b
Ensemble_biased_1
(CNN, RF, SVM, NB)
0.489 0.716 0.780 0.887 0.506 0.563 0.836 0.956 0.721 (0.703-0.739)
Ensemble_2
(CNN, RF, SVM, NB, DT)
0.482 0.707 0.743 0.878 0.377 0.517 0.875 0.956 0.709 (0.692-0.726)
Ensemble_biased_2
(CNN, RF, SVM, NB, DT)
0.456 0.708 0.810 0.878 0.597 0.577 0.786 0.956 0.713 (0.696-0.730)
Abbreviations: A, self-reported abuse or misuse; CNN, convolutional neural network; DT,
decision tree; I, information sharing; k-NN, k-nearestneighbors; N, non-English; NB,
naive Bayes; RF,random forest; SMOTE, synthetic minority oversampling technique;
SVM, support vector machine; U, unrelated.
a
The random classifier randomly assigns 1 of the 4 classes to a tweet.
b
Best performance.
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 6/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
Figure 3 shows the substate-level heat maps for abuse-indicating social media posts and 4
NSDUH metrics over the same 3-year period, along with scatterplots for the 2 sets of variables. All the
computed correlations and their significances are summarized in Table 2 (see eTable 6 in the
Supplement for the substate information). Table 2 illustrates the consistently high correlations
between abuse-indicating social media post rates and the NSDUH survey metrics over the same
3-year period (n = 13): nonmedical prescription opioid use (Pearson r= 0.683, P= .01; Spearman
r= 0.346, P= .25), illicit drug use (Pearson r= 0.850, P< .001; Spearman r= 0.341, P= .25), illicit
drug dependence (Pearson r= 0.937, P< .001; Spearman r= 0.495, P= .09), and illicit drug
dependence or abuse (Pearson r= 0.93 5, P< .001; Spearman r= 0.401 , P= .17). However, we could
not establish statistical significance owing to the small sample sizes.
Discussion
Opioid misuse or abuse and addiction are among the most consequential and preventable public
health threats in the United States.
42
Social media big data, coupled with advances in data science,
present a unique opportunity to monitor the problem in near real time.
20,37,43-45
Because of varying
volumes of noise in generic social media data, the first requirement we believe needs to be satisfied
for opioid toxicosurveillance is the development of intelligent, data-centric systems that can
automatically collect and curate data, a requirement this cross-sectional study addressed. We
explored keyword-based data collection approaches and proposed, through empirical evaluations,
supervised machine learning methods for automatic categorization of social media chatter on
Twitter. The best F1 score achieved was 0.726, which was comparable to human agreement.
Figure 1. Monthly Distributions of the Frequencies and Proportions of Social Media Posts Classified as Abuse and Information in the Unlabeled Data Set Over 3 Years
0
Jan Feb Mar Apr May Jun Jul OctAug Sep Nov Dec Jan Fe b Mar Apr May Jun Jul OctAug Sep Nov Dec Jan Feb Mar Apr May Jun Jul OctAug Sep Nov Dec
800
600
Frequencies
3-Year Period, mo
400
200
2012 2013 2014
0
Jan Feb Mar Apr May Jun Jul OctAug Sep Nov Dec Jan Fe b Mar Apr May Jun Jul OctAug Sep Nov Dec Jan Feb Mar Apr May Jun Jul OctAug Sep Nov Dec
0.10
0.08
Proportions
3-Year Period, mo
0.04
0.02
0.06
2012 2013 2014
Abuse
Information
Abuse
Information
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 7/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
Recent studies have investigated potential correlations between social media data and other
sources, such as overdose death rates
5
and NSDUH survey metrics.
21
The primary differences
between the current work and past studies are that we used a more comprehensive data collection
strategy by incorporating spelling variants, and we applied supervised machine learning as a
preprocessing step. Unlike purely keyword-based or unsupervised models,
5,46,47
the approach we
used appears to be robust at handling varying volumes of social media chatter, which is important
when using social media data for monitoring and forecasting, given that the volume of data can be
associated with factors such as movies or news articles, as suggested by Figure 1. The heat maps in
Figures 2 and 3 show that the rates of abuse-related chatter were much higher in the more populous
Pennsylvania counties (eg, Philadelphia and Allegheny), which was likely related to the social media
Figure 2. Comparison of County-LevelHeat Maps of Opioid-Related Death Rates and Abuse-Related
Social Media Post Rates in Pennsylvania,2012-2014, and Scatterplot of the Association Between the 2 Variables
2012
Deaths Posts
2013
2014
Rates of opioid-related deaths and abuse-related social media posts
A
0
0 5 10 15 20 25 30 35
120
100
County-Level Abuse-Indicating Post Rate
County-Level Overdose Death Rate
80
60
20
40
Association between abuse-related social media posts and opioid-related death rates
B
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 8/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
Figure 3. Substate-Level Heat Maps and ScatterplotsComparing Frequencies of Abuse-Indicating
Social Media Posts With 4 Survey Metrics, 2012-2014
Substate-level abuse-indicating posts, 2012-2014
A
–0.2
1 2 3 54
0.6
0.4
Abuse-Indicating Post
Drug Dependence, Mean %
0.2
0
Posts vs NSDUH illicit-drug dependence metric, past year
B
–0.2
1 2 3 54
0.6
0.4
Abuse-Indicating Post
Drug Dependence/Abuse, Mean %
0.2
0
Posts vs NSDUH illicit-drug dependence or abuse metric, past year
C
–0.2
1 2 3 54
0.6
0.4
Abuse-Indicating Post
Non-Marijuana Drug Use, Mean %
0.2
0
Posts vs NSDUH illicit-drug use (no marijuana) metric, past month
D
–0.2
1 2 3 54
0.6
0.4
Abuse-Indicating Post
Pain Reliever Use, Mean %
0.2
0
Posts vs NSDUH nonmedical pain reliever use metric, past year
E
The computed correlations and their statistical
significance are summarized in Table 2. Pennsylvania
substate information is found in eTable 6 in the
Supplement. NSDUH indicates National Survey on
Drug Use and Health.
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 9/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
user base being skewed to large cities. More advanced methods for adjusting or normalizing the data
in large cities may further improve the correlations.
We also found that the correlation coefficient tended to increase when only counties with
higher death rates were included. This finding suggests that Twitter-based classification may be more
reliable for counties or geolocations with higher populations and therefore higher numbers of users.
If this assertion is true, the increasing adoption of social media in recent years, specifically Twitter, is
likely to aid the proposed approach. The correlations between social media post rates and the
NSDUH metrics were consistently high, but statistical significance could not be established owing to
the smaller sample sizes.
The proposed model we present in this study enables the automatic curation of opioid misuse–
related chatter from social media despite fluctuating numbers of posts over time. The outputs of the
proposed approach correlate with related measures from other sources and therefore may be used
for obtaining near-real-time insights into the opioid crisis or for performing other analyses associated
with opioid misuse or abuse.
Classification Error Analysis
As mentioned, the most common error made by the best-performing classifier (Ensemble_1) was to
misclassify social media posts to class U, whereas misclassifications to the other 3 classes occurred
with much lower frequencies (eTable 7 in the Supplement). We reviewed the confusion matrices
from the other classifiers and saw a similar trend. Because class U was the majority class, by a margin,
it was the category to which the classifiers tended to group posts that lacked sufficient context. Short
lengths of certain posts and the presence of misspellings or rare nonstandard expressions added
difficulty for the classifiers to decipher contextual cues, a major cause of classification errors.
Lack of context in posts also hindered the manual annotations, making the categorizations
dependent on the subjective assessments of the annotators. Although the final agreement level
between the annotators was higher than the levels in initial iterations, it could be improved. Our
previous work suggests that preparing thorough annotation guidelines and elaborate annotation
strategies for social media–based studies helps in obtaining relatively high annotator agreement
levels and, eventually, improved system performances.
48,49
We plan to address this issue in future
research.
Another factor that affected the performance of the classifiers on class A and class I was data
imbalance; the relatively low number of annotated instances for these classes made it difficult for
algorithms to optimally learn. The resampling experiments were not associated with improved
performances, which is consistent with findings from past research.
49,50
Annotating more data is
likely to produce improved performances for these classes. Given that several recent studies
obtained knowledge from Twitter about opioid use or abuse, combining all the available data in a
distant supervision framework may be valuable.
51
We will also explore the use of sentence-level
contextual embeddings, which have been shown to outperform past text classification
approaches.
52
Table 2. Pearson and Spearman Correlations for Geolocation-Specific Abuse-Indicating Social Media Post Rates
With County-Level Opioid Overdose Death Rates and 4 Metrics Fromthe National Survey on Drug Use
and Health
Measure Pearson rPValue Spearman rPValue
No. of
Data
Points
Opioid overdose death rate 0.451 <.001
a
0.331 .004
a
75
Illicit drug use, no marijuana, past mo 0.850 <.001
a
0.341 .25 13
Nonmedical use of pain relievers, past y 0.683 .01 0.346 .25 13
Illicit drug dependence or abuse, past y 0.935 <.001
a
0.401 .17 13
Illicit drug dependence, past y 0.937 <.001
a
0.495 .09 13
a
Indicates statistical significance.
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 10/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
In future research, we plan to expand this work to other classes of drugs and prescription
medications, such as stimulants and benzodiazepines. Combining machine learning and available
metadata, we will estimate the patterns of drug consumption and abuse over time and across
geolocations and analyze cohort-level data, building on our previous work.
53
Limitations
This cross-sectional study has several limitations. First, we included social media posts that
originated from Pennsylvania. The advantage of machine learning over rule-based approaches is
portability, but the possibly differing contents of social media chatter in different geolocations may
reduce machine learning performance unless additional training data are added. Social media chatter
is also always evolving, with new expressions introduced constantly. Therefore, systems trainedwith
data from specific periods and geolocations may not perform optimally for other periods. The use
of dense vector-based representations of texts may address this problem as semantic
representations of emerging terms may be learned from large, unlabeled data sets without requiring
human annotations.
Second, the moderate interannotator agreement in this study provided a relatively low ceiling
for the machine learning classifier performance. More detailed annotation guidelines and strategies
may address this problem by making the annotation process less subjective. Furthermore, the
correlations we obtained did not necessarily indicate any higher-level associations between abuse-
related social media posts and overdose death rates and/or survey responses.
Conclusions
Big data derived from social media such as Twitter present the opportunity to perform localized
monitoring of the opioid crisis in near real time. In this cross-sectional study, we presented the
building blocks for such social media–based monitoring by proposing data collection and
classification strategies that employ natural language processing and machine learning.
ARTICLE INFORMATION
Accepted for Publication: August 4, 2019.
Published: November 6, 2019. doi:10.1001/jamanetworkopen.2019.14672
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Sarker A et al.
JAMA Network Open.
Corresponding Author: Abeed Sarker, PhD, Department of Biomedical Informatics, School of Medicine, Emory
University,101 Woodruff Circle, Office 4101, Atlanta, GA 30322 (abeed@dbmi.emory.edu).
Author Affiliations: Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,
University of Pennsylvania, Philadelphia (Sarker, Gonzalez-Hernandez); Department of Biomedical Informatics,
School of Medicine, Emory University, Atlanta, Georgia (Sarker); School of Engineering and Applied Science,
University of Pennsylvania, Philadelphia (Ruan); Department of Emergency Medicine, Perelman School of
Medicine, University of Pennsylvania, Philadelphia (Perrone).
Author Contributions: Dr Sarker had full access to all of the data in the study and takes responsibility for the
integrity of the data and the accuracy of the data analysis.
Concept and design: Sarker,Gonzalez-Hernandez, Perrone.
Acquisition, analysis, or interpretation of data: Sarker, Ruan.
Drafting of the manuscript: Sarker,Gonzalez-Hernandez.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Sarker,Ruan.
Administrative, technical, or material support: Gonzalez-Hernandez.
Supervision: Gonzalez-Hernandez, Perrone.
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 11/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
Conflict of Interest Disclosures: Dr Sarker reported receiving grants from the National Institute on Drug Abuse
(NIDA), grants from Pennsylvania Department of Health, and nonfinancial support from NVIDIA Corporation
during the conduct of the study as well as personal fees from the National Board of Medical Examiners, grants from
the Robert Wood Johnson Foundation, and honorarium from the National Institutes of Health (NIH) outside the
submitted work. Dr Gonzalez-Hernandez reported receiving grants from NIH/NIDA during the conduct of the
study and grants from AbbVie outside the submitted work. No other disclosures were reported.
Funding/Support: This study was funded in part by award R01DA046619 from the NIH/NIDA. The data collection
and annotation efforts were partly funded by a grant from the Pennsylvania Department of Health.
Role of the Funder/Sponsor:The funders had no role in the design and conduc t of the study; collection,
management, analysis, and interpretation of the data; preparation, review, or approvalof the manuscript; and
decision to submit the manuscript for publication.
Disclaimer: The content of this article is solely the responsibility of the authors and does not necessarily represent
the official views of NIDA or NIH.
Additional Contributions: Karen O’Connor, MS, and Alexis Upshur, BS, Department of Biostatistics, Epidemiology
and Informatics, Perelman School of Medicine, University of Pennsylvania, and Annika DeRoos,College of Arts and
Sciences, University of Pennsylvania, performed the annotations. Mss O’Connor and Upshur received
compensation for their contributions as staff researchers, and Ms DeRoos received compensation as a sessional
research assistant under the mentorship of Dr Sarker. The Titan Xp GPU used for the deep learning experiments
was donated by the NVIDIA Corporation.
REFERENCES
1. National Academie s of Sciences, Engineering,and Medicine; Health and Medicine Division; Board on Health
Sciences Policy; Committee on Pain Management and Regulatory Strategies to Address Prescription Opioid Abuse.
Pain Management and the Opioid Epidemic: Balancing Societal and Individual Benefits and Risks of Prescription
Opioid Use. Washington, DC: National Academies Press; 2017.
2. National Institute on Drug Abuse. Overdose death rates. https://www.drugabuse.gov/related-topics/trends-
statistics/overdose-death-rates. Published 2019. Accessed September 11, 2019.
3. Scholl L , Seth P, Kariisa M, Wilson N, Baldwin G. Drug and opioid-involved overdose deaths—United States,
2013-2017. MMWR Morb Mortal Wkly Rep. 2018;67(5152):1419-1427. doi:10.15585/mmwr.mm675152e1
4. Centers for Disease Control and Prevention. Opioid overdose: drug overdose deaths. https://www.cdc.gov/
drugoverdose/data/statedeaths.html. Published 2018. Accessed September 11, 2019.
5. Graves RL, Tufts C, Meisel ZF, Polsky D, Ungar L, Merchant RM. Opioid discussion in the Twittersphere. Subst
Use Misuse. 2018;53(13):2132-2139. doi:10.1080/10826084.2018.1458319
6. Grig gs CA, Weiner SG, Feldman JA. Prescription drug monitoring programs: examining limitations and future
approaches. West J Emerg Med. 2015;16(1):67-70. doi:10.5811/westjem.2014.10.24197
7. Manasco AT, Griggs C, Leeds R, et al. Characteristics of state prescription drug monitoring programs: a state-by-
state survey.Pharmacoepidemiol Drug Saf. 2016;25(7):847-851. doi:10.1002/pds.4003
8. Holton D, White E, McCarty D. Public health policy strategies to address the opioid epidemic. Clin Pharmacol
Ther. 2018;103(6):959-962. doi:10.1002/cpt.992
9. Kolodny A, Courtwright DT, Hwang CS, et al. The prescription opioid and heroin crisis: a public health approach
to an epidemic of addiction. Annu Rev Public Health. 2015;36:559-574. doi:10.1146/annurev-publhealth-031914-
122957
10. Penm J, MacKinnon NJ, Boone JM, Ciaccia A, McNamee C, Winstanley EL. Strategies and policies to address
the opioid epidemic: a case study of Ohio. J Am Pharm Assoc (2003). 2017;57(2S):S148-S153. doi:10.1016/j.japh.
2017.01.001
11. Kolodny A, Frieden TR. Ten steps the federal government should take now to reverse the opioid addiction
epidemic. JAMA. 2017;318(16):1537-1538. doi:10.1001/jama.2017.14567
12. Fung IC, Tse ZT, Fu KW. The use of social media in public health surveillance. Western Pac Surveill Response J.
2015;6(2):3-6. doi:10.5365/wpsar.2015.6.1.019
13. Chan B, Lopez A, Sarkar U. The canary in the coal mine tweets: social media reveals public perceptions of
non-medical use of opioids. PLoS One. 2015;10(8):e0135072. doi:10.1371/journal.pone.0135072
14. Sarker A, Ginn R , NikfarjamA , et al. Utilizing social media data for pharmacovigilance:a review. J Biomed
Inform. 2015;54:202-212. doi:10.1016/j.jbi.2015.02.004
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 12/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
15. Velasco E, Agheneza T, Denecke K, Kirchner G, Eckmanns T. Social media and internet-based data in global
systems for public health surveillance: a systematic review. Milbank Q. 2014;92(1):7-33. doi:10.1111/1468-
0009.12038
16. Phan N, Chun SA , Bhole M, Geller J. Enabling real-time drug abuse detection in tweets. In: 2017 IEEE 33rd
International Conference on Data Engineering (ICDE). Piscataway, NJ: IEEE; 2017.
17. Sarker A, O’Connor K, Ginn R, et al. Social media mining for toxicovigilance: automatic monitoring of
prescription medication abuse from Twitter. Drug Saf. 2016;39(3):231-240. doi:10.1007/s40264-015-0379-4
18. Cherian R , WestbrookM, Ramo D, Sarkar U. Representations of codeine misuse on Instagram: content analysis.
JMIR Public Health Surveill. 2018;4(1):e22. doi:10.2196/publichealth.8144
19. PEW Research Center. Social media fact sheet. https://www.pewinternet.org/fact-sheet/social-media/. Published
June 12, 2019. Accessed September 1, 2019.
20. Chary M, Genes N, McKenzie A, Manini AF. Leveraging social networks for toxicovigilance. J Med Toxicol.
2013;9(2):184-191. doi:10.1007/s13181-013-0299-6
21. Char y M, Genes N, Giraud-Carrier C, Hanson C, Nelson LS, Manini AF. Epidemiology from tweets: estimating
misuse of prescription opioids in the USA from social media. J Med Toxicol. 2017;13(4):278-286. doi:10.1007/
s13181-017-0625-5
22. Bigeard E , GrabarN, Thiessard F. Detection and analysis of drug misuses: a study based on social media
messages. Front Pharmacol. 2018;9:791. doi:10.3389/fphar.2018.00791
23. Buntain C, Golbeck J. This is your Twitter on drugs. Any questions? In: Proceedings of the 24th International
Conference on World Wide Web. WWW ’15 Companion. New York, NY: ACM; 2015:777-782.
24. Shutler L, Nelson LS, Portelli I, Blachford C, Perrone J. Drug use in the Twittersphere: a qualitative contextual
analysis of tweets about prescription drugs. J Addict Dis. 2015;34(4):303-310. doi:10.1080/10550887.2015.
1074505
25. Tufts C, Polsky D, Volpp KG, et al. Characterizing tweet volume and content about common health conditions
across Pennsylvania: retrospective analysis. JMIR Public Health Surveill. 2018;4(4):e10834. doi:10.2196/10834
26. Wang Y, Callan J, Zheng B. Should we use the sample? analyzing datasets sampled from Twitter’s stream API.
ACM TransWeb. 2015;3(13):1-23. doi:10.1145/2746366
27. Schwartz H, Eichstaedt J, Kern M, et al. Characterizing geographicvariation in well-being using tweets.
Seventh International AAAI Conference on Weblogs and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/
ICWSM13/paper/view/6138. Accessed October 2, 2019.
28. Drug Facts. US Drug Enforcement Administration website. https://www.dea.gov/factsheets. Accessed September
11, 2019.
29. Han B, Cook P, Baldwin T. Lexical normalization for social media text. ACM Trans Intell Syst Technol. 2013;4
(1):1-27. doi:10.1145/2414425.2414430
30. Sarker A, Gonzalez-Hernandez G. An unsupervised and customizable misspelling generator for mining noisy
health-related text sources. J Biomed Inform. 2018;88:98-107. doi:10.1016/j.jbi.2018.11.007
31. Mar tin PY, Turner BA. Grounded theory and organizational research. J Appl Behav Sci. 1986;22(2):141-157. doi:10.
1177/002188638602200207
32. Sarker A, Gonzalez-Hernandez G, Perrone J. Towards automating location-specific opioid toxicosurveillance
from Twitter via data science methods. Stud Health Technol Inform. 2019;264:333-337.doi:10.3233/SHTI190238
33. Porter MF. An algorithm for suffix stripping. Program. 1980;14(3):130-137. doi:10.1108/eb046814
34. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J.Distributed representations of words and phrases and their
compositionality. In: Advancesin Neural Information Processing Systems 26 (NIPS 2013). San Diego, CA: Neural
Information Processing Systems Foundation Inc; 2013:1-9.
35. Sarker A, Gonzalez G. A corpus for mining drug-related knowledge from Twitter chatter: language models and
their utilities. Data Brief. 2016;10:122-131. doi:10.1016/j.dib.2016.11.056
36. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif
Intell Res. 2002;16:321-357. doi:10.1613/jair.953
37. Hanson CL, Burton SH, Giraud-Carrier C, West JH, Barnes MD, Hansen B. Tweaking and tweeting: exploring
Twitter for nonmedical use of a psychostimulant drug (Adderall) among college students. J Med Internet Res.
2013;15(4):e62. doi:10.2196/jmir.2503
38. Efron B. Bootstrap methods: another look at the jackknife. Ann Stat. 1979;7(1):1-26. doi:10.1214/aos/
1176344552
39. Centers for Disease Control and Prevention. CDC WONDER. https://wonder.cdc.gov/. Accessed October 2, 2019.
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 13/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
40. Substance Abuse and Mental Health Service s Administration.Substate estimates of substance use and mental
illness from the 2012-2014 NSDUH: results and detailed tables. https://www.samhsa.gov/samhsa-data-outcomes-
quality/major-data-collections/state-reports-NSDUH/2012-2014-substate-reports. Accessed October 2, 2019.
41. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37-46. doi:10.1177/
001316446002000104
42. Gostin LO, Hodge JG Jr, Noe SA. Reframing the opioid epidemic as a national emergency. JAMA. 2017;318(16):
1539-1540. doi:10.1001/jama.2017.13358
43. Katsuki T, Mackey TK, Cuomo R. Establishing a link between prescription drug abuse and illicit online
pharmacies: analysis of Twitter data. J Med InternetRes. 2015;17(12):e280. doi:10.2196/jmir.5144
44. Yang X, Luo J. Tracking illicit drug dealing and abuse on Instagram using multimodal analysis. ACM Trans Intell
Syst Technol. 2017;8(4):1-15. doi:10.1145/3011871
45. Cameron D, Smith GA, Daniulaityte R, et al. PREDOSE: a semantic web platform for drug abuse epidemiology
using social media. J Biomed Inform. 2013;46(6):985-997. doi:10.1016/j.jbi.2013.07.007
46. Paul MJ, Dredze M, Broniatowski D. Twitter improves influenza forecasting. PLoS Curr. 2014;6:1-13. doi:10.
1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117
47. Sharpe JD, Hopkins RS, Cook RL, Striley CW. Evaluating Google, Twitter, and Wikipedia as tools for influenza
surveillance using bayesian change point analysis: a comparative analysis. JMIR Public Health Surveill. 2016;2
(2):e161. doi:10.2196/publichealth.5901
48. Klein A , SarkerA , RouhizadehM, O’Connor K, Gonzalez G. Detecting personal medication intake in Twitter: an
annotated corpus and baseline classification system. In: Proceedings of the BioNLP 2017 Workshop. Vancouver,
Canada: Association for Computational Linguistics; 2017:136-142.
49. Sarker A, Belousov M, Friedrichs J, et al. Data and systemsfor medication-related text classification and
concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task.
J Am Med Inform Assoc. 2018;25(10):1274-1283. doi:10.1093/jamia/ocy114
50. Klein AZ, Sarker A, Weissenbacher D, Gonzalez-Hernandez G. Automatically detecting self-reported birth
defect outcomes on Twitter for large-scale epidemiological research [published online October 22, 2018]. arXiv.
doi:10.1038/s41746-019-0170-5
51. Sahni T, Chandak C, Chedeti NR, Singh M. Efficient Twitter sentiment classification using subjective distant
supervision. In: 2017 9th International Conference on Communication Systems and Networks (COMSNETS).
Piscataway, NJ: IEEE; 2017:548-553. doi:10.1109/COMSNETS.2017.7945451
52. Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language
understanding. In: Proceedings of NAACL-HLT 2019. Minneapolis, MN: Association for Computational Linguistics;
2019:4171-4186.
53. Sarker A, Chandrashekar P, Magge A, Cai H, Klein A, Gonzalez G. Discovering cohorts of pregnant women from
social media for safety surveillance and analysis. J Med Internet Res. 2017;19(10):e361. doi:10.2196/jmir.8164
SUPPLEMENT.
eFigure 1. Frequencies of Misspellings of Six Opioid Keywords Relativeto the Frequencies of the Original Spellings
eFigure 2. Distribution of Opioid-Related Keywords in a Sample of 16,320 Tweets
eTable 1. Definitions of the Four Annotation Categories
eTable 2. Optimal Parameter Values for the Different Classifiers Presented
eTable 3. Class-Specific Recall and Precision, Average Accuracy and Standard Deviation Over Ten Folds for Each
Classifier
eTable 4. Opioid Keywords and Spelling Variants
eTable 5. Distribution of Tweet Classes Across the Training and the Evaluation Sets
eTable 6. Counties Within Each Substatein Pennsylvania
eTable 7. Confusion Matrices Illustrating Common Errors Made by the 2 Best PerformingSystems (Ensemble_1 and
Ensemble_biased_1 in Table 1)
JAMA Network Open | Health Informatics Machine Learning and Natural Language Processing for Opioid-Related Social Media Chatter
JAMA Network Open. 2019;2(11):e1914672. doi:10.1001/jamanetworkopen.2019.14672 (Reprinted) November 6, 2019 14/14
Downloaded From: https://jamanetwork.com/ on 11/06/2019
... A review of Twitter data [8] revealed numerous mentions of opioid terms such as fentanyl, heroin, and morphine in tweets. As reported by numerous past studies [9], many social media subscribers openly discuss their substance use with their online networks, even if they might not feel comfortable discussing these topics with their doctors. Recent studies [10][11][12] also suggest the potential to use social media data to supplement survey results in studying psychoactive substances and their effects. ...
... These works primarily utilize data from scientific literature [26] or clinical research studies [21] to extract crucial insights pertaining to the subject matter. Another line of research [9,[27][28][29] employed machine learning algorithms to classify social media data, such as tweets [9] or Reddit posts [30], and determine patterns of drug misuse, providing valuable insights into the public perception and understanding of the issue. Through diverse NLP techniques, these studies were able to extract and analyse textual data, uncovering trends and common themes associated with drug misuse within these virtual communities. ...
... These works primarily utilize data from scientific literature [26] or clinical research studies [21] to extract crucial insights pertaining to the subject matter. Another line of research [9,[27][28][29] employed machine learning algorithms to classify social media data, such as tweets [9] or Reddit posts [30], and determine patterns of drug misuse, providing valuable insights into the public perception and understanding of the issue. Through diverse NLP techniques, these studies were able to extract and analyse textual data, uncovering trends and common themes associated with drug misuse within these virtual communities. ...
Article
Full-text available
Background Substance use, including the non-medical use of prescription medications, is a global health problem resulting in hundreds of thousands of overdose deaths and other health problems. Social media has emerged as a potent source of information for studying substance use-related behaviours and their consequences. Mining large-scale social media data on the topic requires the development of natural language processing (NLP) and machine learning frameworks customized for this problem. Our objective in this research is to develop a framework for conducting a content analysis of Twitter chatter about the non-medical use of a set of prescription medications. Methods We collected Twitter data for four medications—fentanyl and morphine (opioids), alprazolam (benzodiazepine), and Adderall® (stimulant), and identified posts that indicated non-medical use using an automatic machine learning classifier. In our NLP framework, we applied supervised named entity recognition (NER) to identify other substances mentioned, symptoms, and adverse events. We applied unsupervised topic modelling to identify latent topics associated with the chatter for each medication. Results The quantitative analysis demonstrated the performance of the proposed NER approach in identifying substance-related entities from data with a high degree of accuracy compared to the baseline methods. The performance evaluation of the topic modelling was also notable. The qualitative analysis revealed knowledge about the use, non-medical use, and side effects of these medications in individuals and communities. Conclusions NLP-based analyses of Twitter chatter associated with prescription medications belonging to different categories provide multi-faceted insights about their use and consequences. Our developed framework can be applied to chatter about other substances. Further research can validate the predictive value of this information on the prevention, assessment, and management of these disorders.
... NLP-based Opioid Abuse Analysis Recently, with the development of NLP technology, studies have been actively conducted to analyze information relevant to opioid abuse and OOD from text (e.g. EHR notes, social media) [32,6,14,48,34]. Studies have explored a broad range of NLP techniques to identify OUD [48]. ...
... Goodman-Meza et al. [14] utilized text features such as term frequency-inverse document frequency (TF-IDF), concept unique identifier (CUI) embeddings, and word embeddings to analyze substances that contribute to opioid overdose deaths. Sarker et al. [32] conducted a geospatial and temporal analysis of opioidrelated mentions in Twitter posts. They found a positive correlation between the rate of opioid abuse-indicating posts and opioid misuse rates and county-level overdose death rates. ...
Preprint
Full-text available
Opioid related aberrant behaviors (ORAB) present novel risk factors for opioid overdose. Previously, ORAB have been mainly assessed by survey results and by monitoring drug administrations. Such methods however, cannot scale up and do not cover the entire spectrum of aberrant behaviors. On the other hand, ORAB are widely documented in electronic health record notes. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset comprising of more than 750 publicly available EHR notes. ODD has been designed to identify ORAB from patients' EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiapines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing (NLP) models (finetuning pretrained language models and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the finetuning models in most cateogories and the gains were especially higher among uncommon categories (Suggested aberrant behavior, Diagnosed opioid dependency and Medication change). Although the best model achieved the highest 83.92\% on area under precision recall curve, uncommon classes (Suggested Aberrant Behavior, Diagnosed Opioid Dependence, and Medication Change) still have a large room for performance improvement.
... Virtually every area of criminology and crime research has been -to some extent -explored by computational approaches: from white collar crime (Ribeiro et al., 2018;Luna-Pla & Nicolás-Carlock, 2020;Kertész & Wachs, 2021) to terrorism (Moon & Carley, 2007;Chuang et al., 2019;Campedelli et al., 2021), from illicit drugs (Mackey et al., 2018;Sarker et al., 2019;Magliocca et al., 2019) to organized crime (Nardin et al., 2016;Troitzsch, 2017;Calderoni et al., 2021), from gun violence (Mohler, 2014;Green et al., 2017;Loeffler & Flaxman, 2018) to cyber-crime (Shalaginov et al., 2017;Duxbury & Haynie, 2018;, from recidivism (Tollenaar & van der Heijden, 2013;Duwe & Kim, 2017;Berk & Elzarka, 2020) to predictive policing (Mohler et al., 2011;Caplan et al., 2011;Perry, 2013). Particularly, the dialogue between computational methods and the study of recidivism and predictive policing not only focused on technical innovations to optimize forecasting and predictive models, but also provoked vivid debates regarding critical issues of algorithmic accountability, fairness, and transparency (Lum & Isaac, 2016;Dressel & Farid, 2018;Richardson et al., 2019;Akpinar et al., 2021;Purves, 2022). ...
Article
Full-text available
Urban agglomerations are constantly and rapidly evolving ecosystems, with globalization and increasing urbanization posing new challenges in sustainable urban development well summarized in the United Nations' Sustainable Development Goals (SDGs). The advent of the digital age generated by modern alternative data sources provides new tools to tackle these challenges with spatio-temporal scales that were previously unavailable with census statistics. In this review, we present how new digital data sources are employed to provide data-driven insights to study and track (i) urban crime and public safety; (ii) socioeconomic inequalities and segregation; and (iii) public health, with a particular focus on the city scale.
... Often containing billions of individual data points, social media language allows for both fine-grained temporal 33 and spatial 34 analysis. In the domain of OPM, the typical application is to measure the frequency of opioid related keywords (e.g., mentions of the word "opioid" or "fentanyl") in order to track mortality rates or prescriptions [35][36][37][38][39][40] . While such keyword approaches can accurately predict real-world outcomes, they may fail to fully characterize communities, analogous to examining over-prescribing. ...
Article
Full-text available
Opioid poisoning mortality is a substantial public health crisis in the United States, with opioids involved in approximately 75% of the nearly 1 million drug related deaths since 1999. Research suggests that the epidemic is driven by both over-prescribing and social and psychological determinants such as economic stability, hopelessness, and isolation. Hindering this research is a lack of measurements of these social and psychological constructs at fine-grained spatial and temporal resolutions. To address this issue, we use a multi-modal data set consisting of natural language from Twitter, psychometric self-reports of depression and well-being, and traditional area-based measures of socio-demographics and health-related risk factors. Unlike previous work using social media data, we do not rely on opioid or substance related keywords to track community poisonings. Instead, we leverage a large, open vocabulary of thousands of words in order to fully characterize communities suffering from opioid poisoning, using a sample of 1.5 billion tweets from 6 million U.S. county mapped Twitter users. Results show that Twitter language predicted opioid poisoning mortality better than factors relating to socio-demographics, access to healthcare, physical pain, and psychological well-being. Additionally, risk factors revealed by the Twitter language analysis included negative emotions, discussions of long work hours, and boredom, whereas protective factors included resilience, travel/leisure, and positive emotions, dovetailing with results from the psychometric self-report data. The results show that natural language from public social media can be used as a surveillance tool for both predicting community opioid poisonings and understanding the dynamic social and psychological nature of the epidemic.
Article
Background: Delta-8 tetrahydrocannabinol (THC) has experienced significant cultivation, use, and online marketing growth in recent years.Objectives: This study utilized natural language processing on Twitter data to examine trends in public discussions regarding this novel psychoactive substance.Methods: This study analyzed the frequency of #Delta8 tweets over time, most commonly used words, sentiment classification of words in tweets, and a qualitative analysis of a random sample of tweets containing the hashtag "Delta8" from January 1, 2020 to September 26, 2021.Results: A total of 41,828 tweets were collected, with 30,826 unique tweets (73.7%) and 11,002 quotes, retweets, or replies (26.3%). Tweet activity increased from 2020 to 2021, with daily original tweets rising from 8.55 to 149. This increase followed a high-engagement retailer promotion in June 2021. Commonly used terms included "cbd," "cannabis," "edibles," and "cbdoil." Sentiment classification revealed a predominance of "positive" (30.93%) and "trust" (14.26%) categorizations, with 8.42% classified as "negative." Qualitative analysis identified 20 codes, encompassing substance type, retailers, links, and other characteristics.Conclusion: Twitter discussions on Delta-8 THC exhibited a sustained increase in prevalence from 2020 to 2022, with online retailers playing a dominant role. The content also demonstrated significant overlap with cannabidiol and various cannabis products. Given the growing presence of retailer marketing and sales on social media, it is crucial for public health researchers to monitor and promote relevant Delta-8 health recommendations on these platforms to ensure a balanced conversation.
Article
Background:Due to the high burden of chronic pain, and the detrimental public health consequences of its treatment with opioids, there is a high-priority need to identify effective alternative therapies. Social media is a potentially valuable resource for knowledge about self-reported therapies by chronic pain sufferers. Methods:We attempted to (a) verify the presence of large-scale chronic pain-related chatter on Twitter, (b) develop natural language processing and machine learning methods for automatically detecting self-disclosures, (c) collect longitudinal data posted by them, and (d) semiautomatically analyze the types of chronic pain-related information reported by them. We collected data using chronic pain-related hashtags and keywords and manually annotated 4,998 posts to indicate if they were self-reports of chronic pain experiences. We trained and evaluated several state-of-the-art supervised text classification models and deployed the best-performing classifier. We collected all publicly available posts from detected cohort members and conducted manual and natural language processing-driven descriptive analyses. Results:Interannotator agreement for the binary annotation was 0.82 (Cohen’s kappa). The RoBERTa model performed best (F1 score: 0.84; 95% confidence interval: 0.80 to 0.89), and we used this model to classify all collected unlabeled posts. We discovered 22,795 self-reported chronic pain sufferers and collected over 3 million of their past posts. Further analyses revealed information about, but not limited to, alternative treatments, patient sentiments about treatments, side effects, and self-management strategies. Conclusion: Our social media based approach will result in an automatically growing large cohort over time, and the data can be leveraged to identify effective opioid-alternative therapies for diverse chronic pain types.
Article
Components of artificial intelligence (AI) for analysing social big data, such as natural language processing (NLP) algorithms, have improved the timeliness and robustness of health data. NLP techniques have been implemented to analyse large volumes of text from social media platforms to gain insights on disease symptoms, understand barriers to care and predict disease outbreaks. However, AI-based decisions may contain biases that could misrepresent populations, skew results or lead to errors. Bias, within the scope of this paper, is described as the difference between the predictive values and true values within the modelling of an algorithm. Bias within algorithms may lead to inaccurate healthcare outcomes and exacerbate health disparities when results derived from these biased algorithms are applied to health interventions. Researchers who implement these algorithms must consider when and how bias may arise. This paper explores algorithmic biases as a result of data collection, labelling and modelling of NLP algorithms. Researchers have a role in ensuring that efforts towards combating bias are enforced, especially when drawing health conclusions derived from social media posts that are linguistically diverse. Through the implementation of open collaboration, auditing processes and the development of guidelines, researchers may be able to reduce bias and improve NLP algorithms that improve health surveillance.
Article
Intimate partner violence (IPV) increased during the COVID-19 pandemic. Collecting actionable IPV-related data from conventional sources (e.g., medical records) was challenging during the pandemic, generating a need to obtain relevant data from non-conventional sources, such as social media. Social media, like Reddit, is a preferred medium of communication for IPV survivors to share their experiences and seek support with protected anonymity. Nevertheless, the scope of available IPV-related data on social media is rarely documented. Thus, we examined the availability of IPV-related information on Reddit and the characteristics of the reported IPV during the pandemic. Using natural language processing, we collected publicly available Reddit data from four IPV-related subreddits between January 1, 2020 and March 31, 2021. Of 4,000 collected posts, we randomly sampled 300 posts for analysis. Three individuals on the research team independently coded the data and resolved the coding discrepancies through discussions. We adopted quantitative content analysis and calculated the frequency of the identified codes. 36% of the posts (n = 108) constituted self-reported IPV by survivors, of which 40% regarded current/ongoing IPV, and 14% contained help-seeking messages. A majority of the survivors' posts reflected psychological aggression, followed by physical violence. Notably, 61.4% of the psychological aggression involved expressive aggression, followed by gaslighting (54.3%) and coercive control (44.3%). Survivors' top three needs during the pandemic were hearing similar experiences, legal advice, and validating their feelings/reactions/thoughts/actions. Albeit limited, data from bystanders (survivors' friends, family, or neighbors) were also available. Rich data reflecting IPV survivors' lived experiences were available on Reddit. Such information will be useful for IPV surveillance, prevention, and intervention.
Article
Full-text available
Importance: Despite compelling evidence that statins are safe, are generally well tolerated, and reduce cardiovascular events, statins are underused even in patients with the highest risk. Social media may provide contemporary insights into public perceptions about statins. Objective: To characterize and classify public perceptions about statins that were gleaned from more than a decade of statin-related discussions on Reddit, a widely used social media platform. Design, setting, and participants: This qualitative study analyzed all statin-related discussions on the social media platform that were dated between January 1, 2009, and July 12, 2022. Statin- and cholesterol-focused communities, were identified to create a list of statin-related discussions. An artificial intelligence (AI) pipeline was developed to cluster these discussions into specific topics and overarching thematic groups. The pipeline consisted of a semisupervised natural language processing model (BERT [Bidirectional Encoder Representations from Transformers]), a dimensionality reduction technique, and a clustering algorithm. The sentiment for each discussion was labeled as positive, neutral, or negative using a pretrained BERT model. Exposures: Statin-related posts and comments containing the terms statin and cholesterol. Main outcomes and measures: Statin-related topics and thematic groups. Results: A total of 10 233 unique statin-related discussions (961 posts and 9272 comments) from 5188 unique authors were identified. The number of statin-related discussions increased by a mean (SD) of 32.9% (41.1%) per year. A total of 100 discussion topics were identified and were classified into 6 overarching thematic groups: (1) ketogenic diets, diabetes, supplements, and statins; (2) statin adverse effects; (3) statin hesitancy; (4) clinical trial appraisals; (5) pharmaceutical industry bias and statins; and (6) red yeast rice and statins. The sentiment analysis revealed that most discussions had a neutral (66.6%) or negative (30.8%) sentiment. Conclusions and relevance: Results of this study demonstrated the potential of an AI approach to analyze large, contemporary, publicly available social media data and generate insights into public perceptions about statins. This information may help guide strategies for addressing barriers to statin use and adherence.
Article
Full-text available
Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.
Article
Full-text available
Social media may serve as an important platform for the monitoring of population-level opioid abuse in near real-time. Our objectives for this study were to (i) manually characterize a sample of opioid-mentioning Twitter posts, (ii) compare the rates of abuse/misuse related posts between prescription and illicit opiods, and (iii) to implement and evaluate the performances ofsupervised machine learning algorithms for the characterization of opioid-related chatter, which can potentially automate social media based monitoring in the future.. We annotated a total of 9006 tweets into four categories, trained several machine learning algorithms and compared their performances. Deep convolutional neural networks marginally outperformed support vector machines and random forests, with an accuracy of 70.4%. Lack of context in tweets and data imbalance resulted in misclassification of many tweets to the majority class. The automatic classification experiments produced promising results, although there is room for improvement.
Article
Full-text available
The 63,632 drug overdose deaths in the United States in 2016 represented a 21.4% increase from 2015; two thirds of these deaths involved an opioid (1). From 2015 to 2016, drug overdose deaths increased in all drug categories examined; the largest increase occurred among deaths involving synthetic opioids other than methadone (synthetic opioids), which includes illicitly manufactured fentanyl (IMF) (1). Since 2013, driven largely by IMF, including fentanyl analogs (2-4), the current wave of the opioid overdose epidemic has been marked by increases in deaths involving synthetic opioids. IMF has contributed to increases in overdose deaths, with geographic differences reported (1). CDC examined state-level changes in death rates involving all drug overdoses in 50 states and the District of Columbia (DC) and those involving synthetic opioids in 20 states, during 2013-2017. In addition, changes in death rates from 2016 to 2017 involving all opioids and opioid subcategories,* were examined by demographics, county urbanization levels, and by 34 states and DC. Among 70,237 drug overdose deaths in 2017, 47,600 (67.8%) involved an opioid.† From 2013 to 2017, drug overdose death rates increased in 35 of 50 states and DC, and significant increases in death rates involving synthetic opioids occurred in 15 of 20 states, likely driven by IMF (2,3). From 2016 to 2017, overdose deaths involving all opioids and synthetic opioids increased, but deaths involving prescription opioids and heroin remained stable. The opioid overdose epidemic continues to worsen and evolve because of the continuing increase in deaths involving synthetic opioids. Provisional data from 2018 indicate potential improvements in some drug overdose indicators;§ however, analysis of final data from 2018 is necessary for confirmation. More timely and comprehensive surveillance data are essential to inform efforts to prevent and respond to opioid overdoses; intensified prevention and response measures are urgently needed to curb deaths involving prescription and illicit opioids, specifically IMF.
Article
Full-text available
Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Discussion: Our proposed spelling variant generator has several advantages over the existing spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Article
Full-text available
Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).
Article
Full-text available
Drug misuse may happen when patients do not follow the prescriptions and do actions which lead to potentially harmful situations, such as intakes of incorrect dosage (overuse or underuse) or drug use for indications different from those prescribed. Although such situations are dangerous, patients usually do not report the misuse of drugs to their physicians. Hence, other sources of information are necessary for studying these issues. We assume that online health fora can provide such information and propose to exploit them. The general purpose of our work is the automatic detection and classification of drug misuses by analysing user-generated data in French social media. To this end, we propose a multi-step method, the main steps of which are: (1) indexing of messages with extended vocabulary adapted to social media writing; (2) creation of typology of drug misuses; and (3) automatic classification of messages according to whether they contain drug misuses or not. We present the results obtained at different steps and discuss them. The proposed method permit to detect the misuses with up to 0.773 F-measure.
Article
Full-text available
Background Tweets can provide broad, real-time perspectives about health and medical diagnoses that can inform disease surveillance in geographic regions. Less is known, however, about how much individuals post about common health conditions or what they post about. Objective We sought to collect and analyze tweets from 1 state about high prevalence health conditions and characterize the tweet volume and content. Methods We collected 408,296,620 tweets originating in Pennsylvania from 2012-2015 and compared the prevalence of 14 common diseases to the frequency of disease mentions on Twitter. We identified and corrected bias induced due to variance in disease term specificity and used the machine learning approach of differential language analysis to determine the content (words and themes) most highly correlated with each disease. Results Common disease terms were included in 226,802 tweets (174,381 tweets after disease term correction). Posts about breast cancer (39,156/174,381 messages, 22.45%; 306,127/12,702,379 prevalence, 2.41%) and diabetes (40,217/174,381 messages, 23.06%; 2,189,890/12,702,379 prevalence, 17.24%) were overrepresented on Twitter relative to disease prevalence, whereas hypertension (17,245/174,381 messages, 9.89%; 4,614,776/12,702,379 prevalence, 36.33%), chronic obstructive pulmonary disease (1648/174,381 messages, 0.95%; 1,083,627/12,702,379 prevalence, 8.53%), and heart disease (13,669/174,381 messages, 7.84%; 2,461,721/12,702,379 prevalence, 19.38%) were underrepresented. The content of messages also varied by disease. Personal experience messages accounted for 12.88% (578/4487) of prostate cancer tweets and 24.17% (4046/16,742) of asthma tweets. Awareness-themed tweets were more often about breast cancer (9139/39,156 messages, 23.34%) than asthma (1040/16,742 messages, 6.21%). Tweets about risk factors were more often about heart disease (1375/13,669 messages, 10.06%) than lymphoma (105/4927 messages, 2.13%). Conclusions Twitter provides a window into the Web-based visibility of diseases and how the volume of Web-based content about diseases varies by condition. Further, the potential value in tweets is in the rich content they provide about individuals’ perspectives about diseases (eg, personal experiences, awareness, and risk factors) that are not otherwise easily captured through traditional surveys or administrative data.
Article
Full-text available
Background: Prescription opioid misuse has doubled over the past 10 years and is now a public health epidemic. Analysis of social media data may provide additional insights into opioid misuse to supplement the traditional approaches of data collection (eg, self-report on surveys). Objective: The aim of this study was to characterize representations of codeine misuse through analysis of public posts on Instagram to understand text phrases related to misuse. Methods: We identified hashtags and searchable text phrases associated with codeine misuse by analyzing 1156 sequential Instagram posts over the course of 2 weeks from May 2016 to July 2016. Content analysis of posts associated with these hashtags identified the most common themes arising in images, as well as culture around misuse, including how misuse is happening and being perpetuated through social media. Results: A majority of images (50/100; 50.0%) depicted codeine in its commonly misused form, combined with soda (lean). Codeine misuse was commonly represented with the ingestion of alcohol, cannabis, and benzodiazepines. Some images highlighted the previously noted affinity between codeine misuse and hip-hop culture or mainstream popular culture images. Conclusions: The prevalence of codeine misuse images, glamorizing of ingestion with soda and alcohol, and their integration with mainstream, popular culture imagery holds the potential to normalize and increase codeine misuse and overdose. To reduce harm and prevent misuse, immediate public health efforts are needed to better understand the relationship between the potential normalization, ritualization, and commercialization of codeine misuse.
Article
Background: The rise in opioid use and overdose has increased the importance of improving data collection methods for the purpose of targeting resources to high-need populations and responding rapidly to emerging trends. Objective: To determine whether Twitter data could be used to identify geographic differences in opioid-related discussion and whether opioid topics were significantly correlated with opioid overdose death rate. Methods: We filtered approximately 10 billion tweets for keywords related to opioids between July 2009 and October 2015. The content of the messages was summarized into 50 topics generated using Latent Dirchlet Allocation, a machine learning analytic tool. The correlation between topic distribution and census region, census division, and opioid overdose death rate were quantified. Results: We evaluated a tweet cohort of 84,023 tweets from 72,211 unique users across the US. Unique opioid-related topics were significantly correlated with different Census Bureau divisions and with opioid overdose death rates at the state and county level. Drug-related crime, language of use, and online drug purchasing emerged as themes in various Census Bureau divisions. Drug-related crime, opioid-related news, and pop culture themes were significantly correlated with county-level opioid overdose death rates, and online drug purchasing was significantly correlated with state-level opioid overdoses. Conclusions: Regional differences in opioid-related topics reflect geographic variation in the content of Twitter discussion about opioids. Analysis of Twitter data also produced topics significantly correlated with opioid overdose death rates. Ongoing analysis of Twitter data could provide a means of identifying emerging trends related to opioids.