ArticlePDF Available

Abstract and Figures

Objective: Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step toward incorporating Twitter data in pharmacoepidemiologic research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names suffer from low recall due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. Materials and methods: We present Kusuri, an Ensemble Learning classifier able to identify tweets mentioning drug products and dietary supplements. Kusuri (, "medication" in Japanese) is composed of 2 modules: first, 4 different classifiers (lexicon based, spelling variant based, pattern based, and a weakly trained neural network) are applied in parallel to discover tweets potentially containing medication names; second, an ensemble of deep neural networks encoding morphological, semantic, and long-range dependencies of important words in the tweets makes the final decision. Results: On a class-balanced (50-50) corpus of 15 005 tweets, Kusuri demonstrated performances close to human annotators with an F1 score of 93.7%, the best score achieved thus far on this corpus. On a corpus made of all tweets posted by 112 Twitter users (98 959 tweets, with only 0.26% mentioning medications), Kusuri obtained an F1 score of 78.8%. To the best of our knowledge, Kusuri is the first system to achieve this score on such an extremely imbalanced dataset. Conclusions: The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness, and is ready to be integrated in pharmacovigilance, toxicovigilance, or more generally, public health pipelines that depend on medication name mentions.
Content may be subject to copyright.
Research and Applications
Deep neural networks ensemble for detecting medication
mentions in tweets
Davy Weissenbacher,
1
Abeed Sarker ,
1
Ari Klein ,
1
Karen O’Connor,
1
Arjun Magge,
2
and Graciela Gonzalez-Hernandez
1
1
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania,
Philadelphia, Pennsylvania, USA, and
2
Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State
University, Tempe, Arizona, USA
Corresponding Author: Davy Weissenbacher, PhD, Department of Biostatistics, Epidemiology and Informatics, Perelman
School of Medicine, University of Pennsylvania, 480-492-0477, 404 Blockley Hall, 423 Guardian Drive, Philadelphia, PA
19104-6021, USA; dweissen@pennmedicine.upenn.edu
Received 28 March 2019; Revised 26 July 2019; Editorial Decision 8 August 2019; Accepted 13 August 2019
ABSTRACT
Objective: Twitter posts are now recognized as an important source of patient-generated data, providing unique
insights into population health. A fundamental step toward incorporating Twitter data in pharmacoepidemio-
logic research is to automatically recognize medication mentions in tweets. Given that lexical searches for med-
ication names suffer from low recall due to misspellings or ambiguity with common words, we propose a more
advanced method to recognize them.
Materials and Methods: We present Kusuri, an Ensemble Learning classifier able to identify tweets mentioning drug
products and dietary supplements. Kusuri (, “medication” in Japanese) is composed of 2 modules: first, 4 different
classifiers (lexicon based, spelling variant based, pattern based, and a weakly trained neural network) are applied in par-
allel to discover tweets potentially containing medication names; second, an ensemble of deep neural networks encod-
ing morphological, semantic, and long-range dependencies of important words in the tweets makes the final decision.
Results: On a class-balanced (50-50) corpus of 15 005 tweets, Kusuri demonstrated performances close to hu-
man annotators with an F
1
score of 93.7%, the best score achieved thus far on this corpus. On a corpus made of
all tweets posted by 112 Twitter users (98 959 tweets, with only 0.26% mentioning medications), Kusuri
obtained an F
1
score of 78.8%. To the best of our knowledge, Kusuri is the first system to achieve this score on
such an extremely imbalanced dataset.
Conclusions: The system identifies tweets mentioning drug names with performance high enough to ensure its
usefulness, and is ready to be integrated in pharmacovigilance, toxicovigilance, or more generally, public health
pipelines that depend on medication name mentions.
Key words: social media, pharmacovigilance, drug name detection, ensemble learning, text classification
INTRODUCTION
Twitter has been utilized as an important source of patient-
generated data that can provide unique insights into popula-
tionhealth.
1
Many of these studies involve retrieving tweets that
mention drugs, for tasks such as syndromic surveillance,
2,3
pharmacovigilance,
4
and monitoring drug abuse.
5
A common ap-
proach is to search for tweets containing lexical matches of drug
names occurring in a manually compiled dictionary. However, this
approach has several limitations. Many tweets contain drugs that
are misspelled or not referred to by name (eg, “it” or “antibiotic”).
Even when a match is found, oftentimes the referent is not actually a
V
CThe Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com 1618
Journal of the American Medical Informatics Association, 26(12), 2019, 1618–1626
doi: 10.1093/jamia/ocz156
Advance Access Publication Date: 27 September 2019
Research and Applications
drug; for example, tweets that mention Lyrica are predominantly
about the singer, Lyrica Anderson, and not about the antiepileptic
drug. In this study, when using the lexical match approach on a
corpus where names of drugs are naturally rare, we retrieved only
71% of the tweets that we manually identified as mentioning a drug,
and more than 45% of the tweets retrieved were noise. Enhancing the
utility of social media for public health research requires methods that
are capable of improving the detection of posts that mention drugs.
The task of automatically detecting mentions of concepts in text
is generally referred to as named entity recognition (NER).
6
State-
of-the-art NER systems are based on machine learning (ML) and
achieve performances close to humans when they are trained and
evaluated on formal texts. However, they tend to perform relatively
poor when they are trained and evaluated on social media.
7,8
Tweets
are short messages, so they do not provide large contexts that NER
systems can use to disambiguate concepts. Furthermore, the collo-
quial style of tweets—misspellings, elongations, abbreviations, neo-
logisms, and nonstandard grammatical structures, cases, and
punctuations—poses challenges for computing features in ML-based
NER systems.
9
Although large sets of annotated corpora are avail-
able for training NER systems to detect general concepts on Twitter
(eg, people, organizations), there is a need for the collection and an-
notation of additional data for automatic detection of more special-
ized concepts (eg, drugs and diseases).
10
Over the last decade, researchers have competed to
improveNER on tweets. Most challenges were organized for tweets
written in not only English (eg, the Named Entity Recognition and
Linking challenges series
11
or the Workshop on Noisy User-
generated Text series)
12
, but also other languages (eg, Conference
surl’Apprentissage Automatique2017
13
). Specifically for drug detec-
tion in Twitter, we organized the Third Social Media Mining for
Health Applications shared task in 2018.
14
The sizes of the corpora
annotated during these challenges vary, from 4000 tweets
15
to
10 000 tweets.
13
ML methods have evolved over recent years, with
a noticeable shift from support vector machine (SVM)– or condi-
tional random field–based frameworks trained on carefully engi-
neered features,
16
to deep neural networks that automatically
discover relevant features from word embeddings. In the 2016
Workshop on Noisy User-generated Text, a large disparity between
the performances obtained by the NERs on different types of entities
was observed. The winners,
17
with an overall F
1
score of 52.4%,
reported an F
1
score of 72.61% on Geo-Locations, the most fre-
quent NEs in their corpus, but much lower scores on rare NEs, such
as an F
1
score of 5.88% on TV Shows. Kusuri ( , “medication” in
Japanese; https://en.wiktionary.org/wiki/%E8%96%AC)detects
names of drugs with sufficient performance even on a natural corpus
in which drugs are very rarely mentioned.
The primary objective of this study was to automatically detect
tweets that mention drug products (prescription and over-the-
counter) and dietary supplements. The Federal Drug Administration
(FDA Glossary of Terms: https://www.fda.gov/drugs/informatio-
nondrugs/ucm079436.htm; Drug; Drug product) defines a drug
product as the final form of a drug, containing the drug substance gen-
erally in association with other active or inactive ingredients.This
study includes drug products that are referred to by their trademark
names (eg, NyQuil), generic names (eg, acetaminophen), and class
name (eg, antibiotic or seizure medication). We formulate this prob-
lem as a binary classification task. Formally, given a set of tweets T,
our goal is to learn a function f such that f(t)¼1 for all tweets t in T
containing at least 1 phrase referring to a drug product/dietary supple-
ment, f(t)¼0 otherwise. A tweet is a positive example if it contains
text referring to a drug (and not only “matching” a drug name), a neg-
ative example otherwise. For example, the tweet “I didn’t know
Lyrica had a twin” is a negative example because Lyrica refers to the
singer, Lyrica Anderson, whereas the tweet “Lyrica experiences? I
was on Gabapentin.” is a positive example because, in this context, it
mentions 2 antiepileptics. The use of sequence labeling to delimit drug
name boundaries in the positive examples (named entity recognition)
and their mapping to a standardized name (named entity identifica-
tion) are outside the scope of this work.
The main contributions of this study are (1) a gold standard cor-
pus of 15 005 annotated tweets, for training ML-based classifiers to
automatically detect drug products mentioned on Twitter; (2) a
binaryclassifier based on Ensemble Learning, which we call Kusuri;
and (3) an evaluation of Kusuri on 98 959 tweets with the natural
balance of 0.2% positive to 99.8% negative for the presence of med-
ication names. We describe the corpora in the Materials and Meth-
ods as well as the details of ourclassifier followed byits evaluation
in the Results.
Automatic drug name recognition has mostly been studied for
extracting drug names from biomedical articles and medical docu-
ments, with several articles published
18
and challenges organized in
the last decade.
1921
Most works that have tackled the task of
detecting drug names in Twitter have focused on building corpora.
Sarker et al
22
created a large corpus of 260 000 tweets mentioning
drugs. However, they restricted their search by strict matching to a
preselected list of 250 drugs plus their lexical variants, and they did
not annotate the generated corpus. In a similar study, Carbonell
et al
10
explored the distributions of drug and disease names in Twit-
ter as preliminary work for drug-drug interaction. While the authors
searched for a larger set of drugs (all unambiguous drugs listed in
DrugBank database), they did not annotate the corpus of 1.4 million
tweets generated, and nor did they count false positives—ambiguous
mentions—in their statistical analysis.
The first evaluation of automatic drug name recognition in Twit-
ter that we are aware of was performed by Jimeno-Yepes et al,
23
on
a corpus of 1300 tweets. Two off-the-shelf classifiers, MetaMap and
the Stanford NER tagger, as well as an in-house classifier based on a
conditional random fields with hand-crafted features, were evalu-
ated. The latter obtained the best F
1
score of 65.8%. Aside from the
aforementioned problem of selecting the tweets using a lexical
match, other limitations to their study lie in additional choices
made. To remove nonmedical tweets, they retained only tweets con-
taining at least 2 medical concepts (eg, drug and disease). This ensured
a good precision, but also artificially biased their corpus in 2 ways: by
retaining only the tweets that mentioned the drugs in their dictionary
and eliminating tweets that mention a drug alone (eg, “me and
ZzzQuil are best friends”). In November 2018, we organized the
Third Social Media Mining for Health Applications shared task
(SMM4H),
14
with Task 1 of our challenge dedicated to the problem
of the automatic recognition of drug names in Twitter. Eleven teams
tested multiple approaches on the provided balanced corpus, which
we selected using 4 classifiers (the first module of Kusuri). A wide
range of deep learning–based classifiers were used by participants, as
well as some feature-based classifiers and a few attempts with ensem-
ble learning systems. The system THU_NGN by Wu et al,
24
an en-
semble of hierarchical neural networks with multihead self-attention
and integrating features modeling sentiments, was the top performer,
with an F
1
score of 91.8%. This established a recent benchmark for
the community for an artificially balanced corpus (with approxi-
mately the same number of positive and negative examples). Our eval-
uation data, described in the UPennHLP Twitter Pregnancy Corpus
Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12 1619
subsection, includes both the artificially balanced corpus and, in addi-
tion, a corpus of all available tweets posted by selected Twitter users
where the mentions of drug products were manually annotated. We
refer to the later as a corpus with “natural” balance.
MATERIALS AND METHODS
We collected all publicly available tweets posted by 112 500 Twitter
users (their timelines). To do so, we first used the Twitter streaming
application programming interface to detect tweets mentioning key-
words used when announcing a pregnancy. These keywords were
manually defined. Then, we used a simple SVM classifier to confirm
that the tweets were really announcing a pregnancy and discarded
other tweets. The keywords and the SVM classifier are described in
our previous work.
25
Once a tweet announcing a pregnancy was iden-
tified, we collected the timeline of the author of the tweet. We used
the REST application programming interface provided by Twitter to
download all tweets posted by this user within the Twitter-imposed
limit of 3200 most recent tweets, and continued collection afterward.
We did not remove bots or accounts managed by businesses and other
entities, as they may also tweet about drugs. We intermittently col-
lected posts from January 2014 to April 2017. After April 2017, we
systematically collected posts until September 2017. Through this pro-
cess, we collected a total of 421.5 million tweets. Using this dataset as
a source allows us to avoid the bias of a drug-name keyword based
collection. When a dataset is collected using a list of drug names, the
resulting dataset will obviously contain only tweets mentioning the
drugs occurring in the list: the you-find-what-you-are-looking-for bias
(ie, confirmation) bias. This is evident when reported recall climbs to
98% or more. Our method, collecting all tweets posted by the users in
our cohort, captured drugs mentioned by the users in the way that
they naturally occur. Our dataset represents natural variants of drug
names occurring in our collection, as expressed by Twitter users, and
that would have been missed if not present in the list upfront.
All tweets were collected from public Twitter accounts, and a
certificate of exemption was obtained from the Institutional Review
Board of the University of Pennsylvania. All tweets used and re-
leased to the community were used and released without violating
Twitter’s terms and conditions.
UPennHLP Twitter Drug Corpus
Building a corpus of tweets containing drug names to train and evaluate
a drug name classifier is a challenging task. Tweets mentioning drug
names are extremely rare. We found that they only represent 0.26% of
the tweets in the UPennHLP Twitter Pregnancy Corpus (see the follow-
ing section), and are often ambiguous with common and proper nouns.
Ifanaivelexiconmatchingmethodisusedtocreatethecorpus,itoften
matches a large number of tweets not containing any drug names.
Therefore, to build a gold-standard corpus, we had to rely on a
more sophisticated method than simply lexicon matching. We cre-
ated 4 simple classifiers to detect tweets mentioning drug names:
one based on a lexicon matching, one on lexical variants matching,
one on regular expressions, and a classifier trained with weak super-
vision. The 4 classifiers are described briefly below, and in detail in
Supplementary Appendix A.
Lexicon-based drug classifier
The first classifier is built on top of a lexicon of drug names
generated from the RxNorm Database (https://www.nlm.nih.gov/
research/umls/rxnorm/overview.html, Accessed June 11, 2018). If a
tweet contains a word or phrase occurring in the lexicon, the tweet
is classified as a positive example without any further analysis. We
chose RxNorm because it has a large coverage. It combines, in a
unique database, 15 existing dictionaries, including DrugBank, a
database often used in previous works on drug names detection.
Variant-based drug classifier
Names of drugs may have a complex morphology and, as a conse-
quence, are often misspelled on Twitter. Lexicon-based approaches
detect drugs mentioned in tweets only if the drug names are cor-
rectly spelled. The incapability to detect misspelled drug names
results in low recall for the lexicon-based classifier. In an attempt to
increase recall, we used a data-centric misspelling generation algo-
rithm
26
to generate variants of drug names and used the variants to
detect tweets mentioning misspelled drugs.
Weakly trained drug long short-term memory classifier
Our third classifier is a long short-term memory (LSTM) neural net-
work that integrates an attention mechanism
27
and is trained on
noisy training examples obtained through weak supervision. One
annotator identified drug names tending to be unambiguous
28
in
our timelines (eg, Benadryl or Xanax). We selected the 126 500
tweets containing these unambiguous names in our timelines as posi-
tive examples. Then, given that drug names occur very rarely in
tweets, we randomly selected an additional 126 500 tweets from
our timelines as negative examples and trained our LSTM on these
examples. We chose this simple classifier to discover a large number
of tweets that could potentially contain drug names, and integrated
an attention mechanism to ensure that the neural network focuses
on the words occurring recurrently in the context of drug mentions
in tweets and discards irrelevant words.
Pattern-based drug classifier
Our last classifier implements a common method to detect general
named entities, regular expressions (REs). REs describe precisely the
linguistic contexts used in Twitter to speak about drugs. We manu-
ally crafted our REs by inspecting 9530 n-grams, which were the
most frequent n-grams occurring before and after the most frequent
unambiguous names of drugs in our 126 500 tweets. We retained
81 patterns (eg, “prescibed me”, “prescription filled for”, “doctor
switched”). When inspecting these n-grams, one annotator used his
knowledge of the language to reject noisy patterns. Before including
a pattern in the list, the annotator confirmed empirically by querying
the pattern in a search engine indexing our dataset that the pattern
actually retrieved tweets mentioning drugs in the first 100 tweets re-
trieved for the query. Davy Weissenbacher created the REs.
To obtain positive examples, we selected tweets retrieved by at
least 2 classifiers, as they were most likely to mention drug names.
To obtain negative examples, we selected tweets detected by only 1
classifier, given that if these tweets did not contain a drug name,
they were nonobvious negative examples. Following this process,
from our 421.5 million tweets, we created a corpus of 15 005
tweets, henceforth referred to as the UPennHLP Twitter Drug Cor-
pus (Table 1). We removed from the corpus duplicated tweets,
tweets not written in English, and tweets that were no longer on
Twitter (eg, tweets deleted by the users) at the time of the collection.
Two annotators annotated the corpus in its entirety, with a high
interannotator agreement (IAA) measured as Cohen’s kappa of
.892. Our corpus was annotated by our 2 staff annotators, who
have over 7 years combined experience annotating texts in the
1620 Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12
biomedical domain. One annotator holds a degree in biology and
our senior annotator has a master of science degree in biomedical
informatics. We randomly selected 9623 tweets for the training set
(4975 positive and 4648 negative examples) and 5382 tweets for the
testing set (2852 positive and 2530 negative examples). We publicly
released the corpus and our guidelines to the research community
for Task 1 of the SMM4H 2018 shared task.
14
UPennHLP Twitter Pregnancy Corpus
A balanced corpus, such as the UPennHLP Twitter Drug Corpus, is a
useful resource to study how people speak about drugs on social me-
dia. However, due to the mechanism of its construction, a balanced
corpus does not represent the natural distribution of tweets mention-
ing drugs on Twitter. Consequently, any evaluation made on a bal-
anced corpus will never be indicative of the performance expected
from a drug name classifier used in practice. To further assess whether
Kusuri could reliably be used in practice, we ran additional experi-
ments on the corpus used for an epidemiologic study of birth defect
outcomes.
29
For that, we collected 397 timelines of women that had
announced their pregnancy on Twitter and that had tweeted during
their pregnancy, and manually identified all tweets mentioning a drug
during the period of pregnancy. It took an average of 2.5 hours to an-
notate each timeline. We ran our experiments on a subset of 112 time-
lines (98 959 tweets) from this corpus, referred to as UPennHLP
Twitter Pregnancy Corpus in the remainder of the article (Table 1).
The annotators manually verified that these timelines were owned by
individuals by looking at their profiles and their posts and making a
judgment call, and removed all timelines administered by bots or asso-
ciation/companies. Our senior annotator and the main author of this
article annotated these timelines. An IAA of 0.88 (Cohen’s kappa)
was computed over 12 timelines, which were dual annotated by the
senior annotator. We randomly selected 70% (69 272 tweets) of the
Pregnancy Corpus for training and the remaining 30% (29 687
tweets) for testing. When splitting the corpus into a training set and a
testing set, we kept the same percentage ratio, 70%/30%, of positive/
negative examples in the training and in the testing sets (ie, 181/69
091 and 77/29 610 respectively).
Kusuri architecture
Kusuri, described in Figure 1, applies sequentially 2 modules to de-
tect tweets mentioning drugs in the UPennHLP Twitter Pregnancy
Corpus. This section describes each module. The success of deep
learning classifiers in natural language processing lies on their ability
to automatically discover relevant linguistic features from word
embeddings
30
—an ability even more valuable when working on
short and colloquial texts such as tweets. For this reason, we pre-
ferred to integrate in our modules deep learning classifiers over
more traditional classifiers based on feature engineering.
Module 1: tweet prefilter
Kusuri applies our 4 classifiers—the lexicon-based, variant-based,
pattern-based, and weakly trained LSTM classifiers—in parallel to
discover tweets that potentially contain drug names. Among the
tweets discovered, Kusuri selects the tweets classified by the lexicon
classifier, by the variant-based classifier, and the tweets selected by
both the pattern-based and the weakly trained classifiers. The tweets
discovered by only 1 of the 2 last classifiers were too noisy and dis-
carded. The tweets selected are then submitted to the Module 2, an
ensemble of deep neural networks (DNN) that makes the final deci-
sion for the labels. The 4 classifiers act as filters, collecting only
good candidates for the ensemble of DNNs, which was, in turn, op-
timized to recognize positive examples among them.
Module 2: ensemble of DNNs
As a single element for the ensemble of neural networks composing
the second module of Kusuri, we designed a DNN following a stan-
dard architecture for classification of NEs. Described in Figure 2,
our DNN starts by independently encoding each sequence of charac-
ters composing the tokens of a tweet through 3 layers sequentially
connected: a recurrent layer, an attention layer, and a densely con-
nected layer. All resulting vectors, encoding the morphological prop-
erties of the tokens, are then concatenated with their respective
pretrained word embedding vectors, which encode the semantic
properties of the tokens. The concatenated vectors are passed to a
bidirectional-gated recurrent unit (GRU) layer to learn long-range
dependencies between the words, followed by an attention layer
that, as additional memory, helps the NN to focus on the most dif-
ferentiating words for the classification. A final dense layer com-
putes the probability for the tweet to contain a mention of a drug.
All neural networks in our study were given pretrained word vectors
as input. We chose the word vectors trained with the Glove algo-
rithm on 2 billion tweets, available for download on the webpage of
the project (https://nlp.stanford.edu/projects/glove/). Supplementary
Appendix B describes in detail the preprocessing steps, the embed-
dings, and the parameters of our training. We experimented with
early stopping to avoid overfitting when training our models. We
kept 70% of our training corpus to train a model, and 30% for vali-
dation. We found that 8 iterations were sufficient to train the model
before overfitting the training corpus. However, contrary to our ex-
pectation, the models trained with 8 iterations gave slightly lower
performances on the test set of the UPennHLP Drug Corpus than
models trained with 20 iterations, with F
1
scores of 92.7% and
93.1%, respectively. The reason for this is not clear, but it may be
because our model continues to improve as more examples are pro-
vided in the training corpus, making the estimation of the best num-
ber of iterations inexact with early stopping. We report the results of
our model trained with 20 iterations.
Owing to the stochastic nature of the initialization of the NN,
the learning process may discover a local optimum and return a sub-
optimal model. To reduce the effect of local optimums, we resort to
ensemble averaging. We independently learned 9 models using our
DNN and computed the final decision, for a tweet to mention a
medication name or not, by taking the mean of the probabilities
Table 1. Statistics of the UPennHLP Twitter Drug and Pregnancy
Corpora
UPennHLP Twitter
Drug Corpus
UPennHLP Twitter
Pregnancy Corpus
Training set 9623 tweets
(4975 1/4648 2)
69 272 tweets
(181 1/69 091 2)
Testing set 5382 tweets
(2852 1/2530 2)
29 687 tweets
(771/29 610 2)
Users in training/
testing set
7584/4535 112/112
Users posting in
training
set and in
testing set
1054
(these users posted
1713 tweets in
testing set, 31.8%
of testing set)
112
Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12 1621
computed by the models (because a soft voting algorithm
31
in our
experiments did not improve over the simple averaging method, we
kept the latter). When applied on the Pregnancy Corpus, all DNNs
of the ensemble were trained on the Drug Corpus, and, at test time,
all DNNs of the ensemble only have to classify the tweets in the
Pregnancy Corpus filtered by the first module of Kusuri.
RESULTS
This section details the performances of Kusuri and its ensemble of
DNNs during 2 series of experiments on the Drug Corpus and on
the Pregnancy Corpus.
Drug detection in the UPennHLP Twitter Drug Corpus
We first ran a series of experiments to measure the performances of
the ensemble of DNNs composing the second module of Kusuri.
The detailed results are reported in Table 2. We compare the perfor-
mance of our ensemble with 3 baseline classifiers.
The first baseline classifier is a combination of the lexicon- and
variant-based classifiers. It labeled as positive examples all tweets of
the test set of the Drug Corpus that contained a phrase found in our
Lexicon or in our list of variants. This baseline classifier provides a
good estimation of the performances to expect when a lexicon-based
approach is used. We chose as a second baseline system a
bidirectional-GRU classifier. Because this baseline system has a sim-
pler architecture than our final DNN, their comparison allows us to
estimate the benefits of the components we added in our system.
The baseline system was trained on the same training data with the
same hyperparameters, and took as input the same word embed-
dings, but it did not have information about the morphology of the
words or the help of the attention mechanism. As a third strong
baseline, we compare our system with the best system of the Task 1
of the SMM4H 2018 competition, the THU_NGN system.
24
The results in Table 2 are interesting in several ways. The combined
lexicon- and variant-based classifier has a high recall on the test data
(88.5%), an unsurprising result considering the central role played by
the lexicon and the variants during the construction of the Drug Cor-
pus. This classifier is vulnerable to the frequent ambiguity of drug
names, resulting in a low precision of 66.4%. The classifier has no
knowledge of the context in which the name of a drug appears, and
thus cannot disambiguate tweets mentioning Ly ri ca (antiepileptic vs
Lyrica Anderson), lozenge (type of pills vs geometric shape), or halls
(brand name vs misspelling for Hall’s), for example. The fully super-
vised bidirectional-GRU confirms its ability to learn the features only
from the word embeddings,
32
and achieves an F
1
score of 91.4%, a
higher score than the IAA computed on this corpus. However, such sys-
tems can be improved as demonstrated by the better performances of
the best DNN in the ensemble (4 in Table 2). The encoding of the token
morphology and the attention layer of the best DNN improve the F
1
score by 1.7 points. Also, despite having a simpler architecture and at-
tention mechanism, the best DNN system performs better than the en-
semble of hierarchical NNs proposed by Wu et al.
24
The reason for this
result is not clear, but it may be a suboptimal set of hyperparameters
chosen by the authors or the difficulty to train such a complex network.
The highest performance is obtained by the ensemble DNNs,
which shows an improvement of 0.6 points over the best model in
the ensemble, with a final F
1
score of 93.7%. We confirmed the dis-
agreement between the ensemble DNNs and the THU_NGN sys-
tems to be statistically significant with a McNemar test.
33
The null
hypothesis was rejected with a significance level set to .001. We ana-
lyzed randomly selected labeling errors made by the ensemble
DNNs (Table 3). We distinguished 8 nonexclusive categories of false
positives. With 41 cases, most false positives were tweets discussing
Figure 1. Architecture of Kusuri, an ensemble learning classifier for drug detection in Twitter. LSTM: long short-term memory.
1622 Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12
medical topics without mentioning a drug. As medical tweets often de-
scribe symptoms or discuss medical concepts, their lexical fields are
strongly associated with drug names (eg, cough, flu, doctor) and con-
fuse the classifier. The causes of false negatives seem to mirror those
for the false positives. With 36 cases, false negatives were mostly
caused by the ambiguity of not only common English words (eg, air-
borne), but also dietary supplements and food products sometimes
consumed for their medicinal properties (eg, clove, arnica, aloe). This
could be a positive turn of events if nutritional supplements are to be
included in a study. A second important cause was unseen, or rarely
seen, drug names in our training corpus, with 25 cases.
The ensemble DNNs correctly detected 245 tweets that were in-
correctly detected by the THU_NGN system. On the other hand, the
THU_NGN system correctly detected 138 tweets that were incor-
rectly detected by our system. We manually analyzed 100 tweets
that were randomly selected from the 245 tweets, but could not dis-
cern any evident patterns explaining the differences in performances
between the 2 systems. Further linguistic analysis, such as the analy-
sis proposed in Vanni et al,
30
may help uncover these patterns, but is
beyond the scope of this study.
Drug detection in the UPennHLP Twitter Pregnancy
Corpus
The results of the ensemble DNNs on the Drug Corpus are prom-
ising, but they were obtained based on ideal conditions. The
training corpus is balanced, and most of the drugs found in the
test set were present in the training set. These conditions are un-
likely to be satisfied when the classifier is used on naturally oc-
curring data. We ran a second series of experiments on the
Pregnancy Corpus to get a more realistic evaluation of our classi-
fiers. As for the previous experiments, we kept the lexicon- and
variant-based classifier as well as an ensemble of bidirectional-
GRU networks as baseline systems (1 and 2 in Table 4, respec-
tively). Each network of the ensemble was trained on the Preg-
nancy Corpus with 5 iterations and a batch size of 64 examples,
and a simple averaging was used to combine their results. Since
the ensemble DNNs gave the best performances on the Drug Cor-
pus (Table 2), we chose it as a third baseline system. This base-
line applies the ensemble of DNNs without prefiltering the
tweets using the first module of Kusuri. In system 3.a, we trained
all DNNs of the ensemble on the training set of the Drug Corpus,
with 20 iterations and a batch size of 2 examples. In system 3.b,
we trained all DNNs of the ensemble on the training set of the
Pregnancy Corpus, with 4 iterations and a batch size of 64 exam-
ples. These hyperparameters were the best parameters found af-
ter early stopping and a manual search through standard batch
sizes of 2, 64, and 128. The last system evaluated was the
“complete” Kusuri system, with both modules applied
sequentially.
The results are reported in Table 4. What is striking about the
figures in this table is the poor performances of the lexicon- and var-
iant-based classifier and the ensemble DNNs classifier. The score of
the former dropped from an F
1
score of 75.9% when applied on the
Drug Corpus, to 62.2% when applied on the Pregnancy Corpus. As
we used the lexicon to build the Drug Corpus, the drugs in the lexi-
con were overrepresented in this corpus, increasing the baseline’s re-
call by 17.1 points. The ensemble DNNs classifier (3.a) did worse,
with an F
1
score of only 18%. Trained on a balanced set of medi-
cally related tweets, the classifier was found too sensitive. It gives
too much weight to words that are related to medication but used in
other contexts such as overdose,bi-polar,orisnt working, resulting
in a total of 549 false positives, in which only 77 tweets mentioned a
drug in the test set. Surprisingly, the ensemble of bidirectional-GRU
Figure 2. Deep neural network predicting yˆ, the probability for a tweet to mention a drug name. GRU: gated recurrent unit.
Table 2. Precision, recall, and F1 scores for drug detection
classifiers on the test set of the UPennHLP Twitter Drug Corpus
System Precision Recall F
1
score
1. Lexicon þvariant classifier 66.4 88.5 75.9
2. Supervised bidirectional-GRU 93.5 89.5 91.4
3. THU_NGN hierarchical-NNs 93.3 90.4 91.8
4. Best DNN model in the ensemble 93.7 92.5 93.1
5. Ensemble DNNs (module 2 of Kusuri) 95.1 92.5 93.7
DNN: deep neural network; GRU: gated recurrent unit; NN: neural
network; THU_NGN.
Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12 1623
networks (system 2.a) was capable of learning our task with very
few positive examples. Despite the 5.11-point difference in F
1
score
of Kusuri over system 2.a, a McNemar test shows that we cannot
conclude this difference is significant. Additional experiments, with
more positive examples, are needed to confirm Kusuri’s superiority.
While lower than the ideal score of the ensemble DNNs classifier on
the Drug Corpus, the F
1
score of Kusuri (78.8%) is comparable to
the scores published for the best NERs when applied on the most
frequent types of NEs in Twitter.
17
More importantly, we believe
that this score is high enough to expect a positive impact of Kusuri
when integrated in larger applications.
DISCUSSION
As stated before, our experiments show Kusuri outperforming
the system 2.a (Table 4), a bidirectional GRU, but the difference
is not statistically significant. Increasing the number of positive
examples to the point at which the performance difference is sta-
tistically significant would require the collection and annotation
of a much larger corpus that exhibits the natural balance (<1%
of tweets positive for a drug mention). This could prove cost
prohibitive.
In this study, we opted for an oversampling strategy and cre-
ated an artificially balanced corpus to train our classifier, the
UPenn Twitter Drug Corpus. However, while our ensemble of
DNNs performs well on the Drug Corpus (an F
1
score of
93.7%), its performance drops considerably when we applied it
to the Pregnancy Corpus (an F
1
score of 18%). Trained on our
balanced corpus, this classifier was biased toward disambiguat-
ing examples easily recognizable by our basic filters and could
not generalize well on other examples occurring in the Preg-
nancy Corpus, making the ensemble useless for real
applications.
The solution we implemented with Kusuri is to prefilter the
tweets of the timelines before applying our ensemble of DNNs.
This solution increases performance by 5.1 points over the best
baseline system (2.a in Table 4). However, our strategy is far
from perfect, as it reduces the number of FPs with hard filters
and, consequently, also limits the overall performance of Kusuri
by removing 27% (21 tweets) of the few tweets mentioning drugs
in the Pregnancy Corpus. We are currently replacing the hard fil-
ters with active learning to further train our ensemble of DNNs
and reduce its oversensitivity to medical phrases in general
tweets.
One may argue that, given that the selection of users in our co-
hort are women reporting a pregnancy, the drugs mentioned in our
dataset are biased to a specific set of mediations, with a higher num-
ber of tweets mentioning drugs commonly used in pregnancy and a
lower number of tweets mentioning drugs not recommended during
this period. However, this limitation is alleviated to an extent due to
the fact that our dataset includes tweets beyond those that were
posted during pregnancy; we collected the full timelines available
from the users, including a large number of tweets posted before and
after pregnancy.
Finally, we designed our neural network around a standard
representation of a sentence as a sequence of word embeddings
learned on local windows of words. Better alternatives have re-
cently been proposed
34
and could be integrated in our system to
help drug name disambiguation. We could replace our word
embeddings with ELMo or BERT, which learn each word
embeddings within the whole context of a sentence,
35,36
or sup-
plement our current sentence representation with sentence
embeddings.
37
Table 3. Categories of false positive and false negative made by
the drug detection classifier on the test set of the UPennHLP
Twitter Drug Corpus
Error category Errors Examples
False positive
Medical topic 41 <user>you should see a dermatolo-
gist if you can. You may just need
something to break you out of a cy-
cle. I used a topical and took pills
Lola may has a stye, or pink eye.
Doc recommends warm com-
presses to see if it gets better today,
but my eyes are itchy just looking
at her.
Weighted words/
patterns
19 i can take a wax brazillian a g
<user>i was robbed a foul when i
took a three point shot and they
got a few three pointers in. good
game.
Ambiguous name 12 <user>I actually really like Lyrica &
A1.
Food topic 11 This aerobically fermented product
was tested & it’s antibiotic residue
free. also certified organic.
Insufficient context 7 <user>adding Arnica to my shop-
ping list
Cosmetic topic 5 Doc prescribed me this dandruff
shampoo, if it works, I’m defi-
nitely getting a sew in after I’m
done using it
Unknown 2 Ice_Cream, Ice-Cream and More Ice-
Cream...thats Ol i Want
Error annotation 3
False negative
Ambiguous name 36 Trying Oil of Oregano & garlic for
congestion for my sinus infection.
[ambiguous dietary supplement]
In the church the person close to
me’s sniffling & coughing... I need
a bathe of bactine and some Air-
borne, right now [ambiguous En-
glish word]
Drug not/rarely
seen
25 That’s the benzo effects! [missing
variant]
Pennsylvania Appellate Court
Revives 1, 000 Prempro Cases
Against Pfizer [missing in lexicon]
the percocet-thief plot makes Real
World New Orleans look almost
intriguing [preprocessing error]
Generic terms 18 Tossing and turning. I need ur sleep
aid. Waiting patiently <user>
Nonmedical topic 11 <user>Meet Mr an Mrs Lexapro...
..
guarenteed fidelity.
Short tweets 3 arnica-ointment-7
Error annotation 7
1624 Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12
CONCLUSION
In this article, we presented Kusuri, an ensemble learning classifier
to identify tweets mentioning drug names. Given the unavailability
of a labeled corpus to train our system, we created and annotated
a balanced corpus of 15 005 tweets, the UPennHLP Twitter Drug
Corpus. The ensemble of deep neural networks at the core of
Kusuri’s decisions (Module 2) demonstrated performances close to
human annotators without requiring engineered features on this
corpus,withanF
1
score of 93.7%. However, because we built this
corpus artificially, it did not represent the natural distribution of
drug mentions in Twitter. We evaluated Kusuri on a second cor-
pus, UPennHLP Twitter Pregnancy Corpus, made of all tweets
posted by 112 Twitter users, a total of 98 959 annotated tweets,
with only 258 tweets mentioning drugs. On this corpus, Kusuri
obtained an F
1
score of 78.8%, a score comparable to the score
obtained on the most frequent types of NEs by the best systems
competing in well-established challenges, despite our corpus hav-
ing only 0.26% positive instances in it. The code of Kusuri and the
models used for these experiments are publicly available at https://
bitbucket.org/pennhlp/kusuri/. The UPennHLP Twitter Drug Cor-
pus is available at https://healthlanguageprocessing.org/kusuri. We
will release the UPennHLP Twitter Pregnancy Corpus during the
Fifth Social Media Mining for Health Applications shared task in
2020.
FUNDING
This work was supported by National Library of Medicine grant number
R01LM011176 to GG-H. The content is solely the responsibility of the
authors and does not necessarily represent the official view of National
Library of Medicine.
AUTHOR CONTRIBUTIONS
DW designed the experiments, preprocessed the data and anno-
tated a part of it, implemented Kusuri andcomputedthemodels,
analyzed the prediction errors, and wrote the majority of the man-
uscript. AS integrated the variant-based drug classifier in Kusuri,
wrote its description, and proofread the manuscript. AK edited the
manuscript. KO annotated the data and computed the interanno-
tator agreement. AM helped optimize the neural networks. GG-H
supervised the overall study design and edited the final
manuscript.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American
Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
None declared.
REFERENCES
1. Sinnenberg, L Buttenheim, AM Padrez, K Mancheno, C Ungar, LMer-
chant RM. Twitter as a tool for health research: a systematic review. Am J
Public Health 2017; 107 (1): e1–e8.
2. Velardi P, Stilo G, Tozzi AE, Gesualdo F. Twitter mining for fine-grained
syndromic surveillance. Artif Intell Med 2014; 61 (3): 153–63.
3. Kagashe I, Yan Z, Suheryani I. Enhancing seasonal influenza surveillance:
topic analysis of widely used medicinal drugs using twitter data. J Med In-
ternet Res 2017; 19 (9): e315.
4. Magge A, Sarker A, Nikfarjam A, Gonzalez-Hernandez G. “Comment on:
“deep learning for pharmacovigilance: recurrent neural network architec-
tures for labeling adverse drug reactions in twitter posts” J Am Med In-
form Assoc 2019; 26 (6): 577–9.
5. Kazemi DM, Borsari B, Levine MJ, Dooley B. Systematic review of sur-
veillance by social media platforms for illicit drug use. J Public Health
(Oxf) 2017; 39 (4): 763–76.
6. Sekine S, Nobata C. Definition, dictionaries and tagger for extended named
entity hierarchy. In: Proceedings of the Fourth International Conference on
Language Resources and Evaluation (LREC’04); 2004: 1977–80.
7. Liu X, Zhang S, Wei F, Zhou M. Recognizing named entities in tweets. In:
Proceedings of the 49th Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies; 2011: 359–67. https://
www.aclweb.org/anthology/P11-1037/
8. Sarker A, Belousov M, Friedrichs J, et al. Data and systems for
medication-related text classification and concept normalization from
Twitter: insights from the Social Media Mining for Health (SMM4H)-
2017 shared task. J Am Med Inform Assoc 2018; 25 (10): 1274–83.
9. Ritter A, Clark S, Etzioni M, Etzioni O. Named entity recognition in
tweets: an experimental study. In: Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Processing; 2011: 1524–34.
10. Carbonell P, Mayer MA, Bravo
A. Exploring brand-name drug mentions on
twitter for pharmacovigilance. Stud Health Technol Inform2015; 210: 55–9.
11. Rizzo G, Pereira B, Varga A, van Erp M, Basave AEC. Lessons learnt from
the named entity recognition and linking (NEEL) challenge series. Semant
Web 2017; 8 (5): 667–700.
12. Derczynski L, Nichols E, Erp MV, Limsopatham N. Results of the
WNUT2017 shared task on novel and emerging entity recognition. In: Pro-
ceedings ofthe 3rd Workshop on Noisy User-generated Text; 2017:140–7.
Table 4. Precision, recall, F
1
scores, true positives, false positives, and false negatives for drug detection classifiers on the UPennHLP Twitter
Pregnancy Corpus testing set
System Precision Recall F
1
score
True positives/false
positives/false negatives
1. Lexicon þvariant classifier 55.0 71.43 62.15 55/45/22
2. Ensemble supervised bidirectional-GRUs
a. Trained on UPennHLP Twitter Pregnancy Corpus 87.5 63.64 73.68 49/7/28
3. Ensemble DNNs (only module 2 [classifier] of Kusuri)
a. Trained on UPennHLP Twitter Drug Corpus 10.15 80.52 18.02 62/549/15
b. Trained on UPennHLP Twitter Pregnancy Corpus 93.75 58.44 72.00 45/3/32
4. Kusuri (module 1 [filters] þmodule 2 [classifier]) 94.55
a
67.53 78.79
a
52/3/25
DNN: deep neural network; GRU: gated recurrent unit; NN: neural network; THU_NGN.
Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12 1625
13. Lopez C, Partalas I, Balikas G, et al. CAp 2017 challenge: twitter named
entity recognition; arXiv 2017 Jul 24 [E-pub ahead of print].
14. Weissenbacher D, Sarker A, Paul MJ, Gonzalez-Hernandez G. Overview
of the third social media mining for health (SMM4H) shared tasks at
EMNLP 2018. In: Proceedings of the 2018 EMNLP Workshop SMM4H:
The 3rd Social Media Mining for Health Applications Workshop and
Shared Task; 2018: 13–16.
15. Strauss B, Toma BE, Ritter A, Marneffe M-C, Xu DW. Results of the
wnut16 named entity recognition shared task. In: Proceedings of the 2nd
Workshop on Noisy User-generated Text (WNUT); 2016: 138–44.
16. Sileo D, Pradel C, Muller P, De Cruys TV. Synapse at CAp 2017 NER
challenge: Fasttext CRF. arXiv 2017 Sept 14 [E-pub ahead of print].
17. Limsopatham N, Collier N. Bidirectional LSTM for named entity recogni-
tion in twitter messages. In: Proceedings of the 2nd Workshop on Noisy
User-generated Text; 2016: 145–52.
18. Liu S, Tang B, Chen Q, Wan X. Drug name recognition: approaches and
resources. Information 2015; 6 (4): 790–810.
19. Uzuner O, Solti I, Cadag E. Extracting medication information from clini-
cal text. J Am Med Inform Assoc 2010; 17 (5): 514–8.
20. Segura-Bedmar I, Martınez P, Herrero-Za M. Semeval-2013 task 9: ex-
traction of drug-drug interactions from biomedical texts (ddiextraction
2013). In: Proceedings of the 7th International Workshop on Semantic
Evaluation (SemEval 2013); 2013: 341–50.
21. Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A.
CHEMDNER: the drugs and chemical names extraction challenge. J
Cheminform 2015; 7: S1.
22. Sarker A, Gonzalez G. A corpus for mining drug-related knowledge from twit-
ter chatter: language models and their utilities. Data Brief 2017; 10: 122–31.
23. Jimeno-Yepes A, MacKinlay A, Han B, Chen Q. Identifying diseases, drugs,
and symptomsin twitter. Stud Health TechnolInform 2015; 216: 643–7.
24. Wu C, Wu F, Wu J, et al. Detecting tweets mentioning drug name and ad-
verse drug reaction with hierarchical tweet representation and multi-head
self-attention. In: Proceedings of the 2018 EMNLP Workshop SMM4H:
the 3rd Social Media Mining for Health Applications Workshop and
Shared Task; 2018: 34–7.
25. Sarker A, Chandrashekar P, Magge A, Cai H, Klein A, Gonzalez G. Dis-
covering cohorts of pregnant women from social media for safety surveil-
lance and analysis. J Med Internet Res 2017; 19 (10): e361.
26. Sarker A, Gonzalez-Hernande z G. An unsupervised and customizable mis-
spelling generator for mining noisy health-related text sources. J Biomed
Inform 2018; 88: 98–107.
27. Shen S-S, Lee HY. Neural attention models for sequence classification:
analysis and application to key term extraction and dialogue act detection.
In: Proceedings of INTERSPEECH’16; 2016: 2716–20.
28. Grave E. Weakly supervised named entity classificatio n. Workshop on Au-
tomated Knowledge Base Construction (AKBC); December 13, 2014;
Montreal, Canada.
29. Golder SP, Chiuve S, Weissenbacher D, et al. Pharmacoepidemiologic
evaluation of birth defects from health-related postings in social media
during pregnancy. Drug Saf 2019; 42: 389–400.
30. Vanni L, Ducoffe M, Mayaffre D, et al. Text deconvolution saliency
(TDS): a deep tool box for linguistic analysis. In: Proceedings of ACL’18,
56th Annual Meeting of the Association for Computational Linguistics
(ACL); 2018.
31. Raschka S, Mirjalili V. Python Machine Learning: Machine Learning and
Deep Learning with Python, Scikit-Learn, and TensorFlow. 2nd ed. Bir-
mingham, UK: Packt Publishing Ltd; 2017.
32. Chalapathy R, Borzeshi E, Piccardi M. An investigation of recurrent neu-
ral architectures for drug name recognition. In: proceedings of the Seventh
International Workshop on Health Text Mining and Information Analysis
(LOUHI); 2016: 1–5.
33. Dietterich TG. Approximate Statistical tests for comparing supervised
classification learning algorithms. Neural Comput 1998; 10 (7):
1895–923.
34. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: a multi-
task benchmark and analysis platform for natural language understand-
ing. BlackboxNLP@EMNLP; 2018.
35. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv 2019 May
24 [E-pub ahead of print].
36. Peters ME, Neumann M, Iyyer M, et al. Deep contextualized word repre-
sentations. In Proceedings of NAACL-HLT 2018; 2018: 227–37.
37. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A. Supervised learn-
ing of universal sentence representations from natural language inference
data. In: Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing; 2017: 670–80.
1626 Journal of the American Medical Informatics Association, 2019, Vol. 26, No. 12
... This approach has several limitations, even when allowing for variants and misspellings. In a prior study, (1), when using the lexical match approach on a corpus where names of medications are rare, the authors retrieved only 71% of the tweets manually identified as mentioning a medication, and more than 45% of the tweets retrieved were false positives. For example, when tweets mention the word 'propel', it denotes predominantly the verb and not the brand name of a corticosteroid. ...
... The lexicon contains 44 498 medication names from RxNorm (13). Automatically generated variants of the medication names were added to the lexicon to account for misspellings using the method described in (1). The variants were manually curated to remove those that could be confused with common English words (such as 'some', a variant generated for 'Sone', a corticosteroid). ...
... One may think that the classifier module is not needed since the extractor performs at the same time the detection of the tweets mentioning medication names and the extraction of their positions. However, empirical results in (1) indicate that separating the classification and the extraction steps may facilitate optimizing the system as a whole by optimizing each individually. The loss function of the classifier focuses on the semantics of health-related tweets and that of the extractor on detecting the spans of the medications. ...
Article
Full-text available
This study presents the outcomes of the shared task competition BioCreative VII (Task 3) focusing on the extraction of medication names from a Twitter user's publicly available tweets (the user's 'timeline'). In general, detecting health-related tweets is notoriously challenging for natural language processing tools. The main challenge, aside from the informality of the language used, is that people tweet about any and all topics, and most of their tweets are not related to health. Thus, finding those tweets in a user's timeline that mention specific health-related concepts such as medications requires addressing extreme imbalance. Task 3 called for detecting tweets in a user's timeline that mentions a medication name and, for each detected mention, extracting its span. The organizers made available a corpus consisting of 182 049 tweets publicly posted by 212 Twitter users with all medication mentions manually annotated. The corpus exhibits the natural distribution of positive tweets, with only 442 tweets (0.2%) mentioning a medication. This task was an opportunity for participants to evaluate methods that are robust to class imbalance beyond the simple lexical match. A total of 65 teams registered, and 16 teams submitted a system run. This study summarizes the corpus created by the organizers and the approaches taken by the participating teams for this challenge. The corpus is freely available at https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/. The methods and the results of the competing systems are analyzed with a focus on the approaches taken for learning from class-imbalanced data.
... Former works have reported optimal results by means of pre-annotating medical texts [7,8] or data augmentation using synonyms from a lexicon [9]. Recent teams have applied hybrid methods [10], integrating pre-annotation in the pipeline or using the prediction of a terminology-based system as features for a neural network model [11][12][13]. Thus, creating resources adapted to the medical terminology and health literature is beneficial to obtain optimal results [14]. ...
... This resource provides an extensive coverage of rare diseases. 10 The Spanish Drug Effect database (SDEdb) [54]: this resource gathers terms related to adverse effects obtained from drug packages and medical web sites and social media. This database provides both new drugrelated terms and laymen variants of technical words (e.g. ...
Article
Full-text available
Background: Medical lexicons enable the natural language processing (NLP) of health texts. Lexicons gather terms and concepts from thesauri and ontologies, and linguistic data for part-of-speech (PoS) tagging, lemmatization or natural language generation. To date, there is no such type of resource for Spanish. Construction and content: This article describes an unified medical lexicon for Medical Natural Language Processing in Spanish. MedLexSp includes terms and inflected word forms with PoS information and Unified Medical Language System® (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used NLP techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs. 10, the Anatomical Therapeutic Chemical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. MedLexSp includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 UMLS CUIs. We report two use cases of MedLexSp. First, applying the lexicon to pre-annotate a corpus of 1200 texts related to clinical trials. Second, PoS tagging and lemmatizing texts about clinical cases. MedLexSp improved the scores for PoS tagging and lemmatization compared to the default Spacy and Stanza python libraries. Conclusions: The lexicon is distributed in a delimiter-separated value file; an XML file with the Lexical Markup Framework; a lemmatizer module for the Spacy and Stanza libraries; and complementary Lexical Record (LR) files. The embeddings and code to extract COVID-19 terms, and the Spacy and Stanza lemmatizers enriched with medical terms are provided in a public repository.
... It is thus not surprising that the initiative the Social Media Mining 4 Health (#SMM4H) is addressing these problems in its agenda(Klein, 2021) (Magge et al., 2021). This initiative uses social media data as a data source for solving health-related tasks and problems such as finding disease mentions and symptoms(Klein, 2021) (Magge et al., 2021) (Weissenbacher et al., 2019). However, this rich data source does not have demographic information necessary for the statistics on social variation in the health literacy study. ...
... All these datasets have been used for multiple use-case scenarios. For instance, the Twitter dataset on drug-related knowledge [57] was used for detecting medication mentions on Twitter [84], region-specific monitoring and characterization of opioid-related social media chatter [85], tracking birth defect-related conversations on Twitter [86], detection of the self-reports of prescription medication abuse from Twitter [87], development of a methodology for automatic detection of breast cancer cohort from Tweets [88], development of a methodology to identify mentions of specific drugs on Twitter [89], and identifying conversations on Twitter related to the adverse drug reactions (ADRs) of marketed drugs [90]. Similarly, the Twitter dataset on conversations on Twitter about the efficacy of Hydroxychloroquine as a treatment for COVID-19 was used for stance detection in Tweets related to COVID-19 [91], misinformation detection on Twitter [92], detection of fake news related to COVID-19 [93], studying the public perceptions of approved versus off-label use for COVID-19-related medications [94], understanding public opinion on using hydroxychloroquine for COVID-19 treatments [95], stance detection towards vaccination for COVID-19 [96], and a few other applications. ...
Article
Full-text available
The mining of Tweets to develop datasets on recent issues, global challenges, pandemics, virus outbreaks, emerging technologies, and trending matters has been of significant interest to the scientific community in the recent past, as such datasets serve as a rich data resource for the investigation of different research questions. Furthermore, the virus outbreaks of the past, such as COVID-19, Ebola, Zika virus, and flu, just to name a few, were associated with various works related to the analysis of the multimodal components of Tweets to infer the different characteristics of conversations on Twitter related to these respective outbreaks. The ongoing outbreak of the monkeypox virus, declared a Global Public Health Emergency (GPHE) by the World Health Organization (WHO), has resulted in a surge of conversations about this outbreak on Twitter, which is resulting in the generation of tremendous amounts of Big Data. There has been no prior work in this field thus far that has focused on mining such conversations to develop a Twitter dataset. Furthermore, no prior work has focused on performing a comprehensive analysis of Tweets about this ongoing outbreak. To address these challenges, this work makes three scientific contributions to this field. First, it presents an open-access dataset of 556,427 Tweets about monkeypox that have been posted on Twitter since the first detected case of this outbreak. A comparative study is also presented that compares this dataset with 36 prior works in this field that focused on the development of Twitter datasets to further uphold the novelty, relevance, and usefulness of this dataset. Second, the paper reports the results of a comprehensive analysis of the Tweets of this dataset. This analysis presents several novel findings; for instance, out of all the 34 languages supported by Twitter, English has been the most used language to post Tweets about monkeypox, about 40,000 Tweets related to monkeypox were posted on the day WHO declared monkeypox as a GPHE, a total of 5470 distinct hashtags have been used on Twitter about this outbreak out of which #monkeypox is the most used hashtag, and Twitter for iPhone has been the leading source of Tweets about the outbreak. The sentiment analysis of the Tweets was also performed, and the results show that despite a lot of discussions, debate, opinions, information, and misinformation, on Twitter on various topics in this regard, such as monkeypox and the LGBTQI+ community, monkeypox and COVID-19, vaccines for monkeypox, etc., “neutral” sentiment was present in most of the Tweets. It was followed by “negative” and “positive” sentiments, respectively. Finally, to support research and development in this field, the paper presents a list of 50 open research questions related to the outbreak in the areas of Big Data, Data Mining, Natural Language Processing, and Machine Learning that may be investigated based on this dataset.
... Yang, Liu, Qian, Guan, & Yuan, 2019;Zhang, Zhang, Zhou, & Pang, 2019) and clinical entity relation extraction (Chen et al., 2018;Hu et al., 2018;Z. Li et al., 2019b;Munkhdalai et al., 2018;Shi et al., 2019), temporal relation (Choi et al., 2017), temporal matching (Lüneburg et al., 2019), semantic representation (Deng, Faulstich, & Denecke, 2017), de-identification (Lee, Filannino, & Uzuner, 2019;Obeid et al., 2019;Richter-Pechanski et al., 2019), medical question-answering (Ben Abacha & Demner-Fushman, 2019; Hu et al., 2018), and dealing with text ambiguity, such as abbreviation disambiguation (Joopudi, Dandala, & Devarakonda, 2018), prediction of ambiguous terms (Pesaranghader, Matwin, Sokolova, & Pesaranghader, 2019) and disambiguation methods (Wei, Lee, Leaman, & Lu, 2019;Weissenbacher et al., 2019). ...
Article
Full-text available
Neurologic disability level at hospital discharge is an important outcome in many clinical research studies. Outside of clinical trials, neurologic outcomes must typically be extracted by labor intensive manual review of clinical notes in the electronic health record (EHR). To overcome this challenge, we set out to develop a natural language processing (NLP) approach that automatically reads clinical notes to determine neurologic outcomes, to make it possible to conduct larger scale neurologic outcomes studies. We obtained 7314 notes from 3632 patients hospitalized at two large Boston hospitals between January 2012 and June 2020, including discharge summaries (3485), occupational therapy (1472) and physical therapy (2357) notes. Fourteen clinical experts reviewed notes to assign scores on the Glasgow Outcome Scale (GOS) with 4 classes, namely ‘good recovery’, ‘moderate disability’, ‘severe disability’, and ‘death’ and on the Modified Rankin Scale (mRS), with 7 classes, namely ‘no symptoms’, ‘no significant disability’, ‘slight disability’, ‘moderate disability’, ‘moderately severe disability’, ‘severe disability’, and ‘death’. For 428 patients’ notes, 2 experts scored the cases generating interrater reliability estimates for GOS and mRS. After preprocessing and extracting features from the notes, we trained a multiclass logistic regression model using LASSO regularization and 5-fold cross validation for hyperparameter tuning. The model performed well on the test set, achieving a micro average area under the receiver operating characteristic and F-score of 0.94 (95% CI 0.93–0.95) and 0.77 (0.75–0.80) for GOS, and 0.90 (0.89–0.91) and 0.59 (0.57–0.62) for mRS, respectively. Our work demonstrates that an NLP algorithm can accurately assign neurologic outcomes based on free text clinical notes. This algorithm increases the scale of research on neurological outcomes that is possible with EHR data.
... They can also be used to communicate timely messages to the population [84] and thus to increase the chance of successful adoption of measures by the population. The development of indicators based on the real-time tracking of health-related conversations on social media is becoming crucial [9,[85][86][87]. A major contribution of this study is to show the usefulness of deep learning methods to simultaneously capture public opinion and associated sentiments from large amounts of social media data. ...
... These recent advances have been mostly implemented in the field of pharmacovigilance. Weis-senbacher et al. [40] have used deep neural networks, notably LSTMs, to find mentions of medications in tweets. This could be used further to detect those associated with mentions of adverse effects or toxicities. ...
Article
Full-text available
Adverse Outcome Pathways (AOPs) are conceptual frameworks that tie an initial perturbation (molecular initiating event) to a phenotypic toxicological manifestation (adverse outcome), through a series of steps (key events). They provide therefore a standardized way to map and organize toxicological mechanistic information. As such, AOPs inform on key events underlying toxicity, thus supporting the development of New Approach Methodologies (NAMs), which aim to reduce the use of animal testing for toxicology purposes. However, the establishment of a novel AOP relies on the gathering of multiple streams of evidence and information, from available literature to knowledge databases. Often, this information is in the form of free text, also called unstructured text, which is not immediately digestible by a computer. This information is thus both tedious and increasingly time-consuming to process manually with the growing volume of data available. The advancement of machine learning provides alternative solutions to this challenge. To extract and organize information from relevant sources, it seems valuable to employ deep learning Natural Language Processing techniques. We review here some of the recent progress in the NLP field, and show how these techniques have already demonstrated value in the biomedical and toxicology areas. We also propose an approach to efficiently and reliably extract and combine relevant toxicological information from text. This data can be used to map underlying mechanisms that lead to toxicological effects and start building quantitative models, in particular AOPs, ultimately allowing animal-free human-based hazard and risk assessment.
... They can also be used to communicate timely messages to the population [84] and thus to increase the chance of successful adoption of measures by the population. The development of indicators based on the real-time tracking of health-related conversations on social media is becoming crucial [9,[85][86][87]. A major contribution of this study is to show the usefulness of deep learning methods to simultaneously capture public opinion and associated sentiments from large amounts of social media data. ...
Article
Full-text available
Background Public engagement is a key element for mitigating pandemics, and a good understanding of public opinion could help to encourage the successful adoption of public health measures by the population. In past years, deep learning has been increasingly applied to the analysis of text from social networks. However, most of the developed approaches can only capture topics or sentiments alone but not both together. Objective Here, we aimed to develop a new approach, based on deep neural networks, for simultaneously capturing public topics and sentiments and applied it to tweets sent just after the announcement of the COVID-19 pandemic by the World Health Organization (WHO). Methods A total of 1,386,496 tweets were collected, preprocessed, and split with a ratio of 80:20 into training and validation sets, respectively. We combined lexicons and convolutional neural networks to improve sentiment prediction. The trained model achieved an overall accuracy of 81% and a precision of 82% and was able to capture simultaneously the weighted words associated with a predicted sentiment intensity score. These outputs were then visualized via an interactive and customizable web interface based on a word cloud representation. Using word cloud analysis, we captured the main topics for extreme positive and negative sentiment intensity scores. Results In reaction to the announcement of the pandemic by the WHO, 6 negative and 5 positive topics were discussed on Twitter. Twitter users seemed to be worried about the international situation, economic consequences, and medical situation. Conversely, they seemed to be satisfied with the commitment of medical and social workers and with the collaboration between people. Conclusions We propose a new method based on deep neural networks for simultaneously extracting public topics and sentiments from tweets. This method could be helpful for monitoring public opinion during crises such as pandemics.
Article
Full-text available
Background: Medication noncompliance is a critical issue because of the increased number of drugs sold on the web. Web-based drug distribution is difficult to control, causing problems such as drug noncompliance and abuse. The existing medication compliance surveys lack completeness because it is impossible to cover patients who do not go to the hospital or provide accurate information to their doctors, so a social media-based approach is being explored to collect information about drug use. Social media data, which includes information on drug usage by users, can be used to detect drug abuse and medication compliance in patients. Objective: This study aimed to assess how the structural similarity of drugs affects the efficiency of machine learning models for text classification of drug noncompliance. Methods: This study analyzed 22,022 tweets about 20 different drugs. The tweets were labeled as either noncompliant use or mention, noncompliant sales, general use, or general mention. The study compares 2 methods for training machine learning models for text classification: single-sub-corpus transfer learning, in which a model is trained on tweets about a single drug and then tested on tweets about other drugs, and multi-sub-corpus incremental learning, in which models are trained on tweets about drugs in order of their structural similarity. The performance of a machine learning model trained on a single subcorpus (a data set of tweets about a specific category of drugs) was compared to the performance of a model trained on multiple subcorpora (data sets of tweets about multiple categories of drugs). Results: The results showed that the performance of the model trained on a single subcorpus varied depending on the specific drug used for training. The Tanimoto similarity (a measure of the structural similarity between compounds) was weakly correlated with the classification results. The model trained by transfer learning a corpus of drugs with close structural similarity performed better than the model trained by randomly adding a subcorpus when the number of subcorpora was small. Conclusions: The results suggest that structural similarity improves the classification performance of messages about unknown drugs if the drugs in the training corpus are few. On the other hand, this indicates that there is little need to consider the influence of the Tanimoto structural similarity if a sufficient variety of drugs are ensured.
Article
Full-text available
Most of the medicines used to treat the novel coronavirus infection (COVID-19) are either approved under an accelerated procedure or not approved for the indication. Consequently, their safety requires special attention. The aim of the study was to review methodological approaches to collecting data on the safety of medicines, using COVID-19 treatment regimens involving azithromycin as a case study. Materials and methods: PubMed® (MEDLINE), Scopus, eLIBRARY, and Cyberleninka databases were searched for publications on azithromycin as part of combination therapy for COVID-19 in 2020–2021. Search queries included names of the medicinal product or its pharmacotherapeutic group and words describing adverse drug reactions (ADRs) during treatment. Results: the analysis included 7 publications presenting the results of studies covering the use of azithromycin as part of COVID-19 combination therapy in more than 4000 patients. Most commonly, the patients receiving COVID-19 therapy including azithromycin developed cardiovascular ADRs (up to 30% of azithromycin prescription cases). In 3 of the analysed publications, safety information was collected through spontaneous reporting and active identification based on the findings of laboratory and instrumental investigations performed during the clinical studies; in other 3, only spontaneous reports were used; and in the last one, ADR database information was studied. Conclusion: currently, information on ADRs associated with the use of medicines is mainly gathered via spontaneous reporting. Direct sourcing of information on personal experiences with a certain product from patients, among other means through social media analysis, opens a promising direction towards the improvement of existing approaches to collecting safety data.
Article
Full-text available
Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms showed an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. Discussion: Our proposed spelling variant generator has several advantages over the existing spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Article
Full-text available
Introduction Adverse effects of medications taken during pregnancy are traditionally studied through post-marketing pregnancy registries, which have limitations. Social media data may be an alternative data source for pregnancy surveillance studies. Objective The objective of this study was to assess the feasibility of using social media data as an alternative source for pregnancy surveillance for regulatory decision making. Methods We created an automated method to identify Twitter accounts of pregnant women. We identified 196 pregnant women with a mention of a birth defect in relation to their baby and 196 without a mention of a birth defect in relation to their baby. We extracted information on pregnancy and maternal demographics, medication intake and timing, and birth defects. Results Although often incomplete, we extracted data for the majority of the pregnancies. Among women that reported birth defects, 35% reported taking one or more medications during pregnancy compared with 17% of controls. After accounting for age, race, and place of residence, a higher medication intake was observed in women who reported birth defects. The rate of birth defects in the pregnancy cohort was lower (0.44%) compared with the rate in the general population (3%). Conclusions Twitter data capture information on medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. Development of improved methods to automatically extract and annotate social media data may increase their value to support regulatory decision making regarding pregnancy outcomes in women using medications during their pregnancies.
Article
Full-text available
Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).
Article
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Conference Paper
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.