Conference PaperPDF Available

Overview of the Fourth Social Media Mining for Health (SMM4H) Shared Tasks at ACL 2019



Proceedings of the 4th Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task, pages 21–30
Florence, Italy, August 2, 2019. c
2019 Association for Computational Linguistics
Overview of the Fourth Social Media Mining for Health (#SMM4H)
Shared Task at ACL 2019
Davy Weissenbacher, Abeed Sarker, Arjun Magge, Ashlynn Daughton,
Karen O’Connor, Michael Paul, Graciela Gonzalez-Hernandez
DBEI, Perelman School of Medicine, University of Pennsylvania, PA, USA
Biodesign Center for Environmental Health Engineering, Biodesign Institute,
Arizona State University, AZ, USA
Information Science University of Colorado Boulder, CO, USA
The number of users of social media contin-
ues to grow, with nearly half of adults world-
wide and two-thirds of all American adults
using social networking on a regular basis1.
Advances in automated data processing and
NLP present the possibility of utilizing this
massive data source for biomedical and pub-
lic health applications, if researchers address
the methodological challenges unique to this
media. We present the Social Media Mining
for Health Shared Tasks collocated with the
ACL at Florence in 2019, which address these
challenges for health monitoring and surveil-
lance, utilizing state of the art techniques for
processing noisy, real-world, and substantially
creative language expressions from social me-
dia users. For the fourth execution of this chal-
lenge, we proposed four different tasks. Task
1 asked participants to distinguish tweets re-
porting an adverse drug reaction (ADR) from
those that do not. Task 2, a follow-up to Task
1, asked participants to identify the span of
text in tweets reporting ADRs. Task 3 is an
end-to-end task where the goal was to first de-
tect tweets mentioning an ADR and then map
the extracted colloquial mentions of ADRs in
the tweets to their corresponding standard con-
cept IDs in the MedDRA vocabulary. Finally,
Task 4 asked participants to classify whether
a tweet contains a personal mention of one’s
health, a more general discussion of the health
issue, or is an unrelated mention. A total of
34 teams from around the world registered
and 19 teams from 12 countries submitted a
system run. We summarize here the corpora
for this challenge which are freely available
at https://competitions.codalab.
org/competitions/22521, and present
an overview of the methods and the results of
the competing systems.
1Pew Research Center. Social Media Fact Sheet.
2017. [Online]. Available:
1 Introduction
The intent of the #SMM4H shared tasks se-
ries is to challenge the community with Natu-
ral Language Processing tasks for mining rele-
vant data for health monitoring and surveillance
in social media. Such challenges require pro-
cessing imbalanced, noisy, real-world, and sub-
stantially creative language expressions from so-
cial media. The competing systems should be
able to deal with many linguistic variations and
semantic complexities in the various ways peo-
ple express medication-related concepts and out-
comes. It has been shown in past research (Liu
et al.,2011;Giuseppe et al.,2017) that automated
systems frequently under-perform when exposed
to social media text because of the presence of
novel/creative phrases, misspellings and frequent
use of idiomatic, ambiguous and sarcastic expres-
sions. The tasks act as a discovery and verification
process of what approaches work best for social
media data.
As in previous years, our tasks focused on min-
ing health information from Twitter. This year
we challenged the community with two different
problems. The first problem focuses on perform-
ing pharmacovigilance from social media data. It
is now well understood that social media data may
contain reports of adverse drug reactions (ADRs)
and these reports may complement traditional ad-
verse event reporting systems, such as the FDA
adverse event reporting system (FAERS). How-
ever, automatically curating reports from adverse
reactions from Twitter requires the application of
a series of NLP methods in an end-to-end pipeline
(Sarker et al.,2015). The first three tasks of this
year’s challenge represent three key NLP prob-
lems in a social media based pharmacovigilance
pipeline — (i) automatic classification of ADRs,
(ii) extraction of spans of ADRs and (iii) normal-
ization of the extracted ADRs to standardized IDs.
The second problem explores the generalizabil-
ity of predictive models. In health research us-
ing social media, it is often necessary for re-
searchers to build individual classifiers to iden-
tify health mentions of a particular disease in a
particular context. Classification models that can
generalize to different health contexts would be
greatly beneficial to researchers in these fields
(e.g., (Payam and Eugene,2018)), as this would
allow researchers to more easily apply existing
tools and resources to new problems. Motivated
by these ideas, Task 4 was testing tweet classifi-
cation methods across diverse health contexts, so
the test data included a very different health con-
text than the training data. This setting measures
the ability of tweet classifiers to generalize across
health contexts.
The fourth iteration of our series follows the
same organization as previous iterations. We col-
lected posts from Twitter, annotated the data for
the four tasks proposed and released the posts to
the registered teams. This year, we conducted the
evaluation of all participating systems using Co-
dalab, an open source platform facilitating data
science competitions. The performances of the
systems were compared on a blind evaluations sets
for each task.
All teams registered were allowed to participate
to one or multiple tasks. We provided the partic-
ipants with two sets of data for each task, a train-
ing and a test set. Participants had a period of six
weeks, from March 5th to April 15th , for train-
ing their systems on our training sets, and 4 days,
from the 16th to 20th of April, for calibrating their
systems on our test sets and submitting their pre-
dictions. In total 34 teams registered and 19 teams
submitted at least one run (each team was allowed
to submit, at most, three runs per task). In detail,
we received 43 runs for task 1, 24 for task 2, 10 for
task 3 and 15 for task 4. We briefly describe each
task and their data in section 2, before discussing
the results obtained in section 3.
2 Task Descriptions
2.1 Tasks
Task 1: Automatic classification of tweets men-
tioning an ADR. This is a binary classification
task for which systems are required to predict if a
tweet mentions an ADR or not. In an end-to-end
social media based pharmacovigilance pipeline,
such a system is needed after data collection to
filter out the large volume of medication-related
chatter that is not a mention of an ADR. This task
is a rerun of the popular classification task orga-
nized in past years.
Task 2: Automatic extraction of ADR mentions
from tweets. This is a named entity recogni-
tion (NER) task that typically follows the ADR
classification step (Task 1) in an ADR extraction
pipeline. Given a set of tweets containing drug
mentions and potentially containing ADRs, the
objective was to determine the span of the ADR
mention, if any. ADRs are rare events making
ADR classification a challenging task with an F1-
score in the vicinity of 0.5 (based on previous
shared task results (Weissenbacher et al.,2018))
for the ADR class. The dataset for the ADR ex-
traction task contains tweets that are both positive
and negative for the presence of ADRs. This al-
lowed participants to choose to train their systems
on either the set of tweets containing ADRs or in-
clude tweets that were negative for the presence of
Task 3: Automatic extraction of ADR mentions
and normalization of extracted ADRs to Med-
DRA preferred term identifiers. This is an ex-
tension of Task 2 consisting of the combination of
NER and entity normalization tasks: a named en-
tity resolution task. In this task, given the same
set of tweets as in Task 2, the objective was to ex-
tract the span of an ADR mention and to normal-
ize it to MedDRA identifiers 2. MedDRA (Med-
ical Dictionary for Regulatory Activities), which
is the standard nomenclature for monitoring med-
ical products, and includes diseases, disorders,
signs, symptoms, adverse events or adverse drug
reactions. For the normalization task, MedDRA
version 21.1 was used, containing 79,507 lower
level terms (LLTs) and 23,389 respective preferred
terms (PTs).
Task 4: Automatic classification of personal
mentions of health. In this binary classifica-
tion task, the systems were required to distinguish
tweets of personal health status or opinions across
different health domains. The proposed task was
intended to provide a baseline understanding of
the ability to identify personal health mentions in
a generalized context.
2 Accessed:
2.2 Data
All corpora were composed of public tweets
downloaded using the official streaming API pro-
vided by Twitter and made available to the partici-
pants in accordance with Twitter’s data use policy.
This study received an exempt determination by
the Institutional Review Board of the University
of Pennsylvania.
Task 1. For training, participants were provided
with all the tweets from the #SMM4H 2017 shared
tasks (Sarker et al.,2018), which are publicly
available at: https://data.mendeley.
com/datasets/rxwfb3tysd/2. A total of
25,678 tweets were made available for training.
The test set consisted of 4575 tweets with 626
(13.7%) tweets representing ADRs. The evalua-
tion metric for this task was micro-averaged F1-
score for the ADR class.
Task 2. Participants of Task 2 were provided
with a training set containing 2276 tweets which
mentioned at least one drug name. The dataset
contained 1300 tweets that were positive for the
presence of ADRs and 976 tweets that were neg-
ative. Participants were allowed to include addi-
tional negative instances from Task 1 for training
purposes. Positive tweets were annotated with the
start and end indices of the ADRs and the corre-
sponding span text in the tweets. The evaluation
set contained 1573 tweets, 785 and 788 tweets
were positive and negative for the presence of
ADRs respectively. The participants were asked
to submit outputs from their systems that con-
tained the predicted start and end indices of ADRs.
The participants’ submissions were evaluated us-
ing standard strict and overlapping F1-scores for
extracted ADRs. Under strict mode of evaluation,
ADR spans were considered correct only if both
start and end indices matched with the indices in
our gold standard annotations. Under overlapping
mode of evaluation, ADR spans were considered
correct only if spans in predicted annotations over-
lapped with our gold standard annotations.
Task 3. Participants were provided with the
same training and evaluation datasets as in Task
2. However, the datasets contained additional
columns for the MedDRA annotated LLT and PT
identifiers for each ADR mention. In total, of the
79,507 LLT and 23,389 PT identifiers available in
MedDRA, the training set of 2276 tweets and 1832
annotated ADRs contained 490 unique LLT iden-
tifiers and 327 unique PT identifiers. The evalua-
tion set contained 112 PT identifiers that were not
present as part of the training set. The participants
were asked to submit outputs containing the pre-
dicted start and end indices of ADRs and respec-
tive PT identifiers. Although the training dataset
contained annotations at the LLT level, the perfor-
mance was only evaluated at the higher PT level.
The participants’ submissions were evaluated us-
ing standard strict and overlapping F-scores for ex-
tracted ADRs and respective MedDRA identifiers.
Under strict mode of evaluation, ADR spans were
considered correct only if both start and end in-
dices matched along with matching MedDRA PT
identifiers. Under overlapping mode of evaluation,
ADR spans were considered correct only if spans
in predicted ADRs overlapped with gold standard
ADR spans in addition to matching MedDRA PT
Task 4 Data. Participants were provided train-
ing data from one disease domain, influenza,
across two contexts, being sick and getting vac-
cinated, both annotated for personal mentions: the
user is personally sick or the user has been per-
sonally vaccinated. Test data included new tweets
of personal health mentions about influenza and
tweets from an additional disease domain, Zika
virus, with two different contexts, the user is
changing their travel plans in response to Zika
concerns, or the user is minimizing potential
mosquito exposure due to Zika concerns.
2.3 Annotation and Inter-Annotator
Two annotators with biomedical education and
both experienced in Social Media research tasks
manually annotated the corpora for tasks 1, 2 and
3. Our annotators independently dual-annotated
each test sets to insure the quality of our annota-
tions. Disagreement were resolved after an adju-
dication phase between our two annotators. On
task 1, the classification task, the inter annotator-
agreement (IAA) was high with a Cohens Kappa
= 0.82. On task 2, the information extraction task,
IAAs were good with and an F1-score of 0.73 for
strict agreement, and 0.85 for overlapping agree-
ment3. On task 3, our annotators double annotated
3Since task 2 is a named-entity recognition task, we fol-
lowed the recommendations of (Hripcsak and Rothschild,
2005) and used precision and recall metrics to estimate the
inter-annotator rate.
535 of the extracted ADR terms and normalized
them to MedDRA lower lever terms (LLT). They
achieved an agreement accuracy of 82.6%. Af-
ter converting the LLT to their corresponding pre-
ferred term (PT) in MedDRA, which is the coding
the task was scored against, accuracy improved to
The annotation process followed for task 4 was
slightly different due to the nature of the task. We
obtained the two datasets of our training set, fo-
cusing on flu vaccination and flu infection, from
(Huang et al.,2017) and (Lamb et al.,2013) re-
spectively. Huang et al. (Huang et al.,2017) used
mechanical turk to crowdsource labels (Fleiss’
kappa = 0.793). Lamb et al. (Lamb et al.,2013)
did not report their labeling procedure or annotator
agreement metrics, but do report annotation guide-
lines5. A few of the tweets released by Lamb et
al. appeared to be mislabeled and were corrected
in accordance with the annotation guidelines de-
fined by the authors. We obtained the test data
for task 4 by compiling three datasets. For the
dataset related to travel changes due to Zika con-
cerns, we selected a subset of data already avail-
able from (Daughton and Paul,2019). Initial la-
beling of these tweets was performed by two an-
notators with a public health background (Cohen’s
kappa = 0.66). We reuse the original annotations
for this dataset without changes. For the mosquito
exposure dataset, tweets were labeled by one an-
notator with public health knowledge and expe-
rienced with social media, and then verified by
a second annotator with similar experience. The
additional set of data on personal exposure to In-
fluenza were obtained from a separate group, who
used an independent labeling procedure.
3 Results
The challenge received a solid response with 19
teams from 12 countries (7 from North America,
1 from South America, 6 from Asia and 5 from
Europe) submitting 92 runs in total in one or more
tasks. We present an overview of all architec-
tures competing in the different tasks in Table 1,
2,3,4. We also list in these tables the exter-
nal resources competitors integrated for improving
4We measured agreement using accuracy instead of Co-
hens Kappa because, with greater than 70,000 LLTs for the
annotators to choose from, agreement due to chance is ex-
pected to be small.
5We used the awareness vs. infection labels as defined in
(Lamb et al.,2013).
the pre-training of their systems or for embedding
high-level features to help decision-making.
The overview of all architectures is interest-
ing in two ways. First, this challenge confirms
the tendency of the community to abandon tradi-
tional Machine Learning systems based on hand-
crafted features for deep learning architectures ca-
pable of discovering the features relevant for the
task at hand from pre-trained embeddings. Dur-
ing the challenge, when participants implemented
traditional systems, such as SVM or CRF, they
used such systems as baselines and, observing sig-
nificant differences of performances with systems
based on deep learning on their validation sets,
most of them did not submit their predictions as
official runs. Second, while last year convolu-
tional or recurrent neural networks “fed” with pre-
trained word embeddings learned on local win-
dows of words (e.g. word2vec, GloVe) were the
most popular architectures, this year we can see
a clear dominance of neural architectures using
word embeddings pre-trained with the Bidirec-
tional Encoder Representations from Transform-
ers (BERT) proposed by (Devlin et al.,2018), or
fine-tuning these words embeddings on our train-
ing corpora. BERT allows to compute words em-
beddings based on the full context of sentences
and not only on local windows.
A notable result from task 1-3 is that, despite
an improvement in performances for the detec-
tion of ADRs, their resolution remains challenging
and will require further research. The participants
largely adopted contextual word-embeddings dur-
ing this challenge, a choice rewarded by new
records in performances during the task 1, the only
task reran from last years. The performances in-
creased from .522 F1-score (.442 P, .636 R) (Weis-
senbacher et al.,2018) to .646 F1-score (0.608 P,
0.689 R) for the best systems of each years. How-
ever, with a strict matching F1-score of .432 (.362
P, .535 R) for the best system, the performances
obtained in task 3 for ADRs resolution are still
low and human inspection is still required to make
use of the data extracted automatically. As shown
by the best score of .887 Accuracy obtained on the
ADR normalization in task 3 ran during #SMM4H
in 2017 (Sarker et al.,2018)6, once ADRs are ex-
tracted, the normalization of the ADRs can be per-
6Organizers of the task 3 ran during #SMM4H 2017 pro-
vided participants with manually curated expressions refer-
ring to ADRs and participants had to map them to their cor-
responding preferred terms in MeDRA.
formed with a good reliability. However errors are
made during all steps of the resolution — detec-
tion, extraction, normalization — and their over-
all accumulation render current automatic systems
inefficient. Note that bulk of the errors are made
during the extraction of the ADRs, as shown by
the low strict F1-score of the best system in task 2,
.464 F1-score (.389P, .576 R).
For task 4, we were especially interested in the
generalizability of first person health classifiers to
a domain separate from that of the training data.
We find that, on average, teams do reasonably
well across the full test dataset (average F1-score:
0.70, range: 0.41-0.87). Unsurprisingly, classi-
fiers tended to do better on a test set in the same
domain as the training dataset (context 1, average
F1-score: 0.82) and more modestly on the Zika
travel and mosquito datasets (average F1-score:
0.40 and 0.52, respectively). Interestingly, in all
contexts, precision was higher than recall. We note
that both the training and the testing data were lim-
ited in quantity, and that classifiers would likely
improve with more data. However, in general, it is
encouraging that classifiers trained in one health
domain can be applied to separate health domains.
4 Conclusion
In this paper we presented an overview of the re-
sults of #SMM4H 2019 which focuses on a) the
resolution of adverse drug reaction (ADR) men-
tioned in Twitter and b) the distinction between
tweets reporting personal health status form opin-
ions across different health domains. With a total
of 92 runs submitted by 19 teams, the challenge
was well attended. The participants, in large part,
opted for neural architectures and integrated pre-
trained word-embedding sensitive to their contexts
based on the recent Bidirectional Encoder Repre-
sentations from Transformers. Such architectures
were the most efficient on our four tasks. Re-
sults on tasks 1-3 show that, despite a continuous
improvement of performances in the detection of
tweets mentioning ADRs over the past years, their
end-to-end resolution still remain a major chal-
lenge for the community and an opportunity for
further research. Results of task 4 were more en-
couraging, with systems able to generalized their
predictions over domains not present in their train-
ing data.
Ashlynn R. Daughton and Michael J. Paul. 2019. Iden-
tifying protective health behaviors on twitter: Ob-
servational study of travel advisories and zika virus.
Journal of Medical Internet Research. In Press.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training of
deep bidirectional transformers for language under-
standing. CoRR, abs/1810.04805.
Rizzo Giuseppe, Pereira Bianca, Varga Andrea,
van Erp Marieke, and Elizabeth Cano Basave Am-
paro. 2017. Lessons learnt from the named entity
recognition and linking (neel) challenge series. Se-
mantic Web Journal, 8(5):667–700.
George Hripcsak and Adam S Rothschild. 2005.
Agreement, the f-measure, and reliability in infor-
mation retrieval. Journal of the American Medical
Informatics Association, 12(3):296–298.
Xiaolei Huang, Michael C. Smith, Michael J. Paul,
Dmytro Ryzhkov, Sandra C. Quinn, David A. Bro-
niatowski, and Mark Dredze. 2017. Examining
patterns of influenza vaccination in social media.
In AAAI Joint Workshop on Health Intelligence
Alex Lamb, Michael J. Paul, and Mark Dredze. 2013.
Separating fact from fear: Tracking flu infections on
twitter. In Proceedings of Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies
(NAACL-HLT 2013).
Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming
Zhou. 2011. Recognizing named entities in tweets.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies - Volume 1, HLT ’11, pages
359–367. Association for Computational Linguis-
Zulfat Miftahutdinov, Elena Tutubalina, and Alexander
Tropsha. 2017. Identifying disease-related expres-
sions in reviews using conditional random fields. In
Proceedings of International Conference on Com-
putational Linguistics and Intellectual Technologies
Dialog, volume 1, pages 155–167.
Azadeh Nikfarjam, Abeed Sarker, Karen O’connor,
Rachel Ginn, and Graciela Gonzalez-Hernandez.
2015. Pharmacovigilance from social media: min-
ing adverse drug reaction mention using sequence
labeling with word embedding cluster features.
Journal of the American Medical Informatics Asso-
ciation, 22(3):671–681.
Karisani Payam and Agichtein Eugene. 2018. Did you
really just have a heart attack? towards robust de-
tection of personal health mentions in social media.
In Proceedings of the 2018 World Wide Web Confer-
ence on World Wide Web, pages 137–146.
Abeed Sarker, Maksim Belousov, Jasper Friedrichs,
Kai Hakala, Svetlana Kiritchenko, Farrokh
Mehryary, Sifei Han, Tung Tran, Anthony Rios,
Ramakanth Kavuluru, Berry de Bruijn, Filip Gin-
ter, Debanjan Mahata, Saif Mohammad, Goran
Nenadic, and Graciela Gonzalez-Hernandez. 2018.
Data and systems for medication-related text clas-
sification and concept normalization from twitter:
insights from the social media mining for health
(smm4h)-2017 shared task. J Am Med Inform
Assoc, 25(10):1274–1283.
Abeed Sarker, Rachel Ginn, Azadeh Nikfarjam, Karen
OConnor, Karen Smith, Swetha Jayaraman, Tejaswi
Upadhaya, and Graciela Gonzalez. 2015. Utilizing
social media data for pharmacovigilance: A review.
Journal of Biomedical Informatics, 54:202 – 212.
Abeed Sarker and Graciela Gonzalez. 2017. A cor-
pus for mining drug-related knowledge from twitter
chatter: Language models and their utilities. Data
in Brief, 10:122–131.
Abeed Sarker and Graciela Gonzalez-Hernandez.
2015. Portable automatic text classification for ad-
verse drug reaction detection via multi-corpus train-
ing. Journal of biomedical informatics, 53:196–207.
Davy Weissenbacher, Abeed Sarker, Michael J Paul,
and Graciela Gonzalez-Hernandez. 2018. Overview
of the third social media mining for health (smm4h)
shared tasks at emnlp 2018. In in Proceedings of
the 2018 EMNLP Workshop SMM4H: The 3rd So-
cial Media Mining for Health Applications Work-
shop and Shared Task, pages 13–16.
FF: Feedforward
CNN: Convolutional Neural Network
BiLSTM: Bidirectional Long Short-Term Mem-
SVM: Support Vector Machine
CRF: Conditional Random Field
POS: Part-Of-Speech
RNN: Recurrent Neural Network
Rank Team System details
1 ICRC Architecture: BERT + FF + Softmax
Details: lexicon features (pairs of drug-ADR)
Resources: SIDER
2 UZH Architecture: ensemble of BERT & C CNN + W BiLSTM (+ CRF)
Details: multi-task-learning
Resources: CADEC corpus
3 MIDAS@IIITD Architecture: 1. BERT 2. ULMFit 3. W BiLSTM
Details: BERT + GloVe + Flair
Resources: additional corpus (Sarker and Gonzalez-Hernandez,2015)
4 KFU NLP Architecture: BERT + logistic regression
Details: BioBERT
5 CLaC Architecture: Bert + W BiLSTM + attention + softmax + SVM
Details: BERT, Word2Vec, Glove, embedded features
Resources: POS, modality, ADR list
6 THU NGN Architecture: C CNN + W BiLSTM + features + Multi-Head attention + Softmax
Details: Word2Vec, POS, ELMo
Resources: sentiment Lexicon, SIDER, CADEC
7 BigODM Architecture: ensemble of SVMs
Resources: Word Embeddings
8 UMich-NLP4Health Architecture: 1. W BiLSTM + attention + softmax; 2. W CNN + BiLSTM + softmax; 3. SVM
Details: GloVe, POS, case
Resources: Metamap, cTAKES, CIDER
9 TMRLeiden Architecture: ULMfit
Details: Flair + Glove + Bert; transfer learning
Resources: external corpus (Sarker and Gonzalez,2017)
10 CIC-NLP Architecture: C BiLSTM + W FF + LSTM + FF
Details: GloVe + BERT
12 SINAI Architecture: 1. SVM 2. CNN + Softmax
Details: GloVe
Resources: MetaMap
13 nlp-uned Architecture: W BiLSTM + Sigmoid
Details: GloVe
14 ASU BioNLP Architecture: 1. Lexicon; 2. BioBert
Details: Lexicon learned with Logistic regression model
15 Klick Health Architecture: ELMo + FF + Softmax
Details: Lexicons
Resources: MedDRA, Consumer Health Vocabulary, (Nikfarjam et al.,2015)
16 GMU Architecture: encoder-decoder (W biLSTM + attention)
Details: Glove
Resources: #SMM4H 2017-2018, UMLS
Table 1: Task 1. System and resource descriptions for ADR mentions detection in tweets7.
8We use C BiLSMT and C CNN to denote bidirectonal LSTMs or CNNs encoding sequences of characters, W BiLSTM and
W FF to denote bidirectional LSTMs or Feed Forward encoders of word embeddings.
Rank Team System details
1 KFU NLP Architecture: ensemble of BioBERT + CRF
Details: BioBERT
Resources: external dictionaries (Miftahutdinov et al.,2017);
CADEC, PsyTAR, TwADR-L corpora; #SMM4H 2017
2 THU NGN Architecture: C CNN + W BiLSTM + features + Multi-Head self-attention + CRF
Details: Word2Vec, POS, ELMo
Resources: sentiment Lexicon, SIDER, CADEC
3 MIDAS@IIITD Architecture: W BiLSTM + CRF
Details: BERT + GloVe + Flair
4 TMRLeiden Architecture: BERT + Flair
Details: Flair + Glove + Bert; transfer learning
5 ICRC Architecture: BERT + CRF
Resources: SIDER
6 GMU Architecture: C biLSTM + W biLSTM + CRF
Details: Glove
Resources: #SMM4H 2017-2018, UMLS
7 HealthNLP Architecture: W BiLSTM + CRF
Details: Word2vec, BERT, ELMo, POS
Resources: external dictionaries
8 SINAI Architecture: CRF
Details: GloVe
Resources: MetaMap
9Architecture: BiLSTM + CRF
Details: Word2Vec
Resources: MIMIC-III
10 Klick Health Architecture: Similarity
Details: Lexicons
Resources: MedDRA, Consumer Health Vocabulary, (Nikfarjam et al.,2015)
Table 2: Task 2. System and resource descriptions for ADR mentions extraction in tweets
Rank Team System details
1 KFU NLP Architecture: BioBERT + softmax
2 myTomorrows-TUDelft Architecture: ensemble RNN & Few-Shot Learning
Details: Word2Vec
Resources: MedDRA, Consumer Health Vocabulary, UMLS
3 TMRLeiden Architecture: BERT + Flair + RNN
Details: Flair + Glove + Bert; transfer learning
Resources: Consumer Health Vocabulary
4 GMU Architecture: encoder-decoder (W biLSTM + attention)
Details: Glove
Resources: #SMM4H 2017-2018, UMLS
Table 3: Task 3. System and resource descriptions for ADR mentions resolution in tweets.
Rank Team System details
1 UZH Architecture: ensemble BERT
Resources: CADEC corpus
2 ASU1 Architecture: BioBERT + FF
Resources: Word2vec, manually compiled list, ConceptNet
4 MIDAS@IIITD Architecture: BERT; W BiLSTM
Details: BERT + GloVe + Flair
5 TMRLeiden Architecture: ULMfit
Details: Flair + Glove + Bert; transfer learning
Resources: external corpus (Payam and Eugene,2018)
6 CLaC Architecture: Bert + W BiLSTM + attention + softmax + SVM
Details: BERT, Word2Vec, Glove, embedded features
Resources: POS, modality, ADR list
Table 4: Task 4. System and resource descriptions for detection of personal mentions of health in tweets.
Team F1 P R
ICRC 0.6457 0.6079 0.6885
UZH 0.6048 0.6478 0.5671
MIDAS@IIITD 0.5988 0.6647 0.5447
KFU NLP 0.5738 0.6914 0.4904
CLaC 0.5738 0.5427 0.6086
THU NGN 0.5718 0.4667 0.738
BigODM 0.5514 0.4762 0.655
UMich-NLP4Health 0.5369 0.5654 0.5112
TMRLeiden 0.5327 0.6419 0.4553
CIC-NLP 0.5209 0.6203 0.4489
UChicagoCompLx 0.4993 0.4574 0.5495
SINAI 0.4969 0.5517 0.4521
nlp-uned 0.4723 0.5244 0.4297
ASU BioNLP 0.4317 0.3223 0.6534
Klick Health 0.4099 0.5824 0.3163
GMU 0.3587 0.4526 0.2971
Table 5: System performances for each team for task 1 of the shared task. F1-score, Precision and Recall over the
ADR class are shown. Top scores in each column are shown in bold.
Relaxed Strict
Team F1 P R F1 P R
KFU NLP 0.658 0.554 0.81 0.464 0.389 0.576
THU NGN 0.653 0.614 0.697 0.356 0.328 0.388
MIDAS@IIITD 0.641 0.537 0.793 0.328 0.274 0.409
TMRLeiden 0.625 0.555 0.715 0.431 0.381 0.495
ICRC 0.614 0.538 0.716 0.407 0.357 0.474
GMU 0.597 0.596 0.599 0.407 0.406 0.407
HealthNLP 0.574 0.632 0.527 0.336 0.37 0.307
SINAI 0.542 0.612 0.486 0.36 0.408 0.322
ASU BioNLP 0.535 0.415 0.753 0.269 0.206 0.39
Klick Health 0.396 0.416 0.378 0.194 0.206 0.184
Table 6: System performances for each team for task 2 of the shared task. (Strict/Relaxed) F1-score, Precision
and Recall over the ADR mentions are shown. Top scores in each column are shown in bold.
Relaxed Strict
Team F1 P R F1 P R
KFU NLP 0.432 0.362 0.535 0.344 0.288 0.427
myTomorrows-TUDelft 0.345 0.336 0.355 0.244 0.237 0.252
TMRLeiden 0.312 0.37 0.27 0.25 0.296 0.216
GMU 0.208 0.221 0.196 0.109 0.116 0.102
Table 7: System performances for each team for task 3 of the shared task. (Strict/Relaxed) F1-score, Precision
and Recall over the ADR resolution are shown. Top scores in each column are shown in bold.
Team Acc F1 P R
Health concerns in all contexts
UZH 0.8772 0.8727 0.8392 0.9091
ASU1 0.8456 0.8036 0.9783 0.6818
UChicagoCompLx 0.8316 0.7913 0.9286 0.6894
MIDAS@IIITD 0.8211 0.783 0.8932 0.697
TMRLeiden 0.793 0.7256 0.9398 0.5909
CLaC 0.6386 0.4607 0.7458 0.3333
Health concerns in Context 1: Flu virus (infection/vaccination)
UZH 0.9438 0.9474 0.9101 0.9878
UChicagoCompLx 0.925 0.9231 0.973 0.878
ASU1 0.925 0.9221 0.9861 0.8659
MIDAS@IIITD 0.8875 0.88 0.9706 0.8049
TMRLeiden 0.8625 0.8493 0.9688 0.7561
CLaC 0.6625 0.5645 0.8333 0.4268
Health concerns in Context 2: Zika virus, travel plans changes
UZH 0.7536 0.7385 0.7059 0.7742
MIDAS@IIITD 0.6667 0.5818 0.6667 0.5161
ASU1 0.6957 0.5116 0.9167 0.3548
UChicagoCompLx 0.6377 0.4681 0.6875 0.3548
TMRLeiden 0.6377 0.4186 0.75 0.2903
CLaC 0.5362 0.2 0.4444 0.129
Health concerns in Context 3: Zika virus, reducing mosquito exposure
UZH 0.8393 0.7692 0.75 0.7895
MIDAS@IIITD 0.8214 0.6667 0.9091 0.5263
ASU1 0.8036 0.5926 1.0 0.4211
UChicagoCompLx 0.8036 0.5926 1.0 0.4211
TMRLeiden 0.7857 0.5385 1.0 0.3684
CLaC 0.6964 0.3704 0.625 0.2632
Table 8: System performances for each team for task 4 of the shared task. Accuracy, F1-score, Precision and
Recall over the personal mentions are shown. Top scores in each column are shown in bold.
... However, existing public clinical datasets of different domains are usually annotated using different tag sets (with small overlap), i.e., they lack evaluation data that is consistently annotated with the complete tag sets for multiple domains. Therefore, we re-annotate the ADE-Corpus (Gurulingappa et al., 2012) and SMM4H (Weissenbacher et al., 2019) datasets using the annotation scheme of CADEC (Karimi et al., 2015), resulting in datasets of multiple domains that are annotated consistently for evaluation. Our contributions are as follow: ...
... labeling procedure: We annotate the text corpous ofADE (Gurulingappa et al., 2012) and SMM4H(Weissenbacher et al., 2019) with a tag set of 5 entity types, i.e., T = {Drug, ADE, Disease, Finding, Symptom}, following the definition as in the original paper of CADEC(Karimi et al., 2015). Following (Gurulingappa et al., 2010),we have two annotators that can discuss on the disagreement. ...
... Following (Gurulingappa et al., 2010),we have two annotators that can discuss on the disagreement. We split the text of ADE (Gurulingappa et al., 2012) and SMM4H(Weissenbacher et al., 2019) into batches of 100 sentences. The annotators will work on streaming of batches, and annotating each batch takes about an hour. ...
... Several human-annotated corpora of clinical narratives from diverse sources have been introduced to define NLP tasks related to pharmacovigilance such as detection of ADEs and extraction of medication information from free-texts. The sources of those clinical narratives included clinical or physicians notes from electronic health records (EHRs) [9,17], consumer reviews on medications [18], drug labels [10], social media [11][12][13][14][15], safety reports in the Vaccine Adverse Event Reporting System (VAERS) [19], and serious ADE reports collected during clinical trials [20]. These corpora principally focused on detecting ADEs and medication entities and normalizing detected entities to medical ontology standards. ...
... Although many NLP models and human-annotated corpora extracting drug safety information [9][10][11][12][13][14][15] have been developed to improve the quality of ADE reporting. the focus of these models and corpora has primarily been on detecting ADEs and medication entities. ...
... As a result, annotation quality was appropriately maintained (Table S4, OSM). While previous NLP corpora to extract drug safety information have rarely attempted to capture information other than ADE occurrence and drug dosing information [9,11,12,17,18], our annotated corpus additionally covered other drug safety information helpful for pharmacovigilance, including the WHO-UMC assessment results and temporal information. In addition, the NER performance of the KAERS-BERT model is comparable to or even higher than those of previous NLP models even though the numbers of used entity types were much larger in this study than in previous studies [9,10,17,19]. ...
Full-text available
IntroductionConcerns have been raised over the quality of drug safety information, particularly data completeness, collected through spontaneous reporting systems (SRS), although regulatory agencies routinely use SRS data to guide their pharmacovigilance programs. We expected that collecting additional drug safety information from adverse event (ADE) narratives and incorporating it into the SRS database would improve data completeness.Objective The aims of this study were to define the extraction of comprehensive drug safety information from ADE narratives reported through the Korea Adverse Event Reporting System (KAERS) as natural language processing (NLP) tasks and to provide baseline models for the defined tasks.Methods This study used ADE narratives and structured drug safety information from individual case safety reports (ICSRs) reported through KAERS between 1 January 2015 and 31 December 2019. We developed the annotation guideline for the extraction of comprehensive drug safety information from ADE narratives based on the International Conference on Harmonisation (ICH) E2B(R3) guideline and manually annotated 3723 ADE narratives. Then, we developed a domain-specific Korean Bidirectional Encoder Representations from Transformers (KAERS-BERT) model using 1.2 million ADE narratives in KAERS and provided baseline models for the task we defined. In addition, we performed an ablation experiment to investigate whether named entity recognition (NER) models were improved when a training dataset contained more diverse ADE narratives.ResultsWe defined 21 types of word entities, six types of entity labels, and 49 types of relations to formulate the extraction of comprehensive drug safety information as NLP tasks. We obtained a total of 86,750 entities, 81,828 entity labels, and 45,107 relations from manually annotated ADE narratives. The KAERS-BERT model achieved F1-scores of 83.81 and 76.62% on the NER and sentence extraction tasks, respectively, while outperforming other baseline models on all the NLP tasks we defined except the sentence extraction task. Finally, utilizing the NER model for extracting drug safety information from ADE narratives resulted in an average increase of 3.24% in data completeness for KAERS structured data fields.Conclusions We formulated the extraction of comprehensive drug safety information from ADE narratives as NLP tasks and developed the annotated corpus and strong baseline models for the tasks. The annotated corpus and models for extracting comprehensive drug safety information can improve the data quality of an SRS database.
... Transformer-based bidirectional encoder language models (BERT [5]) and their variants (e.g., ClinicalBERT [6], BioBERT [7]) are widely used in AE detection (e.g., [8]) and carry out predictions with a decoder implemented as a task-specific trainable network. Raffel et al. [9] introduced a Text-To-Text Transfer Transformer (T5), a pretrained encoder-decoder transformer. ...
... Training data are required to train ML models, and those models can only be as good as the training data used to train them. Publicly available pharmacovigilance datasets (e.g., ADE Corpus V2 [13], WEB-RADR [14], CADEC [15], SMM4H [8,16], BioCreative VII Task 3 [17]) focus on small subsets of annotated entities (e.g., drug, dosage, AE, indication), and single language registers (social media, scientific literature) contain different entity subsets or do not follow an annotation scheme that allows the use of annotated entities independently from other entities. In this paper, we describe our activities to generate, characterize, and consistently re-annotate a dataset collection by completing annotations on pharmaceutical and biomedical entities over all sources. ...
Full-text available
Introduction and objective: Machine learning (ML) systems are widely used for automatic entity recognition in pharmacovigilance. Publicly available datasets do not allow the use of annotated entities independently, focusing on small entity subsets or on single language registers (informal or scientific language). The objective of the current study was to create a dataset that enables independent usage of entities, explores the performance of predictive ML models on different registers, and introduces a method to investigate entity cut-off performance. Methods: A dataset has been created combining different registers with 18 different entities. We applied this dataset to compare the performance of integrated models with models created with single language registers only. We introduced fractional stratified k-fold cross-validation to determine model performance on entity level by using training dataset fractions. We investigated the course of entity performance with fractions of training datasets and evaluated entity peak and cut-off performance. Results: The dataset combines 1400 records (scientific language: 790; informal language: 610) with 2622 sentences and 9989 entity occurrences and combines data from external (801 records) and internal sources (599 records). We demonstrated that single language register models underperform compared to integrated models trained with multiple language registers. Conclusions: A manually annotated dataset with a variety of different pharmaceutical and biomedical entities was created and is made available to the research community. Our results show that models that combine different registers provide better maintainability, have higher robustness, and have similar or higher performance. Fractional stratified k-fold cross-validation allows the evaluation of training data sufficiency on the entity level.
... Finally, DeepADEMiner employs a normalization component, which is trained on an augmented dataset comprising 265,000+ unique instances. This dataset is generated by combining the SMM4H 2017 and 2019 datasets 17,18 with terminology from the Medical Dictionary for Regulatory Activities (MedDRA) 19 and the Unified Medical Language System (UMLS) 20 . ...
Full-text available
The increasing significance of Adverse Drug Events (ADEs) extracted from social media, such as Twitter data, has led to the development of various end-to-end resolution methodologies. Despite recent advancements, there remains a substantial gap in normalizing ADE entities coming from social media, particularly with informal and diverse expressions of symptoms, which is crucial for accurate ADE identification and reporting. To address this challenge, we introduce a novel end-to-end solution called CONORM: Context-Aware Entity Normalization. CONORM is a two-step pipeline. The first component is a transformer encoder fine-tuned for entity recognition. The second component is a context-aware entity normalization algorithm. This algorithm uses a dynamic context refining mechanism to adjust entity embeddings, aiming to align ADE mentions with their respective concepts in medical terminology. An integral feature of CONORM is its compatibility with vector databases, which enables efficient querying and scalable parallel processing. Upon evaluation with the SMM4H 2023 ADE normalization shared task dataset, CONORM achieved an F1-score of 50.20% overall and 39.40% for out-of-distribution samples. These results improve performance by 18.00% and 19.90% over the median shared task results, 7.60% and 10.20% over the best model in the shared task, and 5.00% and 3.10% over the existing state-of-the-art ADE mining algorithm. CONORM's ability to provide context-aware entity normalization paves the way for enhanced end-to-end ADE resolution methods. Our findings and methodologies shed light on the potential advancements in the broader realm of pharmacovigilance using social media data. The model architectures are publicly available at
... Since its inception in 2016, the Social Media Mining for Health Research and Applications (SMM4H) shared tasks have been a regular venue for publishing larger annotated datasets related to diverse categories of tasks. These tasks include but not limited to ADR detection [17,18,19,20], automatic classification of tweets mentioning a drug name [21], and more recently, the classification of tweets containing COVID-19 symptoms [22]. Though extraction of medical concepts from social media is a challenging task due to the possible use of colloquial language, with the advent of deep learning such concept extraction tasks have seen a significant performance improvement. ...
Full-text available
We investigate the potential benefit of incorporating dictionary information into a neural network architecture for natural language processing. In particular, we make use of this architecture to extract several concepts related to COVID-19 from an on-line medical forum. We use a sample from the forum to manually curate one dictionary for each concept. In addition, we use MetaMap, which is a tool for extracting biomedical concepts, to identify a small number of semantic concepts. For a supervised concept extraction task on the forum data, our best model achieved a macro $F_1$ score of 90\%. A major difficulty in medical concept extraction is obtaining labelled data from which to build supervised models. We investigate the utility of our models to transfer to data derived from a different source in two ways. First for producing labels via weak learning and second to perform concept extraction. The dataset we use in this case comprises COVID-19 related tweets and we achieve an $F_1$ score 81\% for symptom concept extraction trained on weakly labelled data. The utility of our dictionaries is compared with a COVID-19 symptom dictionary that was constructed directly from Twitter. Further experiments that incorporate BERT and a COVID-19 version of BERTweet demonstrate that the dictionaries provide a commensurate result. Our results show that incorporating small domain dictionaries to deep learning models can improve concept extraction tasks. Moreover, models built using dictionaries generalize well and are transferable to different datasets on a similar task.
... Besides, people share their experiences regarding diseases, symptoms, and related matters to help other patients. Due to the importance of medical and health text mining, the NLP community has organized a series of open challenges focusing on biomedical entity extraction and classification (Weissenbacher et al., 2019). ...
... We benchmark all models on four social media datasets: CADEC (Karimi et al., 2015), PsyTAR (Zolnoori et al., 2019), SMM4H (Weissenbacher et al., 2019), and TwiMed (Alvaro et al., 2017), which are further described in Section 3.3. ...
Full-text available
Biomedical entity linking, also known as biomedical concept normalization, has recently witnessed the rise to prominence of zero-shot contrastive models. However, the pre-training material used for these models has, until now, largely consisted of specialist biomedical content such as MIMIC-III clinical notes (Johnson et al., 2016) and PubMed papers (Sayers et al., 2021; Gao et al., 2020). While the resulting in-domain models have shown promising results for many biomedical tasks, adverse drug event normalization on social media texts has so far remained challenging for them (Portelli et al., 2022). In this paper, we propose a new approach for adverse drug event normalization on social media relying on general-purpose model initialization via BioLORD (Remy et al., 2022) and a semantic-text-similarity fine-tuning named STS. Our experimental results on several social media datasets demonstrate the effectiveness of our proposed approach, by achieving state-of-the-art performance. Based on its strong performance across all the tested datasets, we believe this work could emerge as a turning point for the task of adverse drug event normalization on social media and has the potential to serve as a benchmark for future research in the field.
... The main steps in monitoring ADRs using social media posts are text classification to find the text that mentions an adverse drug reaction, and the concept and mention extraction of ADE/ADR from the classified text. Breden et al. [357], preprocessed the Twitter dataset from Social Media Mining for Health (SMM4H) 2019 Competition [363] using the lexical normalization [364] method. The best performing model was an ensemble of fine-tuned BERT, BioBERT [55] and ClinicalBERT [60]. ...
Full-text available
With Artificial Intelligence (AI) increasingly permeating various aspects of society, including healthcare, the adoption of the Transformers neural network architecture is rapidly changing many applications. Transformer is a type of deep learning architecture initially developed to solve general-purpose Natural Language Processing (NLP) tasks and has subsequently been adapted in many fields, including healthcare. In this survey paper, we provide an overview of how this architecture has been adopted to analyze various forms of data, including medical imaging, structured and unstructured Electronic Health Records (EHR), social media, physiological signals, and biomolecular sequences. Those models could help in clinical diagnosis, report generation, data reconstruction, and drug/protein synthesis. We identified relevant studies using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We also discuss the benefits and limitations of using transformers in healthcare and examine issues such as computational cost, model interpretability, fairness, alignment with human values, ethical implications, and environmental impact.
Background: Few-shot learning (FSL) is a class of machine learning methods that require small numbers of labeled instances for training. With many medical topics having limited annotated text-based data in practical settings, FSL-based natural language processing (NLP) holds substantial promise. We aimed to conduct a review to explore the current state of FSL methods for medical NLP. Methods: We searched for articles published between January 2016 and October 2022 using PubMed/Medline, Embase, ACL Anthology, and IEEE Xplore Digital Library. We also searched the preprint servers (e.g., arXiv, medRxiv, and bioRxiv) via Google Scholar to identify the latest relevant methods. We included all articles that involved FSL and any form of medical text. We abstracted articles based on the data source, target task, training set size, primary method(s)/approach(es), and evaluation metric(s). Results: Fifty-one articles met our inclusion criteria-all published after 2018, and most since 2020 (42/51; 82%). Concept extraction/named entity recognition was the most frequently addressed task (21/51; 41%), followed by text classification (16/51; 31%). Thirty-two (61%) articles reconstructed existing datasets to fit few-shot scenarios, and MIMIC-III was the most frequently used dataset (10/51; 20%). 77% of the articles attempted to incorporate prior knowledge to augment the small datasets available for training. Common methods included FSL with attention mechanisms (20/51; 39%), prototypical networks (11/51; 22%), meta-learning (7/51; 14%), and prompt-based learning methods, the latter being particularly popular since 2021. Benchmarking experiments demonstrated relative underperformance of FSL methods on biomedical NLP tasks. Conclusion: Despite the potential for FSL in biomedical NLP, progress has been limited. This may be attributed to the rarity of specialized data, lack of standardized evaluation criteria and the underperformance of FSL methods on biomedical topics. The creation of publicly-available specialized datasets for biomedical FSL may aid method development by facilitating comparative analyses.
Conference Paper
Full-text available
Traditional data on influenza vaccination has several limitations: high cost, limited coverage of underrepresented groups, and low sensitivity to emerging public health issues. Social media, such as Twitter, provide an alternative way to understand a population's vaccination-related opinions and behaviors. In this study, we build and employ several natural language classifiers to examine and analyze behavioral patterns regarding influenza vaccination in Twitter across three dimensions: temporality (by week and month), geography (by US region), and demography (by gender). Our best results are highly correlated official government data, with a correlation over 0.90, providing validation of our approach. We then suggest a number of directions for future work.
Full-text available
Background: An estimated 3.9 billion individuals live in a location endemic for common mosquito-borne diseases. The emergence of Zika virus in South America in 2015 marked the largest known Zika outbreak and caused hundreds of thousands of infections. Internet data have shown promise in identifying human behaviors relevant for tracking and understanding other diseases. Objective: Using Twitter posts regarding the 2015-16 Zika virus outbreak, we sought to identify and describe considerations and self-disclosures of a specific behavior change relevant to the spread of disease-travel cancellation. If this type of behavior is identifiable in Twitter, this approach may provide an additional source of data for disease modeling. Methods: We combined keyword filtering and machine learning classification to identify first-person reactions to Zika in 29,386 English-language tweets in the context of travel, including considerations and reports of travel cancellation. We further explored demographic, network, and linguistic characteristics of users who change their behavior compared with control groups. Results: We found differences in the demographics, social networks, and linguistic patterns of 1567 individuals identified as changing or considering changing travel behavior in response to Zika as compared with a control sample of Twitter users. We found significant differences between geographic areas in the United States, significantly more discussion by women than men, and some evidence of differences in levels of exposure to Zika-related information. Conclusions: Our findings have implications for informing the ways in which public health organizations communicate with the public on social media, and the findings contribute to our understanding of the ways in which the public perceives and acts on risks of emerging infectious diseases.
Conference Paper
Full-text available
As the as the volume of user-generated content in social media expands so do the potential benefits of mining social media to learn about patient conditions, drug indications, and beneficial or adverse drug reactions. In this paper, we apply Conditional Random Fields (CRF) model for extracting expressions related to diseases from patient comments. Our method utilizes hand-crafted features including contextual features, dictionaries, clusterbased and distributed word representation generated from unlabeled user posts in social media. We compare our CRF-based approach with deep recurrent neural networks and a dictionary-based approach. We examine different word embeddings generated from unlabeled user posts in social media and scientific literature. We show that CRF outperformed other methods and achieved the F1-measures of 69.1% and 79.4% on recognition of disease-related expressions in the exact and partial matching exercises, respectively. Qualitative evaluation of disease-related expressions recognized by our feature-rich CRF-based approach demonstrates the variability of reactions from patients with different health conditions.
Full-text available
Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Conference Paper
Full-text available
The challenges of Named Entities Recognition (NER) for tweets lie in the insufficient information in a tweet and the unavailability of training data. We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model under a semi-supervised learning framework to tackle these challenges. The KNN based classifier conducts pre-labeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semi-supervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of KNN and semi-supervised learning.
Conference Paper
Millions of users share their experiences on social media sites, such as Twitter, which in turn generate valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, critical, task for these applications is classifying whether a personal health event was mentioned, which we call the (PHM) problem. This task is challenging for many reasons, including typically short length of social media posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like "heart attack»» or "cancer»» for emphasis, and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as "stroke»». To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embedding space to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep-learning techniques, WESPAD requires relatively little training data, which makes it possible to adapt, with minimal effort, to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions. Our experiments show that WESPAD outperforms the baselines and state-of-the-art methods, especially in cases when the number and proportion of true health mentions in the training data is small.
Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the kappa statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that kappa approaches these measures as the number of negative cases grows large. Positive specific agreement-or the equivalent F-measure-may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.
BERT: pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Lessons learnt from the named entity recognition and linking (neel) challenge series
  • Rizzo Giuseppe
  • Pereira Bianca
  • Varga Andrea
  • Elizabeth Cano Basave Van Erp Marieke
  • Amparo
Rizzo Giuseppe, Pereira Bianca, Varga Andrea, van Erp Marieke, and Elizabeth Cano Basave Amparo. 2017. Lessons learnt from the named entity recognition and linking (neel) challenge series. Semantic Web Journal, 8(5):667-700.
Separating fact from fear: Tracking flu infections on twitter
  • Alex Lamb
  • Michael J Paul
  • Mark Dredze
Alex Lamb, Michael J. Paul, and Mark Dredze. 2013. Separating fact from fear: Tracking flu infections on twitter. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013).