Conference PaperPDF Available

Overview of the Fourth Social Media Mining for Health (SMM4H) Shared Tasks at ACL 2019

Authors:

Figures

Proceedings of the 4th Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task, pages 21–30
Florence, Italy, August 2, 2019. c
2019 Association for Computational Linguistics
21
Overview of the Fourth Social Media Mining for Health (#SMM4H)
Shared Task at ACL 2019
Davy Weissenbacher, Abeed Sarker, Arjun Magge, Ashlynn Daughton,
Karen O’Connor, Michael Paul, Graciela Gonzalez-Hernandez
DBEI, Perelman School of Medicine, University of Pennsylvania, PA, USA
Biodesign Center for Environmental Health Engineering, Biodesign Institute,
Arizona State University, AZ, USA
Information Science University of Colorado Boulder, CO, USA
{dweissen,abeed,gragon}@pennmedicine.upenn.edu,amaggera@asu.edu
{mpaul,ashlynn.daughton}@colorado.edu
Abstract
The number of users of social media contin-
ues to grow, with nearly half of adults world-
wide and two-thirds of all American adults
using social networking on a regular basis1.
Advances in automated data processing and
NLP present the possibility of utilizing this
massive data source for biomedical and pub-
lic health applications, if researchers address
the methodological challenges unique to this
media. We present the Social Media Mining
for Health Shared Tasks collocated with the
ACL at Florence in 2019, which address these
challenges for health monitoring and surveil-
lance, utilizing state of the art techniques for
processing noisy, real-world, and substantially
creative language expressions from social me-
dia users. For the fourth execution of this chal-
lenge, we proposed four different tasks. Task
1 asked participants to distinguish tweets re-
porting an adverse drug reaction (ADR) from
those that do not. Task 2, a follow-up to Task
1, asked participants to identify the span of
text in tweets reporting ADRs. Task 3 is an
end-to-end task where the goal was to first de-
tect tweets mentioning an ADR and then map
the extracted colloquial mentions of ADRs in
the tweets to their corresponding standard con-
cept IDs in the MedDRA vocabulary. Finally,
Task 4 asked participants to classify whether
a tweet contains a personal mention of one’s
health, a more general discussion of the health
issue, or is an unrelated mention. A total of
34 teams from around the world registered
and 19 teams from 12 countries submitted a
system run. We summarize here the corpora
for this challenge which are freely available
at https://competitions.codalab.
org/competitions/22521, and present
an overview of the methods and the results of
the competing systems.
1Pew Research Center. Social Media Fact Sheet.
2017. [Online]. Available: http://www.pewinternet.org/fact-
sheet/social-media/
1 Introduction
The intent of the #SMM4H shared tasks se-
ries is to challenge the community with Natu-
ral Language Processing tasks for mining rele-
vant data for health monitoring and surveillance
in social media. Such challenges require pro-
cessing imbalanced, noisy, real-world, and sub-
stantially creative language expressions from so-
cial media. The competing systems should be
able to deal with many linguistic variations and
semantic complexities in the various ways peo-
ple express medication-related concepts and out-
comes. It has been shown in past research (Liu
et al.,2011;Giuseppe et al.,2017) that automated
systems frequently under-perform when exposed
to social media text because of the presence of
novel/creative phrases, misspellings and frequent
use of idiomatic, ambiguous and sarcastic expres-
sions. The tasks act as a discovery and verification
process of what approaches work best for social
media data.
As in previous years, our tasks focused on min-
ing health information from Twitter. This year
we challenged the community with two different
problems. The first problem focuses on perform-
ing pharmacovigilance from social media data. It
is now well understood that social media data may
contain reports of adverse drug reactions (ADRs)
and these reports may complement traditional ad-
verse event reporting systems, such as the FDA
adverse event reporting system (FAERS). How-
ever, automatically curating reports from adverse
reactions from Twitter requires the application of
a series of NLP methods in an end-to-end pipeline
(Sarker et al.,2015). The first three tasks of this
year’s challenge represent three key NLP prob-
lems in a social media based pharmacovigilance
pipeline — (i) automatic classification of ADRs,
(ii) extraction of spans of ADRs and (iii) normal-
22
ization of the extracted ADRs to standardized IDs.
The second problem explores the generalizabil-
ity of predictive models. In health research us-
ing social media, it is often necessary for re-
searchers to build individual classifiers to iden-
tify health mentions of a particular disease in a
particular context. Classification models that can
generalize to different health contexts would be
greatly beneficial to researchers in these fields
(e.g., (Payam and Eugene,2018)), as this would
allow researchers to more easily apply existing
tools and resources to new problems. Motivated
by these ideas, Task 4 was testing tweet classifi-
cation methods across diverse health contexts, so
the test data included a very different health con-
text than the training data. This setting measures
the ability of tweet classifiers to generalize across
health contexts.
The fourth iteration of our series follows the
same organization as previous iterations. We col-
lected posts from Twitter, annotated the data for
the four tasks proposed and released the posts to
the registered teams. This year, we conducted the
evaluation of all participating systems using Co-
dalab, an open source platform facilitating data
science competitions. The performances of the
systems were compared on a blind evaluations sets
for each task.
All teams registered were allowed to participate
to one or multiple tasks. We provided the partic-
ipants with two sets of data for each task, a train-
ing and a test set. Participants had a period of six
weeks, from March 5th to April 15th , for train-
ing their systems on our training sets, and 4 days,
from the 16th to 20th of April, for calibrating their
systems on our test sets and submitting their pre-
dictions. In total 34 teams registered and 19 teams
submitted at least one run (each team was allowed
to submit, at most, three runs per task). In detail,
we received 43 runs for task 1, 24 for task 2, 10 for
task 3 and 15 for task 4. We briefly describe each
task and their data in section 2, before discussing
the results obtained in section 3.
2 Task Descriptions
2.1 Tasks
Task 1: Automatic classification of tweets men-
tioning an ADR. This is a binary classification
task for which systems are required to predict if a
tweet mentions an ADR or not. In an end-to-end
social media based pharmacovigilance pipeline,
such a system is needed after data collection to
filter out the large volume of medication-related
chatter that is not a mention of an ADR. This task
is a rerun of the popular classification task orga-
nized in past years.
Task 2: Automatic extraction of ADR mentions
from tweets. This is a named entity recogni-
tion (NER) task that typically follows the ADR
classification step (Task 1) in an ADR extraction
pipeline. Given a set of tweets containing drug
mentions and potentially containing ADRs, the
objective was to determine the span of the ADR
mention, if any. ADRs are rare events making
ADR classification a challenging task with an F1-
score in the vicinity of 0.5 (based on previous
shared task results (Weissenbacher et al.,2018))
for the ADR class. The dataset for the ADR ex-
traction task contains tweets that are both positive
and negative for the presence of ADRs. This al-
lowed participants to choose to train their systems
on either the set of tweets containing ADRs or in-
clude tweets that were negative for the presence of
ADRs.
Task 3: Automatic extraction of ADR mentions
and normalization of extracted ADRs to Med-
DRA preferred term identifiers. This is an ex-
tension of Task 2 consisting of the combination of
NER and entity normalization tasks: a named en-
tity resolution task. In this task, given the same
set of tweets as in Task 2, the objective was to ex-
tract the span of an ADR mention and to normal-
ize it to MedDRA identifiers 2. MedDRA (Med-
ical Dictionary for Regulatory Activities), which
is the standard nomenclature for monitoring med-
ical products, and includes diseases, disorders,
signs, symptoms, adverse events or adverse drug
reactions. For the normalization task, MedDRA
version 21.1 was used, containing 79,507 lower
level terms (LLTs) and 23,389 respective preferred
terms (PTs).
Task 4: Automatic classification of personal
mentions of health. In this binary classifica-
tion task, the systems were required to distinguish
tweets of personal health status or opinions across
different health domains. The proposed task was
intended to provide a baseline understanding of
the ability to identify personal health mentions in
a generalized context.
2https://www.meddra.org/ Accessed:
05/13/2019.
23
2.2 Data
All corpora were composed of public tweets
downloaded using the official streaming API pro-
vided by Twitter and made available to the partici-
pants in accordance with Twitter’s data use policy.
This study received an exempt determination by
the Institutional Review Board of the University
of Pennsylvania.
Task 1. For training, participants were provided
with all the tweets from the #SMM4H 2017 shared
tasks (Sarker et al.,2018), which are publicly
available at: https://data.mendeley.
com/datasets/rxwfb3tysd/2. A total of
25,678 tweets were made available for training.
The test set consisted of 4575 tweets with 626
(13.7%) tweets representing ADRs. The evalua-
tion metric for this task was micro-averaged F1-
score for the ADR class.
Task 2. Participants of Task 2 were provided
with a training set containing 2276 tweets which
mentioned at least one drug name. The dataset
contained 1300 tweets that were positive for the
presence of ADRs and 976 tweets that were neg-
ative. Participants were allowed to include addi-
tional negative instances from Task 1 for training
purposes. Positive tweets were annotated with the
start and end indices of the ADRs and the corre-
sponding span text in the tweets. The evaluation
set contained 1573 tweets, 785 and 788 tweets
were positive and negative for the presence of
ADRs respectively. The participants were asked
to submit outputs from their systems that con-
tained the predicted start and end indices of ADRs.
The participants’ submissions were evaluated us-
ing standard strict and overlapping F1-scores for
extracted ADRs. Under strict mode of evaluation,
ADR spans were considered correct only if both
start and end indices matched with the indices in
our gold standard annotations. Under overlapping
mode of evaluation, ADR spans were considered
correct only if spans in predicted annotations over-
lapped with our gold standard annotations.
Task 3. Participants were provided with the
same training and evaluation datasets as in Task
2. However, the datasets contained additional
columns for the MedDRA annotated LLT and PT
identifiers for each ADR mention. In total, of the
79,507 LLT and 23,389 PT identifiers available in
MedDRA, the training set of 2276 tweets and 1832
annotated ADRs contained 490 unique LLT iden-
tifiers and 327 unique PT identifiers. The evalua-
tion set contained 112 PT identifiers that were not
present as part of the training set. The participants
were asked to submit outputs containing the pre-
dicted start and end indices of ADRs and respec-
tive PT identifiers. Although the training dataset
contained annotations at the LLT level, the perfor-
mance was only evaluated at the higher PT level.
The participants’ submissions were evaluated us-
ing standard strict and overlapping F-scores for ex-
tracted ADRs and respective MedDRA identifiers.
Under strict mode of evaluation, ADR spans were
considered correct only if both start and end in-
dices matched along with matching MedDRA PT
identifiers. Under overlapping mode of evaluation,
ADR spans were considered correct only if spans
in predicted ADRs overlapped with gold standard
ADR spans in addition to matching MedDRA PT
identifiers.
Task 4 Data. Participants were provided train-
ing data from one disease domain, influenza,
across two contexts, being sick and getting vac-
cinated, both annotated for personal mentions: the
user is personally sick or the user has been per-
sonally vaccinated. Test data included new tweets
of personal health mentions about influenza and
tweets from an additional disease domain, Zika
virus, with two different contexts, the user is
changing their travel plans in response to Zika
concerns, or the user is minimizing potential
mosquito exposure due to Zika concerns.
2.3 Annotation and Inter-Annotator
Agreements
Two annotators with biomedical education and
both experienced in Social Media research tasks
manually annotated the corpora for tasks 1, 2 and
3. Our annotators independently dual-annotated
each test sets to insure the quality of our annota-
tions. Disagreement were resolved after an adju-
dication phase between our two annotators. On
task 1, the classification task, the inter annotator-
agreement (IAA) was high with a Cohens Kappa
= 0.82. On task 2, the information extraction task,
IAAs were good with and an F1-score of 0.73 for
strict agreement, and 0.85 for overlapping agree-
ment3. On task 3, our annotators double annotated
3Since task 2 is a named-entity recognition task, we fol-
lowed the recommendations of (Hripcsak and Rothschild,
2005) and used precision and recall metrics to estimate the
inter-annotator rate.
24
535 of the extracted ADR terms and normalized
them to MedDRA lower lever terms (LLT). They
achieved an agreement accuracy of 82.6%. Af-
ter converting the LLT to their corresponding pre-
ferred term (PT) in MedDRA, which is the coding
the task was scored against, accuracy improved to
87.7%4.
The annotation process followed for task 4 was
slightly different due to the nature of the task. We
obtained the two datasets of our training set, fo-
cusing on flu vaccination and flu infection, from
(Huang et al.,2017) and (Lamb et al.,2013) re-
spectively. Huang et al. (Huang et al.,2017) used
mechanical turk to crowdsource labels (Fleiss’
kappa = 0.793). Lamb et al. (Lamb et al.,2013)
did not report their labeling procedure or annotator
agreement metrics, but do report annotation guide-
lines5. A few of the tweets released by Lamb et
al. appeared to be mislabeled and were corrected
in accordance with the annotation guidelines de-
fined by the authors. We obtained the test data
for task 4 by compiling three datasets. For the
dataset related to travel changes due to Zika con-
cerns, we selected a subset of data already avail-
able from (Daughton and Paul,2019). Initial la-
beling of these tweets was performed by two an-
notators with a public health background (Cohen’s
kappa = 0.66). We reuse the original annotations
for this dataset without changes. For the mosquito
exposure dataset, tweets were labeled by one an-
notator with public health knowledge and expe-
rienced with social media, and then verified by
a second annotator with similar experience. The
additional set of data on personal exposure to In-
fluenza were obtained from a separate group, who
used an independent labeling procedure.
3 Results
The challenge received a solid response with 19
teams from 12 countries (7 from North America,
1 from South America, 6 from Asia and 5 from
Europe) submitting 92 runs in total in one or more
tasks. We present an overview of all architec-
tures competing in the different tasks in Table 1,
2,3,4. We also list in these tables the exter-
nal resources competitors integrated for improving
4We measured agreement using accuracy instead of Co-
hens Kappa because, with greater than 70,000 LLTs for the
annotators to choose from, agreement due to chance is ex-
pected to be small.
5We used the awareness vs. infection labels as defined in
(Lamb et al.,2013).
the pre-training of their systems or for embedding
high-level features to help decision-making.
The overview of all architectures is interest-
ing in two ways. First, this challenge confirms
the tendency of the community to abandon tradi-
tional Machine Learning systems based on hand-
crafted features for deep learning architectures ca-
pable of discovering the features relevant for the
task at hand from pre-trained embeddings. Dur-
ing the challenge, when participants implemented
traditional systems, such as SVM or CRF, they
used such systems as baselines and, observing sig-
nificant differences of performances with systems
based on deep learning on their validation sets,
most of them did not submit their predictions as
official runs. Second, while last year convolu-
tional or recurrent neural networks “fed” with pre-
trained word embeddings learned on local win-
dows of words (e.g. word2vec, GloVe) were the
most popular architectures, this year we can see
a clear dominance of neural architectures using
word embeddings pre-trained with the Bidirec-
tional Encoder Representations from Transform-
ers (BERT) proposed by (Devlin et al.,2018), or
fine-tuning these words embeddings on our train-
ing corpora. BERT allows to compute words em-
beddings based on the full context of sentences
and not only on local windows.
A notable result from task 1-3 is that, despite
an improvement in performances for the detec-
tion of ADRs, their resolution remains challenging
and will require further research. The participants
largely adopted contextual word-embeddings dur-
ing this challenge, a choice rewarded by new
records in performances during the task 1, the only
task reran from last years. The performances in-
creased from .522 F1-score (.442 P, .636 R) (Weis-
senbacher et al.,2018) to .646 F1-score (0.608 P,
0.689 R) for the best systems of each years. How-
ever, with a strict matching F1-score of .432 (.362
P, .535 R) for the best system, the performances
obtained in task 3 for ADRs resolution are still
low and human inspection is still required to make
use of the data extracted automatically. As shown
by the best score of .887 Accuracy obtained on the
ADR normalization in task 3 ran during #SMM4H
in 2017 (Sarker et al.,2018)6, once ADRs are ex-
tracted, the normalization of the ADRs can be per-
6Organizers of the task 3 ran during #SMM4H 2017 pro-
vided participants with manually curated expressions refer-
ring to ADRs and participants had to map them to their cor-
responding preferred terms in MeDRA.
25
formed with a good reliability. However errors are
made during all steps of the resolution — detec-
tion, extraction, normalization — and their over-
all accumulation render current automatic systems
inefficient. Note that bulk of the errors are made
during the extraction of the ADRs, as shown by
the low strict F1-score of the best system in task 2,
.464 F1-score (.389P, .576 R).
For task 4, we were especially interested in the
generalizability of first person health classifiers to
a domain separate from that of the training data.
We find that, on average, teams do reasonably
well across the full test dataset (average F1-score:
0.70, range: 0.41-0.87). Unsurprisingly, classi-
fiers tended to do better on a test set in the same
domain as the training dataset (context 1, average
F1-score: 0.82) and more modestly on the Zika
travel and mosquito datasets (average F1-score:
0.40 and 0.52, respectively). Interestingly, in all
contexts, precision was higher than recall. We note
that both the training and the testing data were lim-
ited in quantity, and that classifiers would likely
improve with more data. However, in general, it is
encouraging that classifiers trained in one health
domain can be applied to separate health domains.
4 Conclusion
In this paper we presented an overview of the re-
sults of #SMM4H 2019 which focuses on a) the
resolution of adverse drug reaction (ADR) men-
tioned in Twitter and b) the distinction between
tweets reporting personal health status form opin-
ions across different health domains. With a total
of 92 runs submitted by 19 teams, the challenge
was well attended. The participants, in large part,
opted for neural architectures and integrated pre-
trained word-embedding sensitive to their contexts
based on the recent Bidirectional Encoder Repre-
sentations from Transformers. Such architectures
were the most efficient on our four tasks. Re-
sults on tasks 1-3 show that, despite a continuous
improvement of performances in the detection of
tweets mentioning ADRs over the past years, their
end-to-end resolution still remain a major chal-
lenge for the community and an opportunity for
further research. Results of task 4 were more en-
couraging, with systems able to generalized their
predictions over domains not present in their train-
ing data.
References
Ashlynn R. Daughton and Michael J. Paul. 2019. Iden-
tifying protective health behaviors on twitter: Ob-
servational study of travel advisories and zika virus.
Journal of Medical Internet Research. In Press.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. BERT: pre-training of
deep bidirectional transformers for language under-
standing. CoRR, abs/1810.04805.
Rizzo Giuseppe, Pereira Bianca, Varga Andrea,
van Erp Marieke, and Elizabeth Cano Basave Am-
paro. 2017. Lessons learnt from the named entity
recognition and linking (neel) challenge series. Se-
mantic Web Journal, 8(5):667–700.
George Hripcsak and Adam S Rothschild. 2005.
Agreement, the f-measure, and reliability in infor-
mation retrieval. Journal of the American Medical
Informatics Association, 12(3):296–298.
Xiaolei Huang, Michael C. Smith, Michael J. Paul,
Dmytro Ryzhkov, Sandra C. Quinn, David A. Bro-
niatowski, and Mark Dredze. 2017. Examining
patterns of influenza vaccination in social media.
In AAAI Joint Workshop on Health Intelligence
(W3PHIAI).
Alex Lamb, Michael J. Paul, and Mark Dredze. 2013.
Separating fact from fear: Tracking flu infections on
twitter. In Proceedings of Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies
(NAACL-HLT 2013).
Xiaohua Liu, Shaodian Zhang, Furu Wei, and Ming
Zhou. 2011. Recognizing named entities in tweets.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies - Volume 1, HLT ’11, pages
359–367. Association for Computational Linguis-
tics.
Zulfat Miftahutdinov, Elena Tutubalina, and Alexander
Tropsha. 2017. Identifying disease-related expres-
sions in reviews using conditional random fields. In
Proceedings of International Conference on Com-
putational Linguistics and Intellectual Technologies
Dialog, volume 1, pages 155–167.
Azadeh Nikfarjam, Abeed Sarker, Karen O’connor,
Rachel Ginn, and Graciela Gonzalez-Hernandez.
2015. Pharmacovigilance from social media: min-
ing adverse drug reaction mention using sequence
labeling with word embedding cluster features.
Journal of the American Medical Informatics Asso-
ciation, 22(3):671–681.
Karisani Payam and Agichtein Eugene. 2018. Did you
really just have a heart attack? towards robust de-
tection of personal health mentions in social media.
In Proceedings of the 2018 World Wide Web Confer-
ence on World Wide Web, pages 137–146.
26
Abeed Sarker, Maksim Belousov, Jasper Friedrichs,
Kai Hakala, Svetlana Kiritchenko, Farrokh
Mehryary, Sifei Han, Tung Tran, Anthony Rios,
Ramakanth Kavuluru, Berry de Bruijn, Filip Gin-
ter, Debanjan Mahata, Saif Mohammad, Goran
Nenadic, and Graciela Gonzalez-Hernandez. 2018.
Data and systems for medication-related text clas-
sification and concept normalization from twitter:
insights from the social media mining for health
(smm4h)-2017 shared task. J Am Med Inform
Assoc, 25(10):1274–1283.
Abeed Sarker, Rachel Ginn, Azadeh Nikfarjam, Karen
OConnor, Karen Smith, Swetha Jayaraman, Tejaswi
Upadhaya, and Graciela Gonzalez. 2015. Utilizing
social media data for pharmacovigilance: A review.
Journal of Biomedical Informatics, 54:202 – 212.
Abeed Sarker and Graciela Gonzalez. 2017. A cor-
pus for mining drug-related knowledge from twitter
chatter: Language models and their utilities. Data
in Brief, 10:122–131.
Abeed Sarker and Graciela Gonzalez-Hernandez.
2015. Portable automatic text classification for ad-
verse drug reaction detection via multi-corpus train-
ing. Journal of biomedical informatics, 53:196–207.
Davy Weissenbacher, Abeed Sarker, Michael J Paul,
and Graciela Gonzalez-Hernandez. 2018. Overview
of the third social media mining for health (smm4h)
shared tasks at emnlp 2018. In in Proceedings of
the 2018 EMNLP Workshop SMM4H: The 3rd So-
cial Media Mining for Health Applications Work-
shop and Shared Task, pages 13–16.
Abbreviations
FF: Feedforward
CNN: Convolutional Neural Network
BiLSTM: Bidirectional Long Short-Term Mem-
ory
SVM: Support Vector Machine
CRF: Conditional Random Field
POS: Part-Of-Speech
RNN: Recurrent Neural Network
27
Rank Team System details
1 ICRC Architecture: BERT + FF + Softmax
Details: lexicon features (pairs of drug-ADR)
Resources: SIDER
2 UZH Architecture: ensemble of BERT & C CNN + W BiLSTM (+ CRF)
Details: multi-task-learning
Resources: CADEC corpus
3 MIDAS@IIITD Architecture: 1. BERT 2. ULMFit 3. W BiLSTM
Details: BERT + GloVe + Flair
Resources: additional corpus (Sarker and Gonzalez-Hernandez,2015)
4 KFU NLP Architecture: BERT + logistic regression
Details: BioBERT
5 CLaC Architecture: Bert + W BiLSTM + attention + softmax + SVM
Details: BERT, Word2Vec, Glove, embedded features
Resources: POS, modality, ADR list
6 THU NGN Architecture: C CNN + W BiLSTM + features + Multi-Head attention + Softmax
Details: Word2Vec, POS, ELMo
Resources: sentiment Lexicon, SIDER, CADEC
7 BigODM Architecture: ensemble of SVMs
Resources: Word Embeddings
8 UMich-NLP4Health Architecture: 1. W BiLSTM + attention + softmax; 2. W CNN + BiLSTM + softmax; 3. SVM
Details: GloVe, POS, case
Resources: Metamap, cTAKES, CIDER
9 TMRLeiden Architecture: ULMfit
Details: Flair + Glove + Bert; transfer learning
Resources: external corpus (Sarker and Gonzalez,2017)
10 CIC-NLP Architecture: C BiLSTM + W FF + LSTM + FF
Details: GloVe + BERT
12 SINAI Architecture: 1. SVM 2. CNN + Softmax
Details: GloVe
Resources: MetaMap
13 nlp-uned Architecture: W BiLSTM + Sigmoid
Details: GloVe
14 ASU BioNLP Architecture: 1. Lexicon; 2. BioBert
Details: Lexicon learned with Logistic regression model
15 Klick Health Architecture: ELMo + FF + Softmax
Details: Lexicons
Resources: MedDRA, Consumer Health Vocabulary, (Nikfarjam et al.,2015)
16 GMU Architecture: encoder-decoder (W biLSTM + attention)
Details: Glove
Resources: #SMM4H 2017-2018, UMLS
Table 1: Task 1. System and resource descriptions for ADR mentions detection in tweets7.
8We use C BiLSMT and C CNN to denote bidirectonal LSTMs or CNNs encoding sequences of characters, W BiLSTM and
W FF to denote bidirectional LSTMs or Feed Forward encoders of word embeddings.
28
Rank Team System details
1 KFU NLP Architecture: ensemble of BioBERT + CRF
Details: BioBERT
Resources: external dictionaries (Miftahutdinov et al.,2017);
CADEC, PsyTAR, TwADR-L corpora; #SMM4H 2017
2 THU NGN Architecture: C CNN + W BiLSTM + features + Multi-Head self-attention + CRF
Details: Word2Vec, POS, ELMo
Resources: sentiment Lexicon, SIDER, CADEC
3 MIDAS@IIITD Architecture: W BiLSTM + CRF
Details: BERT + GloVe + Flair
4 TMRLeiden Architecture: BERT + Flair
Details: Flair + Glove + Bert; transfer learning
5 ICRC Architecture: BERT + CRF
Resources: SIDER
6 GMU Architecture: C biLSTM + W biLSTM + CRF
Details: Glove
Resources: #SMM4H 2017-2018, UMLS
7 HealthNLP Architecture: W BiLSTM + CRF
Details: Word2vec, BERT, ELMo, POS
Resources: external dictionaries
8 SINAI Architecture: CRF
Details: GloVe
Resources: MetaMap
9Architecture: BiLSTM + CRF
Details: Word2Vec
Resources: MIMIC-III
10 Klick Health Architecture: Similarity
Details: Lexicons
Resources: MedDRA, Consumer Health Vocabulary, (Nikfarjam et al.,2015)
Table 2: Task 2. System and resource descriptions for ADR mentions extraction in tweets
Rank Team System details
1 KFU NLP Architecture: BioBERT + softmax
2 myTomorrows-TUDelft Architecture: ensemble RNN & Few-Shot Learning
Details: Word2Vec
Resources: MedDRA, Consumer Health Vocabulary, UMLS
3 TMRLeiden Architecture: BERT + Flair + RNN
Details: Flair + Glove + Bert; transfer learning
Resources: Consumer Health Vocabulary
4 GMU Architecture: encoder-decoder (W biLSTM + attention)
Details: Glove
Resources: #SMM4H 2017-2018, UMLS
Table 3: Task 3. System and resource descriptions for ADR mentions resolution in tweets.
Rank Team System details
1 UZH Architecture: ensemble BERT
Resources: CADEC corpus
2 ASU1 Architecture: BioBERT + FF
Resources: Word2vec, manually compiled list, ConceptNet
4 MIDAS@IIITD Architecture: BERT; W BiLSTM
Details: BERT + GloVe + Flair
5 TMRLeiden Architecture: ULMfit
Details: Flair + Glove + Bert; transfer learning
Resources: external corpus (Payam and Eugene,2018)
6 CLaC Architecture: Bert + W BiLSTM + attention + softmax + SVM
Details: BERT, Word2Vec, Glove, embedded features
Resources: POS, modality, ADR list
Table 4: Task 4. System and resource descriptions for detection of personal mentions of health in tweets.
29
Team F1 P R
ICRC 0.6457 0.6079 0.6885
UZH 0.6048 0.6478 0.5671
MIDAS@IIITD 0.5988 0.6647 0.5447
KFU NLP 0.5738 0.6914 0.4904
CLaC 0.5738 0.5427 0.6086
THU NGN 0.5718 0.4667 0.738
BigODM 0.5514 0.4762 0.655
UMich-NLP4Health 0.5369 0.5654 0.5112
TMRLeiden 0.5327 0.6419 0.4553
CIC-NLP 0.5209 0.6203 0.4489
UChicagoCompLx 0.4993 0.4574 0.5495
SINAI 0.4969 0.5517 0.4521
nlp-uned 0.4723 0.5244 0.4297
ASU BioNLP 0.4317 0.3223 0.6534
Klick Health 0.4099 0.5824 0.3163
GMU 0.3587 0.4526 0.2971
Table 5: System performances for each team for task 1 of the shared task. F1-score, Precision and Recall over the
ADR class are shown. Top scores in each column are shown in bold.
Relaxed Strict
Team F1 P R F1 P R
KFU NLP 0.658 0.554 0.81 0.464 0.389 0.576
THU NGN 0.653 0.614 0.697 0.356 0.328 0.388
MIDAS@IIITD 0.641 0.537 0.793 0.328 0.274 0.409
TMRLeiden 0.625 0.555 0.715 0.431 0.381 0.495
ICRC 0.614 0.538 0.716 0.407 0.357 0.474
GMU 0.597 0.596 0.599 0.407 0.406 0.407
HealthNLP 0.574 0.632 0.527 0.336 0.37 0.307
SINAI 0.542 0.612 0.486 0.36 0.408 0.322
ASU BioNLP 0.535 0.415 0.753 0.269 0.206 0.39
Klick Health 0.396 0.416 0.378 0.194 0.206 0.184
Table 6: System performances for each team for task 2 of the shared task. (Strict/Relaxed) F1-score, Precision
and Recall over the ADR mentions are shown. Top scores in each column are shown in bold.
Relaxed Strict
Team F1 P R F1 P R
KFU NLP 0.432 0.362 0.535 0.344 0.288 0.427
myTomorrows-TUDelft 0.345 0.336 0.355 0.244 0.237 0.252
TMRLeiden 0.312 0.37 0.27 0.25 0.296 0.216
GMU 0.208 0.221 0.196 0.109 0.116 0.102
Table 7: System performances for each team for task 3 of the shared task. (Strict/Relaxed) F1-score, Precision
and Recall over the ADR resolution are shown. Top scores in each column are shown in bold.
30
Team Acc F1 P R
Health concerns in all contexts
UZH 0.8772 0.8727 0.8392 0.9091
ASU1 0.8456 0.8036 0.9783 0.6818
UChicagoCompLx 0.8316 0.7913 0.9286 0.6894
MIDAS@IIITD 0.8211 0.783 0.8932 0.697
TMRLeiden 0.793 0.7256 0.9398 0.5909
CLaC 0.6386 0.4607 0.7458 0.3333
Health concerns in Context 1: Flu virus (infection/vaccination)
UZH 0.9438 0.9474 0.9101 0.9878
UChicagoCompLx 0.925 0.9231 0.973 0.878
ASU1 0.925 0.9221 0.9861 0.8659
MIDAS@IIITD 0.8875 0.88 0.9706 0.8049
TMRLeiden 0.8625 0.8493 0.9688 0.7561
CLaC 0.6625 0.5645 0.8333 0.4268
Health concerns in Context 2: Zika virus, travel plans changes
UZH 0.7536 0.7385 0.7059 0.7742
MIDAS@IIITD 0.6667 0.5818 0.6667 0.5161
ASU1 0.6957 0.5116 0.9167 0.3548
UChicagoCompLx 0.6377 0.4681 0.6875 0.3548
TMRLeiden 0.6377 0.4186 0.75 0.2903
CLaC 0.5362 0.2 0.4444 0.129
Health concerns in Context 3: Zika virus, reducing mosquito exposure
UZH 0.8393 0.7692 0.75 0.7895
MIDAS@IIITD 0.8214 0.6667 0.9091 0.5263
ASU1 0.8036 0.5926 1.0 0.4211
UChicagoCompLx 0.8036 0.5926 1.0 0.4211
TMRLeiden 0.7857 0.5385 1.0 0.3684
CLaC 0.6964 0.3704 0.625 0.2632
Table 8: System performances for each team for task 4 of the shared task. Accuracy, F1-score, Precision and
Recall over the personal mentions are shown. Top scores in each column are shown in bold.
... Sarker et al. [14] has found that two datasets sourced from DailyStrength and Twitter are compatible for multi-corpus training. Four consecutive shared tasks (i.e., PSB 2016 Social Media Mining Shared Task Workshop and the second, third, and fourth Social Media Mining for Health (SMM4H) Shared Tasks [15][16][17]) facilitate this line of research. ...
... Yang et al. [18] used the Latent Dirichlet Allocation (LDA) and partially supervised learning. Nowadays, the research tendency of the community is to abandon handcrafted feature engineering-based method for deep learning architectures [17]. Recent studies have resorted to deep learning methods, such as convolutional neural network (CNN) [6], bi-directional long short-term memory (Bi-LSTM) with the attention mechanism [6,25], and BERT [7]. ...
Article
Full-text available
Adverse drug reactions (ADRs) are a huge public health issue. Identifying text that mentions ADRs from a large volume of social media data is important. However, we need to address two challenges for high-performing ADR-related text detection: the data imbalance problem and the requirement of simultaneously using data-driven information and handcrafted information. Therefore, we propose an approach named multi-view active learning using domain-specific and data-driven document representations (MVAL4D), endeavoring to enhance the predictive capability and alleviate the requirement of labeled data. Specifically, a new view-generation mechanism is proposed to generate multiple views by simultaneously exploiting various document representations obtained using handcrafted feature engineering and by performing deep learning methods. Moreover, different from previous active learning studies in which all instances are chosen using the same selection criterion, MVAL4D adopts different criteria (i.e., confidence and informativeness) to select potentially positive instances and potentially negative instances for manual annotation. The experimental results verify the effectiveness of MVAL4D. The proposed approach can be generalized to many other text classification tasks. Moreover, it can offer a solid foundation for the ADR mention extraction task, and improve the feasibility of monitoring drug safety using social media data.
... public availability of the dataset and the presence of many labels in it (7000). 24 6 papers used datasets from shared tasks, of which 4 were from BioNLP, 47,55 one from the Social Media Mining for Health Applications (SMM4H), 41 and one from the Medical Document Anonymization (MEDDOCAN) shared task. 45 Only 3 papers created new datasets, reflecting the paucity of corpora built to support FSL for medical NLP. ...
... Public datasets have helped progress NLP and machine learning research over the years, such as through shared tasks. 41 Our review, however, did not find any current shared task that provides specialized datasets for FSL-based biomedical NLP. ...
Preprint
Full-text available
Objective Few-shot learning (FSL) methods require small numbers of labeled instances for training. As many medical topics have limited annotated textual data in practical settings, FSL-based natural language processing (NLP) methods hold substantial promise. We aimed to conduct a systematic review to explore the state of FSL methods for medical NLP. Materials and Methods ACL Anthology, and IEEE Xplore Digital Library. To identify the latest relevant methods, we also searched other sources such as preprint servers (eg., medRxiv) via Google Scholar. We included all articles that involved FSL and any type of medical text. We abstracted articles based on data source(s), aim(s), training set size(s), primary method(s)/approach(es), and evaluation method(s). Results 31 studies met our inclusion criteria-all published after 2018; 22 (71%) since 2020. Concept extraction/named entity recognition was the most frequently addressed task (13/31; 42%), followed by text classification (10/31; 32%). Twenty-one (68%) studies reconstructed existing datasets to create few-shot scenarios synthetically, and MIMIC-III was the most frequently used dataset (7/31; 23%). Common methods included FSL with attention mechanisms (12/31; 39%), prototypical networks (8/31; 26%), and meta-learning (6/31; 19%). Discussion Despite the potential for FSL in biomedical NLP, progress has been limited compared to domain-independent FSL. This may be due to the paucity of standardized, public datasets, and the relative underperformance of FSL methods on biomedical topics. Creation and release of specialized datasets for biomedical FSL may aid method development by enabling comparative analyses.
... Due to the popularity of social media, a large amount of healthrelated information is shared online (Zheng et al., 2021;Rivadeneira et al., 2021). It has been reported that over 70% of adults in the USA use social media, and its popularity is increasing (Weissenbacher et al., 2019). Platforms such as Twitter provide real-time coverage of ongoing events that are dynamically and locally fluctuating, like the COVID-19 pandemic. ...
... In 2019, the task of automatic PHM classification from tweets was added as a new task. Participants of the task classify influenza-related PHM tweets into two categories: being sick or vaccinated (Weissenbacher et al., 2019). During the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), a new shared task of identifying informative COVID-19 tweets was organized (Nguyen et al., 2020). ...
Article
Twitter offers extensive and valuable information on the spread of COVID-19 and the current state of public health. Mining tweets could be an important supplement for public health departments in monitoring the status of COVID-19 in a timely manner and taking the appropriate actions to minimize its impact. Identifying personal health mentions (PHM) is the first step of social media public health surveillance. It aims to identify whether a person’s health condition is mentioned in a tweet, and it serves as a crucial method in tracking pandemic conditions in real time. However, social media texts contain noise, many creative and novel phrases, sarcastic emoji expressions, and misspellings. In addition, the class imbalance issue is usually very serious. To address these challenges, we built a COVID-19 PHM dataset containing more than 11,000 annotated tweets, and we proposed a dual convolutional neural network (CNN) framework using this dataset. An auxiliary CNN in the dual CNN structure provides supplemental information for the primary CNN in order to detect PHMs from tweets more effectively. The experiment shows that the proposed structure could alleviate the effect of class imbalance and could achieve promising results. This automated approach could monitor public health in real time and save disease-prevention departments from the tedious manual work in public health surveillance.
... Nowadays, however, not all drug effects are reported to healthcare professionals, but are also widely discussed online. This, and the rise of deep learning, spurred the collection of datasets (mostly in English) and the introduction of shared tasks and challenges, such as the SMM4H series (Weissenbacher et al., 2018;Weissenbacher et al., 2019;Klein et al., 2020). The methods of choice for tackling these tasks often included rule-based approaches and ensembles of statistical classifiers, e.g. ...
Preprint
Full-text available
In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.
... Its precision of the model is 0.695, meaning that 69.5% of all retrieved entities are true positives. Our model thereby outperforms state-of-the-art models on this task 27 . Yet, its overall performance (F 1 = 0.72) is still slightly lower than that of humans (average pair-wise F 1 = 0.80). ...
Article
Full-text available
Current methods of pharmacovigilance result in severe under-reporting of adverse drug events (ADEs). Patient forums have the potential to complement current pharmacovigilance practices by providing real-time uncensored and unsolicited information. We are the first to explore the value of patient forums for rare cancers. To this end, we conduct a case study on a patient forum for Gastrointestinal Stromal Tumor patients. We have developed machine learning algorithms to automatically extract and aggregate side effects from messages on open online discussion forums. We show that patient forum data can provide suggestions for which ADEs impact quality of life the most: For many side effects the relative reporting rate differs decidedly from that of the registration trials, including for example cognitive impairment and alopecia as side effects of avapritinib. We also show that our methods can provide real-world data for long-term ADEs, such as osteoporosis and tremors for imatinib, and novel ADEs not found in registration trials, such as dry eyes and muscle cramping for imatinib. We thus posit that automated pharmacovigilance from patient forums can provide real-world data for ADEs and should be employed as input for medical hypotheses for rare cancers.
... The third task is medical NER, which aims to extract medical entities from plain texts and classify their types. We conduct experiment on three public benchmark medical NER datasets, e.g., CADEC 32 , ADE 33 and SMM4H 34 . These three datasets have very different characteristics such as data collection sources and label tagging schemes. ...
Article
Full-text available
Federated learning is a privacy-preserving machine learning technique to train intelligent models from decentralized data, which enables exploiting private data by communicating local model updates in each iteration of model learning rather than the raw data. However, model updates can be extremely large if they contain numerous parameters, and many rounds of communication are needed for model training. The huge communication cost in federated learning leads to heavy overheads on clients and high environmental burdens. Here, we present a federated learning method named FedKD that is both communication-efficient and effective, based on adaptive mutual knowledge distillation and dynamic gradient compression techniques. FedKD is validated on three different scenarios that need privacy protection, showing that it maximally can reduce 94.89% of communication cost and achieve competitive results with centralized model learning. FedKD provides a potential to efficiently deploy privacy-preserving intelligent systems in many scenarios, such as intelligent healthcare and personalization. This work presents a communication-efficient federated learning method that saves a major fraction of communication cost. It reveals the advantage of reciprocal learning in machine knowledge transfer and the evolutional low-rank properties of deep model updates.
Article
Background: In the current phase of the COVID-19 pandemic, we are witnessing the most massive vaccine rollout in human history. Like any other drug, vaccines may cause unexpected side effects, which need to be timely investigated to minimize harm in the population. If not properly dealt with, side effects may also impact the public trust in the vaccination campaigns carried out by the national governments. Objective: Monitoring social media for the early identification of side effects and understanding the public opinion on the vaccines are of paramount importance to ensure a successful and harmless rollout. The objective is to create a web portal to monitor the opinion of social media users on the vaccines, to provide a tool for journalists, scientists, and users alike to visualize how the general public is reacting to the vaccination campaign. Methods: In this paper, we present a tool to analyze the public opinion on COVID-19 vaccines from Twitter, exploiting, among the others: a state-of-the-art system for the identification of Adverse Drug Events (ADEs) on social media; Natural Language Processing models for sentiment analysis; statistical tools and open-source databases to visualize the trending hashtags, news articles and their factuality. All the modules of the system are displayed through a web portal available at http://ailab.uniud.it/covid-vaccines/. Results: A set of 650,000 tweets was collected and analyzed in an ongoing process started in December 2020. The results of the analysis are made public on a web portal (updated daily), together with the processing tools and data. The data provide insights on the public opinion about the vaccines and its change in time. For example, users show a high tendency to only share news from reliable sources when discussing COVID-19 vaccines (98% of the shared URLs). The general sentiment of the users towards the vaccines is negative/neutral, but the system is able to record fluctuations in the attitude towards specific vaccines in correspondence with specific events (eg, news about new outbreaks). The data also show how news coverage had a high impact on the set of discussed topics. To further investigate this point, we perform a more in-depth analysis of the data regarding AstraZeneca. We observe how media coverage of blood-clot related side effects suddenly shifted the topic of public discussions regarding both AstraZeneca and the other vaccines. This is particularly evident when visualizing the most frequently discussed symptoms for the vaccines and comparing them month-by-month. Conclusions: We presented a tool connected with a web portal to monitor and display some key aspects of the public's reaction to COVID-19 vaccines. The system also provides an overview of the opinions of the Twittersphere through graphic representations and represents a tool for the extraction of suspected adverse events from tweets with a Deep Learning model. Clinicaltrial:
Conference Paper
Full-text available
Traditional data on influenza vaccination has several limitations: high cost, limited coverage of underrepresented groups, and low sensitivity to emerging public health issues. Social media, such as Twitter, provide an alternative way to understand a population's vaccination-related opinions and behaviors. In this study, we build and employ several natural language classifiers to examine and analyze behavioral patterns regarding influenza vaccination in Twitter across three dimensions: temporality (by week and month), geography (by US region), and demography (by gender). Our best results are highly correlated official government data, with a correlation over 0.90, providing validation of our approach. We then suggest a number of directions for future work.
Conference Paper
Full-text available
As the as the volume of user-generated content in social media expands so do the potential benefits of mining social media to learn about patient conditions, drug indications, and beneficial or adverse drug reactions. In this paper, we apply Conditional Random Fields (CRF) model for extracting expressions related to diseases from patient comments. Our method utilizes hand-crafted features including contextual features, dictionaries, clusterbased and distributed word representation generated from unlabeled user posts in social media. We compare our CRF-based approach with deep recurrent neural networks and a dictionary-based approach. We examine different word embeddings generated from unlabeled user posts in social media and scientific literature. We show that CRF outperformed other methods and achieved the F1-measures of 69.1% and 79.4% on recognition of disease-related expressions in the exact and partial matching exercises, respectively. Qualitative evaluation of disease-related expressions recognized by our feature-rich CRF-based approach demonstrates the variability of reactions from patients with different health conditions.
Article
Full-text available
Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Conference Paper
Full-text available
The challenges of Named Entities Recognition (NER) for tweets lie in the insufficient information in a tweet and the unavailability of training data. We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model under a semi-supervised learning framework to tackle these challenges. The KNN based classifier conducts pre-labeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semi-supervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of KNN and semi-supervised learning.
Article
Background: An estimated 3.9 billion individuals live in a location endemic for common mosquito-borne diseases. The emergence of Zika virus in South America in 2015 marked the largest known Zika outbreak and caused hundreds of thousands of infections. Internet data have shown promise in identifying human behaviors relevant for tracking and understanding other diseases. Objective: Using Twitter posts regarding the 2015-16 Zika virus outbreak, we sought to identify and describe considerations and self-disclosures of a specific behavior change relevant to the spread of disease-travel cancellation. If this type of behavior is identifiable in Twitter, this approach may provide an additional source of data for disease modeling. Methods: We combined keyword filtering and machine learning classification to identify first-person reactions to Zika in 29,386 English-language tweets in the context of travel, including considerations and reports of travel cancellation. We further explored demographic, network, and linguistic characteristics of users who change their behavior compared with control groups. Results: We found differences in the demographics, social networks, and linguistic patterns of 1567 individuals identified as changing or considering changing travel behavior in response to Zika as compared with a control sample of Twitter users. We found significant differences between geographic areas in the United States, significantly more discussion by women than men, and some evidence of differences in levels of exposure to Zika-related information. Conclusions: Our findings have implications for informing the ways in which public health organizations communicate with the public on social media, and the findings contribute to our understanding of the ways in which the public perceives and acts on risks of emerging infectious diseases.
Conference Paper
Millions of users share their experiences on social media sites, such as Twitter, which in turn generate valuable data for public health monitoring, digital epidemiology, and other analyses of population health at global scale. The first, critical, task for these applications is classifying whether a personal health event was mentioned, which we call the (PHM) problem. This task is challenging for many reasons, including typically short length of social media posts, inventive spelling and lexicons, and figurative language, including hyperbole using diseases like "heart attack»» or "cancer»» for emphasis, and not as a health self-report. This problem is even more challenging for rarely reported, or frequent but ambiguously expressed conditions, such as "stroke»». To address this problem, we propose a general, robust method for detecting PHMs in social media, which we call WESPAD, that combines lexical, syntactic, word embedding-based, and context-based features. WESPAD is able to generalize from few examples by automatically distorting the word embedding space to most effectively detect the true health mentions. Unlike previously proposed state-of-the-art supervised and deep-learning techniques, WESPAD requires relatively little training data, which makes it possible to adapt, with minimal effort, to each new disease and condition. We evaluate WESPAD on both an established publicly available Flu detection benchmark, and on a new dataset that we have constructed with mentions of multiple health conditions. Our experiments show that WESPAD outperforms the baselines and state-of-the-art methods, especially in cases when the number and proportion of true health mentions in the training data is small.
Article
Information retrieval studies that involve searching the Internet or marking phrases usually lack a well-defined number of negative cases. This prevents the use of traditional interrater reliability metrics like the kappa statistic to assess the quality of expert-generated gold standards. Such studies often quantify system performance as precision, recall, and F-measure, or as agreement. It can be shown that the average F-measure among pairs of experts is numerically identical to the average positive specific agreement among experts and that kappa approaches these measures as the number of negative cases grows large. Positive specific agreement-or the equivalent F-measure-may be an appropriate way to quantify interrater reliability and therefore to assess the reliability of a gold standard in these studies.
BERT: pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
Lessons learnt from the named entity recognition and linking (neel) challenge series
  • Rizzo Giuseppe
  • Pereira Bianca
  • Varga Andrea
  • Elizabeth Cano Basave Van Erp Marieke
  • Amparo
Rizzo Giuseppe, Pereira Bianca, Varga Andrea, van Erp Marieke, and Elizabeth Cano Basave Amparo. 2017. Lessons learnt from the named entity recognition and linking (neel) challenge series. Semantic Web Journal, 8(5):667-700.
Separating fact from fear: Tracking flu infections on twitter
  • Alex Lamb
  • Michael J Paul
  • Mark Dredze
Alex Lamb, Michael J. Paul, and Mark Dredze. 2013. Separating fact from fear: Tracking flu infections on twitter. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013).