Available via license: CC BY-SA 4.0
Content may be subject to copyright.
Recovering Patient Journeys:
A Corpus of Biomedical Entities and Relations on Twitter (BEAR)
Amelie W¨
uhrl, Roman Klinger
Institut f¨
ur Maschinelle Sprachverarbeitung, University of Stuttgart
Pfaffenwaldring 5b, 70569 Stuttgart, Germany
{amelie.wuehrl, roman.klinger}@ims.uni-stuttgart.de
Abstract
For a long time, text mining and information extraction for the medical domain has focused on scientific text generated by
researchers. However, their direct access to individual patient experiences or patient-doctor interactions is sometimes limited.
Information provided on social media, e.g., by patients and their relatives, complements the knowledge available in scientific
text. It reflects the patient’s journey and their subjective perspective on the process of developing symptoms, being diagnosed
and offered a treatment, being cured or learning to live with a medical condition. The value of this type of data is therefore
twofold: Firstly, it offers direct access to people’s perspectives. Secondly, it might cover information that is not available
elsewhere, including self-treatment or self-diagnoses. Named entity recognition and relation extraction are methods to structure
information that is available in unstructured text. However, existing medical social media corpora focused on a comparably
small set of entities and relations and were focused on particular domains, rather than putting the patient into the center of
analyses. With this paper we contribute a corpus with a rich set of annotation layers following the motivation to uncover and
model patients’ journeys and experiences in more detail. We label 14 entity classes (incl. environmental factors, diagnostics,
biochemical processes, patients’ quality-of-life descriptions, pathogens, medical conditions, and treatments) and 20 relation
classes (e.g., prevents, influences, interactions, causes) most of which have not been considered before for social media data.
The publicly available dataset consists of 2,100 tweets with ≈6,000 entity and ≈3,000 relation annotations. In a corpus analysis
we find that over 80 % of documents contain relevant entities. Over 50 % of tweets express relations which we consider essential
for uncovering patients’ narratives about their journeys.
Keywords: social media health mining, biomedical information extraction, BioNLP, relation extraction
1. Introduction
On social media, doctors, patients, concerned relatives
or other laypeople frequently discuss medical informa-
tion. Twitter posts for example contain opinions and
recommendations about treatments, recounts of medi-
cal experiences, or hypotheses and assumptions about
medical issues like in Figure 1. This information is by
design centered around the patient. It is impacted by
the patient’s journey and their subjective perspective on
processes like developing symptoms, being diagnosed
and offered a treatment, being cured or learning to live
with a disease. This data offers direct access to people’s
perspectives and covers information that is not avail-
able elsewhere, e.g., aspects that might not be consid-
ered important or difficult to assess in clinical settings.
This includes, e.g., assessments of a patient’s quality
of life (Table 1, Ex. 2 and 5), or which environmental
factors people consider when talking about their health
(Table 1, Ex. 3 and 4).
At the same time, established resources and systems
for text mining and information extraction in the med-
ical domain have mostly been centered around sci-
As females
SOCIO-ECON we tend to have more arthritis
MEDC
neg influence
Figure 1: Example of our annotation scheme.
entific and biomedical text generated by researchers.
Such texts seldomly focus on individual patient’s expe-
riences or patient-doctor interactions which makes the
information and knowledge contained in the text dis-
tant by nature. While scientific resources contain high
quality information, many studies struggle with gender
biases and population imbalance (Weber et al., 2021),
which leads to blind spots in the literature. The time-
consuming nature of clinical studies causes delays until
information is available to practitioners. Both limita-
tions can be mitigated by accessing social media data.
Duh et al. (2016) find in fact that social media can lead
to earlier detection of adverse drug reactions.
While social media data has come more into focus re-
cently, existing corpora are limited with respect to the
types of entities and relations they cover. Most com-
monly, biomedical entity corpora focus on diseases,
symptoms and drugs (Jimeno-Yepes et al., 2015; Al-
varo et al., 2017, i.a.). With regards to relation de-
tection, work on Twitter is limited to causal relations
(Doan et al., 2019), or a very small number of relation
classes (i.e. reason-to-use,outcome-negative,outcome-
positive) (Alvaro et al., 2017). This leaves a gap for
medical information needs. As described above, con-
tent from social media holds this type of information.
Extracting it is required if we want to uncover more
fine-grained aspects of patients’ medical journeys com-
plementary to the knowledge in scientific text.
To facilitate research in this area, we contribute a cor-
arXiv:2204.09952v1 [cs.CL] 21 Apr 2022
pus of medical tweets annotated with a fine-grained
set of medical entities and relations between them.
For the BEAR Corpus of Biomedical Entities And
Relations on Twitter, we annotate 14 entity and 20 re-
lation classes. Entities include environmental factors,
diagnostics, biochemical processes, quality-of-life as-
sessments, pathogens, as well as more established en-
tity classes such as medical conditions, and treatments.
Relation classes model how entities prevent, influence,
interact with, cause or worsen other entities, or how
they relate to each other as a symptom, side-effect, or
diagnosis.
The dataset consists of 2,100 tweets with roughly 6,000
entities and 3,000 relations. To the best of our knowl-
edge the majority of those classes which are centered
around patient journeys have not been considered be-
fore. The dataset is available at https://www.
ims.uni-stuttgart.de/data/bioclaim.
2. Related Work
Biomedical natural language processing (BioNLP) is
an established field in computational linguistics, with a
rich set of shared tasks including BioCreative and the
competitions organized by the BioNLP workshop se-
ries (bio, 2021; Ben Abacha et al., 2021). Research
topics include automatic information extraction from
clinical reports, discharge summaries or life science ar-
ticles, e.g., in the form of entity recognition for dis-
eases, proteins, drug and gene names (Habibi et al.,
2017; Giorgi and Bader, 2018; Lee et al., 2019, i.a.).
A subsequent task to entity recognition is relation ex-
traction which covers clinical relations (Uzuner et al.,
2011; Wang and Fan, 2014; Sahu et al., 2016; Lin
et al., 2019; Akkasi and Moens, 2021) or biomedi-
cal relations/interactions (e.g., drug-drug-interactions)
between entities (Lamurias et al., 2019; Sousa et al.,
2021, i.a.).
While scientific resources contain high quality infor-
mation, studies might not be fully representative re-
garding population groups or gender (Weber et al.,
2021), which leads to blind spots in the literature – the
general population can barely be captured in such stud-
ies. In addition, clinical studies or reports are time-
consuming which inevitably leads to delays, e.g., with
regards to indications of adverse drug events. Both
limitations can be mitigated by accessing social media
data. Duh et al. (2016) find in fact that social media
can lead to earlier detection of adverse drug reactions.
This is why biomedical NLP also works with social
media texts and online content (Wegrzyn-Wolska et al.,
2011; Yang et al., 2016; Sullivan et al., 2016, i.a.), in-
cluding established shared tasks (Magge et al., 2021a).
A major focus has been to inform pharmacovigilance
by identifying and extracting mentions of adverse drug
reactions (Nikfarjam et al., 2015; Cocos et al., 2017;
Magge et al., 2021b). Additionally, the community has
explored leveraging social media postings to monitor
public health (Paul and Dredze, 2012; Choudhury et
al., 2013; Sarker et al., 2016; Stefanidis et al., 2017),
and detect personal health mentions (Yin et al., 2015;
Klein et al., 2017; Karisani and Agichtein, 2018).
A few studies compare biomedical information in sci-
entific documents with social media: Thorne and
Klinger (2017) explore how disease names are referred
to across both domains, while Seiffe et al. (2020) look
into laypersons’ medical vocabulary. A related task is
entity normalization which links a given mention of an
entity to the respective concept in a formalized medical
ontology. Limsopatham and Collier (2016) and later
Basaldella et al. (2020) explore this task for medical
entities on social media showcasing the difficulties in
mapping laypeople’s health terminology to structured
medical knowledge bases.
The ongoing COVID-19 pandemic has sparked
bioNLP research to leverage or contextualize infor-
mation about the disease and virus from social me-
dia. A number of studies explore detecting COVID-19-
related misinformation and fact-checking (Hossain et
al., 2020; Chen and Hasan, 2021; Mattern et al., 2021;
Saakyan et al., 2021, i.a.). Others have looked into
monitoring information surrounding the virus using so-
cial media (Cornelius et al., 2020; Hu et al., 2020).
2.1. BioNER on Social Media
Early contributions on biomedical information extrac-
tion from Twitter aimed at the extraction of adverse
drug reactions from social media – a fundamentally dif-
ferent use case than scientific text analytics. The goal
is to provide access to information even before it be-
comes available to doctors or researchers. This work
includes corpus creation efforts on dedicated platforms
like AskAPatient1(Karimi et al., 2015) and Twitter
(Nikfarjam et al., 2015; Magge et al., 2021b)
With a similar motivation, Jimeno-Yepes et al. (2015)
created Micromed, a Twitter corpus annotated with
disease names, drug names, and symptom mentions.
Further, TwiMed (Alvaro et al., 2017) is a dataset
which combines social media and scientific text with
annotations of diseases, symptoms and drug names to
study drug reports across both sources. Annotated with
the same entity classes, the MedRed dataset consists
of Reddit posts (Scepanovic et al., 2020) labeled via
crowdsourcing.
In addition to identifying entities, there has also been
some work on linking them to existing databases. To
facilitate this task for social media, Limsopatham and
Collier (2016) contribute a Twitter corpus in which en-
tities are linked to the SIDER 4(Kuhn et al., 2016)
database of drug profiles. Basaldella et al. (2020)
subsequently introduce COMETA, a Reddit corpus in
which entities are linked to SNOWMED-CT2. With re-
gards to the groups of entities considered (phenotype,
disease, anatomy, molecule (incl. drugs, toxins, nutri-
ents etc.), gene/DNA/RNA, device, procedure) this is
1https://www.askapatient.com/
2https://www.snomed.org/
id Tweet
1Prochlorperazine
DRUG
is compazine
DRUG
, just the generic name. Ativan
DRUG also causes drowsiness
MEDC
is type of side-effect-of
[...] I was on Lyrica
DRUG
to help with the horrific neuropathic pain
MEDC
but cause mind numbing
MEDC
and bowel problems
MEDC
side-effect-of
side-effect-of
treats
2so I stopped taking the Lyrica
MEDC
.I’m in more pain now but feel more like me.
QOL
cause of
3Meditation
HABITUAL ,yoga
HABITUAL
[...] are all effective at relieving stress
MEDCand helping with #IBS
MEDC.
pos influence
pos influence
pos influence
pos influence
4Alcohol
DIETARY disrupts production of adenosine
PROCESS
which results in lighter sleep
MEDC
[...]
neg influence cause of
5I’m awake just can’t get going.
QOL
Need cat food seriously only reason go out [...] #SpoonieLife
MEDC
cause of
6[...] neighbour has been diagnosed with c19
MEDC
which means admin has to self isolate
THERAPY and do a test
DIAGNOSTICS[...]
pos influence
may diagnose
7Support dogs
OTHE R
can improve the effectiveness of dementia
MEDCtherapy! Miracle creatures.
pos influence
Table 1: Annotated tweets from the dataset.
similar to our contribution.
Existing resources do not cover enough entities to ex-
tract patient narratives from social media. They do not
allow us yet to access the fine-grained information that
social media content holds, and that would allow us to
fill the information gap in scientific text.
2.2. Detection of Medical Relations on Social
Media
Relation extraction contextualizes entities with each
other. Medical relation extraction resources for social
media are rare. Existing studies have focused on causal
relations (Doan et al., 2019), or a small number of
relation classes (i.e., reason-to-use,outcome-negative,
outcome-positive) (Alvaro et al., 2017).
With regards to scientific text, and specifically clinical
relation extraction, closest to our annotation scheme are
approaches by Uzuner et al. (2011) and Wang and Fan
(2014). Classes for both their work describe relations
between treatments and medical conditions, relations
between two treatments, medical conditions, or diag-
noses (e.g., treatment caused medical problem,treat-
ment improved or cure medical problem,test reveal
medical problem in Uzuner et al. (2011), or treats,pre-
vents,has symptom,contraindicates in Wang and Fan
(2014)). However, both work with clinical and scien-
tific texts. Medical relation extraction on social media
is understudied and missing resources that facilitate ex-
tracting patients’ experiences and opinions towards en-
tities of their medical history which would allow us to
recover their medical narratives.
3. Corpus Creation
3.1. Data Collection
We collect English tweets between January 01 and
November 02, 2021 using the official keyword-based
Twitter API.
3.1.1. Corpus Subselection
The list of keywords to retrieve the data stems from
three different sources. Refer to Table 6 for examples
for each source.
1. DrugBank: DrugBank is a database for drugs
which provides molecular information about
drugs, their mechanisms, interactions and tar-
gets (Wishart et al., 2018). We use generic and
brand/product names which allows us to collect
tweets discussing treatments, or descriptions of
off-label drug use.
2. MeSH: Medical Subject Headings is a controlled
vocabulary thesaurus used for indexing articles in
PubMed3. We use terms from the subcategories
disease and therapeutics to collect tweets that ad-
dress specific diseases and therapeutic measures.
We use all terms that appear with a frequency >=
1000 in PubMed articles hypothesizing that the
distribution of those terms mirrors the usage on
Twitter.
3. Manual: MeSH and DrugBank mostly contain sci-
entific terms (see Table 6), so we also query with
a manually compiled list of medical terms. Partly,
3https://pubmed.ncbi.nlm.nih.gov/
those relate to 10 medical conditions4. This is to
collect tweets that either use Twitter specific hash-
tags, abbreviations, or community-based terms re-
lated to a condition, or mention terms generally
related to the medical domain.
All terms combined result in a list of 22,874 keywords.
From this list, 10,599 terms return results from Twit-
ter during a test crawl. We remove unproductive terms
and use a final list of 7,358 keywords from Drug-
Bank, 3,120 from MeSH, and 121 from the manually
compiled list.5We acknowledge that by using this
approach, we can not sample tweets with incorrectly
spelled mentions of drug or disease names.
We only keep non-duplicate tweets (based on the tweet
ID) which do not contain a URL due to their increased
probability of containing advertisements. Further, we
only keep tweets which contain a relational term. Ex-
amples include words like treats,prescribed, or diag-
nosed (and variations thereof). From the resulting col-
lection of tweets, we draw a sample balanced across the
three keyword sources. We subsequently annotate 700
tweets per data source (350 per MeSH subcategory)
which amounts in a total of 2,100 tweets.
3.2. Annotation
We label entity and relation classes that allow us
to include individual aspects within people’s disease-
treatment cycles. Classes cover information concern-
ing developing symptoms, being diagnosed and offered
a treatment, being cured or learning to live with a medi-
cal condition. They allow us to model statements about
how to self-diagnose, treat a particular condition by
themselves, or capture how people perceive risk fac-
tors. For both annotation tasks, we therefore follow the
central paradigm which tells annotators to label entities
and relations the way a tweet’s author intends or under-
stands them. A mention like UV radiation could either
be intended as an environmental factor (High UV radi-
ation causes skin cancer.), or a treatment (UV radiation
will help with my low vitamin D levels).
3.2.1. Entity classes
We label seven groups of entities. Each group contains
a respective label or subset of labels which the annota-
tors use to label the text. We visualize the entities in
Figure 2 and depict which entity-pairs can be related.
Each entity group will be briefly described in the fol-
lowing section. Table 1 additionally provides fully an-
notated examples from the dataset, to which we will
refer to in the following descriptions.
Medical Conditions. All mentions of diseases,
symptoms, side effects, and medical events or descrip-
4COVID-19, Alzheimer’s disease, borderline personal-
ity disorder, cancer, depression, irritable bowel syndrome,
measles, multiple sclerosis, post-traumatic stress disorder,
stroke.
5Lists we used to collect and filter the data are available
in the suppl. material together with the corpus.
med. condition
treatments
pathogen
biochem.
diagnostics
quality-of-life
other
environm. factors
drug
therapy
socio-eco
geo-climate
dietary
habitual
pollution
substance
process
Figure 2: Visualization of entity classes and the rela-
tions between them.
tions thereof. See #IBS
MEDCin Ex. 3 or drowsiness
MEDCin
Ex. 1. Phrases like stopped taking the Lyrica
MEDC
in Ex. 2
are considered relevant medical events, and labeled as
medC, too.
Treatments. Mentions of any kind of treatment.
That includes drug names, generic and brand names
(see Prochlorperazine
DRUG
and compazine
DRUG
in Ex. 1)
and all types of therapy or prevention methods (see
self isolate
THERAPY in Ex. 6).
Environmental Factors. Entities that influence,
cause or contribute to a medical condition. We annotate
socio-economic (age, gender, ethnicity, social back-
ground etc.), geographic/climatic (geography, climate,
weather etc.), dietary, habitual (exercise, stress etc.) or
pollution-related (air/water pollution, UV or nuclear ra-
diation) factors. See yoga
HABITUAL
in Ex. 2 or Alcohol
DIETARY in
Ex. 4.
Pathogens. Pathogens are organisms that cause dis-
eases. This includes mentions of bacteria, fungi, para-
sites, or viruses, e.g., coronavirus
PATHOGE N .
Biochemical Entities. Biochemical substances such
as proteins or hormones (e.g., Lactose
SUBSTANCE ). The
class includes biochemical processes such as bi-
ological, pathogenic or chemical mechanisms (see
production of adenosine
PROCESS
in Ex. 4).
Diagnostics. Mentions of tests or other diagnostic in-
struments that are used to diagnose or test for a medical
condition. Refer to test
DIAGNOSTICSin Ex. 6.
Quality of Life Assessments. Descriptions of pa-
tients’ quality of life, i.e. mentions of how a disease
or its management impacts a patient’s well-being. See
I’m awake just can’t get going
QOL
in Ex. 5.
Other. Relevant entities that can not be covered by
any of the other classes. See Support dogs
OTHE R
in Ex. 7.
3.2.2. Relation Classes
Each relation is directed and connects two entities (see
Figure 2 for a depiction of which entities can be related
and Table 1 for examples). We annotate the follow-
ing entity pairs with relations. (±indicates that a rela-
tion has a positive and negative variant, e.g., (does not)
treat.)
treat →medC ±treats, worsens, ±prevents, ±causes,
contraindicates, prescribed, ±influences
medC →treat side effect of
env/pathogen/biochem →medC ±causes,
±influences, ±prevents
medC →medC/biochem has symptom, ±causes, is
similar to
treat →treat ±interaction, is similar to
diag →medC/pathogen ±diagnoses
pathogen →biochem ±causes
medC/treat/env/diag →qol ±causes, ±influences
general type of, other
3.2.3. Evaluation metrics
We measure the agreement between annotations by cal-
culating the inter-annotator F1. Specifically, we treat
one annotator’s labels as the gold annotations and con-
sider the other annotator’s labels as predictions (Hripc-
sak and Rothschild, 2005).
We report the agreement for varying levels of strictness.
We consider entity span (S) and type (T) as follows:
S1T1 The two spans and types of the entities are en-
tirely identical.
S0T1 The two spans overlap by min. one token, entity
type is identical.
S0T0 The two spans overlap by min. one token, entity
type is ignored in the comparison.
When evaluating the annotated relation (R) between
two entities, we consider two modes:
R1 Relation type and direction are identical.
R0 Relation type and direction are ignored in the
comparison.
On the entity level, comparing S1T1 to S0T1 shows to
which extend the span of an entity influences the an-
notation task. Comparing S0T1 to S0T0 indicates the
impact of assigning a label on the difficulty of the task.
Analyzing the relation annotation follows the same ob-
jectives with respect to the entities, but adds the im-
pact of the relation assignment. R1S1T1 is the strictest
evaluation mode. The comparison to both R1S0T1 and
R1S0T0 helps in understanding how the entity anno-
tation influences the relation annotation task. R0S0T0
captures the most general level of agreement indicating
how well the annotators can identify the fact that any
two entities are somehow related. Comparing this to
R1S0T0, we can conclude how difficult it is to identify
relation types.
3.2.4. Guideline development & annotator
training
We work with two in-house annotators (A1, A2) to la-
bel the tweets with entities and relations. Both anno-
Evaluation mode
Round
S1T1
S0T1
S0T0
R1S1T1
R1S0T1
R1S0T0
R0S0T0
1 .44 .66 .73 .18 .4 .4 .4
2 .23 .34 .75 .18 .25 .35 .35
3 .37 .45 .79 .05 .23 .27 .59
4 .64 .76 .95 .44 .48 .51 .68
5 .39 .46 .76 .07 .1 .29 .43
6 .44 .49 .89 .12 .3 .44 .58
7 .62 .68 .96 .4 .4 .44 .47
8 .42 .61 .77 .15 .39 .41 .56
9 .69 .77 .82 .31 .33 .35 .54
10 .53 .55 .8 .28 .4 .59 .66
Table 2: Macro inter-annotator F1across all entity and
relation classes throughout the training rounds. S1T1
through R0S0T0 indicate the evaluation mode.
tators are female, ages 20 to 25, and 25 to 30, respec-
tively. Their backgrounds are in linguistics and compu-
tational linguistics. They have no medical training. We
iteratively train the annotators over the course of three
months. In each training iteration, all annotators label a
small set of instances independently following our an-
notation guidelines. Subsequently we discuss each set
within the group. In addition, we calculate the inter-
annotator F1for each round of training annotations (re-
fer to Section 3.2.3 for an explanation of the eval. met-
rics used), and adapt the guidelines with findings from
the discussions and analysis to clarify the annotation
tasks further. The training instances are not part of the
final corpus. The final version of the guideline docu-
ment is available in the supplementary material.
Table 2 shows the development of the inter-annotator
F1over the training iterations. For each round we re-
port the macro F1score across all entity/relation classes
in the different evaluation settings. We find that the
agreement increases for the entities and the relation an-
notation over time. The agreement increases as we al-
low for less precise matches to be counted as true pos-
itive instances. By the end of the training period, an-
notators agreed with .53F1on exact entity types and
boundaries (S1T1). Comparing the impact of each sub-
task in the last round, we observe that agreeing on the
entity type is more challenging than identifying the en-
tity span (decrease of .25F1between S0T0 and S0T1
vs. .02F1decrease between S0T1 and S1T1). Evalu-
ating the relation type strictly (R1S0T0 vs. R0S0T0),
the agreement drops by .07F1which indicates that the
relation type is fairly ambiguous, and therefore hard to
agree upon. The strictest evaluation measures (S1T1,
R1S1T1) show that the task remains challenging even
after substantial annotator training which we attribute
to the diverse nature of text in tweets. Presumably,
this is also why the agreement fluctuates over training
rounds.
Figure 3: Inter-annotator macro F1-scores for each sub-
sample of the corpus (DrugBank (drug), MeSH (mesh),
manually researched keywords (manual)), and the full
dataset (full) across evaluation modes.
3.3. Aggregation
We provide an adjudicated version of the dataset which
combines both annotators’ results. In case of dis-
agreements of entity spans between the annotators, we
choose the longest overlapping sequence between two
instances. We further prefer more frequent entity and
relation classes over less frequent ones, and choose
more general concepts over more specific ones. Gen-
erally, our aggregation strategy is motivated by a high
recall approach to ensure that we lose as little of the nu-
ances from the individual annotations as possible. We
aggregate in two steps and first align the entity anno-
tations, followed by aggregating the relations. Please
refer to Section 6 for more details.
4. Analysis
4.1. Agreement Between Annotators
The annotators labeled the final corpus over the course
of four months. Since both sets of annotations provide
unique perspectives on the data, we release the indi-
vidual annotations along with an aggregated version.
We evaluate the annotations using the inter-annotator
F1-scores as described in Section 3.2.3 and provide
scores for the full dataset as well as individual scores
for each sampling method in the following. Figure 3
shows the inter-annotator F1-scores for each subsam-
ple of the corpus evaluated with descending strictness.
For the final corpus we find that annotators are fairly
synchronized in identifying entities in tweets (.67F1
S0T0). Agreeing on the entity type is more challenging
than identifying the same entity span (.07F1decrease
between S0T1 and S1T1 vs. .23F1decrease between
S0T0 and S0T1). This is also the case for the rela-
tion agreement. Labeling the relation type is by far the
most difficult task. When we compare the agreement
levels in R0S0T0 with R1S0T0, we report a difference
of .13F1which showcases how ambiguous the relations
are.
We observe a slight decrease of the agreement com-
pared to the last training round. We attribute this to the
fact that annotators are continued to be faced with novel
variations of entities and relations because of Twitter’s
diverse nature.
Agreement across sources. Across all evaluation
modes, tweets from the subsample Manual show the
strongest agreement, followed by subsamples MeSH,
and DrugBank. The results indicate that tweets from
the Manual category are easier to annotate than the
other documents, presumably because they mostly use
laypeople’s vocabulary. Due to the nature of the Drug-
Bank database, tweets from this set might be more sci-
entific, making them more difficult to annotate.
Agreement across entities. Table 4 reports the inter-
annotator F1-score (iaa) for each entity class (eval.
mode: S1T1). A1 and A2 agree most strongly on in-
stances of medC and treat drug (.73 and .74 F1, respec-
tively). We observe the lowest agreement for mentions
of biochem process (.05 F1).
We observe that the agreement for highly frequent
classes is stronger than the agreement in less frequent
ones. Presumably, this is because these classes are also
the most concrete, and therefore easier to detect. Less
frequent classes (e.g., env or qol) could be considered
more abstract or vague. At the same time, we presume
that seeing a certain type of entity more often acts like
a training effect for the annotators.
Agreement across relations. Table 5 reports the
inter-annotator F1-score for each relation class (eval-
uation mode: R1S0T06). Across all classes, we
report a macro F1-score of .35. has symptom and
does not prevent are the classes with highest agree-
ment (.59 F1respectively), followed by treats (.58
F1), may diagnose and prevents (.56 F1, respectively).
We observe no agreement for is contraindicated,
may not diagnose, and pos/neg interaction.
4.2. Corpus Statistics
The final corpus contains 2,100 tweets with labels for
medical entities and the relations connecting them. Ta-
ble 3 lists the number of documents with and without
entities and relations. The majority of documents in
the dataset contain entities. 86.2 % of all documents in
the dataset are labeled with at least one entity. Slightly
more than half of all documents containing entities also
express a relevant relation (56.5 %).
The corpus consists of 93,258 words (17,559 words
are unique). The longest tweet consists of 114 words,
the two shortest tweets are made up of 4 words each
(see Table 7). A tweet from our corpus has an average
length of 44.41 words. There is no substantial differ-
ence between tweets from different sampling sources.
6We choose R1S0T0 to focus specifically on the relation
while allowing a imprecise agreement on the entities.
Number of documents
no ent with ent no rel with rel
A1 330 (15.7) 1770 (84.3) 835 (47.2) 935 (52.8)
A2 378 (18.0) 1722 (82.0) 833 (48.4) 889 (51.6)
agg 289 (13.8) 1811 (86.2) 788 (43.5) 1023 (56.5)
Table 3: Number of documents with and without en-
tities (ent) and relations (rel) for both annotators (A1,
A2) and the aggregated dataset (agg). Values in paren-
thesis report the respective percentages. For relations
this is w.r.t. all instances which contain entities.
The following sections describe our dataset in more de-
tail. We present corpus statistics regarding the entity
and relation class distribution. Note that we describe
the aggregated version of the dataset.
4.2.1. Entities
Table 4 shows the number of instances per entity class.
We include the statistics for both annotators (A1, A2)
and for the adjudicated dataset. Additionally, we report
the statistics for the whole corpus (full), and divided by
the method the documents were sampled with (Drug-
Bank, MeSH terms, Manual) .
The dataset contains 6,324 entities. The biggest en-
tity class is medical conditions (3,553 instances), fol-
lowed by mentions of treat drug (1,240). The re-
maining entity classes are substantially less frequent.
env pollution has the smallest number of instances (5).
Annotators label approx. 3.01 entities per document.
Entities across sources. Mentions of medical con-
ditions are more frequent in tweets from the subsam-
ples MeSH and Manual (1,458 and 1,367, respectively)
than they are the DrugBank sample (728). Tweets
from set DrugBank exhibit the majority of mentions of
treat drug as well as biochem substance entities (1,035
and 163, respectively). Notably, mentions of the sec-
ond treatment-related entity class, treat therapy, are
more frequent in tweets from the MeSH and Manual
sample.
These results confirm that tweets in the DrugBank sam-
ple more frequently discuss treatments, and therefore
exhibit a high number of drug and biochemical entities.
treat therapy captures more general treatment descrip-
tions than specific mentions of drugs. Regarding the
subsample Manual, we presume that the high frequency
of therapy mentions indicates that laypeople speak in
more general terms about treatments.
4.2.2. Relations
Table 5 reports the number of annotated relations for
each class. We calculate the statistics for both annota-
tors (A1, A2) and for the adjudicated data. We report
the numbers of relations for the full corpus as well as
for each of the three subsamples (DrugBank, MeSH,
Manual).
In total, the corpus contains 2,959 relations. The
cause of relation is the most frequent (983), followed
by treats (500), is type of (336), and pos influence
(263). worsens is the class with the lowest frequency
(1 instance). For relations which can be either positive
or negative, the negative relations are always less fre-
quent. On average, a document in our dataset contains
1.41 relations.
Relations across sources. While documents from
the subsample DrugBank and MeSH show relatively
equal numbers of total relations (averages of 1,043 and
1,081, respectively), the Manual subsample has the
least amount of relations (av. of 835). cause of rela-
tions are most frequent in the subsamples MeSH (407)
and Manual (331). In the DrugBank set, treats is the
most prevalent relation class (277). Notably, for set
Manual, we find that cause of is by far more frequent
than any other relation. All other classes count (mostly
substantially) less than 100 instances each.
5. Conclusion and Future Work
We introduce and describe BEAR, a corpus of 2,100
medical tweets annotated with a detailed set of biomed-
ical entities, and the relations connecting them. Both
the entity and relation classes are motivated by the
need to capture fine-grained aspects of patients’ med-
ical journeys. In our annotation study, we show that
tweets hold this type of information, and that non-
expert annotators can detect this reasonably well.
With this dataset, we lay the groundwork to develop
entity and relation extraction systems that give medi-
cal professionals access to patient narratives which are
not covered in scientific texts. This includes quality-of-
life assessments, perception of risk factors, unconven-
tional treatments, or self-diagnoses that people might
feel uncomfortable or irrelevant to share with their doc-
tors. Such systems could help answer detailed ques-
tions like ”How does chemotherapy affect the social
life of breast cancer patients?” or ”Which habits serve
as coping mechanisms for people suffering from de-
pression?”.
Acknowledgments
This research has been conducted as part of the FIBISS
project which is funded by the German Research Coun-
cil (DFG, project number: KL 2869/5-1). We thank our
annotators for their hard work and tireless attention to
detail.
Entities
medC
treat drug
treat therapy
env geo-cli
env diet
env habit
env pollution
env socio-econ
pathogen
qol
diag
biochem process
biochem subst
other
total
av. #ents/doc
DB
674 1053 99 1 47 4 1 3 14 23 23 24 139 48 2153 3.08
620 967 84 0 43 2 1 6 13 22 8 19 189 39 2013 2.88
728 1035 114 0 34 2 1 5 7 28 13 6 163 41 2177 3.11
MeSH
1361 151 366 2 42 23 2 8 63 11 40 12 43 45 2169 3.1
1288 108 324 3 18 23 1 20 45 15 25 8 39 58 1975 2.82
1458 128 347 3 36 25 2 19 35 12 27 3 38 43 2176 3.11
Manual
1329 89 254 0 29 30 3 10 36 38 33 4 18 49 1922 2.75
1276 60 226 3 29 27 0 40 41 64 21 14 31 45 1877 2.68
1367 77 234 3 34 21 2 28 30 67 28 7 22 51 1971 2.82
full
3364 1293 719 3 118 57 6 21 113 72 96 40 200 142 6244 2.97
3184 1135 634 6 90 52 2 66 99 101 54 41 259 142 5865 2.79
3553 1240 695 6 104 48 5 52 72 107 68 16 223 135 6324 3.01
iaa .73 .74 .66 .22 .36 .39 .25 .18 .43 .15 .44 .05 .42 .12 .37
Table 4: Number of annotated entities and inter-annotator F1(iaa) per entity class. We report the statistics across
the whole corpus (full) as well as divided by the method the documents were sampled with (DB =DrugBank,
MeSH =Medical subject headings, Manual =manually compiled medical keywords). Within each sampling
method we report the statistics for annotator 1 and 2 and for the adjudicated dataset. Reported agreement scores
(iaa, eval. mode S1T1) for all instances across the full corpus.
Relations
cause of
does not prevent
does not treat
has symptom
is contraindicated
is similar to
is type of
may diagnose
may not diagnose
neg influence on
neg interaction
not cause of
pos influence on
pos interaction
prescribed for
prevents
side effect of
treats
worsens
other
total
av. #rels/doc
DB
215 1 13 3 3 8 138 8 0 49 12 7 67 29 9 27 35 197 1 30 852 1.22
157 1 16 12 7 6 92 4 0 31 1 13 75 0 7 29 41 212 1 13 718 1.03
245 1 15 11 4 1 163 7 0 50 11 12 88 28 8 31 52 277 0 39 1043 1.49
MeSH
324 0 10 81 5 3 81 14 1 45 2 10 69 3 1 30 23 132 1 10 845 1.21
242 0 12 90 1 6 51 9 2 74 0 20 61 0 0 44 38 121 1 6 778 1.11
407 0 11 108 5 5 102 16 3 48 1 18 92 3 0 40 39 166 1 16 1081 1.54
Manual
292 7 7 36 0 13 69 20 1 36 2 7 62 1 0 28 3 48 0 35 667 0.95
257 8 8 62 0 8 24 17 0 91 0 17 80 0 0 31 9 40 0 8 660 0.94
331 9 10 57 0 13 71 23 1 78 0 15 83 1 0 41 10 57 0 35 835 1.19
full
831 8 30 120 8 24 288 42 2 130 16 24 198 33 10 85 61 377 2 75 2364 1.13
656 9 36 164 8 20 167 30 2 196 1 50 216 0 7 104 88 373 2 27 2156 1.03
983 10 36 176 9 19 336 46 4 176 12 45 263 32 8 112 101 500 1 90 2959 1.41
iaa .48 .59 .52 .59 .0 .55 .45 .56 .0 .25 .0 .24 .38 .0 .12 .56 .54 .58 .5 .08 .35
Table 5: Number of annotated relations and inter-annotator F1(iaa) per class. We report the statistics across the
whole corpus (full) as well as divided by the method the documents were sampled with (DB =DrugBank, MeSH
=Medical subject headings, Manual =manually researched medical keywords). Within each sampling method
we report the statistics for annotator 1 and 2 and for the aggregated dataset. Reported agreement scores (iaa, eval.
mode R1S0T0) for all instances across the full corpus.
6. Bibliographical References
Akkasi, A. and Moens, M.-F. (2021). Causal rela-
tionship extraction from biomedical text using deep
neural models: A comprehensive survey. Journal of
Biomedical Informatics, 119:103820.
Alvaro, N., Miyao, Y., and Collier, N. (2017).
TwiMed: Twitter and PubMed comparable corpus of
drugs, diseases, symptoms, and their relations. JMIR
Public Health Surveill, 3(2):e24.
Basaldella, M., Liu, F., Shareghi, E., and Collier, N.
(2020). COMETA: A corpus for medical entity link-
ing in the social media. In Proceedings of the 2020
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 3122–3137, On-
line, November. Association for Computational Lin-
guistics.
Ben Abacha, A., Mrabet, Y., Zhang, Y., Shivade, C.,
Langlotz, C., and Demner-Fushman, D. (2021).
Overview of the MEDIQA 2021 shared task on sum-
marization in the medical domain. In Proceedings
of the 20th Workshop on Biomedical Language Pro-
cessing, pages 74–85, Online, June. Association for
Computational Linguistics.
(2021). Proceedings of the BioCreative VII Challenge
Evaluation Workshop.
Chen, Y. and Hasan, M. (2021). Navigating the kalei-
doscope of COVID-19 misinformation using deep
learning. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing,
pages 6000–6017, Online and Punta Cana, Domini-
can Republic, November. Association for Computa-
tional Linguistics.
Choudhury, M. D., Counts, S., and Horvitz, E. (2013).
Social media as a measurement tool of depression in
populations. In In Proceedings of the 5th ACM Inter-
national Conference on Web Science (Paris, France,
May 2-May 4, 2013). WebSci 2013., 05.
Cocos, A., Fiks, A. G., and Masino, A. J. (2017).
Deep learning for pharmacovigilance: recurrent neu-
ral network architectures for labeling adverse drug
reactions in twitter posts. Journal of the American
Medical Informatics Association, 24(4):813–821.
Cornelius, J., Ellendorff, T., Furrer, L., and Rinaldi,
F. (2020). COVID-19 Twitter monitor: Aggregat-
ing and visualizing COVID-19 related trends in so-
cial media. In Proceedings of the Fifth Social Media
Mining for Health Applications Workshop & Shared
Task, pages 1–10, Barcelona, Spain (Online), De-
cember. Association for Computational Linguistics.
Doan, S., Yang, E. W., Tilak, S. S., Li, P. W., Zisook,
D. S., and Torii, M. (2019). Extracting health-
related causality from twitter messages using natural
language processing. BMC Medical Informatics and
Decision Making, 19(3):79.
Duh, M. S., Cremieux, P., Audenrode, M. V., Veke-
man, F., Karner, P., Zhang, H., and Greenberg, P.
(2016). Can social media data lead to earlier detec-
tion of drug-related adverse events? Pharmacoepi-
demiology and drug safety, 25(12):1425–1433.
Giorgi, J. M. and Bader, G. D. (2018). Transfer
learning for biomedical named entity recognition
with neural networks. Bioinformatics, 34(23):4087–
4094, 06.
Habibi, M., Weber, L., Neves, M., Wiegandt, D. L.,
and Leser, U. (2017). Deep learning with word em-
beddings improves biomedical named entity recog-
nition. Bioinformatics, 33(14):i37–i48, 07.
Hossain, T., Logan IV, R. L., Ugarte, A., Matsubara,
Y., Young, S., and Singh, S. (2020). COVIDLies:
Detecting COVID-19 misinformation on social me-
dia. In Proceedings of the 1st Workshop on NLP for
COVID-19 (Part 2) at EMNLP 2020, Online, De-
cember. Association for Computational Linguistics.
Hripcsak, G. and Rothschild, A. S. (2005). Agree-
ment, the f-measure, and reliability in information
retrieval. Journal of the American Medical Infor-
matics Association : JAMIA, 12(3):296–298.
Hu, Y., Huang, H., Chen, A., and Mao, X.-L. (2020).
Weibo-COV: A large-scale COVID-19 social me-
dia dataset from Weibo. In Proceedings of the 1st
Workshop on NLP for COVID-19 (Part 2) at EMNLP
2020, Online, December. Association for Computa-
tional Linguistics.
Jimeno-Yepes, A., MacKinlay, A., Han, B., and
Chen, Q. (2015). Identifying diseases, drugs, and
symptoms in twitter. In MEDINFO 2015:eHealth-
enabled Health - Proceedings of the 15thWorld
Congress on Health and Biomedical Infor-matics,
pages 643–647.
Karimi, S., Metke-Jimenez, A., Kemp, M., and Wang,
C. (2015). Cadec: A corpus of adverse drug
event annotations. Journal of Biomedical Informat-
ics, 55:73–81.
Karisani, P. and Agichtein, E. (2018). Did you re-
ally just have a heart attack? Towards robust de-
tection of personal health mentions in social me-
dia. In Proceedings of the 2018 World Wide Web
Conference, page 137–146, Republic and Canton of
Geneva, CHE.
Klein, A., Sarker, A., Rouhizadeh, M., O’Connor,
K., and Gonzalez, G. (2017). Detecting personal
medication intake in Twitter: An annotated corpus
and baseline classification system. In BioNLP 2017,
pages 136–142, Vancouver, Canada,, August. Asso-
ciation for Computational Linguistics.
Kuhn, M., Letunic, I., Jensen, L. J., and Bork, P.
(2016). The SIDER database of drugs and side ef-
fects. Nucleic acids research, 44:D1075–1079.
Lamurias, A., Sousa, D., Clarke, L. A., and Couto,
F. M. (2019). Bo-lstm: classifying relations via long
short-term memory networks along biomedical on-
tologies. BMC Bioinformatics, 20(1):10, 01.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H.,
and Kang, J. (2019). Biobert: a pre-trained biomed-
ical language representation model for biomedical
text mining. Bioinformatics, 09.
Limsopatham, N. and Collier, N. (2016). Normalis-
ing medical concepts in social media texts by learn-
ing semantic representation. In Proceedings of the
54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
1014–1023, Berlin, Germany, August. Association
for Computational Linguistics.
Lin, C., Miller, T., Dligach, D., Bethard, S., and
Savova, G. (2019). A BERT-based universal model
for both within- and cross-sentence clinical temporal
relation extraction. In Proceedings of the 2nd Clin-
ical Natural Language Processing Workshop, pages
65–71, Minneapolis, Minnesota, USA, 06. Associa-
tion for Computational Linguistics, Association for
Computational Linguistics.
Arjun Magge, et al., editors. (2021a). Proceed-
ings of the Sixth Social Media Mining for Health
(#SMM4H) Workshop and Shared Task, Mexico
City, Mexico, June. Association for Computational
Linguistics.
Magge, A., Tutubalina, E., Miftahutdinov, Z., Alimova,
I., Dirkson, A., Verberne, S., Weissenbacher, D., and
Gonzalez-Hernandez, G. (2021b). DeepADEMiner:
a deep learning pharmacovigilance pipeline for ex-
traction and normalization of adverse drug event
mentions on Twitter. Journal of the American Medi-
cal Informatics Association, 28(10):2184–2192, 07.
Mattern, J., Qiao, Y., Kerz, E., Wiechmann, D., and
Strohmaier, M. (2021). FANG-COVID: A new
large-scale benchmark dataset for fake news detec-
tion in German. In Proceedings of the Fourth Work-
shop on Fact Extraction and VERification (FEVER),
pages 78–91, Dominican Republic, November. As-
sociation for Computational Linguistics.
Nikfarjam, A., Sarker, A., O’Connor, K., Ginn, R., and
Gonzalez, G. (2015). Pharmacovigilance from so-
cial media: mining adverse drug reaction mentions
using sequence labeling with word embedding clus-
ter features. Journal of the American Medical Infor-
matics Association, 22(3):671–681, 03.
Paul, M. J. and Dredze, M. (2012). A model for min-
ing public health topics from Twitter. Health, 11(1).
Saakyan, A., Chakrabarty, T., and Muresan, S. (2021).
COVID-fact: Fact extraction and verification of real-
world claims on COVID-19 pandemic. In Proceed-
ings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language Pro-
cessing (Volume 1: Long Papers), pages 2116–2129,
Online, August. Association for Computational Lin-
guistics.
Sahu, S., Anand, A., Oruganty, K., and Gattu, M.
(2016). Relation extraction from clinical texts us-
ing domain invariant convolutional neural network.
In Proceedings of the 15th Workshop on Biomedi-
cal Natural Language Processing, pages 206–215,
Berlin, Germany, August. Association for Computa-
tional Linguistics.
Sarker, A., O’Connor, K., Ginn, R., Scotch, M., Smith,
K., Malone, D., and Gonzalez, G. (2016). Social
media mining for toxicovigilance: automatic moni-
toring of prescription medication abuse from Twitter.
Drug safety, 39(3):231–240.
Scepanovic, S., Martin-Lopez, E., Quercia, D., and
Baykaner, K. (2020). Extracting medical entities
from social media. In Proceedings of the ACM Con-
ference on Health, Inference, and Learning, CHIL
’20, pages 170–181. Association for Computing Ma-
chinery. event-place: Toronto, Ontario, Canada.
Seiffe, L., Marten, O., Mikhailov, M., Schmeier, S.,
M¨
oller, S., and Roller, R. (2020). From witch’s
shot to music making bones - resources for med-
ical laymen to technical language and vice versa.
In Proceedings of the 12th Language Resources
and Evaluation Conference, pages 6185–6192, Mar-
seille, France, May. European Language Resources
Association.
Sousa, D., Lamurias, A., and Couto, F. M., (2021). Us-
ing Neural Networks for Relation Extraction from
Biomedical Literature, pages 289–305. Springer
US, New York, NY.
Stefanidis, A., Vraga, E., Lamprianidis, G.,
Radzikowski, J., Delamater, P. L., Jacobsen,
K. H., Pfoser, D., Croitoru, A., and Crooks, A.
(2017). Zika in twitter: Temporal variations of
locations, actors, and concepts. JMIR Public Health
Surveill, 3(2):e22, Apr.
Sullivan, R., Sarker, A., O’Connor, K., Goodin, A.,
Karlsrud, M., and Gonzalez, G. (2016). Finding
potentially unsafe nutritional supplements from user
reviews with topic modeling. In Biocomputing 2016,
pages 528–539, Kohala Coast, Hawaii, USA, Jan-
uary.
Thorne, C. and Klinger, R. (2017). Towards confi-
dence estimation for typed protein-protein relation
extraction. In Proceedings of the Biomedical NLP
Workshop associated with RANLP 2017, pages 55–
63, Varna, Bulgaria, September.
Uzuner, O., South, B. R., Shen, S., and DuVall,
S. L. (2011). 2010 i2b2/VA challenge on concepts,
assertions, and relations in clinical text. Journal
of the American Medical Informatics Association,
18(5):552–556, 06.
Wang, C. and Fan, J. (2014). Medical relation extrac-
tion with manifold models. In Proceedings of the
52nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
828–838, Baltimore, Maryland, June. Association
for Computational Linguistics.
Weber, A. M., Gupta, R., Abdalla, S., Cislaghi,
B., Meausoone, V., and Darmstadt, G. L. (2021).
Gender-related data missingness, imbalance and bias
in global health surveys. BMJ global health, 6(11).
Wegrzyn-Wolska, K., Bougueroua, L., and Dz-
iczkowski, G. (2011). Social media analysis for e-
health and medical purposes. In 2011 International
Conference on Computational Aspects of Social Net-
works (CASoN), pages 278–283.
Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J.,
Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li,
C., Sayeeda, Z., Assempour, N., Iynkkaran, I., Liu,
Y., Maciejewski, A., Gale, N., Wilson, A., Chin, L.,
Cummings, R., Le, D., Pon, A., Knox, C., and Wil-
son, M. (2018). DrugBank 5.0: a major update to
the DrugBank database for 2018. Nucleic acids re-
search, 46:D1074–D1082.
Yang, F.-C., Lee, A. J., and Kuo, S.-C. (2016). Mining
health social media with sentiment analysis. Journal
of Medical Systems, 40(11):236, Sep.
Yin, Z., Fabbri, D., Rosenbloom, S. T., and Malin,
B. (2015). A scalable framework to detect personal
health mentions on Twitter. Journal of Medical In-
ternet Research, 17(6):e138, 06.
Appendix
Example Terms from the Sampling Methods
Table 6 shows terms from each sampling method.
Source Example terms
DrugBank advil, Benzylamine, Cobalt, S-Acetyl-
Cysteine, Wellbutrin, zzzquil
MeSH Anaphylaxis, Cough, Drainage, Hospital-
ization, Neoplasms, Self-Testing
Manual #antivaxxer, #cancersucks, #depressionis-
real, #mswarrior, #plantbasedhealing, #So-
cialDistancing
Table 6: Example terms from each sampling method.
Additional Examples from the Corpus
Table 7 shows the shortest and longest tweets in the
dataset.
id #words tweet
1 4 bpd symptoms on 1000
2 4 Increasing pain unlocked #PTSDAware-
nessDay
3 114
@username [...] @username I’ve been to
every hospital in my region, ”Sorry, can’t
help you” I don’t want drugs, I want my
back fixed. I know for a fact the tech is
there. They don´
t want the liability. They
should just Quit Medicine! I’m called in-
operable with intractable pain, none will
help. No Pain RX
Table 7: Longest and shortest tweet in the dataset.
Annotation Aggregation Strategies
We provide an aggregated version of the dataset which
adjudicates both annotators’ results. In general, our
strategy is motivated by a high recall approach to en-
sure we do not lose any annotated perspectives on the
data. When combining the annotations, we choose the
longest overlapping sequence between two instances.
We prefer more frequent entity and relation classes over
less frequent ones, and choose more general concepts
over more specific ones. We aggregate in two steps by
first aligning the entity annotations, followed by aggre-
gating the relations.
Entities With regards to the entity span, we use the
longest overlapping span between A1’s and A2’s an-
notation. In cases in which they disagree on the en-
tity type, we chose the more frequent class. Exceptions
are the entity classes treatment and biochem. For those
classes, one subgroup is more general than the other.
If both annotators agree on the major class (treat), but
disagree on the subtype (drug vs. therapy) we aggre-
gate to the more general one which are treat therapy or
biochem substance.
For cases in which one annotator labeled an entity as
other while the second annotator chose a different en-
tity class, we aggregate to the more frequent entity
class. However, if the annotator used other to model
a relation, we keep the entity as other to keep the rela-
tion intact and valid. 7
If one annotator labeled an entity, but the other one did
not, we generally follow a high recall approach and add
this entity to the aggregated document. However, we
additionally check if the annotator who marked the en-
tity used it to model a relation. If the relation is valid
(i.e. the involved entities are allowed to be connected),
we use the entity, otherwise it is dropped.
Relations To adjudicate the relation annotation, we
identify cases in which both annotators agreed on the
fact that there is any type of relation between a given
entity pair. First, we check if the relation tags are valid
(i.e. the involved entities are allowed to be connected).
If one of them is invalid, we choose the valid one for
the aggregated version. If both are invalid, the relation
is dropped. If they are both valid, we choose the more
frequent relation class. One exception to this rule con-
cerns cases in which one annotator identified an other
relation while the second annotator chose a different
relation class. Here, the tag other indicates a vague re-
lation which is not in line with our aim to adjudicate
to the more specific class. Therefore, we can not re-
solve this by simply assigning the more frequent label,
because some of the small relation classes are less fre-
quent than the class other. A1 and A2 consequently re-
7In the guidelines annotators are instructed to prioritize
assigning an accurate relation over an accurate entity type. In
some cases this means they may default to an other entity if
the relation they want to model is not allowed for a particular
entity pair.
visit those cases (11 instances) and decide jointly which
relation type should be added to the aggregated version.
For annotations in which A1 and A2 only agreed on
one of the involved entities, we follow a high recall
approach and keep both relations for the adjudicated
version of the data as long as the relations are valid. Fi-
nally, we consider cases in which one annotator did not
label any relation while the other identified one. For
those, we hypothesize that they are ambiguous and that
the missing relation reflects that (i.e. that the relation
marked by one of the annotators might be covering a
political claim about a medical topic). In an effort not
to lose these borderline cases, we add them to the ag-
gregation as long as the relation is valid.