PreprintPDF Available

From Witch's Shot to Music Making Bones -- Resources for Medical Laymen to Technical Language and Vice Versa


Abstract and Figures

Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, applications can be created that enable e.g. the detection of possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked to the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German. We introduce a new data set which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.
Content may be subject to copyright.
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6185–6192
Marseille, 11–16 May 2020
European Language Resources Association (ELRA), licensed under CC-BY-NC
From Witch’s Shot to Music Making Bones – Resources for Medical Laymen to
Technical Language and Vice Versa
Laura Seiffe, Oliver Marten, Michael Mikhailov, Sven Schmeier,
Sebastian M¨
oller, Roland Roller
German Research Center for Artificial Intelligence (DFKI),
Speech and Language Technology Lab, Berlin, Germany
Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited.
This also applies to information about a person’s health status. Information we share online unveils directly or indirectly information
about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, applications
can be created that enable e.g. the detection of possible risk factors of diseases or adverse drug reactions of medications. However, as
most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do.
To detect and use those relevant information, laymen language has to be translated and/or linked to the corresponding medical concept.
This work presents baseline data sources in order to address this challenge for German. We introduce a new data set which annotates
medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first
baseline results on the data.
Keywords: medical laymen to technical language, text simplification, concept normalization
1. Introduction
Every day people generate and share information online
which sheds light on our lifestyle and also to a certain extent
to the health situation. Provided information might include
data about sports activities, food, alcohol and drug intake,
but also indirectly about potential risk factors of diseases
or possible adverse drug reactions, see e.g. Abbar et al.
(2015) or Weissenbacher et al. (2018). Mining for instance
adverse drug reactions has a high relevance for the general
public as well as for pharmacological companies. As the
level of medication intake is generally increasing all over
the world, so does the risk of unwanted side effects (Kara-
petiantz et al., 2018).
In most cases, models to extract health related information
from text are trained on large annotated data sets, mainly
in English language, and on well formed sentences. Text in
social media, forums, but also in emails, can differ in terms
of sentence structure, writing style and word usage in com-
parison to news articles or scientific publications. Think-
ing particularly of health related information, the language
used might be more casual and descriptive rather than the
precise medical expression, as most people are not medi-
cal experts. This makes it difficult to identify the precise
technical expression and to link it against a unique concept
in a biomedical ontology, in order to e.g. gather further
background knowledge. This makes it difficult to identify
the precise technical expression and to link it to a unique
concept in a biomedical ontology, in order to e.g. gather
further background knowledge. For instance referring to
the title of this work, patients might use laymen expres-
sions such as ‘Hexenschuss’ (lit.: a witch’s shot’, known
as ‘lumbago’) or ‘Musizierknochen’ (lit.: music making
bone’, aka ‘funny bone’ or ‘ulnar nerve’) rather than their
technical equivalent.
Conversely medical language might be difficult to under-
stand for non-experts. Technical terms and a special lan-
guage use make it difficult to get an easy access to informa-
tion that concerns the patient. The medical science is built
on a vast amount of technical expressions that are not nec-
essarily part of a patient’s everyday language. The major-
ity of the clinical lexicon has its origin in Latin or Greek.
Although the access to information is crucial for keeping
track on personal conditions, for most patients the struc-
ture of the medical language remains obscure. Thus, un-
derstanding medical articles and most importantly under-
standing our own clinical reports written by our attending
doctor may raise some challenges. In order to understand
a possible serious health condition faster, automatic meth-
ods might help to simplify technical language. However,
as most resources concern English language, a technical-
laymen translation (and vice versa) for non-English raises
further issues.
To address those challenges, this work introduces new data
sets for German which support the linking of medical lay-
men language to technical language. Firstly we introduce a
new corpus which annotates medical laymen language and
technical language in a patient forum. Additionally we in-
troduce two data sets which include different synonyms of
medical concepts and sort them by complexity (rather tech-
nical to rather laymen). All data sets described in this paper
will be made available1. Our corpus in combination with
the additional resources can serve as a baseline to train and
to evaluate systems to map laymen into technical language
and vice versa.
2. Related Work
In recent years, the biomedical domain has become an im-
portant field of research for natural language processing
tasks. Enhancing the patient’s understanding of clinical
texts is one major objective. The automatic processing of
medical free text is one obstacle that is addressed by these
research efforts. One step towards the processing is the
mapping from free text-expressions to structured represen-
tations of domain knowledge. This includes the detection
of technical terms and the normalization to an appropriate
knowledge base. Synonymous expressions, terminological
variants and paraphrases as well as spelling mistakes and
abbreviations occur frequently in natural texts. By linking
them to one unique concept, the lexical information in the
text is structured and unified. In the context of medical lan-
guage, different approaches face the normalization of med-
ical concepts, such as in Leaman et al. (2013), Suominen et
al. (2013) or Do ˘
gan et al. (2014).
Systems and methods that particularly address the transi-
tion from medical technical language to lay language of-
ten pursue similar approaches. Under these conditions,
the linked knowledge base must provide lay language syn-
onyms or simplified explanations for technical terms. In
Zeng-Treitler et al. (2007), the Unified Medical Language
System (UMLS) and especially the Consumer Health Vo-
cabulary (CHV) are used as sources of lay vocabulary
knowledge. Abrahamsson et al. (2014) conduct a synonym
replacement for medical Swedish, using a system which as-
sesses the difficulty of technical terms. If the technical term
is considered as more difficult than the corresponding entry
in the Swedish MeSH, the terms are replaced.
Apart from approaches that aim at simplifying the techni-
cal language, also the mapping of laymen language to med-
ical technical expressions has gained attraction. Social me-
dia texts are a thriving resource for genuine lay language
use. Recognizing meaningful elements and linking these
expressions to technical counterparts allows structured in-
sights into the health status or health related behaviour.
For example, O’Connor et al. (2014) create a data set of
annotated tweets with potential adverse drug reactions. The
authors test a lexicon-based approach to detect the concepts
of interest. Limsopatham and Collier (2015) improve this
baseline in order to normalize medical terms from social
media messages using a phrase-based machine translation
technique. The authors also present a system which learns
the transition between lay language used in social media
and the formal medical language used in descriptions of
medical concepts in a standard ontology (Limsopatham and
Collier, 2016).
Recently the Shared Task of Social Media Mining for
Health (SMM4H) has gained much interest and targets this
topic as well. Some of the tasks involve for instance classi-
fication of tweets presenting adverse drug reactions or vac-
cine behavior mentions, see Weissenbacher et al. (2019) for
more information.
Now that we introduced work related to make technical ex-
pressions more comprehensible and methods to map lay-
men expressions to their precise equivalent and vice versa,
something still remains unclear: What actually are laymen
expressions and how are medical technical expressions de-
Previous and related work does not provide a clear defini-
tion for both. Elhadad and Sutaria (2007) make use of the
contrast between a text written by a medical professional
(scientific articles) and a text written by a journalist, ad-
dressing a lay audience. They consider a term as an appro-
priate lay expression if it is the most frequent candidate in
the lay texts.
Chen et al. (2017) provide a method to rank medical terms
extracted from electronic health records. The higher a term
is ranked, the more urgently a lay translation is needed.
Therefore they consider unithood, termhood, unfamiliarity
and quality of compound term as relevant criteria for terms
that must be translated for a lay audience. In contrast to
these vague definitions, Grabar and Hamon (2014) concen-
trate on terms that show neoclassical compounding word
formation. Consequently words with Latin or Greek roots
are seen as technical terms.
Definition 1: (a) A medical technical term is that
which is used by physicians whereas (b) a medical
lay term can be easily understood by patients (medi-
cal non-experts).
Definition 2: (a) A medical term which includes (at
least in parts) words with a Latin or Greek origin is
defined as medical technical term. (b) All other terms
belong to lay language. Lay terms are based on ev-
eryday words/language.
Table 1: Definitions used in this work of medical technical
terms and laymen expressions
As there is no clear definition for technical and lay expres-
sions, we decide to incorporate the mentioned aspects and
use the definitions in Table 1. Both definitions are not en-
tirely satisfactorily. The first definition is subjective, de-
pends on the background of a person and requires poten-
tially a manually generated gold standard data set. More-
over, there might be words which belong to both groups,
as they are used by physicians and at the same time are
understood by patients, such as cancer. The second defini-
tion makes it much easier to differ between both language
types. However, also Latin or Greek rooted words can be
very common in our daily language thus be easily under-
stood by medical non experts, such as hallucination.
3. Technical-Laymen Corpus
This section introduces the Technical-Laymen Corpus
(TLC) an annotated forum based on Med1.de2. Med1 is a
German patient forum that provides a large variety of health
related topics. Users are non-professionals who seek for
exchange, opinions and advice. Med1 is freely accessible
and the discussions can be read without being registered.
A registration is necessary to participate in the discussion.
The operating team of Med1 does not provide medical con-
sultation, however they guide the community in terms of
netiquette. The users are anonymous and only their user-
names are known to us. We would have been prepared to
anonymize any personal data but we did not encounter data
that could link to someone.
Forum Example Translation
Ja. Der Termin ist tats¨
achlich durch. Ich wurde
an den Nieren geschallt die dort unauff¨
allig ausse-
hen. (Kp was das schon ausschließt) 24h Urin
urde abgegeben und eine 24h Blutdruckunter-
suchung angeordnet. Die haben mich komplett
zerlegt: EKG Blut Spontanurin.
Hi, I am very unsure at the moment, my doctors
have different opinions, some doctors say that my
kidneys are not looking well, the others say that
I should not be worried until GFR decreases, but
what is right?
Kidney Hallo, ich bin momentan sehr verunsichert, meine
Arzte sind nicht gleicher Meinung, die einen
Arzte sagen meine Nieren sehen nicht gut aus, die
anderen sagen, solange der GFR nicht f¨
allt muss
ich mir keine Gedanken machen, was stimmt den
Yes, the appointment is really over. The renal ul-
trasound showed no pathologies. (no idea what
it can rule out) I gave 24 urine sample and a 24h
blood pressure test was ordered. They have ana-
lyzed me completely: EKG, blood analysis, urine
Table 2: Excerpt of patient forum in German and (translated) English
Tag Example Annotation
L Blut im Urin (blood in urine) H¨
amaturie (haematuria)
Hexenschuss (lit.: a witch’s shot) Lumbago (lumbago)
Eiweissverlust ¨
uber die Nieren (protein loss through kidneys) Proteinurie (proteinuria)
Durchfall (lit.: fall through) Diarrh¨
o (diarrhea)
ummerung (smashing of kidney stones) Extrakorporale Stoßwellenlithotripsie (extracorpo-
real shockwave therapy)
T Aerophagie (aerophagy) Luftschlucken (air swallowing)
Appendizitis (appendicitis) Blinddarmentz¨
undung (appendix infection)
Table 3: Annotated examples of both tags (Lay, Technical) from the Technical-Laymen Corpus, including translations
We are mainly interested in the medical language that is
used by patients and medical laymen. A non-professional
forum is likely to show the biggest source of lay language
use. A corpus consisting of this kind of data should give the
most realistic impression of the medical lay language. The
annotation of technical and lay expressions should provide
valuable insights into the relationship of technical and lay
For this work we selected two subforums, namely kidney
diseases and stomach and intestines as text source. Each
subforum provides a variety of user questions (“threads”),
each containing a varying number of corresponding an-
swers. We crawled posts of the two subforums, including
the time of posting, the author’s nickname and the thread
title. As the forum continuously grows, the corpus only
represents the forum’s status of the crawling date. Table
2 shows two exemplary sentences from the patient forum.
The examples show characteristic entries in the forum, in-
cluding a specific syntax and spelling errors.
3.1. Annotation Schema
Mainly we are interested in terms and expressions that are
used by medical non-professionals as those provide a large
variety which cannot be entirely covered in medical dic-
tionaries. However, as people might undergo a lifelong
treatment (kidney diseases are chronic diseases) patients
are well informed and also use frequently technical terms
and abbreviations. For a newbie this might be difficult to
understand. Thus, we target also the other direction – the
detection of technical terms in order to simplify them. Our
annotation involves two different concepts: (1) lay expres-
sions and (2) technical expressions. Regarding those in-
formation we mainly focus on symptoms,diseases, as well
as treatments and examinations. However annotators were
free to also label information that goes beyond the focus
information (e.g. body parts, medication).
Annotators were asked in case of a lay expression to include
the corresponding technical counterpart as well, and in case
of a technical expression, the most common lay expression.
We opt for a single word counterpart. If this is not possible,
we choose a paraphrase or a short, appropriate explanation.
In case of abbreviations we treat them accordingly: If the
abbreviation is presumably known to a layman or even typ-
ical layman use (e.g. KKH for “Krankenhaus”, hospital),
we annotate it as a lay expression. If the abbreviation is
untypical or unlikely to be known to a patient (e.g. NBE
for “Nierenbeckenentz¨
undung”, Inflammation of the Renal
pelvis) we treat it as technical term. In both cases we add
the expanded version. Table 3 presents examples of the cat-
egories including their English translation.
3.2. Annotation Setup and Process
The annotation has been then carried out by two medical
students within various iterations using the brat 3annotator
tool (Stenetorp et al., 2012). The first annotation cycle con-
centrated on medically obvious cases. This means that we
focused on medically clear translations from lay to techni-
cal language or vice versa. For example, the term “Normo-
tonie” (normotonia) is assigned the tag technical and the
corresponding lay expression “normaler Blutdruck” (nor-
mal blood pressure) is given as free text.
However the results of the first cycle were not satisfying
yet, as most translations were already well documented
in existing vocabularies. Therefore we extended the an-
notations by including cases in which a non-professional
describes a medical concept in such way that a definite
technical translation is difficult. For example, if a user
describes problems with passing water (“Probleme beim
Wasserlassen”), a possible technical equivalent could be dy-
From the medical point of view, this procedure is diffi-
cult because it includes to some extend interpretation work:
While problems with passing water is only a rough symp-
tom description, a dysuria is a pathological state. The trans-
fer from a symptom description to a disease can be seen as
kind of diagnostic process which must be avoided at that
point. As the annotation was carried out by medical stu-
dents we trusted their expertise to decide at which point the
annotation would exceed a reasonable interpretation. Thus
we do not opt for a diagnostic interpretation of symptoms.
In order to retrace such cases, the annotators highlighted
annotated terms that came close to a critical interpretation
Within a final iteration one of the authors examined the
annotations and highlighted potential errors (wrong labels,
missing information etc.). Those highlighted information
were then again manually examined, in order to provide a
corpus with an appropriate quality.
3.3. Corpus Analysis
Table 4 provides an overview about TLC. The table lists
for each forum topic the number of included files, number
of tokens, as well as the average number of tokens per file
and the average number of annotations per file. Note that
not all files included relevant information to be annotated.
A more detailed overview about the annotated information
itself is presented in Table 5. The table lists the the num-
ber of overall and number of unique annotations for each
label. As the table shows, the most annotated labels are
laymen expressions. Moreover those expressions also have
the largest variety in terms of different unique terms. This
makes sense and highlights the importance detecting lay-
men expressions.
Kidney Stomach-Intestines
Number of files 2000 2000
Number of tokens 203,553 234,914
Avg. tokens /file 101.78 117.46
Avg. annotations /file 2.52 1.41
Table 4: General overview about Med1 Corpus
4. Additional Resources and Methods to
Process Technical-Laymen Language
In addition to the Technical-Laymen Corpus we extract data
from two additional resources: UMLS and Wiktionary. We
Label #Annotations #Unique
Lay Expression 4727 1246
Technical Term 1745 376
Table 5: Overview about number of annotated and unique
concepts of each category label.
aim at providing assorted data sets which incorporate a
matching of technical and laymen language in the biomed-
ical domain. Both resources are processed and can be used
to support the linking from laymen to technical terms and
vice versa. However as both resources do not systemati-
cally differ between lay and technical terms, we addition-
ally propose a simple method to identify technical (and less
technical) terms.
4.1. UMLS Synonym Subset
The Unified Medical Language System (UMLS) is a
biomedical ontology and knowledge source. The Metathe-
saurus of UMLS provides a vocabulary database for the
biomedical and health domain. Synonymous expressions
are linked by the same concept unique identifier (CUI). The
same CUI also links equivalent expressions in different lan-
guages. The Semantic Network of UMLS categorises all
terms into broad subject categories, providing a categoriza-
tion into 127 semantic types (STY) and 54 relation types
(RL). Overall UMLS includes concepts of over 34 million
concepts in English language, whereas only approximately
100,000 in German. Roughly half of those concepts include
at least two mentions. While the German UMLS subset is
relevant for concept normalization in general, particularly
concepts including synonyms are interesting, as they might
include technical and laymen expressions.
4.2. Wiktionary Synonym Subset
Our second resource is build from the German version of
Wiktionary4. Wiktionary provides 741,260 (Jan 2019) en-
tries in German. Although biomedical information is not
a special focus of Wiktionary, there is a large range of re-
lated subcategories. In order to create our technical/laymen
language resource the (in November 2019 newest) German
Wiktionary dumb has been downloaded and further pro-
cessed and filtered to our needs. In order to build a techni-
cal/laymen language resource from Wiktionary, we parsed
the provided dump and automatically gathered for each en-
try the term, its explanation and, if available, synonyms.
Our focus is the biomedical domain, thus we limited the
data by selecting medical related entries only. These entries
come from the categories Medicine, Pharmacy, Pharma-
cology, Anatomy, Psychiatry, Psychology, Physiology, Oph-
thalmology, Pathology, Dentistry, Gynaecology and Der-
matology. Additionally, we included every entry that con-
tains at any place the regular expression krank (sick) which
should relate to mentions of diseases. By doing so, the re-
sulting resource is larger than necessary (e.g. some vet-
erinary entries are included). However we ensure to make
Term Explanation Synonym
Dialyse Anwendung der Dialyse, vor allem zur Reinigung von Blut Blutreinigung; Blutw¨
Diabetes Stoffwechselerkrankung, bei der eine gesteigerte Un-
empfindlichkeit gegen¨
uber Insulin besteht (sogenannter Di-
abetes mellitus Typ 2 oder Typ-2-Diabetes oder Altersdia-
Zuckerkrankheit; Zucker
Delirium tremens Ernste und potentiell lebensbedrohende Komplikation im
Alkoholentzug bei einer schon l¨
anger bestehenden Alko-
Alkoholdelir; ¨
Onomanie; S¨
Table 6: Example of extracted information from Wiktionary
CUI English German Spanish French Swedish Russian
C0007097 carcinoma Karzinom carcinoma carcinome Karcinom KARTSINOMA
C0012503 Dioxins Dioxine Dioxinas Dioxines Dioxiner DIOKSINY
C0023531 Leukoplakia Leukoplakie Leucoplaquia Leucoplasie Leukoplaki LEUKOPLAKIJA
C0027804 Neurasthenia Neurasthenie neurastenia Neurasth´
enie Neurasteni NEVRASTENIIA
Table 7: Similar mentions of different languages in UMLS linked by the same concept unique identifier (CUI).
use of all entries that could be relevant. Only entries of the
mentioned categories were used for our resource. The fi-
nal biomedical Wiktionary subset comprises 4468 concepts
and nearly all including a definition. 2155 of the entries
include at least one synonym. Overall this subset includes
8657 different entries.
Even though the data set appears to be small in compari-
son to UMLS, an interesting aspect about Wiktionary is the
variety of laymen synonyms. It includes lay expressions
which are often not covered by UMLS. Table 6 shows some
examples: Diabetes for instance is a characterized by re-
current or persistent high blood sugar. A non-professional
German term for diabetes is “Zuckerkrankheit” (lit.: sugar
disease) or simply “Zucker” (sugar). These terms, even
though frequently used, are not listed in UMLS. The large
variety of lay expressions includes not only lay expressions
to the respective technical term but also colloquial or even
vulgar terms. For example, the entry of “Diarrhoe” (diar-
rhea) lists as synonyms “Schnelle Katharina” (fast Katha-
rina) and “Flotter Otto” (quick Otto).
4.3. Aligning data sets
UMLS is frequently used for concept normalization and it
comprises much more concepts than the Wiktionary sub-
set. Conversely, Wiktionary appears to be a highly useful
resource as it contains more casual expressions in medical
context. For this reason we try to combine both data sets.
For this, we identify expressions from Wiktionary which
also occur in UMLS. If a term from Wiktionary also oc-
curs within exactly one CUI in UMLS, we can simply align
the Wiktionary concept with all its synonyms to this CUI.
For instance if the Wikitonary term ‘pain’ (and all its syn-
onyms) would occur only in context of one single UMLS-
CUI, we can map the Wiktionary term ‘pain’ and all its
synonyms to this corresponding CUI. However, this is not
possible in all cases, as terms in UMLS might be assigned
to various CUIs.
In this way, 768 CUIs can be extended by overall 3082
additional mentions. We refer to the resulting data set as
Wiktionary-UMLS (WUMLS).
4.4. Sorting Synonyms
The mapping from technical to laymen language is one of
the aspects of this work. However, the largest of our sup-
porting resources, UMLS, does not provide any informa-
tion about technical or laymen language for German. For
this reason we provide a simple technique to identify tech-
nical and less technical terms according to definition 2 (see
Table 1). According to this, technical terms have their ori-
gin in Latin or Greek language. Moreover, we know that
those technical terms are very common in many (particular
European) languages. Table 7 shows examples of similar
expressions across various languages. Using this character-
istic we propose the following method to identify medical
technical expressions:
For each German target mention (Gt) we identify the En-
glish (Ej) and French (Fk) synonym with the lowest Leven-
shtein distance (lev(a, b)) for each of both languages. Next
we calculate the average between both minimum distance
scores. Note, we chose two languages rather than one to
have a more robust distance score. Finally we harmonize
this score, dividing it by the length of the target mention
(len(a)). This should avoid that short strings are favoured
over longer strings with similar edits. We refer to this score
as the harmonized distance (h dist). The harmonized dis-
tance can be formulated as follows:
h distGt=min(lev(Gt, Ej)) + min(lev(Gt, Fk))
Sorted Synonym data set (SSD): Following the assump-
tion from above, we assume that a German mention with a
low harmonized distance might likely to have a Greek or
Latin origin, thus tends to be a technical term. Thus we cal-
culate the harmonized distance of all German mentions of
UMLS (and WUMLS) and sort all synonyms of each con-
cept according to this score. Starting with the term with the
lowest distance score and finishing with the one with the
largest score.
distance (>=) 0 5 10 15 20 25 30 35 40 45 50
#instance 300 237 193 161 144 124 97 87 74 56 49
%is-easier 50 59 65 71 74 74 75 74 70 70 71
%is-easier-or-equal 88 89 91 92 92 93 93 92 91 89 88
Table 8: Manual Evaluation of 300 selected examples to explore if the term ranked as easiest term is in fact easier than the
term ranked as most technical. Considering only pairs with a larger edit distance, the results show that precision increases
for both is-easier (checking whether the term is in fact simplified) or is-easier-or-equal (checking whether the term is at
least not more complicated).
As we are interested in particular concepts we select only
those which belong to one of these semantic types (STY):
Anatomical Abnormality’, ‘Anatomical Structure’, ‘Body
Location or Region’, ‘Body Part, Organ, or Organ Com-
ponent’, ‘Body Space or Junction’, ‘Disease or Syndrome’,
Injury or Poisoning’, ‘Mental or Behavioral Dysfunction’,
Sign or Symptom’. Using the technique from above and
including English and French as reference language, we
can generate sorted synonym sequences of 28,495 different
concepts with overall 47,996 different mentions.
Evaluation 1 – Are synonyms with a low harmonic dis-
tance technical terms? In order to examine this question
we randomly select 300 concepts and their lowest h dist
mention from UMLS-SSD. All selected mentions had a dif-
ferent harmonic score, whereas the largest score of the sub-
set was 120. The selected mentions have been manually
evaluated according to our two definitions by one of the
authors. The analysis shows that 75% of all terms are tech-
nical expressions according to definition 1 and 90% accord-
ing to definition 2. Table 9 shows an analysis considering
only concepts below a certain harmonic distance threshold.
In this way we can see that a harmonic distance below 60
leads to a high accuracy, which supports our assumption.
The larger the distance the more the accuracy decreases.
However the score decreases faster using definition 1.
distance (<=) 20 40 60 80 100 120
#instances 59 105 174 277 297 299
%definition-1 93 93 91 79 75 75
%definition-2 98 99 99 94 90 90
Table 9: Manual examination of 300 randomly selected ex-
pressions of a concept with the lowest harmonic distance
Evaluation 2 – Are synonym mentions with a larger
harmonic distance less technical and possibly laymen
expressions? In order to examine this question we ex-
amine whether the term with the lowest score in UMLS-
SSD is more or at least similarly technical as the term
with the largest score of all synonyms. Thus, we selected
randomly 300 German concept mention pairs, this time
with the lowest and the largest harmonic distance score
and examined whether the first term is a) more technical,
b) similar technical or c) less technical than the second
term. As we do not know whether there is always a sim-
plified term within the synonym set, we evaluate according
to is-easier (a/(a+b+c)), as well as is-easier-or-equal
The results in Table 8 show that in only 50% of the cases
the expression with the highest harmonic distance is less
technical than the expression with the lowest harmonic dis-
tance. This does not look very promising at first. How-
ever we can make the following analyses: First considering
all synonym pairs, in 88% of the cases the expression with
the highest harmonic distance is easier or at least similarly
technical as the expression with the lowest score. More-
over the table shows that the absolute distance between both
scores has a strong influence on the outcome. Increasing
the absolute distance between both scores quickly increases
also the accuracy (%). In case of examining whether the ex-
pression with the higher score is in fact less technical, we
can see a constant increase from 50%, using all pairs, to
75% considering a minimum absolute distance of 30. In-
creasing the distance, decreases obviously the number of
synonym pairs. However, after reaching a maximum of
75%, the scores drop slightly, but never undergo 70. A sim-
ilar effect can be observed for is-easier-or-equal. After a
maximum of 93% with a distance of 30, the values slightly
decrease but remain always above 88.
Overall these results are very promising. Considering a cer-
tain distance (e.g. of 15 or more), we can ensure that in
more than 70% of the cases the synonym with the larger
harmonic distance is less technical and in 92% of the cases
the term is at least not more complicated.
5. Baseline Experiments
In the previous sections we presented the TLC corpus and
in addition two further resources to support the mapping
between German medical laymen to technical language and
vice versa. The main focus of our work is the presentation
of new resources in this domain. In this section, however,
we present in addition some baseline results on TLC which
can be used as benchmark for future work.
Regarding baseline results, we carry out two different
experiments: 1) the normalization of medical technical
terms including a term simplification and the 2) normal-
ization of medical laymen expressions. For our experiment
we indexed the mentions (and its stemmed version) from
5.1. Experiment 1 – Normalization and
For experiment 1 we extract all technical terms and ex-
amine whether we can align it to a corresponding concept
unique identifier. Using UMLS in 72.10% of the cases we
can find the corresponding medical concept. However only
in 31.11% of those cases we find an easier synonym. The
usage of WUMLS does not increase the performance much.
However if we analyse the terms found in UMLS in more
detail, we can see that the average harmonic distance score
of those expressions is 39.93. As we know from Evaluation
1 in Section 4.4. that a low score is an indicator for a tech-
nical term, this score is no surprise. We can also see that a
large number of expressions include a larger harmonic dis-
tance, for instance 143 expressions have a score of 70 or
5.2. Experiment 2 – Normalization Laymen
For experiment 2 we extract all laymen terms and examine
whether the corresponding technical term can be found. In
case of using UMLS terms for only 57.37% of the mentions
a corresponding CUI can be detected. As laymen expres-
sions provide much more variations in comparison to tech-
nical terms, this outcome was expected. If we again exam-
ine the expressions found in UMLS in more detail we can
see that the average harmonic distance is at 82.05. How-
ever also here we can find a large number of expressions
supposed to be non-technical, but have a low harmonic dis-
tance. For instance 137 expressions have a score below
Finally, using WUMLS data for the normalization the score
can be increased to 64.08%. This shows clearly the advan-
tage of including additional information of Wiktionary.
5.3. Discussion
Overall the results of our baseline experiments show that
laymen language concept normalization is much more dif-
ficult in comparison to the normalization of medical techni-
cal expressions. This highlights the importance of creating
further resources of laymen synonyms but also methods be-
ing able to map between those language types.
Methods trained on definitions such as in Limsopatham and
Collier (2016) might be helpful to tackle this challenge.
However, in comparison to English UMLS and also Wik-
tionary do not contain as many German definitions as for
English language. This again highlights the aspect that
German, in comparison to English, is a low resourced lan-
guage considering existing and freely available structured
resources. As mentioned above, the German UMLS subset
covers only 3.2% of all English concepts and involves only
2.3% of all existing English synonyms. Thus, it is obvi-
ous that concept normalization even for technical terms is
much more challenging. Cross-lingual methods such as in
Roller et al. (2018) might help to increase the coverage of
technical terms.
6. Conclusion
In this work we presented a new corpus based upon a
patient forum for kidney disease and stomach-intestines.
The data set labels medical laymen language and technical
terms and assigns a corresponding description or expres-
sion. This resource might be valuable resource to map and
translate between both types of language styles in the med-
ical domain. In addition to that we also provided two re-
sources which can support this translation process. Finally
we also tested a simple baseline on our corpus which can
be used as reference for more complex methods.
This project was funded by the European Union’s Horizon
2020 research and innovation program under grant agree-
ment No 780495 (BigMedilytics) and by the German Fed-
eral Ministry of Economics and Energy through the project
MACSS (01MD16011F).
7. Bibliographical References
Abbar, S., Mejova, Y., and Weber, I. (2015). You tweet
what you eat: Studying food consumption through twit-
ter. In Proceedings of the 33rd Annual ACM Confer-
ence on Human Factors in Computing Systems, CHI ’15,
pages 3197–3206, New York, NY, USA. ACM.
Abrahamsson, E., Forni, T., Skeppstedt, M., and Kvist, M.
(2014). Medical text simplification using synonym re-
placement: Adapting assessment of word difficulty to
a compounding language. In Proceedings of the 3rd
Workshop on Predicting and Improving Text Readabil-
ity for Target Reader Populations (PITR), pages 57–65,
Gothenburg, Sweden, April. Association for Computa-
tional Linguistics.
Chen, J., Jagannatha, A. N., Fodeh, S. J., and Yu, H.
(2017). Ranking medical terms to support expansion
of lay language resources for patient comprehension of
electronic health record notes: adapted distant supervi-
sion approach. JMIR medical informatics, 5(4):e42.
gan, R. I., Leaman, R., and Lu, Z. (2014). Ncbi dis-
ease corpus: a resource for disease name recognition and
concept normalization. Journal of biomedical informat-
ics, 47:1–10.
Elhadad, N. and Sutaria, K. (2007). Mining a lexicon of
technical terms and lay equivalents. In Proceedings of
the Workshop on BioNLP 2007: Biological, Transla-
tional, and Clinical Language Processing, pages 49–56.
Association for Computational Linguistics.
Grabar, N. and Hamon, T. (2014). Automatic extraction
of layman names for technical medical terms. In 2014
IEEE International Conference on Healthcare Informat-
ics, pages 310–319. IEEE.
Karapetiantz, P., Audeh, B., Lillo-Le Lou¨
et, A., and Bous-
quet, C. (2018). Signal Detection for Baclofen in Web
Forums: A Preliminary Study. In MIE, pages 421–425.
Leaman, R., Islamaj Do˘
gan, R., and Lu, Z. (2013). Dnorm:
disease name normalization with pairwise learning to
rank. Bioinformatics, 29(22):2909–2917.
Limsopatham, N. and Collier, N. (2015). Adapting Phrase-
based Machine Translation to Normalise Medical Terms
in Social Media Messages. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1675–1680, Lisbon, Portugal,
September. Association for Computational Linguistics.
Limsopatham, N. and Collier, N. (2016). Normalising
Medical Concepts in Social Media Texts by Learning Se-
mantic Representation. In Proceedings of the 54th An-
nual Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 1014–1023,
Berlin, Germany, August. Association for Computa-
tional Linguistics.
O’Connor, K., Pimpalkhute, P., Nikfarjam, A., Ginn, R.,
Smith, K. L., and Gonzalez, G. (2014). Pharmacovigi-
lance on twitter? Mining tweets for adverse drug reac-
tions. In AMIA annual symposium proceedings, volume
2014, page 924. American Medical Informatics Associ-
Roller, R., Kittner, M., Weissenborn, D., and Leser, U.
(2018). Cross-lingual Candidate Search for Biomedical
Concept Normalization. In Proceedings of Multilingual
BIO, Miyazaki, Japan, May.
Stenetorp, P., Pyysalo, S., Topi´
c, G., Ohta, T., Anani-
adou, S., and Tsujii, J. (2012). brat: a Web-based Tool
for NLP-Assisted Text Annotation. In Proceedings of
the Demonstrations Session at EACL 2012, Avignon,
France, April. Association for Computational Linguis-
Suominen, H., Salanter¨
a, S., Velupillai, S., Chapman,
W. W., Savova, G., Elhadad, N., Pradhan, S., South,
B. R., Mowery, D. L., Jones, G. J., et al. (2013).
Overview of the share/clef ehealth evaluation lab 2013.
In International Conference of the Cross-Language
Evaluation Forum for European Languages, pages 212–
231. Springer.
Weissenbacher, D., Sarker, A., Paul, M. J., and Gonzalez-
Hernandez, G. (2018). Overview of the Third Social
Media Mining for Health (SMM4H) Shared Tasks at
EMNLP 2018. In Proceedings of the 2018 EMNLP
Workshop SMM4H: The 3rd Social Media Mining for
Health Applications Workshop & Shared Task, pages 13–
16, Brussels, Belgium, October. Association for Compu-
tational Linguistics.
Weissenbacher, D., Sarker, A., Magge, A., Daughton, A.,
O’Connor, K., Paul, M. J., and Gonzalez-Hernandez,
G. (2019). Overview of the Fourth Social Media Min-
ing for Health (SMM4H) Shared Tasks at ACL 2019.
In Proceedings of the Fourth Social Media Mining for
Health Applications (#SMM4H) Workshop & Shared
Task, pages 21–30, Florence, Italy, August. Association
for Computational Linguistics.
Zeng-Treitler, Q., Goryachev, S., Kim, H., Keselman, A.,
and Rosendale, D. (2007). Making texts in electronic
health records comprehensible to consumers: a proto-
type translator. In AMIA Annual Symposium Proceed-
ings, volume 2007, page 846. American Medical Infor-
matics Association.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Background: Medical terms are a major obstacle for patients to comprehend their electronic health record (EHR) notes. Clinical natural language processing (NLP) systems that link EHR terms to lay terms or definitions allow patients to easily access helpful information when reading through their EHR notes, and have shown to improve patient EHR comprehension. However, high-quality lay language resources for EHR terms are very limited in the public domain. Because expanding and curating such a resource is a costly process, it is beneficial and even necessary to identify terms important for patient EHR comprehension first. Objective: We aimed to develop an NLP system, called adapted distant supervision (ADS), to rank candidate terms mined from EHR corpora. We will give EHR terms ranked as high by ADS a higher priority for lay language annotation-that is, creating lay definitions for these terms. Methods: Adapted distant supervision uses distant supervision from consumer health vocabulary and transfer learning to adapt itself to solve the problem of ranking EHR terms in the target domain. We investigated 2 state-of-the-art transfer learning algorithms (ie, feature space augmentation and supervised distant supervision) and designed 5 types of learning features, including distributed word representations learned from large EHR data for ADS. For evaluating ADS, we asked domain experts to annotate 6038 candidate terms as important or nonimportant for EHR comprehension. We then randomly divided these data into the target-domain training data (1000 examples) and the evaluation data (5038 examples). We compared ADS with 2 strong baselines, including standard supervised learning, on the evaluation data. Results: The ADS system using feature space augmentation achieved the best average precision, 0.850, on the evaluation set when using 1000 target-domain training examples. The ADS system using supervised distant supervision achieved the best average precision, 0.819, on the evaluation set when using only 100 target-domain training examples. The 2 ADS systems both performed significantly better than the baseline systems (P<.001 for all measures and all conditions). Using a rich set of learning features contributed to ADS's performance substantially. Conclusions: ADS can effectively rank terms mined from EHRs. Transfer learning improved ADS's performance even with a small number of target-domain training examples. EHR terms prioritized by ADS were used to expand a lay language resource that supports patient EHR comprehension. The top 10,000 EHR terms ranked by ADS are available upon request.
Full-text available
Recent research has shown that Twitter data analytics can have broad implications on public health research. However, its value for pharmacovigilance has been scantly studied - with health related forums and community support groups preferred for the task. We present a systematic study of tweets collected for 74 drugs to assess their value as sources of potential signals for adverse drug reactions (ADRs). We created an annotated corpus of 10,822 tweets. Each tweet was annotated for the presence or absence of ADR mentions, with the span and Unified Medical Language System (UMLS) concept ID noted for each ADR present. Using Cohen's kappa1, we calculated the inter-annotator agreement (IAA) for the binary annotations to be 0.69. To demonstrate the utility of the corpus, we attempted a lexicon-based approach for concept extraction, with promising success (54.1% precision, 62.1% recall, and 57.8% F-measure). A subset of the corpus is freely available at:
Conference Paper
Full-text available
Food is an integral part of our lives, cultures, and well-being, and is of major interest to public health. The collection of daily nutritional data involves keeping detailed diaries or periodic surveys and is limited in scope and reach. Alternatively, social media is infamous for allowing its users to update the world on the minutiae of their daily lives, including their eating habits. In this work we examine the potential of Twitter to provide insight into US-wide dietary choices by linking the tweeted dining experiences of 210K users to their interests, demographics, and social networks. We validate our approach by relating the caloric values of the foods mentioned in the tweets to the state-wide obesity rates, achieving a Pearson correlation of 0.77 across the 50 US states and the District of Columbia. We then build a model to predict county-wide obesity and diabetes statistics based on a combination of demographic variables and food names mentioned on Twitter. Our results show significant improvement over previous CHI research (Culotta'14). We further link this data to societal and economic factors, such as education and income, illustrating that, for example, areas with higher education levels tweet about food that is significantly less caloric. Finally, we address the somewhat controversial issue of the social nature of obesity (first raised by Christakis & Fowler in 2007) by inducing two social networks using mentions and reciprocal following relationships.
Full-text available
Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH(®)) or Online Mendelian Inheritance in Man (OMIM(®)). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6,892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at:
Web forums are proposed as a new complementary source of knowledge to spontaneous reports by patients and healthcare professionals due to underreporting of adverse drug reactions (ADRs). Some authors suggest that signal detection could be a convenient method for gathering mentions of ADRs in patients' posts. Signal detection methods were proposed to mine pharmacovigilance databases, but little is known about their applicability to web forums. We describe a method implementing several traditional decision rules on signal detection with baclofen applied to a set of more than 6 million posts. We then cross-validated four unexpected signals applying a logistic regression method. Most adverse effects (AEs) described in the summary of product characteristics of baclofen were detected by signal detection methods. Some unexpected AEs were too. Therefore, web forums are confirmed as a complementary resource for improving current knowledge in pharmacovigilance by detecting unexpected adverse drug reactions.
Conference Paper
Automatically recognising medical concepts mentioned in social media messages (e.g. tweets) enables several applications for enhancing health quality of people in a community, e.g. real-time monitoring of infectious diseases in population. However, the discrepancy between the type of language used in social media and medical ontologies poses a major challenge. Existing studies deal with this challenge by employing techniques, such as lexical term matching and statistical machine translation. In this work, we handle the medical concept normalisation at the semantic level. We investigate the use of neural networks to learn the transition between layman’s language used in social media messages and formal medical language used in the descriptions of medical concepts in a standard ontology. We evaluate our approaches using three different datasets, where social media texts are extracted from Twitter messages and blog posts. Our experimental results show that our proposed approaches significantly and consistently outperform existing effective baselines, which achieved state-of-the-art performance on several medical concept normalisation tasks, by up to 44%.
Conference Paper
Discharge summaries and other free-text reports in healthcare transfer information between working shifts and geographic locations. Patients are likely to have difficulties in understanding their content, because of their medical jargon, non-standard abbreviations, and ward-specific idioms. This paper reports on an evaluation lab with an aim to support the continuum of care by developing methods and resources that make clinical reports in English easier to understand for patients, and which helps them in finding information related to their condition. This ShARe/CLEFeHealth2013 lab offered student mentoring and shared tasks: identification and normalisation of disorders (1a and 1b) and normalisation of abbreviations and acronyms (2) in clinical reports with respect to terminology standards in healthcare as well as information retrieval (3) to address questions patients may have when reading clinical reports. The focus on patients’ information needs as opposed to the specialised information needs of physicians and other healthcare workers was the main feature of the lab distinguishing it from previous shared tasks. De-identified clinical reports for the three tasks were from US intensive care and originated from the MIMIC II database. Other text documents for Task 3 were from the Internet and originated from the Khresmoi project. Task 1 annotations originated from the ShARe annotations. For Tasks 2 and 3, new annotations, queries, and relevance assessments were created. 64, 56, and 55 people registered their interest in Tasks 1, 2, and 3, respectively. 34 unique teams (3 members per team on average) participated with 22, 17, 5, and 9 teams in Tasks 1a, 1b, 2 and 3, respectively. The teams were from Australia, China, France, India, Ireland, Republic of Korea, Spain, UK, and USA. Some teams developed and used additional annotations, but this strategy contributed to the system performance only in Task 2. The best systems had the F1 score of 0.75 in Task 1a; Accuracies of 0.59 and 0.72 in Tasks 1b and 2; and Precision at 10 of 0.52 in Task 3. The results demonstrate the substantial community interest and capabilities of these systems in making clinical reports easier to understand for patients. The organisers have made data and tools available for future research and development.