PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Background: In the information extraction and natural language processing domain, accessible datasets are crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark and facilitate the development of more complex applications. However, in the context of clinical text processing the number of accessible datasets is scarce -- and so is the number of existing tools. One of the main reasons is the sensitivity of the data. This problem is even more evident for non-English languages. Approach: In order to address this situation, we introduce a workbench: a collection of German clinical text processing models. The models are trained on a de-identified corpus of German nephrology reports. Result: The presented models provide promising results on in-domain data. Moreover, we show that our models can be also successfully applied to other biomedical text in German. Our workbench is made publicly available so it can be used out of the box, as a benchmark or transferred to related problems.
Content may be subject to copyright.
Roller et al.
This is the extended journal paper version of our IEE ICHI 2020 paper ‘Information Extraction Models for
German Clinical Text’. Since then this version is still under review ...
A Medical Information Extraction Workbench to
Process German Clinical Text
Roland Roller1,2* , Laura Seiffe1, Ammer Ayach1, Sebastian oller1, Oliver Marten1, Michael
Mikhailov1,2, Christoph Alt1, Danilo Schmidt3, Fabian Halleck2, Marcel Naik2,4, Wiebke
Duettmann2,4and Klemens Budde2
Background: In the information extraction and natural language processing domain, accessible datasets are
crucial to reproduce and compare results. Publicly available implementations and tools can serve as benchmark
and facilitate the development of more complex applications. However, in the context of clinical text
processing the number of accessible datasets is scarce and so is the number of existing tools. One of the
main reasons is the sensitivity of the data. This problem is even more evident for non-English languages.
Approach: In order to address this situation, we introduce a workbench: a collection of German clinical text
processing models. The models are trained on a de-identified corpus of German nephrology reports.
Result: The presented models provide promising results on in-domain data. Moreover, we show that our
models can be also successfully applied to other biomedical text in German. Our workbench is made publicly
available so it can be used out of the box, as a benchmark or transferred to related problems.
Keywords: Clinical Text Processing; Information Extraction; Part-of-Speech Tagging; NLP Workbench
The year is 2022 AD. The research community re-
lies entirely on neural and computationally intensive
methods using large datasets and pre-trained language
models, which push the current development rapidly
forward. Well, not entirely. One small domain still
struggles with more fundamental problems: lack of ex-
isting datasets, corpora or pre-trained models; hurdles
with legal aspects; and outdated computer infrastruc-
ture with limited user rights just to name a few. The
domain is called clinical text processing. This applies
particularly for non-English clinical text processing.
In the English speaking world, however, various tools
can be applied to process clinical text, such as cTAKES
[1] or MetaMap [2]. Most of those tools focus on named
entity recognition, partly combined with concept nor-
malization and disambiguation. Moreover, some tools
target a very particular topic or domain, such as phar-
macovigilance (MedLEE [3]), detection of medications
1Speech and Language Technology, German Research Center for Artificial
Intelligence (DFKI), Berlin, Germany
Full list of author information is available at the end of the article
(MedEx [4]) or temporal information (MedTime [5]).
A more detailed overview can be found in Wang et al.
Regarding the availability of annotated clinical text
in English, various challenges have been conducted in
the last years. Challenges that share data and expe-
riences on the same problem. The i2b2 NLP chal-
lenge (now n2c2) for instance addresses a large va-
riety of different tasks, such as medication detection
[7] or identification of heart disease risk factors [8].
Also other shared tasks such as CLEF eHealth or Se-
mEval have been carried out already various times
on clinical text (e.g. [9,10,11,12]). In more recent
years, those challenges also targeted non-English clin-
ical texts. The CLEF eHealth challenge 2018 (Task
1 [13]), for instance, focused on the mapping of ICD
codes to death certificates in French, Hungarian, and
Italian, or NEGES at IberLEF targets negation de-
tection from Spanish clinical reports [14]. A good
overview about non-English clinical text processing in
general is provided in ev´eol et al. [15].
For the German language, the situation looks worse
in comparison to French or Spanish. In 2019 a CLEF
arXiv:2207.03885v1 [cs.CL] 8 Jul 2022
Roller et al. Page 2 of 12
eHealth challenge for German text had been carried
out, in which ICD-10 codes were mapped to health-
related, non-technical summaries of experiments [16].
However, this data concerned animals, not humans.
Besides, not much other data has been published.
An overview of the current situation is presented in
Borchert et al. [17]. The paper lists 13 different Ger-
man text corpora with a clinical/biomedical context,
but only three are freely available. First, GGPOnc [17]
a dataset of clinical practice guidelines, second TLC
[18], posts of a patient forum with annotated laymen
expressions, and finally JSynCC [19] a dataset of Ger-
man case reports extracted from medical literature. In
the case of JSynCC, authors actually provide a soft-
ware to extract the relevant text passages from digital
medical books, instead of providing the data itself
due to legal reasons.
Concerning existing tools and pre-trained methods
to process German clinical text, the situation is sim-
ilar to the availability of text data. Most prominent
is JPOS [20], a tokenizer and part of speech tagger
trained on clinical text. The authors published the
tool, as they were not allowed to publish the under-
lying FRAMED corpus [21] itself. In addition to that,
two NegEx [22] versions for German exist [23] [24], as
well as a dependency tree parser [25], an abbreviation
expansion [26] and a tool to pseudonomize protected
health information (PHI) in German clinical text [27].
In very recent times, some additional resources
highly related to this work have been published, as the
field is developing quickly in recent years. BRONCO
[28], is an annotated dataset of German discharge sum-
maries of the oncology domain - the first clinical text
dataset in German which has been published. The data
is de-identified and sentences of all 200 documents
were shuffled to lower the risk of any re-identification.
Along with the data, the authors also make the base-
line models available on request. GERNERMED [29]
is a German medical NER model, which is trained on
translated data from n2c2 2018 [30]. And finally, Ger-
man MedBERT [31], a BERT model, optimized for
German clinical text, has been published on Hugging
Face[32], targeting ICD10 code mapping.
In order to further support the development of re-
sources to process German clinical text, this work de-
scribes an annotated dataset of German nephrology
reports. It contains fine-grained annotations of con-
cepts, relations, attributes, as well as part-of-speech
(POS) labels and dependency trees. Unfortunately, we
are unable to share the dataset at this point due to un-
solved data protection concerns. Thus, instead of pub-
lishing the dataset, we release the machine learning
models which we trained on the de-identified data[1].
[1]The workbench can be found here: http://biomedical. and Docker-Deployment.
While the BRONCO models mainly target diagnosis,
treatment, medications, as well as factuality, our work
includes a larger variety of different named entities,
which might be useful for other use cases. Moreover,
our workbench also includes relation detection and a
POS tagger. Similarly this applies to GERNERMED,
which mainly targets medications, as well as its dosage,
duration, frequency etc. Note, as this work was devel-
oped for over various years, we still rely on classical
word embeddings, rather than testing the efficiency of
German MedBERT for our scenario.
German Nephrology Corpus
The corpus consists of German documents of the
nephrology division at Charit´e Universit¨atsmedizin
Berlin. All documents have been de-identified by re-
moving protected health information (PHI) defined by
HIPPA (Health Insurance Portability and Account-
ability Act), using deID [33]. Next, the documents were
enriched with semantic annotations as described in the
We considered two different document types for our
annotations, clinical notes and discharge summaries
(or discharge letters). Both document types report
on kidney transplanted patients who underwent long-
term treatment, are written by medical professionals
and address medical professionals. Discharge sum-
maries (“Arztbriefe”) serve as a summary of a pa-
tient’s hospital stay, expressed as letters sent to the
patient’s GP, and cover history, diagnostic and thera-
peutic procedures. Since discharge letters are relatively
long, they often provide additional structural elements
such as headings, enumerations, etc. Clinical notes
(“Verlaufsnotizen”) summarize the results from a sin-
gle consultation in the outpatient department, which
results in rather short texts. In contrast to discharge
letters, their content, form, and function are not sub-
ject to professional and structural standards, as they
are used only for internal communication and largely
depend on the doctor’s writing style.
Characteristics of Clinical Language
A characteristic of the (German) clinical language is
the large proportion of technical terms, which mostly
have their origin in Latin or Greek, and underlie spe-
cific morphological rules. Moreover, clinical language
provides a characteristic and individual use of syntax,
for example, a notation style that excludes function
words (e.g., articles, auxiliary verbs) and might include
non-standard term variants. As documents might be
written in a hurry, sentences can include typos, can be
In order to use the models, a user agreement has to be
signed first.
Roller et al. Page 3 of 12
Table 1 Overview of all concepts
Concept Description Translated examples
Medical Condition Signs, symptoms, diagnoses, diseases, findings Aphasia, cachectic
Diagnostic Lab Procedure Procedures (diagnostic or laboratory) that
serve the clinical examination of the patient’s
measure, CT angiography, sonogra-
Treatment All variants of clinical interventions aiming at
improving the health state
blood pressure management, trans-
Medication Names of medications, their active substances Prograf, Sandimmun
Biological Chemistry Biochemical substances that play a role in hu-
man organism
creatinine, ANA, HbA1c
Process Endogenous processes and functions in human
peristaltic sounds, defecation
Person Mentions of people gynecologist, GP
Body Part Parts of the human body renal, lung
Body Fluid Fluid and excretions of the human body urine, sputum, blood
Medical Device Artificial or biological system that supports or
replaces a failed function of human organism
kidney allograft, TEP, shunt
Biological Parameter Functions, features and characteristics of the
human body
nutritional status, blood count, body
Medical Specification Clinically specifying elements chronic, symptomatic
Local Specification Locally specifying elements perirenal, right
Time Information Temporally specifying elements today in the morning, 2014, on
Dosing Dosing instructions for medications 1 sachet daily, 1 in the morning - 1 in
the afternoon - none at night, 5 mg
Measurement Measurements and evaluations of functions,
findings, states
sonorous, active, distinct
State of Health The (aimed) health state or the ongoing im-
properly adjusted, satisfactory, stable
Table 2 Overview of all relations
Relation Description
Has state Argument1 is described as being pathologic (Medical Condition) or as being healthy
(State of health)
Has dosing A Medication or a Treatment is linked to a Dosing instruction
Has time info Argument1 is described by a Time information
Has measure Argument1 is described by a Measurement
Is located Argument1 is described locally by Local specification or Body part
Is specified Argument1 is described by Medical specification
Shows A DiagLab Procedure or a Biological Parameter leads to a finding (Argument2 )
Examines A DiagLab Procedure examines Argument2
Involves A Treatment involves a Medical Device, a Medication or another Treatment
incomplete or punctuation marks are missing. Overall,
the clinical language prefers a compact and reduced
language use with a high information density. Exam-
ple (1) shows a typical sentence structure of a clinical
(1) Im Sono kein Stau.
Sonogram [shows] no congestion.
This short expression does not use a verb - something
like “to show” is presumably meant in this context.
This information is however significant as it forms
the relation between the examination process (Sono,
“Sonogram”) and the (negated) finding (Stau, “con-
gestion”). As the verb and therefore the relation is
only implicitly expressed, a basic understanding of the
subject is presupposed for the correct manual annota-
Roller et al. Page 4 of 12
Table 3 Overview of all attributes
Attribute Value Explanation
DocTime Past The entity or the event existed or happened in the past, respectively (related to the current
temporal setting of the document)
Future The entity or the event is planned, prescribed or recommended in the future (related to
the current temporal setting of the document)
Past present The entity or the event has begun in the past and endures to the current temporal setting
of the document
LevelOfTruth Possible future The entity or the event might happen in the future, e.g. in an if-clause
Negative The entity or the event is negated
Speculated The truth of the entity or the event is only assumed
Unlikely The truth of the entity or the event is doubtful
Furthermore, many abbreviations are used in both
document types. These abbreviations are often stan-
dardized, and their expansion and meaning are well
documented. Conversely, the expansion of abbrevi-
ations can be complicated by the fact that abbre-
viations are often ambiguous. For example the Ger-
man web dictionary Beckers Abk¨urzungslexikon Medi-
zinischer Begriffe[34] for medical-related abbrevia-
tions lists 61 possible expansions for KS (for in-
stance Kaltschweißigkeit, Klopfschall, Kaiserschnitt,
Kaufmann-Schema) with additional subcategories of
varieties. Only a sufficient context can help to disam-
biguate an abbreviation. However, as the clinical lan-
guage tends to a compact and reduced language, such
a disambiguating context is not always given. Another
characteristic of clinical documents is the large num-
ber of negations and vague descriptions, particularly
in context of symptoms and findings.
Semantic Annotations
Our semantic annotation schema is intended to cover
the most relevant textual information in the corpus,
and developed during the annotation process [35]. The
schema has been developed from scratch, together with
linguists, computer scientists, and physicians and fo-
cuses on the pathological health state (medical condi-
tion) of the patient as well as his or her treatments and
diagnostic and laboratory examinations. The schema
targets mainly the recognition of that information and
everything which is connected to it. It is individually
adapted to the demands of the German nephrology
domain and applies for both discharge summaries and
clinical notes. In order to gather this meaning, the
schema is constructed of concepts, binary relations,
and concept attributes which are introduced in the fol-
Concepts: The concept schema can be divided
into three groups: central,relating, and specify-
ing.Central concepts describe from our perspec-
tive the most crucial information about a patient. It
concerns the pathological health state of the patient
as well as his or her treatments and diagnostic and
laboratory examinations. Relating concepts describe
other relevant information within the documents. By
connecting them via relations (see below) to mostly
central concepts, those information help to gather rel-
evant information of the documents. Specifying con-
cepts provide more detailed information to the other
concepts, such as dosing, local or time information. An
overview of the concept schema is provided in Table 1
and includes a short definition and examples.
Relations: Our relation schema describes a binary
semantic relation between two concepts. It intends
to connect the annotated concepts within the docu-
ment with each other and to give the single concepts
a stronger meaning. On a high level, relations can be
divided into two groups, describing and medical re-
lations. Describing relations connect two concepts
with each other, of which one argument adds more
information to the other one, such as the dosing of a
medication, the pathological state of something, or a
further specification. Usually one argument is a spec-
ifying concept. Medical relations instead describe
more complex situations related to the examination
and treatment of a patient.
Table 2presents the set of relations including a short
description. In most cases the relations are defined in a
broader sense. While one argument is usually defined
to be bound to one or two particular concept types,
the other argument often has more freedom and can
be bound to various concept types.
Attributes: Attributes are used for the further spec-
ification of an annotated concept. While a concept
covers the term’s lexical information, the selected at-
tribute value refers to extra-lexical/contextual infor-
mation. Such information relates to temporal infor-
mation or information about the level of truth. See
Table 3for the annotation schema of attributes.
The time information attribute DocTime helps to
structure the described temporal course of the doc-
ument. By applying one of its values, an entity can
Roller et al. Page 5 of 12
be highlighted as has happened in the past or as be-
ing planned in or predicted for the future. In most
cases, a surrounding time information-concept trig-
gers the attribute selection. If possible, both concepts
are additionally linked by the relation has time info.
This means that in many cases, this information is ex-
pressed twice: First by concept and second by attribute
The attribute LevelOfTruth highlights information
that indicates vagueness, possibility, and negated ex-
pressions. Generally, both document types comprise
plenty of expressions of assumptions. It is necessary to
differentiate between a statement expressing certainty
and an assumption, for example.
Corpus Generation
The annotation was carried out by three students (2
linguists, 1 medical student). The medical student in
particular contributed to the understanding of the
medical terminology. The task itself was conducted by
using the Brat annotator tool [36] within several anno-
tation cycles. This method led to various adaptations
and updates of the annotation schema.
Table 4 Analysis of annotated documents
Discharge Summaries Clinical Notes
# docs 61 1300
# words 57,219 54,206
# sentences 6,213 6,618
avg. words (std) 938 (246.33) 54 (45.43)
Table 4provides an overview of the annotated
dataset. Overall 1300 clinical notes and 61 discharge
summaries have been annotated. Most documents were
examined at least twice by two different annotators.
However, between the two linguists, which annotated
more than 80% of the data (53.3%, 30.9%), there exists
a larger number of overlapping documents in compari-
son to the medical student (11% between linguists, 8%
and 2% between the linguists and the medical student;
only a small portion of documents was annotated by
all three annotators, namely 1 discharge summary and
20 clinical notes).
The table shows the number of different annotated
documents, the overall number of words in all docu-
ments, the overall number of sentences and the aver-
age words per document, including standard deviation
(in brackets). Discharge summaries contain a larger
average number of words per document compared to
the clinical notes[2]. However, the standard deviation
of the average word number per document shows that
[2]The information is generated by applying a German
tokenizer and a sentence splitter.
both document types have a large variation in text
length. Some clinical notes contain only a few words.
Similarly as in Hripcsak and Rothschild [37] we cal-
culate the inter-annotator-agreement (IAA) using the
pairwise average F-score (micro) on character level.
This results in an avg. F-Score across all annotators of
0.761 for the concepts and 0.636 for relations. Particu-
larly the score of the relations does not seem to be very
high. However, two aspects need to be taken into con-
sideration: a consistent relation annotation strongly
depends on the fact if the underlying concepts are an-
notated correctly beforehand, and secondly, the IAA
between the two linguists, who annotated more than
80% of the data, have got a much stronger overlap,
namely 0.822 for concepts and 0.697 for relations.
Challenges & Limitations
In order to meet the complexity of the German clini-
cal language, we created a detailed and extensive an-
notation schema. We faced multiple challenges dur-
ing the annotation process, due to the complexity of
the language and the schema. These challenges had a
substantial impact on the consistency of the annota-
tions across the annotations (IAA), therefore decisions
about the approach had to be made. The most relevant
ones will be presented in the following.
Fine-Grained Annotation: The German language
includes a large number of compound words. As com-
pounds consist of two (or more) meaningful units that
are linked via an inherent linguistic relation, the anno-
tation of that relation seems possible. This can lead to
subword annotations on multiple levels. Notably this
applies to medical technical terms.
Figure 1 Annotation Granularity
Figure 1shows an example of annotating on different
levels. It shows the compound Niereninsuffizienz (“re-
nal insufficiency”) whose two elements refer to the con-
cepts body part and medical condition. The word
itself also refers to a medical condition. The speci-
fying adjective terminale (“terminal”) should be con-
sidered as an attached part of the technical term. Thus
the preferred annotation here is Terminale Nierenin-
suffizienz as medical condition. Alternatively, ter-
minale is annotated as medical specification,Nie-
Roller et al. Page 6 of 12
Figure 2 “Pain on percussion along the spine”
reninsuffizienz as medical condition and both are
connected via the relation is specified. The process
of subword annotation goes beyond the scope of this
corpus, therefore we opt for the annotation of larger
spans. However, the possibility of both alternatives de-
creases the consistency in the annotation.
Ambiguity: The relations in our annotation schema
are rather broadly defined. This means that one re-
lation can be used to link several different argu-
ment pairs. For example, the relation shows with
the concept medical condition as the second argu-
ment can make use of two different first arguments:
Firstly it links to diagLab procedure. In such a
case, the result or the finding of a diagnostic proce-
dure is expressed. Alternatively it links to biologi-
cal parameter. Then the relation expresses that a
specific parameter indicates a pathological condition.
In some contexts, the connection between the two
entities can be expressed equally by two different rela-
tions. This is mainly the case for the link between the
two concepts medical condition and body part.
See Figure 2: The sentence describes the location of
a symptom. This finding can either be described by
using a is located-relation (Figure 2, first line) or
by defining the health state of a body part as be-
ing pathological (has state-relation, Figure 2, second
line). Both are, according to our schema, correct and
express the intended semantic relation equally. In or-
der to achieve a consistent annotation, we opt for the
first version as the medical condition-information
seems more central here.
Additional Datasets
While the previous section introduced the main cor-
pus in detail, this part presents additional relevant
data sources, namely Nephro Gold, the Hamburg De-
pendencies Dataset, as well as a biomedical text corpus
in German.
Nephro Gold: A syntactic dataset
In addition to the semantic clinical corpus, also a small
clinical syntactic dataset has been created in previous
work [38] [25]. We refer to this corpus as Nephro Gold.
The dataset also consists of clinical notes and discharge
summaries, is rather small (44 clinical notes and 11 dis-
charge summaries) and is a subset of the dataset de-
scribed above. It includes part-of-speech (POS) anno-
tations using the Stuttgart-T¨ubingen Tagset (STTS)
[39] and dependency trees using the Universal Depen-
dencies (UD) tagset [40].
Hamburg Dependency Treebank
The Hamburg Dependency Treebank (HDT) [41] is
a large dataset of more than 261,000 German sen-
tences, and includes POS labels and syntactical anno-
tations. The annotation scheme of HDT is also based
on the Stuttgart-T¨ubingen Tag Set for morphological
and POS annotation and a set of 35 dependency labels
for dependency annotation. The corpus is freely avail-
able for scientific purposes. In this work, the dataset
will be used for the POS tagger.
BTC - A Biomedical Text Collection in German
Many modern NLP (natural language processing)
methods use word embeddings as input. However, it
turns out that embeddings specialized on the given
domain often outperform embeddings trained only on
general text. Although embeddings with a focus on
the biomedical domain have been published, there are
no embeddings specialized on German clinical text[3].
Moreover, as mentioned above, clinical text contains
a large number of technical terms, which do not fre-
quently occur in general news text or Wikipedia. This
might result in many out-of-vocabulary words if we
train a system using standard pre-trained embeddings.
For this reason, we collected German text data from
multiple sources in order to train our own custom em-
beddings. Table 5and Table 6provide an overview
about the different sources used. We scraped data from
multiple webpages with a focus on biomedical topics,
as well as forums. In addition to that we also used text
from different medical books. In the following we will
refer to our biomedical text collection as BTC.
Methods and Setup
This section provides an overview of the technical as-
pects of our work. We briefly present the relevant
methods and explain how we use or modify them for
our experiments. As we mainly rely on existing im-
plementations, the technical components will be pre-
sented relatively short.
[3]Note, after submission of the manuscript, the situ-
ation has slightly changed. See related work for more
Roller et al. Page 7 of 12
Table 5 Overview of different biomedical text sources in German to create a new biomedical text collection
Size Source Description
7,5 GB Med1 Forum[42] German forum for clinical topics and information exchange
91,8 MB Deutsches Medizin Forum[43] German forum for clinical topics and information exchange
28,6 MB Spiegel Online[44] German news webpage, articles downloaded from the health section
10,7 MB Aerzte-Blatt[45] Official news publication of the German Medical Association
10 MB NetDoktor[46] German online health portal for medical information from experts to patients
7,1 MB Onmeda[47] German online health portal, content extracted from the symptoms and diseases
3,6 MB German PubMed Abstracts[48] Archive of biomedical and life sciences journal literature
1,9 MB eDocTrainer[49] German collection of clinical case studies from all specialist disciplines
16,5 MB Medical Books Content from various medical books, see Table 6
Table 6 Overview of medical books
Name Author
Chirurgie: Mit integriertem Fallquiz - 40 alle nach neuer AO. Springer-Verlag, 2009 [50] Siewert, Stein
Neurologie. Springer-Verlag, 2006 [51] Poeck, Hacke
Urologie. Springer-Verlag, 2014 [52] Hautmann, Gschwend
Basiswissen Augenheilkunde. Springer-Verlag, 2016 [53] Walter, Plange
Hals-Nasen-Ohren-Heilkunde. Springer-Verlag, 2012 [54] Lenarz, Boenninghaus
Notfallmedizin. Springer-Verlag, 2016 [55] Ziegenfuß
Basiswissen Dermatologie. Springer-Verlag, 2017 [56] Goebeler, Hamm
Basiswissen Psychiatrie und Psychotherapie. Springer-Verlag, 2011 [57] Arolt, Reimer, Dilling
Custom Word Embeddings with fastText
All methods used in this work rely on variant types
of word, character, and document embeddings. For
reasons of simplicity and compatibility with our NLP
pipeline, fastText [58] was the method of choice. fast-
Text is a lightweight library that offers pre-trained text
classifiers. Moreover, it provides various multilingual
word vectors that can be fine-tuned on new unlabeled
data to obtain a better domain-specific characteriza-
tion of our clinical data. In this work we used fastText
to fine-tune our own embeddings using BTC and the
German Nephrology corpus. We refer to this represen-
tation as ‘custom embeddings’. A total of 5 epochs
were needed using CBOW for the generation of the
new embeddings. In addition, we used the default Ger-
man fastText embeddings, which we refer to as ‘de-
Clinical Text processing with Flair
Large parts of our work relied on Flair [59], a state
of the art NLP framework, which provides various
functionalities, for instance, named entity recognition
(NER) and part-of-speech-tagging (POS). Further-
more, its codebase, developed upon PyTorch, allows
easy and efficient modifications to realize new tasks.
For this work, we partially relied on contextual string
embeddings (FlairEmbeddings) [59] and pooled con-
textualized embeddings (Pool) [60]. Flair embeddings
are a type of word embeddings, which merge the best
attributes of three embeddings (Word embeddings,
Character-level features, Contextualized word embed-
dings). Those embeddings represent words as a se-
quence of characters and are contextualized by their
surrounding text, making the same words have a differ-
ent type of embedding depending on its context. Flair
also offers the opportunity to fine-tune this kind of em-
beddings on unlabeled datasets for a specific domain.
Pool embeddings resolve the problem of representing
rare occurrences of words that might carry more than
one meaning in a given text. A pooling operation is
applied to all contextualized instances of a word to
generate a global word representation that encodes all
the gathered features into a new one.
Part-of-speech Tagging: For the POS tagging we used
a BiLSTM-CRF implementation of Flair. Overall we
explored different setups with different embeddings
and their combination. As the Nephro Gold dataset
is rather small, we also ran experiments with a train-
ing and development dataset extended by Hamburg
Dependency Treebank.
Concept Detection: For the concept detection we
again relied on a BiLSTM-CRF model implemented
in Flair. Similar to POS tagging, we examined a range
of different embeddings and their combination.
Relation Extraction
For relation extraction, we re-implemented a CNN-
based relation classifier, as proposed by Nguyen and
Grishman (2015) [61]. We used word- and positional
embeddings to represent the meaning and relative po-
sition of each token. Short sentences and reduced lan-
guage might provide insufficient context. Thus, we also
added embeddings to provide the model with concept
information about the two relation arguments.
Roller et al. Page 8 of 12
Table 7 Part-of-speech Tagging: Average accuracy with standard deviation (in brackets) over 5-fold cross validation
Def. Word Custom Word Def. Flair Custom Flair Def. Word+Flair Cust. Word+Flair
only Nephro Gold 65.53 (0.66) 77.65 (0.95) 81.41 (0.75) 80.80 (0.47) 81.54 (0.69) 81.84 (1.37)
Nephro Gold & HDT 97.07 (0.02) 97.29 (0.02) 98.47 (0.01) 98.37 (0.01) 98.57 (0.02) 97.96 (0.01)
Model Size 1.3 GB 4.7 GB 248.7 MB 248.7 MB 1.6 GB 5.0 GB
Table 8 Concept Detection: Average micro F1 score with standard deviation (in brackets) over 5-fold cross validation
Strict Lenient
Model Size Prec. Rec. F1 Prec. Rec. F1
Default Word Embeddings 1.3 GB 61.20 57.11 58.97 (0.44) 72.4 67.4 69.6 (0.54)
Custom Word Embeddings 4.7 GB 73.82 71.19 72.47 (0.61) 83.2 79.4 81.4 (0.54)
Default Flair Embeddings 248.6 MB 74.82 73.85 74.33 (0.30) 83.0 81.75 82.25 (0.50)
Custom Flair Embeddings 248.6 MB 75.36 75.45 75.40 (0.46) 83.75 83.25 83.25 (0.50)
Default Word + Def Flair Embeddings 1.6 GB 73.20 72.65 72.92 (1.17) 81.8 80.8 81.6 (0.54)
Custom Word + Custom Flair Embeddings 5.0 GB 75.45 75.09 75.26 (0.56) 84.2 83.4 83.8 (0.83)
Cust. Word+Cust. Flair+Cust. Pool Embeddings 5.9 GB 76.12 75.94 76.02 (0.64) 84.8 84.2 84.4 (0.54)
In the following, we present our results in evaluat-
ing the models introduced in the previous section. To
merge the overlapping documents, we prioritized the
number of documents each student annotated (more
documents, higher priority) to achieve a better con-
sistency across the dataset. Then, we used JCoRe
[20] for tokenization and sentence splitting, and re-
moved annotations crossing sentence boundaries. We
also removed nested annotations and retained only the
longest span. In cases where tokens were assigned mul-
tiple concept annotations, we favored the one with the
higher occurrence.
We used accuracy to report results for POS tagging,
and precision, recall, and micro-averaged F1 score for
concept detection and relation extraction. The con-
cept detection was evaluated by using strict and le-
nient matching. Strict matching considers a predicted
concept to be a true positive only if its offsets exactly
match the ground truth, whereas for lenient matching
it is sufficient for tags to overlap, similarly as defined
in Henry et al. [30].
We evaluated all models using a 5-fold cross vali-
dation with a training, development, and test split of
75/10/15. The reported results only include part-of-
speech tagging, concept detection, and relation extrac-
tion, because dependency tree parsing and attribute
(negation) detection results were partly already pre-
sented in previous work and tools made available (see
Kara et al. [25] and Cotik et al. [24]).
Part-of-speech Tagging
Table 7presents the results of part-of-speech tagging.
The upper line shows the performance of the POS tag-
ger using only Nephro Gold, and the line below the
model including the extended training and develop-
ment dataset using HDT. In addition we can see the
performance of the standard word and Flair embed-
dings, also in combination, as well as the performance
of the custom embeddings. The size of each model is
presented in line at the bottom.
The table shows that combining Nephro Gold with
the much larger HDT data, all different setups show
a strong increase of accuracy. The table also shows
that custom word embeddings always outperform the
default word embeddings, although the performance
gain is much smaller in case of including HDT. More-
over, Flair embeddings show in all cases a boost over
their word embedding equivalent. Interestingly default
Flair embeddings outperform the (fine-tuned) custom
Flair embeddings. Finally we can see that the combi-
nation of word and Flair embeddings tend to outper-
form their single equivalent - not in case of using HDT
data and custom setup. However, the best setup can be
achieved with HDT and both, default word and Flair
The presented results indicate the custom embed-
dings are not necessary for the POS use case. In ad-
dition to that, regarding the model size, the default
Flair embedding using HDT is our favored setup, as
the performance is only 0.1 below the best performing
system, but the model is much smaller.
Concept Detection
The results of the concept detection are presented in
Table 8and show the average F1 micro scores using a
cross validation with the different setups. The evalua-
tion was carried out using strict and lenient matching
as described above.
The results show that in all cases the lenient score
achieves a better performance, which is no surprise as
we leave more flexibility. Moreover, we can see that the
custom word embeddings outperform the default fast-
Text emebddings. The Flair embeddings on the other
Roller et al. Page 9 of 12
Table 9 Relation Extraction: Average micro F1 score with standard deviation (in brackets) over 5-fold cross validation
Model Size Prec. Rec. F1
Def. Word Embeddings + Relative Offsets (Nygen et al. [61]) 2.9 GB 63.0 74.0 68.0 (0.008)
Custom Word Embeddings + Relative Offsets 4.7 GB 61.0 76.0 68.0 (0.007)
Def. Word + Concept Embeddings + Relative Offsets 2.9 GB 80.0 87.0 83.0 (0.01)
Custom Word + Concept Embeddings + Relative Offsets 4.7 GB 79.0 89.0 84.0 (0.003)
Table 10 Fine-grained lenient score of (first run) concepts using
custom Flair embeddings model, according to precision, recall and
F1, sorted by frequency (#). Results are presented in comparison
to IAA (micro avg. F1).
Concept Name Prec. Recall F1 # IAA
Micro F1-Score 84.0 82.0 83.0
Macro F1-Score 81.0 77.0 79.0
Medical condition 88.0 92.0 90.0 8953 87.39
Measurement 77.0 83.0 80.0 5429 62.13
Body part 82.0 85.0 83.0 3410 74.20
Treatment 81.0 82.0 82.0 4379 79.24
DiagLab Procedure 87.0 65.0 75.0 3209 66.54
State of health 94.0 87.0 90.0 4025 83.32
Process 86.0 67.0 76.0 2716 79.41
Medication 89.0 90.0 90.0 3169 93.83
Time information 89.0 73.0 80.0 3103 46.75
Local specification 84.0 79.0 82.0 1716 63.83
Biological chemistry 60.0 94.0 73.0 1363 71.13
Biological parameter 68.0 66.0 67.0 966 60.22
Dosing 93.0 85.0 88.0 1203 74.78
Person 91.0 97.0 94.0 1265 85.26
Medical specification 40.0 31.0 35.0 850 38.68
Medical device 74.0 55.00 63.0 370 89.98
Body Fluid 91.0 78.0 84.0 164 70.09
Table 11 Fine-grained score (first run) of relations using “custom
word + concept embedding + relative offset” model, according to
precision, recall and F1, sorted by frequency (#). Results are
presented in comparison to IAA (micro avg. F1).
Relation Name Prec. Recall F1 # IAA
Micro F1-Score 0.77 0.92 0.84
Macro F1-Score 0.74 0.91 0.82
rel-has-measure 0.77 0.95 0.85 3810 62.15
rel-hasState 0.83 0.92 0.87 2860 76.58
rel-has-time-info 0.75 0.93 0.83 2302 41.24
rel-is-located 0.75 0.84 0.79 2160 56.38
rel-involves 0.86 0.95 0.90 2015 85.79
rel-shows 0.50 0.77 0.61 1192 65.87
rel-has-dosing 0.85 0.97 0.90 1156 84.97
rel-is-specified 0.78 0.99 0.87 628 39.19
rel-examines 0.60 0.89 0.72 381 57.70
hand outperform the specialized custom embeddings,
and at the same time, the size of the model is below
250 MB, while the pure word embedding approaches
are always above at least one GB. The overall best per-
forming model uses a combination of word, Flair and
Pool embeddings, unfortunately resulting in a model
with the largest size of nearly 6GB.
Table 10 shows the results of the custom Flair em-
beddings model on concept level. The table also shows
the overall (including training and development) fre-
quency of each concept in the dataset. As we can see,
the distribution of the concepts is very unbalanced.
Often multi-class approaches have problems dealing
with unbalanced data. In our case the classifier can
deal relatively well with the situation. Note, we also
trained single classifiers for each concept, which led to
marginal improvements in most cases. On the other
hand this approach is not feasible for a real use-case,
as each model needs to be loaded into the working
memory, for only a slight overall improvement.
Moreover, Table 8presents the micro avg. F1 IAA
scores of the annotators. In most cases the IAA is be-
low the score of the classifier, only in some cases, such
as Medication, the IAA score is above. In the case of
Medical Specification for instance the IAA is very low,
this is also represented in the results of the machine
learning model.
Relation Extraction
Table 9shows the relation extraction results. The base-
line model with default word embeddings (Nygen et
al.) yields an F1 score of 68.0. The usage of the cus-
tom word embeddings does lead to any improvements
but increases model size by more than 3 GB. Sup-
plementing the two models with concept embeddings
considerably increases performance to 83.0 and 84.0 F1
for default and custom word embeddings, respectively.
This supports our hypothesis that concept information
is beneficial to relation extraction from clinical text, as
the context lacks important linguistic information.
Table 11 shows the detailed results (first run) of the
default word + concept embeddings + relative offsets
model, including the overall frequency of the different
relations. Similarly to the concepts, the distribution of
the relations is unbalanced. However, all relations can
be detected very well, often with an F1 above 0.8.
Moreover, Table 11 presents the micro avg. F1 IAA
scores of the annotators. Similarly as in Table 8, the
IAA tends to be below the results of the machine learn-
ing model. Notably, while the performance of the rela-
tion is-specified achieves quite good results with an F1
score of 87, the IAA is only about 39. It is likely that
the disagreements between the annotators regarding
Medical-Specification have a strong influence on that.
Conversely, more than 50% of the data was annotated
by one particular annotator. This might have a strong
influence on the overall annotations, and the model
Roller et al. Page 10 of 12
was probably able to learn this annotation style very
well, which is reflected in the performance of that re-
Medical Text Processing Workbench
Given the previous experiments we select the best
models according to score and model size. Generally
we prefer a small sized model with a score slightly
below the best performing system. Moreover, as the
evaluation was carried out within a 5-fold cross valida-
tion, resulting in five different models, we always pick
the first model for the workbench. Overall the follow-
ing models will be used: POS Tagger (Default Flair,
run 1), concept detection (Custom Flair, run 1) and
relation extraction (Default Word+Concept+Relative
Offset, run 1).
The description of the workbench, to run the
models out of the box, can be found on http:// The models themselves can be
downloaded here:
mEx-Docker-Deployment. Note, in order to use the
models, a user agreement must be signed first.
Testing Concept Detection on additional Datasets
The experiments above have been conducted on clini-
cal text of the nephrology domain. All documents have
been written within one hospital. Due to the restricted
topic and due to the limited number of authors writing
those reports, the data might be very homogeneous.
Therefore models trained on this data might be not
suitable to be tested on other biomedical/clinical texts
in German.
In order to explore this, we carried out a small proof
of concept. To do so, we tested our concept detection
on two additional biomedical text datasets in German,
namely (a) GGPONC [17], a dataset of clinical prac-
tice guidelines and (b) a set of posts published in a
German health forum, taken from the TLC corpus [18].
In both cases we applied the model to 600 sentences of
each dataset. The automatically generated labels were
then corrected (if needed) by the main annotator of
our nephrology corpus. Please note, we wanted to find
out if our concept detection can work in general also
on different text, given our annotation schema. So the
annotator examined the given labels, but was not too
strict about each single label.
Regarding micro F-Score, the IAA between classi-
fier and annotator was 0.851 on GGPONC and 0.868
on the TLC forum data. These values are surprisingly
good, also in comparison to the original concept detec-
tion results on the nephrology dataset. The fact that
correcting a dataset instead of starting an annotation
from scratch, might have an influence, as well as ac-
cepting ‘suggested labels’ which would have been not
annotated in other cases. Overall the outcomes serve
as a proof of concept, rather than a solid evaluation.
However, the results indicate that our models might
be a good first choice to process biomedical text in
German. Data can be shared upon request.
Discussion and Analysis
We carried out experiments with different setups on
three different tasks using clinical text in German.
In most cases, we observed customized word embed-
dings to provide better results compared to their de-
fault counterpart. This performance gain, however,
comes at the expense of increased model size, partic-
ularly for POS tagging and concept detection. More-
over, within the first two experiments we could see that
the models using only Flair embeddings perform quite
well and tend to have a smaller model size. The re-
sults show that character embedding-based approaches
seem to perform well on clinical text. A reason for that
might be the characteristics of some words (medica-
tion names often contain letters like ‘x’ and ‘y’ [62])
and possibly the features of the Greek and Latin origin
of words (e.g. words ending with ‘-itis or iasis might
refer to a disease).
For applying and running such text processing mod-
els on the fly in clinical care, smaller models might be
favoured, in order to be not too dependent on compu-
tational power and working memory. This means that
the simple Flair embedding models would probably be
the better choice for a real clinical use case. In case
of relation extraction instead, we relied on an imple-
mentation which does not integrate Flair or character
embeddings. Therefore the favoured models still have
a size of 2.9 GB. It would make sense to switch to an
efficient model which results in a smaller model size.
Moreover, we analyzed the predictions of the model.
Generally, the clinical data of our corpus is hetero-
geneous. Even though a large range of health related
problems can be described in the documents, they are
all of the nephrology domain and always of the same
department. Therefore, we frequently observed similar
text patterns for sentences, mentioning similar medical
problems, treatments or medications. Regarding POS
tagging, the fact that the dataset is relatively small,
and contains similar sentences (or often a similar sen-
tence structure) might have been the reason for the
good results.
The concept detection certainly also benefited from
frequently re-occurring concepts. On the other hand
the same words could be annotated differently, depend-
ing on context, but also depending on the annotator.
This particularly made the strict matching more chal-
lenging. Analyzing the falsely predicted concepts, var-
ious cases could be found where the classifier attached
Roller et al. Page 11 of 12
a label, which was not necessarily wrong. Interestingly
our concept detection showed very promising results,
when applied to two different biomedical datasets in
In this work, we described an annotated corpus of Ger-
man nephrology reports and further created a collec-
tion of German biomedical text to train customized
word embeddings for clinical text. Based on this data
in combination with existing methods and tools, we
created and evaluated a set of German clinical text
processing models: a part-of-speech tagger, a concept
detector, and a relation extractor. To provide resources
for the clinical text processing community, we com-
bined the best performing models into a medical infor-
mation extraction workbench, which we made publicly
available for free use.
This research was supported by the German Research Foundation
(Deutsche Forschungsgemeinschaft, DFG) through the project KEEPHA
(442445488), by the German Federal Ministry of Education and Research
(BMBF) through the projects BIFOLD (01IS18025E), and the European
Union’s Horizon 2020 research and innovation program under grant
agreement No 780495 (BigMedilytics). MM and WD are participants in the
BIH Charit´e Digital Clinician Scientist Program funded by the Charit´e
Universit¨atsmedizin Berlin, and the Berlin Institute of Health at Charit´e
NLP - Natural Language Processing; NER - Named Entity Recognition; RE
- Relation Extraction; POS - Part-of-Speech Tagging; HDT - Hamburg
Dependency Treebank; IAA - Inter-Annotator Agreement;
Availability of data and materials
The description of the mEx workbench can be found on The mEx models themselves can be downloaded
Ethics approval and consent to participate
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Authors’ contributions
All co-authors are justifiably credited with authorship, according to the
authorship criteria. In detail: RR- coordination, planing, development of
annotation schema, writing, analysis; LR- development of annotation
schema, annotations, evaluation, writing corpus section; AA- technical
development, writing technical section; SM- critical revision of manuscript;
OM- annotations, writing corpus section; MM- annotations, writing corpus
section; CA- technical development, writing technical section; DS- technical
support, data preparation; FH- development of annotation schema; MN-
discussions, editing, revision of the manuscript; WD- discussions, editing,
revision of the manuscript; KB- planing, critical revision of manuscript. All
authors read and approved the final manuscript.
Author details
1Speech and Language Technology, German Research Center for Artificial
Intelligence (DFKI), Berlin, Germany. 2Department of Nephrology and
Medical Intensive Care, Charit´e Universit¨atsmedizin Berlin, Berlin,
Germany. 3Business Unit IT, Charit´e Universit¨atsmedizin Berlin, Berlin,
Germany. 4Participant in the digital clinician scientist programme, Berlin
Institute of Health, Berlin, Germany.
1. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S.,
Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and
knowledge extraction system (ctakes): architecture, component
evaluation and applications. Journal of the American Medical
Informatics Association 17(5), 507–513 (2010)
2. Aronson, A.R., Lang, F.-M.: An overview of metamap: historical
perspective and recent advances. Journal of the American Medical
Informatics Association 17(3), 229–236 (2010)
3. Friedman, C., Alderson, P.O., Austin, J.H., Cimino, J.J., Johnson,
S.B.: A general natural-language text processor for clinical radiology.
Journal of the American Medical Informatics Association 1(2),
161–174 (1994)
4. Xu, H., Stenner, S.P., Doan, S., Johnson, K.B., Waitman, L.R.,
Denny, J.C.: Medex: a medication information extraction system for
clinical narratives. Journal of the American Medical Informatics
Association 17(1), 19–24 (2010)
5. Sohn, S., Wagholikar, K.B., Li, D., Jonnalagadda, S.R., Tao, C.,
Komandur Elayavilli, R., Liu, H.: Comprehensive temporal information
detection from clinical text: medical events, time, and tlink
identification. Journal of the American Medical Informatics Association
20(5), 836–842 (2013)
6. Wang, Y., Wang, L., Rastegar-Mojarad, M., Moon, S., Shen, F., Afzal,
N., Liu, S., Zeng, Y., Mehrabi, S., Sohn, S., Liu, H.: Clinical
information extraction applications: A literature review. Journal of
Biomedical Informatics 77, 34–49 (2018)
7. Uzuner, ¨
O., Solti, I., Cadag, E.: Extracting medication information
from clinical text. Journal of the American Medical Informatics
Association 17(5), 514–518 (2010)
8. Stubbs, A., Kotfila, C., Xu, H., Uzuner, ¨
O.: Identifying risk factors for
heart disease over time: Overview of 2014 i2b2/uthealth shared task
track 2. Journal of biomedical informatics 58, 67–77 (2015)
9. Suominen, H., Salanter¨a, S., Velupillai, S., Chapman, W.W., Savova,
G., Elhadad, N., Pradhan, S., South, B.R., Mowery, D.L., Jones, G.J.,
et al.: Overview of the share/clef ehealth evaluation lab 2013. In:
International Conference of the Cross-Language Evaluation Forum for
European Languages, pp. 212–231 (2013). Springer
10. Kelly, L., Goeuriot, L., Suominen, H., Schreck, T., Leroy, G., Mowery,
D.L., Velupillai, S., Chapman, W.W., Martinez, D., Zuccon, G., et al.:
Overview of the share/clef ehealth evaluation lab 2014. In:
International Conference of the Cross-Language Evaluation Forum for
European Languages, pp. 172–191 (2014). Springer
11. Pradhan, S., Elhadad, N., Chapman, W., Manandhar, S., Savova, G.:
Semeval-2014 task 7: Analysis of clinical text. In: Proceedings of the
8th International Workshop on Semantic Evaluation (SemEval 2014),
pp. 54–62 (2014)
12. Bethard, S., Savova, G., Chen, W.-T., Derczynski, L., Pustejovsky, J.,
Verhagen, M.: Semeval-2016 task 12: Clinical tempeval. In:
Proceedings of the 10th International Workshop on Semantic
Evaluation (SemEval-2016), pp. 1052–1062 (2016)
13. ev´eol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L.,
Ramadier, L., Rey, G., Zweigenbaum, P.: Clef ehealth 2018 multilingual
information extraction task overview: Icd10 coding of death certificates
in french, hungarian and italian. In: CLEF (Working Notes) (2018)
14. Jim´enez-Zafra, S.M., D´ıaz, N.P.C., Morante, R., Mart´ın-Valdivia,
M.T.: Neges 2018: Workshop on negation in spanish. Procesamiento
del Lenguaje Natural 62, 21–28 (2019)
15. ev´eol, A., Dalianis, H., Velupillai, S., Savova, G., Zweigenbaum, P.:
Clinical natural language processing in languages other than english:
opportunities and challenges. Journal of biomedical semantics 9(1), 12
16. orendahl, A., Leich, N., Hummel, B., Sch¨onfelder, G., Grune, B.:
Overview of the clef ehealth 2019 multilingual information extraction
Roller et al. Page 12 of 12
17. Borchert, F., Lohr, C., Modersohn, L., Langer, T., Follmann, M.,
Sachs, J.P., Hahn, U., Schapranow, M.-P.: Ggponc: A corpus of
german medical text with rich metadata based on clinical practice
guidelines. In: Proceedings of the 11th International Workshop on
Health Text Mining and Information Analysis, pp. 38–48 (2020)
18. Seiffe, L., Marten, O., Mikhailov, M., Schmeier, S., oller, S., Roller,
R.: From witch’s shot to music making bones-resources for medical
laymen to technical language and vice versa. In: Proceedings of the
12th Language Resources and Evaluation Conference, pp. 6185–6192
19. Lohr, C., Buechel, S., Hahn, U.: Sharing copies of synthetic clinical
corpora without physical distribution—a case study to get around iprs
and privacy constraints featuring the german jsyncc corpus. In:
Proceedings of the Eleventh International Conference on Language
Resources and Evaluation (LREC 2018) (2018)
20. Hellrich, J., Matthies, F., Faessler, E., Hahn, U.: Sharing Models and
Tools for Processing German Clinical Texts. MIE 2015 - Digital
Healthcare Empowering Europeans 210, 734–738 (2015)
21. Wermter, J., Hahn, U.: Really, is medical sublanguage that different?
experimental counter-evidence from tagging medical and newspaper
corpora. In: Medinfo, pp. 560–564 (2004)
22. Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan,
B.G.: A simple algorithm for identifying negated findings and diseases
in discharge summaries. Journal of Biomedical Informatics 34(5),
301–310 (2001)
23. Chapman, W.W., Hilert, D., Velupillai, S., Kvist, M., Skeppstedt, M.,
Chapman, B.E., Conway, M., Tharp, M., Mowery, D.L., Deleger, L.:
Extending the negex lexicon for multiple languages. Studies in health
technology and informatics 192, 677 (2013)
24. Cotik, V., Roller, R., Xu, F., Uszkoreit, H., Budde, K., Schmidt, D.:
Negation detection in clinical reports written in German. In:
Proceedings of the Fifth Workshop on Building and Evaluating
Resources for Biomedical Text Mining (BioTxtM2016), pp. 115–124.
The COLING 2016 Organizing Committee, Osaka, Japan (2016)
25. Kara, E., Zeen, T., Gabryszak, A., Budde, K., Schmidt, D., Roller, R.:
A Domain-adapted Dependency Parser for German Clinical Text. In:
Proceedings of the 14th Conference on Natural Language Processing
(KONVENS 2018), Vienna, Austria (2018)
26. Oleynik, M., Kreuzthaler, M., Schulz, S.: Unsupervised abbreviation
expansion in clinical narratives. MedInfo 245, 539–543 (2017)
27. Lohr, C., Eder, E., Hahn, U.: Pseudonymization of PHI items in
German clinical reports. In: Public Health and Informatics (2021)
28. Kittner, M., Lamping, M., Rieke, D.T., otze, J., Bajwa, B., Jelas, I.,
uter, G., Hautow, H., anger, M., Habibi, M., et al.: Annotation and
initial evaluation of a large annotated german oncological corpus.
JAMIA open 4(2), 025 (2021)
29. Frei, J., Kramer, F.: Gernermed: An open german medical ner model.
Software Impacts 11, 100212 (2022)
30. Henry, S., Buchan, K., Filannino, M., Stubbs, A., Uzuner, O.: 2018
n2c2 shared task on adverse drug events and medication extraction in
electronic health records. Journal of the American Medical Informatics
Association 27(1), 3–12 (2020)
31. Shrestha, M.: Development of a language model for medical domain.
PhD thesis, Hochschule Rhein-Waal (2021)
32. German-MedBERT on Hugging Face. Accessed:
33. Seuss, H., Dankerl, P., Ihle, M., Grandjean, A., Hammon, R., Kaestle,
N., Fasching, P.A., Maier, C., Christoph, J., Sedlmayr, M., et al.:
Semi-automated de-identification of german content sensitive reports
for big data analytics. In: oFo-Fortschritte Auf dem Gebiet der
ontgenstrahlen und der Bildgebenden Verfahren, vol. 189, pp.
661–671 (2017). ©Georg Thieme Verlag KG
34. Beckers Abk¨urzungslexikon Medizinischer Begriffe.
Accessed: 2021-01-10
35. Roller, R., Uszkoreit, H., Xu, F., Seiffe, L., Mikhailov, M., Staeck, O.,
Budde, K., Halleck, F., Schmidt, D.: A fine-grained corpus annotation
schema of german nephrology records. In: Proceedings of the Clinical
Natural Language Processing Workshop (ClinicalNLP), pp. 69–77. The
COLING 2016 Organizing Committee, Osaka, Japan (2016)
36. Stenetorp, P., Pyysalo, S., Topi´c, G., Ohta, T., Ananiadou, S., Tsujii,
J.: brat: a web-based tool for NLP-assisted text annotation. In:
Proceedings of the Demonstrations Session at EACL 2012. Association
for Computational Linguistics, Avignon, France (2012)
37. Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and
reliability in information retrieval. Journal of the American medical
informatics association 12(3), 296–298 (2005)
38. Seiffe, L.: Linguistic Modeling for Text Analytic Tasks for German
Clinical Texts. Master’s thesis, TU Berlin (2018)
39. Schiller, A., Teufel, S., Thielen, C.: Guidelines ur das Tagging
deutscher Textcorpora mit STTS. Universit¨aten Stuttgart und
ubingen (1999)
40. Nivre, J., De Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J.,
Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., et
al.: Universal dependencies v1: A multilingual treebank collection. In:
Proceedings of the Tenth International Conference on Language
Resources and Evaluation (LREC’16), pp. 1659–1666 (2016)
41. Foth, K.A., ohn, A., Beuck, N., Menzel, W.: Because size does
matter: The hamburg dependency treebank. In: Chair), N.C.C.,
Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J.,
Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth
International Conference on Language Resources and Evaluation
(LREC’14). European Language Resources Association (ELRA),
Reykjavik, Iceland (2014)
42. Med1 Forum. Accessed: 2021-01-10; Note,
the Med1 forum does not exist anymore, see med2 instead.
43. Deutsches Medizin Forum.
Accessed: 2021-01-10
44. Spiegel. Accessed:
45. Aerzte-Blatt. Accessed: 2021-01-10
46. NetDoktor. Accessed: 2021-01-10
47. Onmeda. Accessed: 2021-01-10
48. German PubMed Abstracts.
Accessed: 2021-01-10
49. eDocTrainer. Accessed: 2021-01-10
50. Siewert, J.R., Stein, H.J., Allg¨ower, M.: Chirurgie: mit integriertem
fallquiz - 40 alle nach neuer ao. Springer (2009)
51. Poeck, K., Hacke, W.: Neurologie. Springer (2006)
52. Hautmann, R., Gschwend, J.E.: Urologie. Springer (2014)
53. Walter, P., Plange, N.: Basiswissen Augenheilkunde. Springer, ???
54. Lenarz, T., Boenninghaus, H.G.: Hals-nasen-ohren-heilkunde. Springer
55. Ziegenfuß, T.: Notfallmedizin. Springer (2016)
56. Goebeler, M., Hamm, H.: Basiswissen dermatologie. Springer (2017)
57. Arolt, V., Reimer, C., Dilling, H.: Basiswissen psychiatrie und
psychotherapie. Springer (2011)
58. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word
vectors with subword information. Transactions of the association for
computational linguistics 5, 135–146 (2017)
59. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for
sequence labeling. In: Proceedings of the 27th International Conference
on Computational Linguistics, pp. 1638–1649. Association for
Computational Linguistics, Santa Fe, New Mexico, USA (2018)
60. Akbik, A., Bergmann, T., Vollgraf, R.: Pooled contextualized
embeddings for named entity recognition. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1
(Long and Short Papers), pp. 724–728. Association for Computational
Linguistics, Minneapolis, Minnesota (2019)
61. Nguyen, T.H., Grishman, R.: Relation extraction: Perspective from
convolutional neural networks. In: Proceedings of the 1st Workshop on
Vector Space Modeling for Natural Language Processing, pp. 39–48
62. Wick, J.Y.: What’s in a drug name?: A rose might smell as sweet by
any other name, but the process of naming the growing number of
medications has become quite complex and serious. Journal of the
American Pharmacists Association 44(1), 12–14 (2004)
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Recent advancements in natural language processing (NLP) have been achieved by the use of increasingly complex neural networks. In clinical context, NLP is a key technique to access highly relevant information from unstructured texts such as clinical notes. We evaluate the feasibility of training our neural model GERNERMED on annotated German training data generated by automated translation from a public English dataset. The work guides other researchers about the use of machine-translation methods for dataset acquisition. Due to the public origin of the dataset, our trained software can be used by fellow researchers without any legal access restrictions.
Full-text available
We describe the adaptation of a non-clinical pseudonymization system, originally developed for a German email corpus, for clinical use. This tool replaces previously identified Protected Health Information (PHI) items as carriers of privacy-sensitive information (original names for people, organizations, places, etc.) with semantic type-conformant, yet, fictitious surrogates. We evaluate the generated substitutes for grammatical correctness, semantic and medical plausibility and find particularly low numbers of error instances (less than 1%) on all of these dimensions.
Full-text available
Objective We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. Materials and Methods BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. Results The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. Discussion Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. Conclusion To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.
Conference Paper
Full-text available
Many people share information in social media or forums, like food they eat, sports activities they do or events which have been visited. This also applies to information about a person's health status. Information we share online unveils directly or indirectly information about our lifestyle and health situation and thus provides a valuable data resource. If we can make advantage of that data, applications can be created that enable e.g. the detection of possible risk factors of diseases or adverse drug reactions of medications. However, as most people are not medical experts, language used might be more descriptive rather than the precise medical expression as medics do. To detect and use those relevant information, laymen language has to be translated and/or linked to the corresponding medical concept. This work presents baseline data sources in order to address this challenge for German. We introduce a new data set which annotates medical laymen and technical expressions in a patient forum, along with a set of medical synonyms and definitions, and present first baseline results on the data.
Conference Paper
Full-text available
Recent advances in language modeling using recurrent neural networks have made it viable to model language as distributions over characters. By learning to predict the next character on the basis of previous characters, such models have been shown to automatically internalize linguistic concepts such as words, sentences, subclauses and even sentiment. In this paper, we propose to leverage the internal states of a trained character language model to produce a novel type of word embedding which we refer to as contextual string embeddings. Our proposed embeddings have the distinct properties that they (a) are trained without any explicit notion of words and thus fundamentally model words as sequences of characters, and (b) are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use. We conduct a comparative evaluation against previous embeddings and find that our embeddings are highly useful for downstream tasks: across four classic sequence labeling tasks we consistently outperform the previous state-of-the-art. In particular, we significantly outperform previous work on English and German named entity recognition (NER), allowing us to report new state-of-the-art F1-scores on the C O NLL03 shared task. We release all code and pre-trained language models in a simple-to-use framework to the research community, to enable reproduction of these experiments and application of our proposed embeddings to other tasks:
Conference Paper
Full-text available
In this work, we present a syntactic parser specialized for German clinical data. Our model, trained on a small gold standard nephrological dataset, outperforms the default German model of Stanford CoreNLP in parsing nephrology documents in respect to LAS (74.64 vs. 42.15). Moreover, retraining the default model via domain-adaptation to nephrology leads to further improvements on nephrology data (78.96). We also show that our model performs well on fictitious clinical data from other subdo-mains (69.69).
Full-text available
Background: Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main body: We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion: We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.
Conference Paper
The lack of publicly accessible text corpora is a major obstacle for progress in natural language processing. For medical applications, unfortunately, all language communities other than English are low-resourced. In this work, we present GGPONC (German Guideline Program in Oncology NLP Corpus), a freely distributable German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions. Moreover, GGPONC is the first corpus for the German language covering diverse conditions in a large medical subfield and provides a variety of metadata, such as literature references and evidence levels. By applying and evaluating existing medical information extraction pipelines for German text, we are able to draw comparisons for the use of medical language to other corpora, medical and non-medical ones.
Objective: This article summarizes the preparation, organization, evaluation, and results of Track 2 of the 2018 National NLP Clinical Challenges shared task. Track 2 focused on extraction of adverse drug events (ADEs) from clinical records and evaluated 3 tasks: concept extraction, relation classification, and end-to-end systems. We perform an analysis of the results to identify the state of the art in these tasks, learn from it, and build on it. Materials and methods: For all tasks, teams were given raw text of narrative discharge summaries, and in all the tasks, participants proposed deep learning-based methods with hand-designed features. In the concept extraction task, participants used sequence labelling models (bidirectional long short-term memory being the most popular), whereas in the relation classification task, they also experimented with instance-based classifiers (namely support vector machines and rules). Ensemble methods were also popular. Results: A total of 28 teams participated in task 1, with 21 teams in tasks 2 and 3. The best performing systems set a high performance bar with F1 scores of 0.9418 for concept extraction, 0.9630 for relation classification, and 0.8905 for end-to-end. However, the results were much lower for concepts and relations of Reasons and ADEs. These were often missed because local context is insufficient to identify them. Conclusions: This challenge shows that clinical concept extraction and relation classification systems have a high performance for many concept types, but significant improvement is still required for ADEs and Reasons. Incorporating the larger context or outside knowledge will likely improve the performance of future systems.