PreprintPDF Available

GottBERT: a pure German Language Model

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OSCAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the $F_{1}$ score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyper-parameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.
GottBERT: a pure German Language Model
Raphael Scheible1, Fabian Thomczyk1, Patric Tippmann1, Victor Jaravine1and Martin Boeker1
{raphael.scheible,thomczyk,tippmann,victor.zharavin,martin.boeker}
@imbi.uni-freiburg.de
1Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg
Abstract
Lately, pre-trained language models advanced
the field of natural language processing (NLP).
The introduction of Bidirectional Encoders for
Transformers (BERT) and its optimized ver-
sion RoBERTa have had significant impact and
increased the relevance of pre-trained models.
First, research in this field mainly started on
English data followed by models trained with
multilingual text corpora. However, current re-
search shows that multilingual models are in-
ferior to monolingual models. Currently, no
German single language RoBERTa model is
yet published, which we introduce in this work
(GottBERT). The German portion of the OS-
CAR data set was used as text corpus. In
an evaluation we compare its performance
on the two Named Entity Recognition (NER)
tasks Conll 2003 and GermEval 2014 as well
as on the text classification tasks GermEval
2018 (fine and coarse) and GNAD with ex-
isting German single language BERT models
and two multilingual ones. GottBERT was
pre-trained related to the original RoBERTa
model using fairseq. All downstream tasks
were trained using hyperparameter presets
taken from the benchmark of German BERT.
The experiments were setup utilizing FARM.
Performance was measured by the F1score.
GottBERT was successfully pre-trained on a
256 core TPU pod using the RoBERTa BASE
architecture. Even without extensive hyper-
parameter optimization, in all NER and one
text classification task, GottBERT already out-
performed all other tested German and multi-
lingual models. In order to support the Ger-
man NLP field, we publish GottBERT under
the AGPLv3 license.
1 Introduction
The computation of contextual pre-trained word
representation is the current trend of neural net-
works in natural language processing (NLP).
This trend has a long history, starting with non-
contextualized word representations of which the
most prominent are word2vec (Goldberg and Levy,
2014) , GloVe (Pennington et al.,2014) and fast-
Text (Joulin et al.,2016,2017;Bojanowski et al.,
2017;Mikolov et al.,2018), evolving to deep con-
textualized models such as ELmO (Alsentzer et al.,
2019) and flair (Akbik et al.,2019). Most recently,
the field of NLP experienced remarkable progress,
by the use of Transformer (Vaswani et al.,2017)
based approaches. Especially Bidirectional En-
coder Representations from Transformers (BERT)
(Devlin et al.,2019) impacted the field which sub-
sequently was robustly optimized to RoBERTa (Liu
et al.,2019). These Transformer based approaches
use large scale pre-trained language models. For
application, such models are fine tuned by a super-
vised training on the specific downstream task lead-
ing to performance improvements in many tasks.
On the other hand, the computation of the language
model is performed unsupervised. Large text blobs
are required for training and strong hardware such
as hundreds of GPUs (Martin et al.,2020) or TPUs
(You et al.,2020). Initially, most of the research
took place in English followed by multilingual ap-
proaches (Conneau et al.,2019). Although, mul-
tilingual approaches were trained on large texts
of many languages, they were outperformed by
single language models (de Vries et al.,2019;Mar-
tin et al.,2020;Le et al.,2020;Delobelle et al.,
2020). Single language models trained with the
Open Super-large Crawled ALMAnaCH coRpus
(OSCAR) (Ortiz Su
´
arez et al.,2020) showed good
performance due to the size and variance of the OS-
CAR (Martin et al.,2020;Delobelle et al.,2020).
Following this ongoing trend, we pre-train the first
German RoBERTa single language model with the
German portion of the OSCAR - the German OS-
CAR text trained BERT (GottBERT). In an evalua-
tion we compare its performance on the two named
arXiv:2012.02110v1 [cs.CL] 3 Dec 2020
entity recognition tasks Conll 2003 and GermEval
2014, as well as on the two text classification tasks
GermEval 2018 and GNAD with existing German
single language BERT models and two multilingual
ones.
2 Related Work
Most recently Transformer based models widely
impacted the field of NLP. From neural translation
(Ott et al.,2018;Ng et al.,2019) to generative
language models as GPT2 (Radford et al.,2019),
remarkable performance gains were achieved. With
BERT, an approach facilitating pre-trained trans-
former based models was introduced. Fine-tuned
on downstream tasks, BERT based approaches
improved the performance of several NLP tasks
(Devlin et al.,2019;Liu et al.,2019). BERT
models though, were first released as single lan-
guage models in English based on 16GB of raw
text and as the multilingual model mBERT based
on Wikipedia dumps in about 100 languages (De-
vlin,2018). These models were followed by sin-
gle language models for several languages: Bertje
(de Vries et al.,2019) for Dutch, FinBERT (Virta-
nen et al.,2019) for Finish, German BERT
1
and a
German BERT from the MDZ Digital Library team
at the Bavarian State Library to which we refer as
dbmz BERT in this paper
2
. German BERT was
trained using 12GB of raw text data basing on Ger-
man Wikipedia (6GB), the OpenLegalData dump
(2.4GB) and news articles (3.6GB). dbmz BERT
used as source data a German Wikipedia dump, EU
Bookshop corpus, Open Subtitles, CommonCrawl,
ParaCrawl and News Crawl which sums up to a
dataset of 16GB. With the release of RoBERTa a
new standard for raw text size was set as it was
trained on 160GB of raw English text. Further,
RoBERTa enhances the original BERT approach
by removing segment embeddings, next sentence
prediction and improved hyperparameters. Ad-
ditionally, instead of using wordpiece (Schuster
and Nakajima,2012) tokenization, RoBERTa uti-
lizes GPT2’s byte pair encoding (BPE) (Radford
et al.,2019) with the benefit that language-specific
tokenizers are not required. Other than mBERT,
the multilingual XLM-RoBERTa (Conneau et al.,
2019) was trained on 2.5TB of filtered Common-
Crawl data. CamemBERT is a French RoBERTa
1https://deepset.ai/german-bert
2https://github.com/dbmdz/berts#
german-bert
model that was trained on the OSCAR and uses
sentencepiece (Kudo and Richardson,2018) BPE.
Further, they pre-trained a model with 4GB of the
French OSCAR portion and another model with
4GB of the French Wikipedia. The comparison
of these models using downstream tasks shows
that high text variance leads to better results. Um-
BERTo
3
is an Italian RoBERTa model, similarly
designed as CamemBERT. RobBERT, the Dutch
single language RoBERTa, was trained on 39GB
of the Dutch portion of the OSCAR and outper-
formed Bertje. A more recent version of RobBert
showed the performance gains of language specific
BPE compared to the English based GPT2 BPE
in downstream tasks. Most recently FlauBERT
(Le et al.,2020) for French was released trained
on 71GB data. They cleaned a 270GB corpus of
mixed sources by filtering out meaningless con-
tent and Unicode-normalization. Data was pre-
tokenized by moses (Koehn et al.,2007) and en-
coded by fastBPE
4
which is an implementation of
Sennrich et al. (2016). Following the approach of
utilizing the OSCAR, we computed the German
OSCAR text trained BERT (GottBERT). However,
the drawback of BERT approaches is the computa-
tional power requirement. Multiple GPUs or TPUs
were used for pre-training. All listed RoBERTa
based models were computed on GPUs whereas
GottBERT is the first published RoBERTa model
pre-trained on TPUs.
3 GottBERT
GottBERT bases on the robustly optimized BERT
architecture RoBERTa, which was pre-trained with
the German portion of the OSCAR using the fairseq
framework (Ott et al.,2019).
Training Data
The GottBERT model is trained on the German
portion of the OSCAR, a recently published large
multilingual text corpus extracted from Common
Crawl. The German data portion of the OSCAR
measures 145GB of text containing approximately
21.5 billion words in approximately 459 million
documents (one document per line).
Pre-processing
Originally, RoBERTa uses GPT2 (Radford et al.,
2019) byte pair encoding to segment the input into
3https://github.com/
musixmatchresearch/umberto
4https://github.com/glample/fastBPE
subword units. Therefore, no pre-tokenization is
required and thus no language-specific tokenizer
as e.g. moses (Koehn et al.,2007) must be used.
Its original vocabulary was computed on English
data. For GottBERT we computed a vocabulary
of 52k subword tokens based on 40 GB randomly
sampled documents of the German OSCAR portion.
Compared to the original GPT2 tokenizer, which
was trained on English data, this leads to a 40%
smaller size of the binary data which are fed into
fairseq (Ott et al.,2019). Furthermore, according
to Delobelle et al. (2020), it leads to a performance
increase.
Pre-training
Using fairseq, we pre-trained GottBERT on a 256
core TPU pod using the RoBERTa BASE architec-
ture. We trained the model in 100k update steps
using a batch size of 8k. A 10k iteration warmup
of the learning rate to a peak of 0.0004 was applied,
from which the learning rate polynomially decayed
to zero.
Downstream Tasks
Based on the pre-trained BERT models, several
downstream tasks were trained. The Framework for
Adapting Representation Models (FARM)
5
already
comes with pre-configured experiments for Ger-
man language. Originally, these experiments were
setup to benchmark German BERT. We adopted
this set of hyperparameters without additional grid
search optimization. FARM is based on Hugging
Face’s Transformer library (Wolf et al.,2019). As
we trained GottBERT with fairseq, we converted
GottBERT into the Hugging Face format.
Named Entity Recognition
We evaluated Got-
tBERT on two NER tasks. One was the German
part of CoNLL 2003 shared task (Tjong Kim Sang
and De Meulder,2003). It contains three main en-
tity classes and one for other miscellaneous entities.
As measurement we used the harmonic mean of
precision and recall
F1
. The second NER task was
GermEval 2014 (Benikova et al.,2014). It extends
the CoNLL 2003 shared task by fine-grained labels
and embedded markables. Fine-grained labels al-
low the indication of NER subtypes common in
German, namely derivations and parts: e.g. “Mann”
“m
¨
annlich” and “Mann”
“mannhaft”. In
order to recognize nested NEs embedded mark-
ables are required. Specifically, this was realized
5https://github.com/deepset-ai/FARM
by annotating main classes as well as two levels of
subclasses. Performance was measured by the use
of an adapted
F1
evaluation metric Benikova et al.
(2014), which considers the equality of labels and
spans (text passages) and additionally levels in the
class hierarchy.
Text Classification
GermEval task 2018 (Risch
et al.,2018) is a text classification task that contains
two subtasks of different granularity: the coarse-
grained binary classification of German tweets and
fine-grained classification of the same tweets into
four different classes. Based on the One Million
Posts Corpus (Schabus et al.,2017) which is in-
tended to test performance on German language,
the 10k German News Articles Dataset (10kG-
NAD) topic classification
6
was created. The dataset
contains approximately 10k news articles of an
Austrian newspaper which are to be classified into
9 categories. As both classification datasets do not
provide a pre-defined validation set, we used 90%
of the original training set for training and 10% for
validation. Data was split randomly. For evaluation
we computed the mean of the
F1
-scores of each
class/category.
4 Results
In order to evaluate the performance, each down-
stream task ran 10 times using different seeds. As
measure, the
F1
scores of the experiments based
on the test set was used. The score is the best of 10
runs of the respective experiment of each trained
model. The best score selection is based on valida-
tion set. We compared GottBERT’s performance
with four other models listed in Table 1.
Named Entity Recognition
For the two tested
NER tasks, CoNLL 2003 and GermEval 2014, Got-
tBERT outperformed all others models followed by
dbmz BERT (see Table 2). Obviously, the amount
and characteristic of data of the German portion
of the OSCAR is beneficial. The third place goes
to XLM RoBERTa. Interestingly, mBERT outper-
forms German BERT, although mBERT’s amount
of German data can be assumed to be smaller com-
pared to German BERT. Notably, mBERT performs
an exponentially smoothed weighting of the data in
order to balance the variance in data size of all the
different languages. Consequently, even if we knew
the data size of the German portion of mBERT, a
direct comparison would not be possible.
6https://tblock.github.io/10kGNAD
Model Type #Languages Data Size Data Source
GottBERT RoBERTa 1 145GB OSCAR
dbmz BERT BERT 1 16GB
Wikipedia, EU Bookshop corpus7,
Open Subtitles,
Common-,Para-,NewsCrawl
mBERTcased BERT 104 unknown Wikipedia
German BERT BERT 1 12GB news articles, Open Legal Data8,
Wikipedia
XLM RoBERTa RoBERTa 100 2.5TB
(66.6GB German) CommonCrawl, Wikipedia
Table 1: This table shows the models, we used in our experiments. Additional information about the pre-training
and architecture is listed. Unfortunately, for mBERT we did not find any estimate about the data size.
Model CoNLL 2003 GermEval 2014
GottBERT 83.57 86.84
dbmz BERT 82.30 85.82
mBERTcased 81.20 85.39
German BERT 81.18 85.03
XLM RoBERTa 81.36 85.41
Table 2: F1scores of NER experiments based on the test set. Best score out of 10 runs (selection based on
validation set).
Text Classification
Table 2shows the
F1
scores
of the classification tasks. In GermEval 2018
coarse and 10kGNAD, dbmz BERT outperforms
the other models followed by German BERT. Got-
tBERT only performs best in GermEval 2018 fine.
In 10kGNAD even XLM RoBERTa outperforms
GottBERT. In general, the two RoBERTa based
models seem not to develop their full potential,
which might be due to sub-optimal hyperparame-
ters.
5 Conclusion
In this work we present the German single lan-
guage RoBERTa based model GottBERT which
was computed on 145GB plain text. GottBERT is
the first German single language RoBERTa based
model. Even without extensive hyperparameter op-
timization, in three out of five downstream tasks,
GottBERT already outperformed all other tested
models. As the authors of German BERT released
the parameters within the FARM framework, we as-
sume the parameters for the downstream tasks are
probably not optimal for RoBERTa based models.
Consequently, further extensive hyperparameter op-
timization of the downstream tasks might lead to
better results for GottBERT. Dodge et al. (2020)
give insights into hyperparameter optimization, its
complexity and effects in relation to BERT. Fur-
ther, they present a solution to lower costs. Due
to the required effort, we currently leave this open
as future work and release GottBERT in hugging-
face and fairseq format to the community under the
AGPLv3 license.
Acknowledgments
This work was supported by the German Min-
istry for Education and Research (BMBF FKZ
01ZZ1801B) and supported with Cloud TPUs from
Google’s TensorFlow Research Cloud (TFRC). We
would like to thank Ian Graham for constructive
criticism of the manuscript and Louis Martin for
the helping email contact. A special thanks goes
to Myle Ott, who implemented the TPU feature
in fairseq and intensively supported us to get our
computation run. Finally, we would like to recog-
nizably thank the people behind the scenes who
essentially made this work possible: Frank Werner,
Georg Koch, Friedlinde B
¨
uhler and Jochen Knaus
of our internal IT team, Philipp Munz and Chris-
tian Wiedemann of Wabion GmbH and last but not
least Nora Limbourg the Google Cloud Customer
Engineer assigned to us.
Model GermEval 2018 coarse GermEval 2018 fine 10kGNAD
GottBERT 76.39 51.25 89.20
dbmz BERT 77.32 50.97 90.86
mBERTcased 72.87 44.78 88.72
German BERT 77.23 49.28 90.66
XLM RoBERTa 75.15 45.63 89.30
Table 3: F1scores of text classification experiments based on the test set. Best score out of 10 runs (selection
based on validation set).
References
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif
Rasul, Stefan Schweter, and Roland Vollgraf. 2019.
FLAIR: An Easy-to-Use Framework for State-of-
the-Art NLP. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Asso-
ciation for Computational Linguistics (Demonstra-
tions), pages 54–59, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics.
Emily Alsentzer, John R. Murphy, Willie Boag, Wei-
Hung Weng, Di Jin, Tristan Naumann, and Matthew
B. A. McDermott. 2019. Publicly Available Clinical
BERT Embeddings.arXiv:1904.03323 [cs]. ArXiv:
1904.03323.
Darina Benikova, Chris Biemann, Max Kisselew, and
Sebastian Pad´
o. 2014. GermEval 2014 Named En-
tity Recognition Shared Task: Companion Paper.
Proceedings of the KONVENS GermEval Shared
Task on Named Entity Recognition, pages 104–112.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching Word Vectors with
Subword Information. Transactions of the Associa-
tion for Computational Linguistics, 5:135–146.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzm´
an, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2019. Unsupervised
cross-lingual representation learning at scale.CoRR,
abs/1911.02116.
Pieter Delobelle, Thomas Winters, and Bettina Berendt.
2020. RobBERT: a Dutch RoBERTa-based Lan-
guage Model.arXiv:2001.06286 [cs]. ArXiv:
2001.06286.
Jacob Devlin. 2018. Multilingual BERT Readme Doc-
ument.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
Farhadi, Hannaneh Hajishirzi, and Noah Smith.
2020. Fine-Tuning Pretrained Language Models:
Weight Initializations, Data Orders, and Early Stop-
ping.arXiv:2002.06305 [cs]. ArXiv: 2002.06305.
Yoav Goldberg and Omer Levy. 2014. word2vec Ex-
plained: deriving Mikolov et al.’s negative-sampling
word-embedding method.arXiv:1402.3722 [cs,
stat]. ArXiv: 1402.3722.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
Matthijs Douze, H´
erve J´
egou, and Tomas Mikolov.
2016. FastText.zip: Compressing text classification
models. arXiv preprint arXiv:1612.03651.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of Tricks for Efficient
Text Classification. In Proceedings of the 15th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
pers, pages 427–431. Association for Computational
Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondˇ
rej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation.
In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics Compan-
ion Volume Proceedings of the Demo and Poster Ses-
sions, pages 177–180, Prague, Czech Republic. As-
sociation for Computational Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece:
A simple and language independent subword tok-
enizer and detokenizer for Neural Text Processing.
In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
Association for Computational Linguistics.
Hang Le, Lo¨
ıc Vial, Jibril Frej, Vincent Segonne, Max-
imin Coavoux, Benjamin Lecouteux, Alexandre Al-
lauzen, Benoˆ
ıt Crabb´
e, Laurent Besacier, and Didier
Schwab. 2020. FlauBERT: Unsupervised Language
Model Pre-training for French.arXiv:1912.05372
[cs]. ArXiv: 1912.05372.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A Robustly Optimized BERT Pretrain-
ing Approach.arXiv:1907.11692 [cs]. ArXiv:
1907.11692.
Louis Martin, Benjamin Muller, Pedro Javier Or-
tiz Su´
arez, Yoann Dupont, Laurent Romary, ´
Eric
de la Clergerie, Djam´
e Seddah, and Benoˆ
ıt Sagot.
2020. CamemBERT: a Tasty French Language
Model. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7203–7219, Online. Association for Computa-
tional Linguistics.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
Christian Puhrsch, and Armand Joulin. 2018. Ad-
vances in Pre-Training Distributed Word Representa-
tions. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan. European Language
Resources Association (ELRA).
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott,
Michael Auli, and Sergey Edunov. 2019. Facebook
FAIR’s WMT19 News Translation Task Submission.
In Proceedings of the Fourth Conference on Ma-
chine Translation (Volume 2: Shared Task Papers,
Day 1), pages 314–319, Florence, Italy. Association
for Computational Linguistics.
Pedro Javier Ortiz Su´
arez, Laurent Romary, and Benoˆ
ıt
Sagot. 2020. A Monolingual Approach to Contex-
tualized Word Embeddings for Mid-Resource Lan-
guages. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 1703–1714, Online. Association for Computa-
tional Linguistics.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A Fast, Extensible
Toolkit for Sequence Modeling.arXiv:1904.01038
[cs]. ArXiv: 1904.01038.
Myle Ott, Sergey Edunov, David Grangier, and
Michael Auli. 2018. Scaling Neural Machine Trans-
lation.arXiv:1806.00187 [cs]. ArXiv: 1806.00187.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. GloVe: Global Vectors for Word
Representation. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, Doha,
Qatar. Association for Computational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, 1(8):9.
Julian Risch, Eva Krebs, Alexander L¨
oser, Alexander
Riese, and Ralf Krestel. 2018. Fine-Grained Classi-
fication of Offensive Language. In Proceedings of
GermEval 2018 (co-located with KONVENS), pages
38–44.
Dietmar Schabus, Marcin Skowron, and Martin Trapp.
2017. One Million Posts: A Data Set of German On-
line Discussions. In Proceedings of the 40th Interna-
tional ACM SIGIR Conference on Research and De-
velopment in Information Retrieval (SIGIR), pages
1241–1244, Tokyo, Japan.
Mike Schuster and Kaisuke Nakajima. 2012. Japanese
and Korean voice search. In 2012 IEEE Interna-
tional Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 5149–5152. ISSN:
2379-190X.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural Machine Translation of Rare Words
with Subword Units.arXiv:1508.07909 [cs].
ArXiv: 1508.07909.
Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition.
In Proceedings of the Seventh Conference on Nat-
ural Language Learning at HLT-NAACL 2003 - Vol-
ume 4, CONLL ’03, pages 142–147, USA. Associ-
ation for Computational Linguistics. Event-place:
Edmonton, Canada.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran Asso-
ciates, Inc.
Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma,
Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and
Sampo Pyysalo. 2019. Multilingual is not enough:
BERT for Finnish.arXiv:1912.07076 [cs]. ArXiv:
1912.07076.
Wietse de Vries, Andreas van Cranenburgh, Arianna
Bisazza, Tommaso Caselli, Gertjan van Noord,
and Malvina Nissim. 2019. BERTje: A Dutch
BERT Model.arXiv:1912.09582 [cs]. ArXiv:
1912.09582.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R´
emi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. Huggingface’s trans-
formers: State-of-the-art natural language process-
ing.CoRR, abs/1910.03771.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,
Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song,
James Demmel, Kurt Keutzer, and Cho-Jui Hsieh.
2020. Large Batch Optimization for Deep Learning:
Training BERT in 76 minutes.arXiv:1904.00962
[cs, stat]. ArXiv: 1904.00962 version: 5.
... A set of transformer architectures were chosen as representatives of Deep Learning: The German base versions of Bert, Electra, Roberta 9 (called GottBert, see [37]) and DistilBert 10 as they can be found on Huggingface. It is expected that the transformer architecture, which is a driver for much of the recent progress in NLP, outperforms the classical architectures. ...
Article
Full-text available
This paper introduces Propositional Claim Detection (PCD), an NLP task for classifying claims to truth, and presents a publicly available dataset for it. PCD is applicable in practical scenarios, for instance, for the support of fact-checkers, as well as in many areas of communication research. By leveraging insights from philosophy and linguistics, PCD is a more systematic and transparent version of claim detection than previous approaches. This paper presents the theoretical background for PCD and discusses its advantages over alternative approaches to claim detection. Extensive experiments on models trained on the dataset are conducted and result in an F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {F}_{1}$$\end{document}-score of up to 0.91. Moreover, PCD’s generalization across domains is tested. Models trained on the dataset show stable performance for text from previously unseen domains such as different topical domains or writing styles. PCD is a basic task that finds application in various fields and can be integrated with many other computational tools.
... BERT [65] is a language model published in 2019 that can be used for a variety of tasks such as translation [66], question answering [67], and document classification. Although the original BERT model is based on the English language, French and German variants have since been published [68][69][70]. ...
Article
Full-text available
Background One-third of older inpatients experience adverse drug events (ADEs), which increase their mortality, morbidity, and health care use and costs. In particular, antithrombotic drugs are among the most at-risk medications for this population. Reporting systems have been implemented at the national, regional, and provider levels to monitor ADEs and design prevention strategies. Owing to their well-known limitations, automated detection technologies based on electronic medical records (EMRs) are being developed to routinely detect or predict ADEs. Objective This study aims to develop and validate an automated detection tool for monitoring antithrombotic-related ADEs using EMRs from 4 large Swiss hospitals. We aim to assess cumulative incidences of hemorrhages and thromboses in older inpatients associated with the prescription of antithrombotic drugs, identify triggering factors, and propose improvements for clinical practice. Methods This project is a multicenter, cross-sectional study based on 2015 to 2016 EMR data from 4 large hospitals in Switzerland: Lausanne, Geneva, and Zürich university hospitals, and Baden Cantonal Hospital. We have included inpatients aged ≥65 years who stayed at 1 of the 4 hospitals during 2015 or 2016, received at least one antithrombotic drug during their stay, and signed or were not opposed to a general consent for participation in research. First, clinical experts selected a list of relevant antithrombotic drugs along with their side effects, risks, and confounding factors. Second, administrative, clinical, prescription, and laboratory data available in the form of free text and structured data were extracted from study participants’ EMRs. Third, several automated rule-based and machine learning–based algorithms are being developed, allowing for the identification of hemorrhage and thromboembolic events and their triggering factors from the extracted information. Finally, we plan to validate the developed detection tools (one per ADE type) through manual medical record review. Performance metrics for assessing internal validity will comprise the area under the receiver operating characteristic curve, F1-score, sensitivity, specificity, and positive and negative predictive values. Results After accounting for the inclusion and exclusion criteria, we will include 34,522 residents aged ≥65 years. The data will be analyzed in 2022, and the research project will run until the end of 2022 to mid-2023. Conclusions This project will allow for the introduction of measures to improve safety in prescribing antithrombotic drugs, which today remain among the drugs most involved in ADEs. The findings will be implemented in clinical practice using indicators of adverse events for risk management and training for health care professionals; the tools and methodologies developed will be disseminated for new research in this field. The increased performance of natural language processing as an important complement to structured data will bring existing tools to another level of efficiency in the detection of ADEs. Currently, such systems are unavailable in Switzerland. International Registered Report Identifier (IRRID) DERR1-10.2196/40456
Conference Paper
Full-text available
Social media platforms receive massive amounts of user-generated content that may include offensive text messages. In the context of the GermEval task 2018, we propose an approach for fine-grained classification of offensive language. Our approach comprises a Naive Bayes classifier, a neural network, and a rule-based approach that categorize tweets. In addition, we combine the approaches in an ensemble to overcome weaknesses of the single models. We cross-validate our approaches with regard to macro-average F 1-score on the provided training dataset.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Article
This paper proposes a simple and efficient approach for text classification and representation learning. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute.
Conference Paper
In this paper we introduce a new data set consisting of user comments posted to the website of a German-language Austrian newspaper. Professional forum moderators have annotated 11,773 posts according to seven categories they considered crucial for the efficient moderation of online discussions in the context of news articles. In addition to this taxonomy and annotated posts, the data set contains one million unlabeled posts. Our experimental results using six methods establish a first baseline for predicting these categories. The data and our code are available for research purposes from https://ofai.github.io/million-post-corpus.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Conference Paper
This paper describes challenges and solutions for building a successful voice search system as applied to Japanese and Korean at Google. We describe the techniques used to deal with an infinite vocabulary, how modeling completely in the written domain for language model and dictionary can avoid some system complexity, and how we built dictionaries, language and acoustic models in this framework. We show how to deal with the difficulty of scoring results for multiple script languages because of ambiguities. The development of voice search for these languages led to a significant simplification of the original process to build a system for any new language which in in parts became our default process for internationalization of voice search.