Content uploaded by Raphael Scheible
Author content
All content in this area was uploaded by Raphael Scheible on Dec 10, 2020
Content may be subject to copyright.
GottBERT: a pure German Language Model
Raphael Scheible1, Fabian Thomczyk1, Patric Tippmann1, Victor Jaravine1and Martin Boeker1
{raphael.scheible,thomczyk,tippmann,victor.zharavin,martin.boeker}
@imbi.uni-freiburg.de
1Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg
Abstract
Lately, pre-trained language models advanced
the field of natural language processing (NLP).
The introduction of Bidirectional Encoders for
Transformers (BERT) and its optimized ver-
sion RoBERTa have had significant impact and
increased the relevance of pre-trained models.
First, research in this field mainly started on
English data followed by models trained with
multilingual text corpora. However, current re-
search shows that multilingual models are in-
ferior to monolingual models. Currently, no
German single language RoBERTa model is
yet published, which we introduce in this work
(GottBERT). The German portion of the OS-
CAR data set was used as text corpus. In
an evaluation we compare its performance
on the two Named Entity Recognition (NER)
tasks Conll 2003 and GermEval 2014 as well
as on the text classification tasks GermEval
2018 (fine and coarse) and GNAD with ex-
isting German single language BERT models
and two multilingual ones. GottBERT was
pre-trained related to the original RoBERTa
model using fairseq. All downstream tasks
were trained using hyperparameter presets
taken from the benchmark of German BERT.
The experiments were setup utilizing FARM.
Performance was measured by the F1score.
GottBERT was successfully pre-trained on a
256 core TPU pod using the RoBERTa BASE
architecture. Even without extensive hyper-
parameter optimization, in all NER and one
text classification task, GottBERT already out-
performed all other tested German and multi-
lingual models. In order to support the Ger-
man NLP field, we publish GottBERT under
the AGPLv3 license.
1 Introduction
The computation of contextual pre-trained word
representation is the current trend of neural net-
works in natural language processing (NLP).
This trend has a long history, starting with non-
contextualized word representations of which the
most prominent are word2vec (Goldberg and Levy,
2014) , GloVe (Pennington et al.,2014) and fast-
Text (Joulin et al.,2016,2017;Bojanowski et al.,
2017;Mikolov et al.,2018), evolving to deep con-
textualized models such as ELmO (Alsentzer et al.,
2019) and flair (Akbik et al.,2019). Most recently,
the field of NLP experienced remarkable progress,
by the use of Transformer (Vaswani et al.,2017)
based approaches. Especially Bidirectional En-
coder Representations from Transformers (BERT)
(Devlin et al.,2019) impacted the field which sub-
sequently was robustly optimized to RoBERTa (Liu
et al.,2019). These Transformer based approaches
use large scale pre-trained language models. For
application, such models are fine tuned by a super-
vised training on the specific downstream task lead-
ing to performance improvements in many tasks.
On the other hand, the computation of the language
model is performed unsupervised. Large text blobs
are required for training and strong hardware such
as hundreds of GPUs (Martin et al.,2020) or TPUs
(You et al.,2020). Initially, most of the research
took place in English followed by multilingual ap-
proaches (Conneau et al.,2019). Although, mul-
tilingual approaches were trained on large texts
of many languages, they were outperformed by
single language models (de Vries et al.,2019;Mar-
tin et al.,2020;Le et al.,2020;Delobelle et al.,
2020). Single language models trained with the
Open Super-large Crawled ALMAnaCH coRpus
(OSCAR) (Ortiz Su
´
arez et al.,2020) showed good
performance due to the size and variance of the OS-
CAR (Martin et al.,2020;Delobelle et al.,2020).
Following this ongoing trend, we pre-train the first
German RoBERTa single language model with the
German portion of the OSCAR - the German OS-
CAR text trained BERT (GottBERT). In an evalua-
tion we compare its performance on the two named
arXiv:2012.02110v1 [cs.CL] 3 Dec 2020
entity recognition tasks Conll 2003 and GermEval
2014, as well as on the two text classification tasks
GermEval 2018 and GNAD with existing German
single language BERT models and two multilingual
ones.
2 Related Work
Most recently Transformer based models widely
impacted the field of NLP. From neural translation
(Ott et al.,2018;Ng et al.,2019) to generative
language models as GPT2 (Radford et al.,2019),
remarkable performance gains were achieved. With
BERT, an approach facilitating pre-trained trans-
former based models was introduced. Fine-tuned
on downstream tasks, BERT based approaches
improved the performance of several NLP tasks
(Devlin et al.,2019;Liu et al.,2019). BERT
models though, were first released as single lan-
guage models in English based on 16GB of raw
text and as the multilingual model mBERT based
on Wikipedia dumps in about 100 languages (De-
vlin,2018). These models were followed by sin-
gle language models for several languages: Bertje
(de Vries et al.,2019) for Dutch, FinBERT (Virta-
nen et al.,2019) for Finish, German BERT
1
and a
German BERT from the MDZ Digital Library team
at the Bavarian State Library to which we refer as
dbmz BERT in this paper
2
. German BERT was
trained using 12GB of raw text data basing on Ger-
man Wikipedia (6GB), the OpenLegalData dump
(2.4GB) and news articles (3.6GB). dbmz BERT
used as source data a German Wikipedia dump, EU
Bookshop corpus, Open Subtitles, CommonCrawl,
ParaCrawl and News Crawl which sums up to a
dataset of 16GB. With the release of RoBERTa a
new standard for raw text size was set as it was
trained on 160GB of raw English text. Further,
RoBERTa enhances the original BERT approach
by removing segment embeddings, next sentence
prediction and improved hyperparameters. Ad-
ditionally, instead of using wordpiece (Schuster
and Nakajima,2012) tokenization, RoBERTa uti-
lizes GPT2’s byte pair encoding (BPE) (Radford
et al.,2019) with the benefit that language-specific
tokenizers are not required. Other than mBERT,
the multilingual XLM-RoBERTa (Conneau et al.,
2019) was trained on 2.5TB of filtered Common-
Crawl data. CamemBERT is a French RoBERTa
1https://deepset.ai/german-bert
2https://github.com/dbmdz/berts#
german-bert
model that was trained on the OSCAR and uses
sentencepiece (Kudo and Richardson,2018) BPE.
Further, they pre-trained a model with 4GB of the
French OSCAR portion and another model with
4GB of the French Wikipedia. The comparison
of these models using downstream tasks shows
that high text variance leads to better results. Um-
BERTo
3
is an Italian RoBERTa model, similarly
designed as CamemBERT. RobBERT, the Dutch
single language RoBERTa, was trained on 39GB
of the Dutch portion of the OSCAR and outper-
formed Bertje. A more recent version of RobBert
showed the performance gains of language specific
BPE compared to the English based GPT2 BPE
in downstream tasks. Most recently FlauBERT
(Le et al.,2020) for French was released trained
on 71GB data. They cleaned a 270GB corpus of
mixed sources by filtering out meaningless con-
tent and Unicode-normalization. Data was pre-
tokenized by moses (Koehn et al.,2007) and en-
coded by fastBPE
4
which is an implementation of
Sennrich et al. (2016). Following the approach of
utilizing the OSCAR, we computed the German
OSCAR text trained BERT (GottBERT). However,
the drawback of BERT approaches is the computa-
tional power requirement. Multiple GPUs or TPUs
were used for pre-training. All listed RoBERTa
based models were computed on GPUs whereas
GottBERT is the first published RoBERTa model
pre-trained on TPUs.
3 GottBERT
GottBERT bases on the robustly optimized BERT
architecture RoBERTa, which was pre-trained with
the German portion of the OSCAR using the fairseq
framework (Ott et al.,2019).
Training Data
The GottBERT model is trained on the German
portion of the OSCAR, a recently published large
multilingual text corpus extracted from Common
Crawl. The German data portion of the OSCAR
measures 145GB of text containing approximately
21.5 billion words in approximately 459 million
documents (one document per line).
Pre-processing
Originally, RoBERTa uses GPT2 (Radford et al.,
2019) byte pair encoding to segment the input into
3https://github.com/
musixmatchresearch/umberto
4https://github.com/glample/fastBPE
subword units. Therefore, no pre-tokenization is
required and thus no language-specific tokenizer
as e.g. moses (Koehn et al.,2007) must be used.
Its original vocabulary was computed on English
data. For GottBERT we computed a vocabulary
of 52k subword tokens based on 40 GB randomly
sampled documents of the German OSCAR portion.
Compared to the original GPT2 tokenizer, which
was trained on English data, this leads to a 40%
smaller size of the binary data which are fed into
fairseq (Ott et al.,2019). Furthermore, according
to Delobelle et al. (2020), it leads to a performance
increase.
Pre-training
Using fairseq, we pre-trained GottBERT on a 256
core TPU pod using the RoBERTa BASE architec-
ture. We trained the model in 100k update steps
using a batch size of 8k. A 10k iteration warmup
of the learning rate to a peak of 0.0004 was applied,
from which the learning rate polynomially decayed
to zero.
Downstream Tasks
Based on the pre-trained BERT models, several
downstream tasks were trained. The Framework for
Adapting Representation Models (FARM)
5
already
comes with pre-configured experiments for Ger-
man language. Originally, these experiments were
setup to benchmark German BERT. We adopted
this set of hyperparameters without additional grid
search optimization. FARM is based on Hugging
Face’s Transformer library (Wolf et al.,2019). As
we trained GottBERT with fairseq, we converted
GottBERT into the Hugging Face format.
Named Entity Recognition
We evaluated Got-
tBERT on two NER tasks. One was the German
part of CoNLL 2003 shared task (Tjong Kim Sang
and De Meulder,2003). It contains three main en-
tity classes and one for other miscellaneous entities.
As measurement we used the harmonic mean of
precision and recall
F1
. The second NER task was
GermEval 2014 (Benikova et al.,2014). It extends
the CoNLL 2003 shared task by fine-grained labels
and embedded markables. Fine-grained labels al-
low the indication of NER subtypes common in
German, namely derivations and parts: e.g. “Mann”
→
“m
¨
annlich” and “Mann”
→
“mannhaft”. In
order to recognize nested NEs embedded mark-
ables are required. Specifically, this was realized
5https://github.com/deepset-ai/FARM
by annotating main classes as well as two levels of
subclasses. Performance was measured by the use
of an adapted
F1
evaluation metric Benikova et al.
(2014), which considers the equality of labels and
spans (text passages) and additionally levels in the
class hierarchy.
Text Classification
GermEval task 2018 (Risch
et al.,2018) is a text classification task that contains
two subtasks of different granularity: the coarse-
grained binary classification of German tweets and
fine-grained classification of the same tweets into
four different classes. Based on the One Million
Posts Corpus (Schabus et al.,2017) which is in-
tended to test performance on German language,
the 10k German News Articles Dataset (10kG-
NAD) topic classification
6
was created. The dataset
contains approximately 10k news articles of an
Austrian newspaper which are to be classified into
9 categories. As both classification datasets do not
provide a pre-defined validation set, we used 90%
of the original training set for training and 10% for
validation. Data was split randomly. For evaluation
we computed the mean of the
F1
-scores of each
class/category.
4 Results
In order to evaluate the performance, each down-
stream task ran 10 times using different seeds. As
measure, the
F1
scores of the experiments based
on the test set was used. The score is the best of 10
runs of the respective experiment of each trained
model. The best score selection is based on valida-
tion set. We compared GottBERT’s performance
with four other models listed in Table 1.
Named Entity Recognition
For the two tested
NER tasks, CoNLL 2003 and GermEval 2014, Got-
tBERT outperformed all others models followed by
dbmz BERT (see Table 2). Obviously, the amount
and characteristic of data of the German portion
of the OSCAR is beneficial. The third place goes
to XLM RoBERTa. Interestingly, mBERT outper-
forms German BERT, although mBERT’s amount
of German data can be assumed to be smaller com-
pared to German BERT. Notably, mBERT performs
an exponentially smoothed weighting of the data in
order to balance the variance in data size of all the
different languages. Consequently, even if we knew
the data size of the German portion of mBERT, a
direct comparison would not be possible.
6https://tblock.github.io/10kGNAD
Model Type #Languages Data Size Data Source
GottBERT RoBERTa 1 145GB OSCAR
dbmz BERT BERT 1 16GB
Wikipedia, EU Bookshop corpus7,
Open Subtitles,
Common-,Para-,NewsCrawl
mBERTcased BERT 104 unknown Wikipedia
German BERT BERT 1 12GB news articles, Open Legal Data8,
Wikipedia
XLM RoBERTa RoBERTa 100 2.5TB
(66.6GB German) CommonCrawl, Wikipedia
Table 1: This table shows the models, we used in our experiments. Additional information about the pre-training
and architecture is listed. Unfortunately, for mBERT we did not find any estimate about the data size.
Model CoNLL 2003 GermEval 2014
GottBERT 83.57 86.84
dbmz BERT 82.30 85.82
mBERTcased 81.20 85.39
German BERT 81.18 85.03
XLM RoBERTa 81.36 85.41
Table 2: F1scores of NER experiments based on the test set. Best score out of 10 runs (selection based on
validation set).
Text Classification
Table 2shows the
F1
scores
of the classification tasks. In GermEval 2018
coarse and 10kGNAD, dbmz BERT outperforms
the other models followed by German BERT. Got-
tBERT only performs best in GermEval 2018 fine.
In 10kGNAD even XLM RoBERTa outperforms
GottBERT. In general, the two RoBERTa based
models seem not to develop their full potential,
which might be due to sub-optimal hyperparame-
ters.
5 Conclusion
In this work we present the German single lan-
guage RoBERTa based model GottBERT which
was computed on 145GB plain text. GottBERT is
the first German single language RoBERTa based
model. Even without extensive hyperparameter op-
timization, in three out of five downstream tasks,
GottBERT already outperformed all other tested
models. As the authors of German BERT released
the parameters within the FARM framework, we as-
sume the parameters for the downstream tasks are
probably not optimal for RoBERTa based models.
Consequently, further extensive hyperparameter op-
timization of the downstream tasks might lead to
better results for GottBERT. Dodge et al. (2020)
give insights into hyperparameter optimization, its
complexity and effects in relation to BERT. Fur-
ther, they present a solution to lower costs. Due
to the required effort, we currently leave this open
as future work and release GottBERT in hugging-
face and fairseq format to the community under the
AGPLv3 license.
Acknowledgments
This work was supported by the German Min-
istry for Education and Research (BMBF FKZ
01ZZ1801B) and supported with Cloud TPUs from
Google’s TensorFlow Research Cloud (TFRC). We
would like to thank Ian Graham for constructive
criticism of the manuscript and Louis Martin for
the helping email contact. A special thanks goes
to Myle Ott, who implemented the TPU feature
in fairseq and intensively supported us to get our
computation run. Finally, we would like to recog-
nizably thank the people behind the scenes who
essentially made this work possible: Frank Werner,
Georg Koch, Friedlinde B
¨
uhler and Jochen Knaus
of our internal IT team, Philipp Munz and Chris-
tian Wiedemann of Wabion GmbH and last but not
least Nora Limbourg the Google Cloud Customer
Engineer assigned to us.
Model GermEval 2018 coarse GermEval 2018 fine 10kGNAD
GottBERT 76.39 51.25 89.20
dbmz BERT 77.32 50.97 90.86
mBERTcased 72.87 44.78 88.72
German BERT 77.23 49.28 90.66
XLM RoBERTa 75.15 45.63 89.30
Table 3: F1scores of text classification experiments based on the test set. Best score out of 10 runs (selection
based on validation set).
References
Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif
Rasul, Stefan Schweter, and Roland Vollgraf. 2019.
FLAIR: An Easy-to-Use Framework for State-of-
the-Art NLP. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Asso-
ciation for Computational Linguistics (Demonstra-
tions), pages 54–59, Minneapolis, Minnesota. Asso-
ciation for Computational Linguistics.
Emily Alsentzer, John R. Murphy, Willie Boag, Wei-
Hung Weng, Di Jin, Tristan Naumann, and Matthew
B. A. McDermott. 2019. Publicly Available Clinical
BERT Embeddings.arXiv:1904.03323 [cs]. ArXiv:
1904.03323.
Darina Benikova, Chris Biemann, Max Kisselew, and
Sebastian Pad´
o. 2014. GermEval 2014 Named En-
tity Recognition Shared Task: Companion Paper.
Proceedings of the KONVENS GermEval Shared
Task on Named Entity Recognition, pages 104–112.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching Word Vectors with
Subword Information. Transactions of the Associa-
tion for Computational Linguistics, 5:135–146.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzm´
an, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2019. Unsupervised
cross-lingual representation learning at scale.CoRR,
abs/1911.02116.
Pieter Delobelle, Thomas Winters, and Bettina Berendt.
2020. RobBERT: a Dutch RoBERTa-based Lan-
guage Model.arXiv:2001.06286 [cs]. ArXiv:
2001.06286.
Jacob Devlin. 2018. Multilingual BERT Readme Doc-
ument.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali
Farhadi, Hannaneh Hajishirzi, and Noah Smith.
2020. Fine-Tuning Pretrained Language Models:
Weight Initializations, Data Orders, and Early Stop-
ping.arXiv:2002.06305 [cs]. ArXiv: 2002.06305.
Yoav Goldberg and Omer Levy. 2014. word2vec Ex-
plained: deriving Mikolov et al.’s negative-sampling
word-embedding method.arXiv:1402.3722 [cs,
stat]. ArXiv: 1402.3722.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
Matthijs Douze, H´
erve J´
egou, and Tomas Mikolov.
2016. FastText.zip: Compressing text classification
models. arXiv preprint arXiv:1612.03651.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of Tricks for Efficient
Text Classification. In Proceedings of the 15th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
pers, pages 427–431. Association for Computational
Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondˇ
rej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation.
In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics Compan-
ion Volume Proceedings of the Demo and Poster Ses-
sions, pages 177–180, Prague, Czech Republic. As-
sociation for Computational Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece:
A simple and language independent subword tok-
enizer and detokenizer for Neural Text Processing.
In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 66–71, Brussels, Belgium.
Association for Computational Linguistics.
Hang Le, Lo¨
ıc Vial, Jibril Frej, Vincent Segonne, Max-
imin Coavoux, Benjamin Lecouteux, Alexandre Al-
lauzen, Benoˆ
ıt Crabb´
e, Laurent Besacier, and Didier
Schwab. 2020. FlauBERT: Unsupervised Language
Model Pre-training for French.arXiv:1912.05372
[cs]. ArXiv: 1912.05372.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A Robustly Optimized BERT Pretrain-
ing Approach.arXiv:1907.11692 [cs]. ArXiv:
1907.11692.
Louis Martin, Benjamin Muller, Pedro Javier Or-
tiz Su´
arez, Yoann Dupont, Laurent Romary, ´
Eric
de la Clergerie, Djam´
e Seddah, and Benoˆ
ıt Sagot.
2020. CamemBERT: a Tasty French Language
Model. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7203–7219, Online. Association for Computa-
tional Linguistics.
Tomas Mikolov, Edouard Grave, Piotr Bojanowski,
Christian Puhrsch, and Armand Joulin. 2018. Ad-
vances in Pre-Training Distributed Word Representa-
tions. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan. European Language
Resources Association (ELRA).
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott,
Michael Auli, and Sergey Edunov. 2019. Facebook
FAIR’s WMT19 News Translation Task Submission.
In Proceedings of the Fourth Conference on Ma-
chine Translation (Volume 2: Shared Task Papers,
Day 1), pages 314–319, Florence, Italy. Association
for Computational Linguistics.
Pedro Javier Ortiz Su´
arez, Laurent Romary, and Benoˆ
ıt
Sagot. 2020. A Monolingual Approach to Contex-
tualized Word Embeddings for Mid-Resource Lan-
guages. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 1703–1714, Online. Association for Computa-
tional Linguistics.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Michael Auli. 2019. fairseq: A Fast, Extensible
Toolkit for Sequence Modeling.arXiv:1904.01038
[cs]. ArXiv: 1904.01038.
Myle Ott, Sergey Edunov, David Grangier, and
Michael Auli. 2018. Scaling Neural Machine Trans-
lation.arXiv:1806.00187 [cs]. ArXiv: 1806.00187.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. GloVe: Global Vectors for Word
Representation. In Proceedings of the 2014 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, Doha,
Qatar. Association for Computational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, 1(8):9.
Julian Risch, Eva Krebs, Alexander L¨
oser, Alexander
Riese, and Ralf Krestel. 2018. Fine-Grained Classi-
fication of Offensive Language. In Proceedings of
GermEval 2018 (co-located with KONVENS), pages
38–44.
Dietmar Schabus, Marcin Skowron, and Martin Trapp.
2017. One Million Posts: A Data Set of German On-
line Discussions. In Proceedings of the 40th Interna-
tional ACM SIGIR Conference on Research and De-
velopment in Information Retrieval (SIGIR), pages
1241–1244, Tokyo, Japan.
Mike Schuster and Kaisuke Nakajima. 2012. Japanese
and Korean voice search. In 2012 IEEE Interna-
tional Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 5149–5152. ISSN:
2379-190X.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural Machine Translation of Rare Words
with Subword Units.arXiv:1508.07909 [cs].
ArXiv: 1508.07909.
Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition.
In Proceedings of the Seventh Conference on Nat-
ural Language Learning at HLT-NAACL 2003 - Vol-
ume 4, CONLL ’03, pages 142–147, USA. Associ-
ation for Computational Linguistics. Event-place:
Edmonton, Canada.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran Asso-
ciates, Inc.
Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma,
Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and
Sampo Pyysalo. 2019. Multilingual is not enough:
BERT for Finnish.arXiv:1912.07076 [cs]. ArXiv:
1912.07076.
Wietse de Vries, Andreas van Cranenburgh, Arianna
Bisazza, Tommaso Caselli, Gertjan van Noord,
and Malvina Nissim. 2019. BERTje: A Dutch
BERT Model.arXiv:1912.09582 [cs]. ArXiv:
1912.09582.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R´
emi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. Huggingface’s trans-
formers: State-of-the-art natural language process-
ing.CoRR, abs/1910.03771.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,
Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song,
James Demmel, Kurt Keutzer, and Cho-Jui Hsieh.
2020. Large Batch Optimization for Deep Learning:
Training BERT in 76 minutes.arXiv:1904.00962
[cs, stat]. ArXiv: 1904.00962 version: 5.