ChapterPDF Available

BERTimbau: Pretrained BERT Models for Brazilian Portuguese

Authors:

Abstract and Figures

Recent advances in language representation using neural networks have made it viable to transfer the learned internal states of large pretrained language models (LMs) to downstream natural language processing (NLP) tasks. This transfer learning approach improves the overall performance on many tasks and is highly beneficial when labeled data is scarce, making pretrained LMs valuable resources specially for languages with few annotated training examples. In this work, we train BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which we nickname BERTimbau. We evaluate our models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition. Our models improve the state-of-the-art in all of these tasks, outperforming Multilingual BERT and confirming the effectiveness of large pretrained LMs for Portuguese. We release our models to the community hoping to provide strong baselines for future NLP research: https://github.com/neuralmind-ai/portuguese-bert.
Content may be subject to copyright.
Graphical Abstract
BERT models for Brazilian Portuguese: pretraining, evaluation and tokenization analysis
FC Souza,RF Nogueira,RA Lotufo
mBERT
or
English BERT
BERTimbau
Portuguese
Wikipedia
Brazilian
web pages
(brWaC)
Labeled datasets
HAREM I (NER)
ASSIN2 (STS & RTE)
Pretraining
Fine-tuning
on evaluation
tasks
Fine-tuned
BERTimbau
Portuguese
tokenizer
Train Evaluation
Tokenization
analysis
Highlights
BERT models for Brazilian Portuguese: pretraining, evaluation and tokenization analysis
FC Souza,RF Nogueira,RA Lotufo
Release of pretrained BERT models for Brazilian Portuguese trained on the brWaC corpus
State-of-the-art performance on three Portuguese NLP tasks: Sentence Textual Similarity (ASSIN2), Recogniz-
ing Textual Entailment (ASSIN2) and Named Entity Recognition (HAREM I)
Tokenization analysis reveals strong correlation between task performance and subword splits and opens future
research directions
BERT models for Brazilian Portuguese: pretraining, evaluation and
tokenization analysis
FC Souzaa,c,RF Nogueiraa,b,c and RA Lotufoa,c
aSchool of Electrical and Computer Engineering - UNICAMP, Cidade Universitária, Campinas, SP 13083-852, Brazil
bUniversity of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada
cNeuralmind Inteligência Artificial, Cidade Universitária, Campinas, SP 13083-898, Brazil
ARTICLE INFO
Keywords:
Language model
BERT
Sentence textual similarity
Recognizing textual entailment
Named entity recognition
ABSTRACT
Recent advances in language representation using neural networks have made it viable to trans-
fer the learned internal states of large pretrained language models (LMs) to downstream natural
language processing (NLP) tasks. This transfer learning approach improves the overall perfor-
mance on many tasks and is highly beneficial when labeled data is scarce, making pretrained
LMs valuable resources specially for languages with few annotated training examples. In this
work, we train BERT (Bidirectional Encoder Representations from Transformers) models for
Brazilian Portuguese, which we nickname BERTimbau. We evaluate our models on three down-
stream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity
recognition. Our models improve the state-of-the-art in all of these tasks, outperforming Mul-
tilingual BERT and confirming the effectiveness of large pretrained LMs for Portuguese. We
release our models to the community hoping to provide strong baselines for future NLP research:
https://github.com/neuralmind-ai/portuguese- bert.
1. Introduction
Transfer learning, where a model is first trained on a source task and then fine-tuned on tasks of interest, has changed
the landscape of natural language processing (NLP) applications over the last years. The strategy of fine-tuning a large
pretrained language model (LM) has been largely adopted and has achieved state-of-the-art performance on a variety
of NLP tasks [11,35,37,58]. Aside from bringing performance improvements, transfer learning reduces the amount
of labeled data needed for supervised learning on downstream tasks [15,32].
Pretraining these large language models, however, requires huge amounts of unlabeled data and computational
resources, with reports of models being trained using thousands of GPUs or TPUs and hundreds of GBs of raw textual
data [25,37]. This resource barrier has limited the availability of these models, early on, to English, Chinese and
multilingual models.
BERT [11], which uses the Transformer architecture [53], among its derived models, such as RoBERTa [25] and
Albert [23], is one of the most adopted models. Despite having a multilingual BERT1model (mBERT) trained on
104 languages, much effort has been devoted on pretraining monolingual BERT and BERT-derived models on single
languages, such as French [26], Dutch [10,54], Spanish [4], Italian [34], and others [2,20,28]. Even though it is
infeasable to train monolingual models for every language, these works are motivated by the superior performance on
downstream tasks and resource efficiency of monolingual models compared to mBERT [16].
Large pretrained LMs can be valuable assets especially for languages that have few annotated resources but abun-
dant unlabeled data, such as Portuguese. With that in mind, we train BERT models for Brazilian Portuguese — which
we nickname BERTimbau — using data from brWaC [55], a large and diverse corpus of web pages. We evaluate
our models on three NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recog-
nition. We also compare the BERTimbau’s and mBERT’s tokenizers on these tasks and analyze their implications
on the tasks’ performances. BERTimbau improves the state-of-the-art on these tasks over multilingual models and
previous monolingual approaches, confirming the effectiveness of large pretrained LMs for Portuguese. This work
extends our previous work [48], in which we perform the pretraining and evaluation of BERTimbau models. In this
work, we conduct a tokenization analysis of BERTimbau and mBERT tokenizers and their impacts on the downstream
fabiocapsouza@gmail.com (F. Souza); rfn@unicamp.br (R. Nogueira); roberto@neuralmind.ai (R. Lotufo)
ORCID(s): 0000-0002-5652-0852 (R. Lotufo)
1https://github.com/google-research/bert/blob/master/multilingual.md
FC Souza et al.: Preprint submitted to Elsevier Page 1 of 20
BERT for Brazilian Portuguese
tasks performance, which reveals a correlation between segmentation of words and task performance. Moreover, we
improve methodology details, specially regarding vocabulary generation, pretraining and finetuning stages, more de-
tailed description of evaluation datasets and tasks, and discussion of related and future work. We make BERTimbau
models available to the community2on open-source libraries to provide strong baselines for future research and em-
power transfer learning on NLP applications in scenarios of limited labeled data or insufficient data to train a model
from scratch.
The paper is organized as follows: in Section 2, we present the related work. In Section 3, we briefly describe
BERTimbau architecture and the pretraining procedure, such as the pretraining data, the vocabulary generation, and
pretraining objectives. In Section 4, we describe the downstream tasks and datasets used to evaluate our models and
the evaluation procedures. In Section 5, we describe our experiments and present and analyze our results. Then, in
Section 6, we perform tokenization comparison and analysis of its impacts on the evaluation tasks. Lastly, we make
our final remarks in Section 7.
2. Related Work
Transfer learning is a widespread technique to reduce training data requirements and improve performance of
machine learning models [30]. Sequential inductive transfer learning consists of training a model on a source dataset
and task and then transfering the learned general-purpose features to a new target dataset and task [59]. This transfer
is generally done by initializing all or part of the target model’s parameters with the pretrained weights of the source
model. These pretrained weights can be kept frozen – and used in a feature extraction approach – or be fine-tuned
along with new parameters of the target model when training on the final task [33].
Before the advent of pretrained LMs, most of the transfer learnings were made by first learning vector representa-
tions for predefined vocabularies in an unsupervised manner on large corpora and later reusing these representations as
input to models that were trained from scratch. Classic word embeddings [27,31] consists of static non-contextualized
word-level representations that capture semantic and syntatic features. More recently, contextual embeddings, such as
ELMo [32] and Flair Embeddings [1], leverage the internal states of language models to extract richer word represen-
tations in context.
Deep transfer learning techniques for NLP emerged by successfully fine-tuning large pretrained LMs with general-
purpose architectures, such as the Transformer [53], replacing task-specific models. Language modeling pretraining
is shown to resemble a multitask objetive that allows zero-shot learning on many tasks [36]. This pretraining stage
benefits from diverse texts and can be further improved by additional pretraining with unlabeled data of downstream
tasks’ domains [14].
Pretrained LMs can be monolingual or multilingual. Initial works focused primarily on English monolingual LMs
[15,35], while BERT [11] also released a multilingual model (mBERT) that is trained on 104 languages. While
multilingual models can be applied to any compatible language and allow cross-lingual transfer, they suffer from the
curse of multilinguality [8,7]: as more languages are added to a fixed-capacity model, the overall monolingual and
transfer performances degrade, specially for low-resource languages that are underrepresented in the training data and
vocabulary. This effect can be mitigated by increasing the model capacity, which results in larger compute resource
requirements.
Similar to BERTimbau, much effort has been devoted to pretraining monolingual BERT and BERT-derived mod-
els for other languages, such as French [26], Dutch [10,54], Spanish [4], Italian [34], and many others [2,20,28].
These works usually follow a common recipe and hyperparameters from the original works for the model pretrain-
ing, since the high cost of pretraining hinders much experimentation and optimization. The main differences between
BERTimbau and the other works are the pretraining corpus and choices of model initialization, tokenizer, vocabulary
generation, evaluation tasks and datasets. For an in-depth comparison of the works, we refer readers to AMMUS [16],
a monolingual LMs survey.
For the Portuguese language, recent works mainly explored and compared contextual embedding techniques.
ELMo and Flair Embeddings trained on a large Portuguese corpora achieve state-of-the-art results on IberLEF 2019 and
HAREM named entity recognition (NER) tasks [5,44,9]. A comparison of ELMo and multilingual BERT using a con-
textual embeddings setup shows superior performance of Portuguese ELMo on semantic textual similarity task when
no fine-tuning is used [41]. Regarding pretrained LMs for specific domains, BioBERTpt [45], a finetuned mBERT
on clinical and biomedical texts in Portuguese, achieves state-of-the-art on SemClinBr NER tasks. BERTaú [12], a
2https://github.com/neuralmind-ai/portuguese- bert
FC Souza et al.: Preprint submitted to Elsevier Page 2 of 20
BERT for Brazilian Portuguese
BERT model trained from scratch on Portuguese texts of the banking domain, achieves state-of-the-art performance
on private in-domain information retrieval, sentiment analysis and NER tasks.
3. BERTimbau
In this section, we present a high-level overview of BERT’s architecture and describe the procedures to pretrain
BERT models for Brazilian Portuguese.
3.1. BERT overview
BERT (Bidirectional Encoder Representation from Transformers) [11] is a language model based on the Trans-
former Encoder [53] architecture. BERT’s main contribution is establishing a framework to pretrain deep bidirectional
word representations that are jointly conditioned on both left and right contexts, in contrast to preceding works on lan-
guage modeling that employ unidirectional LMs. For example, OpenAI GPT [35] pretrains a Transformer Decoder
using left-to-right language modeling, and ELMo [32] concatenates representations from independent left-to-right and
right-to-left language models. BERT demonstrates that bidirectionality is important for both sentence-level and token-
level tasks by improving the state-of-the-art on several benchmarks. This unidirectionality limitation is overcome by
using a modified language modeling objective called Masked Language Modeling, that resembles a denoising objective
and which we detail later in this section.
A Transformer Encoder consists of a stack of 𝐿identical layers 𝑖𝑁×𝐻𝑁×𝐻of hidden dimension 𝐻that
are applied consecutively:
(𝑥) = 𝐿(𝐿−1(… 2(1(𝑥)) … )),(1)
where 𝑁is the length of the input sequence in tokens. Each layer contains a multi-head self-attention layer and a
feedforward layer, which is a MLP network with two layers.
BERT can receive as input a single sentence or a sentence pair.3An input sequence is generated by packing the
input sentences using two special tokens, [CLS] and [SEP]. A single sentence of tokens (𝑥1,, 𝑥𝑁)is represented
as
[CLS] 𝑥1𝑥𝑁[SEP] ,(2)
and a sentence pair (𝑥1,, 𝑥𝑁)and (𝑦1,, 𝑦𝑀), as
[CLS] 𝑥1𝑥𝑁[SEP] 𝑦1𝑦𝑀[SEP] ,(3)
where 𝑥𝑖and 𝑦𝑖are tokens of a vocabulary of size 𝑉and 𝑁,𝑀are the number of tokens in each sentence. The
[SEP] token is simply a separator to mark the end of a sentence.
For the purposes of this work, BERT can be seen as a black box model that embeds and maps a sequence of tokens
𝐱into a sequence of encoded token representations:
(𝑥1,, 𝑥𝑁)(𝐜,𝐓1,,𝐓𝑁)(4)
where 𝐓𝑖𝐻is the encoded representation of the i-th token 𝑥𝑖in the sequence 𝐱, and 𝐜𝐻is the encoded output
of the [CLS] token, which is used as an aggregate representation of the entire sequence for sequence-level tasks. To
apply BERT to a task of interest, the representations 𝐜or 𝐓𝑖are used as inputs to a task specific model, which can be
as simple as a linear transformation. We refer readers to Vaswani et al. [53] and Devlin et al. [11] for further details on
Transformer and BERT architectures.
BERT usage in downstream tasks is composed of two stages: pretraining and fine-tuning. In the pretraining stage,
the model is trained from scratch on self-supervised tasks to learn useful representations 𝐜and 𝐓𝑖. This stage is
computationally intensive and has to be performed only once. In the fine-tuning stage, a task specific model is attached
to the pretrained BERT and the whole model is further trained on the task of interest.
3We borrow the notation of the original work: a “sentence” can be any arbitrary contiguous text span, rather than a linguistic sentence. A
“sequence”, in its turn, refers to the input token sequence, which can be composed of one or two sentences.
FC Souza et al.: Preprint submitted to Elsevier Page 3 of 20
BERT for Brazilian Portuguese
In the pretraining stage, BERT is trained on two self-supervised tasks: Masked Language Modeling (MLM) and
Next Sentence Prediction (NSP). Each pretraining example is generated by concatenating two sentences A, with tokens
(𝑎1,, 𝑎𝑁)and B, with tokens (𝑏1,, 𝑏𝑀), as shown by Eq. 3, where 𝑁and 𝑀are the sentences’ lengths. Given
a sentence A from the corpus, in 50% of the time the sentence B is the sentence that follows sentence A, forming a
contiguous piece of text, and 50% of the time B is a random sentence sampled from a distinct document of the corpus.
This choice defines the ground truth label for the NSP task
𝑦NSP =1(B is the continuation of A)(5)
where 1()is the indicator function.
For MLM, each example is then corrupted by first selecting a random set of positions (integers from 1 to 𝑁=|𝑥|)
𝐦= [𝑚1,, 𝑚𝐾],𝐾=0.15𝑁. The tokens in each of these positions are then replaced by 1 of 3 options: a special
[MASK] token with 80% probability, a random token from the vocabulary with 10% probability or, otherwise, keeping
the original token. In this selection step we use whole word masking: if a token from a word composed of multiple
subword units is chosen to be corrupted, all other subword units are also corrupted. The original tokens from the
positions 𝑚𝑖are saved and serve as labels for the MLM task
𝑦MLM = [𝑥
𝑚1,, 𝑥
𝑚𝐾].(6)
The final pretraining example can be represented as a tuple (𝑥corrupted,𝐦, 𝑦MLM, 𝑦NSP).
The corrupted sequences 𝑥corrupt are used as inputs to BERT and the output encoded representations are used as
inputs for the pretraining tasks’ heads that are attached during this pretraining stage: the NSP head uses the 𝐜vector and
the MLM head uses the 𝐓𝑖vectors of the masked positions. Both heads perform a classification task and the complete
model is optimized by minimizing the sum of the cross-entropy losses of the two tasks.
3.2. Vocabulary generation
BERT uses WordPiece [46], a subword tokenization technique that is not open-source. As an alternative, we
generate a cased Portuguese vocabulary of 30,000 subword units using the SentencePiece library [19] with the BPE
algorithm [47] and 2,000,000 random sentences from Portuguese Wikipedia, which contains close to 1 million articles.
The resulting vocabulary is then converted to WordPiece format for compatibility with original BERT code.
To convert the generated SentencePiece vocabulary to WordPiece format, we follow BERT’s tokenization rules.
Firstly, all BERT special tokens ([CLS],[MASK],[SEP], and [UNK]) and all punctuation characters of mBERT’s
vocabulary are added to the Portuguese vocabulary. Then, since BERT splits the text at whitespace and punctuation
prior to applying WordPiece tokenization in the resulting chunks, each SentencePiece token that contains punctuation
characters is split at these characters, the punctuations are removed and the resulting subword units are added to the
vocabulary.4Finally, subword units that do not start with SentencePiece’s meta character “ ” are prefixed with “##”
and the “ ” symbol is removed from the remaining tokens.
3.3. Pretraining data
For pretraining data, we use the brWaC [55] corpus (Brazilian Web as Corpus), a crawl of Brazilian webpages
which contains 2.68 billion tokens from 3.53 million documents and is the largest open Portuguese corpus to date.
On top of its size, brWaC is composed of whole documents and its methodology ensures high domain diversity and
content quality, which are desirable features for BERT pretraining.
We use only the document body (ignoring the titles) and we apply a single post-processing step on the data to
remove mojibakes5and remnant HTML tags using the ftfy library [49]. The final processed corpus has 17.5GB of
raw text. We split the corpus into chunks of 50MB and generate pretraining examples independently for each file as
described in 3.1, with a duplication factor of 10. That is, we run example generation 10 times for each 50MB file,
producing distinct sentence pairs for NSP task and token masks for MLM task. For maximum sequence length of 128
tokens, a total of 4.29 × 108examples are generated, and, for maximum length 512, a total of 1.58 × 108examples.
4Splitting at punctuation implies no subword unit can contain both punctuation and non-punctuation characters. Also, there cannot be two
punctuation characters inside a single subword unit.
5Mojibake is a kind of text corruption that occurs when strings are decoded using the incorrect character encoding. For example, the word
“codificação” becomes “codificação” when encoded in UTF-8 and decoded using ISO-8859-1.
FC Souza et al.: Preprint submitted to Elsevier Page 4 of 20
BERT for Brazilian Portuguese
4. Evaluation
Once pretrained, we evaluate our models on 3 downstream NLP tasks: Sentence Textual Similarity (STS), Recog-
nizing Textual Entailment (RTE), and Named Entity Recognition (NER). These tasks were chosen as to assess both
sentence- and token-level tasks and because of the availability of labeled datasets and comparable strong baselines. To
evaluate BERTimbau on downstream tasks, we remove the MLM and NSP classification heads used during pretraining
stage and attach a relevant head required for each task. We then fine-tune our models on each task or pair of tasks.
Similar to pretraining, sentence-level tasks are performed on the encoded representation of the [CLS] special token,
𝐜, and token-level tasks use the encoded representation of each relevant token, 𝐓𝑖.
4.1. Sentence Textual Similarity and Recognizing Textual Entailment
Sentence Textual Similarity (STS) is a regression task that measures the degree of semantic equivalence between
two sentences in a numeric scale. Recognizing Textual Entailment (RTE), also known as Natural Language Inference
(NLI), is a classification task of predicting if a given premise sentence entails a hypothesis sentence.
We use the ASSIN2 dataset [38], a shared task with 10,000 sentence pairs with STS and RTE annotations. The
dataset is composed of 6500 train, 500 validation and 3000 test examples.
STS labels are continuous values in a scale of 1 to 5, where a pair of sentences with completely different meanings
have label value of 1 and semantically equivalent sentences have value of 5. STS performance is evaluated using
Pearson’s Correlation as primary metric and Mean Squared Error (MSE) as secondary metric.
RTE labels are simply Entailment and None. RTE performance is evaluated using macro F1-score as primary
metric and accuracy as secondary metric. Examples from ASSIN2 dataset can be seen in Table 1, which contains
sentence pairs and their corresponding gold labels for both tasks.
Given an example with a premise sentence and a hypothesis sentence, we concatenate the two sentences as of
Eq. 3and feed the sequence into BERTimbau. We attach two independent linear layers on top of BERTimbau in a
multitask scheme to predict the STS relatedness score and the RTE classification. We train using MSE loss for STS
and cross-entropy loss for RTE. The final loss is the sum of both losses with equal weight.
4.2. Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying text spans that mention named entities (NEs) and
classifying them into predefined categories, such as person, organization, and location. Given a sequence of tokens
(𝑥1,, 𝑥𝑁), a NER system has to output triples (𝑡𝑠, 𝑡𝑒, 𝑘)where 𝑡𝑠, 𝑡𝑒∈ {1,, 𝑁}are the start and end token indices
of an entity, respectively, and 𝑘is a named entity class. We cast NER as a sequence labeling task that performs unified
entity identification and classification using the IOB2 tagging scheme [51].
We use the Golden Collections of the First HAREM evaluation contests [43], First HAREM and MiniHAREM,
as train and test sets, respectively, following previous works [6,42]. Both datasets contain multidomain documents
annotated with 10 NE classes: Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and
Other. Table 2show some dataset samples.
We employ the datasets on two distinct scenarios: a Total scenario that considers all 10 classes, and a Selective
scenario that includes only 5 classes (Person, Organization, Location, Value, and Date). Table 3presents some dataset
statistics. We set aside 7% of First HAREM documents as a holdout validation set.
To perform NER, we experiment with two architectures. In the simplest architecture, we attach a linear classifier
layer on top of BERTimbau to predict the tag of each token independently. The model is trained using cross-entropy
loss. We compute predictions and losses only for the first wordpiece of each token, ignoring word continuations, as
depicted in Figure 1.
Since Linear-Chain Conditional Random Fields (CRF) [21] is widely adopted to enforce sequential classification
in sequence labeling tasks [6,22,42], we also experiment with employing a CRF layer after the linear layer. We refer
readers to Lample et al. [22] for a detailed explanation on CRF loss formulation and decoding procedure.
NER performance is evaluated using CoNLL 2003 [52] evaluation script,6that computes entity-level precision,
recall, and micro F1-score on exact matches. In other words, precision is the percentage of named entities predicted
by the model that are correct, recall is the percentage of corpus entities that were correctly predicted and F1-score is
the harmonic mean of precision and recall.
6https://www.clips.uantwerpen.be/conll2002/ner/bin/conlleval.txt
FC Souza et al.: Preprint submitted to Elsevier Page 5 of 20
BERT for Brazilian Portuguese
Table 1
Five samples of ASSIN2 dataset. Each sample is composed of a sentence pair and its gold
STS relatedness score (a contiguous value from 1 to 5) and RTE label (Entailment or
None). Actual dataset examples are in Portuguese, with English translations provided by
us.
Gold STS/RTE Sentence pair
5.0 / Entailment
A: Os meninos estão de pé na frente do carro, que está queimando.
B: Os meninos estão de pé na frente do carro em chamas.
English translation:
A: The boys are standing in front of the car, which is burning.
B: The boys are standing in front of the burning car.
4.0 / Entailment
A: O campo verde para corrida de cavalos está completamente cheio de jóqueis.
B: Os jóqueis estão correndo a cavalos no campo, que é completamente verde.
English translation:
A: The green field for horse races is completely full of Jockeys.
B: The Jockeys are racing horses on the field, which is completely green.
3.0 / Entailment
A: A gruta com interior rosa está sendo escalada por quatro crianças do Oriente Médio, três meninas
e um menino.
B: Um grupo de crianças está brincando em uma estrutura colorida.
English translation:
A: Four middle eastern children, three girls and one boy, are climbing on the grotto with a pink
interior.
B: A group of kids is playing in a colorful structure.
2.0 / None
A: Não tem nenhuma pessoa descascando uma batata.
B: Uma pessoa está fritando alguma comida.
English translation:
A: There is no one peeling a potato.
B: A person is frying some food.
1.0 / None
A: Um cachorro está correndo no chão.
B: A menina está batucando suas unhas.
English translation:
A: A dog is running on the ground.
B: The girl is tapping her fingernails.
4.3. Document context and max context evaluation for token-level tasks
In token-level tasks such as NER, we use document context for input examples instead of sentence context to take
advantage of longer contexts when encoding token representations from BERT. Following the approach of Devlin et al.
[11] on the SQuAD dataset, examples longer than 𝑆tokens are broken into spans of length up to 𝑆using a stride of 𝐷
tokens. Each span is used as a separate example during training. During evaluation, however, a single token 𝑇𝑖can be
present in 𝑊=𝑆
𝐷multiple spans 𝑠𝑗, and so may have up to 𝑊distinct predictions 𝑦𝑖,𝑗 . Each token’s final prediction is
taken from the span where the token is closer to the central position, that is, the span where it has the most contextual
information. Figure 1illustrates this procedure.
5. Experiments
In this section, we present the experimental setup and results for BERT pretrainings and evaluation tasks. We
conduct additional experiments to explore the usage of BERTimbau as a fixed extractor of contextual embeddings and
also assess the impact of the long pretraining stage.
FC Souza et al.: Preprint submitted to Elsevier Page 6 of 20
BERT for Brazilian Portuguese
Table 2
FirstHAREM dataset samples. Gold named entities are enclosed by brackets with sub-
scripted labels. Actual dataset examples are in Portuguese, with English translations pro-
vided by us.
A onça, ou jaguar, é um mamífero ([Panthera]THING onca), da ordem dos carnívoros, família dos felídeos, encontrado
em todo o continente americano, dos [EUA]LOC à [Argentina]LOC e em todo o [Brasil]LOC .
English translation: The jaguar is a mammal ([Panthera]THING onca), of the order of carnivores, family of
felids, found throughout the American continent, from the [USA]LOC to [Argentina]LOC and throughout [Brazil]LOC.
[Almeida Henriques]PER ([A.H.]PER): O [CEC]ORG foi criado numa lógica de unir as associações da [Região Centro]LOC,
quer sejam industriais, quer sejam comerciais, quer sejam agrícolas.
English translation: [Almeida Henriques]PER ([A.H.]PER): The [CEC]ORG was created in a logic of uniting the
associations of the Center Region, whether industrial, commercial or agricultural.
Entre os mais importantes destacam-se o de [Shanta Durga]TITLE e o de [Shri Munguesh]TITLE, construidos há [400
anos]VALUE.
English translation: Among the most important are [Shanta Durga]TITLE and [Shri Munguesh]TITLE, built [400
years]VALUE ago.
Para aqueles que vão participar do processo seletivo, o professor de [Direito Previdenciário]ABS [Fábio Zambite]PER
uma dica importante: os candidatos devem estudar com bastante atenção o [Decreto 3.048/99]TITLE, que aprova o
[Regulamento da Previdência Social]TITLE.
English translation: For those who are going to participate in the selection process, Professor of [Social Security Law]ABS
[Fábio Zambite]PER gives an important tip: candidates must carefully study [Decree 3.048/99]TITLE, which approves
the [Social Security Regulation]TITLE.
[A Mulher no Inicio do Novo Século]EVENT
Dia [15 de Maio]TIME, pelas [9.30H]TIME , no [Cine-Teatro Caridade]LOC, em [Moura]LOC irá realizar-se um Fórum
intitulado [A Mulher no Inicio do Novo Século]ABS, tendo como organização a [Câmara Municipal de Moura]ORG e a
colaboração da [Associação de Mulheres do Concelho de Moura]ORG .
English translation: [Women at the Beginning of the New Century]EVENT
On the [15th of May]TIME, at [9.30 am]TIME , at the [Cine-Teatro Caridade]LOC, in [Moura]LOC, a Forum entitled
[Women at the Beginning of the New Century]ABS will take place, organized by the [Moura City Council]ORG with the
collaboration of the [Moura’s Women Association Board]ORG.
[Touro]OTHER é o signo seguinte. O sol o visita entre [21 de abril]TIME e [21 de maio]TIME, domicílio de [Vênus]THING .
English translation: [Taurus]OTHER is the next sign. The sun visits him between [April 21]TIME and [May 21]TIME, home
of [Venus]THING.
5.1. Pretrainings
Following the original BERT work [11], we train BERTimbau models of two sizes: Base (𝐿= 12 layers, 𝐻= 768,
12 attention heads, and 110M parameters) and Large (𝐿= 24 layers, 𝐻= 1024, 16 attention heads and 330M
parameters). The maximum sentence length is set to 𝑆= 512 tokens. We train cased models only since we focus on
general purpose models and capitalization is relevant for tasks like named entity recognition [6,11].
The models are pretrained for 1,000,000 steps. We use a learning rate of 1e-4, learning rate warmup over the first
10,000 steps followed by a linear decay of the learning rate over the remaining steps.
For BERTimbau Base models, the weights are initialized with the checkpoint of Multilingual BERT Base (dis-
carding the word embeddings and the MLM head weights that are from a different vocabulary). We use a batch size of
128 and sequences of 512 tokens the entire training. This training takes 4 days on a TPU v3-8 instance and performs
about 8 epochs over the pretraining data.
For BERTimbau Large, the weights are initialized with the checkpoint of English BERT Large (again discarding
the word embeddings and MLM head weights that are from a different vocabulary), because Multilingual BERT is
FC Souza et al.: Preprint submitted to Elsevier Page 7 of 20
BERT for Brazilian Portuguese
Table 3
Dataset statistics for the HAREM I corpora. The Tokens column refers to whitespace and
punctuation tokenization.
Dataset Documents Tokens Entities in scenario
Selective Total
First HAREM 129 95585 4151 5017
MiniHAREM 128 64853 3018 3642
Figure 1: Illustration of the proposed method for the NER task described in 4.3. Given an
input document, the text is tokenized using WordPiece [56] and the tokenized document
is split into overlapping spans of the maximum length using a fixed stride (= 3, in the
example). Maximum context tokens of each span are marked in bold. The spans are fed
into BERT and then into the classification model, producing a sequence of tag scores for
each span. The scores of subtoken entries (starting with ##) are removed from the spans
and the remaining tags scores are passed to the CRF layer — if it is employed, otherwise
the highest tag scores are used independently. The maximum context tokens are selected
and concatenated to form the final predicted tags.
only available at Base size. Since it is a bigger model with longer training time, we follow the instructions of Devlin
et al. [11] and use sequences of 128 tokens in batches of size 256 for the first 900,000 steps and then sequences of 512
tokens and batch size 128 for the last 100,000 steps. This training takes 7 days on a TPU v3-8 instance and performs
about 6 epochs over the training data.
Training loss curves for both pretrainings are shown in Figure 2. It can be seen that there is a sharp decrease in loss
over the initial 100k steps, which can be interpreted as the models learning the word embeddings that are initialized
randomly. The losses slowly decrease afterwards until the end of the training. A steep decrease in BERTimbau Large
loss can be noticed in step 900,000, which marks the beginning of the pretraining using sequences of 512 tokens.
Note that in the calculation of the number of epochs, we are taking into consideration a duplication factor of 10
when generating the input examples. This means that under 10 epochs, the same sentence is seen with different masking
and sentence pair in each epoch, which is effectively equal to dynamic example generation proposed by RoBERTa [25].
5.2. Fine-tunings on evaluation tasks
For all evaluation experiments, we fine-tune the complete model and we use a learning rate schedule of warmup
over the first 10% steps followed by linear decay of the learning rate over the remaining steps. Similar to pretraining,
we use BERT’s AdamW optimizer implementation with 𝛽1= 0.9,𝛽2= 0.999 and L2 weight decay of 0.01. We
perform early stopping and select the best model on the validation set of each dataset.
FC Souza et al.: Preprint submitted to Elsevier Page 8 of 20
BERT for Brazilian Portuguese
Figure 2: Training loss curves for BERTimbau Base and BERTimbau Large pretrainings.
Smoothed curves are exponential moving averages with smoothing factor 𝛼= 0.95. A
sharp decrease in BERTimbau Large training loss can be noticed after step 900,000, when
training begins using sequences of 512 tokens.
Table 4
Test scores for STS and RTE tasks on ASSIN2 dataset. We compare our models
to the best published results. Best scores in bold. Reported values are the aver-
age of multiple runs (standard deviation in parenthesis) with different random seeds.
Star () denotes primary metrics. : ensemble technique. : extra training data.
Row Model STS RTE
Pearson ()MSE F1 ()Accuracy
1 mBERT + RoBERTa-Large-en (Averaging) [40]0.83 0.91 84 84.8
2 mBERT + RoBERTa-Large-en (Stacking) [40]0.785 0.59 88.3 88.3
3 mBERT (STS) and mBERT-PT (RTE) [39]0.826 0.52 87.6 87.6
4 USE+Features (STS) and mBERT+Features (RTE) [13] 0.800 0.39 86.6 86.6
5 mBERT+Features [13] 0.817 0.47 86.6 86.6
6 mBERT (ours) 0.809 (0.004) 0.58 (0.04) 86.8 (0.4) 86.8 (0.4)
7 BERTimbau Base 0.836 (0.007) 0.58 (0.03) 89.2 (0.8) 89.2 (0.8)
8 BERTtimbau Large 0.852 (0.003) 0.50 (0.03) 90.0 (0.4) 90.0 (0.4)
5.3. Sentence Textual Similarity and Recognizing Textual Entailment results
For this experiment, we train BERTimbau Base with learning rate of 4e-5 and batch size 32 for 10 epochs, and
BERTimbau Large with learning rate of 1e-5, batch size 8 for 5 epochs. We also train mBERT to compare it to
BERTimbau models. mBERT is trained using learning rate of 1e-5 and batch size 8 for 10 epochs.
5.3.1. Results
Our results for both tasks are shown in Table 4. We compare our results to the best-performing submissions to
official ASSIN2. All compared works employ mBERT or a Transformer-based architecture in their approaches. In the
following paragraphs, we refer to each work using their corresponding row numbers in Table 4.
FC Souza et al.: Preprint submitted to Elsevier Page 9 of 20
BERT for Brazilian Portuguese
Table 5
Results of NER task (Precision, Recall and micro F1-score) on the test set (MiniHAREM).
Best results in bold. Reported values are the average of multiple runs (standard deviation
in parenthesis) with different random seeds. Star () denotes primary metrics.
Row Architecture Total scenario Selective scenario
Prec. Rec. F1 () Prec. Rec. F1 ()
1 CharWNN [42] 67.2 63.7 65.4 74.0 68.7 71.2
2 LSTM-CRF [6] 72.8 68.0 70.3 78.3 74.4 76.3
3 BiLSTM-CRF+FlairBBP [44] 74.9 74.4 74.6 83.4 81.2 82.3
4 mBERT 71.6 (1.1) 72.7 (0.4) 72.2 (0.6) 77.0 (0.8) 78.8 (0.7) 77.9 (0.6)
5 mBERT + CRF 74.1 (0.7) 72.2 (0.8) 73.1 (0.7) 80.1 (0.4) 78.3 (0.6) 79.2 (0.2)
6 BERTimbau Base 76.8 (0.8) 77.1 (0.7) 77.2 (0.7) 81.9 (0.7) 82.7 (0.4) 82.2 (0.4)
7 BERTimbau Base + CRF 78.5 (0.8) 76.8 (1.1) 77.6 (0.8) 84.6 (1.2) 81.6 (1.0) 83.1 (1.1)
8 BERTimbau Large 77.9 (0.4) 78.0 (0.3) 77.9 (0.3) 81.3 (0.9) 82.2 (0.9) 81.7 (0.7)
9 BERTimbau Large + CRF 79.6 (0.8) 77.4 (1.0) 78.5 (0.8) 84.9 (0.9) 82.5 (0.6) 83.7 (0.2)
BERTimbau models achieve the best results on the primary metrics of both STS and RTE tasks, with the large
model performing significantly better than the base variant. The previous highest scores (rows 1 and 2) for both STS
Pearson’s correlation and RTE F1 score are from ensemble techniques that combine mBERT fine-tuned on original
ASSIN2 data and an English RoBERTa-Large fine-tuned on ASSIN2 data automatically translated to English. The
averaging ensemble uses 2 models and the stacking ensemble uses 10 distinct fine-tuned models — 5-fold stacking
which results in 5 mBERT and 5 RoBERTa trained models. While this approach shows an interesting application of
English models to Portuguese tasks, our BERTimbau models achieve higher performance using a single model and,
hence, demand lower compute resources in both fine-tuning and inference stages.
Regarding our implementation using mBERT (row 6), it presents a lower performance compared to BERTimbau
models, which highlights the benefits of Portuguese pretraining of BERTimbau. For STS task, we note that mBERT
achieves the same MSE as BERTimbau Base, even though Pearson correlation is lower. Comparing it to other works
approaches, better performances are achieved using extra supervised training data and further pretraining of mBERT
on Portuguese data (row 3), and also by combining it with hand-designed features (rows 4 and 5).
5.4. NER task
In this section, we refer to the 2 architectures defined in Section 4.2 as BERT and BERT-CRF. Long examples are
broken into spans using a stride of 𝐷= 128 as explained in Section 4.3.
The model parameters are divided in two groups with different learning rates: 5e-5 for BERT model and 1e-3 for
the classifier. We train the models for up to 50 epochs using a batch size of 16. Models with CRF are trained for up to
15 epochs.
In addition to BERTimbau Base and Large, we also train mBERT to compare monolingual versus multilingual
model performances. mBERT is fine-tuned with the same hyperparameters.
It is common in NER for the vast majority of tokens not to belong to named entities (and have tag label “O”).
To deal with this class imbalance, we initialize the classifier’s bias term of the “O” tag with a value of 6in order to
promote a better stability in early training [24]. We also use a weight of 0.01 for “O” tag losses.
When evaluating, we produce valid predictions by removing all invalid tag transitions for the IOB2 scheme, such
as “I-” tags coming directly after “O” tags or after an “I-” tag of a different class. This post-processing step trades off
recall for a possibly higher precision.
5.4.1. Results
The main results of our NER experiments are presented in Table 5. We compare the performances of our models
on the two scenarios (total and selective) defined in Section 4.2 to results of previous works. The models of rows 1 to 3
show the progress of neural network approaches for this dataset over the recent years. The previous best result (row 3),
achieved by BiLSTM-CRF+FlairBBP model, uses Portuguese Flair Embeddings, which are contextual embeddings
FC Souza et al.: Preprint submitted to Elsevier Page 10 of 20
BERT for Brazilian Portuguese
Table 6
NER performances (Precision, Recall and F1-score) on the test set (MiniHAREM) using
BERTimbau as contextual embeddings in a feature-based approach. Star () denotes
primary metrics.
Architecture Total scenario Selective scenario
Prec. Rec. F1 () Prec. Rec. F1 ()
mBERT + LSTM-CRF 74.7 69.7 72.1 80.6 75.0 77.7
BERTimbau Base + LSTM-CRF 78.3 73.2 75.6 84.5 78.7 81.6
BERTimbau Large + LSTM-CRF 77.4 72.4 74.8 83.0 77.8 80.3
extracted from character-level language models [1].
Our best model, BERTimbau Large + CRF (row 9), outperforms the best published results improving the F1-
score by 3.9 points on the total scenario and by 1.4 point on the selective scenario. Interestingly, Flair embeddings
outperform BERT models on English NER [1,11].
There is a large performance gap between BERTimbau and mBERT, which reinforces the advantages of mono-
lingual models pretrained on multidomain data compared to mBERT, that is trained only on Wikipedia articles. This
result is on par with other monolingual BERT works.
The CRF layer consistently brings performance improvements in F1 in all settings. However, F1 increases are
pushed by a large boost in precision that is often associated with lower recall. It is worth noting that, without CRF,
BERTimbau Large shows a close but inferior performance to the Base variant on the selective cenario. This result
suggests that a more controlled fine-tuning scheme might be required in some cases, such as partial layer unfreezing
or discriminative fine-tuning [33] — usage of lower learning rates for lower layers —, given that it is a higher capacity
model trained on few data.
5.5. BERTimbau as contextual embeddings
In this experiment, we evaluate BERTimbau as a fixed extractor of contextual embeddings that we use as input
features to train a downstream model on the NER task. In other words, BERTimbau’s weights are kept frozen and only
the downstream model’s weights are optimized during training, in contrast to the fine-tuning experiments where the
complete model is optimized. This setup can be interesting in lower resource scenarios in which several tasks are to
be performed on the same input text: the extraction of contextual embeddings —which is the most expensive stage,
—can be computed once and then shared across several smaller task-specific models.
In this feature-based approach, we train a BiLSTM-CRF model with 1 layer and 100 hidden units followed by a
linear classifier layer for up to 50 epochs. Instead of using only the hidden representation of BERT’s last encoder
layer, the representation 𝐓𝑖of each token is taken as the sum its representations of the last 4 layers, as proposed by the
original work [11]:
𝐓f𝑖𝑥𝑒𝑑
𝑖=𝐓(𝐿)
𝑖+𝐓(𝐿−1)
𝑖+𝐓(𝐿−2)
𝑖+𝐓(𝐿−3)
𝑖,(7)
where 𝐓(𝑗)
𝑖𝐻is the output of the j-th Transformer Encoder layer at the i-th position and 𝐿is the number of layers.
The resulting architecture resembles the BiLSTM-CRF model [22] but using BERT contextual embeddings instead of
fixed word embeddings.
5.5.1. Results
We present the results on Table 6. Models of the feature-based approach perform significantly worse compared to
the ones of the fine-tuning approach. The performance gap is found to be much higher than the reported values for NER
on English language [11,33] and reaches up to 2 points on BERTimbau Base and 3.5 points on BERTimbau-Large,
although it can probably be reduced by further hyperparameter tuning.
In this setup, BERTimbau Base+BiLSTM-CRF achieves similar performances to BiLSTM-CRF+FlairBBP (row
3 of Table 5), which also uses contextual embeddings and a similar architecture. BERTimbau shows a slightly lower
F1-score in the Selective scenario but higher F1-score in the Total scenario.
FC Souza et al.: Preprint submitted to Elsevier Page 11 of 20
BERT for Brazilian Portuguese
235000 505000 700000 1000000
Pret raini ng st eps
0.8 0
0.8 2
0.8 4
0.8 6
0.8 8
Dev F1 sco re
Sele ct iv e sc en ari o
235000 505000 700000 1000000
Pret raini ng st eps
0.7 2
0.7 4
0.7 6
0.7 8
0.8 0
Dev F1 sco re
Tot al scen ario
Figure 3: Performance of BERTimbau Base on NER task using intermediate checkpoints
of the pretraining stage. Scores reported on the validation set.
It is worth mentioning that BERTimbau models in this feature-based approach achieve better performances than a
fine-tuned mBERT on this same task. While BERTimbau Large is the highest performer when fine-tuned, we observe
that it experiences performance degradation when used in this feature-based approach, performing worse than the
smaller Base variant but still better than mBERT.
5.6. Impact of pretraining steps
To assess the impact of the number of pretraining steps on the performance of downstream tasks, we repeat part of
the NER fine-tuning experiment (Section 5.4) using intermediate checkpoints of BERTimbau Base pretraining proce-
dure. We train BERT models (without CRF) using the checkpoints of steps 235k, 505k and 700k, which correspond to
23.5%, 50.5% and 70% of the complete pretraining of 1000k steps, respectively. All models are trained with the same
hyperparameters and experimental setup described in Section 5.4.
The results are shown in Figure 3. Performances on the downstream task increase non-linearly with pretraining
steps, with diminishing returns as pretraining progresses. This is an expected result, as test performance of pretraining
tasks are shown to follow a power law on the number of pretraining steps [17].
6. Tokenization analysis
One possible advantage of a monolingual BERT over multilingual BERT can be related to the WordPiece tokenizer
vocabulary. The vocabulary size is a hyperparameter that limits the number of distinct recognizable tokens, which
affects the size of the input token embedding matrix. Most monolingual BERT models have vocabulary sizes in the
range of 30,000 to 50,000 tokens [11,25,23,4]. In comparison, mBERT has a vocabulary of 120,000 tokens, which
has to encompass tokens of over 100 languages and a variety of alphabets. When considering the usage of mBERT
on a single specific language, the effective vocabulary size is usually much smaller than a monolingual equivalent,
resulting in longer tokenized sequences. This happens because, in smaller vocabularies generated by BPE, only very
frequent words will be present as individual tokens, causing the tokenization of most words to be composed of multiple
subword units.
Considering that dot-product attention layers have quadratic complexity that imposes limitations on input sequence
size of BERT and Transformer models in general [53], a more efficient tokenization that produces shorter sequences
allows inputing larger textual context in a sequence of maximum length 𝑆. This limitation is often encountered in
sequence-level tasks such as document classification of long documents [50].
One can also hypothesize that a tokenization that often breaks words into multiple subword units imposes a harder
task on the model, since instead of receiving an embedding vector that readily represents the original word, the model
will receive several vectors — one vector for each subword unit — that will have to be combined inside the model to
form a complete word representation.
FC Souza et al.: Preprint submitted to Elsevier Page 12 of 20
BERT for Brazilian Portuguese
Figure 4: Distribution of ASSIN2 test set examples binned by subtoken count in the tok-
enization of the concatenated premise and hypothesis texts, for BERTimbau and mBERT
(left-side). A subtoken is any word continuation token that starts with “##”. The bin
at 𝑥= 0 contains examples whose premise and hypothesis tokenizations are composed of
only whole words. The histogram on the right-side is a clipped version that aggregates the
right tail of the distributions in the 𝑥= 10 and 𝑥= 15 bins for BERTimbau and mBERT,
respectively.
Figure 5: Metrics of RTE task on ASSIN2 test set examples computed separately for each
bin of the distribution of the right side of Figure 4.
In this experiment, we analyze the tokenizations produced by BERTimbau’s and mBERT’s tokenizers and com-
pare the produced tokenized sequences for the evaluation tasks’ datasets. We compare the sequence lengths for each
downstream task dataset and assess how subword unit tokenization may affect the performance of each task.
6.1. Tokenization effects on Sentence Textual Similarity and Recognizing Textual Entailment tasks
To investigate the effects of the tokenization on the RTE and STS tasks, we tokenize the ASSIN2 test set examples
using BERTimbau and mBERT tokenizers. The examples are then binned by the subtoken count in the premise and
hypothesis texts’ tokenizations, as can be seen in Figure 4. The mBERT tokenizer produces a median of 7 subtokens
per example, while BERTimbau has a distribution skewed to the lower end and a median value of 3.5. Less than 1% of
the examples tokenized using the multilingual vocabulary are composed of only whole words, while this proportion is
7.5% for the Portuguese tokenizer. Since both distributions have a right tail of low proportion bins, the right tails are
clipped to the bins 𝑥= 10 for BERTimbau and 𝑥= 15 for mBERT, as shown in the right side of Figure 4, and these
bins are used for the following metrics analysis.
For each example bin of the distribution, we compute the evaluation metrics for RTE and STS tasks to see how
performance vary as tokenizations break words into more pieces, as shown in Figures 5and 6. For the RTE task, there
is almost no variation of F1-score and accuracy for BERTimbau as subtoken count increases. For mBERT, it appears
to have a performance degradation beginning at the 9 subtokens bin, even though the 15+ bin recovers the performance
of the lower bins. It is noticeable that mBERT performs on par with BERTimbau in the lower subtoken bins, and its
FC Souza et al.: Preprint submitted to Elsevier Page 13 of 20
BERT for Brazilian Portuguese
Figure 6: Metrics of STS task on ASSIN2 test set examples computed separately for each
bin of the distribution of the right side of Figure 4. Computation of Pearson’s correlation
uses global mean values for ground-truth and prediction similarity scores.
Figure 7: Distribution of ground-truth entities of Mini HAREM dataset by number of words
and presence of subtoken (word continuation token that starts with “##”), for BERTimbau
and mBERT.
global metrics are affected by the higher bins with worse performance, that comprise over 30% of the test set. Similar
conclusions can be drawn for the STS task metrics, with BERTimbau metrics showing less overall variation while
mBERT shows degradation on the higher bins. We argue, however, that it is not possible to draw precise conclusions,
since bins at the distribution tails may have as few as 30 to 100 examples and, as such, the metrics of these bins can be
dominated by the presence of easy or hard examples or with rarer words.
6.2. Tokenization effects on Named Entity Recognition
Given that Named Entity Recognition is a token-level task and the metrics are computed on entity-level, we take a
distinct approach and analyze how the tokenization of the entities’ words may affect the model performance. This way,
we compare the tokenizations using BERTimbau and mBERT vocabularies on the ground-truth and predicted entities
of the test dataset, Mini HAREM. The compared models are the BERT-CRF architecture on the Total scenario.
FC Souza et al.: Preprint submitted to Elsevier Page 14 of 20
BERT for Brazilian Portuguese
Figure 8: Metrics of BERTimbau Large (BERT-CRF) on Named Entity Recognition task
on the test dataset (Mini HAREM, Total scenario) binned by entity word count.
When casting NER as a sequence tagging problem, it is intuitively expected that longer entities might be harder
to predict accurately. This comes the fact that an entity composed of more words is encoded as a longer tag sequence
and any incorrectly predicted tag yields a wrong entity prediction, which hurts both recall and precision. Considering
this proposition, we bin the ground-truth and predicted entities by word count and, inside each bin, we distinguish
between entities whose tokenization contains only whole words or contains at least one subtoken, as shown in Figure 7
for the ground-truth entities. It is worth emphasizing the definition of word in this context: we consider as words any
sequence of characters that are produced by splitting a text into whitespace and punctuation characters, considering
each punctuation character as a separate word. The length of an entity in words is independent of the vocabulary. Each
word is then tokenized into one or multiple subword units by WordPiece tokenization using the model vocabulary,
affecting the presence of subtoken or not.
As can be seen in Figure 7, the proportion of entities that have subtokens inside each bin is very similar between
BERTimbau and mBERT vocabularies. This is not unexpected, since it is a general NER dataset and entities often
contain proper names which are commonly rarer words. Even though the Portuguese vocabulary contains a larger set
of common Portuguese proper names than the multilingual vocabulary, the opposite holds for foreign proper names,
for instance, and these effects roughly balance each other out in this case.
We separately compute NER metrics (F1-score, Precision and Recall) for both analyzed models for each entity bin
of Figure 7, as shown for BERTimbau in Figure 8and for mBERT in Figure 9. By observing the dashed lines of the
F1-score and recall plots, it can be seen that the proposition of longer entities showing worse performances hold for
both models. The most relevant finding is that both models show worse performances to detect entities that have at
least one subtoken, across all entity lengths, compared to entities that are composed of only whole words. The degree
of performance degradation for increasing entity length is also higher for entities that contain subtokens.
We did not expect the presence of subtokens inside the entities to impact the task performance in such a strong and
consistent manner. When defining the NER task, it is often stated that the class of a named entity is dependent not only
on the entity itself, but also heavily dependent on its surrounding words [57]. For instance, consider the act of replacing
a person’s or organization’s name in a sentence by another one of the same class. This replacement should not make
the entity necessarily easier or harder to detect — as long as the meaning and syntax of the sentence are preserved
in this process—, since the surrounding context might give enough information to infer the entity class nonetheless.
However, this observed performance degradation suggests that the choice of the replacement entity might affect the
model’s capabilities if it contains subtokens or not.
6.3. Discussion
Our experiments reveal a correlation between word segmentation into multiple subwords and task performance,
with a higher degradation in the case of mBERT. Analyzing the impacts of the tokenization on these models can be
an area of further research that has not been much explored. SciBERT [3] — a BERT trained on scientific articles in
English — shows that having an in-domain vocabulary is beneficial, but they argue that larger benefits come rather from
fine-tuning BERT on in-domain data than from a more suited vocabulary. These two factors are present in BERTimbau,
since it is trained on more Portuguese data (and more diverse) than mBERT.
FC Souza et al.: Preprint submitted to Elsevier Page 15 of 20
BERT for Brazilian Portuguese
Figure 9: Metrics of mBERT (BERT-CRF) on Named Entity Recognition task on the test
dataset (Mini HAREM, Total scenario) binned by entity word count.
However, we believe there are other factors that, along with the tokenization method, might be hurting the perfor-
mance of these models and can be subjects of research. While subword unit tokenization is a more robust alternative
to word-level tokenization, allowing the representation of out-of-vocabulary words and using much smaller vocab-
ularies, there is performance degradation when tokenization deviates from the ideal scenario where words are kept
intact. We hypothesize that BERT architecture is not being capable of easily reconstructing a high quality word repre-
sentation from multiple independent subword units. Future works could experiment with techniques such as subword
regularization [18] as to stimulate the learning of compositionaly of multiple subword units.
7. Conclusion
In this work, we advance the study of deep learning models for NLP in Portuguese, especially the usage of pretrained
language models in a transfer learning approach. We train BERT models for Brazilian Portuguese and evaluate their
performances on three downstream NLP tasks.
In the pretraining stage, we use Wikipedia articles to generate a Portuguese vocabulary and then leverage millions
of webpages from the brWaC corpus as unlabeled data to train Portuguese BERT models on self-supervised objectives.
On the evaluation stage, we fine-tune the models supervisely on downstream tasks in two distinct experiments.
In the first experiment, we fine-tune our BERTimbau models on the ASSIN2 dataset to jointly solve Sentence
Textual Similarity (STS) and Recognizing Textual Entailment (RTE) tasks. BERTimbau achieves state-of-the-art
performances in both tasks, surpassing Multilingual BERT (mBERT) and previously published results in the literature,
that comprise both Portuguese specific models and multilingual approaches.
In the second experiment, we fine-tune BERTimbau on named entity recognition (NER) task using the FirstHAREM
and MiniHAREM datasets. We experiment with two NER architectures: plain BERT and BERT-CRF. Again, our best
model achieves state-of-the-art results and shows a large performance improvement over mBERT and the previously
best published result, that uses Portuguese Flair embeddings in a contextual embeddings setup, especially in the hardest
Total scenario that considers all 10 named entity classes.
In the three tasks, BERTimbau Large shows a performance slightly higher than BERTimbau Base, as expected of
a higher capacity model pretrained and fine-tuned to small labeled datasets. Hence, the choice of which model size
to use depends primarily on the trade-off between performance and compute resources, especially during inference
usage.
In additional experiments, we assess the usage of BERTimbau in a contextual embeddings setup by freezing its
weights and training BERTimbau-BiLSTM-CRF models on NER task. Even though there is a notable performance
drop, we show that contextual embeddings from BERTimbau Base outperform fine-tuned mBERT models, which can
be a lower compute alternative for limited resource scenarios. We also validate the necessity of a long pretraining stage,
that had been reported for English and other languages, for our Portuguese models by evaluating the performance of
intermediate model checkpoints of the pretraining stage on NER task. Models pretrained for longer times show better
performance in the end task, even though the pretraining stage had already started from pretrained checkpoints from
mBERT and English BERT.
FC Souza et al.: Preprint submitted to Elsevier Page 16 of 20
BERT for Brazilian Portuguese
Lastly, we compare BERTimbau’s Portuguese vocabulary to mBERT’s multilingual vocabulary by looking at the
produced tokenizations on the evaluation tasks. The Portuguese vocabulary produces smaller tokenized sentences,
which corresponds to (1) keeping more words intact as a single token and (2) breaking words using a lower average
number of subword units per word. We analyze how tokenizing words into multiple subword units might affect the
model performance on the end tasks by binning task examples by tokenization statistics and computing evaluation
metrics separately for each group. In particular to NER, we notice the models show inferior performance to detect
named entities that contain at least one subword unit compared to only whole words. While this phenomenon could be
related to the presence of rarer words in these examples, we hypothesize that the BERT architecture is not fully capable
of reconstructing word representations from several subword units since it relies heavily on the positional embeddings
that might not be sufficient. Further experiments and analyses can be performed to better understand these issues,
such as exploring other evaluation tasks, vocabulary generation algorithms, regularization techniques or looking for
alternatives to the simple positional embedding.
In regards to multilingual models, we notice mBERT is one of the first works in this area. There are more recent
models, such as XLM [8] and XLM-R [7], that can be experimented with in future work. These models propose
new training procedures that allow greater knowledge sharing across languages without sacrificing much per-language
performance, and avoiding vocabulary dilution. Even though Portuguese is heavily present on the internet and, as
such, it has enough unlabeled data to train large monolingual language models, the cross-lingual transfer allowed by
multilingual models can be extremely beneficial for Portuguese by leveraging labeled datasets of other languages to
alleviate the annotated data limitation that is commonly faced by NLP researchers and developers.
8. Acknowledgements
R Lotufo acknowledges the support of the Brazilian government through the CNPq Fellowship ref. 310828/2018-0.
We would like to thank Google Cloud for research credits.
References
[1] Akbik, A., Blythe, D., Vollgraf, R., 2018. Contextual string embeddings for sequence labeling, in: COLING 2018, 27th International Confer-
ence on Computational Linguistics, pp. 1638–1649.
[2] Baly, F., Hajj, H., et al., 2020. Arabert: Transformer-based model for arabic language understanding, in: Proceedings of the 4th Workshop on
Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pp. 9–15.
[3] Beltagy, I., Lo, K., Cohan, A., 2019. SciBERT: A pretrained language model for scientific text, in: Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-
IJCNLP), Association for Computational Linguistics, Hong Kong, China. pp. 3615–3620. URL: https://aclanthology.org/D19-1371,
doi:10.18653/v1/D19-1371.
[4] Canete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J., 2020. Spanish pre-trained bert model and evaluation data. Pml4dc at
ICLR 2020, 2020.
[5] de Castro, P.V.Q., da Silva, N.F.F., da Silva Soares, A., 2019. Contextual representations and semi-supervised named entity recognition for
portuguese language., in: IberLEF@ SEPLN, pp. 411–420.
[6] Castro, P.V.Q.d., Silva, N.F.F.d., Soares, A.d.S., 2018. Portuguese named entity recognition using lstm-crf, in: Villavicencio, A., Moreira,
V., Abad, A., Caseli, H., Gamallo, P., Ramisch, C., Gonçalo Oliveira, H., Paetzold, G.H. (Eds.), Computational Processing of the Portuguese
Language, Springer International Publishing, Cham. pp. 83–92.
[7] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V., 2020.
Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, Online. pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.
747, doi:10.18653/v1/2020.acl-main.747.
[8] Conneau, A., Lample, G., 2019. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems 32, 7059–
7069.
[9] Consoli, B., Vieira, R., 2019. Multidomain contextual embeddings for named entity recognition, in: Proceedings of the Iberian Languages
Evaluation Forum, pp. 434–441.
[10] Delobelle, P., Winters, T., Berendt, B., 2020. RobBERT: a Dutch RoBERTa-based Language Model, in: Findings of the Associa-
tion for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online. pp. 3255–3265. URL: https:
//aclanthology.org/2020.findings-emnlp.292, doi:10.18653/v1/2020.findings- emnlp.292.
[11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding,
in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL:
https://aclanthology.org/N19-1423, doi:10.18653/v1/N19- 1423.
[12] Finardi, P., Viegas, J.D., Ferreira, G.T., Mansano, A.F., Carid’a, V.F., 2021. Bertaú: Itaú bert for digital customer service. ArXiv
abs/2101.12015.
FC Souza et al.: Preprint submitted to Elsevier Page 17 of 20
BERT for Brazilian Portuguese
[13] Fonseca, E., Alvarenga, J.P.R., 2020. Wide and deep transformers applied to semantic relatedness and textual entailment, in: [29]. pp. 68–76.
pp. 68–76. URL: http://ceur-ws.org/Vol-2583/.
[14] Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A., 2020. Don’t stop pretraining: Adapt language
models to domains and tasks, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association
for Computational Linguistics, Online. pp. 8342–8360. URL: https://aclanthology.org/2020.acl-main.740, doi:10.18653/v1/
2020.acl-main.740.
[15] Howard, J., Ruder, S., 2018. Universal language model fine-tuning for text classification, in: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339.
[16] Kalyan, K.S., Rajasekharan, A., Sangeetha, S., 2021. Ammus : A survey of transformer-based pretrained models in natural language process-
ing. ArXiv abs/2108.05542.
[17] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D., 2020. Scaling laws
for neural language models. arXiv preprint arXiv:2001.08361 .
[18] Kudo, T., 2018. Subword regularization: Improving neural network translation models with multiple subword candidates, in: Proceedings
of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Melbourne, Australia. pp. 66–75. URL: https://aclanthology.org/P18- 1007, doi:10.18653/v1/P18-1007.
[19] Kudo, T., Richardson, J., 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text process-
ing, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for
Computational Linguistics, Brussels, Belgium. pp. 66–71. URL: https://aclanthology.org/D18-2012, doi:10.18653/v1/D18-2012.
[20] Kuratov, Y., Arkhipov, M., 2019. Adaptation of deep bidirectional multilingual transformers for russian language. arXiv preprint
arXiv:1905.07213 .
[21] Lafferty, J.D., McCallum, A., Pereira, F.C.N., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence
data, in: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA. p. 282–289.
[22] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C., 2016. Neural architectures for named entity recognition, in: Pro-
ceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies, Association for Computational Linguistics, San Diego, California. pp. 260–270. URL: https://aclanthology.org/N16- 1030,
doi:10.18653/v1/N16-1030.
[23] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R., 2020. ALBERT: A lite BERT for self-supervised learning of lan-
guage representations, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020,
OpenReview.net. URL: https://openreview.net/forum?id=H1eA7AEtvS.
[24] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P., 2017. Focal loss for dense object detection, in: Proceedings of the IEEE international
conference on computer vision, pp. 2980–2988.
[25] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustly
optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 .
[26] Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la Clergerie, É., Seddah, D., Sagot, B., 2020. CamemBERT: a tastyFrench
language model, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Online. pp. 7203–7219. URL: https://aclanthology.org/2020.acl-main.645, doi:10.18653/v1/2020.acl- main.
645.
[27] Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word representations in vector space, in: Bengio, Y., LeCun, Y.
(Eds.), 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track
Proceedings. URL: http://arxiv.org/abs/1301.3781.
[28] Nguyen, D.Q., Tuan Nguyen, A., 2020. PhoBERT: Pre-trained language models for Vietnamese, in: Findings of the Association for Com-
putational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online. pp. 1037–1042. URL: https://aclanthology.
org/2020.findings-emnlp.92, doi:10.18653/v1/2020.findings- emnlp.92.
[29] Oliveira, H.G., Real, L., Fonseca, E. (Eds.), 2020. Proceedings of the ASSIN 2 Shared Task: Evaluating Semantic Textual Similarity and
Textual Entailment in Portuguese, Extended Semantic Web Conference. number 2583 in CEUR Workshop Proceedings. URL: http://
ceur-ws.org/Vol- 2583/.
[30] Pan, S.J., Yang, Q., 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 1345–1359. doi:10.
1109/TKDE.2009.191.
[31] Pennington, J., Socher, R., Manning, C., 2014. Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1532–1543. URL:
https://www.aclweb.org/anthology/D14-1162, doi:10.3115/v1/D14- 1162.
[32] Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep contextualized word representations, in:
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pp. 2227–2237.
[33] Peters, M.E., Ruder, S., Smith, N.A., 2019. To tune or not to tune? adapting pretrained representations to diverse tasks, in: Proceedings of
the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 7–14.
[34] Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., Basile, V., 2019. Alberto: Italian bert language understanding model for nlp
challenging tasks based on tweets, in: 6th Italian Conference on Computational Linguistics, CLiC-it 2019, CEUR. pp. 1–6.
[35] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving language understanding with unsupervised learning. Technical
Report. Technical report, OpenAI.
[36] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language models are unsupervised multitask learners. OpenAI
Blog 1, 9.
FC Souza et al.: Preprint submitted to Elsevier Page 18 of 20
BERT for Brazilian Portuguese
[37] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67. URL: http://jmlr.org/papers/v21/20- 074.html.
[38] Real, L., Fonseca, E., Gonçalo Oliveira, H., 2020. The assin 2 shared task: A quick overview, in: Computational Processing of the Por-
tuguese Language: 14th International Conference, PROPOR 2020, Evora, Portugal, March 2–4, 2020, Proceedings, Springer-Verlag, Berlin,
Heidelberg. p. 406–412. URL: https://doi.org/10.1007/978- 3-030- 41505-1_39, doi:10.1007/978- 3-030- 41505-1_39.
[39] Rodrigues, R., Couto, P., Rodrigues, I., 2020a. Ipr: The semantic textual similarity and recognizing textual entailment systems, in: [29]. pp.
39–47. pp. 39–47. URL: http://ceur-ws.org/Vol-2583/.
[40] Rodrigues, R., da Silva, J., Castro, P., Felix, N., Soares, A., 2020b. Multilingual transformer ensembles for portuguese natural language tasks,
in: [29]. pp. 27–38. pp. 27–38. URL: http://ceur-ws.org/Vol-2583/.
[41] Rodrigues, R.C., Rodrigues, J., de Castro, P.V.Q., da Silva, N.F.F., Soares, A., 2020c. Portuguese language models and word embeddings:
Evaluating on semantic similarity tasks, in: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (Eds.), Computational
Processing of the Portuguese Language, Springer International Publishing, Cham. pp. 239–248.
[42] dos Santos, C., Guimarães, V., 2015. Boosting named entity recognition with neural character embeddings, in: Proceedings of the Fifth Named
Entity Workshop, Association for Computational Linguistics, Beijing, China. pp. 25–33. URL: https://aclanthology.org/W15-3904,
doi:10.18653/v1/W15-3904.
[43] Santos, D., Seco, N., Cardoso, N., Vilela, R., 2006. HAREM: An advanced NER evaluation contest for Portuguese, in: Proceedings of the Fifth
International Conference on Language Resources and Evaluation (LREC’06), European Language Resources Association (ELRA), Genoa,
Italy. pp. 1986–1991. URL: http://www.lrec-conf.org/proceedings/lrec2006/pdf/59_pdf.pdf.
[44] Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., Vieira, R., 2019. Assessing the impact of contextual embeddings for portuguese
named entity recognition, in: 8th Brazilian Conference on Intelligent Systems, BRACIS, Bahia, Brazil, October 15-18, pp. 437–442.
[45] Schneider, E.T.R., de Souza, J.V.A., Knafou, J., Oliveira, L.E.S.e., Copara, J., Gumiel, Y.B., Oliveira, L.F.A.d., Paraiso, E.C., Teodoro,
D., Barra, C.M.C.M., 2020. BioBERTpt - a Portuguese neural language model for clinical named entity recognition, in: Proceedings of
the 3rd Clinical Natural Language Processing Workshop, Association for Computational Linguistics, Online. pp. 65–72. URL: https:
//www.aclweb.org/anthology/2020.clinicalnlp-1.7.
[46] Schuster, M., Nakajima, K., 2012. Japanese and korean voice search, in: 2012 IEEE Inter national Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE. pp. 5149–5152.
[47] Sennrich, R., Haddow, B., Birch, A., 2016. Neural machine translation of rare words with subword units, in: Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin,
Germany. pp. 1715–1725. URL: https://www.aclweb.org/anthology/P16-1162, doi:10.18653/v1/P16-1162.
[48] Souza, F., Nogueira, R., Lotufo, R., 2020. Bertimbau: Pretrained bert models for brazilian portuguese, in: Brazilian Conference on Intelligent
Systems, Springer. pp. 403–417.
[49] Speer, R., 2019. ftfy. Zenodo. URL: https://doi.org/10.5281/zenodo.2591652, doi:10.5281/zenodo.2591652. version 5.5.
[50] Sun, C., Qiu, X., Xu, Y., Huang, X., 2019. How to fine-tune bert for text classification?, in: China National Conference on Chinese Compu-
tational Linguistics, Springer. pp. 194–206.
[51] Tjong, E.F., Sang, K., Veenstra, J., 1999. Representing text chunks, in: Ninth Conference of the European Chapter of the Association for
Computational Linguistics, Association for Computational Linguistics, Bergen, Norway. pp. 173–179. URL: https://www.aclweb.org/
anthology/E99-1023.
[52] Tjong Kim Sang, E.F., De Meulder, F., 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,
in: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. URL: https://www.aclweb.
org/anthology/W03-0419.
[53] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need, in:
Advances in neural information processing systems, pp. 5998–6008.
[54] Vries, W.d., Cranenburgh, A.v., Bisazza, A., Caselli, T., Noord, G.v., Nissim, M., 2019. BERTje: A Dutch BERT Model. arXiv preprint
arXiv:1912.09582 URL: http://arxiv.org/abs/1912.09582.
[55] Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A., 2018. The brWaC corpus: A new open resource for Brazilian Portuguese, in:
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources
Association (ELRA), Miyazaki, Japan. pp. 4339–4344. URL: https://aclanthology.org/L18- 1686.
[56] Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al., 2016. Google’s
neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 URL: http:
//arxiv.org/abs/1609.08144. version 2.
[57] Yadav, V., Bethard, S., 2018. A survey on recent advances in named entity recognition from deep learning models, in: Proceedings of the
27th International Conference on Computational Linguistics, pp. 2145–2158.
[58] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V., 2019. Xlnet: Generalized autoregressive pretraining for
language understanding, in: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Advances
in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper/2019/file/
dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
[59] Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are featuresin deep neural networks?, in: Advances in neural information
processing systems, pp. 3320–3328.
Fábio Souza is a research developer at NeuralMind (Brazil), where he builds solutions to extract and structure information from legal documents
using AI. He received a B.Sc. in Electrical Engineering from UNICAMP (Brazil), in 2017, and a M.Sc. degree from UNICAMP, where he worked
on Deep Learning applications for Natural Language Processing for the Portuguese language under the supervision of prof. Roberto Alencar Lotufo
and co-supervision of Dr. Rodrigo Nogueira.
FC Souza et al.: Preprint submitted to Elsevier Page 19 of 20
BERT for Brazilian Portuguese
Rodrigo Nogueira is a post-doctoral researcher at the University of Waterloo (Canada), an adjunct professor at UNICAMP (Brazil), and a senior
research scientist at NeuralMind (Brazil). He holds a Ph.D. from New York University (NYU), where he worked on the intersection of Deep
Learning, Natural Language Processing, and Information Retrieval under the supervision of prof. Kyunghyun Cho. He has an Ms.C. degree from
UNICAMP, where he developed with prof. Roberto Alencar Lotufo an award-winning algorithm for detecting fake fingerprints.
Roberto A. Lotufo is the CTO co-founder of NeuralMind. He received the B.S. degree from Instituto Tecnologico de Aeronautica, Brazil, in 1978,
and the Ph.D. degree from the University of Bristol, U.K., in 1990, in Electrical Engineering. He is a full professor collaborating at the School of
Electrical and Computer Engineering, University of Campinas (Unicamp), Brazil, were he has worked for since 1981. His principal research interests
are in the areas of Deep Learning for Image and Document Processing and Analysis. He was awarded the Zeferino Vaz Academic Recognition in
2011 from University of Campinas. Member of the winning team of the 2015 Fingerprint Liveness Detection Competition and of the 2020 São
Paulo State public call "Use of AI algorithms for the diagnosis of COVID-19 using CT and chest x-ray images".
FC Souza et al.: Preprint submitted to Elsevier Page 20 of 20
... As representações vetoriais das palavras são fundamentais no PLN, capturando características semânticas e sintáticas para melhorar o desempenho de modelos [Souza et al. 2020]. Modelos baseados em Transformers, como os LLMs, utilizam técnicas avançadas, como subword e contextual embeddings, que refinam a representação linguística e ampliam a capacidade de generalização dos modelos [Souza et al. 2020]. ...
... As representações vetoriais das palavras são fundamentais no PLN, capturando características semânticas e sintáticas para melhorar o desempenho de modelos [Souza et al. 2020]. Modelos baseados em Transformers, como os LLMs, utilizam técnicas avançadas, como subword e contextual embeddings, que refinam a representação linguística e ampliam a capacidade de generalização dos modelos [Souza et al. 2020]. ...
... A MarIA adota o modelo BERTimbau (NeuralMind) [Souza et al. 2020], treinado especificamente para o português, garantindo melhor preservação semântica e adaptaçãò a língua. Testes indicaram que modelos em inglês apresentaram instabilidades e underfitting, reforçando a escolha de uma solução otimizada para o português. ...
Conference Paper
Os primeiros 1.000 dias de vida, que compreendem a gestação e os dois primeiros anos da criança, representam um período crítico para prevenir doenças crônicas não transmissíveis (DCNT). No entanto, Agentes Comunitários de Saúde (ACS) enfrentam dificuldades para acessar e aplicar recomendações baseadas em evidências científicas durante esse período. Este trabalho apresenta o MarIA DeepSeek, uma assistente virtual baseada em modelos amplos de linguagem (LLMs), que integra técnicas de Prompt Chaining, Retrieval-Augmented Generation (RAG) e curadoria de documentos especializados para oferecer orientações personalizadas e cientificamente embasadas aos ACS. Os experimentos demonstraram que a ferramenta aprimora a precisão, a clareza e a acessibilidade das recomendações, superando modelos generalistas como GPT-4.0 e Gemini, e contribuindo para uma tomada de decisão mais eficaz na atenção primária à saúde.
... The development of LLMs for the Portuguese language was limited until 2020, when the first works began to be published F. Souza et al., 2020). Multilingual models fail to fully capture cultural and terminological nuances of language, resulting in inferior performance compared to specialized monolingual models (N.K. Corrêa et al., 2024;F. Souza et al., 2020). ...
... The development of LLMs for the Portuguese language was limited until 2020, when the first works began to be published F. Souza et al., 2020). Multilingual models fail to fully capture cultural and terminological nuances of language, resulting in inferior performance compared to specialized monolingual models (N.K. Corrêa et al., 2024;F. Souza et al., 2020). ...
... Ten models (BERTimbau-base-covid-19, Carvalho et al., Coutinho et al., LHIA, Machado et al., Mendes et al., PsyBERTpt, Santos et al., Sousa et al., and Yang et al.) used the BERTimbau as a base model, specifically pre-trained for Portuguese, using data from the Brazilian Portuguese Web Corpus (brWAC) and Portuguese Wikipedia, and designed for general-domain use (F. Souza et al., 2020). Monolingual models (e.g., BERTimbau) can outperform multilingual ones (e.g., mBERT) in specific NLP tasks. ...
Preprint
Full-text available
Purpose: This study addresses Large Language Models (LLMs) pre-trained in Portuguese for healthcare applications, focusing on contextual embeddings. Research on LLMs for natural language processing (NLP) tasks in Portuguese is limited, especially within healthcare. However, LLMs demonstrate potential in clinical decision support, diagnosis assistance, patient care, and other healthcare applications. In view thereof, the present work assesses the current state of LLMs in Portuguese for healthcare. Methods: Our Systematic Literature Review (SLR) followed standard protocols: search, screening based on inclusion/exclusion criteria, quality assessment, data extraction, and analysis. Results: We identified 28 models, mostly based on BERTimbau, mBERT, and BioBERTpt. Adaptation strategies such as fine-tuning, domain-adaptive pre-training, training from scratch, and zero-shot learning have been the most prevalent. Several datasets have been used, including clinical records, social media, and scientific repositories. LLMs in Portuguese are being applied in mental health, general medicine, COVID-19, oncology, and other related areas, accomplishing classification tasks, followed by named entity recognition (NER), topic modeling, question answering, text generation, and conversational agents. Conclusion: Our study identified gaps and opportunities: (1) base models such as LLAMA, T5, ELECTRA, BART, XLM-R, Falcon, Mistral, BLOOM are unexplored yet; (2) there is a lack of detailed fine-tuning specifications, hindering reproducibility; (3) many healthcare fields are not even tackled; (4) clinical and hospital data have been widely used but not shared; (5) social media data need caution because it can introduce inconsistencies; (6) data privacy, especially de-identification and anonymization, have been largely overlooked; and (7) Brazilian healthcare data present large opportunities.
... Meanwhile, XGBoost was selected for its adequacy for datasets with imbalanced sample quantities between classes [19], [20]. Additionally, the BERTimbau model, a Portuguese pre-trained transformer model [21], was adapted for the binary classification of messages from individuals. For model training and validation, 75% of messages were randomly selected for training and 25% for validation, distributed proportionally and stratified by the classes with medical needs and without medical needs. ...
... Traditional models employed TF-IDF (Term Frequency-Inverse Document Frequency) [22] with bigrams for vectorization. The Portuguese BERTimbau [21] model was tailored for binary classification, using standard tokenization and the "Transformers" package from Hugging Face for training. A "dropout" layer was added to mitigate overfitting, and a "warmup step" adjusted the learning rate early in training. ...
... The most relevant architectures found in recent studies are constituted by LSTM, BiL-STM, CNN, or transformer layers, which benefit from high accuracy by mapping long-term dependencies and focusing on the most relevant terms of the input sequence [11,12]. Some researchers have made efforts to train this model using large datasets from other languages, emphasizing the Portuguese variations of interest in this study, such as BERTimbau [13] and Albertina PT-* [14]. The lack of studies in languages other than English was also highlighted by Khurana et al. [3] and is proposed for future work. ...
Article
Full-text available
Featured Application Automatic sentiment analysis for restaurant reviews. Abstract A sentiment analysis is a Natural Language Processing (NLP) task that identifies the opinion or emotional tone of documents such as customer reviews, either at the general or detailed level. Improving domain-specific models is important, as it provides smaller and better-suited models that can be implemented by entities that own textual data. This paper presents a deep learning model trained on Portuguese restaurant reviews using recurrent and self-attention mechanisms, which have consistently delivered strong results in prior research studies. Designing an effective model involves numerous hyperparameters and architectural choices. To address this complexity, a discrete genetic algorithm was used to find an optimal configuration, selecting the layer types, placement of self-attention, dropout rate, and model dimensions and shape. A key outcome of this study was that the optimization process produced a model that is competitive with a Bidirectional Encoder Representation from Transformers (BERT) model retrained for Portuguese, which was used as the baseline. The proposed model achieved an area under the curve of 92.1% and F1-score of 75.4%, demonstrating that a small, optimized model can compete and even outperform larger state-of-the-art models. Moreover, this work helps address the scarcity of NLP resources for Portuguese, and highlights the potential of customized architectures over generic solutions.
... For NER, a softmax layer is added on top to predict entity labels based on these embeddings. • BERT-CRF [28]: Enhances the BERT model by incorporating a CRF layer on top of the Transformer layers. This allows the model to leverage BERT's powerful contextual embeddings while capturing label dependencies, leading to more accurate sequence labeling. ...
Article
Full-text available
The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.
Conference Paper
The advent of large language models has significantly advanced natural language processing, revolutionizing numerous applications within the healthcare domain. Despite these advances, most existing research remains predominantly centered around English, resulting in notable disparities in medical AI accessibility for non-English speaking communities. To bridge this gap, we introduce a specialized Portuguese-language chatbot tailored explicitly to women’s healthcare needs, addressing critical shortages in linguistic resources and culturally relevant data. Leveraging retrieval-augmented generation, our chatbot integrates accurate, evidence-based information directly into generated responses. Evaluations demonstrate that our approach markedly enhances reliability, highlighting the potential for tailored AI applications to significantly improve healthcare accessibility and outcomes for Portuguese-speaking women.
Article
Full-text available
This paper investigates the changes in financial assets and markets from December 1st, 2021, to April 30th, 2022, during the start of the Ukraine War. These dates roughly correspond to the prelude to the War in December 2021 to a few weeks after Russian troops withdrew from the Kyiv area on April 7th, 2022. We used the Goldstein 1992 Results Table to create Positive and Negative Geopolitical Risk bigrams (Goldstein, 1992). With these bigrams, we collected over 3.6 million tweets during our research period in seven different languages (English, Spanish, French, Portuguese, Arabic, Japanese, and Korean) to capture worldwide reaction to the Ukraine War. Using various sentiment analysis methods, we constructed a time series of changes in the daily Geopolitical Risk sentiment. We explored its relationship to 39 financial assets and markets at various time lags. We found through Granger causality that the geopolitical risk time series contained predictive information on several assets and market changes.
Conference Paper
Context: In the context of valuing open data, transparency, and efficiency in public services, there is a growing demand for studies to improve government systems. Some public agencies use the Brazilian Electronic Information System (Sistema Eletrônico de Informações - SEI), a procedural management system that centralizes electronic processes and promotes administrative efficiency. Problem: Although SEI has contributed to advancements in public administration, significant challenges remain in information search and retrieval due to the inefficient keyword-based approach currently available. These difficulties are enhanced by the high amount of documents generated daily, which are written in natural language and present variability in categories, writing styles, and structures. As a result, searching for relevant documents in SEI is time-consuming, leading users to create unnecessary processes and inconsistent resolutions when compared to previously completed processes. Solution: A Natural Language Processing (NLP) pipeline was proposed to extract information from SEI documents using Named Entity Recognition (NER) models. IS Theory: Organizational Information Processing. Method: This research adopts a descriptive approach. Public SEI documents were collected using a web crawler, and trained annotators built a corpus to enable the training of state-of-the-art NER models. The models’ performances were compared and quantitatively analyzed. Summary of Results: A Brazilian Portuguese labeled corpus of SEI for NER was curated and validated, leading to an NLP pipeline for information extraction. Contributions and Impact in the IS area: This research provides a baseline for structuring data from Electronic Information Systems, enabling more effective strategies for search and retrieval tasks.
Conference Paper
Full-text available
The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Given these limitations, Arabic Natural Language Processing (NLP) tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus. Such models were able to set new standards and achieve state-of-the-art results for most NLP tasks. In this paper, we pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language. The performance of AraBERT is compared to multilingual BERT from Google and other state-of-the-art approaches. The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks. The pretrained AraBERT models are publicly available on this https URL hoping to encourage research and applications for Arabic NLP.
Conference Paper
Full-text available
With the growing number of electronic health record data, clinical NLP tasks have become increasingly relevant to unlock valuable information from unstructured clinical text. Although the performance of downstream NLP tasks, such as named-entity recognition (NER), in English corpus has recently improved by contextualised language models, less research is available for clinical texts in low resource languages. Our goal is to assess a deep contextual embedding model for Portuguese, so called BioBERTpt, to support clinical and biomedical NER. We transfer learned information encoded in a multilingual-BERT model to a corpora of clinical narratives and biomedical-scientific papers in Brazilian Portuguese. To evaluate the performance of BioBERTpt, we ran NER experiments on two annotated corpora containing clinical narratives and compared the results with existing BERT models. Our in-domain model out-performed the baseline model in F1-score by 2.72%, achieving higher performance in 11 out of 13 assessed entities. We demonstrate that enriching contextual embedding models with domain literature can play an important role in improving performance for specific NLP tasks. The transfer learning process enhanced the Portuguese biomedical NER model by reducing the necessity of labeled data and the demand for retraining a whole new model.
Conference Paper
Full-text available
Due to the technical gap between the language models available for low-resource languages and the state-of-the-art models available in English and Chinese, a simple approach that deploys automatic translation and ensembles predictions from Portuguese and English models is competitive with monolingual Portuguese approaches that may demand task-specific preprocessing and hand-crafted features. We performed our experiments on ASSIN 2-the second edition of the Avaliação de Sim-ilaridade Semântica e Inferência Textual (Evaluating Semantic Similarity and Textual Entailment). On the semantic textual similarity task, we performed multilingual ensemble techniques to achieve results with higher Pearson correlation and lower mean squared error than BERT-multilingual, and on the textual entailment task, BERT-multilingual could be surpassed by automatically translating the corpus into English and then fine-tuning a large RoBERTa model over the translated texts.
Chapter
This paper offers a brief overview on the ASSIN 2, an evaluation shared task collocated with STIL 2019. ASSIN 2 covered two different but related tasks: Recognizing Textual Entailment (RTE), also known as Natural Language Inference (NLI), and Semantic Textual Similarity (STS). The ASSIN 2 collection was made of pairs of sentences annotated with human judgments for NLI and STS. Participating teams could take part in any of the tasks or both: nine teams participated in the STS task and eight in the NLI task.