FinancialBERT - A Pretrained Language Model for Financial Text Mining
Ahmed Rachid Hazourli
Textual data in the ﬁnancial domain is becoming increasingly important as the number of ﬁnancial documents rapidly grows. With the
progress in natural language processing (NLP), extracting valuable information has gained popularity among researchers, deep learning
has boosted the development of effective ﬁnancial text mining models and made signiﬁcant breakthroughs in various Natural Language
State-of-the-art models such as BERT (Devlin et al., 2019) model developed by Google pre-trained on a large scale of unlabeled
texts from Wikipedia, has shown its effectiveness by achieving good results on general domain data. However, these models are not
effective enough on ﬁnance-speciﬁc language and semantics, limiting the accuracy that ﬁnancial data scientists can expect from their
NLP models. In this paper, we introduce FinancialBERT, a domain-speciﬁc language representation model pre-trained on large-scale
ﬁnancial corpora that can enhance NLP research in the ﬁnancial sector. With almost the same architecture across tasks, FinancialBERT
largely outperforms BERT and other state-of-the-art models in Sentiment Analysis task when pre-trained on ﬁnancial corpora.
Our pre-trained model FinancialBERT is freely available at: https://huggingface.co/ahmedrachid/FinancialBERT.
Keywords: Natural Language Processing, BERT, Language Model, Pretrained Model, Sentiment Analysis, Financial Language
In recent years, Deep Neural Networks have revolutionized
the development of intelligent systems in many ﬁelds
especially in Natural Language Processing using state-
of-the-art neural networks architectures that signiﬁcantly
improved many NLP tasks. These results are achieved
thanks to unsupervised pre-training of language models on
large text collections based on deep learning techniques
such as Long Short-Term Memory (LSTM), Transform-
As the amount of textual content generated in the ﬁ-
nancial domain is growing at an exponential rate, natural
language processing is becoming a strategic tool for
ﬁnancial analysis. Such textual data is a valuable source
of knowledge, however, applying state-of-the-art models
to ﬁnancial text mining has limitations. Firstly, word
embeddings or representations such as ELMO (Peters et
al., 2018), Word2Vec (Mikolov et al., 2013) and BERT
(Devlin et al., 2019) are trained on general domain texts,
it is then hard to estimate their performance on ﬁnancial
datasets. Also, the word distributions are different between
general and ﬁnancial domains.
BERT achieves great results on various NLP tasks,
adapting it for the ﬁnancial domain could potentially
achieve high performance by building a model capable of
understanding ﬁnancial language, producing more accurate
word embeddings and ultimately can improve the perfor-
mance of downstream tasks such as text classiﬁcation,
topic modelling, automatic summarization and sentiment
2. Related Work
Unlike traditional word embedding where a word is repre-
sented as a single vector representation, language models
such as BERT (Devlin et al., 2019), ELMO (Peters et al.,
2018) return contextualized embeddings for each word to-
ken which can be fed into downstream tasks. These models
are trained on general domain corpora and are easy to ﬁne
tune for downstream tasks.
The goal of this work is to test the hypothesized advantages
of using ﬁne-tuning pre-trained language models. Thus,
we pre-train FinancialBERT, a ﬁnance domain-speciﬁc
BERT model on a large ﬁnancial communication corpora
including ﬁnancial news, corporate reports and earning
The main contributions of this paper are the follow-
1. Introduce and release FinancialBERT, a new ﬁnance
domain-speciﬁc BERT-base model. We achieve state-
of-the-art results on Financial PhraseBank dataset.
2. Perform extensive experimentation to investigate the
performance of ﬁne-tuning versus task-speciﬁc archi-
tectures atop frozen embeddings, and the effect of hav-
ing an in-domain vocabulary. Then, evaluate on a ﬁ-
nancial corpus for sentiment analysis to show the ef-
fectiveness of our approach.
3. Most importantly, we make publicly available both the
pre-trained FinancialBERT and our ﬁne-tuned Senti-
ment Analysis model . We expect these resources
to boost NLP research and applications for ﬁnance,
since ﬁne-tuning pre-trained Transformer-based lan-
guage models for particular downstream tasks is the
In this section, we will present our FinancialBERT imple-
mentation that has the same structure as BERT, after giving
a brief background on relevant neural architectures. Then,
we describe in detail the pre-training and ﬁne-tuning pro-
cess of FinancialBERT.
With the advent of deep learning and its application in NLP,
researchers began applying Recurrent Neural Networks
(RNNs) and Convolutional Neural Networks (CNNs) for
text classiﬁcation. The current state-of-the-art in text classi-
ﬁcation typically involves a purely attentional architecture,
the Transformer architecture (Vaswani et al., 2017).
Bidirectional Encoder Representations from Transformers
(BERT) model architecture (Devlin et al., 2019) is based
on a multilayer bidirectional Transformer. It is pre-trained
on large textual corpora in an unsupervised way. The atten-
tion mechanism (Vaswani et al., 2017) of the Transformer
allows obtaining contextual word embeddings, BERT (De-
vlin et al., 2019) was trained on two parallel tasks:
1. Masked Language Modeling (MLM): instead of pre-
dicting the next word given previous ones, BERT (De-
vlin et al., 2019) masks a randomly selected 15% of
all tokens and learns to predict them, and hence can be
used for learning bidirectional representations. Thus,
it learns to produce token-level embeddings.
2. Next Sentence Prediction (NSP): the model predicts
whether or not these two actually follow each other.
It learns whether the second sentence is the next one
or not from the embeddings of the special token CLS
(class) and produces sentence-level embeddings.
3.3. Model Architecture
The original English BERT was pre-trained on two generic
corpora, English Wikipedia and Books Corpus with a total
of 3,5B words. BERT (Devlin et al., 2019) has two ver-
1. BERT-BASE: with 12 layers of stacked Transform-
ers, each of 768 hidden units, 12 attention heads,
110M parameters (L=12, H=768, A=12, Total Param-
2. BERT-LARGE: with 24 layers, each of 1024 hidden
units, 16 attention heads, 340M parameters (L=24,
H=1024, A=16, Total Parameters=340M).
Both architectures were trained on “cased” texts that keep
character casing or “uncased” that convert all text to lower-
4. Pre-training FinancialBERT
In this section, we ﬁrst describe our ﬁnancial corpora, the
details of the BERT training procedure, and ﬁnally the spe-
ciﬁc task we examine.
4.1. Financial Corpora
As a general purpose language representation model, BERT
was pre-trained on English Wikipedia and BooksCorpus.
However, ﬁnancial domain texts contain a consider-
able number of new domain-speciﬁc terms. In this
work, we pre-train FinancialBERT on a large corpora of
representative ﬁnancial texts:
1. TRC2-ﬁnancial1:Thomson Reuters Text Research
Collection (TRC2) corpus comprises 1,800,370 news
stories that were published by Reuters covering the pe-
riod between 2008 and 2010.
2. Bloomberg News2:400,000 ﬁnancial articles pub-
lished by Bloomberg between 2006 and 2013.
3. Corporate Reports:3a rich source of information
as they often disclose new important statements and
provide a comprehensive overview of the company’s
business and ﬁnancial condition. These documents
are available on EDGAR database as the Securities
Exchange Commission (SEC) mandates all publicly
traded companies to ﬁle annual reports (10K) and
quarterly reports (10Q). We retrieved 154,354 doc-
uments of the 10-K reports from 1996 to 2015 and
37,646 quarterly reports 10-Q. Then, we ﬁltered on
sections and decided to use only “Risk Factors” (Sec-
tion 1A) and “Management Discussion and Analysis
of Financial Conditions and Results of Operations”
4. Earnings Call Transcripts: we obtained 42,156 earn-
ings call transcripts. They are teleconferences, or we-
bcasts between the management of a public company,
analysts, investors, and the media to discuss the com-
pany’s ﬁnancial results during a given reporting pe-
riod, such as a quarter or a ﬁscal year. An earnings call
is usually preceded by an earnings report, which con-
tains summary information on ﬁnancial performance
for the period.
Corpus Number of words Domain
English Wikipedia 2.5B General
BooksCorpus 0.8B General
TRC2-ﬁnancial 0.29B Financial
Bloomberg News 0.2B Financial
Corporate Reports 2.2B Financial
Earnings Call Transcripts 0.7B Financial
Table 1: Size of text corpora.
The text corpora used for pre-training of FinancialBERT
has a total size of 3.39 billion tokens and are listed above in
Table 1. The description of the textual corpora are listed in
For better performance, we initialized FinancialBERT with
the pre-trained BERT 4model provided by Devlin et al.
4The pre-trained weights are made public by creators of BERT.
The code and weights can be found here: https://github.
(2019) that was trained on Wikipedia + BooksCorpus cor-
pora with a total of 3.3 billion tokens.
BERT Wikipedia + BooksCorpus
FinancialBERT TRC2 + Bloomberg News + Corporate Reports + Earnings Call Transcripts
Table 2: Description of pre-training text corpora.
BERT uses WordPiece (Wu et al., 2016) with a 30,000 to-
ken vocabulary for unsupervised tokenization of the input
text. With WordPiece tokenization, any new words can be
represented by frequent subwords.
We found that using uncased vocabulary results in slightly
better performances in downstream tasks.
4.3. Implementation Details
In our work, we use the Transformers library from Hug-
gingface on Python. For pre-training we used mainly the
BERT recommended parameters. We used the default
BERT optimizer, AdamWeight decay optimizer, the
recommended learning rate of 5e-5, a batch size of 32, a
dropout rate of 0.1 and a maximum sequence length of 512.
Data preprocessing and training BERT on ﬁnancial
corpora took signiﬁcant computational resources. Our
entire model procedure took 23 days of computational
runtime using a single Nvidia GeForce RTX 2060 6GB
GPU. We believe that releasing our pre-trained model
FinancialBERT will be useful to the ﬁnancial researchers
and use it on downstream tasks without the necessity of the
signiﬁcant computational resources.
5. Experimental Evaluation
In this section, we describe experiments on Sentiment
Analysis task to evaluate the effectiveness of our pre-
trained language model.
5.1. Sentiment Analysis
Sentiment analysis and opinion mining is the ﬁeld of study
that analyzes people’s opinions, sentiments, evaluations,
attitudes, and emotions from written language. It is one
of the most active research areas in natural language
processing and is also studied in the ﬁnancial domain.
Financial sentiment analysis differs the general one, it is
important to guess how the market will react to news and
other textual data.
It can be performed by implementing one of the two
different approaches using NLP models unsupervised
or supervised. As it is known sentiments can be either
positive or negative or neutral. NLP algorithms can be used
to evaluate if a series of words reﬂect a positive or negative
sentiment. Coming to unsupervised learning, it involves
using a rule-based approach by counting the number of
positive and negative words based on a dictionary such as
Loughran and McDonald (2011). The supervised approach
is a classiﬁcation model that involves using traditional
machine learning or deep learning methods.
The main sentiment analysis dataset used in this paper is
Financial PhraseBank 5from (Malo et al., 2014).
Financial Phrasebank consists of 4845 english articles that
were categorised by sentiment class and were annotated by
16 researchers with a ﬁnancial background. The sentiment
label is either positive, neutral or negative. However the
dataset is available in four possible conﬁgurations depend-
ing on the percentage of agreement of annotators as you can
see on the Table 3.
Agreement Level Positive Negative Neutral # of articles
100% 25.2% 13.4% 61.4% 2262
75% - 99% 26.6% 9.8% 63.6% 1191
66% - 74% 36.7% 12.3% 50.9% 765
50% - 65% 31.1% 14.4% 54.5% 627
Total 28.1% 12.4% 59.4% 4845
Table 3: Description of Financial PhraseBank dataset.
We chose to use the whole Data (at least 50% agreement).
80% of them as training set, 10% as test set and 10% of the
remaining as validation set as show in Table 4.
Dataset Metric Train Dev Test
Financial PhraseBank Accuracy + F1 3876 484 485
Table 4: Sentiment Analysis task evaluation metrics, and
train, dev, test sets sizes.
5.3. Fine-tuning FinancialBERT
Sentiment analysis is a natural language processing clas-
siﬁcation task, we train a model that predicts a sentiment
label based on an article as input.
Typically, we have two successive steps, one during the
pre-training FinancialBERT phase and one during the
ﬁne-tuning phase. We ﬁrstly conducted unsupervised
pre-training on the large ﬁnancial corpus and then applied
supervised ﬁne-tuning on down-stream NLP tasks.
In our work, we use the same ﬁne-tuning architecture used
in (Devlin et al., 2019) by adding a dense layer after the last
hidden state of the [CLS] token. This is the recommended
practice for using BERT for any classiﬁcation task. Then,
the classiﬁer network is trained on the labeled sentiment
dataset. We also use cross-entropy loss as the loss function.
We used a batch size of 32, a maximum sequence length of
512, and a learning rate of 2e-5 and 5 epochs for ﬁne-tuning
The following Table 5 presents the sentiment analysis re-
sults in a classiﬁcation report on the test set.
Our ﬁne-tuned FinancialBERT6clearly outperforms two
common baselines, the BERT-base (Devlin et al., 2019) and
5The dataset can be found here: https://www.
6Our ﬁne-tuned model is available at:
class precision recall f1-score support
negative 0.96 0.97 0.97 58
neutral 0.98 0.99 0.98 279
positive 0.98 0.97 0.97 148
macro avg 0.97 0.98 0.98 485
weighted avg 0.98 0.98 0.98 485
Table 5: Experimental Results on Financial PhraseBank
FinBERT (Yang et al., 2020), a ﬁnancial domain speciﬁc
FinancialBERT achieved better performance than the state-
of-the-art model on the Financial PhraseBank, which
demonstrates its effectiveness in sentiment analysis. We
obtained the highest Accuracy (0.12 higher) and F1 score
(0.13 higher) than the state-of-the-art model FinBERT as
shown in Table 6.
Model Accuracy F1-score
BERT-base (Devlin et al., 2019) 0.84 0.83
FinBERT (Yang et al., 2020) 0.87 0.85
FinancialBERT (ours) 0.99 0.98
Table 6: Performance of different BERT models on three
ﬁnancial sentiment analysis task.
As expected, we should highlight the importance of pre-
training on ﬁnancial corpora approach which improves per-
formance and enhances the downstream ﬁnancial sentiment
We presented FinancialBERT, a new pre-trained language
model for ﬁnancial communications, which has been
trained on a large corpora and can be ﬁne-tuned for
multiple NLP tasks. Requiring minimal task-speciﬁc archi-
tectural modiﬁcation, our model achieves state-of-the-art
performance on Sentiment Analysis task, signiﬁcantly
outperforming other compared models.
With the release of FinancialBERT, we hope ﬁnancial
practitioners and researchers can beneﬁt from our model
without the necessity of the signiﬁcant computational
resources required to train the model.
Future directions include: further exploration of domain-
speciﬁc pre-training strategies and incorporating more
tasks in ﬁnancial NLP such as Named Entity Recognition
(NER) and Question-Answering tasks.
7. Bibliographical References
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala,
P. (2014). Good debt or bad debt: Detecting semantic
orientations in economic texts. Journal of the Associa-
tion for Information Science and Technology, 65.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efﬁcient estimation of word representations in vector
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,
C., Lee, K., and Zettlemoyer, L. (2018). Deep contextu-
alized word representations.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).
Attention is all you need.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,
Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,
K., Klingner, J., Shah, A., Johnson, M., Liu, X., Łukasz
Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H.,
Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C.,
Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado,
G., Hughes, M., and Dean, J. (2016). Google’s neural
machine translation system: Bridging the gap between
human and machine translation.
Yang, Y., UY, M. C. S., and Huang, A. (2020). Finbert:
A pretrained language model for ﬁnancial communica-