PreprintPDF Available

Does sentiment help in asset pricing? A novel approach using large language models and market-based labels

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We present a novel approach to sentiment analysis in financial markets by using a state-of-the-art large language model, a market data-driven labeling approach, and a large dataset consisting of diverse financial text sources including earnings call transcripts, newspapers, and social media tweets. Based on our approach, we define a predictive high-low sentiment asset pricing factor which is significant in explaining cross-sectional asset pricing for U.S. stocks. Further, we find that a long/short equal-weighted portfolio yields an average annualized return of 35.56% and an annualized Sharpe ratio of 2.21, remaining substantially profitable even when transaction costs are considered. A comparison with an alternative financial sentiment analysis tool (FinBERT) underscores the superiority of our data-driven labeling approach over traditional human-annotated labeling.
Does sentiment help in asset pricing? A novel approach
using large language models and market-based labels
Francesco Audrino
, Jule Schüttler
, and Fabio Sigrist
July 25, 2024
Abstract
We present a novel approach to sentiment analysis in financial markets by using a state-of-
the-art large language model, a market data-driven labeling approach, and a large dataset
consisting of diverse financial text sources including earnings call transcripts, newspapers, and
social media tweets. Based on our approach, we define a predictive high-low sentiment asset
pricing factor which is significant in explaining cross-sectional asset pricing for U.S. stocks.
Further, we find that a long/short equal-weighted portfolio yields an average annualized
return of 35.56% and an annualized Sharpe ratio of 2.21, remaining substantially profitable
even when transaction costs are considered. A comparison with an alternative financial
sentiment analysis tool (FinBERT) underscores the superiority of our data-driven labeling
approach over traditional human-annotated labeling.
Faculty of Mathematics and Statistics, University of St.Gallen, Switzerland. francesco.audrino@unisg.ch
Faculty of Mathematics and Statistics, University of St.Gallen, Switzerland. jule.schuettler@unisg.ch
Seminar for Statistics, ETH Zurich, Switzerland. fabio.sigrist@stat.math.ethz.ch
1
1 Introduction
Sentiment has been recognized as a pivotal factor in understanding phenomena observed
in financial markets. Consequently, the integration of sentiment into economic or financial
models has been a longstanding practice. In the past, proxy variables for sentiment were
constructed either based on financial variables or based on surveys and questionnaires of in-
vestors. However, such proxies come with certain drawbacks, including limitations in terms
of their frequency and extent of availability, as well as a lack of reflection of the overall
sentiment among investors. In recent years, the digitization of communication media has
established a new and widely accessible data source for sentiment. The extraction of senti-
ment from textual data is performed by tools from the field of Natural Language Processing
(NLP). The rise in data availability and advancements in deep learning have led to signifi-
cant achievements in NLP over recent years. In reflection of this, the NLP tools employed
for sentiment extraction have also evolved. For a long time, sentiment classification relied on
a lexicon-based approach. This approach requires lists of relevant individual words or word
combinations which are associated with different modes of sentiment (e.g negative, positive
sentiment) or sentiment scores. Sentiment of a text is determined by counting the occurrence
of the words or word combinations predefined in the lexicon. While this approach is straight-
forward and efficient, it comes with obvious shortcomings. Besides the issue that there is
no universal polarity grading system, a lexicon-approach ignores the semantic relation be-
tween words, the context, and the sentence structure - all of which are crucial factors for
determining the overall sentiment.
By introducing deep-learning-based approaches the model’s ability to understand language
in its whole complexity has improved significantly. The advent of the BERT (Bidirec-
tional Encoder Representations from Transformers) model introduced by Devlin et al. (2018)
marked a groundbreaking milestone in this field. It was followed by numerous variants, such
as the RoBERTa model (Robustly Optimized BERT Pretraining Approach) by Liu et al.
(2019) and the DeBERTa model (Decoding-enhanced Bidirectional Encoder Representations
from Transformers) by He et al. (2021), each building upon the foundational principles of
BERT to enhance and extend its capabilities. Today, these BERT-based models and their
variants are part of the large language models (LLMs) that represent the state-of-the-art in
NLP, demonstrating unprecedented capabilities in understanding and processing language.
Despite these achievements, challenges remain. Specifically, when applied to a downstream
task such as sentiment analysis, even the state-of-the-art LLMs rely on labeled text data.
2
While unlabeled text data is abundant, labeled text data remains scarce. Labels are typically
generated through human annotation, a process that is not only costly but also subjective
and thus noisy. Furthermore, evidence from the literature indicates that sentiment measures
constructed using various sentiment extraction methods exhibit low correlation with each
other, which can result in distinctly different outcomes in predictive analyses. Consequently,
the overall predictive information of sentiment measures is relatively low (Audrino et al., 2020;
Ballinari and Behrendt, 2021). The latter can be interpreted, among other explanations, as
an indication that existing sentiment analysis methods, relying on human-annotated labels,
do not constitute the optimal approach for extracting all relevant information from text for
predictive analysis.
We address these existing challenges by developing a sentiment classification approach
where labels are derived in a data-driven manner. Precisely, we label all text items linked
to a specific company with the next day’s excess return of that company over the prediction
of the Fama-French five-factor model. Opting for excess returns rather than raw returns,
we acknowledge that sentiment is not the only driver for stock returns and instead take into
consideration the well-known Fama-French five factors. This approach offers two significant
advantages: firstly, our labels are not dependent on human annotations, making the process
less costly and subjective. Secondly, by directly training a model to comprehend the relation-
ship between investor sentiment towards a company and its next day’s return, we streamline
the standard two-step procedure of separately performing sentiment and predictive analysis
into a single step.
The past literature on the application of data-driven labeling is limited. Existing work
mainly applies dictionary-based approaches (Jegadeesh and Wu, 2013; Manela and Moreira,
2017; Ke et al., 2019; Garcia et al., 2020; Barbaglia et al., 2022). The only approaches
utilizing state-of-the-art LLMs are presented in Salbrechter (2021) and Chen et al. (2022).
In Salbrechter (2021), a RoBERTa model is pre-trained from scratch on a dataset comprising
financial news articles from Refinitiv, specifically those associated with companies listed in
the S&P 500. Instead of introducing a new model, Chen et al. (2022) provide a comparison
of the investment performance of various state-of-the-art models, including the Open Pre-
trained Transformers (OPT) by Zhang et al. (2022) and BERT variation models.
Our contribution is twofold: We are the first to use a pre-trained DeBERTa model and
further fine-tuning it to our specific dataset. This approach allows employing the concept of
transfer-learning. Transfer learning describes the leveraging of knowledge gained from one
3
task or domain to enhance the performance of another, consequently reducing the need for
extensive training data and computation, both of which are critical elements in NLP. We
enhance the model’s architecture by incorporating an additional input feature that repre-
sents the source of the text item. This augmentation allows the model to receive information
about the origin of each text input. The resulting modified and fine-tuned model is termed
SMARTyBERT (Sentiment Model with Additional Regressor for Text type BERT). Our sec-
ond contribution lies in the creation of an extensive dataset containing financial text collected
from diverse sources. Specifically, we consider all companies whose stocks belong to the CRSP
universe during the period from 2012 to 2019. For each of these companies, we collect daily
text data from three sources: transcripts from earnings conference calls, headlines from news
articles published by Bloomberg, and social media tweets from StockTwits that were tagged
with the corresponding company ticker symbol. This allows us to capture sentiment from
various investors types as different sources represent different segments of the investors’ com-
munity. While each of these sources has been individually utilized for sentiment extraction,
to the best of our knowledge, there is no existing work that combines all three text sources.
We apply our SMARTyBERT model to examine whether the sentiment signal it generates
serves as a significant factor in explaining cross-sectional asset prices. The analysis begins
by using the sentiment predictions generated by SMARTyBERT to construct high and low
sentiment portfolios, with the sentiment predictions serving as the assignment variable for
portfolio construction. Our findings reveal that low sentiment is consistently followed by
negative next-day excess returns, while no such evidence is found for high sentiment. Sub-
sequently, we compute a high-low sentiment (HLS) factor as the return spread between the
high and low sentiment portfolios. When assessing the relevance of the HLS factor within
multifactor asset pricing, our results indicate that the HLS factor is significant at the 1%
level, even when controlling for the Fama-French five factors plus the momentum factor.
Finally, we analyze the performance of an out-of-sample trading strategy that involves
taking a long position in the high sentiment portfolio and a short position in the low senti-
ment portfolio, both with and without transaction costs. For the long/short equal-weighted
portfolio we obtain an annualized average return of 35.56% and an annualized Sharpe ratio
of 2.21 when transaction costs are excluded. When transaction costs are taken into account,
the annualized average return of the long/short portfolio decreases to 33.21%, with a result-
ing annualized Sharpe ratio of 2.06, indicating that the strategy remains highly profitable.
Figure 1 illustrates the cumulative log returns including transaction costs. It demonstrates
the profitability of the long/short strategy, with this profitability primarily coming from the
4
short leg. Comparing these results to the performance of the FinBERT model by Huang
et al. (2023) yields interesting insights. The architecture of the FinBERT model corre-
sponds to the BERT model, but it is pre-trained on a financial text corpus and fine-tuning
is done on 10,0000 manually annotated (positive, negative, neutral) sentences from analyst
reports. When applying FinBERT, we observe an annualized average return of -2.84% for
the long/short equal-weighted portfolio, along with an annualized Sharpe ratio of -0.17 with-
out considering transaction costs. When accounting for transaction costs, the annualized
average return further declines to -6.51%, and the annualized Sharpe ratio drops to -0.40.
Considering the cumulative log returns of a long/short strategy based on the predictions of
FinBERT, Figure 1 illustrates that while the short leg is slightly profitable, the losses of the
long leg result in an overall loss for the long/short strategy. These results demonstrate that
our data-driven labeling approach is superior to the human-annotated labeling approach.
The remainder of the paper is organized as follows: Section 2 offers a review of the liter-
ature. The data gathering process and the pre-processing steps are discussed in Section 3.
Section 4 provides an explanation of the model architecture and training procedure. Out-
of-sample classification results are presented in Section 5 and the financial application is
presented in Section 6. Section 7 concludes.
Figure 1: Cumulative log returns of the long/short, long and short equal-weighted port-
folios of SMARTyBERT and FinBERT with transaction costs.
5
2 Literature Review
The early literature examining the influence of sentiment signals on financial markets usually
relies on lexicon-based approaches. Tetlock (2007) is among the first to quantitatively mea-
sures the interaction between media and the stock market, based on news articles from the
Wall Street Journal. The Harvard-IV dictionary is applied to quantify pessimism. He finds
that elevated levels of media pessimism are associated with a subsequent decrease in market
prices, which is later followed by a reversion to fundamental values. Bollen et al. (2011)
apply the OpinionFinder lexicon to identify collective mood states on Twitter and analyze
how these different mood states correlate with the value of the Dow Jones Industrial Average
(JIA) over time. Their findings suggest that the predictive accuracy of DIJA forecasts can
be significantly enhanced through the incorporation of certain public mood states. Loughran
and McDonald (2011) find that the Harvard Dictionary misclassifies words commonly used
in financial texts, leading them to develop alternative word lists, specifically designed for
finance applications. They provide evidence that certain word lists are related with mar-
ket reactions surrounding the 10-K filing date, trading volume, unexpected earnings, and
subsequent volatility in stock returns. Other approaches involve support vector machines
(SVMs) as in Manela and Moreira (2017). The authors develop a news based measure of
uncertainty (NVIX) and investigate what type of uncertainty drive aggregated stock market
risk premia. The NVIX is construct from one and two word n-grams from articles from the
Wall Street Journal which they use to predict the implied volatility index. Applying a SVM
overcomes the problem that the vector of regression coefficients which corresponds to the
n-gram frequencies is of higher dimensions than their training time series.
The limitation of all these approaches is that they do not account for the semantic rela-
tions between words. This changes with the introduction of word2vec by Mikolov et al. (2013)
which can be used to create word embeddings. It is a neural network model that takes a
large corpus of text as input and computes vector representations for unique words of the cor-
pus. The model is trained such that semantically and syntactically similar words are located
close to each another in the vector space. The combination of word2vec embeddings with
deep-learning-based approaches represents a significant improvement in sentiment analysis.
For instance, Severyn and Moschitti (2015) use word2vec embeddings with deep convolu-
tional neural networks to achieve state-of-the-art results in sentiment analysis of tweets from
Twitter.
The next major development in NLP was initiated with the introduction of the BERT
6
(Bidirectional Encoder Representations from Transformers) model by Devlin et al. (2018)
which brought forth two significant advancements. Firstly, it adopted the transformer ar-
chitecture from machine translation, renowned for its superior ability to model long-term
dependencies. Secondly, BERT introduced the Masked Language Modeling (MLM) task, a
groundbreaking approach where a random 15% of all tokens are masked, and the model’s task
is to predict them using a bi-directional attention mechanism, as introduced by Vaswani et al.
(2017). The MLM task is particularly advantageous as it does not rely on labeled data during
pre-training. Instead, the model is trained on an extensive corpus of unlabeled data. This
pre-training phase enables the model to learn the representation of semantic information.
The subsequent fine-tuning process, which tailors the model for specific tasks like sentiment
classification, necessitates labeled data. However, the model can leverage its acquired knowl-
edge about semantic representation from the pre-training phase, requiring considerably less
labeled data. This utilization of prior knowledge is referred to as transfer learning.
Recognizing the prevalence of specialized terminology in diverse fields, Araci and Genc
(2019) introduced FinBERT. The FinBERT is a LLM based on the BERT model that is
adapted to the financial domain. This is achieved by further pretraining the BERT model on
a financial corpus comprising news articles from Reuters. For fine-tuning, the model utilizes
the financial phrase bank dataset created by Malo et al. (2014). The dataset comprises 4,845
randomly selected sentences extracted from financial news which were manually annotated
by 16 finance experts. In an independent study, Yang et al. (2020) pre-train a distinct
BERT model using a financial corpus comprising Corporate Report 10-K and 10-Q filings,
earnings call transcripts, and analyst reports. Subsequently, the model undergoes fine-tuning
on 10,000 manually annotated sentences extracted from analyst reports.
Despite the numerous technical advancements, it remains an issue that even state-of-the-
art LLMs rely on labeled data in the training process when applied to downstream tasks
such as sentiment classification. Only a limited amount of contributions in the literature
addresses this issue by applying a data-driven approach. The majority of these approaches
are based on dictionary methods: Jegadeesh and Wu (2013) conduct content analysis using
an approach based on a word list that categorizes words as positive and negative. As a novel
contribution, they introduce a weighting scheme for each word in the list, where the weights
are determined based on the market reaction to 10-K filings. Ke et al. (2019) introduce
the Sentiment Extraction via Seceding and Topic Modeling (SESTM) model, which is a
sentiment scoring model learned from the joint behavior of article text and stock returns.
They apply the model to a trading strategy that involves taking a long position in the 50
7
stocks with the highest sentiment scores and a short position in the 50 stocks with the lowest
sentiment scores. Results indicate that this strategy outperforms a similar strategy based on
RavenPack sentiment scores. Garcia et al. (2020) create a new dictionary consisting of both
unigrams and bigrams, which is derived from earnings call transcripts using a multinomial
inverse regression model. Labels (positive or negative) are assigned based on stock price
reactions.
The first approach, which utilizes state-of-the-art LLMs, is presented in Salbrechter (2021)
which introduces the FinNewsBERT model. Instead of leveraging the power of transfer
learning with a pre-trained LLM, the authors opt to pre-train a RoBERTa model from
scratch. They argue that using a pre-trained model could introduce a look-ahead bias, as
it is trained on data until 2018. The dataset used consists of financial news articles from
Refinitiv spanning from 1996 to 2020 and is associated with companies listed in the S&P
500. The FinNewsBERT further extends the RoBERTa model by incorporating a feature
indicating the topic of the news articles and integrating a deep neural network classifier.
Labels are derived as the mean value of the three-day excess return over the CAPM divided
by the standard deviation. Based on the distribution of these values, positive and negative
sentiment labels are determined. They implement a trading strategy based on the sentiment
predictions generated by FinNewsBERT and report an average return per trade of 24.06 bps.
Instead of introducing a new model, Chen et al. (2022) provide a comparison of the in-
vestment performance of various state-of-the-art models, including the Open Pre-trained
Transformers (OPT), RoBERTa, BERT, and FinBERT models. Their analysis is based on
global news data from Thomsen Reuters Real-time News Feed (RTRS) and the Third Party
Archive (3PTY) spanning from 1996 to 2019 and labels are derived based on three-day re-
turns. Sentiment prediction is performed in a two-step procedure: first, a pre-trained model
is used without any fine-tuning to transform text data into a vector representation. This ap-
proach is called zero-shot prediction, and in contrast to fine-tuning, the model’s parameters
do not get adjusted to the specific dataset. In a second step the resulting vectors are passed
to the econometric model. Utilizing the sentiment predictions in a trading strategy, they
obtain a Sharpe ratio of 4.51 for the equal-weighted long-short portfolio for the benchmark
OPT model. The issue of a potential look-ahead bias - as the pre-trained models they use are
trained on data that extends beyond the out-of-sample prediction period - is not addressed.
8
3 Data
We examine companies with stocks in the CRSP universe from 2012 to 2019 focusing on
ordinary shares that are denoted by the share codes 10 or 11. This provides us with 7,835
unique ticker symbols. We gather textual news data associated with each of these ticker
symbols in the considered time period. To capture sentiment from various investor types,
we utilize three different sources for data gathering: earnings call transcripts, headlines
(and if available, subheadlines) from news articles published by Bloomberg, and tweets from
StockTwits.
3.1 Earnings Call Transcripts
Earnings call transcripts are obtained by scraping Seeking Alpha1, which is a financial news
and analysis website. This involves the use of a web scraper API that we purchased from
Rapid API 2. Scraping is performed in a two step procedure: First, we scrape the iden-
tification numbers (IDs) of all associated earnings call transcripts for each ticker symbol.
Subsequently, these IDs are utilized as query parameters to scrape the corresponding earn-
ings call transcripts. As the earnings call transcripts are retrieved in an unstructured format,
additional transformation is necessary. The first line of the transcript is used to identify
the ticker symbol, company name, quarter, and the year when the conference call occurred.
This information is then mapped to the unique identifiers utilized in the CRSP database,
including the Permco, Permno, CUSIP and NCUSIP. In total, we obtain 98,366 earnings call
transcripts, corresponding to 4,986 unique ticker symbols. For 1,204 earnings call transcripts,
the ticker could not be identified. Non-identification of the ticker symbols occurs if the first
line of the transcript varies too strongly from the common structure - rendering the algorithm
unable to apply its logic for identifying the ticker symbol. Moreover, for 5,152 transcripts,
no unique identifiers could be found. This is the case if the earnings call transcript is tagged
with a ticker symbol that does not exist at that point in time, for instance, if there was a
change in ticker symbols but the document is still tagged with the old ticker symbol.
Furthermore, we have to consider that the length of a text sequence that can be passed
1https://seekingalpha.com
2Rapid API is an online platform that severs as a marketplace for APIs, facilitating developers in con-
necting to and managing various APIs. https://rapidapi.com.
9
to the model is restricted. Earnings call transcripts need to be split into shorter segments
to comply with the model’s maximum sequence length. To shorten the transcript while pre-
serving as much context as possible, the splitting is done as follows: First, we divide the
transcript into a company session part and a question-and-answer (QA) part. The company
session encompasses the segment where the company’s executives present and discuss the
financial results. The text from the company session is broken down into shorter segments,
each corresponding to different speakers. These segments are further recursively split by
identifying the halfway point of the text sequences and then dividing at the end of a sentence
until each sequence no longer exceed the specified limit of 1,000 characters. In total we obtain
1,770,212 text sequences originated from a company session. The QA session comprises the
part where investors and analysts can engage with the company’s executives and manage-
ment. The session is split such that each question and its corresponding answers form one
text sequence. We obtain a total of 2,389,311 sequences.
3.2 News Articles
News article headlines combined with subheadlines that give a short description of the body
are obtained from newsfilter.io3. The website collects and presents news articles from various
sources and offers an API for sale. The only source covering the entire time span from 2012 to
2019 is the news provider Bloomberg. We scrape all headlines and, if available, subheadlines
associated with any of the ticker symbols, combining them into one text sequence to provide
greater context. Subsequently, we map these sequences with the unique identifiers provided
by the CRSP database. We obtain 29,626 headlines. As the length of the headlines, including
subheadlines, remains within the maximum sequence length allowed by the model, there is
no need for further splitting the text sequences.
3.3 Tweets
StockTwits4is a social media platform designed for traders and investors to share real-
time information about stocks and financial markets. Using a developer API, we obtain
all tweets from StockTwits posted within the specified time period. In the next step, we
3https://newsfilter.io
4https://stocktwits.com
10
filter for tweets tagged with a ticker symbol from the considered companies and add unique
identifiers. Additionally, we filter for tweets created by users with more than 300 followers
to ensure a wider reach. To maintain a balance with our other data sources, we opt to use
only a random subset of around 9% of all the tweets that we obtain for our analysis. This
approach yields a sample of 2,210,918 tweets. There is no need for further splitting tweets
into shorter sequences as their length remains within the model’s maximum sequence length.
3.4 Return Data
We retrieve daily returns data from the CRSP database and adjust for delistings following
the approach by Bali et al. (2016). Further, we subtract the corresponding risk-free rate at
each point in time. Data on the daily Fama-French 5 factors is retrieved from Ken French’s
data library 5.
3.5 Sentiment Label Generation
Since sentiment classification is a supervised learning task, labeled data is required. However,
the text data we gather does not come with labels. A common approach to addressing this
issue is to train a sentiment classification model using available labeled datasets, where the
labels are derived through human annotation. Subsequently, the trained model is applied
to perform sentiment classification on the task-specific unlabeled dataset. This approach
has several shortcomings: Firstly available labeled datasets are rare, and those that exist
are rather small. In the financial domain, frequently used datasets include the Financial
PhraseBank dataset from Malo et al. (2014) and the FiQA Sentiment dataset (Maia et al.,
2018). The Financial Phrasebank comprises 4,845 English sentences extracted from financial
news, annotated by 16 individuals with backgrounds in finance and business. Annotators
were tasked with assigning labels based on their assessments of how the information in each
sentence could impact the company’s stock price. The FiQA dataset was developed for the
WWW ’18 conference’s financial opinion mining and question-answering challenge. It com-
prises 1,174 financial news headlines and tweets, each accompanied by its corresponding sen-
timent score. Beyond the size limitations of these datasets, the process of human annotation
is subjective and, therefore, prone to noise. Another shortcoming is that text samples con-
5https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
11
tained in the labeled dataset often deviate from the task-specific unlabeled dataset, thereby
significantly decreasing the classification performance of the model.
To overcome these shortcomings, we instead apply a data-driven approach to derive labels
for our text data. Specifically, following the idea of Ke et al. (2019), Salbrechter (2021),
and Chen et al. (2022) labels are obtained form return data. This approach streamlines
the standard two-step procedure of separately performing sentiment classification and return
prediction into a single step. Intuitively speaking, we aim for a model to learn the relationship
between investor sentiment towards a company extracted from financial text data and the
next-day return of that company. Thus, we use the text data associated with a company as
the feature and next-day return data of that company as the label for our model. Naturally,
we do not assume that sentiment is the sole driver of cross-sectional returns but acknowledge
the presence of other well-known factors - such as the Fama-French five factors. Thus, we do
not directly generate labels with the next-day returns - but instead with the excess return over
the prediction of the Fama-French five-factor model. Specifically, we calculate the next-day
excess returns (retexcess) as
retexcess(i,t)=ret(i,t)rf(t)βMKT(i,t)M KT(t)βSMB(i,t)SM B(t)
βHML(i,t)HM L(t)βCMA(i,t)C M A(t)βRMW(i,t)RMW(t)
where the coefficients (βMKT,βSMB ,βHML,βCMA ,βRMW ) represent the estimated risk premi-
ums associated with each factor defined in in Table 1. The coefficients are estimated using a
rolling window estimation with a duration of 360 days and a minimum of 200 observations.
symbol denotation
retexcess(i,t)excess return of company iat time t
ret(i,t)return of company iat time t
rf(t)risk free rate at time t
MKT(t)excess market return over the risk free rate at time t
SM B(t)return spread between small and large market capitalization at time t
HML(t)return spread between high and low book-to market ratio at time t
CM A(t)return spread between conservative and aggressive investing at time t
RM W(t)return spread between high and low profitability at time t
Table 1: Fama-French five-factor model
Since we consider close-to-close returns, the assignment of text items to next-day excess
returns is done as described in the following. Each text item is accompanied by a timestamp
indicating the date and time of publication. If a text item is published on day tbefore 4
12
pm New York time, it is assigned to day t; otherwise, it is assigned to day t+ 1. In case
the assigned day happens to be a non-business day, the text item is instead assigned to
the subsequent business day, as return data is only available for business days. Each text
item associated with company iand published on day tis labeled with the excess return of
company ion day t+ 1. Previous approaches in the literature such as as Ke et al. (2019),
Salbrechter (2021), and Chen et al. (2022), create sentiment labels based on the mean of
three-day returns around the publication date, recognizing that news articles often report
on events from the previous day. In contrast, we choose to use the next-day returns for
the following reasons. Firstly, unlike the other approaches, our dataset comprises not only
news articles but also includes earnings call transcripts and tweets, which constitute the
majority. While news articles may indeed report on events from previous days, earnings
call transcripts capture the event itself. Additionally, StockTwits is a platform that enables
investors exchanging information in real-time. Secondly, our goal is to construct a model that
captures the predictive part of sentiment. Consequently, we solely utilize next-day returns
and do not consider information from past or current returns.
Precedent literature has demonstrated the effectiveness of transforming the initial continu-
ous label into a discrete variable. Ke et al. (2019) consider only the sign of the stock’s return
as the response variable in their SESTM estimation procedure, arguing that this approach
allows for the application of marginal screening. Drawing from this methodology Chen et al.
(2022) employ the sign of the stock returns to assign a binary sentiment label to news ar-
ticles. Salbrechter (2021) compute z-values from the idiosyncratic mean returns. Based on
the density distribution of these z-values, they implement thresholds to classify text items as
positive or negative. We follow these approaches and transform the initial continuous labels
into discrete ones.
In line with the principles of empirical asset pricing, this is done by incorporating a sorting
procedure in the derivation of labels. Specifically, for each day, we consider the cross-sectional
distribution of excess returns across the entire CRSP universe winsorized at 1% level. Sub-
sequently, companies with excess returns lower than one standard deviation from the mean
are labeled as low sentiment, those with excess returns higher than one standard deviation
above the mean are labeled with high sentiment, and the remaining companies are labeled
with middle sentiment. We employ the labels low,middle,high as opposed to the commonly
used negative,neutral, and positive to emphasize that we do not assume the necessity of
both negative and positive sentiment on each day. Instead, sentiment is assessed in a relative
manner between companies. We encode the labels low,middle, and high as 0, 1, and 2,
13
respectively.
3.6 Descriptive Statistics
We summarize our data by providing descriptive statistics. Table 2 shows the total number
of text sequences and their distribution across the four data sources, represented in absolute
values and percentages. While the distribution of text sequences originating from company
sessions, QA sessions, and tweets appears balanced, it is noteworthy that there are relatively
few headlines. We will consider this discrepancy in our subsequent analysis. Table 3 shows
the label distribution across low, middle, and high sentiment represented in absolute values
and percentages. In line with our discretization approach, we observe that 72.33% of all text
sequences are labeled as middle sentiment, while 14.38% are labeled as low sentiment, and
13.30% are labeled as high sentiment. Figure 2 and Figure 3 illustrate the distribution of
text sequences across various sources and the labels per year. The distributions of both text
sequences and sentiment labels remain relatively stable over time.
text sequences headlines company session QA session tweets
6,400,067 29,626 1,770,212 2,389,311 2,210,918
(0.0046) (0.2766) (0.3733) (0.3455)
Table 2: Total number of text sequences and their distribution across various sources,
represented in absolute values and percentages.
low sentiment middle sentiment high sentiment
920,098 4,629,023 850,946
(0.1438) (0.7233) (0.1330)
Table 3: Total number of labels and their distribution across low, middle, and high
sentiment represented in absolute values and percentages.
14
Figure 2: Distribution of text sequences across various sources per year.
Figure 3: Distribution of low, middle, and high sentiment labels per year.
4 Model
4.1 The Transformer Architecture
The underlying architecture of most modern LLMs is based on the Transformer model. The
original Transformer introduced by Vaswani et al. (2017) is founded on the encoder-decoder
15
architecture. This architecture comprises two components: The encoder, where an input
sequence of tokens is converted into a sequence of embedding vectors, the so-called hidden
states, and the decoder, which utilizes the hidden states to iteratively generate an output
sequence. The original Transformer architecture was geared towards sequence-to-sequence
tasks, such as machine translation. However, both the encoder and decoder blocks were
subsequently adapted as standalone models. The encoder-only models convert an input
sequence of text into a comprehensive numerical representation, well-suited for tasks such
as text classification or named entity recognition. BERT and its variants fall within this
category of architectures. The decoder-only models autoregenerate the sequence of a given
text prompt by iteratively predicting the most probable next word. The family of GPT
models belongs to this category.
Figure 4: The architecture of the transformer’s encoder as used in the BERT model.
Figure source: Tunstall et al. (2022).
Figure 4 illustrates the architecture of a transformer’s encoder. In the first step, the input
text needs to be tokenized. Tokenization denotes the process of breaking down a string into
the atomic units used in the model. There are different approaches to splitting words into
subunits. BERT uses WordPiece, a subword tokenization algorithm. The tokenized text is
represented as one-hot vectors called token encodings, where the size of the tokenizer vo-
16
cabulary determines the dimension of the token encodings. BERT uses a vocabulary size
of 30,522 unique tokens. In the next step, these token encodings are mapped to lower-
dimensional fixed-size vectors known as token embeddings. For instance in the BERT model,
each token is represented as a 786-dimensional vector. To capture the sequential nature
of text, the token embeddings are augmented with positional embeddings, which contain
positional information for each token. The token embeddings are then passed through the
transformer’s encoder, which consists of multiple encoder layers stacked next to each other.
The purpose of the encoder stack is to process and update input embeddings to produce
representations that encode contextual information. For instance the word "bank" will be
updated to refer to a financial institution instead of the side of a river if the word "money"
is close to it. The key components for this updating process are the multi-head self-attention
layer and the fully connected feed-forward layer, both of which are present in each encoder
layer. In the multi-head self-attention layer, the self-attention mechanisms take place. The
attention mechanism, conceptualized by Bahdanau et al. (2014), allows neural networks to
assign varying amounts of weight or "attention" to each element in a sequence, specifically to
each token embedding. The fundamental concept of self-attention involves moving away from
fixed embeddings for individual tokens. Instead, it leverages the entire sequence to compute
a weighted average for each embedding. Embeddings generated through this approach are re-
ferred to as contextualized embeddings. Subsequently, the attention-weighted representations
generated by the self-attention layer are passed to the feed-forward layer. The feed-forward
layer introduces non-linearity which is commonly done by applying an activation function.
This enables the model to effectively capture complex patterns and dependencies in the data.
Finally, a hidden state representation for each token is obtained.
With the introduction of the Transformer, a new modeling paradigm emerged - dispensing
with the traditional recurrent structures found in LSTMs and RNNs, which were previously
employed for sequence modeling. While recurrent structures face limitations in terms of
parallelization and memory constraints, the combination of the self-attention mechanism
with the feed-forward neural network in the Transformer architecture has paved the way for
numerous recent breakthroughs in NLP.
4.2 Transfer Learning: Pre-training and Fine-tuning
The introduction of transfer learning in NLP stands as a significant factor contributing to
the unprecedented power of transformers. Transfer learning involves training a model on one
17
task, known as pre-training, and subsequently adapting or fine-tuning it for a new task. This
enables the model to leverage knowledge acquired from the original task. Architecturally,
this process entails dividing the model into a body and a head. During pre-training, the
body’s weights learn broad features of the source domain, serving as an initialization for a
new model tailored to the new task. The head represents a task-specific neural network. In
comparison to traditional supervised learning, this approach often yields high-quality models
that can be trained more efficiently across various downstream tasks, requiring less labeled
data. Although transfer learning had long been a standard approach in computer vision,
its equivalent pretraining process in NLP was not initially clear. With the introduction of
ULMFiT by Howard and Ruder (2018), a comprehensive framework for adapting pretrained
LSTM models to various tasks was presented.
The BERT model is pre-trained on two tasks: Masked Language Modeling (MLM) and the
Next Sentence Prediction (NSP) task. During the MLM task, 15% of tokens within the input
sequence are randomly masked, and the goal is to predict these masked tokens. The NSP task
involves predicting whether a given pair of sentences in a document are consecutive or not.
As both the MLM and NSP tasks are unsupervised learning, the utilization of unlabeled data
for training is feasible, leveraging the abundance of available resources. The BERT model
was pretrained on BookCorpus, a dataset comprising 11,038 unpublished books, and English
Wikipedia.
During the fine-tuning process, the pretrained model undergoes additional training tailored
to a specific downstream task. For example, in sentiment classification, the model’s head
would be a neural network for classification. Fine-tuning entails comprehensive end-to-end
training, adjusting not only the weights of the head but also those of the model’s body to
suit the characteristics of the specific dataset. For this part of the training labeled data
is required. However, thanks to pretraining, a higher level of prediction performance can
be attained with significantly less labeled data compared to models that are trained from
scratch. The process of fine-tuning is illustrated in Figure 5.
4.3 SMARTyBERT
4.3.1 Pretraining
The architecture of SMARTyBERT is based on the DeBERTa model. DeBERTa is a next-
generation version derived from the BERT model and was the first model to surpass human
18
Figure 5: The blue part corresponds to the body and the orange part to the head of
the model. During the fine-tuning the entire model is trained. Tunstall et al.
(2022).
performance on the SuperGLUE benchmark (Wang et al., 2019) which consists of several
subtasks used to measure natural language understanding performance. The two novel com-
ponents that improve the performance of the DeBERTa model over the BERT model and its
variants including the RoBERTa are the disentangled attention mechanism and the enhanced
mask decoder. In the self-attention mechanism, each token is represented by a single vector,
incorporating information about both the token’s content and position through element-wise
embedding summation. However, this approach leads to a potential information loss as the
model might struggle to distinguish whether the importance of a specific embedded vector
component is derived from the word itself or its position. This issue is addressed in the
disentangled attention mechanism. Here, every token is represented by two vectors encoding
its content and position, respectively. Subsequently, attention weights among tokens are cal-
culated using disentangled matrices based on both their contents and relative positions. This
way the relations between the content and position of tokens is explicitly taken into account.
Furthermore, the pretraining of the DeBERTa model is based on the scheme initially adopted
from RoBERTa. RoBERTa employs larger batches and more extensive training data, omit-
ting the NSP task, resulting in a significant enhancement in performance compared to the
original BERT model. Building upon these advancements, DeBERTa introduces an addi-
tional enhancement to the pretraining process with the incorporation of the enhanced mask
decoder. The enhanced mask decoder takes into account that not only the relative position
but also the absolute position of a word is important, particularly during decoding. Thus, an
absolute position embedding is added just before the softmax layer of token decoding head.
19
4.3.2 Fine-tuning
For our SMARTyBERT model, we utilize a pretrained DeBERTa-base model, fine-tuning it
on our specific dataset with labels derived as detailed in Section 3.5. In the literature, some
approaches involve training LLMs from scratch. For instance, Salbrechter (2021) pretrain a
RoBERTa model using a maximum of 4 GB of data, which comprises domain-specific financial
news articles. In contrast, we deliberately choose to use the pretrained DeBERTa instead of
training a model from scratch to optimally leverage the power of transfer learning. DeBERTa
is pretrained on a total of 78 GB of data, including English Wikipedia (12 GB), Book Corpus
(6 GB), which encompasses free books written by unpublished authors, OPENWEBTEXT
(38 GB), containing public Reddit content, and STORIES (31 GB), a subset of the Common
Crawl. Considering the computational costs of pretraining a model on such a massive dataset,
this is not only unfeasible for most researchers but also goes against the spirit of transfer
learning, which aims to leverage pre-trained weights and knowledge. In particular, we make
use of the already established comprehensive understanding of language of the pretrained
DeBERTa model, while the domain-adapted learning specific to our dataset occurs during
the fine-tuning. Concerning the issue of a look-ahead bias which might potentially occur since
the pretraining data spans until and including 2017, we argue that firstly, the text sources are
not from the financial domain. Thus, we expect that the majority of text contains information
unrelated to the financial market. Secondly, in the fine-tuning procedure, we avoid creating
a look-ahead bias by splitting the sample such that fine-tuning is always done on the dataset
that precedes the dataset used for prediction; see details in Section 4.3.3. Further, if there
was a look-ahead bias in the data, the out-of-sample performance of SMARTyBERT should
significantly decrease after 2017, which is not the case, as we will demonstrate in Section 5.6
To adjust SMARTyBERT to the specification of our dataset, we further modify the archi-
tecture of the original DeBERTa by incorporating an additional regressor. This adaptation
is motivated by the specification of out dataset, which encompasses text items from vari-
ous sources, including earnings call transcripts, headlines from newspapers, and tweets from
StockTwits. By providing the model with information about the origin of the text item,
6We select a BERT variant for our analysis because its pretraining data ends before the time span of our
data, enabling us to examine a potential look-ahead bias within the subset from 2018-2019. In contrast, newer
LLMs such as the OPT model are pretrained on news data extending beyond 2019, thereby not providing a
subset that excludes future data for testing potential look-ahead biases.
20
our goal is to enable the model to learn how different sources are associated with distinct
language styles. Computationally, this is achieved by incorporating an additional regressor,
a one-hot encoded vector representing the source of each item. This vector is concatenated
with the hidden state representation before being passed to the classification head.
4.3.3 Sample Split
Considering that financial markets are inherently dynamic and non-stationary, we employ a
rolling window estimation. Specifically, we select a window size of 32 months, utilizing the
first 24 months as the training set, the subsequent 4 months for validation, and the final 4
months for the test set. After applying this data processing, we obtain 16 time windows on
which we train the model and assess its performance. Each window, on average, contains
2,217,528 text sequences. As demonstrated in Table 4 in each window, on average, 72.4% of
text sequences correspond to the training sample, 13.6% to the validation sample, and 13.9%
to the test sample. Detailed information for each window is provided in the appendix A.
text sequences training validation test
2,217,528 1,606,377 301,806 309,123
(0.7244) (0.1361) (0.1394)
Table 4: Average total number of text sequences and average of training, validation and
test sequences per window represented in absolute values and percentages.
4.3.4 Training
We train our model using a NVIDIA A100 80 GB. The weights of the model are initialized
from the "microsoft/deberta-base" checkpoint using the HuggingFace transformer library for
Python. In line with the pretraining of DeBERTa, we apply a byte-pair encoding tokenizer
with a maximum limit of 512 tokens. To address the strong class imbalance in our dataset,
we employ class weights during training. These weights are computed by taking the inverse
class frequency and multiplying it by the total number of observations in the training set,
then dividing by the number of unique classes. For each window, we fine-tune the model for
three epochs with a batch size of 64. We utilize a cross-entropy loss function, an AdamW
optimizer, and a cosine learning rate scheduler. The number of warm-up steps corresponds
to 15% of the training data, and the initial learning rate is set to 2e-05. The best model from
21
the three epochs is saved, where "best" refers to the model with the highest F1 score. The
classification head comprises a simple single-layer feed-forward neural network.
5 Out-of-Sample Classification Results
Table 5 presents the prediction performance of SMARTyBERT on the test set, which is
strictly out-of-sample across all windows. We observe an accuracy of 54.09% and an F1
score7of 0.4063. In particular, we observe that for the subset of windows from January 2018
until December 2019, the performance of SMARTyBERT slightly improves. Thus, there is
no indication of a look-ahead bias in our model. For comparison, Table 5 additionally reports
the prediction performance of the classic DeBERTa model. To clarify, this model is fine-tuned
on the same dataset as SMARTyBERT, with the only difference being that DeBERTa does
not receive the additional input feature indicating the source of the text item. Comparing
the performance of the two models, we observe that SMARTyBERT slightly outperforms
DeBERTa in both accuracy and F1 score. This suggests that providing the model with
additional information about the origin of the text marginally improves its predictive power.
Detailed information on the performance for each individual window for both models is
provided in the appendix B. For a better interpretation of these results, we further consider
a baseline model that randomly predicts low,middle, or high sentiment, with probabilities
corresponding to the relative frequency of each class label in the validation data. Such a
random baseline model attains an accuracy of 0.5547 and an F1 score of 0.3331. While at
first glance the random baseline model seems to outperform SMARTyBERT and DeBERTa
in accuracy, with a value of 0.5547, it is important to note that this high value primarily
stems from the high accuracy of the middle sentiment class. However, for the application
of the model, it is crucial that the high and low sentiment classes are predicted correctly.
Table 6 provides the accuracy and F1 scores for the high and low sentiment classes only, for
all window from January 2018 to December 2019. This highlights that both SMARTyBERT
and DeBERTa clearly outperform the random baseline model. In particular, SMARTyBERT
achieves the highest accuracy and F1 score for the low sentiment class, while DeBERTa is
the best-performing model for the high sentiment class.
7The F1 score is computed as a macro F1 score, where the F1 score for each class is calculated individually,
and then the arithmetic mean of these F1 scores is taken.
22
model time period accuracy F1 score
SMARTyBERT May 2014-December 2019 0.5409 0.4063
SMARTyBERT January 2018-December 2019 0.5436 0.4133
DeBERTa May 2014-December 2019 0.5382 0.4061
random baseline May 2014-December 2019 0.5547 0.3331
Table 5: Prediction performance of SMARTyBERT, DeBERTa, and a random baseline
model on the test set.
high sentiment low sentiment
accuracy F1 score accuracy F1 score
SMARTyBERT 0.3010 0.1542 0.3669 0.1789
DeBERTa 0.3172 0.1605 0.3564 0.1752
random baseline 0.1357 0.0796 0.1471 0.0854
Table 6: Prediction performance of SMARTyBERT, DeBERTa, and a random baseline
model for the high and low class.
5.1 Contribution of Different Sources to Performance
Since SMARTyBERT is fine-tuned on text sequences from various sources, we investigate the
differential contributions of each source to the model’s prediction performance. To accomplish
this, predictions specific to each source were filtered from all predictions, and accuracy and
F1 score were computed on the test set for each individual source. Results are presented
in Table 7. Remarkably, we observe that both accuracy and F1 score are lowest for text
sequences originating from company sessions, followed by QA sessions, tweets, and, finally,
the highest accuracy and F1 score are obtained for headlines. Note that, while accuracy is
linearly scalable across the different sources, the F1 score is not. At first glance, this suggests
that headlines have the highest contribution to the performance metrics and are therefore
the most relevant. However, an examination of the accuracy of each source stratified across
the different sentiment labels leads to different conclusions. Figure 6 reports the results.
Remarkably, the accuracy of headlines is particularly high for the middle sentiment class, with
close to 99% prediction accuracy, while it is extremely poor for the low and high sentiment
23
classes, each with less than 2% accuracy. In contrast, the accuracy of company sessions and
QA sessions is higher than the overall accuracy for the high sentiment class and only slightly
lower than the overall accuracy for the low sentiment class. For tweets, the accuracy is higher
than the overall accuracy in the low sentiment class and slightly lower in the high sentiment
class. Given that the predictions generated by SMARTyBERT are employed in a trading
strategy involving the construction of portfolios based on high and low sentiment, particular
emphasis is placed on the accuracy of these sentiment classifications. Thus, we assume that
incorporating earnings call transcripts and tweets contributes to more accurate asset pricing
and enhances the trading strategy. It is noteworthy that the relative frequency of headlines
is very low. Consequently, the poor performance of the model regarding headlines may be
attributed to a small number of observations which prevents effectively learning the high and
low sentiment classes.
source accuracy F1 score relative frequency
company session 0.4653 0.5077 0.2561
QA session 0.4915 0.5322 0.3497
tweets 0.6310 0.6592 0.3904
headlines 0.9220 0.8956 0.0038
Table 7: Accuracy and F1 score for each source and its relative frequency in the test
set.
24
Figure 6: Accuracy of each source regarding the different sentiment labels.
6 Financial Application
In the following section, we analyze whether the sentiment signal generated by our SMAR-
TyBERT model is a significant factor in explaining cross-sectional asset prices. We base
our analysis on the assumption that low sentiment is followed by negative next-day excess
returns, while high sentiment is followed by positive next-day excess returns. We begin by
using the sentiment predictions generated by SMARTyBERT as the assignment variable for
portfolio construction and examine the performance of excess returns in the resulting high
and low sentiment portfolios. We then compute a high-low sentiment (HLS) factor as the
return spread between the high and low sentiment portfolios and assess its relevance within
multifactor asset pricing. Finally, we analyze the performance of a trading strategy that
involves taking a long position in the high sentiment portfolio and a short position in the
low sentiment portfolio, both with and without trading costs. Our results are strictly out-of-
sample, utilizing test data from multiple three-month periods covering May 2014 to December
2019; see Section 4.3.3.
25
6.1 Aggregation
Since we potentially observe multiple text items from various sources per company per day, we
need to aggregate these sentiment predictions to obtain an overall sentiment class prediction
for each company per day. Aggregation is performed by taking the mean over the median
prediction of each source for each company per day. If the mean is less than or equal to
0.5, the company is classified as low sentiment. If the mean is equal to 2.0, indicating
that the median prediction of each text sources was 2.0, the company is classified as high
sentiment. Otherwise, the company is classified as middle sentiment. These thresholds have
been selected through fine-tuning on the validation set. By equal-weighted averaging across
all sources, we account for the fact that there are fewer headlines than tweets and earnings
call transcripts. As a result, we derive a daily ternary sentiment label low,middle, or high
for all companies for which we observe text data on the previous day.
6.2 Excess Return Prediction
Our analysis is based on the assumption that low sentiment is followed by negative next-day
excess returns, while high sentiment is followed by positive next-day excess returns. Thus, in
the following section, we aim to investigate whether a portfolio consisting of stocks predicted
to have high sentiment exhibits positive next-day excess returns, while a portfolio consisting
of stocks predicted to have low sentiment exhibits negative next-day excess returns. This
further serves as a check to determine if the SMARTyBERT model is able to capture the
relationship between sentiment and next-day excess returns as intended with our labeling
approach.
We construct portfolios as follows. We use the aggregated sentiment predictions as the
assignment variable: all stocks classified as low are assigned to the low portfolio, and all stocks
classified as high are assigned to the high portfolio. We then investigate the performance of
the excess returns in the resulting high, low, and high-low sentiment portfolios, respectively.
The high-low portfolio is constructed as the difference between the excess returns of the high
and low portfolios. In particular, to assess whether the excess returns in the high and low
portfolios are statistically different from each other, we run a regression of the high, low, and
the high-low portfolios on a constant, employing Newey-West standard errors with 4 ( T
100 )2
9
lags. We observe a high-low portfolio for 1,427 days, with an average of 140 companies in
the low portfolio and 74 companies in the high portfolio. Results for the equal-weighted
26
and value-weighted high-low portfolio and their high and low legs are displayed in Table
8. Starting with the results for the equal-weighted portfolio, we find that the mean of the
high-low portfolio is 0.0017. The p-value indicates significance at the 1% level. Hence, we
can conclude that the excess returns in the high and low portfolios are significantly different
from each other. Analyzing the difference in more detail, we observe that the mean excess
return in the low portfolio is negative, specifically -0.0021 and statistically significant at
the 1% level. This aligns with our preceding assumption that low sentiment is followed by
negative next-day excess returns. However, in the high portfolio, we also observe a negative
mean excess return of -0.0004, which is significant at the 5% level. This finding contradicts
our initial assumption that high sentiment is followed by positive next-day excess returns.
Finally, comparing the mean excess return of the low and high portfolio, we observe that
while they are both negative, the low portfolio exhibits a significantly stronger negative
trend. Consequently, the high-low portfolio has a positive mean excess return. Continuing
the analysis with the results for the value-weighted portfolio, we observe that neither the
high-low nor the high or low portfolio has a mean excess return that is significantly different
from 0 at the 5% level. Thus, we can conclude that the effect of sentiment is restricted to
small-cap stocks. This finding is consistent with the results in Ke et al. (2019), where it is
reported that news article sentiment is a stronger predictor of future returns for small stocks.
equal-weighted value-weighted
high-low low high high-low low high
0.0017∗∗∗ -0.0021∗∗∗ -0.0004 ∗∗ 0.0000 -0.0002 -0.0002
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
Standard errors in parentheses
p < 0.10,∗∗p < 0.05,∗∗∗ p < 0.01
Table 8: Mean daily excess returns of equal-weighted and value-weighted high, low,
and high-low portfolios.
6.3 Empirical Multifactor Asset Pricing
Next, we compute a HLS factor as the return spread between the high and low sentiment
equal-weighted portfolios using raw returns over the risk-free rate rather than excess returns
27
as in the previous section. To assess the relevance of such a HLS factor within multifac-
tor asset pricing, we regress the HLS factor against the Fama-French five factors plus the
momentum factor. Momentum, akin to sentiment, is a behavioral factor known to signifi-
cantly impact asset prices. By incorporating momentum we ensure that the significance of
the HLS factor is not confounded by momentum, thereby validating the distinct contribution
of sentiment to asset pricing dynamics. The regression is contemporaneous, with all factors
measured on day t, while the HLS factor is formed predictively using text items from day
t-1. For this analysis, we consider all factors on a monthly basis rather than a daily basis, as
monthly data is less noisy.8The results are presented in Table 9. When regressing the HLS
factor on a constant only, we obtain an alpha of 0.0378, which is significant at the 1% level.
Incorporating the Fama-French five factors plus the momentum factor yields an alpha of
0.0338, also significant at the 1% level. Therefore, we conclude that the HLS factor contains
unique information for cross-sectional asset prices that is not explained by the Fama-French
five factors and the momentum factor.
Alpha MKT SMB HML CMA RMW MOM R2
0.0378∗∗∗ 0.0
(0.005)
0.0338∗∗∗ -0.0094 0.1976 -0.3811 0.4255 0.8574∗∗ 0.0001 0.112
(0.005) (0.152) (0.227) (0.258) (0.371) (0.417) (0.000)
Standard errors in parentheses
p < 0.10,∗∗p < 0.05,∗∗∗ p < 0.01
Table 9: Regression of the HLS factor on the Fama-French five factors and the momen-
tum factor using monthly returns.
6.4 Performance of a Sentiment-Based Trading Strategy
Next, we investigate the financial performance of a trading strategy based on the predictions
of SMARTyBERT. Specifically, we examine a trading strategy that takes a short position in
the low portfolio and a long position in the high portfolio. Figure 7 demonstrates the cumu-
lative log returns of equally-weighted portfolios, excluding transaction costs, from May 2014
to December 2019. It includes both the long and short portfolios, as well as the long/short
8A regression using daily returns yields the same conclusions.
28
portfolio, which is defined as the aggregate of the returns from the long and short portfolios.
The plot illustrates the profitability of the long/short trading strategy which is driven by the
short leg. These results align with our findings in Section 6.2, where we observe significant
negative next-day excess returns for low sentiment, but no significant positive next-day ex-
cess returns for high sentiment. For comparison, Figure 7 also provides the performance of
FinBERT. Using Hugging Face transformer library, we initialize the weights for the model
from the checkpoint ’yiyanghkust/finbert-tone’ and apply it to our dataset using zero-shot
prediction. While the architecture of FinBERT is similar to SMARTyBERT, the key dif-
ference between the two models lies in the labeling approach. We fine-tune SMARTyBERT
with labels derived from the excess return, and FinBERT is fine-tuned with labels that are
derived through human annotation, as described in section 2. FinBERT yields small profits
for the short leg; however, the long leg incurs substantial losses, resulting in an overall loss
for the long/short portfolio. This demonstrates the effectiveness of our data-driven labeling
approach, leading to superior prediction performance in terms of financial gains. Table 10
summarizes the results once more. Starting with the results obtained without transaction
costs for the long/short, short, and long portfolios, SMARTyBERT achieves an annualized
mean return of 35.56%, 37.44%, and -1.88%, respectively. The annualized Sharpe ratios are
as follows: 2.21 for the long/short portfolio, 1.54 for the short portfolio, and -0.08 for the
long portfolio. FinBERT achieves an annualized mean return of -2.84%, 5.09% and -7.93%
for the long/short, short and long portfolio respectively, resulting in annualized Sharpe ratios
of -0.40, 0.30, -0.49. Results for value weighted portfolios are presented in the appendix C in
Table 16.
29
Figure 7: Cumulative log returns of the long/short, long and short portfolio of SMAR-
TyBERT and FinBERT without transaction costs.
SMARTyBERT FinBERT
long/short short long long/short short long
without transaction costs
return 0.3556 0.3744 -0.0188 -0.0284 0.0509 -0.0793
standard deviation 0.1612 0.2423 0.2327 0.1645 0.1655 0.2358
Sharpe ratio 2.21 1.54 -0.08 -0.17 0.31 -0.34
with transaction costs
return 0.3321 0.3676 -0.0355 -0.0651 0.0497 -0.1148
standard deviation 0.1612 0.2423 0.2326 0.1645 0.1655 0.2358
Sharpe ratio 2.06 1.52 -0.15 -0.40 0.30 -0.49
Table 10: Annualized mean return, standard deviation, and Sharpe ratio of SMARTy-
BERT and FinBERT using equal-weighted portfolio construction, with and
without transaction costs.
Moreover, we evaluate the financial performance of the trading strategy when accounting
for transaction costs. Following the approach by Bollerslev et al. (2018), we estimate trans-
30
action costs by using the median bid-ask spread for each asset over the past nine months.
Figure 8 illustrates the cumulative monthly log returns with and without transaction costs.
We consider monthly instead of daily log returns to demonstrate the impact of transaction
costs. As shown in Figure 8, when transaction costs are considered, a long/short strategy
based on the predictions of SMARTyBERT experiences a slight decrease in profitability but
remains profitable. In contrast, a long/short strategy based on the predictions of FinBERT
incurs further losses. In particular, we observe that transaction costs are lower for the short
portfolio compared to the long portfolio. Considering the results in Table 10, the annualized
mean return of the short portfolio declines by 0.35%, resulting in a annualized mean return of
36.76%, while the annualized mean return of the long portfolio decreases by 1.67%, leading to
a return of -3.55%. This discrepancy arises because the average daily share changes per stock
are lower for the low portfolio compared to the high portfolio, resulting in lower daily trans-
action costs per stock, as demonstrated in Figure 9. This is due to the tendency of stocks
in the short portfolio to remain classified as low sentiment for multiple consecutive days.
Additionally, there are more stocks predicted as low (140) than those predicted as high (74).
Consequently, stocks in the low portfolio are held for longer periods, necessitating smaller
daily adjustments in share positions rather than full buys or sells. In contrast, stocks in the
long portfolio are often held for just one day, leading to larger daily adjustments in share
positions and consequently higher transaction costs. The effect is even more pronounced for
FinBERT, as the discrepancy in the average number of stocks predicted as high and low
sentiment is much greater: on average, 16 stocks are predicted as high sentiment, while 551
stocks are predicted as low sentiment. Consequently, when transaction costs are included,
the annualized mean return of the short portfolio only marginally decreases from 5.09% to
4.97%, while the annualized mean return of the long portfolio significantly decreases from
-7.93% to -11.48%.
31
Figure 8: Cumulative monthly log returns of the long/short, long and short portfolios
of SMARTyBERT and FinBERT with and without transaction costs (TAC).
32
Figure 9: Average daily changes in shares per stock and average daily transaction costs
per stock for the high and low portfolios generated by SMARTyBERT.
7 Conclusion
We present a novel approach to sentiment analysis within financial markets, addressing the
critical NLP challenge of data labeling through a data-driven methodology. By deriving
33
sentiment labels based on the next day’s excess returns relative to the Fama-French five-
factor model, we eliminate reliance on human annotations and streamline the sentiment and
predictive analysis into a single step.
Our contributions to the field are twofold. First, we compile a comprehensive dataset span-
ning from 2012 to 2019, which includes diverse financial texts for all companies listed in the
CRSP universe. This dataset encompasses earnings conference call transcripts, Bloomberg
news headlines, and tweets from StockTwits, allowing us to capture sentiment from vari-
ous investor perspectives. Second, we develop SMARTyBERT, a fine-tuned version of the
pre-trained DeBERTa model. By employing transfer learning, we leverage the pre-existing
knowledge embedded in DeBERTa and enhance it with an additional feature representing
the text source. This modification enables the model to adapt to different linguistic styles
across our text sources.
Our financial application of SMARTyBERT reveals low sentiment consistently predicts neg-
ative next-day excess returns, while high sentiment does not show a similar effect. Moreover,
the HLS factor is a significant factor at 1% level in multifactor asset pricing. Additionally,
our analysis of trading strategies demonstrates substantial financial gains. The long/short
equal-weighted portfolio strategy yields an average annualized return of 35.56% and a Sharpe
ratio of 2.21 before transaction costs, and 33.21% and 2.06 respectively after accounting for
transaction costs, indicating strong profitability. In contrast, the FinBERT model, which re-
lies on human-annotated data, significantly underperforms, with negative annualized returns
and Sharpe ratios both with and without considering transaction costs. These findings un-
derscore the superiority of our data-driven labeling approach in enhancing financial sentiment
analysis and trading performance.
Possible future research could focus on the investigation of how the HLS factor and related
trading strategy behave during particular time periods (for example, periods characterized by
market turmoil) or/and whether the same importance of the different sources is maintained
in other markets and geographic regions.
34
References
Araci, D. F. and Genc, Z. (2019). Financial sentiment analysis with pre-trained language
models. arXiv preprint arXiv:1908.10063.
Audrino, F., Sigrist, F., and Ballinari, D. (2020). The impact of sentiment and attention
measures on stock market volatility. International Journal of Forecasting, 36(2):334–357.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning
to align and translate. arXiv preprint arXiv:1409.0473.
Bali, T. G., Engle, R. F., and Murray, S. (2016). Empirical asset pricing: The cross section
of stock returns. John Wiley & Sons.
Ballinari, D. and Behrendt, S. (2021). How to gauge investor behavior? a comparison of
online investor sentiment measures. Digital Finance, 3(2):169–204.
Barbaglia, L., Consoli, S., and Manzan, S. (2022). Forecasting with economic news. Journal
of Business & Economic Statistics, pages 1–12.
Bollen, J., Mao, H., and Zeng, X. (2011). Twitter mood predicts the stock market. Journal
of computational science, 2(1):1–8.
Bollerslev, T., Hood, B., Huss, J., and Pedersen, L. H. (2018). Risk everywhere: Modeling
and managing volatility. The Review of Financial Studies, 31(7):2729–2773.
Chen, Y., Kelly, B. T., and Xiu, D. (2022). Expected returns and large language models.
Available at SSRN.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Garcia, D., Hu, X., and Rohrer, M. (2020). The colour of finance words. Available at SSRN
3630898.
He, P., Liu, X., Gao, J., and Chen, W. (2021). Deberta: Decoding-enhanced bert with
disentangled attention. In International Conference on Learning Representations.
Howard, J. and Ruder, S. (2018). Universal language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146.
35
Huang, A. H., Wang, H., and Yang, Y. (2023). Finbert: A large language model for extracting
information from financial text. Contemporary Accounting Research, 40(2):806–841.
Jegadeesh, N. and Wu, D. (2013). Word power: A new approach for content analysis. Journal
of financial economics, 110(3):712–729.
Ke, Z. T., Kelly, B. T., and Xiu, D. (2019). Predicting returns with text data. Technical
report, National Bureau of Economic Research.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer,
L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.
arXiv preprint arXiv:1907.11692.
Loughran, T. and McDonald, B. (2011). When is a liability not a liability? textual analysis,
dictionaries, and 10-ks. The Journal of finance, 66(1):35–65.
Maia, M., Handschuh, S., Freitas, A., Davis, B., McDermott, R., Zarrouk, M., and Balahur,
A. (2018). Www’18 open challenge: financial opinion mining and question answering. In
Companion proceedings of the the web conference 2018, pages 1941–1942.
Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala, P. (2014). Good debt or bad
debt: Detecting semantic orientations in economic texts. Journal of the Association for
Information Science and Technology, 65(4):782–796.
Manela, A. and Moreira, A. (2017). News implied volatility and disaster concerns. Journal
of Financial Economics, 123(1):137–162.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.
Salbrechter, S. (2021). Financial news sentiment learned by bert: A strict out-of-sample
study. Available at SSRN 3971880.
Severyn, A. and Moschitti, A. (2015). Twitter sentiment analysis with deep convolutional
neural networks. In Proceedings of the 38th international ACM SIGIR conference on re-
search and development in information retrieval, pages 959–962.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock
market. The Journal of finance, 62(3):1139–1168.
36
Tunstall, L., Von Werra, L., and Wolf, T. (2022). Natural language processing with trans-
formers. " O’Reilly Media, Inc.".
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and
Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language under-
standing systems. Advances in neural information processing systems, 32.
Yang, Y., Uy, M. C. S., and Huang, A. (2020). Finbert: A pretrained language model for
financial communications. arXiv preprint arXiv:2006.08097.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li,
X., Lin, X. V., et al. (2022). Opt: Open pre-trained transformer language models. arXiv
preprint arXiv:2205.01068.
37
A Sample Split
Detailed information about data descriptives in each window.
window text sequences training validation test
January 2012-August 2014 1’797’701 0.6754 0.1600 0.1645
May 2012-December 2014 1’856’263 0.7018 0.1593 0.1389
September 2012-April 2015 2’081’700 0.6905 0.1238 0.1857
January 2013-August 2015 2’233’866 0.6587 0.1627 0.1785
May 2013-December 2015 2’230’383 0.6927 0.1689 0.1384
September 2013-April 2016 2’264’923 0.7190 0.1298 0.1512
January 2014-August 2016 2’317’387 0.7088 0.1393 0.1519
May 2014-December 2016 2’303’987 0.7256 0.1470 0.1274
September 2014-April 2017 2’355’398 0.7293 0.1194 0.1513
January 2015-August 2017 2’370’783 0.7385 0.1239 0.1377
May 2015-December 2017 2’320’612 0.7526 0.1402 0.1072
September 2015-April 2018 2’230’138 0.7606 0.1110 0.1284
January 2016-August 2018 2’316’550 0.7409 0.1237 0.1354
May 2016-December 2018 2’258’063 0.7574 0.1382 0.1044
September 2016-April 2019 2’225’037 0.7711 0.1055 0.1234
January 2017-August 2019 2’288’220 0.7412 0.1202 0.1386
May 2017-December 2019 2’246’968 0.7512 0.1413 0.1075
Table 11: Share of sequences in the training, validation, and test sample for each win-
dow.
38
window headlines company session QA session tweets tweets share
January 2012-August 2014 0.0062 0.2868 0.3669 0.3434 0.3000
May 2012-December 2014 0.0059 0.2887 0.3640 0.3434 0.3000
September 2012-April 2015 0.0052 0.2784 0.3530 0.3434 0.3000
January 2013-August 2015 0.0048 0.2827 0.3636 0.3434 0.2600
May 2013-December 2015 0.0046 0.2778 0.3622 0.3434 0.2300
September 2013-April 2016 0.0044 0.2716 0.3612 0.3434 0.2100
January 2014-August 2016 0.0043 0.2778 0.3744 0.3434 0.1800
May 2014-December 2016 0.0042 0.2722 0.3733 0.3434 0.1650
September 2014-April 2017 0.0038 0.2649 0.3729 0.3434 0.1500
January 2015-August 2017 0.0037 0.2767 0.3939 0.3434 0.1290
May 2015-December 2017 0.0037 0.2708 0.3863 0.3434 0.1180
September 2015-April 2018 0.0039 0.2767 0.3920 0.3434 0.0952
January 2016-August 2018 0.0037 0.2799 0.3927 0.3434 0.0850
May 2016-December 2018 0.0038 0.2780 0.3866 0.3434 0.0830
September 2016-April 2019 0.0039 0.2775 0.3823 0.3434 0.0740
January 2017-August 2019 0.0038 0.2842 0.3876 0.3434 0.0690
May 2017-December 2019 0.0042 0.2807 0.3789 0.3434 0.0678
Table 12: Source distribution in each window.
window low sentiment middle sentiment high sentiment
January 2012-August 2014 0.1336 0.7439 0.1225
May 2012-December 2014 0.1369 0.7363 0.1267
September 2012-April 2015 0.1364 0.7345 0.1291
January 2013-August 2015 0.1379 0.7323 0.1298
May 2013-December 2015 0.1384 0.7302 0.1314
September 2013-April 2016 0.1365 0.7302 0.1333
January 2014-August 2016 0.1371 0.7297 0.1331
May 2014-December 2016 0.1362 0.7318 0.1320
September 2014-April 2017 0.1375 0.7298 0.1327
January 2015-August 2017 0.1404 0.7262 0.1334
May 2015-December 2017 0.1455 0.7207 0.1338
September 2015-April 2018 0.1483 0.7159 0.1358
January 2016-August 2018 0.1511 0.7111 0.1378
May 2016-December 2018 0.1553 0.7063 0.1384
September 2016-April 2019 0.1567 0.7043 0.1390
January 2017-August 2019 0.1592 0.6999 0.1409
May 2017-December 2019 0.1584 0.7000 0.1416
Table 13: Label distribution in each window.
39
B Out-of Sample Classification Results
window accuracy F1 score
May 2014-August 2014 0.480542 0.377624
September 2014-December 2014 0.529213 0.385838
January 2015-April 2015 0.544451 0.395403
May 2015-August 2015 0.536634 0.394630
September 2015-December 2015 0.545592 0.394091
January 2016-April 2016 0.548959 0.398383
May 2016-August 2016 0.527539 0.402531
September 2016-December 2016 0.559156 0.415694
January 2017-April 2017 0.573052 0.428323
May 2017-August 2017 0.523571 0.402863
September 2017-December 2017 0.565782 0.427782
January 2018-April 2018 0.569696 0.417944
May 2018-August 2018 0.538421 0.408197
September 2018-December 2018 0.547163 0.414951
January 2019-April 2019 0.559697 0.419280
May 2019-August 2019 0.503404 0.398134
September 2019-December 2019 0.550280 0.423606
Table 14: SMARTyBERT prediction performance on the test set for each window.
40
window accuracy F1 score
May 2014-August 2014 0.485113 0.380561
September 2014-December 2014 0.539969 0.389157
January 2015-April 2015 0.556260 0.398303
May 2015-August 2015 0.533685 0.393462
September 2015-December 2015 0.555373 0.398376
January 2016-April 2016 0.540914 0.395976
May 2016-August 2016 0.511712 0.397058
September 2016-December 2016 0.562476 0.414626
January 2017-April 2017 0.563260 0.428719
May 2017-August 2017 0.516029 0.399968
September 2017-December 2017 0.547971 0.423981
January 2018-April 2018 0.553441 0.415085
May 2018-August 2018 0.539489 0.407916
September 2018-December 2018 0.544477 0.416880
January 2019-April 2019 0.552343 0.418384
May 2019-August 2019 0.504325 0.399274
September 2019-December 2019 0.548810 0.429073
Table 15: DeBERTa prediction performance on the test set for each window.
C Financial Application
SMARTyBERT FinBERT
long-short short long long-short short long
return -0.0108 -0.0785 0.0677 -0.0272 -0.1296 0.1024
std 0.1826 0.2246 0.2381 0.1261 0.1409 0.1859
Sharpe ratio -0.06 -0.35 0.28 -0.22 -0.92 0.55
Table 16: Annualized mean return, standard deviation, and Sharpe ratio of SMARTy-
BERT and FinBERT using value-weighted portfolio construction.
41
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We develop FinBERT, a state‐of‐the‐art large language model that adapts to the finance domain. We show that FinBERT incorporates finance knowledge and can better summarize contextual information in financial texts. Using a sample of researcher‐labeled sentences from analyst reports, we document that FinBERT substantially outperforms the Loughran and McDonald dictionary and other machine learning algorithms, including naïve Bayes, support vector machine, random forest, convolutional neural network, and long short‐term memory, in sentiment classification. Our results show that FinBERT excels in identifying the positive or negative sentiment of sentences that other algorithms mislabel as neutral, likely because it uses contextual information in financial text. We find that FinBERT's advantage over other algorithms, and Google's original bidirectional encoder representations from transformers (BERT) model, is especially salient when the training sample size is small and in texts containing financial words not frequently used in general texts. FinBERT also outperforms other models in identifying discussions related to environment, social, and governance issues. Last, we show that other approaches underestimate the textual informativeness of earnings conference calls by at least 18% compared to FinBERT. Our results have implications for academic researchers, investment professionals, and financial market regulators. This article is protected by copyright. All rights reserved.
Article
Full-text available
The goal of this paper is to evaluate the informational content of sentiment extracted from news articles about the state of the economy. We propose a fine-grained aspect-based sentiment analysis that has two main characteristics: 1) we consider only the text in the article that is semantically dependent on a term of interest (aspect-based) and, 2) assign a sentiment score to each word based on a dictionary that we develop for applications in economics and finance (fine-grained). Our data set includes six large US newspapers, for a total of over 6.6 million articles and 4.2 billion words. Our findings suggest that several measures of economic sentiment track closely business cycle fluctuations and that they are relevant predictors for four major macroeconomic variables. We find that there are significant improvements in forecasting when sentiment is considered along with macroeconomic factors. In addition, we also find that sentiment matters to explains the tails of the probability distribution across several macroeconomic variables.
Article
Full-text available
Given the increasing interest in and the growing number of publicly available methods to estimate investor sentiment from social media platforms, researchers and practitioners alike are facing one crucial question – which is best to gauge investor sentiment? We compare the performance of daily investor sentiment measures estimated from Twitter and StockTwits short messages by publicly available dictionary and machine learning based methods for a large sample of stocks. To determine their relevance for financial applications, these investor sentiment measures are compared by their effects on the cross-section of stocks (i) within a Fama and MacBeth (J Polit Econ 81:607–636, 1973) regression framework applied to a measure of retail investors’ order imbalances and (ii) by their ability to forecast abnormal returns in a model-free portfolio sorting exercise. Interestingly, we find that investor sentiment measures based on finance-specific dictionaries do not only have a greater impact on retail investors’ order imbalances than measures based on machine learning approaches, but also perform very well compared to the latter in our asset pricing application.
Conference Paper
Full-text available
Prosus is one of the largest technology investors in the world and it is important for us to follow the news, reports and commentary text about the sectors and companies of interest. To create a dashboard overview from overwhelming flow of text data, we built an NLP system, which organizes unstructured text from multiple sources into sector, company, release date and sentiment. Most of the text we harvest has financial context and sentiment analysis in financial domain turned out to be a challenging task because of domain- specific language. Transfer learning has been shown to be successful in adapting to new domains without large training data sets. In this paper, we explore the effectiveness of NLP transfer learning in financial sentiment classification and introduce FinBERT, a language model based on BERT. FinBERT improved the state-of-the-art performance by 15 percentage points for a financial sentiment classification task in FinancialPhrasebank dataset.
Conference Paper
Full-text available
The growing maturity of Natural Language Processing (NLP) techniques and resources is dramatically changing the landscape of many application domains which are dependent on the analysis of unstructured data at scale. The finance domain, with its reliance on the interpretation of multiple unstructured and structured data sources and its demand for fast and comprehensive decision making is already emerging as a primary ground for the experimentation of NLP, Web Mining and Information Retrieval (IR) techniques for the automatic analysis of financial news and opinions online. This challenge focuses on advancing the state-of-the-art of aspect-based sentiment analysis and opinion-based Question Answering for the financial domain.
Chapter
This article summarizes our text Empirical Asset Pricing: The Cross Section of Stock Returns . The text provides an overview of the empirical asset pricing research, with a focus on cross‐sectional studies of stock returns. The main objective of this research is to identify patterns in stock returns and understand the drivers of these patterns. The text has two main parts. The first part discusses in detail the main statistical and econometric techniques used in empirical asset pricing research, the most important of which are portfolio analysis and Fama and MacBeth regression analysis. The second part summarizes the main findings in empirical asset pricing research, beginning with the seminal research on the Capital Asset Pricing Model and ending with the most contemporary work as of the time of publication, 2016. The text is intended for use in a doctoral‐level or advanced masters‐level empirical asset pricing or investments course, as well as by professional portfolio managers, risk managers, and other practitioners. JEL Classifications: G11, G12, G13, G17
Article
We analyze the impact of sentiment and attention variables on stock market volatility by using a novel and extensive dataset that combines social media, news articles, information consumption, and search engine data. Applying a state-of-the-art sentiment classication technique, we investigate the question of whether sentiment and attention measures contain additional predictive power for realized volatility when controlling for a wide range of economic and nancial predictors. Using a penalized regression framework, we identify investors' attention, as measured by the number of Google searches on nancial keywords (e.g. "nancial market" and "stock market"), and the daily volume of company-specic short messages posted on StockTwits to be the most relevant variables. In addition, our study shows that attention and sentiment variables are able to signicantly improve volatility forecasts, although the improvements are of relatively small magnitude from an economic point of view.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.