Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
arXiv:2408.06585v1 [cs.CE] 13 Aug 2024
SSAAM: Sentiment Signal-based Asset Allocation
Method with Causality Information
Rei Taguchi
School of Engineering
The University of Tokyo
Tokyo, Japan
s5abadiee@g.ecc.u-tokyo.ac.jp
Hiroki Sakaji
School of Engineering
The University of Tokyo
Tokyo, Japan
sakaji@sys.t.u-tokyo.ac.jp
Kiyoshi Izumi
School of Engineering
The University of Tokyo
Tokyo, Japan
izumi@sys.t.u-tokyo.ac.jp
Abstract—This study demonstrates whether financial text is
useful for tactical asset allocation using stocks by using natural
language processing to create polarity indexes in financial news.
In this study, we performed clustering of the created polarity
indexes using the change-point detection algorithm. In addition,
we constructed a stock portfolio and rebalanced it at each
change point utilizing an optimization algorithm. Consequently,
the asset allocation method proposed in this study outperforms
the comparative approach. This result suggests that the polarity
index helps construct the equity asset allocation method.
Index Terms—Financial news, MLM scoring, causal inference,
change-point detection, portfolio optimization
I. INT RO DU CTI ON
This study proposes that financial text can be useful for
tactical asset allocation methods using equities. This study
focuses on the point at which stock and portfolio prices change
rapidly due to external factors, that is, the point of regime
change. Regimes in finance theory refer to invisible market
states, such as expansion, recession, bulls, and bears. In this
study, we specifically drew on the two studies presented below.
Wood et al. [1] used a change-point detection module to
capture regime changes and created a simple and expressive
model. Ito et al. [2] developed a method for switching invest-
ment strategies in response to market conditions. In this study,
we go one step further and focus on how to measure future
regime changes. If the information on future regime changes
(i.e., future changes in the market environment) is known,
active management with a higher degree of freedom becomes
possible. However, there are certain limitations in calculating
future regimes using only traditional financial time-series data.
Therefore, this study constructs an investment strategy based
on a combination of alternative data that has been attracting
attention in recent years and financial time-series data.
In this study, we hypothesized the following:
•Portfolio performance can be improved by switching be-
tween risk-minimizing and return-maximizing optimiza-
tion strategies according to the change points created by
the polarity index.
The contributions of this study are as follows:
•We demonstrate that the estimation of regime change
points using financial text is active the active management
and propose a highly expressive asset allocation frame-
work.
The framework of this study consists of the following four
steps.
•Step 1 (Creating polarity index): Score financial news
titles using MLM scoring. In addition, quartiles are calcu-
lated from the same data, and a three-value classification
of positive, negative, and neutral is performed according
to the quartile range. The calculated values are aggregated
daily.
•Step 2 (Demonstration of leading effects): We use sta-
tistical causal inference to demonstrate whether financial
news has leading effects on a stock portfolio. Use the
polarity index created in Step 1. We will also create a
portfolio of 10 stocks combined. The algorithm used is
VAR-LiNGAM.
•Step 3 (Change point detection): Verify that the polarity
index has leading effects in Step 2. Calculate the regime
change point of the polarity index using the change point
detection algorithm. The algorithm used is the Binary
Segmentation Search Method.
•Step 4 (Portfolio optimization): Portfolio optimization
is performed based on the change points created in Step
3. The algorithm used is EVaR optimization.
II. ME T HO D
A. Creating polarity index
This study used pseudo-log-likelihood scores (PLLs) to
create polarity indices. PLLs are scores based on probabilistic
language models proposed by Salazar et al. [3]. Because
masked language models (MLMs) are pre-trained by pre-
dicting words in both directions, they cannot be handled by
conventional probabilistic language models. However, PLLs
can determine the naturalness of sentences at a high level
because they are represented by the sum of the log-likelihoods
of the conditional probabilities when each word is masked
and predicted. Token ψtis replaced by [MASK], and the
past and present tokens Ψ\t= [ψ1, ψ2, ..., ψt]are predicted. t
represents time. Θis the model parameter. PMLM (·)denotes
the probability of each sentence token. The MLM selects
BERT (Devlin et al. [4]).
PLL(Ψ):=
|Ψ|
X
t=1
log2PM LM (ψt|Ψ\t; Θ) (1)
After pre-processing, score the financial news text with
PLLs one sentence at a time. Quartile ranges1were calculated
for data that scored one sentence at a time. The figure below
illustrates the polarity classification method.
TABLE I
POL AR IT Y CLA SSI FIC ATI ON MET H OD
Classification Method Sentiment Score
3rd quartile <PLLs 1 (positive)
1st quartile ≤PLLs ≤3rd quartile 0 (neutral)
1st quartile >PLLs -1 (negative)
Aggregate the scores chronologically according to the title
column of financial news.
B. Demonstration of leading effects
In this study, we used VAR-LiNGAM to demonstrate the
precedence. VAR-LiNGAM is a statistical causal inference
model proposed by Hyv ¨arinen et al. [5]. The causal graph
inferred by VAR-LiNGAM is as follows:
x(t) =
T
X
τ=1
Bτx(t−τ) + e(t)(2)
where x(t)is the vector of the variables at time tand
τis the time delay. Trepresents the maturity date. In ad-
dition, Bτis a coefficient matrix that represents the causal
relationship between the variables x(t−τ).e(t)denotes the
disturbance term. VAR-LiNGAM was implemented using the
following procedure: First, a VAR (Vector Auto-Regressive)
model is applied to the causal relationships among variables
from the lag time to the current time. Second, for the causal
relationships among variables at the current time, LiNGAM
inference is performed using the residuals of the VAR model.
This study confirms whether financial news is preferred to
stock portfolios.
C. Change point detection
Binary segmentation search (Bai [6]; Fryzlewicz [7]) is a
greedy sequential algorithm. The notation of the algorithm
follows Truong et al. [8]. This operation is greedy in the sense
that it seeks the change point with the lowest sum of costs.
Next, the signal was divided into two at the position of the
first change point, and the same operation was repeated for the
obtained partial signal until the stop reference was reached.
The binary segmentation search is expressed in Algorithm 1.
We define a signal y={ys}S
s=1 that follows a multivariate
non-stationary stochastic process. This process involves S
1Arranging the data in decreasing order, the data in the 1/4 are called the
1st quartile, the data in the 2/4 are called the 2nd quartile, and the data in
the 3/4 are called the 3rd quartile. (3rd quartile - 1st quartile) is called the
quartile range.
samples. Lrefers to the list of change points. Let sdenote
the value of a change point. Grefers to an ordered list
of change points to be computed. If signal yis given, the
(b−a)-sample long sub-signal {ys}b
s=a+1,(1 ≤a < b ≤S)
is simply denoted ya,b . Hats represent the calculated values.
Other notations are noted in the algorithm’s comment.
Algorithm 1 Binary Segmentation Search
Input: signal y={ys}S
s=1, cost function c(·), stopping
criterion.
Initialize L← {}. ⊲ Estimated breakpoints.
Repeat
k← |L|. ⊲ Number of breakpoints.
s0←0and sk+1 ←S ⊲ Dummy variables
if k > 0then
Denote by si(i= 1, ..., k)the elements (in ascending
order) of L, ie L={s1, ..., sk}.
end if
Initialize Ga(k+ 1)-long array. ⊲List of gains
for i= 0, ..., k do
G[i]←c(ysi,si+1 )−minsi<s<si+1 [c(ysi,s)+c(ys,si+1 )].
end for
ˆ
i←arg maxiG[i]
ˆs←arg minsi<s<si+1 [c(ysˆ
i,t) + c(ys,sˆ
i+1 )].
⊲Estimated change-points
L←L∪ {ˆs}
Until stopping criterion is met.
Output: set Lof estimated breakpoint indexes.
.
D. Portfolio optimization
The entropy value at risk (EVaR) is a coherent risk measure
that is the upper bound between the value at risk (VaR)
and conditional value at risk (CVaR) derived from Chernoff’s
inequality (Ahmadi-Javid [9]; Ahmadi-Javid [10]). EVaR has
the advantage of being computationally tractable compared to
other risk measures, such as CVaR, when incorporated into
stochastic optimization problems (Ahmadi-Javid [10]). EVaR
is defined as follows.
EVaRα(X) := min
z>0zln 1
αMX1
z (3)
Xis a random variable. MXis the moment-generating
function. αdenotes the significance level. zare variables.
A general convex programming framework for the EVaR is
proposed by Cajas [11]. In this study, we switch between the
following two optimization strategies depending on the regime
classified in Section II-C.
•Minimize risk optimization: A convex optimization
problem with constraints imposed to minimize EVaR
given a level of expected µ(bµ).
minimize q+zloge1
T α
subject to µw ≥bµ
N
X
i=1
wi= 1
z≥
T
X
j=1
uj
(−rjw⊤−q, z, uj)∈Kexp (∀j= 1, ..., T )
wi= 0 (∀i= 1, ..., N )
(4)
•Maximize return optimization: A convex optimization
problem imposed to maximize expected return given a
level of expected EV aR (
\
EV aR).
maximize µw⊤
subject to q+zloge1
T α ≥
\
EV aR
N
X
i=1
wi= 1
z≥
T
X
j=1
uj
(−rjw⊤−q, z, uj)∈Kexp (∀j= 1, ..., T )
wi= 0 (∀i= 1, ..., N )
(5)
where q,zand uare the variables, Kexp is the exponential
cone, and Tis the number of observations. wis defined as a
vector of weights for Nassets, ris a matrix of returns, and
µis the mean vector of assets.
III. EXP ER IME NT S & RES ULT S
A. Dataset description
This study calculates the signal for portfolio rebalancing and
tactical asset allocation to actively go for an alpha based on the
assumption that financial news precedes the equity portfolio.
Two types of data were used.
•Stock Data: We used the daily stock data provided by
Yahoo!Finance2. The stocks used are the components of
the NYSE FANG+ Index: Facebook, Apple, Amazon,
Netflix, Google, Microsoft, Alibaba, Baidu, NVIDIA, and
Tesla were selected. For this data, adjusted closing prices
are used. The time period for this data is January 2015
through December 2019.
•Financial News Data: We used the daily historical finan-
cial news archive provided by Kaggle3, a data analysis
platform. This data represents the historical news archive
2https://finance.yahoo.com/
3https://www.kaggle.com/
of U.S. stocks listed on the NYSE/NASDAQ for the past
12 years. This data was confirmed to contain information
on ten stock data issues. This data consists of 9 columns
and 221,513 rows. The title and release date columns
were used in this study. The time period for this data is
January 2015 through December 2019.
B. Preparation for backtesting
The polarity index is presented in section II-A. The financial
news data were pre-processed once before creating the polarity
index. Both financial news and stock data are in daily units;
however, to match the period, if there are blanks in either ,
lines containing blanks are dropped. Once the polarity index
is created in Section II-A, the next step is to create a stock
portfolio by adding the adjusted closing prices of 10 stocks.
The investment ratio for the portfolio is set uniformly for all
stocks. Next, we use VAR-LiNGAM in Section II-B to perform
causal inference. The causal inference results are as follows:
Python library ruptures (Truong et al. [8]) was used.
TABLE II
CAUS AL IN F ER EN CE I N VAR-L INGAM
Direction Causal Graph Value
Index(t-1) 99K Index(t) 0.39
Index(t-1) 99K Portfolio(t) 0.11
Portfolio(t-1) 99K Portfolio(t) 1.00
The values in Table II refer to the elements of the adjacency
matrix. The lower limit was set to 0.05. The results in the table
show that the polarity index has a leading edge in the equity
portfolio. The Python library LiNGAM (Hyv¨arinen et al. [5])
was used.
C. Backtesting scenarios
In this study, the following rebalancing timings were merged
and backtested. Python library vector (Polakow [12]) and
Riskfolio-Lib (Cajas [13]) was used for backtesting. In addi-
tion to EVaR optimization, CVaR optimization and the mean-
variance model were used as optimization algorithms and
comparative methods, respectively. In this study, the number
of regimes was set to 5 and 10. The rebalancing times were
30, 90, and 180 days. The backtesting methodology was
as follows. In this study, CPD-EVaR++ was positioned as
the proposed strategy, and CPD-EVaR+ was the runner-up
strategy.
•CPD-EVaR++ (proposed): Changepoint rebalancing us-
ing risk minimization and return maximization EVaR
optimization + regular intervals rebalancing strategy
•CPD-EVaR+: Changepoint rebalancing using risk mini-
mization and no-restrictions EVaR optimization + regular
intervals rebalancing strategy
•EVaR: EVaR optimization regular intervals rebalancing
strategy
•CVaR: CVaR optimization regular intervals rebalancing
strategy
•MV: Mean-Variance optimization regular intervals rebal-
ancing strategy
The binary determination of whether the polarity index
within each regime shows an upward or downward trend is
made by examining the divided regimes. MinRiskOpt (Section
II-D−(4)) is assigned to an upward trend, and MaxReturnOpt
(Section II-D −(5)) is assigned to a downward trend.
D. Evaluation by backtesting
The following metrics were employed to assess the portfolio
performance.
•Total Return (TR): TR refers to the total return earned
from investing in an investment product within a given
period. TR formula is as follows: TR = Valuation Amount
+ Cumulative Distribution Amount Received + Cumula-
tive Amount Sold - Cumulative Amount Bought. This
study does not incorporate tax amounts and trading
commissions.
•Maximum Drawdown (MDD): MDD refers to the rate
of decline from the maximum asset. MDD formula is
as follows: MDD = (Trough Value - Peak Value) / Peak
Value.
TABLE III
BACK TEST ING (SS AA M)
Rebalance Regime Algorithm TR [%] MDD [%]
30-days
5CPD-EVaR++ 810.9915 26.8629
CPD-EVaR+ 594.7410 26.8629
10 CPD-EVaR++ 485.5201 45.0235
CPD-EVaR+ 392.1392 42.4803
90-days
5CPD-EVaR++ 535.7349 27.6386
CPD-EVaR+ 410.8530 27.6386
10 CPD-EVaR++ 417.8354 27.7646
CPD-EVaR+ 373.5849 27.7646
180-days
5CPD-EVaR++ 152.0988 27.3924
CPD-EVaR+ 131.2210 27.3924
10 CPD-EVaR++ 169.2992 25.3050
CPD-EVaR+ 232.4513 25.3050
TABLE IV
BACK TE S TI NG (C OM PAR I SO N)
Rebalance Algorithm TR [%] MDD [%]
30-days
EVaR 587.9630 46.6651
CVaR 558.7446 44.4532
MV 527.2827 42.9851
90-days
EVaR 500.1421 44.9860
CVaR 496.7423 44.0592
MV 459.1195 42.7358
180-days
EVaR 353.2412 44.7714
CVaR 382.9451 44.2525
MV 360.4298 42.8165
IV. DIS CUS SI ON & CONC LU S IO N
Table III shows that the higher the number of regular rebal-
ances, the higher the total return. In addition, the maximum
drawdowns hovered between 25% and 45%, which is consid-
ered acceptable to the average system trader. In this study, the
experiment was conducted separately when the regime was five
and when the regime was ten. The total return was higher when
the regime was five, whereas the maximum drawdown was
almost the same for both regimes. Moreover, as hypothesized,
CPD-EVaR++, a combination of risk minimization and return
maximization operations, performed better than the others.
Therefore, using this method, the best practice in managing
equity portfolios is to use CPD-EVaR++ and to rebalance
irregularly in regime 5, in addition to regular rebalancing every
30 days.
Backtesting of Table IV using the same parameters as in
Table III. The results show that for the algorithm, EVaR
optimization performed better than the others, similar to the
results of Cajas [11]. This may be because the computational
efficiency of EVaR in stochastic optimization problems is
higher than that of other risk measures, such as CVaR.
This study demonstrates the utility of financial text in asset
allocation with equity portfolios. In the future, we would like
to develop a tactical asset allocation strategy that mixes stocks
and other asset classes, such as bonds. In the future, we would
also like to apply this research to monetary policy and other
macroeconomic analyses.
ACK NOWL EDG ME NT
This work was supported by the JST-Mirai Program Grant
Number JPMJMI20B1, Japan. The authors declare that the
research was conducted without any commercial or financial
relationships that could be construed as potential conflicts of
interest.
REF ER E NC ES
[1] Kieran Wood, Stephen Roberts, and Stefan Zohren. Slow momentum
with fast reversion: A trading strategy using deep learning and change-
point detection. The Journal of Financial Data Science, 4(1):111–129,
dec 2021.
[2] Masatake Ito, Kabun Jo, and Norio Hibiki. Application of asset
allocation models in practice and mutual fund design [in japanese].
Operations research as a management science, 66(10):683–689, 2021.
[3] Julian Salazar, Davis Liang, Toan Q. Nguyen, and Katrin Kirchhoff.
Masked language model scoring. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 2699–
2712, Online, July 2020. Association for Computational Linguistics.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language un-
derstanding, 2019.
[5] Aapo Hyv¨arinen, Kun Zhang, Shohei Shimizu, and Patrik O Hoyer.
Estimation of a structural vector autoregression model using non-
gaussianity. Journal of Machine Learning Research, 11(5), 2010.
[6] Jushan Bai. Estimating multiple breaks one at a time. Econometric
theory, 13(3):315–352, 1997.
[7] Piotr Fryzlewicz. Wild binary segmentation for multiple change-point
detection. The Annals of Statistics, 42(6):2243–2281, 2014.
[8] Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective review of
offline change point detection methods. Signal Processing, 167:107299,
2020.
[9] A. Ahmadi-Javid. An information-theoretic approach to constructing
coherent risk measures. In 2011 IEEE International Symposium on
Information Theory Proceedings, pages 2125–2127, 2011.
[10] Amir Ahmadi-Javid. Entropic value-at-risk: A new coherent risk mea-
sure. Journal of Optimization Theory and Applications, 155(3):1105–
1123, 2012.
[11] Dany Cajas. Entropic portfolio optimization: a disciplined convex
programming framework. Available at SSRN 3792520, 2021.
[12] Oleg Polakow. vectorbt (1.4.2), 2022.
[13] Dany Cajas. Riskfolio-lib (3.0.0), 2022.