Conference PaperPDF Available

AlphaStock: A Buying-Winners-and-Selling-Losers Investment Strategy using Interpretable Deep Reinforcement Attention Networks

Authors:

Abstract and Figures

Recent years have witnessed the successful marriage of finance innovations and AI techniques in various finance applications including quantitative trading (QT). Despite great research efforts devoted to leveraging deep learning (DL) methods for building better QT strategies, existing studies still face serious challenges especially from the side of finance, such as the balance of risk and return, the resistance to extreme loss, and the interpretability of strategies, which limit the application of DL-based strategies in real-life financial markets. In this work, we propose AlphaStock, a novel reinforcement learning (RL) based investment strategy enhanced by interpretable deep attention networks, to address the above challenges. Our main contributions are summarized as follows: i) We integrate deep attention networks with a Sharpe ratio-oriented reinforcement learning framework to achieve a risk-return balanced investment strategy; ii) We suggest modeling interrelationships among assets to avoid selection bias and develop a cross-asset attention mechanism; iii) To our best knowledge, this work is among the first to offer an interpretable investment strategy using deep reinforcement learning models. The experiments on long-periodic U.S. and Chinese markets demonstrate the effectiveness and robustness of AlphaStock over diverse market states. It turns out that AlphaStock tends to select the stocks as winners with high long-term growth, low volatility, high intrinsic value, and being undervalued recently.
Content may be subject to copyright.
AlphaStock: A Buying-Winners-and-Selling-Losers Investment
Strategy using Interpretable Deep Reinforcement Aention
Networks
Jingyuan Wang1,4, Yang Zhang1, Ke Tang2, Junjie Wu3,4,, Zhang Xiong1
1.MOE Engineering Research Center of Advanced Computer Application Technology,
School of Computer Science Engineering, Beihang University, Beijing, China
2.Institute of Economics, School of Social Sciences, Tsinghua University, Beijing China
3.Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations,
School of Economics and Management, Beihang University, Beijing, China
4.Beijing Advanced Innovation Center for BDBC, Beihang University, Beijing, China. Corresponding author.
ABSTRACT
Recent years have witnessed the successful marriage of nance
innovations and AI techniques in various nance applications in-
cluding quantitative trading (QT). Despite great research eorts
devoted to leveraging deep learning (DL) methods for building
better QT strategies, existing studies still face serious challenges
especially from the side of nance, such as the balance of risk and
return, the resistance to extreme loss, and the interpretability of
strategies, which limit the application of DL-based strategies in real-
life nancial markets. In this work, we propose AlphaStock, a novel
reinforcement learning (RL) based investment strategy enhanced
by interpretable deep attention networks, to address the above chal-
lenges. Our main contributions are summarized as follows:
i
) We
integrate deep attention networks with a Sharpe ratio-oriented re-
inforcement learning framework to achieve a risk-return balanced
investment strategy;
ii
) We suggest modeling interrelationships
among assets to avoid selection bias and develop a cross-asset at-
tention mechanism;
iii
) To our best knowledge, this work is among
the rst to oer an interpretable investment strategy using deep
reinforcement learning models. The experiments on long-periodic
U.S. and Chinese markets demonstrate the eectiveness and ro-
bustness of AlphaStock over diverse market states. It turns out
that AlphaStock tends to select the stocks as winners with high
long-term growth, low volatility, high intrinsic value, and being
undervalued recently.
CCS CONCEPTS
Applied computing Economics
;
Computing method-
ologies Reinforcement learning;Neural networks.
KEYWORDS
Investment Strategy, Reinforcement Learning, Deep Learning,
Interpretable Prediction
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
KDD ’19, August 4–8, 2019, Anchorage, AK, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6201-6/19/08. . . $15.00
https://doi.org/10.1145/3292500.3330647
ACM Reference Format:
Jingyuan Wang
1,4
, Yang Zhang
1
, Ke Tang
2
, Junjie Wu
3,4,
, Zhang Xiong
1
.
2019. AlphaStock: A Buying-Winners-and-Selling-Losers Investment Strat-
egy using Interpretable Deep Reinforcement Attention Networks. In The
25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD ’19), August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA,
9 pages. https://doi.org/10.1145/3292500.3330647
1 INTRODUCTION
Given the ability in handling large scales of transactions and
oering rational decision-makings, quantitative trading (QT) strate-
gies have long been adopted in nancial institutions and hedge
funds and have achieved spectacular successes.Traditional QT strate-
gies are usually based on specic nancial logics. For instance, the
momentum phenomenon found by Jegadeesh and Titman in the
stock market [
14
] was used to build momentum strategies. The
mean reversion [
20
] proposed by Poterba and Summers believes
that asset price tends to move to the average over time, so the bias
of asset prices to their means could be used to select investment
targets. The multi-factor strategy [
7
] uses factor-based asset valua-
tions to select assets. Most of these traditional QT strategies, though
equipped with solid nancial theories, can only leverage some spe-
cic characteristic of nancial markets, and therefore might be
vulnerable to complex markets with diverse states.
In recent years, deep learning (DL) emerges as an eective way
to extract multi-aspect characteristics from complex nancial sig-
nals. Many supervised deep neural networks are proposed in the
literature to predict asset prices using various factors, such as fre-
quency of prices [
11
], economic news [
12
], social media [
27
], and
nancial events [
4
,
5
]. Deep neural networks are also adopted in
reinforcement learning (RL) frameworks to enhance traditional
shallow investment strategies [
3
,
6
,
16
]. Despite the rich studies
above, applying DL to real-life nancial markets still faces several
challenges:
Challenge 1: Balancing return and risk. Most existing supervised
deep learning models in nance focus on price prediction without
risk awareness, which is not in line with fundamental investment
principles and may lead to suboptimal performance [
8
]. While some
RL-based strategies [
8
,
17
] have considered this problem, how to
adopt state-of-the-art DL approaches into risk-return-balanced RL
frameworks, is yet not well studied.
Challenge 2: Modeling interrelationships among assets. Many -
nancial tools in the market can be used to derive risk-aware prots
from the interrelationship among assets, such as hedging, arbitrage,
and the BWSL strategy used in this work. However, existing DL/RL-
based investment strategies paid little attention to this important
information.
Challenge 3: Interpreting investment strategies. There is a long-
standing voice arguing that DL-based systems are “unexplainable
black boxes” and therefore cannot be used in crucial applications
like medicine, investment and military [
9
]. RL-based strategies with
deep structures make it even worse. How to extract interpretable
rules from DL-enabled strategies remains an open problem.
In this paper, we propose AlphaStock, a novel reinforcement
learning based strategy using deep attention networks, to overcome
the above challenges. AlphaStock is essentially a buying winners and
selling losers (BWSL) strategy for stock assets. It consists of three
components. The rst is a Long Short-Term Memory with History
state Attention (LSTM-HA) network, which is used to extract asset
representations from multiple time series. The second component
is a Cross-Asset Attention Network (CAAN), which can fully model
the interrelationships among assets as well as the asset price rising
prior. The third is a portfolio generator, which gives the investment
proportion of each asset according to the output winner scores of
the attention networks. We use a RL framework to optimize our
model towards a return-risk-balanced objective, i.e., maximizing
the Sharpe Ratio. In this way, the merit of representation learning
via deep attention models and the merit of risk-return balance
via Sharpe ratio targeted reinforcement learning are integrated
naturally. Moreover, to gain interpretability for AlphaStock, we
propose a sensitivity analysis method to unveil how our model
selects an asset to invest according to its multi-aspect features.
Extensive experiments on long-periodic U.S. stock markets demon-
strate that our AlphaStock strategy outperforms some state-of-the-
art competitors in terms of a variety of evaluation measures. In
particular, AlphaStock shows excellent adaptability to diverse mar-
ket states (enabled by RL and Sharpe ratio) and exceptional ability
for extreme loss control (enabled by CAAN). Extended experiments
on Chinese stock markets further conrm the superiority of Alpha-
Stock and its robustness. Interestingly, the interpretation analysis
results reveal that AlphaStock selects assets by following a principle
as “selecting the stocks as winners with high long-term growth, low
volatility, high intrinsic value, and being undervalued recently”.
2 PRELIMINARIES
In this section, we rst introduce the nancial concepts used
throughout this paper, and then formally dene our problem.
2.1 Basic Financial Concepts
Definition 1 (Holding Period). A holding period is a minimum
time unit to invest an asset. We divide the time axis as sequential
holding periods with xed length, such as one day or one month. We
call the starting time of the t-th holding period as the time t.
Definition 2 (Seqential Investment). A sequential invest-
ment is a sequence of holding periods. For the
t
-th holding period, a
strategy uses original capital to invest in assets at time
t
, and gets
prots (could be negative) at time
t+
1. The capitals plus prots of the
t
-th holding period are used as the original capitals of the
(t+
1
)
-th
holding period.
Definition 3 (Asset Price). The price of an asset is dened as
a time series
p(i)={p(i)
1,p(i)
2, . . . , p(i)
t, . . .}
, where
p(i)
t
denotes the
price of asset iat time t.
In this work, we use a stock as an asset to describe our model,
which could be extended to other types of assets by taking asset
specicities and transaction rules into consideration.
Definition 4 (Long Position). The long position is the trading
operation that buys an asset at time
t1
rst and then sells it at
t2
. The
prot of a long position during the period from
t1
to
t2
for asset
i
is
ui(p(i)
t2p(i)
t1), where uiis the buying volume of asset i.
In the long position, traders expect an asset will rise in price, so
they buy the asset rst and wait for the price rise to earn prots.
Definition 5 (Short Position). A short position is the trading
operation that sells an asset at
t1
rst and then buys it back at
t2
. The
prot of a short position during the period from
t1
to
t2
for asset
i
is
ui(p(i)
t1p(i)
t2), where uiis the selling volume of asset i.
Short position is a reverse operation of the long position. Traders’
expectation in short position is that the price will drop, so they sell
at a price higher than the price at which they buy it back later. In
the stock market, a short position trader borrows stocks from a
broker and sells them at
t1
. At
t2
, the trader buys the sold stocks
back and returns them to the broker.
Definition 6 (Portfolio). Given an asset pool with
I
assets, a
portfolio is dened as a vector
b=(b(1), . . . , b(i), . . . , b(I))
, where
b(i)
is the proportion of the investment on asset
i
, with
I
i=1b(i)=
1.
Assume we have a collection of portfolios
{b(1), . . . , b(j), . . . , b(J)}
.
The investment on portfolio
b(j)
is
M(j)
, with
M(j)
0when taking
a long position on
b(j)
, and
M(j)
0when taking a short position.
We then have the following important denition.
Definition 7 (Zero-investment Portfolio). A zero-investment
portfolio is a collection of portfolios that has a net total investment of
zero when the portfolios are assembled. That is, for a zero-investment
portfolio containing
J
portfolios, the total investment
J
j=1M(j)=
0.
For instance, an investor may borrow $1,000 worth of stocks in
one set of companies and sell them as a short position, and then use
the proceeds of short selling to purchase $1,000 stocks in another
set of companies as a long position. The assemble of the long and
short positions is a zero-investment portfolio. Note that while the
name is “zero-investment”, there still exists a budget constraint to
limit the overall worth of stocks that can be borrowed from the
broker. Also, we ignore real-world transaction costs for simplicity.
2.2 The BWSL Strategy
In this paper, we adopt the buy-winners-and-sell-losers (BWSL)
strategy for stock trading [
14
], the key of which is to buy the
assets with high price rising rate (winners) and sell those with low
price rising rate (losers). We execute the BWSL strategy as a zero-
investment portfolio consisting of two portfolios: a long portfolio
for buying winners and a short portfolio for selling losers. Given a
sequential investment with
T
periods, we denote the short portfolio
for the
t
-th period as
b
t
and the long portfolio as
b+
t
,
t=
1
, . . . , T
.
At time
t
, given a budget constraint
˜
M
, we borrow the “loser”
stocks from brokers according to the investment proportion in
b
t
.
The volume of stock ithat we can borrow is
u−(i)
t=˜
M·b−(i)
t/p(i)
t,(1)
where
b−(i)
t
is the proportion of stock
i
in
b
t
. Next, we sell the
“loser” stocks we borrowed and get the money
˜
M
. After that, we
use
˜
M
to buy the “winner” stocks according to the long portfolio
b+
t. The volume of stock ithat we can buy at time tis
u+(i)
t=˜
M·b+(i)
t/p(i)
t.(2)
The money
˜
M
we used to buy winner stocks is the proceeds of short
selling, so the net investment on the portfolio {b+
t,b
t}is zero.
At the end of the
t
-th holding period, we sell stocks in the long
portfolio. The money we can get is the proceeds of selling stocks
using new prices at t+1for all stocks, i.e.,
M+
t=
I
i=1
u+(i)
tp(i)
t+1=
I
i=1
˜
M·b+(i)
t
p(i)
t+1
p(i)
t
.(3)
Next, we buy the stocks in the short portfolio back and return them
to the broker. The money we spend on buying the short stocks is
M
t=
I
i=1
u−(i)
tp(i)
t+1=
I
i=1
˜
M·b−(i)
t
p(i)
t+1
p(i)
t
.(4)
The ensemble prot earned by the long and short portfolios is
Mt=M+
tM
t
. Let
z(i)
t=p(i)
t+1/p(i)
t
denote the price rising rate of
stock
i
in the
t
-th holding period. Then, the rate of return of the
ensemble portfolio is calculated as
Rt=Mt
˜
M
=
I
i=1
b+(i)
tz(i)
t
I
i=1
b−(i)
tz(i)
t.(5)
Insight I.
As shown in Eq.
(5)
, a positive prot, i.e.,
Rt>
0, means
the average price rising rate of stocks in the long portfolio is higher
than that in the short portfolio, i.e.,
I
i=1
b+(i)
tz(i)
t>
I
i=1
b−(i)
tz(i)
t.(6)
A protable BWSL strategy must ensure the stocks in the portfolio
b+
have a higher average price rising rate than the stocks in
b
. That
is to say, even the prices of all stocks in the market are falling, as
long as we can ensure the price falling of stocks in
b+
is slower than
that in
b
, we can still get prots. On the contrary, even the prices
of all stocks are rising, if the rising of stocks in
b
is faster than that
in
b+
, our strategy still lose money. This characteristic implies that
the absolute price rising or falling of stocks is not the main concern
of our strategy; rather, the relative price relations among stocks
are much more important. As a consequence, we must design a
mechanism to describe the interrelationships of stock prices in our
model for the BWSL strategy.
2.3 Optimization Objective
In order to ensure that our strategy considers both return and
risk of an investment, we adopt the Sharpe ratio, a risk-adjusted
return developed by the Nobel laureate William F. Sharpe [
21
] in
1994, to measure the performance of our strategy.
LSTM-HA
LSTM-HA
LSTM-HA
……
LSTM-HA
……
CAAN
Portfolio Generator

features
t

features
t

features
t

features
t
……
……
Stock History States
Figure 1: The framework of the AlphaStock model.
Definition 8 (Sharpe Ratio). The Sharpe ratio is the average
return in excess of the risk-free return per unit of volatility. Given
a sequential investment that contains
T
holding periods, its Sharpe
ratio is calculated as
HT=ATΘ
VT
,(7)
where
AT
is the average rate of return per period for the investment,
VT
is the volatility that is used to measure risk of the investment,
Θ
is a risk-free return rate, such as the return rate of bank.
Given a sequential investment with
T
holding periods,
AT
is
calculated as
AT=1
T
T
t=1
RtT Ct,(8)
where
TCt
is a transaction cost in the
t
-th period. The volatility
VT
in Eq. (7) is dened as
VT=T
t=1(Rt¯
Rt)2
T,(9)
where ¯
Rt=T
t=1Rt/Tis the average of Rt.
For a
T
-period investment, the optimization objective of our
strategy is to generate the long and short portfolio sequences
B+=
{b+
1, . . . b+
T}
and
B={b
1, . . . , b
T}
that can maximize the Sharpe
ratio of the investment as
arg max
{B+
,B}
HTB+,B.(10)
Insight II.
The Sharpe ratio evaluates the performance of a strategy
from both prot and risk perspectives. This prot-risk balance
characteristic requires our model not only focuses on maximizing
return rate
Rt
for each period, but also considers the long-term
volatility of
Rt
across all periods in an investment. In other words,
designing a far-sighted steady investment strategy is more valuable
than a short-sighted strategy with short-term high prots.
3 THE ALPHASTOCK MODEL
In this section, we propose a reinforcement learning (RL) based
model called AlphaStock to implement a BWSL strategy with the
Sharpe ratio dened in Eq.
(7)
as the optimization objective. As
shown in Fig. 1, AlphaStock contains three components. The rst
component is a LSTM with History state Attention network (LSTM-
HA). For each stock
i
, we use the LSTM-HA model to extract a
stock representation
r(i)
from its history states
X(i)
. The second
component is a Cross-Asset Attention Network (CAAN) to describe
interrelationships among the stocks. The CAAN takes as input the
representations (
r(i)
) of all stocks, and estimates a winner score
s(i)
for every stock. The
s(i)
is a score to indicate the degree of
stock
i
belonging to a winner. The third component is a portfolio
generator, which calculates the investment proportions in
b+
and
b
according to the scores (
s(i)
) of all stocks. We use reinforcement
learning to end-to-end optimize the three components as a whole,
where the Sharpe ratio of a sequential investment is maximized
through a far-sighted way.
3.1 Raw Stock Features
The stock features used in our model contains two categories.
The rst category is the trading features, which describes the trading
information of a stock. At time t, the trading features include:
Price Rising Rate (PR)
: The price rising rate of a stock during
the last holding period. It is dened as p(i)
t/p(i)
t1for stock i.
Fine-grained Volatility (VOL)
: A holding period can be fur-
ther divided into many sub-periods. We set one month as a holding
period in our experiment, thus a sub-period can be a trading day.
VOL is dened as the standard deviation of the prices of all sub-
periods from t1to t.
Trade Volume (TV)
: The total quantity of stocks traded from
t1to t. It reects the market activity of a stock.
The second category is the company features, which describe
the nancial condition of the company that issues a stock. At time
t, the company features include:
Market Capitalization (MC)
: For stock
i
, it is dened as the
product of the price p(i)
tand the outstanding shares of the stock.
Price-earnings Ratio (PE)
: It is the ratio of the market capi-
talization of a company to its annual earnings.
Book-to-market Ratio (BM)
: It is the ratio of the book value
of a company to its market value.
Dividend (Div)
: It is the reward from company’s earnings to
stock holders during the (t1)-th holding period.
Since the values of these features are not in the same scale, we
standardize them into Z-scores.
3.2 Stock Representations Extraction
The performance of a stock has close relations with its history
states. In the AlphaStock model, we propose a Long Short-Term
Memory with History state Attention (LSTM-HA) model to learn the
representation of a stock from its history features.
The sequential representation.
In the LSTM-HA network, we
use the vector
˜
xt
to denote the history state of a stock at time
t
,
which consists of the stock features given in Section 3.1. We name
the last Khistorical holding periods at time t,i.e., the period from
time
tK
to time
t
, as a look-back window of
t
. The history states
of a stock in the look-back window are denoted as a sequence
X={x1, . . . , xk, . . . , xK}1
, where
xk=˜
xtK+k
. Our model uses
a Long Short-Term Memory (LSTM) network [
10
] to recursively
encode Xinto a vector as
hk=LSTM (hk1,xk),k∈ [1,K](11)
where
hk
is the hidden state encoded by LSTM at step
k
. The
hK
at the last step is used as a representation of the stock. It contains
the sequential dependence among elements in X.
1We also use Xto denote the matrix (xk), the two denitions are interchangeable.
The history state attention.
The
hK
can fully exploit the sequen-
tial dependence of elements in
X
, but the global and long-range
dependence among
X
are not eectively modeled. Therefore, we
adopt a history state attention to enhance
hK
using all middle hid-
den states
hk
. Specically, following the standard attention [
22
],
the history state attention enhanced representation, denoted as
r
,
is calculated as
r=
K
k=1
ATT (hK,hk)hk,(12)
where ATT,·) is an attention function dened as
ATT (hK,hk)=exp (αk)
K
k=1exp (αk),(13)
αk=w·tanh W(1)hk+W(2)hK.
Here, w,W(1)and W(2)are the parameters to learn.
For the
i
-th stock at time
t
, the history state attention enhanced
representation is denoted as
r(i)
t
. It contains both the sequential and
global dependences of stock
i
’s history states from time
tK+
1
to time
t
. In our model, the representation vectors for all stocks
are extracted by the same LSTM-HA network. The parameters
w
,
W(1)
,
W(2)
and those of the LSTM network in Eq.
(11)
are shared by
all stocks. In this way, the representations extracted by LSTM-HA
are relatively stable and general for all stocks rather than for a
particular one.
Remark.
A major advantage of LSTM-HA is that it can learn both
the sequential and global dependences from stock history states.
Compared with the existing studies that only use a recurrent neural
network to extract the sequential dependence in history states [
3
,
17
] or directly stack history states as an input vector of MLP [
16
]
to learn the global dependence, our model describes stock histories
more comprehensively. It is worth mentioning that LSTM-HA is also
an open framework. The representations learned from other types
of information sources, such as news, events and social media [
4
,
12, 27], could also be concatenated or attended with r(i)
t.
3.3 Winners and Losers Selection
In the traditional RL-based strategy models, the investment port-
folio is often directly generated from the stock representations
through a softmax normalization [
3
,
6
,
16
]. The drawback of this
type of methods is that it does not fully exploit the interrelation-
ships among stocks, which however is very important for the BWSL
strategy as analyzed in Insight I of Section 2.2. In light of this, we
propose a Cross-Asset Attention Network (CAAN) to describe the
interrelationships among stocks.
The basic CAAN model.
The CAAN model adopts the self-attention
mechanism proposed by Ref. [
24
] to model the interrelationships
among stocks. Specically, given the stock representation
r(i)
(we
omit time
t
without loss of generality), we calculate a query vector
q(i), a key vector k(i)and a value vectorv(i)for stock ias
q(i)=W(Q)r(i),k(i)=W(K)r(i),v(i)=W(V)r(i),(14)
where
W(Q)
,
W(K)
and
W(V)
are the parameters to learn. The
interrelationship of stock
j
to stock
i
is modeled as using the
q(i)
of the stock ito query the key k(j)of stock j,i.e.,
βi j =q(i)⊤ ·k(j)
Dk
,(15)
where
Dk
is a re-scale parameter setting following Ref. [
24
]. Then,
we use the normalized interrelationships
{βi j }
as weights to sum
the values {v(j)}of other stocks into an attenuation score:
a(i)=
I
j=1
SATT q(i),k(j)·v(j),(16)
where the self-attention function
SATT (·,·)
is a softmax normalized
interrelationships of βi j ,i.e.,
SATT q(i),k(j)=
exp βi j
I
j=1exp βi j.(17)
We use a fully connected layer to transform the attention vector
a(i)into a winner score as
s(i)=sigmoid w(s)⊤ ·a(i)+e(s),(18)
where
w(s)
and
e(s)
are the connection weights and the bias to learn.
The winner score
s(i)
t
indicates the degree of stock
i
being a winner
in the
t
-th holding period. A stock with a higher score is more likely
to be a winner.
Incorporating price rising rank prior.
In the basic CAAN, the
interrelationships modeled by Eq.
(15)
are directly learned from
data. In fact, we could use priori knowledge to help our model to
learn the stock interrelationships. We use
c(i)
t1
to denote the rank
of price rising rate of stock
i
in the last holding period (from
t
1
to
t
). Inspired by the method for modeling positional information
from the NLP eld, we use the relative positions of stocks in the
coordinate axis of
c(i)
t1
as a priori knowledge of the stock interrela-
tionships. Specically, given two stocks
i
and
j
, we calculate their
discrete relative distance in the coordinate axis of c(i)
t1as
di j =c(i)
t1c(j)
t1Q,(19)
where
Q
is a preset quantization coecient. We use a lookup matrix
L=(l1, . . . , lL)
to represent each discretized value of
di j
. Using
the
di j
as the index, the corresponding column vector
ldi j
is an
embedding vector of the relative distance dij .
For a pair of stocks
i
and
j
, we calculate a priori relation coe-
cient ψi j using ldi j as
ψi j =sigmoid w(L)⊤ldi j ,(20)
where
w(L)
is a learnable parameter. The relationship between
i
and jestimated by Eq. (15) is rewritten as
βi j =
ψi j q(i)⊤ ·k(j)
D
.(21)
In this way, the relative positions of stocks in price rising rate rank
are introduced as a weight to enhance or weaken the attention
coecient. The stocks have similar history price rising rates will
have a stronger interrelationship in the attention and then have
similar winner scores.
Remark.
As shown in Eq.
(16)
, for each stock
i
, the winner score
s(i)
is calculated according to the attention of all other stocks. In
this way, the interrelationships among all stocks are involved into
CAAN. This special attention mechanism meets the model design
requirement of Insight I in Section 2.2.
3.4 Portfolios Generator
Given the winner scores
{s(1), . . . , s(i), . . . , s(I)}
of
I
stocks,
our AlphaStock model generally buys the stocks with high winner
scores and sells those with low winner scores. Specically, we rst
sort the stocks in descending order by their winner scores and
obtain the sequence number
o(i)
for each stock
i
. Let
G
denote the
preset size of portfolio
b+
and
b
. If
o(i)∈ [
1
,G]
, stock
i
will enter
the portfolio b+(i), with the investment proportion calculated as
b+(i)=
exp s(i)
o(i)∈[1,G]exp s(i).(22)
If o(i)∈ (IG,I], stock iwill enter b−(i)with a proportion
b−(i)=
exp 1s(i)
o(i)∈(IG,I]exp 1s(i).(23)
The rest stocks are unselected for the lack of clear buy/sell signals.
For simplicity, we can use one vector to record all the information
of the two portfolios. That is, we form the vector
bc
of length
I
,
with
bc(i)=b+(i)
if
o(i)∈ [
1
,G]
, or
bc(i)=b−(i)
if
o(i)∈ (IG,I]
,
or 0 otherwise,
i=
1
, . . . , I
. In what follows, we use
bc
and
{b+,b}
interchangeably as the return of our AlphaStock model for clarity.
3.5 Optimization via Reinforcement Learning
We frame the AlphaStock strategy into a RL game with discrete
agent actions to optimize the model parameters, where a
T
-period
investment is modeled as a state-action-reward trajectory
π
of a
RL agent, i.e.,
π={stat e1,action1,rew ard1, . . . , st atet,act iont,
rewardt, . . . , st ateT,actionT,rewardT}
. The
statet
is the history
market state observed at
t
, which is expressed as
Xt=(X(i)
t)
. The
actiont
is an
I
-dimensional binary vector, of which the element
action(i)
t=
1when the agent invests stock
i
at
t
, and 0otherwise
2
.
According to
statet
, the agent has a probability
Pr(action(i)
t=
1
)
to
invest stock i, which is determined by AlphaStock as
Pr ac tion(i)
t=1Xn
t,θ=1
2G(i)(Xn
t,θ)=1
2bc(i)
t,(24)
where
G(i)(Xn
t,θ)
is part of AlphaStock that generates
bc(i)
t
,
θ
de-
notes the model parameters, and 1
/
2is to ensure
I
i=1Pr(action(i)
t=
1
)=
1. Let
Hπ
denote the Sharpe ratio of
π
, then
rewardt
is the
contribution of actiontto Hπ, with T
t=1rewardt=Hπ.
For all possible π, the average reward of the RL agent is
J(θ)=π
HπPr(π|θ)dπ,(25)
where
Pr(π|θ)
is the probability of generating
π
from
θ
. Then,
the objective of the RL model optimization is to nd the optimal
parameters θ=arg maxθJ(θ).
2
In the RL game, the actions of an agent are discrete states with the probability
bc(i)
t/2
indicating whether to invest stock
i
. In the real investment, we allocate capitals to
stocks
i
according the continuous proportion
bc(i)
t
. This approximation is for the sake
of problem solving.
We use the gradient ascent approach to iteratively optimize
θ
at
round
τ
as
θτ=θτ1+ηJ(θ)|θ=θτ1
, where
η
is a learning rate.
Given a training dataset that contains
N
trajectories
{π1, . . . , πn,
. . . , πN},J(θ)can be approximately calculated as [23]
J(θ)=π
HπPr(π|θ)∇ log Pr(π|θ)dπ.
1
N
N
n=1Hπn
Tn
t=1
I
i=1θlog Pr act ion(i)
t=1X(n)
t,θ,
(26)
The gradient
θlog Pr(action(i)
t=
1
|X(n)
t,θ)=θlog G(i)(Xn
t,θ)
,
which is calculated by the Back Propagation algorithm.
In order to ensure the proposed model can beat the market,
we introduce the threshold method [
23
] into our reinforcement
learning. Then the gradient J(θ)in Eq. (26) is rewritten as
J(θ)=1
N
N
n=1HπnH0
Tn
t=1
I
i=1θlog G(i)Xn
t,θ,(27)
where the threshold
H0
is set as the Sharpe ratio of the overall mar-
ket. In this way, the gradient ascent only encourages the parameters
that can outperform the market.
Remark.
The Eq.
(27)
uses
(HπnH0)
to integrally weight the
the gradients
θlog G
of all holding periods in
πn
. The reward is
not directly given to any isolated step in
πn
but given to all steps
in
πn
. This feature of our model meets the far-sight requirement of
Insight II in Section 2.2.
4 MODEL INTERPRETATION
In the AlphaStock model, the LSTM-HA and CAAN networks
cast the raw stock features as winner scores. The nal investment
portfolios are directly generated from the winner scores. A natural
follow-up question is: what kind of stocks would be selected as
winners by AlphaStock? To answer this question, we propose a
sensitivity analysis method [
1
,
25
,
26
] to interpret how the history
features of a stock inuence its winner score in our model.
We use
s=F(X)
to express the function of history features
X
of
a stock to its winner score
s
. In our model,
s=F(X)
is a combined
network of LSTM-HA and CAAN. We use
xq
to denote an element
of
X
which is the value of one feature (dened in Section 3.1) at
a particular time period of the look-back window, e.g., the price
rising rate of a stock at the time of three months ago.
Given the history state
X
of a stock, the inuence of
xq
to its
winner score s,i.e., the sensitivity of sto xq, is expressed as
δxq(X)=lim
xq0
F(X)− F xq+xq,X¬xq
xqxq+xq=
F(X)
xq
,(28)
where X¬xqdenotes the elements of Xexcept xq.
For all possible stock states in a market, the average inuence of
the stock state feature xqto the winner score sis
¯
δxq=DX
Pr(X)δxq(X)dσ.(29)
where
Pr(X)
is the probability density function of
X
, and
DX·
d
σ
is an integral over all possible value of
X
. According to the Large
Number Law, given a dataset that contains history states of
I
stocks
in Nholding periods, the ¯
δxqis approximated as
¯
δxq=1
I×N
N
n=1
I
i=1
δxqX(i)
nXi)
n,(30)
where
X(i)
n
is the history state of the
i
-th stock at the
n
-th holding
period, and
Xi)
n
denotes the history states of other stocks that are
concurrent with the history state of i-th stock.
We use
¯
δxq
to measure the overall inuence of a stock feature
xq
to the winner score. A positive value of
¯
δxq
indicates that our
model tends to take a stock as a winner when
xq
is large, and vice
versa. For example, in the experiment to follow, we obtain
¯
δ<
0
for the ne-grained volatility feature, which means that our model
trends to select low volatility stocks as winners.
5 EXPERIMENT
In this section, we empirically evaluate our AlphaStock model by
the data in the U.S. markets. The data in the Chinese stock markets
are also used for robustness check.
5.1 Data and Experimental Setup
The data of U.S. stock market used in our experiments are ob-
tained from Wharton Research Data Services (WRDS)
3
. The time
range of the data is from Jan. 1970 to Dec. 2016. This long time range
covers several well-known market events, such as the dot-com bub-
ble from 1995 to 2000 and the subprime mortgage crisis from 2007 to
2009, which enables the evaluation over diverse market states. The
stocks are from four markets: NYSE, NYSE American, NASDAQ,
and NYSE Arca. The number of valid stocks is more than 1000 per
year. We use the data from Jan. 1970 to Jan. 1990 as the training
and validation set, and the rest as the test set.
In the experiment, the holding period is set to one month, and
the number of holding periods
T
in an investment is set to 12, i.e.,
the Sharpe ratio reward is calculated every 12 months for RL. The
look-back window size
K
is set to 12, i.e., we look back on the 12-
month history states of stocks. The size
G
of the portfolios is set as
1/4 of number of all stocks.
5.2 Baseline Methods
AlphaStock is compared with a number of baselines including:
Market: the uniform Buy-And-Hold strategy [13];
Cross Sectional Momentum (CSM) [
15
] and Time Series Mo-
mentum (TSM) [18]: two classic momentum strategies;
Robust Median Reversion (RMR): a newly reported reversion
strategy [13];
Fuzzy Deep Direct Reinforcement (FDDR): a newly reported
RL-based BWSL strategy [3];
AlphaStock-NC (AS-NC): the AlphaStock model without the
CAAN, where the outputs of LSTM-HA are directly used as the
inputs of the portfolio generator.
AlphaStock-NP (AS-NP): the AlphaStock model without price
rising rank prior, where we use the basic CAAN in our model.
The baselines TSM/CSM/RMR represent the traditional nancial
strategies. TSM and CSM are based on the momentum logic and
3https://wrds-web.wharton.upenn.edu/wrds/
RMR is based on the reversion logic. FDDR represents the state-of-
the-art RL-based BWSL strategy. AS-NC and AS-NP are used as a
contrast to verify the eectiveness of the CAAN and price rising
rank prior. The Market is used to indicate states of the market.
5.3 Evaluation Measures
The most standard evaluation measure for investment strategies
is Cumulative Wealth, which is dened as
CWT=
T
t=1(Rt+1TC ),(31)
where
Rt
is the rate of return dened in Eq.
(5)
and the transaction
cost TC is set to 0.1% in our experiments according to Ref. [3].
The preferences of dierent investors are varied. Therefore, we
also use some other evaluation measures including:
1) Annualized Percentage Rate (APR) is an annualized average
of return rate. It is dened as
APRT=AT×NY
, where
NY
is the
number of holding periods in a year.
2) Annualized Volatility (AVOL) is an annualized average of volatil-
ity. It is dened as
AVOLT=VT×NY
and is used to measure the
average risk of a strategy during an unit time period.
3) Annualized Sharpe Ratio (ASR) is the risk-adjusted annualized
return based on APR and AVOL. The formalized denition of ASR
is ASRT=APRT/AVOLT.
4) Maximum DrawDown (MDD) is the maximum loss from a
peak to a trough of a portfolio, before a new peak is attained. It
is the other way to measure the investment risk. The formalized
denition of MDD is
M D DT=max
τ∈[1,T]max
t∈[1,τ]AP RtA P Rτ
AP Rt.(32)
5) Calmar Ratio (CR) is the risk-adjusted APR based on Maximum
DrawDown. It is calculated as CRT=APRT/MDDT.
6) Downside Deviation Ratio (DDR) measures the downside risk
of a strategy as the average of returns when it falls below a mini-
mum acceptable return (MAR). It is the risk-adjusted APR based on
Downside Deviation. The formalized denition of DDR is given as
DDRT=AP RT
Downside Deviation =AP RT
E[min(Rt,MAR)]2,t∈ [1,T].
(33)
In our experiment, the MAR is set to zero.
5.4 Performance in U.S. Markets
Fig. 2 is a cumulative wealth comparison of AlphaStock and the
baselines. In general, the performance of AlphaStock (AS) is much
better than other baselines, which veries the eectiveness of our
model. Some interesting observations are highlighted as follows:
1) The performance of AlphaStock is better than AlphaStock-NP
and the performance of AlphaStock-NP is better than AlphaStock-
NC, which indicates that the stock rank priors and interrelation-
ships modeled by CAAN are very helpful for the BWSL strategy.
2) The FDDR is also a kind of deep RL investment strategy, which
extracts the fuzzy representations of stocks using a recurrent deep
neural network. In our experiment, the performance of AlphaStock-
NC is better than FDDR, indicating the advantage of our LSTM-HA
network in the stock representation learning.
Year
1990 1995 2000 2005 2010 2015
Cumulative Wealth
0
1
2
3
4
5
AS AS-NP AS-NC FDDR RMR CSM TSM Market
Figure 2: The Cumulative Wealth in U.S. markets.
Table 1: Performance comparison on U.S. markets.
APR AVOL ASR MDD CR DDR
Market 0.042 0.174 0.239 0.569 0.073 0.337
TSM 0.047 0.223 0.210 0.523 0.090 0.318
CSM 0.044 0.096 0.456 0.126 0.350 0.453
RMR 0.074 0.134 0.551 0.098 1.249 0.757
FDDR 0.063 0.056 1.141 0.070 0.900 2.028
AS-NC 0.101 0.052 1.929 0.068 1.492 1.685
AS-NP 0.133 0.065 2.054 0.033 3.990 4.618
AS 0.143 0.067 2.132 0.027 5.296 6.397
3) The TSM strategy performs well in the bull market but very
poorly in the bear market (the nancial crisis in 2003 and 2008),
while the RMR has an opposite performance. This implies the tradi-
tional nancial strategies can only adapt to a certain type of market
state without an eective forward-looking mechanism. This defect
is greatly addressed by the RL strategies, including AlphaStock and
FDDR, which perform much stably across dierent market states.
The performances evaluated by other measures are listed in Ta-
ble 1. For the measures underlined (AVOL, MDD), the lower value
indicates the better performance, while the situation is opposite for
the other measures. As shown in Table 1, the performances of Al-
phaStock, AlphaStock-NP and AlphaStock-NC are better than other
baselines with all measures, conrming the eectiveness and robust-
ness of our strategy. The performances of AlphaStock, AlphaStock-
NP and AlphaStock-NC are close in terms of ASR, which might be
due to all of these models are optimized for maximizing the Sharpe
ratio. The prots of AlphaStock and AlphaStock-NP measured by
APR are higher than that of AlphaStock-NC, at the cost of a little
bit higher volatility.
More interestingly, the performance of AlphaStock measured
by MDD, CR and DDR is much better than that of AlphaStock-
NP. The similar results could be observed by comparing MDD,
CR and DDR of AlphaStock-NP and AlphaStock-NC. The three
measures are used to indicate the extreme loss in an investment,
i.e., the maximum draw down and the returns below the minimum
acceptable threshold. The results suggest that the extreme loss
control ability of the three models are AlphaStock
>
AlphaStock-NP
>
AlphaStock-NC, which highlights the contribution of the CAAN
component and the price rising rank prior. Indeed, CAAN with price
rising rank priors fully exploits the ranking relationship among
stocks. This mechanism can protect our strategy from the error
of “buying losers and selling winners”, and therefore can greatly
avoid extreme losses in investments. In summary, AlphaStock is
a very competitive strategy for investors with dierent types of
preferences.
(a) Price Rising
(b) Trade Volume
History Months
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
7
/xp
-0.06
-0.04
-0.02
0
Influence of VOL to WS
(c) Fine-grained Volatility
MC PE BM DIV
7
/xp
-0.05
0
0.05
0.1
(d) Company Features
Figure 3: Inuence of history trading features to winner scores.
Table 2: Performance comparison on Chinese markets.
APR AVOL ASR MDD CR DDR
Market 0.037 0.260 0.141 0.595 0.062 0.135
TSM 0.078 0.420 0.186 0.533 0.147 0.225
CSM 0.023 0.392 0.058 0.633 0.036 0.064
RMR 0.079 0.279 0.282 0.423 0.186 0.289
FDDR 0.084 0.152 0.553 0.231 0.365 0.801
AS-NC 0.104 0.113 0.916 0.163 0.648 1.103
AS-NP 0.122 0.105 1.163 0.136 0.895 1.547
AS 0.125 0.103 1.220 0.135 0.296 1.704
5.5 Performance in Chinese Markets
In order to further testify the robustness of our model, we run
the back-test experiments of our model and baselines over the
Chinese stock markets, which contains two exchanges: Shanghai
Stock Exchange (SSE) and Shenzhen Stock Exchange (SZSE). The
data are obtained from the WIND databese
4
. The stocks are the
RMB priced ordinary shares (A-share) and the total number of
stocks used for experiment is 1,131. The time range of our data is
from Jun. 2005 to Dec. 2018, with the period from Jun. 2005 – Dec.
2011 used as the training/validation set and the rest as the test set.
Since the Chinese markets cannot short sell, so we only use the
b+
portfolio in the experiment.
The experimental results are given in Table 2. From the table we
can see that the performances of AlphaStock, AlphaStock-NP and
AlphaStock-NC are better than that of other baselines again. This
veries the eectiveness of our model over the Chinese markets.
By further comparing Table 2 with Table 1, it turns out that the risk
of our model measured by AVOL and MDD in the Chinese markets
is higher than that in the U.S. markets. This might be attributable
to the market faultiness of emerging countries like China, with
more speculative capital but less eective governance. The lack of
short sell mechanism also contributes to the imbalance of market
forces. The AVOL and MDD of the Market and other baselines in
the Chinese markets are also higher than that in the U.S. markets.
Compared with these baselines, the risk control ability of our model
is still competitive. To sum up, the experimental results in Table 2
indicate the robustness of our model over emerging markets.
5.6 Investment Strategies Interpretation
Here, we try to interpret the underlying investment strategies
of AlphaStock, which is crucial for practitioners to better under-
standing this model. To this end, we use
¯
δxp
in Eq.
(30)
to measure
the inuence of the stock features dened in Section 3.1 to Al-
phaStock’s winner selection. Figures 3(a)-3(b) plot the inuences
from the trading features. The vertical axis denotes the inuence
4http://www.wind.com.cn/en/Default.html
strengths indicated by
¯
δxq
, and the horizontal axis denotes how
many months before the trading time. For example, the bar indexed
by “-12” of the horizontal axis in Fig. 3(a) denotes the inuence of
stock price rising rate (PR) at the time of twelve months ago.
As shown in Fig. 3(a), the inuence of history price rising rate is
heterogeneous along the time axis. The PR in long-term months, i.e.,
9 to 11 months ahead, has positive inuence to winner scores, but
for the short-term months, i.e., 1 to 8 months ahead, the inuence
becomes negative. This result indicates that our model tends to buy
the stocks with long-term rapid price increase (valid excellence) or
with short-term rapid price retracement (over undervalued). This
implies that AlphaStock behaviors like a long-term momentum but
short-term reversion mixed strategy. Moreover, since price rising
is usually accompanied by frequent stock trading, Fig. 3(b) shows
that the
¯
δxp
of trading volumes (TV) has a similar tendency with
the price rising rate (PR). Finally, as shown in Fig. 3(c), the volatili-
ties (VOL) have negative inuence to winner scores for all history
months. It means that our model trends to select low volatility
stocks as winners, which indeed explains why AlphaStock can
adapt to diverse market states.
Fig. 3(d) further exhibits the average inuences of dierent com-
pany features to the winner score, i.e., the
¯
δxp
averaged on all
history months. It turns out that Market Capitalization (MC), Price-
earnings Ratio (PE), and Book-to-market Ratio (BM) have positive
inuences. The three features are important company valuation
factors for a listed company, which indicates that AlphaStock tends
to select companies with sound fundamental values. In contrast,
dividends mean a part of company values are returned to share-
holders and could reduce the intrinsic value of a stock. That is why
the inuence of Dividends (DIV) is negative in our model.
To sum up, while AlphaStock is an AI-enabled investment strat-
egy, the interpretation analysis proposed in Section 4 can help to
extract investment logics from AlphaStock. Specically, AlphaStock
suggests selecting the stocks as winners with high long-term growth,
low volatility,high intrinsic value, and being undervalued recently.
6 RELATED WORKS
Our work is related to the following research directions.
Financial Investment Strategy:
Classic nancial investment
strategy includes Momentum, Mean Reversion, and Multi-factors. In
the rst work of BWSL [
14
], Jegadeesh and Titman found “momen-
tum” could be used to select winners and losers. The momentum
strategy buys assets that have had high returns over a past period as
winners, and sells those that have had poor returns over the same
period. Classic momentum strategies include the Cross Sectional
Momentum (CSM) [
15
] and the Time Series Momentum (TSM) [
18
].
The mean reversion strategy [
20
] considers asset prices always re-
turn to their mean over a past period, so it buys assets with a price
under their historical mean and sells above the historical mean.
The multi-factor model [
7
] uses factors to compute a valuation for
each asset and buys/sells those assets with price under/above their
valuations. Most of these nancial investment strategies can only
exploit a certain factor of nancial markets and thus might fail in
complex market environments.
Deep Learning in Finance:
In recent years, deep learning ap-
proaches begin to be applied in the nancial areas. In the literature,
L. Zhang et al. proposed to exploit frequency information to predict
stock prices [
11
]. News and social media were used in price pre-
diction in Refs. [
12
,
27
]. Information about events and corporation
relationships were used to predict stock prices in Ref. [
2
,
4
]. Most
of these works focus on price prediction rather than end-to-end
investment portfolio generation like us.
Reinforcement Learning in Finance:
The RL approaches used
in investment strategies fall in two categories: the value-based and
the policy-based [
8
]. The value-based approaches learn a critic
to describe the expected outcomes of markets to trading actions.
Typical value-based approaches in investment strategies include
Q-learning [
19
] and deep Q-learning [
16
]. A defect of value-based
approaches is the market environment is too complex to be approxi-
mated by a critic. Therefore, policy-based approaches are considered
as more suitable to nancial markets [
8
]. The AlphaStock model
also belongs to this category. A classic policy-based RL algorithm
in investment strategy is the Recurrent Reinforcement Learning
(RRL) [
17
]. The FDDR [
3
] model extends the RRL framework using
deep neural networks. In the Investor-Imitator model [
6
], a policy-
based deep RL framework was proposed to imitate the behaviors of
dierent types of investors. Compared with RRL and its deep learn-
ing extensions, which focus on exploiting sequential dependence in
nancial signals, our AlphaStock model pays more attention to the
interrelationships among assets. Moreover, deep RL approaches are
often hard to deployed in real-life applications for unexplainable
deep network structures. The interpretation tools oered by our
model can solve this problem.
7 CONCLUSIONS
In this paper, we proposed a RL-based deep attention network
to design a BWSL strategy called AlphaStock. We also designed a
sensitivity analysis method to interpret the investment logics of
our model. Compared with existing RL-based investment strategies,
AlphaStock fully exploits the interrelationship among stocks, and
opens a door for solving the “black box” problem of using deep
learning models in nancial markets. The back-testing and simula-
tion experiments over U.S. and Chinese stock markets showed that
AlphaStock performed much better than other competing strate-
gies. Interestingly, AlphaStock suggests buying stocks with high
long-term growth, low volatility, high intrinsic value, and being
undervalued recently.
ACKNOWLEDGMENTS
J. Wang’s work was partially supported by the National Natu-
ral Science Foundation of China (NSFC) (61572059), the Science
and Technology Project of Beijing (Z181100003518001), and the
CETC Union Fund (6141B08080401). Y. Zhang’s work was partially
supported by the National Key Research and Development Pro-
gram of China under Grant (2017YFC0820405) and the Fundamen-
tal Research Funds for the Central Universities. K. Tang’s work
was partially supported the National Social Sciences Foundation
of China (No.14BJL028). J. Wu’s work was partially supported by
NSFC (71725002, 71531001, U1636210).
REFERENCES
[1]
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt,
and Been Kim. 2018. Sanity checks for saliency maps. In NIPS’18. 9525–9536.
[2]
Yingmei Chen, Zhongyu Wei, and Xuanjing Huang. 2018. Incorporating Corpo-
ration Relationship via Graph Convolutional Neural Networks for Stock Price
Prediction. In CIKM’18. ACM, 1655–1658.
[3]
Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2017. Deep
direct reinforcement learning for nancial signal representation and trading.
IEEE TNNLS 28, 3 (2017), 653–664.
[4]
Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning for
event-driven stock prediction.. In IJCAI’15. 2327–2333.
[5]
Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2016. Knowledge-driven
event embedding for stock prediction. In COLING’16. 2133–2142.
[6]
Yi Ding, WeiqingLiu, Jiang Bian, Daoqiang Zhang, and Tie-Yan Liu. 2018. Investor-
Imitator: A Framework for Trading Knowledge Extraction. In KDD’18. ACM,
1310–1319.
[7]
Eugene F Fama and Kenneth R French. 1996. Multifactor explanations of asset
pricing anomalies. J. Finance 51, 1 (1996), 55–84.
[8]
Thomas G Fischer. 2018. Reinforcement learning in nancial markets-a survey.
Technical Report. FAU Discussion Papers in Economics.
[9]
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca
Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black
box models. ACM Computing Surveys (CSUR) 51, 5 (2018), 93.
[10]
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Computation 9, 8 (1997), 1735–1780.
[11]
Hao Hu and Guo-Jun Qi. 2017. State-Frequency Memory Recurrent Neural
Networks. In ICML’17. 1568–1577.
[12]
Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. 2018. Listening
to chaotic whispers: A deep learning framework for news-oriented stock trend
prediction. In WSDM’18. ACM, 261–269.
[13]
Dingjiang Huang, Junlong Zhou, Bin Li, Steven CH Hoi, and Shuigeng Zhou.
2016. Robust median reversion strategy for online portfolio selection. IEEE TKDE
28, 9 (2016), 2480–2493.
[14]
Narasimhan Jegadeesh and Sheridan Titman. 1993. Returns to buying winners
and selling losers: Implications for stock market eciency. J. Finance 48, 1 (1993),
65–91.
[15]
Narasimhan Jegadeesh and Sheridan Titman. 2002. Cross-sectional and time-
series determinants of momentum returns. RFS 15, 1 (2002), 143–157.
[16]
Olivier Jin and Hamza El-Saawy. 2016. Portfolio Management using Reinforcement
Learning. Technical Report. Stanford University.
[17]
John Moody, Lizhong Wu,Yuansong Liao, and Matthew Saell. 1998. Performance
functions and reinforcement learning for trading systems and portfolios. Journal
of Forecasting 17, 5-6 (1998), 441–470.
[18]
Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. 2012. Time series
momentum. J. Financial Economics 104, 2 (2012), 228–250.
[19]
Ralph Neuneier. 1995. Optimal Asset Allocation using Adaptive Dynamic Pro-
gramming. In NIPS’95.
[20]
James M Poterba and Lawrence H Summers. 1988. Mean reversion in stock prices:
Evidence and implications. J. Financial Economics 22, 1 (1988), 27–59.
[21] William F Sharpe. 1994. The sharpe ratio. JPM 21, 1 (1994), 49–58.
[22]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence
Learning with Neural Networks. NIPS’14 (2014), 3104–3112.
[23]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An intro-
duction. MIT press.
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NIPS’17. 5998–6008.
[25]
Jingyuan Wang, Qian Gu, Junjie Wu, Guannan Liu, and Zhang Xiong. 2016.
Trac speed prediction and congestion source exploration: A deep learning
method. In ICDM’16. IEEE, 499–508.
[26]
Jingyuan Wang, Ze Wang, Jianfeng Li, and Junjie Wu. 2018. Multilevel wavelet
decomposition network for interpretable time series analysis. In KDD’18. ACM,
2437–2446.
[27]
Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and
historical prices. In ACL’18, Vol. 1. 1970–1979.
... The need for explainability in financial decision-making drives the development of Explainable Artificial Intelligence (XAI) techniques [12][13][14], whose incorporation in the context of DRL enhance the high performance of DRL agents. But, as we will show in section 2 (state of the art), despite the growing literature that analyzes the application of DRL to PM, the literature on the explainability of DRL algorithms applied to PM is very scarce and underdeveloped, with only four recent studies [15][16][17][18], to the best of our knowledge. Moreover, these four published DRL explainability methods in PM, only offer explanations of the model in training time, not being able to monitor the predictions done by the agent in the trading time. ...
... The application of DRL in financial PM is gaining popularity in the recent years, mainly due to the rise of computing power and architectures that enables a reasonable estimation of the rewards distribution of the actions with respect to the states given by the training process of the agents with respect to financial data [23]. But the literature on the explainability of DRL algorithms applied to PM is very scarce and underdeveloped, with only four recent studies [15][16][17][18], to the best of our knowledge. In this section, we show a detailed state-of-the-art of DRL and XDRL applied to financial PM to show the research gap in the literature to which our work responds. ...
... But despite the growing literature that analyzes the application of DRL to PM (as we have just reviewed above), the literature on the explainability of DRL algorithms applied to PM is very scarce and underdeveloped, with only four recent studies [15][16][17][18], to the best of our knowledge. Guan and Liu (2021) [17] provide an empirical approach of explainable DRL for the PM task in response to the challenge of understanding a DRL-based trading strategy because of the black-box nature of deep neural networks. ...
Article
Full-text available
Financial portfolio management investment policies computed quantitatively by modern portfolio theory techniques like the Markowitz model rely on a set of assumptions that are not supported by data in high volatility markets such as the technological sector or cryptocurrencies. Hence, quantitative researchers are looking for alternative models to tackle this problem. Concretely, portfolio management (PM) is a problem that has been successfully addressed recently by Deep Reinforcement Learning (DRL) approaches. In particular, DRL algorithms train an agent by estimating the distribution of the expected reward of every action performed by an agent given any financial state in a simulator, also called gymnasium. However, these methods rely on Deep Neural Networks model to represent such a distribution, that although they are universal approximator models, capable of representing this distribution over time, they cannot explain its behaviour, given by a set of parameters that are not interpretable. Critically, financial investors policies require predictions to be interpretable, to assess whether they follow a reasonable behaviour, so DRL agents are not suited to follow a particular policy or explain their actions. In this work, driven by the motivation of making DRL explainable, we developed a novel Explainable DRL (XDRL) approach for PM, integrating the Proximal Policy Optimization (PPO) DRL algorithm with the model agnostic explainable machine learning techniques of feature importance, SHAP and LIME to enhance transparency in prediction time. By executing our methodology, we can interpret in prediction time the actions of the agent to assess whether they follow the requisites of an investment policy or to assess the risk of following the agent’s suggestions. We empirically illustrate it by successfully identifying key features influencing investment decisions, which demonstrate the ability to explain the agent actions in prediction time. We propose the first explainable post hoc PM financial policy of a DRL agent.
... Deep learning models have powerful representation learning capabilities, making them particularly well-suited for handling intricate data relation in real financial markets. Early methods consider stock price or return series as ordinary time series and employ regular sequential deep learning models [1] such as RNN [26], LSTM [50] and Transformer [52]. Subsequently, specially designed neural networks are proposed to handle the characteristics of financial data, such as distribution shifts [61], stochasticity [24], and multiple trading patterns [30]. ...
... Moreover, in the stock-level factor analysis, we adopt a novel method to exploit cointegration relations, which is particularly innovative as this type of stock relationship is rarely considered. Most works focus on designing specialized neural networks [50,51] to model stock relations, or incorporate additional information about stock relations, such as collective investments of funds [1,28] and textual media about firm relevance [3]. For the market-level, traditional methods use stock market indexes to represent market overall state. ...
Preprint
Recent years have witnessed the perfect encounter of deep learning and quantitative trading has achieved great success in stock investment. Numerous deep learning-based models have been developed for forecasting stock returns, leveraging the powerful representation capabilities of neural networks to identify patterns and factors influencing stock prices. These models can effectively capture general patterns in the market, such as stock price trends, volume-price relationships, and time variations. However, the impact of special irrationality factors -- such as market sentiment, speculative behavior, market manipulation, and psychological biases -- have not been fully considered in existing deep stock forecasting models due to their relative abstraction as well as lack of explicit labels and data description. To fill this gap, we propose UMI, a Universal multi-level Market Irrationality factor model to enhance stock return forecasting. The UMI model learns factors that can reflect irrational behaviors in market from both individual stock and overall market levels. For the stock-level, UMI construct an estimated rational price for each stock, which is cointegrated with the stock's actual price. The discrepancy between the actual and the rational prices serves as a factor to indicate stock-level irrational events. Additionally, we define market-level irrational behaviors as anomalous synchronous fluctuations of stocks within a market. Using two self-supervised representation learning tasks, i.e., sub-market comparative learning and market synchronism prediction, the UMI model incorporates market-level irrationalities into a market representation vector, which is then used as the market-level irrationality factor.
... However, both methods overlook the correlation between different stocks, where the price fluctuation of one stock may affect the trend of a group of related stocks. Recently, Wang et al. [12] alleviated this limitation by introducing attention mechanisms among different stocks. However, this modeling of stock correlation is based on the similarity of local regions in price sequences, leading to information loss over long time spans. ...
Article
Full-text available
In the field of artificial intelligence, the portfolio management problem has received widespread attention. Portfolio models based on deep reinforcement learning enable intelligent investment decision-making. However, most models only consider modeling the temporal information of stocks, neglecting the correlation between stocks and the impact of overall market risk. Moreover, their trading strategies are often singular and fail to adapt to dynamic changes in the trading market. To address these issues, this paper proposes a Deep Reinforcement Learning Portfolio Model based on Mixture of Experts (MoEDRLPM). Firstly, a spatio-temporal adaptive embedding matrix is designed, temporal and spatial self-attention mechanisms are employed to extract the temporal information and correlations of stocks. Secondly, dynamically select the current optimal expert from the mixed expert pool through router. The expert makes decisions and aggregates to derive the portfolio weights. Next, market index data is utilized to model the current market risk and determine investment capital ratios. Finally, deep reinforcement learning is employed to optimize the portfolio strategy. This approach generates diverse trading strategies according to dynamic changes in the market environment. The proposed model is tested on the SSE50 and CSI300 datasets. Results show that the total returns of this model increase by 12% and 8%, respectively, while the Sharpe Ratios improve by 64% and 51%.
... Jiang et al. [14] introduced a Deep Reinforcement Learning framework aiming at maximizing profit, with assumptions on the market fluidity and transactions. Subsequent models like Alpha Stock [23], HRPM [24], and Smart Trader [29] incorporated more realistic trading scenarios, addressing issues such as slippage and interrelationship among assets, and designed specific model structures tailored to these issues. FinRL [16] was provided as an open-access library offering diverse market environments and multiple RL algorithms for various portfolio tasks. ...
... Reinforcement learning (RL) is a hot research field in recent years, which mainly studies how agents should take actions in the interaction with environment to maximize the cumulative rewards. Many significant applications in finance have been found in this field, particularly in constructing portfolios (Wang et al. [38], Wang et al. [39]), capturing single-asset trading signals (Gao et al. [40], Deng et al. [41], Almahdi and Yang [42]), option hedging (Buehler et al. [43], Kolm and Ritter [44]), and optimal execution. Using Deep Q-network (DQN) algorithm, Nevmyvaka completed empirical demonstrations of RL in modern financial markets in the US [45]. ...
Preprint
The optimal execution problem has always been a continuously focused research issue, and many reinforcement learning (RL) algorithms have been studied. In this article, we consider the execution problem of targeting the volume weighted average price (VWAP) and propose a relaxed stochastic optimization problem with an entropy regularizer to encourage more exploration. We derive the explicit formula of the optimal policy, which is Gaussian distributed, with its mean value being the solution to the original problem. Extending the framework of continuous RL to processes with jumps, we provide some theoretical proofs for RL algorithms. First, minimizing the martingale loss function leads to the optimal parameter estimates in the mean-square sense, and the second algorithm is to use the martingale orthogonality condition. In addition to the RL algorithm, we also propose another learning algorithm: adaptive dynamic programming (ADP) algorithm, and verify the performance of both in two different environments across different random seeds. Convergence of all algorithms has been verified in different environments, and shows a larger advantage in the environment with stronger price impact. ADP is a good choice when the agent fully understands the environment and can estimate the parameters well. On the other hand, RL algorithms do not require any model assumptions or parameter estimation, and are able to learn directly from interactions with the environment.
Article
Quantitative investment (abbreviated as “quant” in this paper) is an interdisciplinary field combining financial engineering, computer science, mathematics, statistics, etc. Quant has become one of the mainstream investment methodologies over the past decades, and has experienced three generations: quant 1.0, trading by mathematical modeling to discover mis-priced assets in markets; quant 2.0, shifting the quant research pipeline from small “strategy workshops” to large “alpha factories”; quant 3.0, applying deep learning techniques to discover complex nonlinear pricing rules. Despite its advantage in prediction, deep learning relies on extremely large data volume and labor-intensive tuning of “black-box” neural network models. To address these limitations, in this paper, we introduce quant 4.0 and provide an engineering perspective for next-generation quant. Quant 4.0 has three key differentiating components. First, automated artificial intelligence (AI) changes the quant pipeline from traditional hand-crafted modeling to state-of-the-art automated modeling and employs the philosophy of “algorithm produces algorithm, model builds model, and eventually AI creates AI.” Second, explainable AI develops new techniques to better understand and interpret investment decisions made by machine learning black boxes, and explains complicated and hidden risk exposures. Third, knowledge-driven AI supplements data-driven AI such as deep learning and incorporates prior knowledge into modeling to improve investment decisions, in particular for quantitative value investing. Putting all these together, we discuss how to build a system that practices the quant 4.0 concept. We also discuss the application of large language models in quantitative finance. Finally, we propose 10 challenging research problems for quant technology, and discuss potential solutions, research directions, and future trends.
Article
Full-text available
Engaging in investment activities plays a crucial and strategic role in fostering the growth of businesses and ensuring their resilience in the market. This involvement entails expenditures on acquiring assets, embracing technological advancements, expanding production capacities, conducting research and development, among various other domains. Collectively, these aspects form the foundation for the sustained success of an organization over the long term. This thesis will delve into an exploration of leveraging machine learning techniques to forecast key parameters in business, including investments and their impact on the financial health of the company. In this research, explored a variety of time series models and identified that both the Random Forest Regressor and Decision Tree Regressor models deliver superior accuracy, showcasing identical RMSE values of 88.36 on the validation dataset. Furthermore, the Cat Boost and Light GBM models exhibited praiseworthy performance, registering RMSE values of 92.47 and 104.69, respectively. These findings highlight the robust performance of Random Forest Regressor and Decision Tree Regressor, emphasizing their capability to provide accurate predictions. It is noted that Random Forest Regressor and Decision Tree Regressor are distinguished by high accuracy in time series forecasting, and the choice between them should take into account the trade-offs between computational efficiency and interpretability of the model. These results allow us to propose practical strategies for managing investment resources to ensure the sustainable development and prosperity of the enterprise in the long term.
Preprint
Full-text available
In recent years, there has been a growing trend of applying Reinforcement Learning (RL) in financial applications. This approach has shown great potential to solve decision-making tasks in finance. In this survey, we present a comprehensive study of the applications of RL in finance and conduct a series of meta-analyses to investigate the common themes in the literature, such as the factors that most significantly affect RL's performance compared to traditional methods. Moreover, we identify challenges including explainability, Markov Decision Process (MDP) modeling, and robustness that hinder the broader utilization of RL in the financial industry and discuss recent advancements in overcoming these challenges. Finally, we propose future research directions, such as benchmarking, contextual RL, multi-agent RL, and model-based RL to address these challenges and to further enhance the implementation of RL in finance.
Conference Paper
Full-text available
Recent years have witnessed the unprecedented rising of time series from almost all kindes of academic and industrial fields. Various types of deep neural network models have been introduced to time series analysis, but the important frequency information is yet lack of effective modeling. In light of this, in this paper we propose a wavelet-based neural network structure called multilevel Wavelet Decomposition Network (mWDN) for building frequency-aware deep learning models for time series analysis. mWDN preserves the advantage of multilevel discrete wavelet decomposition in frequency learning while enables the fine-tuning of all parameters under a deep neural network framework. Based on mWDN, we further propose two deep learning models called Residual Classification Flow (RCF) and multi-frequecy Long Short-Term Memory (mLSTM) for time series classification and forecasting, respectively. The two models take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to the back propagation algorithm to learn all the parameters globally, which enables seamless embedding of wavelet-based frequency analysis into deep learning frameworks. Extensive experiments on 40 UCR datasets and a real-world user volume dataset demonstrate the excellent performance of our time series models based on mWDN. In particular, we propose an importance analysis method to mWDN based models, which successfully identifies those time-series elements and mWDN layers that are crucially important to time series analysis. This indeed indicates the interpretability advantage of mWDN, and can be viewed as an indepth exploration to interpretable deep learning.
Article
Full-text available
In the last years many accurate decision support systems have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue. The literature reports many approaches aimed at overcoming this crucial weakness sometimes at the cost of scarifying accuracy for interpretability. The applications in which black box decision systems can be used are various, and each approach is typically developed to provide a solution for a specific problem and, as a consequence, delineating explicitly or implicitly its own definition of interpretability and explanation. The aim of this paper is to provide a classification of the main problems addressed in the literature with respect to the notion of explanation and the type of black box system. Given a problem definition, a black box type, and a desired explanation this survey should help the researcher to find the proposals more useful for his own work. The proposed classification of approaches to open black box models should also be useful for putting the many research open questions in perspective.
Conference Paper
Full-text available
Stock trend prediction plays a critical role in seeking maximized profit from stock investment. However, precise trend prediction is very difficult since the highly volatile and non-stationary nature of stock market. Exploding information on Internet together with advancing development of natural language processing and text mining techniques have enable investors to unveil market trends and volatility from online content. Unfortunately, the quality, trustworthiness and comprehensiveness of online content related to stock market varies drastically, and a large portion consists of the low-quality news, comments, or even rumors. To address this challenge, we imitate the learning process of human beings facing such chaotic online news, driven by three principles: sequential content dependency, diverse influence, and effective and efficient learning. In this paper, to capture the first two principles, we designed a Hybrid Attention Networks to predict the stock trend based on the sequence of recent related news. Moreover, we apply the self-paced learning mechanism to imitate the third principle. Extensive experiments on real-world stock market data demonstrate the effectiveness of our approach.
Conference Paper
In this paper, we propose to incorporate information of related corporations of a target company for its stock price prediction. We first construct a graph including all involved corporations based on investment facts from real market and learn a distributed representation for each corporation via node embedding methods applied on the graph. Two approaches are then explored to utilize information of related corporations based on a pipeline model and a joint model via graph convolutional neural networks respectively. Experiments on the data collected from stock market in Mainland China show that the representation learned from our model is able to capture relationships between corporations, and prediction models incorporating related corporations' information are able to make more accurate predictions on stock market.
Conference Paper
Stock trading is a popular investment approach in real world. However, since lacking enough domain knowledge and experience, it is very difficult for common investors to analyze thousands of stocks manually. Algorithmic investment provides another rational way to formulate human knowledge as a trading agent. However, it still requires well-built knowledge and experience to design effective trading algorithms in such a volatile market. Fortunately, various kinds of historical trading records are easy to obtain in this big-data era, it is invaluable of us to extract the trading knowledge hidden in the data to help people make better decisions. In this paper, we propose a reinforcement learning driven Investor-Imitator framework to formalize the trading knowledge, by imitating an investor's behavior with a set of logic descriptors. In particular, to instantiate specific logic descriptors, we introduce the Rank-Invest model that can keep the diversity of logic descriptors by learning to optimize different evaluation metrics. In the experiment, we first simulate three types of investors, representing different degrees of information disclosure we may meet in real market. By learning towards these investors, we can tell the inherent trading logic of the target investor with the Investor-Imitator empirically, and the extracted interpretable knowledge can help us better understand and construct trading portfolios. Experimental results in this paper sufficiently demonstrate the designed purpose of Investor-Imitator, it makes the Investor-Imitator an applicable and meaningful intelligent trading framework in financial investment research.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Online portfolio selection has attracted increasing attention from data mining and machine learning communities in recent years. An important theory in financial markets is mean reversion, which plays a critical role in some state-of-the-art portfolio selection strategies. Although existing mean reversion strategies have been shown to achieve good empirical performance on certain datasets, they seldom carefully deal with noise and outliers in the data, leading to suboptimal portfolios, and consequently yielding poor performance in practice. In this paper, we propose to exploit the reversion phenomenon by using robust L1L_1 -median estimators, and design a novel online portfolio selection strategy named “Robust Median Reversion” (RMR), which constructs optimal portfolios based on the improved reversion estimator. We examine the performance of the proposed algorithms on various real markets with extensive experiments. Empirical results show that RMR can overcome the drawbacks of existing mean reversion algorithms and achieve significantly better results. Finally, RMR runs in linear time, and thus is suitable for large-scale real-time algorithmic trading applications.
Article
Can we train the computer to beat experienced traders for financial assert trading? In this paper, we try to address this challenge by introducing a recurrent deep neural network (NN) for real-time financial signal representation and trading. Our model is inspired by two biological-related learning concepts of deep learning (DL) and reinforcement learning (RL). In the framework, the DL part automatically senses the dynamic market condition for informative feature learning. Then, the RL module interacts with deep representations and makes trading decisions to accumulate the ultimate rewards in an unknown environment. The learning system is implemented in a complex NN that exhibits both the deep and recurrent structures. Hence, we propose a task-aware backpropagation through time method to cope with the gradient vanishing issue in deep training. The robustness of the neural system is verified on both the stock and the commodity future markets under broad testing conditions.