ArticlePDF Available

Optimizing the Pairs-Trading Strategy Using Deep Reinforcement Learning with Trading and Stop-Loss Boundaries

Authors:

Abstract and Figures

Many researchers have tried to optimize pairs trading as the numbers of opportunities for arbitrage profit have gradually decreased. Pairs trading is a market-neutral strategy; it profits if the given condition is satisfied within a given trading window, and if not, there is a risk of loss. In this study, we propose an optimized pairs-trading strategy using deep reinforcement learning—particularly with the deep Q-network—utilizing various trading and stop-loss boundaries. More specifically, if spreads hit trading thresholds and reverse to the mean, the agent receives a positive reward. However, if spreads hit stop-loss thresholds or fail to reverse to the mean after hitting the trading thresholds, the agent receives a negative reward. The agent is trained to select the optimum level of discretized trading and stop-loss boundaries given a spread to maximize the expected sum of discounted future profits. Pairs are selected from stocks on the S&P 500 Index using a cointegration test. We compared our proposed method with traditional pairs-trading strategies which use constant trading and stop-loss boundaries. We find that our proposed model is trained well and outperforms traditional pairs-trading strategies.
This content is subject to copyright. Terms and conditions apply.
Research Article
Optimizing the Pairs-Trading Strategy Using Deep
Reinforcement Learning with Trading and Stop-Loss Boundaries
Taewook Kim 1,2 and Ha Young Kim 3
1Qra Technologies, Inc., Ttukseom-ro 1-gil, Sungdong-gu, Seoul 04778, Republic of Korea
2Department of Financial Engineering, Ajou University, Worldcupro 206, Yeongtong-gu, Suwon 16499, Republic of Korea
3Graduate School of Information, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea
Correspondence should be addressed to Ha Young Kim; haimkgetup@gmail.com
Received 6 February 2019; Revised 14 April 2019; Accepted 11 June 2019; Published 12 November 2019
Guest Editor: Benjamin M. Tabak
Copyright ©  Taewook Kim and Ha Young Kim. is is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Many resear chers have trie d to optimize pairs tr ading as the numbe rs of opportunit ies for arbitrage prot h ave gradually d ecreased.
Pairs trading is a market-neutral strategy; it prots if the given condition is satised within a given trading window, and if not,
there is a risk of loss. In this study, we propose an optimized pairs-trading strategy using deep reinforcement learning—particularly
with the deep Q-network—utilizing various trading and stop-loss boundaries. More specically, if spreads hit trading thresholds
and reverse to the mean, the agent receives a positive reward. However, if spreads hit stop-loss thresholds or fail to reverse to the
mean aer hitting the trading thresholds, the agent receives a negative reward. e agent is trained to select the optimum level
of discretized trading and stop-loss boundaries given a spread to maximize the expected sum of discounted future prots. Pairs
are selected from stocks on the S&P  Index using a cointegration test. We compared our proposed method with traditional
pairs-trading strategies which use constant trading and stop-loss boundaries. We nd that our proposed model is trained well and
outperforms traditional pairs-trading strategies.
1. Introduction
Pairs trading is a method for obtaining arbitrage prot when
there is a statistical dierence between two stocks with s imilar
characteristics that are cointegrated or highly correlated. is
is possible because of the statistical reason that spreads made
by two stocks have a mean reversion in the long run []. In
the early days, pairs-trading methods were popular because of
the opportunity to obtain arbitrage prot [–]. However, as
many investors including hedge funds sought these arbitrage
opportunities by executing the pairs-trading strategy, its
protability began to deteriorate [, ]. To overcome these
shortcomings, signicant research has been conducted to
improve the pairs-trading strategy [–].
e mechanism of pairs trading is as follows. First, a pair
of stocks with similar trends is identied. Second, regression
analysis such as ordinary least squares (OLS), total least
squares (TLS), and error correction models (ECM) is used
to calculate the spread of these stocks. Finally, if the spread
hits preset boundaries, investors will open a portfolio which
takes a long position on the undervalued stock and shorts
the overvalued stock. Subsequently, if the spread reverses
to the mean, investors will close the portfolios which are
opposite position to the open portfolio. In this case, the
investor obtains an arbitrage prot by executing this strategy.
However, there is a risk when the spread does not reverse
to the mean. In such a situation, investors are at high risk
because they cannot close the portfolio. By setting a stop-loss
boundary, investors can hedge the risk [–].
Many researchers have applied various statistical methods
to improve the eciency and performance of pairs trading.
In particular, they focused on using the spread as a trading
signal. e study in [] collected pairs of stocks based on
minimizing the sum of squared deviations between the two
stocks and then executed the trading strategy if the dierence
between the pairs is twice the standard deviation of the
spread. ey used normalized US stock price data from
 to  to test the protability of pairs trading. e
Hindawi
Complexity
Volume 2019, Article ID 3582516, 20 pages
https://doi.org/10.1155/2019/3582516
Complexity
study in [] used the cointegration approach to protect the
pairs-trading strategy from severe losses. ey applied an
OLS method to create a spread and set various conditions
that translated into trading actions. From these models, they
achieved a trading strategy with a minimum level of prots
protected from risk of loss. e results showed about an
% annualized excess return over the entire period. e
research in [] compared the distance and cointegration
approaches for each high-frequency and daily dataset to
check whether it is protable for Norwegian seafood com-
panies. e performance is similar between two approaches.
Reference [] used a Kalman lter to calculate spread, which
was then used as a high-frequency trading signal, on the
shares constituting the KOSPI  Index. He found that the
pairs-trading strategy’s performance was signicant on the
KOSPI and was better during daily market conditions at
market opening and closing. Moreover, [] optimized a pairs-
trading system as a stochastic control problem. ey used the
Ornstein-Uhlenbeck process to calculate spread as a trading
signalandtestedtheirmodelwithsimulateddata;theresults
showed that their strategy performs well. In addition, []
suggested the Ornstein-Uhlenbeck process to make a market
microstructure noise used as a trading signal in pairs trading
strategy. e performance is better under this method than
in traditional estimators such as ARIMA(,) and maximum
likelihood. Reference [] applied a cointegration method
to Chinese commodity futures from  to  to check
whether pairs trading was suitable in that market. ey used
OLS regression to create spreads from the pairs. Furthermore,
[] applied a cointegration test to assorted pairs of stocks and
a vector error-correction model to create a trading signal.
It is important to set a boundary to optimize the pairs-
trading strategy. is boundary is a criterion for deciding
whether to execute a pairs-trading strategy. If a low boundary
is set, many strategies will be executed, but prots will be
lower; if a high boundary is set, investors will get high returns
when the strategy is executed. However, all this assumes
that mean reversion occurs. If the spread does not return
to the average in the specied trading window, losses will
be incurred. If a low boundary is set, the loss will be small.
However, if the strategy is executed with a high boundary, the
loss will increase. erefore, the performance of pair trading
depends on how the boundary is set. Reference [] suggested
taking a minimum-prot condition, which could be ecient
to reduce losses in a pairs-trading system. ey set a trading
rule with a diverse open condition: for example, if the spread
is above ., ., ., ., and . standard deviations. ey
used the daily closing prices from January , , to August
, , of two stocks, the Australia New Zealand Bank
and the Adelaide Bank. e results showed that, as the open
condition value decreases, the number of trades and prots
increases. Also [] suggested optimal preset boundaries
calculated from estimated parameters for the average trade
duration, intertrade interval, and number of trades and used
them to maximize the minimum total prot. ey used the
daily closing price data from January , , to June , ,
of seven pairs of stocks on the Australian Stock Exchange.
e results showed that their proposed method was ecient
in making prots using the pairs-trading strategy. Reference
[] examined whether the pairs-trading strategy could be
applied to the daily return of Chinese commodity futures
from  to  using three methods: classical, closed-
loop, and dynamic stop-loss. e closed-loop method takes
only a stop-prot barrier which executes the strategy and
does not consider the risk if spreads revert to the mean. e
classical method adds stop-loss boundaries to the closed-
loop method. e dynamic stop-loss method uses a variety
of stop-prot and stop-loss barriers to t the spreads if the
spread is larger than the standard deviation, which is set using
criteria based on the historical average of spreads. e results
showed that these methods obtained an annualized return of
over %, especially the closed-loop method, which yielded
the highest prot of .%. In addition, [] experimented
with xed optimal threshold selection, conditional volatility,
percentile, spectral analysis, and neural network thresholds in
pairs-trading strategy. Of these, the neural network threshold
has outperformed all other strategies.
Following the success of reinforcement learning, demon-
strated by its successful performance at Atari games [],
many researchers have attempted to apply this algorithm
to the nancial trading system. Reference [] proposed a
deep Q-trading system using reinforcement learning meth-
ods. ey applied Q-learning to a trading system to trade
automatically. ey set a delta price using data from the past
 days, had three discrete action spaces (buy, hold, and
sell), and used long-term prot as a reward. ey used daily
data from January , , to December , , of the
Hang Seng Index and the S&P  Index. e experimental
results showed that their proposed method outperformed
buy-and-hold strategies and recurrent reinforcement learn-
ing methods. Reference [] proposed three steps to apply
reinforcement learning to the nancial trading system. First,
they reduced relative replay size to t nancial trading.
Second, they proposed an action-augmentation technique
that provides more feedback from the action to the agent.
ird, they used long sequences as reinforcement data to
conduct recurrent neural network training. e experimental
data comprised tick-by-tick data of  forex currency pairs
from January  to December . e results showed that
the action-augmentation technique yielded more prot than
an epsilon-greedy policy. Reference [] used an N-armed
bandit problem to optimize the pairs-trading strategy. ey
took the spread using an error-correction model and found
the parameters using a grid-search algorithm. ey compared
their proposed model with a constant parameter model,
which was similar to a traditional pairs-trading strategy. ey
used intraday one-minute data of some stocks in the FactSet
database from June  to January . e performance of
their proposed model was better than the constant-parameter
model.
We investigate not only the dynamic boundary based
on a spread in each trading window—which can achieve
higher prot than the xed boundary used in traditional
pairs trading strategy—but also if it is possible to train deep
reinforcement learning methods to follow this mechanism.
To this end, we propose a new method to optimize the
pairs trading strategy using deep reinforcement learning,
especially deep Q-networks, since pairs trading strategy can
Complexity
be thought of as a game. Aer opening a portfolio position,
the prot can be set whether portfolio is closed, stop-loss
position. erefore, if we set this strategy as a game by
setting boundaries which are optimized in spreads in trading
window, we can achieve more prot than traditional pairs
trading strategies. In particular, we set the pairs-trading sys-
tem to be a kind of game and obtain the optimal boundaries,
trading thresholds, and stop-loss thresholds according to the
calculated spread. e reason for this construction is that if
the portfolio is opened and closed in the trading window in
the calculated spread, it will be unconditionally protable if
the portfolio is closed. If the portfolio reaches the stop-loss
boundary or does not converge to the mean, losses may occur.
We therefore set the DQN to learn by positively rewarding it
if it takes a closed position and negatively rewarding it if it
reaches the stop-loss or exit thresholds. We conducted the
following experiments to verify that our proposed method
is optimized compared to the conventional method. First,
we used dierent spreads calculated using OLS and TLS to
see how the results dier depending on the spread used
for input. Second, depending on the formation window and
trading window, the spread and hedge ratio will be varied.
We therefore set a total of six window sizes for selecting the
optimal window size which had the best performance. Finally,
we compared the proposed method with the traditional pairs-
trading strategy using the test data with the optimal window
size. In this experiment, we use the daily adjusted closing
pricesfromJanuary,,toJuly,,ofstocks
in the S&P  Index. Experimental results show that our
proposed method outperforms the traditional pairs-trading
strategy across all the pairs. In addition, we can conrm that
the performance measure varies according to the spread.
e main contributions of this study are as follows. First,
we propose a novel method to optimize pairs trading strat-
egy using deep reinforcement learning, especially deep Q-
networks with trading and stop-loss boundaries. e exper-
imental results show that our method can be applied in the
pairs trading system and also to various other elds, including
nance and economics, when there is a need to optimize a
rule-based strategy to be more ecient. Second, we propose
an optimized dynamic boundary based on a spread in
each trading window. Our proposed method outperforms
traditional pairs trading strategy which set a xed boundary.
Last, we nd that our method outperforms traditional pairs
trading strategy in all pairs based on constituent stocks in
S&P . Since our method selects optimal boundaries based
on spreads, it can be applied to other stock markets such as
KOSPI, Nikkei, and Hang Seng. It should be noted that the
present work is a part of the Master thesis [].
e rest of this paper is organized as follows. Section 
explains the technical background. Section  describes the
materials and methods. Section shows the results and
provides a discussion of the experiments. Section provides
our conclusions to this study.
2. Technical Background
2.1. e Traditional Pairs-Trading Strategy. Pairs trading
is a representative market-neutral trading strategy which
simultaneously longs an undervalued stock and shorts an
overvalued stock. is strategy is a form of statistical arbitrage
trading that assumes the movements of the prices of the
two assets will be similar to previous trends []. It follows
the assumption that asset prices will return to the long-term
equilibrium. is strategy started from the idea that arbitrage
opportunities exist when the price gap between two assets
expands to or past a certain level. It is also based on the belief
that historical price movements will not change signicantly
in the future.
In Figure , the graph drawn in blue is a spread made of
two stocks that are cointegrated, the red lines are the trading
boundaries, and the green lines are the stop-loss boundaries.
When this spread reaches the trading boundaries, the port-
folio is opened and only closed when the spread returns to
the average. However, losses are incurred when prices reach
the stop-loss boundaries aer the portfolio is opened and do
not return to the average. Furthermore, aer the portfolio is
opened, if the trading signal is not reversed to mean during
the trading window, the portfolio is closed by force; this is
called the exit position of the portfolio.
2.1.1. e Cointegration Test. ere are many approaches
for pair selection such as the discrete approach [, –],
the cointegration approach [, , ], and the stochastic
approach [, ]. In this study, we use the cointegration
approach to choose pairs which have long-term equilibrium.
Generally, a linear combination of nonstationary variables is
also a nonstationary relationship. Assume that 𝑡and 𝑡have
unit roots; as previously mentioned, the linear combination
of these variables follows nonstationary conditions.
𝑡∼(1),
𝑡∼(1)()
𝑡=+𝑡+𝑡()
However, it can be a stationary relationship if the nonsta-
tionary variables are cointegrated. In this case, this regression
must be checked to determine whether it is a spurious
regression or cointegrated. Johansens method is widely used
to test for cointegration []. In this method, the number
of cointegration relations and the parameters of the model
are estimated and tested using maximum likelihood estima-
tion (MLE). Since all variables are regarded as endogenous
variables, there is no need to select dependent variables
and multiple cointegration relationships are identied. In
addition, we use MLE to estimate the cointegration relation
with the vector autoregression model and to determine
the cointegration coecient based on the likelihood-ratio
test. ere is therefore an advantage in performing various
hypothesis tests related to the estimation of cointegration
parameters and the setting of other models when there is
cointegration, and not merely to test for cointegration.
2.2. Spread Calculation
2.2.1. Ordinary Least Squares. In regression analysis, OLS is
widely used to estimate parameters by minimizing the sum
Complexity
Z-score
trading signal
trading boundary
stop-loss boundary
Ye a r
8
6
4
2
0
−2
−4
−6
−8 2009200720052003200119991997199519931991
F : e traditional pairs-trading strategy.
of the squared errors []. Assume that 𝑖,𝑖,and𝑖are
an independent variable, a dependent variable, and an error
term. We can estimate from the following equation by
taking a partial derivative:
𝑖=𝑖+𝑖∼0,2
𝜀()
𝑛
𝑖=1 𝑖−𝑖2()
=𝑛
𝑖=1𝑖󸀠𝑖−1 𝑛
𝑖=1𝑖󸀠𝑖()
e value obtained from equation () is used for the number
of stock orders. e epsilon value is also used as a trading
signal through Z-scoring, in the state composed of the
formation-window size.
2.2.2. Total Least Squares. TLS estimates parameters to min-
imize the sum of the measured distance and the vertical
distance between regression lines []. Since the vertical
distance does not change when the X and Y coordinates are
changed, the value of is calculated consistently. In the TLS
method, the observed values of 𝑖and 𝑖have the following
error terms:
𝑖=𝑖+𝑖∼0,2
𝑒()
𝑖=𝑖+𝑖∼0,2
𝑢()
where 𝑖and 𝑖are true values and 𝑖and 𝑖are error
terms following independent identical distributions. It is
assumed that there is linear combination of true values. For
convenience, we represent the error variance ratio in equation
():
𝑖=0+1𝑖()
𝑖=0+1𝑖+𝑖∼0,2
𝑒()
=var 𝑖|𝑖
var 𝑖|𝑖=2
𝑒
2
𝑢
()
e orthogonal regression estimator is calculated by mini-
mizing the sum of the measured distance and the vertical
distance between regression lines in equation ():
𝑛
𝑖=1 𝑖−0+1𝑖2
+𝑖−𝑖2()
1=2
𝑌𝑌 −2
𝑋𝑋 +2
𝑌𝑌 −2
𝑋𝑋2+42
𝑋𝑌1/2
2𝑋𝑌
()
e value obtained from equation () is used in the same
way as that obtained from equation () and the epsilon value
is also used as a trading signal through the Z-score in the state
composed of the formation-window size.
2.3. Reinforcement Learning and the Deep Q-Network. e
idea of reinforcement learning is to nd an optimal policy
which maximizes the expected sum of discounted future
rewards []. ese rewards come from selecting the optimal
value of each action, called the optimal Q-value. Rein-
forcement learning basically solves the problem dened by
the Markov decision process (MDP). It consists of a tuple
(,,,,),whereis a nite set of states, is a nite set
of actions, is a state transition probability matrix, is a
reward function, and is a discount factor. In environment
, agent-observed state 𝑡at time ,action𝑡is selected.
From the results of these sequences, environmental feedback
is provided to the agent in the form of reward 𝑡and next
state 𝑡+1. An action is selected by the action-value function
𝜋(,) that represents the expected sum of discounted
future rewards.
𝜋𝑡,𝑡=E𝜋𝑇
𝑖=𝑡 𝑖−𝑡𝑖|𝑡,𝑡,()
In this action-value function 𝜋(𝑡,𝑡), we nd an optimal
action-value function (𝑡,𝑡), following an optimal policy
Complexity
which maximizes the expected sum of discounted future
rewards.
𝑡,𝑡=max
𝜋𝜋𝑡,𝑡()
is optimal action-value function can be formulated as the
Bellman equation.
𝑡,𝑡=max
𝑎𝑡+1 𝑡+𝑡+1,𝑡+1 ()
e DQN uses a nonlinear function approximator to estimate
the action value function. is network is trained by min-
imizing a sequence of loss functions 𝑡(𝑡), which changes
with each sequence of .eweightof𝑡is updated as the
sequence progresses:
𝑡𝑡=E(𝑠,𝑎)∼𝜌(∙) 𝑡−𝑡,𝑡;𝑡2()
𝑡=max
𝑎𝑡+1 𝑡+𝑡+1,𝑡+1;𝑡−1|𝑡,𝑡()
3. Materials and Methods
3.1. Data. In this study,  stocks from the S&P  Index
were selected based on their trading volume and market
capitalization. To carry out the experiment, the data must
cover the same period. erefore, corresponding stocks were
selected, leaving a total of  stocks. Table  represents the
dataset of stock names, abbreviations of those stocks, and
their respective sectors. We collected the adjusted daily
closing prices using omson Reuters’ database. e period
of the training dataset is from January , , to December
, , comprising  data points; the test dataset covers
the period from January , , to July , , comprising
 data points. From these datasets, a pair of stocks will
be selected during the training dataset period using the
cointegration test.
3.2. Selecting Pairs Using the Cointegration Test. It is necessary
to pair stocks which have long-run statistical relationships
or similar price movements. It is possible to determine the
degree to which two stocks have had similar price movements
through the correlation value. Furthermore, the long-term
equilibrium of a pair of stocks is an important characteristic
for the execution of pairs trading. In this study, we used
the cointegration approach to select pairs of stocks. rough
Johansen’s method, we selected  pairs of stocks that have
long-run equilibria. Table  shows the resulting pairs of stocks
that were identied based on t-statistics and Figure  shows
price movements of the cointegrated stocks XOM and CVX.
Using this dataset, we will verify whether our proposed
method has better performance than the traditional pairs-
trading method.
3.3. Trading Signal. Aer selecting the pairs, it is necessary
to extract the signal for trading. To extract signals, we opt
fortheOLSorTLSmethods.First,becausethestockprice
follows a random walk [], we need to ensure that it follows
the (1)process through the augmented Dickey-Fuller test.
Subsequently, the (0) process should be created using the
logarithmic dierence in stock prices which is then applied to
the OLS and TLS methods. In equation (), 1is a constant
value, 1is a hedge ratio (which is used as trading size), 𝑡
is the error term, and log 𝐴,𝑡 and log 𝐵,𝑡 are the logarithmic
dierences in the stock prices and at time .Weconvert
values of 𝑡intoaZ-scoreusedasatradingsignal.For
example, if the trading signal reaches the threshold, we short
one share of the overvalued stock (represented as log 𝐴,𝑡 )
and long 1shares of the undervalued stock (represented
as log 𝐵,𝑡). e hedge ratio is determined based on the
window size. We set a total of six discrete window sizes to
obtain the optimal window size for the experiment. Trading
windows are constituted using half of the formation-window
size. e spread obtained here is used as a state when applying
reinforcement learning (i.e., as an input of the DQN).
log 𝐵,𝑡 =1+1log 𝐴,𝑡 +𝑡()
3.4. Proposed Method: Optimized Pairs-Trading Strategy Using
the DQN Method. In this study, we optimize the pairs-trading
strategy with a type of game using the DQN. We will attempt
to implement an optimal pairs-trading strategy by taking
optimal trading and stop-loss boundaries that correspond to
the given spread, since performance depends on how trading
and stop-loss boundaries are set in pairs trading []. Figure 
shows the mechanism of our proposed pairs-trading strategy.
roughout the cointegration test, we identify pairs and,
using regression analysis, obtain a hedge ratio used as trading
volume and a spread used as a trading signal and state. In the
case of the DQN, two hidden layers are set up and the number
of neurons is optimized by taking half of input size through
trial and error. Action values consist of the six discrete spaces
in Table . Each value of 𝑡has values for trading and stop-
loss boundaries.
A pairs-trading system can make a prot if the spread
touches the threshold and returns to the average suchthat the
portfolioisclosedineachtradingwindow.Ontheotherhand,
if the trading boundary is touched and the stop-loss boundary
is reached, the system tries to minimize losses by stopping
trades. If the spread touches the trading boundary but fails to
return to the average, the strategy may end up with a prot
or a loss. In this study, the pairs-trading strategy is therefore
considered as a kind of game; closing a portfolio yields a posi-
tive reward and a portfolio that reaches its stop-loss threshold
yields a negative reward. Although an exited portfolio may
possibly generate a positive prot, there is also a possibility
that losses will occur and it is therefore set to yield a negative
reward. We set the other conditions (such as the maintenance
of the portfolio or not to execute the portfolio) to zero so as
to concentrate on the close, stop-loss, and exit positions.
𝑡=V𝐴,𝑡 ×𝐴,𝑡󸀠−𝐴,𝑡
𝐴,𝑡 +V𝐵,𝑡 ×𝐵,𝑡󸀠−𝐵,𝑡
𝐵,𝑡
<󸀠()
𝑡
=
1000×𝑡 
−1000×𝑡   
−500×𝑡  
()
Complexity
T:estocksontheS&PIndexusedinthisstudy.
No. Ticker Stock Sector
AAPL Apple Inc. Technology
MSFT Microso Corporation Technology
BRKa Berkshire Hathaway Inc. Financial Services
JPM JPMorgan Chase & Co. Financial Services
JNJ Johnson & Johnson Healthcare
XOM Exxon Mobil Corporation Energy
BAC Bank of America Corporation Financial Services
WFC Wells Fargo & Company Financial Services
WMT Walmart Inc. Consumer Defensive
 UNH UnitedHealth Group Incorporated Healthcare
 CVX Chevron Corporation Energy
 T AT&T Inc. Communication Services
 PFE Pzer Inc. Healthcare
 ADBE Adobe Systems Incorporated Technology
 MCD McDonald’s Corporation Consumer Cyclical
 MDT Medtronic plc Healthcare
 MMM M Company Industrials
 HON Honeywell International Inc. Industrials
 GE General Electric Company Industrials
 ABT Abbott Laboratories Healthcare
 MO Altria Group, Inc. Consumer Defensive
 UNP Union Pacic Corporation Industrials
 TXN Texas Instruments Incorporated Technology
 UTX United Technologies Corporation Industrials
 LLY Eli Lilly and Company Healthcare
Ye a r
Price
20172013200920052001199719931989
140
120
100
80
60
40
20
XOM
CVX
F : Cointegrated stock price movements.
We x the values of portfolio close, stop-loss, and exit
to +, , and , respectively. When we update
the Q-values, we must consider the reward as a signicant
component of eciently training the DQN. We therefore set
the reward value to have a range similar to that of the Q-
value. Additionally, we included the corresponding prot or
loss value to reect that weight aer the trading ended. In
equation (), V𝐴,𝑡 and V𝐵,𝑡 are the stock orders of stocks and
at time ,𝐴,𝑡 and 𝐵,𝑡 are the stock prices of and at time
,and𝐴,𝑡󸀠and 𝐵,𝑡󸀠are the stock prices of and at time 󸀠.
Algorithmshowstheprocessofourproposedmethod.
Before we start our proposed method, we set a replay memory
and batch size and select pairs using the cointegration test.
At each epoch, we initialized total prot to .. In the
training scheme, we set a state which has spreads within
the formation window and select actions which are used as
Complexity
50 constituent stocks of the S&P
500 Index
Filter out pairs based on trading volume, liquidity and
the cointegration test
Obtain a reward Environment
Construct pairs of stocks
Preprocess dataset
using OLS or TLS Select max Q-value
Q_values
Outputs of DQN
Deep Q-Network
Agent
SpreadHedge ratio
Inputs of DQN
F : Steps for proposed pairs-trading strategy using the DQN method.
Initialize replay memory 𝐷and batch size 𝑁
Initialize deep Q-network
Select pairs using cointegration test
() For each epoch do
() Prot = .
() For steps t = , ...until end of training data set do
() Calculate spreads using OLS or TLS methods
() Obtain initial state by converting spread to Z-score based on formation window 𝑠𝑡
() Using epsilon-greedy method, select a random action 𝑎𝑡
() Otherwise select 𝑎𝑡=𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄(𝑠𝑡,𝑎)
() Execute traditional pairs-trading strategy based on the action selected
() Obtain reward 𝑟𝑡by performing the pairs-trading strategy
() Set next state 𝑠𝑡+1
() Store transition (𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)in 𝐷
() Sample minibatch of transition (𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)from 𝐷.
() 𝑦𝑡=
𝑟𝑡𝑖𝑓 𝑠𝑡+1=𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙
𝑟𝑡+𝛾𝑚𝑎𝑥𝑎󸀠𝑄𝑠𝑡+1,𝑎󸀠𝑖𝑓 𝑠𝑡+1=𝑛𝑜𝑛 𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙
() Update Q-network by performing a gradient descent step on {𝑦𝑡𝑄(𝑠𝑡,𝑎)}2
() End
() End
A : Optimized pairs-trading system using DQN.
trading and stop-loss boundaries. roughout the trading
window, we executed a strategy similar to a traditional pairs-
trading strategy using the action selected. Aer executing
the strategy, we obtain a reward based on the results of the
portfolio. Finally, for the Q-learning process, we update the
Q-networks by performing a gradient descent step.
3.5. Performance Measure. We check our experiment results
based on prot, maximum drawdown, and the Sharpe ratio.
Prot is commonly used as a performance measure for
trading strategies. It is calculated as the sum of returns
taking into consideration trading cost. Since many trades can
increase total prot, it is necessary to determine the total
prot taking into consideration transaction costs depending
on trading volume. In this study, we set a trading cost of  bp;
equation () is almost the same as equation (), but it does
not include absolute value, and is trading cost. Maximum
drawdown represents the maximum cumulative loss from the
highest to the lowest values of the portfolio during a given
investment period where ()is the value of the portfolio and
is the terminal time value. e Sharpe ratio is an indicator
of the degree of excess prots from investing in risky assets
used in evaluating portfolios []. In equation (), 𝑝is the
expected sum of portfolio returns and 𝑓is the risk-free rate;
we set this value to  and 𝑝is the standard deviation of
portfolio returns.
= 𝑇
𝑡=1 V1,𝑡 1,𝑡󸀠−1,𝑡
1,𝑡 +V2,𝑡 2,𝑡󸀠−2,𝑡
2,𝑡
−∗V1,𝑡 +V2,𝑡 ()
()=max
𝜏∈(0,𝑇) max
𝑡∈(0,𝜏)()−()()
Complexity
T : Summary statistics for pairs veried using cointegration
tests.
No. Pairs t-statistic Correlation
MSFT/JPM.∗∗ .
MSFT/TXN .∗∗ .
BRKa/ABT .∗∗ .
BRKa/UTX.∗∗ .
JPM/T.∗∗ .
JPM/HON.∗∗∗ .
JPM/GE.∗∗ .
JNJ/WFC.∗∗ .
XOM/CVX.∗∗∗ .
 HON/TXN .∗∗∗ .
 GE/TXN .∗∗ .
Note: ∗∗∗and ∗∗ denote a rejection of the null hypothesis at the %and
%signicance levels, respectively.
= 𝑝−𝑓
𝑝()
e Materials and Methods section should contain sucient
details so that all procedures can be repeated. It may be
divided into headed subsections if several methods are
described.
4. Results and Discussion
We use the stock pair XOM and CVX, which rejects the null
hypothesis at the % signicance level, to verify whether our
proposed model is trained well. e lengths of the window
sizes such as the formation window and trading window
are selected from the performance results with the training
dataset. From these results, we select an optimized window
size and compare our proposed model with traditional pairs
trading, which takes a constant set of actions with the test
dataset.
4.1. Training Results. To nd the optimum window size
for the optimized pairs-trading system, we experimented
with six cases. We performed the experiments based on
six window sizes, and the results for each window size are
calculated by averaging the top- results for a total of  pairs.
From Tables  and , we can nd that the best performance
is obtained when the formation and training windows are 
and , respectively, based on the prot generated by both the
OLS and TLS methods. When we trained our networks, we
set a positive reward for taking more closed positions and
fewer stop-loss and exit positions. We can nd the lowest ratio
of portfolio closed positions based on the number of open
positions, which in the formation and trading windows are
for  and  days (.). Contrary to this result, the highest
ratios of the number of closed positions in the formation and
trading windows are for  and  days (.). However,
the highest prots reported in the formation and trading
windows are for  and  days. is can be explained when
we check the ratio of the number of stop-loss portfolios.
e formation and trading window sizes are  and  days
and the ratio of portfolio stop-loss position is ., but the
formation and trading window sizes are .. is result
indicates that it is important to reduce the stop-loss position
while increasing the closed position. In addition, we can see
that the trading signals made with the TLS method are better
than those made with the OLS method in all six of the discrete
window sizes. e reason for this is based on the dierence
between the hedge ratios of the two methods. In OLS, when
one side is the reference, the relative change of the other side
is estimated. Since the assumption is that there is no error
component on the reference side and there is an error only
on the other side, the hedge ratio varies depending on the side
used as the reference. However, in TLS, hedging ratios are the
same regardless of which side is used as the reference. For this
reason, the experimental results conrm that the TLS method
is better able to determine when to execute the pairs-trading
strategy. From these results, we take the optimum window
size when we verify our proposed method in the test dataset.
However, we rst need to ensure that the model we proposed
is well-trained.
It is important to check whether our reinforcement
learning algorithm is trained well. Reference [] suggested
that a steadily increasing average of Q-values is evidence that
the DQN is learning well. Figure (a) shows the average Q-
values of HON and TXN as training progressed. We nd
that the average Q-values steadily increased, indicating that
our proposed model is properly trained. In addition, we
provide a positive reward when the portfolio closes and a
negative reward when the portfolio reaches the stop-loss
threshold or exits. Figure (b) shows the ratio of the number
of portfolio positions as training progressed. e ratio of
closed to open portfolio positions increased and the ratio
of portfolios reaching their stop-loss thresholds to open
portfolio positions decreased. We also nd that the ratio of
portfolio exits to open portfolio positions slightly increased.
It is possible that the rewards given for an open portfolio
position compared to those given for a closed portfolio
position are relatively small. e DQN is therefore trained
to prevent portfolios from reaching their stop-loss thresholds
(the more important objective) over exiting them. is result
can also serve as a basis for judging whether the proposed
model is being trained properly.
Tables  and  represent the performance results of XOM
and CVX in the training dataset. We call our proposed
model pairs-trading DQN (PTDQN) and traditional pairs
trading with constant action values as pairs trading with
action  (PTA) to pairs trading with action  (PTA). From
this result, we can conrm that our proposed method is
more protable than the constant pairs-trading strategies.
In addition, we can see that the TLS method has a higher
protability compared to the OLS method. From PTA to
PTA, the trading boundary and the stop-loss boundary
grew larger; the numbers of open and closed portfolios and
portfolios that reached their stop-loss thresholds are reduced.
In other words, there is less opportunity for prot, but the
probability of loss is also reduced. It is important not only to
take a lot of closed positions, but also to take the best action
to open and close the portfolio. For example, if a portfolio is
Complexity
T : Setting a discrete action space.
Action
A A A A A A
Trad i ng bou n dar y ±0.5 ±1.0 ±1.5 ±2.0 ±2.5 ±3.0
Stop-loss boundary ±2.5 ±3.0 ±3.5 ±4.0 ±4.5 ±5.0
T : Results of applying the DQN method to each window size using OLS.
Formation
window
Trad i ng
window MDD Sharpe
ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
 of portfolio
exits
  . . .    
  . . .    
  . . .    
  . . .   
  . . .   
  . . .   
T : Results of applying the DQN method to each window size using TLS.
Formation
window
Trad i ng
window MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
  . . .    
  . . .    
  . . .    
  . . .    
  . . .    
  . . .    
T : Average top- performance results for XOM and CVX using OLS within the training period.
Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
T : Average top- performance results for XOM and CVX using TLS within the training period.
Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
 Complexity
0
10000
8000
6000
4000
2000
Epoch
200
−2000
−4000
−6000
Average of Q_value
1751501251007550250
Avg_Q_value
(a)
Epoch
2001751501251007550250
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Ratio of portfolio
close
stop
exit
(b)
F : Verication that our propos ed model is well-tr ained with HON and TXN u sing TLS. (a) Average of Q-valu es. (b) Rat io of portfolios.
opened and closed by a boundary corresponding to action 
within the same spread and if a portfolio is opened and closed
by a boundary corresponding to action , the corresponding
prot is dierent. Assuming that t he mean reversion is certain
to occur, if we take the maximum boundary condition to
open a portfolio, we will obtain a larger prot than when
we take a smaller boundary condition. We can see that the
PTDQN returns are higher than the strategy with the highest
return among the traditional pairs trading strategies that take
the constant action. Figures – show the changes in trading
and stop-loss boundaries and the highest prot for constant
action when applying the DQN method during the training
period using OLS and TLS.
Figures  and  show comparisons of PTDQN and
PTA using the TLS method. Figure  consists of the spread,
trading, and stop-loss boundaries. We nd that trading
and stop-loss boundaries have dierent values in PTDQN,
showing that it has learned to nd the optimal boundary
according to each spread. In contrast to PTDQN, PTA
in Figure  has constant trading and stop-loss boundaries.
Figures  and  exhibit the same features we see in Figures
 and . e dierence between these methods lies in the
spreads: dierent results can be obtained depending on the
spreads used. Making better spreads can therefore improve
performance.
Figures  and  represent the prot corresponding to
DQN and constant actions using TLS and OLS. Reference
[] suggested that an average value over multiple trials
should be presented to show the reproducibility of deep
reinforcement learning because there may be dierent results
from high variances across trials and random seeds. We
therefore conducted ve trials with dierent random seeds.
e prot graph of DQN represents the average prot of
these trials and the lled region between the maximum and
minimum prot values. We can see that PTDQN had a higher
prot than the traditional pairs-trading strategies during
Complexity 
Z-score
trading signal
trading boundary
stop-loss boundary
trading signal
trading boundary
stop-loss boundary
Ye a r
Ye a r
8
6
4
2
0
−2
−4
−6
−8
Z-score
8
6
4
2
0
−2
−4
−6
−8
2008200620042002200019981996199419921990
1990-01 1990-071990-04 1990-10 1991-01 1991-04 1991-07 1991-10 1992-01 1992-04
F : An example of optimizing pairs trading using PTDQN based on a training scheme using TLS.
trading signal
trading boundary
stop-loss boundary
Z-score
8
6
4
2
0
−2
−4
−6
−8
Ye a r
2008200620042002200019981996199419921990
F : An example of PTA based on a training scheme using TLS.
the training period. is means that, even with the same
spread, we can see how prot will change as the boundaries
are changed. In other words, nding the optimal boundary
for the spread is an important factor in optimizing the
protability of pairs trading.
4.2. Test Results. Tables  and  show the average perfor-
mance measures of each pair tested by applying the top-
trained models. We can see that the constant action with
the highest returns for each pair is dierent, and the TLS
method is higher in all pairs than the OLS method based
on prot, as shown above. We also nd that PTDQN has
better performance than traditional pairs-trading strategies.
e pair with the highest prot using the proposed method is
HONandTXN(.);italsoshowsthebiggestdierence
between the DQN method and the optimal constant action
(.). We nd that the proposed method has a higher
Sharpe ratio in all pairs except for MO and UTX when the
 Complexity
trading signal
trading boundary
stop-loss boundary
Ye a r
2008200620042002200019981996199419921990
−8
−6
−4
−2
0
2
4
6
8
Z-score
F : An example of optimizing PTDQN based on a training scheme using OLS.
trading signal
trading boundary
stop-loss boundary
Z-score
8
6
4
2
0
−2
−4
−6
−8
Ye a r
2008200620042002200019981996199419921990
F : An example of PTA based on a training scheme using OLS.
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2008200620042002200019981996199419921990
5
4
3
2
1
F : Average top- prots generated by PTDQN and traditional pairs-trading strategies using TLS in training periods.
Complexity 
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2008200620042002200019981996199419921990
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Profit
F : Average top- prots generated by PTDQN and traditional pairs-trading strategies using OLS in training periods.
TLS method is used. If we add the Sharpe ratio in addition to
the total prot as an objective function, we can build a more
optimized pairs-trading system. Based on these results, we
can ensure the robustness of our proposed method for our
dataset. e proposed method can be applied to other pairs
of stocks found in other global markets.
In Figure , we can see that our proposed method,
PTDQN, outperforms the traditional pairs trading strategies
that have constant actions in test dataset. e crucial aspect
of this method is the selection of optimal boundary in the
spread that makes the highest prot in constant action, which
is like a constant boundary. erefore, the trend is the same
as traditional pairs trading strategies; however, when the
optimal boundaries which have the highest prot in the
spread are combined, PTDQN is found to have higher prot
than traditional pairs trading strategies. is method can
therefore be applied in various elds when there is a need
to optimize the eciency of a rule-based strategy [, ].
In this study, we consider spread and boundaries to be the
important factors of pairs trading strategy. erefore, we tried
to optimize pairs trading strategy with various trading and
stop-loss boundaries using deep reinforcement learning and
our method outperforms rule-based strategies. By optimizing
key parameters in rule-based methods, it can improve the
performances.
Pairs trading uses two types of stock which have the same
trends. However, it can be broken due to various factors such
as economic issues and company risk. In this situation, the
spread between two stocks is extremely large. Although this
situation cannot be avoided, we hedge this risk by taking
a dynamic boundary. In this sense, taking the lowest stop-
loss boundary is the best choice since it can be overcome
with the least loss. By taking the dynamic boundary using
the deep reinforcement learning method, we can see that not
only prots are increased, but losses are also minimized as
compared to taking a xed boundary.
5. Conclusions
We propose a novel approach to optimize pairs trad-
ing strategy using a deep reinforcement learning method,
especially deep Q-networks. ere are two key research
questions posed. First, if we set a dynamic boundary based on
a spread in each trading window, can it achieve higher prot
than traditional pairs trading strategy? Second, is it possible
that deep reinforcement learning method can be trained
to follow this mechanism? To investigate these questions,
we collected pairs selected using the cointegration test. We
experimented with how the results varied according to the
spread and the method used. We therefore set dierent
spreads using OLS and TLS methods as the input of the DQN
and the trading signal. To conduct this experiment, we set
up a formation window and a trading window. e hedge
ratio, which is an important factor in determining how much
stock to take, depends on this value. We therefore applied
the OLS and TLS methods and experimented to nd the
optimal window size by varying the formation window and
the trading window.
Tables  and  show the average performance values of
the formation windows and trading windows in the training
dataset. e results show that all six window sizes were
higher when TLS spreads were used than in OLS spreads.
In addition, we can see that protability gradually increases
as the estimation windows and trading windows of methods
using TLS and OLS decreased. e reason is that although
the ratio of closed position portfolio is the lowest in what
we set formation and trading windows, the ratio of stop-
loss position portfolio is also the lowest compared with other
formation and trading windows. It means that reducing stop-
loss position portfolio is important as well as increasing
closed position portfolio to make a prot. Using the optimal
window size, we then check whether our DQN is properly
trained. At each epoch, we nd that the average Q-value
steadily increased, the ratio of closed portfolios increased,
and the ratio of portfolios that reached their stop-loss
thresholds decreased, conrming that our DQN is trained
well. Based onthese results, we nd that our proposed model
using the test dataset with a formation window of  and
a trading window of  had results that were superior to
those of traditional pairs-trading strategies in the out-of-
sample dataset. In Figure , we can see that the prot path of
PTDQN is similar PTA to PTA, but better than that from
 Complexity
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
3.0
3.5
2.5
2.0
1.5
1.0
0.5
(a) MSFT/JPM
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
0.0
0.5
1.0
1.5
2.0
(b) MSFT/TXN
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
2.50
2.25
2.00
1.75
1.50
1.25
1.00
(c) BRKa/ABT
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r 201820172016201520142013201220112010
2009
Profit
1.5
1.0
0.5
0.0
(d) BRKa/UTX
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
1.6
1.4
1.2
1.0
0.8
Profit
(e) JPM/T
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r 2018201720162015201420132012201120102009
1.75
2.00
2.25
2.50
1.50
1.25
1.00
0.75
Profit
(f) JPM/HON
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
Profit
3.0
2.5
2.0
1.5
1.0
(g) JPM/GE
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
1.75
2.00
2.25
2.50
1.50
1.25
1.00
Profit
(h) JNJ/WFC
F : Continued.
Complexity 
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
Profit
1.2
1.0
1.4
0.8
0.6
0.4
(i) XOM/CVX
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
Profit
3.0
2.5
3.5
2.0
1.5
0.5
1.0
(j) HON/TXN
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
1.75
2.00
2.25
2.50
1.50
1.25
1.00
0.75
Profit
(k) GE/TXN
F : Average top- prots of PTDQN and PTA to PTA using TLS with the test dataset.
other methods. is shows that taking dynamic boundaries
based on our method is ecient in optimizing the pairs
trading strategy. During economic issues uncertainties, it can
be a risk to manage the pairs trading strategies including
our proposed method. However, we set a reward function
if spread is suddenly high, and our network is trained to
prevent this situation by taking less stop-loss boundary since
it is trained to maximize the expected sum of future rewards.
erefore, our proposed method can minimize the risk when
the economic risks appeared compared with traditional pairs
trading strategy with xed boundary.
From the experimental results, we show that our method
can be applied in the pairs trading system. It can be applied
in various elds, including nance and economics, when
there is a need to optimize the eciency of a rule-based
strategy. Furthermore, we nd that our method outperforms
the traditional pairs trading strategy in all pairs based on
constituent stocks in S&P . If we select appropriate pairs
which are cointegrated, we can apply our methods to other
marketssuchasKOSPI,Nikkei,andHangSeng.estudy
focused on only spreads made by two stocks, which have
long-term equilibrium patterns. Since our method selects
optimal boundaries based on spreads, it can be applied
to other stock markets such as KOSPI, Nikkei, and Hang
Seng.
In future works, we can develop our proposed model as
follows. First, as prot was set as the objective function in this
study, the performance of the model is lower than traditional
pairs trading when based on other performance measures. It
can therefore be possible to create a better-optimized pairs-
trading strategy by including all these other performance
indicators as part of the objective function. Second, we can
use other statistical methods such as the Kalman lter and
error-correction models to use diversied spreads. Finally, it
is possible to create a more-optimized pairs-trading strategy
by continuously changing the discrete set of window sizes
and boundaries. We will solve these diculties in future
studies.
Data Availability
e data used to support the ndings of this study have
been deposited in the gshare repository (DOI: ./
m.gshare.).
Disclosure
e funders had no role in the study design, data collec-
tion and analysis, decision to publish, or preparation of
the manuscript. is work represents a part of the study
 Complexity
T : Average top- performance results of the proposed method and the traditional pairs-trading strategy in the out-of-sample dataset
using TLS.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
MSFT/JPM
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .   
MSFT/TXN
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
BRKa/ABT
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
BRKa/UTX
PTDQN . . .    
PTA .. .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .   
PTA . . .   
JPM/T
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
JPM/HON
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
Complexity 
T : C o n t i n u e d.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
JPM/GE
PTDQN . . .    
PTA . . .    
PTA . . .   
PTA . . .   
PTA . . .   
PTA . . .    
PTA . . .   
JNJ/WFC
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .   
XOM/CVX
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .   
PTA . . .    
PTA . . .   
HON/TXN
PTDQN . . .    
PTA . . .   
PTA . . .    
PTA . . .   
PTA . . .   
PTA . . .    
PTA . . .   
GE/TXN
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .   
PTA . . .   
MO/UTX
PTDQN . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .   
 Complexity
T : Average top- performance results of the proposed method and the traditional pairs-trading strategy in the out-of-sample dataset
using OLS.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
MSFT/JPM
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
MSFT/TXN
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA .. .   
PTA . . .    
PTA . . .   
BRKa/ABT
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .   
BRKa/UTX
PTDQN . . .    
PTA .. .   
PTA .. .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
JPM/T
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .   
JPM/HON
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .   
PTA . . .    
PTA . . .   
Complexity 
T  : C o n t i nued.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
JPM/GE
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
JNJ/WFC
PTDQN . . .    
PTA . . .    
PTA .. .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .   
XOM/CVX
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .   
PTA . . .    
PTA .. .   
HON/TXN
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .    
PTA . . .    
GE/TXN
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .   
PTA . . .    
PTA . . .   
PTA . . .   
MO/UTX
PTDQN . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .    
PTA . . .   
 Complexity
conducted as a Master esis in Financial Engineering during
 and  at the University of Ajou, Republic of Korea.
Conflicts of Interest
e authors declare that there are no conicts of interest
regarding the publication of this paper.
Acknowledgments
is work was supported by the National Research Foun-
dation of Korea (NRF) grant funded by the Korea Gov-
ernment (MSIT: Ministry of Science and ICT) (No. NRF-
RCB).
References
[]E.Gatev,W.N.Goetzmann,andK.G.Rouwenhorst,“Pairs
trading: performance of a relative-value arbitrage rule,Yale
ICF Working Paper No. 08-03, , https://ssrn.com/abstract
= or http://dx.doi.org/./ssrn..
[] R. J. Elliott, J. van der Hoek, and W. P. Malcolm, “Pairs trading,
Quantitative Finance,vol.,no.,pp.,.
[] S. Andrade, V. Di Pietro, and M. Seasholes, “Understanding the
protability of pairs trading,” .
[] G. Hong and R. Susmel, “Pairs-trading in the Asian ADR
market,” Univ. Houston, Unpubl. Manuscr., .
[]E.Gatev,W.N.Goetzmann,andK.G.Rouwenhorst,“Pairs
trading: performance of a relative-value arbitrage rule,Review
of Financial Studies ,vol.,no.,pp.,.
[] B. Do and R. Fa, “Does simple pairs trading still work?”
Financial Analysts Journal,vol.,no.,pp.,.
[] S. Mudchanatongsuk, J. A. Primbs, and W. Wong, “Optimal
pairs trading: A stochastic control approach,” in Proceedings
of the 2008 American Control Conference, ACC,pp.,
USA, June .
[] A. Tourin and R. Yan, “Dynamic pairs trading using the sto-
chastic control approach,” Journal of Economic Dynamics &
Control,vol.,no.,pp.,.
[] Z. Zeng and C. Lee, “Pairs trading: optimal thresholds and
protability,Quantitative Finance,vol.,no.,pp.,
.
[] S. Fallahpour, H. Hakimian, K. Taheri, and E. Ramezanifar,
“Pairs trading strategy optimization using the reinforcement
learning method: a cointegration approach,So Computing,
vol.,no.,pp.,.
[] P. Nath, “High frequency pairs trading with U.S. treasury
securities: risks and rewards for hedge funds, SSRN Electronic
Journal, .
[] T. Leung and X. Li, “Optimal mean reversion trading with
transaction costs and stop-loss exit,International Journal of
eoretical and Applied Finance,vol.,no.,.
[] E. Ekstr¨om, C. Lindberg, and J. Tysk, “Optimal liquidation of
apairstrade,”inAdvanced Mathematical Methods for Finance,
pp. –, Springer, Heidelberg, .
[] Y. Lin, M. McCrae, and C. Gulati, “Loss protection in pairs
trading through minimum prot bounds: A cointegration
approach,” Journal of Applied Mathematics and Decision Sci-
ences, vol. , pp. –, .
[] A. Mikkelsen, “Pairs trading: the case of Norwegian seafood
companies,Applied Economics,vol.,no.,pp.,.
[] K. Kim, “Performance analysis of pairs trading strateg yutilizing
high frequency data with an application to KOSPI  Equities,
SSRN Electronic Journal,p.,.
[] V. Hol´y and P. Tomanov´a, Estimation of Ornstein-Uhlenbeck
Process Using Ultra-High-Frequency Data with Application to
Intraday Pairs Trading Strategy,.
[]D.Chen,J.Cui,Y.Gao,andL.Wu,“PairstradinginChi-
nese commodity futures markets: an adaptive cointegration
approach,” Accounting & Finance,vol.,no.,pp.,
.
[]H.Puspaningrum,Y.Lin,andC.M.Gulati,“Findingthe
optimal pre-set boundaries for pairs trading strategy based
on cointegration technique,Journal of Statistical eory and
Practice,vol.,no.,pp.,.
[] A. A. Roa, “Pairs trading: optimal thershold strategies,” .
[] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with
deep reinforcement learning,” https://arxiv.org/abs/.,
.
[] Y.Wang,D.Wang,S.Zhang,Y.Feng,S.Li,andQ.Zhou,“Deep
Q-trading,” , http://cslt.riit.tsinghua.edu.cn/.
[] C.-Y. Huang, “Financial tradingas a game: a deep reinforcement
learning approach,” , https://arxiv.org/abs/..
[] T. Kim, Optimizing the pairs trading strategy using Deep
reinforcement learning [M.S. thesis], Ajou University, Suwon,
Republic of Korea, .
[] B.Do,R.Fa,andK.Hamza,“Anewapproachtomodeling
and estimation for pairs trading,” in Proceedings of the 2006
Financial Management Association European Conference, .
[] R. D. Dittmar, C. J. Neely, and P. A. Weller, “Is technical
analysis in the foreign exchange market protable? A genetic
programming approach,” Journal of Financial and Quantitative
Analysis,vol.,p.,.
[] H. Rad, R. K. Low, and R. Fa, “e protability of pairs
trading strategies: distance, cointegrationand copula methods,
Quantitative Finance,vol.,no.,pp.,.
[] S. Johansen, “Statistical analysis of cointegration vectors,Jour-
nal of Economic Dynamics and Control,vol.,no.-,pp.
, .
[] M.H.Kutner,C.J.Nachtsheim,J.Neter,andW.Li,“Applied
linear statistical models,” .
[] G. H. Golub and C. F. Van Loan, “An analysis of the total least
squares problem,SIAM Journal on Numerical Analysis,vol.,
no. , pp. –, .
[] R. S. Sutton and A. G. Barto, “Introduction to reinforcement
learning,Learning,.
[] E. F. Fama, “Random walks in stock market prices,Financial
Analysts Journal,vol.,no.,pp.,.
[] W. F. Sharpe, “e sharpe ratio,e Journal of Portfolio
Management,.
[] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,
and D. Meger, “Deep reinforcement learning that matters,” in
Proceedings of the irthy-Second AAAI Conference On Artificial
Intelligence (AAAI),.
[] Y.H.Li,X.M.Lu,andN.C.Kar,“Rule-basedcontrolstrategy
with novel parameters optimization using NSGA-II for power-
split PHEV operation cost minimization,” IEEE Transactions on
Vehicular Technology,vol.,no.,pp.,.
[] L. Dymova, P. Sevastianov, and K. Kaczmarek, “A stock trading
expert system based on the rule-base evidential reasoning using
Level  Quotes,ExpertSystemswithApplications,vol.,no.,
pp. –, .
... In prior efforts to set the thresholds, referred to as actions in a reinforcement learning (RL) framework, Fallahpour et al. [5] and Kim and Kim [22] use a limited action set with 6 or 39 actions, respectively, which represent significant limits on investment performance. Therefore, we consider a much larger set of about 2800 open and stop-loss threshold recommendations determined by the maximum price deviations during the training set to cover all possible trading scenarios. ...
... However, this naive mechanism cannot capture various properties of different stock pairs, so it is outperformed by other approaches, in terms of our experimental results. Kim and Kim [22] instead use a deep Q-network (DQN) and heuristically set six overly simplistic actions, which significantly limits the profitability of their approach. In addition, they train each PTS-eligible stock pair with a DQN, which necessitates a large number of DQNs. ...
... The motivation of our proposed RLM is that it reduces the number of candidate thresholds to eliminate unconverged training without significantly harming the investment performance. In addition, Fallahpour et al. [5], Kim and Kim [22], Brim [1], and Kim et al. [21] train each reinforcement learning model with the sequential trading data of a specific stock pair. Such a design causes these papers to focus on trading on selected stock pairs since it is impractical to train the many models needed to cover all possible PTS-eligible stock pairs. ...
Article
Full-text available
A pairs trading strategy (PTS) constructs and monitors a stationary portfolio by shorting (longing) when the portfolio is adequately over- (under-)priced measured by a predetermined open threshold. We close this position to earn the price differences when the portfolio’s value reverts back to the mean level. When the portfolio is significantly over- (under-)priced measured by another predetermined stop-loss threshold, we close the position to stop loss. This paper develops a two-stage deep learning method to improve the investment performance of a PTS. Note that the literature executes a PTS by selecting the best trigger threshold (a combination of open and stop-loss thresholds) from a restricted, heuristically-determined set of trigger thresholds. Such a design significantly degrades investment performance. However, selecting the best threshold from all possible thresholds yields a non-converged training problem. To resolve this dilemma, we propose in the first stage of our method a representative label mechanism by which to construct a set of candidate trigger thresholds based on all possible thresholds and then train a deep learning (DL) model to select the best from the set. Experiments demonstrate that the proposed first-stage method avoids the non-converged training problem and outperforms most state-of-the-art methods. To further reduce the trading risk, the second stage trains another DL with the profitability of each trade labeled by executing the PTS with trigger thresholds recommended in the first-stage mechanism to remove unprofitable trades. Compared to models that indirectly judge profitability by price movement similarity without considering the quality of the recommended trigger thresholds, our model produces higher win rates and average profits. Furthermore, we find that training with the PTS portfolio value process exhibiting time invariance clearly outperforms training with only time-varying stock/return processes, even though the latter training set contains more information. This is because unpredictable changes in market trends cause the model to learn time-varying patterns from the training set that may not apply to the testing set.
... [13] used cointegration method to select trading pairs, and adopted Q-Learning [48] to select optimal trading parameters. Kim and Kim introduced a deep Q-network [32] to select the best trading threshold for cointegration approaches [22]. [28] proposed to detect structural changes and improve reinforcement learning trading methods. ...
Preprint
Pair trading is one of the most effective statistical arbitrage strategies which seeks a neutral profit by hedging a pair of selected assets. Existing methods generally decompose the task into two separate steps: pair selection and trading. However, the decoupling of two closely related subtasks can block information propagation and lead to limited overall performance. For pair selection, ignoring the trading performance results in the wrong assets being selected with irrelevant price movements, while the agent trained for trading can overfit to the selected assets without any historical information of other assets. To address it, in this paper, we propose a paradigm for automatic pair trading as a unified task rather than a two-step pipeline. We design a hierarchical reinforcement learning framework to jointly learn and optimize two subtasks. A high-level policy would select two assets from all possible combinations and a low-level policy would then perform a series of trading actions. Experimental results on real-world stock data demonstrate the effectiveness of our method on pair trading compared with both existing pair selection and trading methods.
... Among the other alternative approaches for stock price prediction, time series decomposition-based statistical and econometric approaches are also quite popular [27][28][29][30][31][32][33][34][35][36][37][38][39][40]. Over the last few years, reinforcement learning has been extensively used in the robust and accurate prediction of stock prices and portfolio design [41][42][43][44][45][46][47][48]. The classical mean-variance optimization approach is the most well-known method for portfolio optimization [49][50][51][52][53][54][55][56][57][58][59][60]. ...
Preprint
Full-text available
Portfolio optimization is a classical convex optimization problem that has posed a difficult challenge for the research community of finance and investment analysis. The optimization problem becomes particularly difficult due to the volatile nature of the stock prices, as a robust estimation of their future values, in most cases, becomes very challenging. In this paper, three ratio-maximization approaches to the mean-variance portfolio design are proposed. The three ratios are the Sharpe ratio, the Sortino ratio, and the Calmar ratio. The three design methods are applied to the stocks chosen from seven sectors of the National Stock Exchange (NSE) of India. The portfolios are backtested over the training and the test period, and those fetching the maximum cumulative returns for the majority among the seven sectors are identified. Very useful insights on the return on investment for important sectors of the NSE are found in the results.
Research Proposal
Full-text available
This is the proposal of the chapter titled "Portfolio Optimization Using Deep Reinforcement Learning and Hierarchical Risk Parity Approaches". The chapter proposal has been accepted in the following book: Title: Data Analytics and Computational Intelligence: Novel Models, Algorithms and Applications Editors: Dr. Laura CRUZ-REYES, Tecnológico Nacional de México (Mexico) Dr. Bernabé DORRONSORO, Universidad de Cádiz (Spain) Dr. Gilberto RIVERA, Universidad Autónoma de Ciudad Juárez (Mexico) Dr. Alejandro ROSETE, Universidad Tecnológica de La Habana "José Antonio Echeverría" (Cuba) Publishers: Studies in Big Data Series, Springer Expected Publication: Second half of 2023.
Presentation
Full-text available
This is the presentation of my paper titled "A Comparative Analysis of Portfolio Optimization Using Reinforcement Learning and Hierarchical Risk Parity Approaches". The paper has been accepted for oral presentation and publication in the proceedings of the 9th International Conference on Business Analytics and Intelligence (BAICONF'22). The conference will be organized at Indian Institute of Management, Bangalore, India, from December 15 to 17, 2022.
Presentation
Full-text available
This is the presentation of my paper titled "Optimum Pair-Trading Strategies for Stocks Using Cointegration-Based Approach". The paper has been accepted for oral presentation and publication in the proceedings of the 20th IEEE OTIS International Conference on Information Technology (OCIT'22). The conference will be organized in Bhubaneswar, India, from December 14 to16, 2022.
Article
This paper develops a pairs trading strategy via unsupervised learning. Unlike conventional pairs trading strategies that identify pairs based on return time series, we identify pairs by incorporating firm characteristics as well as price information. Firm characteristics are revealed to provide important information for pair identification and significantly improve the performance of the pairs trading strategy. Applied to the US stock market from January 1980 to December 2020, the long-short portfolio constructed via the agglomerative clustering earns a statistically significant annualized mean return of 24.8% and a Sharpe ratio of 2.69. The strategy remains profitable after accounting for transaction costs and removing stocks below 20% NYSE-size quantile. A host of robustness tests confirm that the results are not driven by data snooping.
Preprint
Full-text available
This is an extended abstract of the paper that has been accepted for oral presentation and publication in the proceedings of the 9th International Conference on Business Analytics and Intelligence (BAICONF'22). The conference will be organized at the Indian Institute of Management Bangalore (IIMB), Bangalore, India, during December 15-17, 2022.
Presentation
Full-text available
This is the presentation of my paper (paper id: 267) which has been accepted for oral presentation and publication in the proceedings of the 2nd Asian International Conference on Innovation in Technology (ASIANCON). The conference will be organized in Pune, India, during August 26-27, 2022. The paper will be listed in the IEEE Xplore.
Article
Full-text available
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to maintaining this rapid progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results difficult to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines, and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field, by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
Article
Full-text available
Recent studies show that the popularity of the pairs trading strategy has been growing and it may pose a problem as the opportunities to trade become much smaller. Therefore, the optimization of pairs trading strategy has gained widespread attention among high-frequency traders. In this paper, using reinforcement learning, we examine the optimum level of pairs trading specifications over time. More specifically, the reinforcement learning agent chooses the optimum level of parameters of pairs trading to maximize the objective function. Results are obtained by applying a combination of the reinforcement learning method and cointegration approach. We find that boosting pairs trading specifications by using the proposed approach significantly overperform the previous methods. Empirical results based on the comprehensive intraday data which are obtained from S&P500 constituent stocks confirm the efficiently of our proposed method.
Article
Full-text available
Motivated by the industry practice of pairs trading, we study the optimal timing strategies for trading a mean-reverting price spread. An optimal double stopping problem is formulated to analyze the timing to start and subsequently liquidate the position subject to transaction costs. Modeling the price spread by an Ornstein-Uhlenbeck process, we apply a probabilistic methodology and rigorously derive the optimal price intervals for market entry and exit. As an extension, we incorporate a stop-loss constraint to limit the maximum loss. We show that the entry region is characterized by a bounded price interval that lies strictly above the stop-loss level. As for the exit timing, a higher stop-loss level always implies a lower optimal take-profit level. Both analytical and numerical results are provided to illustrate the dependence of timing strategies on model parameters such as transaction costs and stop-loss level.
Article
Full-text available
Pairs trading is an speculative investment strategy based on relative mispricing between a pair of stocks. Essentially, the strategy involves choosing a pair of stocks that historically move together. By taking a long-short position on this pair when they diverge, a profit will be made when they next converge to the mean by unwind-ing the position. Literature on this topic is rare due to its proprietary nature. Where it does exist, the strategies are either adhoc or applicable to special cases only, with little theoretical verification. This paper analyzes these existing methods in detail and proposes a general approach to modeling relative mispricing for pairs trading purposes, with reference to the mainstream asset pricing theory. Several estimation techniques are discussed and tested for state space formulation, with Expectation Maximization producing stable results. Initial empirical evidence shows clear mean reversion behavior in selected pairs' relative pricing.
Article
This study comprehensively examines pairs trading in Chinese commodity futures markets, which, although less researched, represents an important scenario for analysing commodity price behaviour. Based on a sample of daily future returns from 2006 to 2016, we propose a cointegration model that employs an adaptive learning process, and we show that our model yields an average annualised return of 26.94 percent before trading costs, using a closed-loop strategy. Our results are robust to various tests, including parameter uncertainty, holding period constraints, trading period selection and trading costs.
Article
In this article, I investigate the performance of a pairs trading strategy on 18 seafood company stocks traded in the Norwegian consumer goods sector on the Oslo Stock Exchange. I apply both high-frequency and daily data from January 2005 to December 2014. I use two approaches – a distance approach and a cointegration approach – and compare the results. For both the distance and the cointegration approaches, nonconvergence of the pairs is high, which may indicate that more fundamental information about the companies traded should be accounted for. None of the strategies evaluated had significant profits after accounting for transaction costs. It therefore remains unclear which approach is best suited for pairs selection. Using high-frequency data yielded empirical distributions that were symmetrical and had a lower degree of leptokurtosis compared to the daily data.
Article
We examine and compare the performance of three different pairs trading strategies (the distance cointegration, and copula methods) on the US equity market from 1962 to 2014 using a time-varying series of trading costs. Using various performance measures, we conclude that cointegration strategy performs as well as the distance method. However, the copula method shows relatively poor performance. Particularly, the distance, cointegration, and copula methods show a mean monthly excess return of 36, 33, and 5 bps after transaction costs and 88, 83, and 43 bps before transaction costs. In recent years, the distance and cointegration methods have presented less trading opportunities whereas this frequency remains stable for the copula method. While liquidity factor is negatively correlated to all strategies' returns, we find no evidence of their correlation to market excess returns. All strategies show positive and signi_cant alphas after accounting for various risk-factors.
Article
One of the major considerations in the automotive industry is the reduction of hybrid electric vehicle fuel consumption and operation cost. This paper is the first to use the nondominated sorting genetic algorithm-II (NSGA-II) for power-split plug-in hybrid electric vehicle (PHEV) applications. The NSGA-II, one of the most efficient multiobjective genetic algorithms (MOGAs), simultaneously optimized operation cost, including gasoline and electricity consumption. The Pareto optimal solutions are discussed for the parameter calibrations of the rule-based control strategy as a useful guide in PHEV development, particularly in the earlier phases. The optimized operation cost at the different power-split device (PSD) gear ratios is used to determine the ideal PSD gear ratio to further minimize the operation cost. To validate the proposed strategy, dynamic PSD and powertrain models of PHEV are developed in the numerical analysis. The two typically different driving cycles, namely, the Urban Dynamometer Driving Schedule (UDDS) and the Highway Fuel Economic Drive Schedule (HWFET), with different numbers of driving cycles, are used for control strategy optimization.