- Access to this full-text is provided by Hindawi.
- Learn more
Download available
Content available from Complexity
This content is subject to copyright. Terms and conditions apply.
Research Article
Optimizing the Pairs-Trading Strategy Using Deep
Reinforcement Learning with Trading and Stop-Loss Boundaries
Taewook Kim 1,2 and Ha Young Kim 3
1Qra Technologies, Inc., Ttukseom-ro 1-gil, Sungdong-gu, Seoul 04778, Republic of Korea
2Department of Financial Engineering, Ajou University, Worldcupro 206, Yeongtong-gu, Suwon 16499, Republic of Korea
3Graduate School of Information, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea
Correspondence should be addressed to Ha Young Kim; haimkgetup@gmail.com
Received 6 February 2019; Revised 14 April 2019; Accepted 11 June 2019; Published 12 November 2019
Guest Editor: Benjamin M. Tabak
Copyright © Taewook Kim and Ha Young Kim. is is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Many resear chers have trie d to optimize pairs tr ading as the numbe rs of opportunit ies for arbitrage prot h ave gradually d ecreased.
Pairs trading is a market-neutral strategy; it prots if the given condition is satised within a given trading window, and if not,
there is a risk of loss. In this study, we propose an optimized pairs-trading strategy using deep reinforcement learning—particularly
with the deep Q-network—utilizing various trading and stop-loss boundaries. More specically, if spreads hit trading thresholds
and reverse to the mean, the agent receives a positive reward. However, if spreads hit stop-loss thresholds or fail to reverse to the
mean aer hitting the trading thresholds, the agent receives a negative reward. e agent is trained to select the optimum level
of discretized trading and stop-loss boundaries given a spread to maximize the expected sum of discounted future prots. Pairs
are selected from stocks on the S&P Index using a cointegration test. We compared our proposed method with traditional
pairs-trading strategies which use constant trading and stop-loss boundaries. We nd that our proposed model is trained well and
outperforms traditional pairs-trading strategies.
1. Introduction
Pairs trading is a method for obtaining arbitrage prot when
there is a statistical dierence between two stocks with s imilar
characteristics that are cointegrated or highly correlated. is
is possible because of the statistical reason that spreads made
by two stocks have a mean reversion in the long run []. In
the early days, pairs-trading methods were popular because of
the opportunity to obtain arbitrage prot [–]. However, as
many investors including hedge funds sought these arbitrage
opportunities by executing the pairs-trading strategy, its
protability began to deteriorate [, ]. To overcome these
shortcomings, signicant research has been conducted to
improve the pairs-trading strategy [–].
e mechanism of pairs trading is as follows. First, a pair
of stocks with similar trends is identied. Second, regression
analysis such as ordinary least squares (OLS), total least
squares (TLS), and error correction models (ECM) is used
to calculate the spread of these stocks. Finally, if the spread
hits preset boundaries, investors will open a portfolio which
takes a long position on the undervalued stock and shorts
the overvalued stock. Subsequently, if the spread reverses
to the mean, investors will close the portfolios which are
opposite position to the open portfolio. In this case, the
investor obtains an arbitrage prot by executing this strategy.
However, there is a risk when the spread does not reverse
to the mean. In such a situation, investors are at high risk
because they cannot close the portfolio. By setting a stop-loss
boundary, investors can hedge the risk [–].
Many researchers have applied various statistical methods
to improve the eciency and performance of pairs trading.
In particular, they focused on using the spread as a trading
signal. e study in [] collected pairs of stocks based on
minimizing the sum of squared deviations between the two
stocks and then executed the trading strategy if the dierence
between the pairs is twice the standard deviation of the
spread. ey used normalized US stock price data from
to to test the protability of pairs trading. e
Hindawi
Complexity
Volume 2019, Article ID 3582516, 20 pages
https://doi.org/10.1155/2019/3582516
Complexity
study in [] used the cointegration approach to protect the
pairs-trading strategy from severe losses. ey applied an
OLS method to create a spread and set various conditions
that translated into trading actions. From these models, they
achieved a trading strategy with a minimum level of prots
protected from risk of loss. e results showed about an
% annualized excess return over the entire period. e
research in [] compared the distance and cointegration
approaches for each high-frequency and daily dataset to
check whether it is protable for Norwegian seafood com-
panies. e performance is similar between two approaches.
Reference [] used a Kalman lter to calculate spread, which
was then used as a high-frequency trading signal, on the
shares constituting the KOSPI Index. He found that the
pairs-trading strategy’s performance was signicant on the
KOSPI and was better during daily market conditions at
market opening and closing. Moreover, [] optimized a pairs-
trading system as a stochastic control problem. ey used the
Ornstein-Uhlenbeck process to calculate spread as a trading
signalandtestedtheirmodelwithsimulateddata;theresults
showed that their strategy performs well. In addition, []
suggested the Ornstein-Uhlenbeck process to make a market
microstructure noise used as a trading signal in pairs trading
strategy. e performance is better under this method than
in traditional estimators such as ARIMA(,) and maximum
likelihood. Reference [] applied a cointegration method
to Chinese commodity futures from to to check
whether pairs trading was suitable in that market. ey used
OLS regression to create spreads from the pairs. Furthermore,
[] applied a cointegration test to assorted pairs of stocks and
a vector error-correction model to create a trading signal.
It is important to set a boundary to optimize the pairs-
trading strategy. is boundary is a criterion for deciding
whether to execute a pairs-trading strategy. If a low boundary
is set, many strategies will be executed, but prots will be
lower; if a high boundary is set, investors will get high returns
when the strategy is executed. However, all this assumes
that mean reversion occurs. If the spread does not return
to the average in the specied trading window, losses will
be incurred. If a low boundary is set, the loss will be small.
However, if the strategy is executed with a high boundary, the
loss will increase. erefore, the performance of pair trading
depends on how the boundary is set. Reference [] suggested
taking a minimum-prot condition, which could be ecient
to reduce losses in a pairs-trading system. ey set a trading
rule with a diverse open condition: for example, if the spread
is above ., ., ., ., and . standard deviations. ey
used the daily closing prices from January , , to August
, , of two stocks, the Australia New Zealand Bank
and the Adelaide Bank. e results showed that, as the open
condition value decreases, the number of trades and prots
increases. Also [] suggested optimal preset boundaries
calculated from estimated parameters for the average trade
duration, intertrade interval, and number of trades and used
them to maximize the minimum total prot. ey used the
daily closing price data from January , , to June , ,
of seven pairs of stocks on the Australian Stock Exchange.
e results showed that their proposed method was ecient
in making prots using the pairs-trading strategy. Reference
[] examined whether the pairs-trading strategy could be
applied to the daily return of Chinese commodity futures
from to using three methods: classical, closed-
loop, and dynamic stop-loss. e closed-loop method takes
only a stop-prot barrier which executes the strategy and
does not consider the risk if spreads revert to the mean. e
classical method adds stop-loss boundaries to the closed-
loop method. e dynamic stop-loss method uses a variety
of stop-prot and stop-loss barriers to t the spreads if the
spread is larger than the standard deviation, which is set using
criteria based on the historical average of spreads. e results
showed that these methods obtained an annualized return of
over %, especially the closed-loop method, which yielded
the highest prot of .%. In addition, [] experimented
with xed optimal threshold selection, conditional volatility,
percentile, spectral analysis, and neural network thresholds in
pairs-trading strategy. Of these, the neural network threshold
has outperformed all other strategies.
Following the success of reinforcement learning, demon-
strated by its successful performance at Atari games [],
many researchers have attempted to apply this algorithm
to the nancial trading system. Reference [] proposed a
deep Q-trading system using reinforcement learning meth-
ods. ey applied Q-learning to a trading system to trade
automatically. ey set a delta price using data from the past
days, had three discrete action spaces (buy, hold, and
sell), and used long-term prot as a reward. ey used daily
data from January , , to December , , of the
Hang Seng Index and the S&P Index. e experimental
results showed that their proposed method outperformed
buy-and-hold strategies and recurrent reinforcement learn-
ing methods. Reference [] proposed three steps to apply
reinforcement learning to the nancial trading system. First,
they reduced relative replay size to t nancial trading.
Second, they proposed an action-augmentation technique
that provides more feedback from the action to the agent.
ird, they used long sequences as reinforcement data to
conduct recurrent neural network training. e experimental
data comprised tick-by-tick data of forex currency pairs
from January to December . e results showed that
the action-augmentation technique yielded more prot than
an epsilon-greedy policy. Reference [] used an N-armed
bandit problem to optimize the pairs-trading strategy. ey
took the spread using an error-correction model and found
the parameters using a grid-search algorithm. ey compared
their proposed model with a constant parameter model,
which was similar to a traditional pairs-trading strategy. ey
used intraday one-minute data of some stocks in the FactSet
database from June to January . e performance of
their proposed model was better than the constant-parameter
model.
We investigate not only the dynamic boundary based
on a spread in each trading window—which can achieve
higher prot than the xed boundary used in traditional
pairs trading strategy—but also if it is possible to train deep
reinforcement learning methods to follow this mechanism.
To this end, we propose a new method to optimize the
pairs trading strategy using deep reinforcement learning,
especially deep Q-networks, since pairs trading strategy can
Complexity
be thought of as a game. Aer opening a portfolio position,
the prot can be set whether portfolio is closed, stop-loss
position. erefore, if we set this strategy as a game by
setting boundaries which are optimized in spreads in trading
window, we can achieve more prot than traditional pairs
trading strategies. In particular, we set the pairs-trading sys-
tem to be a kind of game and obtain the optimal boundaries,
trading thresholds, and stop-loss thresholds according to the
calculated spread. e reason for this construction is that if
the portfolio is opened and closed in the trading window in
the calculated spread, it will be unconditionally protable if
the portfolio is closed. If the portfolio reaches the stop-loss
boundary or does not converge to the mean, losses may occur.
We therefore set the DQN to learn by positively rewarding it
if it takes a closed position and negatively rewarding it if it
reaches the stop-loss or exit thresholds. We conducted the
following experiments to verify that our proposed method
is optimized compared to the conventional method. First,
we used dierent spreads calculated using OLS and TLS to
see how the results dier depending on the spread used
for input. Second, depending on the formation window and
trading window, the spread and hedge ratio will be varied.
We therefore set a total of six window sizes for selecting the
optimal window size which had the best performance. Finally,
we compared the proposed method with the traditional pairs-
trading strategy using the test data with the optimal window
size. In this experiment, we use the daily adjusted closing
pricesfromJanuary,,toJuly,,ofstocks
in the S&P Index. Experimental results show that our
proposed method outperforms the traditional pairs-trading
strategy across all the pairs. In addition, we can conrm that
the performance measure varies according to the spread.
e main contributions of this study are as follows. First,
we propose a novel method to optimize pairs trading strat-
egy using deep reinforcement learning, especially deep Q-
networks with trading and stop-loss boundaries. e exper-
imental results show that our method can be applied in the
pairs trading system and also to various other elds, including
nance and economics, when there is a need to optimize a
rule-based strategy to be more ecient. Second, we propose
an optimized dynamic boundary based on a spread in
each trading window. Our proposed method outperforms
traditional pairs trading strategy which set a xed boundary.
Last, we nd that our method outperforms traditional pairs
trading strategy in all pairs based on constituent stocks in
S&P . Since our method selects optimal boundaries based
on spreads, it can be applied to other stock markets such as
KOSPI, Nikkei, and Hang Seng. It should be noted that the
present work is a part of the Master thesis [].
e rest of this paper is organized as follows. Section
explains the technical background. Section describes the
materials and methods. Section shows the results and
provides a discussion of the experiments. Section provides
our conclusions to this study.
2. Technical Background
2.1. e Traditional Pairs-Trading Strategy. Pairs trading
is a representative market-neutral trading strategy which
simultaneously longs an undervalued stock and shorts an
overvalued stock. is strategy is a form of statistical arbitrage
trading that assumes the movements of the prices of the
two assets will be similar to previous trends []. It follows
the assumption that asset prices will return to the long-term
equilibrium. is strategy started from the idea that arbitrage
opportunities exist when the price gap between two assets
expands to or past a certain level. It is also based on the belief
that historical price movements will not change signicantly
in the future.
In Figure , the graph drawn in blue is a spread made of
two stocks that are cointegrated, the red lines are the trading
boundaries, and the green lines are the stop-loss boundaries.
When this spread reaches the trading boundaries, the port-
folio is opened and only closed when the spread returns to
the average. However, losses are incurred when prices reach
the stop-loss boundaries aer the portfolio is opened and do
not return to the average. Furthermore, aer the portfolio is
opened, if the trading signal is not reversed to mean during
the trading window, the portfolio is closed by force; this is
called the exit position of the portfolio.
2.1.1. e Cointegration Test. ere are many approaches
for pair selection such as the discrete approach [, –],
the cointegration approach [, , ], and the stochastic
approach [, ]. In this study, we use the cointegration
approach to choose pairs which have long-term equilibrium.
Generally, a linear combination of nonstationary variables is
also a nonstationary relationship. Assume that 𝑡and 𝑡have
unit roots; as previously mentioned, the linear combination
of these variables follows nonstationary conditions.
𝑡∼(1),
𝑡∼(1)()
𝑡=+𝑡+𝑡()
However, it can be a stationary relationship if the nonsta-
tionary variables are cointegrated. In this case, this regression
must be checked to determine whether it is a spurious
regression or cointegrated. Johansen’s method is widely used
to test for cointegration []. In this method, the number
of cointegration relations and the parameters of the model
are estimated and tested using maximum likelihood estima-
tion (MLE). Since all variables are regarded as endogenous
variables, there is no need to select dependent variables
and multiple cointegration relationships are identied. In
addition, we use MLE to estimate the cointegration relation
with the vector autoregression model and to determine
the cointegration coecient based on the likelihood-ratio
test. ere is therefore an advantage in performing various
hypothesis tests related to the estimation of cointegration
parameters and the setting of other models when there is
cointegration, and not merely to test for cointegration.
2.2. Spread Calculation
2.2.1. Ordinary Least Squares. In regression analysis, OLS is
widely used to estimate parameters by minimizing the sum
Complexity
Z-score
trading signal
trading boundary
stop-loss boundary
Ye a r
8
6
4
2
0
−2
−4
−6
−8 2009200720052003200119991997199519931991
F : e traditional pairs-trading strategy.
of the squared errors []. Assume that 𝑖,𝑖,and𝑖are
an independent variable, a dependent variable, and an error
term. We can estimate from the following equation by
taking a partial derivative:
𝑖=𝑖+𝑖∼0,2
𝜀()
𝑛
𝑖=1 𝑖−𝑖2()
=𝑛
𝑖=1𝑖𝑖−1 𝑛
𝑖=1𝑖𝑖()
e value obtained from equation () is used for the number
of stock orders. e epsilon value is also used as a trading
signal through Z-scoring, in the state composed of the
formation-window size.
2.2.2. Total Least Squares. TLS estimates parameters to min-
imize the sum of the measured distance and the vertical
distance between regression lines []. Since the vertical
distance does not change when the X and Y coordinates are
changed, the value of is calculated consistently. In the TLS
method, the observed values of 𝑖and 𝑖have the following
error terms:
𝑖=𝑖+𝑖∼0,2
𝑒()
𝑖=𝑖+𝑖∼0,2
𝑢()
where 𝑖and 𝑖are true values and 𝑖and 𝑖are error
terms following independent identical distributions. It is
assumed that there is linear combination of true values. For
convenience, we represent the error variance ratio in equation
():
𝑖=0+1𝑖()
𝑖=0+1𝑖+𝑖∼0,2
𝑒()
=var 𝑖|𝑖
var 𝑖|𝑖=2
𝑒
2
𝑢
()
e orthogonal regression estimator is calculated by mini-
mizing the sum of the measured distance and the vertical
distance between regression lines in equation ():
𝑛
𝑖=1 𝑖−0+1𝑖2
+𝑖−𝑖2()
1=2
𝑌𝑌 −2
𝑋𝑋 +2
𝑌𝑌 −2
𝑋𝑋2+42
𝑋𝑌1/2
2𝑋𝑌
()
e value obtained from equation () is used in the same
way as that obtained from equation () and the epsilon value
is also used as a trading signal through the Z-score in the state
composed of the formation-window size.
2.3. Reinforcement Learning and the Deep Q-Network. e
idea of reinforcement learning is to nd an optimal policy
which maximizes the expected sum of discounted future
rewards []. ese rewards come from selecting the optimal
value of each action, called the optimal Q-value. Rein-
forcement learning basically solves the problem dened by
the Markov decision process (MDP). It consists of a tuple
(,,,,),whereis a nite set of states, is a nite set
of actions, is a state transition probability matrix, is a
reward function, and is a discount factor. In environment
, agent-observed state 𝑡at time ,action𝑡is selected.
From the results of these sequences, environmental feedback
is provided to the agent in the form of reward 𝑡and next
state 𝑡+1. An action is selected by the action-value function
𝜋(,) that represents the expected sum of discounted
future rewards.
𝜋𝑡,𝑡=E𝜋𝑇
𝑖=𝑡 𝑖−𝑡𝑖|𝑡,𝑡, ()
In this action-value function 𝜋(𝑡,𝑡), we nd an optimal
action-value function ∗(𝑡,𝑡), following an optimal policy
Complexity
which maximizes the expected sum of discounted future
rewards.
∗𝑡,𝑡=max
𝜋𝜋𝑡,𝑡()
is optimal action-value function can be formulated as the
Bellman equation.
∗𝑡,𝑡=max
𝑎𝑡+1 𝑡+𝑡+1,𝑡+1 ()
e DQN uses a nonlinear function approximator to estimate
the action value function. is network is trained by min-
imizing a sequence of loss functions 𝑡(𝑡), which changes
with each sequence of .eweightof𝑡is updated as the
sequence progresses:
𝑡𝑡=E(𝑠,𝑎)∼𝜌(∙) 𝑡−𝑡,𝑡;𝑡2()
𝑡=max
𝑎𝑡+1 𝑡+𝑡+1,𝑡+1;𝑡−1|𝑡,𝑡()
3. Materials and Methods
3.1. Data. In this study, stocks from the S&P Index
were selected based on their trading volume and market
capitalization. To carry out the experiment, the data must
cover the same period. erefore, corresponding stocks were
selected, leaving a total of stocks. Table represents the
dataset of stock names, abbreviations of those stocks, and
their respective sectors. We collected the adjusted daily
closing prices using omson Reuters’ database. e period
of the training dataset is from January , , to December
, , comprising data points; the test dataset covers
the period from January , , to July , , comprising
data points. From these datasets, a pair of stocks will
be selected during the training dataset period using the
cointegration test.
3.2. Selecting Pairs Using the Cointegration Test. It is necessary
to pair stocks which have long-run statistical relationships
or similar price movements. It is possible to determine the
degree to which two stocks have had similar price movements
through the correlation value. Furthermore, the long-term
equilibrium of a pair of stocks is an important characteristic
for the execution of pairs trading. In this study, we used
the cointegration approach to select pairs of stocks. rough
Johansen’s method, we selected pairs of stocks that have
long-run equilibria. Table shows the resulting pairs of stocks
that were identied based on t-statistics and Figure shows
price movements of the cointegrated stocks XOM and CVX.
Using this dataset, we will verify whether our proposed
method has better performance than the traditional pairs-
trading method.
3.3. Trading Signal. Aer selecting the pairs, it is necessary
to extract the signal for trading. To extract signals, we opt
fortheOLSorTLSmethods.First,becausethestockprice
follows a random walk [], we need to ensure that it follows
the (1)process through the augmented Dickey-Fuller test.
Subsequently, the (0) process should be created using the
logarithmic dierence in stock prices which is then applied to
the OLS and TLS methods. In equation (), 1is a constant
value, 1is a hedge ratio (which is used as trading size), 𝑡
is the error term, and log 𝐴,𝑡 and log 𝐵,𝑡 are the logarithmic
dierences in the stock prices and at time .Weconvert
values of 𝑡intoaZ-scoreusedasatradingsignal.For
example, if the trading signal reaches the threshold, we short
one share of the overvalued stock (represented as log 𝐴,𝑡 )
and long 1shares of the undervalued stock (represented
as log 𝐵,𝑡). e hedge ratio is determined based on the
window size. We set a total of six discrete window sizes to
obtain the optimal window size for the experiment. Trading
windows are constituted using half of the formation-window
size. e spread obtained here is used as a state when applying
reinforcement learning (i.e., as an input of the DQN).
log 𝐵,𝑡 =1+1log 𝐴,𝑡 +𝑡()
3.4. Proposed Method: Optimized Pairs-Trading Strategy Using
the DQN Method. In this study, we optimize the pairs-trading
strategy with a type of game using the DQN. We will attempt
to implement an optimal pairs-trading strategy by taking
optimal trading and stop-loss boundaries that correspond to
the given spread, since performance depends on how trading
and stop-loss boundaries are set in pairs trading []. Figure
shows the mechanism of our proposed pairs-trading strategy.
roughout the cointegration test, we identify pairs and,
using regression analysis, obtain a hedge ratio used as trading
volume and a spread used as a trading signal and state. In the
case of the DQN, two hidden layers are set up and the number
of neurons is optimized by taking half of input size through
trial and error. Action values consist of the six discrete spaces
in Table . Each value of 𝑡has values for trading and stop-
loss boundaries.
A pairs-trading system can make a prot if the spread
touches the threshold and returns to the average suchthat the
portfolioisclosedineachtradingwindow.Ontheotherhand,
if the trading boundary is touched and the stop-loss boundary
is reached, the system tries to minimize losses by stopping
trades. If the spread touches the trading boundary but fails to
return to the average, the strategy may end up with a prot
or a loss. In this study, the pairs-trading strategy is therefore
considered as a kind of game; closing a portfolio yields a posi-
tive reward and a portfolio that reaches its stop-loss threshold
yields a negative reward. Although an exited portfolio may
possibly generate a positive prot, there is also a possibility
that losses will occur and it is therefore set to yield a negative
reward. We set the other conditions (such as the maintenance
of the portfolio or not to execute the portfolio) to zero so as
to concentrate on the close, stop-loss, and exit positions.
𝑡=V𝐴,𝑡 ×𝐴,𝑡−𝐴,𝑡
𝐴,𝑡 +V𝐵,𝑡 ×𝐵,𝑡−𝐵,𝑡
𝐵,𝑡
<()
𝑡
=
1000×𝑡
−1000×𝑡 −
−500×𝑡
()
Complexity
T:estocksontheS&PIndexusedinthisstudy.
No. Ticker Stock Sector
AAPL Apple Inc. Technology
MSFT Microso Corporation Technology
BRKa Berkshire Hathaway Inc. Financial Services
JPM JPMorgan Chase & Co. Financial Services
JNJ Johnson & Johnson Healthcare
XOM Exxon Mobil Corporation Energy
BAC Bank of America Corporation Financial Services
WFC Wells Fargo & Company Financial Services
WMT Walmart Inc. Consumer Defensive
UNH UnitedHealth Group Incorporated Healthcare
CVX Chevron Corporation Energy
T AT&T Inc. Communication Services
PFE Pzer Inc. Healthcare
ADBE Adobe Systems Incorporated Technology
MCD McDonald’s Corporation Consumer Cyclical
MDT Medtronic plc Healthcare
MMM M Company Industrials
HON Honeywell International Inc. Industrials
GE General Electric Company Industrials
ABT Abbott Laboratories Healthcare
MO Altria Group, Inc. Consumer Defensive
UNP Union Pacic Corporation Industrials
TXN Texas Instruments Incorporated Technology
UTX United Technologies Corporation Industrials
LLY Eli Lilly and Company Healthcare
Ye a r
Price
20172013200920052001199719931989
140
120
100
80
60
40
20
XOM
CVX
F : Cointegrated stock price movements.
We x the values of portfolio close, stop-loss, and exit
to +, −, and −, respectively. When we update
the Q-values, we must consider the reward as a signicant
component of eciently training the DQN. We therefore set
the reward value to have a range similar to that of the Q-
value. Additionally, we included the corresponding prot or
loss value to reect that weight aer the trading ended. In
equation (), V𝐴,𝑡 and V𝐵,𝑡 are the stock orders of stocks and
at time ,𝐴,𝑡 and 𝐵,𝑡 are the stock prices of and at time
,and𝐴,𝑡and 𝐵,𝑡are the stock prices of and at time .
Algorithmshowstheprocessofourproposedmethod.
Before we start our proposed method, we set a replay memory
and batch size and select pairs using the cointegration test.
At each epoch, we initialized total prot to .. In the
training scheme, we set a state which has spreads within
the formation window and select actions which are used as
Complexity
50 constituent stocks of the S&P
500 Index
Filter out pairs based on trading volume, liquidity and
the cointegration test
Obtain a reward Environment
Construct pairs of stocks
Preprocess dataset
using OLS or TLS Select max Q-value
Q_values
Outputs of DQN
Deep Q-Network
Agent
SpreadHedge ratio
Inputs of DQN
F : Steps for proposed pairs-trading strategy using the DQN method.
Initialize replay memory 𝐷and batch size 𝑁
Initialize deep Q-network
Select pairs using cointegration test
() For each epoch do
() Prot = .
() For steps t = , ...until end of training data set do
() Calculate spreads using OLS or TLS methods
() Obtain initial state by converting spread to Z-score based on formation window 𝑠𝑡
() Using epsilon-greedy method, select a random action 𝑎𝑡
() Otherwise select 𝑎𝑡=𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄(𝑠𝑡,𝑎)
() Execute traditional pairs-trading strategy based on the action selected
() Obtain reward 𝑟𝑡by performing the pairs-trading strategy
() Set next state 𝑠𝑡+1
() Store transition (𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)in 𝐷
() Sample minibatch of transition (𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)from 𝐷.
() 𝑦𝑡=
𝑟𝑡𝑖𝑓 𝑠𝑡+1=𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙
𝑟𝑡+𝛾𝑚𝑎𝑥𝑎𝑄𝑠𝑡+1,𝑎𝑖𝑓 𝑠𝑡+1=𝑛𝑜𝑛 −𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙
() Update Q-network by performing a gradient descent step on {𝑦𝑡−𝑄(𝑠𝑡,𝑎)}2
() End
() End
A : Optimized pairs-trading system using DQN.
trading and stop-loss boundaries. roughout the trading
window, we executed a strategy similar to a traditional pairs-
trading strategy using the action selected. Aer executing
the strategy, we obtain a reward based on the results of the
portfolio. Finally, for the Q-learning process, we update the
Q-networks by performing a gradient descent step.
3.5. Performance Measure. We check our experiment results
based on prot, maximum drawdown, and the Sharpe ratio.
Prot is commonly used as a performance measure for
trading strategies. It is calculated as the sum of returns
taking into consideration trading cost. Since many trades can
increase total prot, it is necessary to determine the total
prot taking into consideration transaction costs depending
on trading volume. In this study, we set a trading cost of bp;
equation () is almost the same as equation (), but it does
not include absolute value, and is trading cost. Maximum
drawdown represents the maximum cumulative loss from the
highest to the lowest values of the portfolio during a given
investment period where ()is the value of the portfolio and
is the terminal time value. e Sharpe ratio is an indicator
of the degree of excess prots from investing in risky assets
used in evaluating portfolios []. In equation (), 𝑝is the
expected sum of portfolio returns and 𝑓is the risk-free rate;
we set this value to and 𝑝is the standard deviation of
portfolio returns.
= 𝑇
𝑡=1 V1,𝑡 ∗1,𝑡−1,𝑡
1,𝑡 +V2,𝑡 ∗2,𝑡−2,𝑡
2,𝑡
−∗V1,𝑡 +V2,𝑡 ()
()=max
𝜏∈(0,𝑇) max
𝑡∈(0,𝜏)()−()()
Complexity
T : Summary statistics for pairs veried using cointegration
tests.
No. Pairs t-statistic Correlation
MSFT/JPM−.∗∗ .
MSFT/TXN −.∗∗ .
BRKa/ABT −.∗∗ .
BRKa/UTX−.∗∗ .
JPM/T−.∗∗ .
JPM/HON−.∗∗∗ .
JPM/GE−.∗∗ .
JNJ/WFC−.∗∗ .
XOM/CVX−.∗∗∗ .
HON/TXN −.∗∗∗ .
GE/TXN −.∗∗ .
Note: ∗∗∗and ∗∗ denote a rejection of the null hypothesis at the %and
%signicance levels, respectively.
= 𝑝−𝑓
𝑝()
e Materials and Methods section should contain sucient
details so that all procedures can be repeated. It may be
divided into headed subsections if several methods are
described.
4. Results and Discussion
We use the stock pair XOM and CVX, which rejects the null
hypothesis at the % signicance level, to verify whether our
proposed model is trained well. e lengths of the window
sizes such as the formation window and trading window
are selected from the performance results with the training
dataset. From these results, we select an optimized window
size and compare our proposed model with traditional pairs
trading, which takes a constant set of actions with the test
dataset.
4.1. Training Results. To nd the optimum window size
for the optimized pairs-trading system, we experimented
with six cases. We performed the experiments based on
six window sizes, and the results for each window size are
calculated by averaging the top- results for a total of pairs.
From Tables and , we can nd that the best performance
is obtained when the formation and training windows are
and , respectively, based on the prot generated by both the
OLS and TLS methods. When we trained our networks, we
set a positive reward for taking more closed positions and
fewer stop-loss and exit positions. We can nd the lowest ratio
of portfolio closed positions based on the number of open
positions, which in the formation and trading windows are
for and days (.). Contrary to this result, the highest
ratios of the number of closed positions in the formation and
trading windows are for and days (.). However,
the highest prots reported in the formation and trading
windows are for and days. is can be explained when
we check the ratio of the number of stop-loss portfolios.
e formation and trading window sizes are and days
and the ratio of portfolio stop-loss position is ., but the
formation and trading window sizes are .. is result
indicates that it is important to reduce the stop-loss position
while increasing the closed position. In addition, we can see
that the trading signals made with the TLS method are better
than those made with the OLS method in all six of the discrete
window sizes. e reason for this is based on the dierence
between the hedge ratios of the two methods. In OLS, when
one side is the reference, the relative change of the other side
is estimated. Since the assumption is that there is no error
component on the reference side and there is an error only
on the other side, the hedge ratio varies depending on the side
used as the reference. However, in TLS, hedging ratios are the
same regardless of which side is used as the reference. For this
reason, the experimental results conrm that the TLS method
is better able to determine when to execute the pairs-trading
strategy. From these results, we take the optimum window
size when we verify our proposed method in the test dataset.
However, we rst need to ensure that the model we proposed
is well-trained.
It is important to check whether our reinforcement
learning algorithm is trained well. Reference [] suggested
that a steadily increasing average of Q-values is evidence that
the DQN is learning well. Figure (a) shows the average Q-
values of HON and TXN as training progressed. We nd
that the average Q-values steadily increased, indicating that
our proposed model is properly trained. In addition, we
provide a positive reward when the portfolio closes and a
negative reward when the portfolio reaches the stop-loss
threshold or exits. Figure (b) shows the ratio of the number
of portfolio positions as training progressed. e ratio of
closed to open portfolio positions increased and the ratio
of portfolios reaching their stop-loss thresholds to open
portfolio positions decreased. We also nd that the ratio of
portfolio exits to open portfolio positions slightly increased.
It is possible that the rewards given for an open portfolio
position compared to those given for a closed portfolio
position are relatively small. e DQN is therefore trained
to prevent portfolios from reaching their stop-loss thresholds
(the more important objective) over exiting them. is result
can also serve as a basis for judging whether the proposed
model is being trained properly.
Tables and represent the performance results of XOM
and CVX in the training dataset. We call our proposed
model pairs-trading DQN (PTDQN) and traditional pairs
trading with constant action values as pairs trading with
action (PTA) to pairs trading with action (PTA). From
this result, we can conrm that our proposed method is
more protable than the constant pairs-trading strategies.
In addition, we can see that the TLS method has a higher
protability compared to the OLS method. From PTA to
PTA, the trading boundary and the stop-loss boundary
grew larger; the numbers of open and closed portfolios and
portfolios that reached their stop-loss thresholds are reduced.
In other words, there is less opportunity for prot, but the
probability of loss is also reduced. It is important not only to
take a lot of closed positions, but also to take the best action
to open and close the portfolio. For example, if a portfolio is
Complexity
T : Setting a discrete action space.
Action
A A A A A A
Trad i ng bou n dar y ±0.5 ±1.0 ±1.5 ±2.0 ±2.5 ±3.0
Stop-loss boundary ±2.5 ±3.0 ±3.5 ±4.0 ±4.5 ±5.0
T : Results of applying the DQN method to each window size using OLS.
Formation
window
Trad i ng
window MDD Sharpe
ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
of portfolio
exits
−. . .
−. . .
−. . .
−. . .
−. . .
−. . .
T : Results of applying the DQN method to each window size using TLS.
Formation
window
Trad i ng
window MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
−. . .
−. . .
−. . .
−. . .
−. . .
−. . .
T : Average top- performance results for XOM and CVX using OLS within the training period.
Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
T : Average top- performance results for XOM and CVX using TLS within the training period.
Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
Complexity
0
10000
8000
6000
4000
2000
Epoch
200
−2000
−4000
−6000
Average of Q_value
1751501251007550250
Avg_Q_value
(a)
Epoch
2001751501251007550250
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Ratio of portfolio
close
stop
exit
(b)
F : Verication that our propos ed model is well-tr ained with HON and TXN u sing TLS. (a) Average of Q-valu es. (b) Rat io of portfolios.
opened and closed by a boundary corresponding to action
within the same spread and if a portfolio is opened and closed
by a boundary corresponding to action , the corresponding
prot is dierent. Assuming that t he mean reversion is certain
to occur, if we take the maximum boundary condition to
open a portfolio, we will obtain a larger prot than when
we take a smaller boundary condition. We can see that the
PTDQN returns are higher than the strategy with the highest
return among the traditional pairs trading strategies that take
the constant action. Figures – show the changes in trading
and stop-loss boundaries and the highest prot for constant
action when applying the DQN method during the training
period using OLS and TLS.
Figures and show comparisons of PTDQN and
PTA using the TLS method. Figure consists of the spread,
trading, and stop-loss boundaries. We nd that trading
and stop-loss boundaries have dierent values in PTDQN,
showing that it has learned to nd the optimal boundary
according to each spread. In contrast to PTDQN, PTA
in Figure has constant trading and stop-loss boundaries.
Figures and exhibit the same features we see in Figures
and . e dierence between these methods lies in the
spreads: dierent results can be obtained depending on the
spreads used. Making better spreads can therefore improve
performance.
Figures and represent the prot corresponding to
DQN and constant actions using TLS and OLS. Reference
[] suggested that an average value over multiple trials
should be presented to show the reproducibility of deep
reinforcement learning because there may be dierent results
from high variances across trials and random seeds. We
therefore conducted ve trials with dierent random seeds.
e prot graph of DQN represents the average prot of
these trials and the lled region between the maximum and
minimum prot values. We can see that PTDQN had a higher
prot than the traditional pairs-trading strategies during
Complexity
Z-score
trading signal
trading boundary
stop-loss boundary
trading signal
trading boundary
stop-loss boundary
Ye a r
Ye a r
8
6
4
2
0
−2
−4
−6
−8
Z-score
8
6
4
2
0
−2
−4
−6
−8
2008200620042002200019981996199419921990
1990-01 1990-071990-04 1990-10 1991-01 1991-04 1991-07 1991-10 1992-01 1992-04
F : An example of optimizing pairs trading using PTDQN based on a training scheme using TLS.
trading signal
trading boundary
stop-loss boundary
Z-score
8
6
4
2
0
−2
−4
−6
−8
Ye a r
2008200620042002200019981996199419921990
F : An example of PTA based on a training scheme using TLS.
the training period. is means that, even with the same
spread, we can see how prot will change as the boundaries
are changed. In other words, nding the optimal boundary
for the spread is an important factor in optimizing the
protability of pairs trading.
4.2. Test Results. Tables and show the average perfor-
mance measures of each pair tested by applying the top-
trained models. We can see that the constant action with
the highest returns for each pair is dierent, and the TLS
method is higher in all pairs than the OLS method based
on prot, as shown above. We also nd that PTDQN has
better performance than traditional pairs-trading strategies.
e pair with the highest prot using the proposed method is
HONandTXN(.);italsoshowsthebiggestdierence
between the DQN method and the optimal constant action
(.). We nd that the proposed method has a higher
Sharpe ratio in all pairs except for MO and UTX when the
Complexity
trading signal
trading boundary
stop-loss boundary
Ye a r
2008200620042002200019981996199419921990
−8
−6
−4
−2
0
2
4
6
8
Z-score
F : An example of optimizing PTDQN based on a training scheme using OLS.
trading signal
trading boundary
stop-loss boundary
Z-score
8
6
4
2
0
−2
−4
−6
−8
Ye a r
2008200620042002200019981996199419921990
F : An example of PTA based on a training scheme using OLS.
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2008200620042002200019981996199419921990
5
4
3
2
1
F : Average top- prots generated by PTDQN and traditional pairs-trading strategies using TLS in training periods.
Complexity
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2008200620042002200019981996199419921990
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Profit
F : Average top- prots generated by PTDQN and traditional pairs-trading strategies using OLS in training periods.
TLS method is used. If we add the Sharpe ratio in addition to
the total prot as an objective function, we can build a more
optimized pairs-trading system. Based on these results, we
can ensure the robustness of our proposed method for our
dataset. e proposed method can be applied to other pairs
of stocks found in other global markets.
In Figure , we can see that our proposed method,
PTDQN, outperforms the traditional pairs trading strategies
that have constant actions in test dataset. e crucial aspect
of this method is the selection of optimal boundary in the
spread that makes the highest prot in constant action, which
is like a constant boundary. erefore, the trend is the same
as traditional pairs trading strategies; however, when the
optimal boundaries which have the highest prot in the
spread are combined, PTDQN is found to have higher prot
than traditional pairs trading strategies. is method can
therefore be applied in various elds when there is a need
to optimize the eciency of a rule-based strategy [, ].
In this study, we consider spread and boundaries to be the
important factors of pairs trading strategy. erefore, we tried
to optimize pairs trading strategy with various trading and
stop-loss boundaries using deep reinforcement learning and
our method outperforms rule-based strategies. By optimizing
key parameters in rule-based methods, it can improve the
performances.
Pairs trading uses two types of stock which have the same
trends. However, it can be broken due to various factors such
as economic issues and company risk. In this situation, the
spread between two stocks is extremely large. Although this
situation cannot be avoided, we hedge this risk by taking
a dynamic boundary. In this sense, taking the lowest stop-
loss boundary is the best choice since it can be overcome
with the least loss. By taking the dynamic boundary using
the deep reinforcement learning method, we can see that not
only prots are increased, but losses are also minimized as
compared to taking a xed boundary.
5. Conclusions
We propose a novel approach to optimize pairs trad-
ing strategy using a deep reinforcement learning method,
especially deep Q-networks. ere are two key research
questions posed. First, if we set a dynamic boundary based on
a spread in each trading window, can it achieve higher prot
than traditional pairs trading strategy? Second, is it possible
that deep reinforcement learning method can be trained
to follow this mechanism? To investigate these questions,
we collected pairs selected using the cointegration test. We
experimented with how the results varied according to the
spread and the method used. We therefore set dierent
spreads using OLS and TLS methods as the input of the DQN
and the trading signal. To conduct this experiment, we set
up a formation window and a trading window. e hedge
ratio, which is an important factor in determining how much
stock to take, depends on this value. We therefore applied
the OLS and TLS methods and experimented to nd the
optimal window size by varying the formation window and
the trading window.
Tables and show the average performance values of
the formation windows and trading windows in the training
dataset. e results show that all six window sizes were
higher when TLS spreads were used than in OLS spreads.
In addition, we can see that protability gradually increases
as the estimation windows and trading windows of methods
using TLS and OLS decreased. e reason is that although
the ratio of closed position portfolio is the lowest in what
we set formation and trading windows, the ratio of stop-
loss position portfolio is also the lowest compared with other
formation and trading windows. It means that reducing stop-
loss position portfolio is important as well as increasing
closed position portfolio to make a prot. Using the optimal
window size, we then check whether our DQN is properly
trained. At each epoch, we nd that the average Q-value
steadily increased, the ratio of closed portfolios increased,
and the ratio of portfolios that reached their stop-loss
thresholds decreased, conrming that our DQN is trained
well. Based onthese results, we nd that our proposed model
using the test dataset with a formation window of and
a trading window of had results that were superior to
those of traditional pairs-trading strategies in the out-of-
sample dataset. In Figure , we can see that the prot path of
PTDQN is similar PTA to PTA, but better than that from
Complexity
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
3.0
3.5
2.5
2.0
1.5
1.0
0.5
(a) MSFT/JPM
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
0.0
0.5
1.0
1.5
2.0
(b) MSFT/TXN
Profit
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
2.50
2.25
2.00
1.75
1.50
1.25
1.00
(c) BRKa/ABT
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r 201820172016201520142013201220112010
2009
Profit
1.5
1.0
0.5
0.0
(d) BRKa/UTX
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
1.6
1.4
1.2
1.0
0.8
Profit
(e) JPM/T
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r 2018201720162015201420132012201120102009
1.75
2.00
2.25
2.50
1.50
1.25
1.00
0.75
Profit
(f) JPM/HON
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
Profit
3.0
2.5
2.0
1.5
1.0
(g) JPM/GE
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
1.75
2.00
2.25
2.50
1.50
1.25
1.00
Profit
(h) JNJ/WFC
F : Continued.
Complexity
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
Profit
1.2
1.0
1.4
0.8
0.6
0.4
(i) XOM/CVX
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
Profit
3.0
2.5
3.5
2.0
1.5
0.5
1.0
(j) HON/TXN
PTDQN
PTA0
PTA1
PTA2
PTA3
PTA4
PTA5
Ye a r
2018201720162015201420132012201120102009
1.75
2.00
2.25
2.50
1.50
1.25
1.00
0.75
Profit
(k) GE/TXN
F : Average top- prots of PTDQN and PTA to PTA using TLS with the test dataset.
other methods. is shows that taking dynamic boundaries
based on our method is ecient in optimizing the pairs
trading strategy. During economic issues uncertainties, it can
be a risk to manage the pairs trading strategies including
our proposed method. However, we set a reward function
if spread is suddenly high, and our network is trained to
prevent this situation by taking less stop-loss boundary since
it is trained to maximize the expected sum of future rewards.
erefore, our proposed method can minimize the risk when
the economic risks appeared compared with traditional pairs
trading strategy with xed boundary.
From the experimental results, we show that our method
can be applied in the pairs trading system. It can be applied
in various elds, including nance and economics, when
there is a need to optimize the eciency of a rule-based
strategy. Furthermore, we nd that our method outperforms
the traditional pairs trading strategy in all pairs based on
constituent stocks in S&P . If we select appropriate pairs
which are cointegrated, we can apply our methods to other
marketssuchasKOSPI,Nikkei,andHangSeng.estudy
focused on only spreads made by two stocks, which have
long-term equilibrium patterns. Since our method selects
optimal boundaries based on spreads, it can be applied
to other stock markets such as KOSPI, Nikkei, and Hang
Seng.
In future works, we can develop our proposed model as
follows. First, as prot was set as the objective function in this
study, the performance of the model is lower than traditional
pairs trading when based on other performance measures. It
can therefore be possible to create a better-optimized pairs-
trading strategy by including all these other performance
indicators as part of the objective function. Second, we can
use other statistical methods such as the Kalman lter and
error-correction models to use diversied spreads. Finally, it
is possible to create a more-optimized pairs-trading strategy
by continuously changing the discrete set of window sizes
and boundaries. We will solve these diculties in future
studies.
Data Availability
e data used to support the ndings of this study have
been deposited in the gshare repository (DOI: ./
m.gshare.).
Disclosure
e funders had no role in the study design, data collec-
tion and analysis, decision to publish, or preparation of
the manuscript. is work represents a part of the study
Complexity
T : Average top- performance results of the proposed method and the traditional pairs-trading strategy in the out-of-sample dataset
using TLS.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
MSFT/JPM
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
MSFT/TXN
PTDQN −. . .
PTA −. −. .
PTA −. −. .
PTA −. . .
PTA −. −. .
PTA −. . .
PTA −. . .
BRKa/ABT
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
BRKa/UTX
PTDQN −. . .
PTA −. −. .
PTA −. . .
PTA −. −. .
PTA −. . .
PTA −. . .
PTA −. . .
JPM/T
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
PTA −. . .
JPM/HON
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
Complexity
T : C o n t i n u e d.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
JPM/GE
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
JNJ/WFC
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
XOM/CVX
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
PTA −. −. .
PTA −. . .
HON/TXN
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
GE/TXN
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
MO/UTX
PTDQN −. . .
PTA −. −. .
PTA −. . .
PTA −. −. .
PTA −. . .
PTA −. . .
PTA −. . .
Complexity
T : Average top- performance results of the proposed method and the traditional pairs-trading strategy in the out-of-sample dataset
using OLS.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
MSFT/JPM
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
PTA −. . .
MSFT/TXN
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. . .
PTA −. . .
BRKa/ABT
PTDQN −. . .
PTA −. −. .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
BRKa/UTX
PTDQN −. . .
PTA −. −. .
PTA −. −. .
PTA −. −. .
PTA −. . .
PTA −. . .
PTA −. −. .
JPM/T
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
JPM/HON
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
Complexity
T : C o n t i nued.
Pairs Model MDD Sharpe ratio Prot ofopen
portfolios
ofclosed
portfolios
ofstop-loss
portfolios
ofexited
portfolios
JPM/GE
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
JNJ/WFC
PTDQN −. . .
PTA −. −. .
PTA −. −. .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
XOM/CVX
PTDQN −. . .
PTA −. −. .
PTA −. . .
PTA −. −. .
PTA −. . .
PTA −. −. .
PTA −. −. .
HON/TXN
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. . .
GE/TXN
PTDQN −. . .
PTA −. . .
PTA −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
PTA −. . .
MO/UTX
PTDQN −. . .
PTA −. . .
PTA −. −. .
PTA −. −. .
PTA −. . .
PTA −. . .
PTA −. . .
Complexity
conducted as a Master esis in Financial Engineering during
and at the University of Ajou, Republic of Korea.
Conflicts of Interest
e authors declare that there are no conicts of interest
regarding the publication of this paper.
Acknowledgments
is work was supported by the National Research Foun-
dation of Korea (NRF) grant funded by the Korea Gov-
ernment (MSIT: Ministry of Science and ICT) (No. NRF-
RCB).
References
[]E.Gatev,W.N.Goetzmann,andK.G.Rouwenhorst,“Pairs
trading: performance of a relative-value arbitrage rule,” Yale
ICF Working Paper No. 08-03, , https://ssrn.com/abstract
= or http://dx.doi.org/./ssrn..
[] R. J. Elliott, J. van der Hoek, and W. P. Malcolm, “Pairs trading,”
Quantitative Finance,vol.,no.,pp.–,.
[] S. Andrade, V. Di Pietro, and M. Seasholes, “Understanding the
protability of pairs trading,” .
[] G. Hong and R. Susmel, “Pairs-trading in the Asian ADR
market,” Univ. Houston, Unpubl. Manuscr., .
[]E.Gatev,W.N.Goetzmann,andK.G.Rouwenhorst,“Pairs
trading: performance of a relative-value arbitrage rule,” Review
of Financial Studies ,vol.,no.,pp.–,.
[] B. Do and R. Fa, “Does simple pairs trading still work?”
Financial Analysts Journal,vol.,no.,pp.–,.
[] S. Mudchanatongsuk, J. A. Primbs, and W. Wong, “Optimal
pairs trading: A stochastic control approach,” in Proceedings
of the 2008 American Control Conference, ACC,pp.–,
USA, June .
[] A. Tourin and R. Yan, “Dynamic pairs trading using the sto-
chastic control approach,” Journal of Economic Dynamics &
Control,vol.,no.,pp.–,.
[] Z. Zeng and C. Lee, “Pairs trading: optimal thresholds and
protability,” Quantitative Finance,vol.,no.,pp.–,
.
[] S. Fallahpour, H. Hakimian, K. Taheri, and E. Ramezanifar,
“Pairs trading strategy optimization using the reinforcement
learning method: a cointegration approach,” So Computing,
vol.,no.,pp.–,.
[] P. Nath, “High frequency pairs trading with U.S. treasury
securities: risks and rewards for hedge funds,” SSRN Electronic
Journal, .
[] T. Leung and X. Li, “Optimal mean reversion trading with
transaction costs and stop-loss exit,” International Journal of
eoretical and Applied Finance,vol.,no.,.
[] E. Ekstr¨om, C. Lindberg, and J. Tysk, “Optimal liquidation of
apairstrade,”inAdvanced Mathematical Methods for Finance,
pp. –, Springer, Heidelberg, .
[] Y. Lin, M. McCrae, and C. Gulati, “Loss protection in pairs
trading through minimum prot bounds: A cointegration
approach,” Journal of Applied Mathematics and Decision Sci-
ences, vol. , pp. –, .
[] A. Mikkelsen, “Pairs trading: the case of Norwegian seafood
companies,” Applied Economics,vol.,no.,pp.–,.
[] K. Kim, “Performance analysis of pairs trading strateg yutilizing
high frequency data with an application to KOSPI Equities,”
SSRN Electronic Journal,p.,.
[] V. Hol´y and P. Tomanov´a, Estimation of Ornstein-Uhlenbeck
Process Using Ultra-High-Frequency Data with Application to
Intraday Pairs Trading Strategy,.
[]D.Chen,J.Cui,Y.Gao,andL.Wu,“PairstradinginChi-
nese commodity futures markets: an adaptive cointegration
approach,” Accounting & Finance,vol.,no.,pp.–,
.
[]H.Puspaningrum,Y.Lin,andC.M.Gulati,“Findingthe
optimal pre-set boundaries for pairs trading strategy based
on cointegration technique,” Journal of Statistical eory and
Practice,vol.,no.,pp.–,.
[] A. A. Roa, “Pairs trading: optimal thershold strategies,” .
[] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with
deep reinforcement learning,” https://arxiv.org/abs/.,
.
[] Y.Wang,D.Wang,S.Zhang,Y.Feng,S.Li,andQ.Zhou,“Deep
Q-trading,” , http://cslt.riit.tsinghua.edu.cn/.
[] C.-Y. Huang, “Financial tradingas a game: a deep reinforcement
learning approach,” , https://arxiv.org/abs/..
[] T. Kim, Optimizing the pairs trading strategy using Deep
reinforcement learning [M.S. thesis], Ajou University, Suwon,
Republic of Korea, .
[] B.Do,R.Fa,andK.Hamza,“Anewapproachtomodeling
and estimation for pairs trading,” in Proceedings of the 2006
Financial Management Association European Conference, .
[] R. D. Dittmar, C. J. Neely, and P. A. Weller, “Is technical
analysis in the foreign exchange market protable? A genetic
programming approach,” Journal of Financial and Quantitative
Analysis,vol.,p.,.
[] H. Rad, R. K. Low, and R. Fa, “e protability of pairs
trading strategies: distance, cointegrationand copula methods,”
Quantitative Finance,vol.,no.,pp.–,.
[] S. Johansen, “Statistical analysis of cointegration vectors,” Jour-
nal of Economic Dynamics and Control,vol.,no.-,pp.–
, .
[] M.H.Kutner,C.J.Nachtsheim,J.Neter,andW.Li,“Applied
linear statistical models,” .
[] G. H. Golub and C. F. Van Loan, “An analysis of the total least
squares problem,” SIAM Journal on Numerical Analysis,vol.,
no. , pp. –, .
[] R. S. Sutton and A. G. Barto, “Introduction to reinforcement
learning,” Learning,.
[] E. F. Fama, “Random walks in stock market prices,” Financial
Analysts Journal,vol.,no.,pp.–,.
[] W. F. Sharpe, “e sharpe ratio,” e Journal of Portfolio
Management,.
[] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,
and D. Meger, “Deep reinforcement learning that matters,” in
Proceedings of the irthy-Second AAAI Conference On Artificial
Intelligence (AAAI),.
[] Y.H.Li,X.M.Lu,andN.C.Kar,“Rule-basedcontrolstrategy
with novel parameters optimization using NSGA-II for power-
split PHEV operation cost minimization,” IEEE Transactions on
Vehicular Technology,vol.,no.,pp.–,.
[] L. Dymova, P. Sevastianov, and K. Kaczmarek, “A stock trading
expert system based on the rule-base evidential reasoning using
Level Quotes,” ExpertSystemswithApplications,vol.,no.,
pp. –, .
Available via license: CC BY
Content may be subject to copyright.