- Access to this full-text is provided by Hindawi.
- Learn more

Download available

Content available from Complexity

This content is subject to copyright. Terms and conditions apply.

Research Article

Optimizing the Pairs-Trading Strategy Using Deep

Reinforcement Learning with Trading and Stop-Loss Boundaries

Taewook Kim 1,2 and Ha Young Kim 3

1Qra Technologies, Inc., Ttukseom-ro 1-gil, Sungdong-gu, Seoul 04778, Republic of Korea

2Department of Financial Engineering, Ajou University, Worldcupro 206, Yeongtong-gu, Suwon 16499, Republic of Korea

3Graduate School of Information, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea

Correspondence should be addressed to Ha Young Kim; haimkgetup@gmail.com

Received 6 February 2019; Revised 14 April 2019; Accepted 11 June 2019; Published 12 November 2019

Guest Editor: Benjamin M. Tabak

Copyright © Taewook Kim and Ha Young Kim. is is an open access article distributed under the Creative Commons

Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is

properly cited.

Many resear chers have trie d to optimize pairs tr ading as the numbe rs of opportunit ies for arbitrage prot h ave gradually d ecreased.

Pairs trading is a market-neutral strategy; it prots if the given condition is satised within a given trading window, and if not,

there is a risk of loss. In this study, we propose an optimized pairs-trading strategy using deep reinforcement learning—particularly

with the deep Q-network—utilizing various trading and stop-loss boundaries. More specically, if spreads hit trading thresholds

and reverse to the mean, the agent receives a positive reward. However, if spreads hit stop-loss thresholds or fail to reverse to the

mean aer hitting the trading thresholds, the agent receives a negative reward. e agent is trained to select the optimum level

of discretized trading and stop-loss boundaries given a spread to maximize the expected sum of discounted future prots. Pairs

are selected from stocks on the S&P Index using a cointegration test. We compared our proposed method with traditional

pairs-trading strategies which use constant trading and stop-loss boundaries. We nd that our proposed model is trained well and

outperforms traditional pairs-trading strategies.

1. Introduction

Pairs trading is a method for obtaining arbitrage prot when

there is a statistical dierence between two stocks with s imilar

characteristics that are cointegrated or highly correlated. is

is possible because of the statistical reason that spreads made

by two stocks have a mean reversion in the long run []. In

the early days, pairs-trading methods were popular because of

the opportunity to obtain arbitrage prot [–]. However, as

many investors including hedge funds sought these arbitrage

opportunities by executing the pairs-trading strategy, its

protability began to deteriorate [, ]. To overcome these

shortcomings, signicant research has been conducted to

improve the pairs-trading strategy [–].

e mechanism of pairs trading is as follows. First, a pair

of stocks with similar trends is identied. Second, regression

analysis such as ordinary least squares (OLS), total least

squares (TLS), and error correction models (ECM) is used

to calculate the spread of these stocks. Finally, if the spread

hits preset boundaries, investors will open a portfolio which

takes a long position on the undervalued stock and shorts

the overvalued stock. Subsequently, if the spread reverses

to the mean, investors will close the portfolios which are

opposite position to the open portfolio. In this case, the

investor obtains an arbitrage prot by executing this strategy.

However, there is a risk when the spread does not reverse

to the mean. In such a situation, investors are at high risk

because they cannot close the portfolio. By setting a stop-loss

boundary, investors can hedge the risk [–].

Many researchers have applied various statistical methods

to improve the eciency and performance of pairs trading.

In particular, they focused on using the spread as a trading

signal. e study in [] collected pairs of stocks based on

minimizing the sum of squared deviations between the two

stocks and then executed the trading strategy if the dierence

between the pairs is twice the standard deviation of the

spread. ey used normalized US stock price data from

to to test the protability of pairs trading. e

Hindawi

Complexity

Volume 2019, Article ID 3582516, 20 pages

https://doi.org/10.1155/2019/3582516

Complexity

study in [] used the cointegration approach to protect the

pairs-trading strategy from severe losses. ey applied an

OLS method to create a spread and set various conditions

that translated into trading actions. From these models, they

achieved a trading strategy with a minimum level of prots

protected from risk of loss. e results showed about an

% annualized excess return over the entire period. e

research in [] compared the distance and cointegration

approaches for each high-frequency and daily dataset to

check whether it is protable for Norwegian seafood com-

panies. e performance is similar between two approaches.

Reference [] used a Kalman lter to calculate spread, which

was then used as a high-frequency trading signal, on the

shares constituting the KOSPI Index. He found that the

pairs-trading strategy’s performance was signicant on the

KOSPI and was better during daily market conditions at

market opening and closing. Moreover, [] optimized a pairs-

trading system as a stochastic control problem. ey used the

Ornstein-Uhlenbeck process to calculate spread as a trading

signalandtestedtheirmodelwithsimulateddata;theresults

showed that their strategy performs well. In addition, []

suggested the Ornstein-Uhlenbeck process to make a market

microstructure noise used as a trading signal in pairs trading

strategy. e performance is better under this method than

in traditional estimators such as ARIMA(,) and maximum

likelihood. Reference [] applied a cointegration method

to Chinese commodity futures from to to check

whether pairs trading was suitable in that market. ey used

OLS regression to create spreads from the pairs. Furthermore,

[] applied a cointegration test to assorted pairs of stocks and

a vector error-correction model to create a trading signal.

It is important to set a boundary to optimize the pairs-

trading strategy. is boundary is a criterion for deciding

whether to execute a pairs-trading strategy. If a low boundary

is set, many strategies will be executed, but prots will be

lower; if a high boundary is set, investors will get high returns

when the strategy is executed. However, all this assumes

that mean reversion occurs. If the spread does not return

to the average in the specied trading window, losses will

be incurred. If a low boundary is set, the loss will be small.

However, if the strategy is executed with a high boundary, the

loss will increase. erefore, the performance of pair trading

depends on how the boundary is set. Reference [] suggested

taking a minimum-prot condition, which could be ecient

to reduce losses in a pairs-trading system. ey set a trading

rule with a diverse open condition: for example, if the spread

is above ., ., ., ., and . standard deviations. ey

used the daily closing prices from January , , to August

, , of two stocks, the Australia New Zealand Bank

and the Adelaide Bank. e results showed that, as the open

condition value decreases, the number of trades and prots

increases. Also [] suggested optimal preset boundaries

calculated from estimated parameters for the average trade

duration, intertrade interval, and number of trades and used

them to maximize the minimum total prot. ey used the

daily closing price data from January , , to June , ,

of seven pairs of stocks on the Australian Stock Exchange.

e results showed that their proposed method was ecient

in making prots using the pairs-trading strategy. Reference

[] examined whether the pairs-trading strategy could be

applied to the daily return of Chinese commodity futures

from to using three methods: classical, closed-

loop, and dynamic stop-loss. e closed-loop method takes

only a stop-prot barrier which executes the strategy and

does not consider the risk if spreads revert to the mean. e

classical method adds stop-loss boundaries to the closed-

loop method. e dynamic stop-loss method uses a variety

of stop-prot and stop-loss barriers to t the spreads if the

spread is larger than the standard deviation, which is set using

criteria based on the historical average of spreads. e results

showed that these methods obtained an annualized return of

over %, especially the closed-loop method, which yielded

the highest prot of .%. In addition, [] experimented

with xed optimal threshold selection, conditional volatility,

percentile, spectral analysis, and neural network thresholds in

pairs-trading strategy. Of these, the neural network threshold

has outperformed all other strategies.

Following the success of reinforcement learning, demon-

strated by its successful performance at Atari games [],

many researchers have attempted to apply this algorithm

to the nancial trading system. Reference [] proposed a

deep Q-trading system using reinforcement learning meth-

ods. ey applied Q-learning to a trading system to trade

automatically. ey set a delta price using data from the past

days, had three discrete action spaces (buy, hold, and

sell), and used long-term prot as a reward. ey used daily

data from January , , to December , , of the

Hang Seng Index and the S&P Index. e experimental

results showed that their proposed method outperformed

buy-and-hold strategies and recurrent reinforcement learn-

ing methods. Reference [] proposed three steps to apply

reinforcement learning to the nancial trading system. First,

they reduced relative replay size to t nancial trading.

Second, they proposed an action-augmentation technique

that provides more feedback from the action to the agent.

ird, they used long sequences as reinforcement data to

conduct recurrent neural network training. e experimental

data comprised tick-by-tick data of forex currency pairs

from January to December . e results showed that

the action-augmentation technique yielded more prot than

an epsilon-greedy policy. Reference [] used an N-armed

bandit problem to optimize the pairs-trading strategy. ey

took the spread using an error-correction model and found

the parameters using a grid-search algorithm. ey compared

their proposed model with a constant parameter model,

which was similar to a traditional pairs-trading strategy. ey

used intraday one-minute data of some stocks in the FactSet

database from June to January . e performance of

their proposed model was better than the constant-parameter

model.

We investigate not only the dynamic boundary based

on a spread in each trading window—which can achieve

higher prot than the xed boundary used in traditional

pairs trading strategy—but also if it is possible to train deep

reinforcement learning methods to follow this mechanism.

To this end, we propose a new method to optimize the

pairs trading strategy using deep reinforcement learning,

especially deep Q-networks, since pairs trading strategy can

Complexity

be thought of as a game. Aer opening a portfolio position,

the prot can be set whether portfolio is closed, stop-loss

position. erefore, if we set this strategy as a game by

setting boundaries which are optimized in spreads in trading

window, we can achieve more prot than traditional pairs

trading strategies. In particular, we set the pairs-trading sys-

tem to be a kind of game and obtain the optimal boundaries,

trading thresholds, and stop-loss thresholds according to the

calculated spread. e reason for this construction is that if

the portfolio is opened and closed in the trading window in

the calculated spread, it will be unconditionally protable if

the portfolio is closed. If the portfolio reaches the stop-loss

boundary or does not converge to the mean, losses may occur.

We therefore set the DQN to learn by positively rewarding it

if it takes a closed position and negatively rewarding it if it

reaches the stop-loss or exit thresholds. We conducted the

following experiments to verify that our proposed method

is optimized compared to the conventional method. First,

we used dierent spreads calculated using OLS and TLS to

see how the results dier depending on the spread used

for input. Second, depending on the formation window and

trading window, the spread and hedge ratio will be varied.

We therefore set a total of six window sizes for selecting the

optimal window size which had the best performance. Finally,

we compared the proposed method with the traditional pairs-

trading strategy using the test data with the optimal window

size. In this experiment, we use the daily adjusted closing

pricesfromJanuary,,toJuly,,ofstocks

in the S&P Index. Experimental results show that our

proposed method outperforms the traditional pairs-trading

strategy across all the pairs. In addition, we can conrm that

the performance measure varies according to the spread.

e main contributions of this study are as follows. First,

we propose a novel method to optimize pairs trading strat-

egy using deep reinforcement learning, especially deep Q-

networks with trading and stop-loss boundaries. e exper-

imental results show that our method can be applied in the

pairs trading system and also to various other elds, including

nance and economics, when there is a need to optimize a

rule-based strategy to be more ecient. Second, we propose

an optimized dynamic boundary based on a spread in

each trading window. Our proposed method outperforms

traditional pairs trading strategy which set a xed boundary.

Last, we nd that our method outperforms traditional pairs

trading strategy in all pairs based on constituent stocks in

S&P . Since our method selects optimal boundaries based

on spreads, it can be applied to other stock markets such as

KOSPI, Nikkei, and Hang Seng. It should be noted that the

present work is a part of the Master thesis [].

e rest of this paper is organized as follows. Section

explains the technical background. Section describes the

materials and methods. Section shows the results and

provides a discussion of the experiments. Section provides

our conclusions to this study.

2. Technical Background

2.1. e Traditional Pairs-Trading Strategy. Pairs trading

is a representative market-neutral trading strategy which

simultaneously longs an undervalued stock and shorts an

overvalued stock. is strategy is a form of statistical arbitrage

trading that assumes the movements of the prices of the

two assets will be similar to previous trends []. It follows

the assumption that asset prices will return to the long-term

equilibrium. is strategy started from the idea that arbitrage

opportunities exist when the price gap between two assets

expands to or past a certain level. It is also based on the belief

that historical price movements will not change signicantly

in the future.

In Figure , the graph drawn in blue is a spread made of

two stocks that are cointegrated, the red lines are the trading

boundaries, and the green lines are the stop-loss boundaries.

When this spread reaches the trading boundaries, the port-

folio is opened and only closed when the spread returns to

the average. However, losses are incurred when prices reach

the stop-loss boundaries aer the portfolio is opened and do

not return to the average. Furthermore, aer the portfolio is

opened, if the trading signal is not reversed to mean during

the trading window, the portfolio is closed by force; this is

called the exit position of the portfolio.

2.1.1. e Cointegration Test. ere are many approaches

for pair selection such as the discrete approach [, –],

the cointegration approach [, , ], and the stochastic

approach [, ]. In this study, we use the cointegration

approach to choose pairs which have long-term equilibrium.

Generally, a linear combination of nonstationary variables is

also a nonstationary relationship. Assume that 𝑡and 𝑡have

unit roots; as previously mentioned, the linear combination

of these variables follows nonstationary conditions.

𝑡∼(1),

𝑡∼(1)()

𝑡=+𝑡+𝑡()

However, it can be a stationary relationship if the nonsta-

tionary variables are cointegrated. In this case, this regression

must be checked to determine whether it is a spurious

regression or cointegrated. Johansen’s method is widely used

to test for cointegration []. In this method, the number

of cointegration relations and the parameters of the model

are estimated and tested using maximum likelihood estima-

tion (MLE). Since all variables are regarded as endogenous

variables, there is no need to select dependent variables

and multiple cointegration relationships are identied. In

addition, we use MLE to estimate the cointegration relation

with the vector autoregression model and to determine

the cointegration coecient based on the likelihood-ratio

test. ere is therefore an advantage in performing various

hypothesis tests related to the estimation of cointegration

parameters and the setting of other models when there is

cointegration, and not merely to test for cointegration.

2.2. Spread Calculation

2.2.1. Ordinary Least Squares. In regression analysis, OLS is

widely used to estimate parameters by minimizing the sum

Complexity

Z-score

trading signal

trading boundary

stop-loss boundary

Ye a r

8

6

4

2

0

−2

−4

−6

−8 2009200720052003200119991997199519931991

F : e traditional pairs-trading strategy.

of the squared errors []. Assume that 𝑖,𝑖,and𝑖are

an independent variable, a dependent variable, and an error

term. We can estimate from the following equation by

taking a partial derivative:

𝑖=𝑖+𝑖∼0,2

𝜀()

𝑛

𝑖=1 𝑖−𝑖2()

=𝑛

𝑖=1𝑖𝑖−1 𝑛

𝑖=1𝑖𝑖()

e value obtained from equation () is used for the number

of stock orders. e epsilon value is also used as a trading

signal through Z-scoring, in the state composed of the

formation-window size.

2.2.2. Total Least Squares. TLS estimates parameters to min-

imize the sum of the measured distance and the vertical

distance between regression lines []. Since the vertical

distance does not change when the X and Y coordinates are

changed, the value of is calculated consistently. In the TLS

method, the observed values of 𝑖and 𝑖have the following

error terms:

𝑖=𝑖+𝑖∼0,2

𝑒()

𝑖=𝑖+𝑖∼0,2

𝑢()

where 𝑖and 𝑖are true values and 𝑖and 𝑖are error

terms following independent identical distributions. It is

assumed that there is linear combination of true values. For

convenience, we represent the error variance ratio in equation

():

𝑖=0+1𝑖()

𝑖=0+1𝑖+𝑖∼0,2

𝑒()

=var 𝑖|𝑖

var 𝑖|𝑖=2

𝑒

2

𝑢

()

e orthogonal regression estimator is calculated by mini-

mizing the sum of the measured distance and the vertical

distance between regression lines in equation ():

𝑛

𝑖=1 𝑖−0+1𝑖2

+𝑖−𝑖2()

1=2

𝑌𝑌 −2

𝑋𝑋 +2

𝑌𝑌 −2

𝑋𝑋2+42

𝑋𝑌1/2

2𝑋𝑌

()

e value obtained from equation () is used in the same

way as that obtained from equation () and the epsilon value

is also used as a trading signal through the Z-score in the state

composed of the formation-window size.

2.3. Reinforcement Learning and the Deep Q-Network. e

idea of reinforcement learning is to nd an optimal policy

which maximizes the expected sum of discounted future

rewards []. ese rewards come from selecting the optimal

value of each action, called the optimal Q-value. Rein-

forcement learning basically solves the problem dened by

the Markov decision process (MDP). It consists of a tuple

(,,,,),whereis a nite set of states, is a nite set

of actions, is a state transition probability matrix, is a

reward function, and is a discount factor. In environment

, agent-observed state 𝑡at time ,action𝑡is selected.

From the results of these sequences, environmental feedback

is provided to the agent in the form of reward 𝑡and next

state 𝑡+1. An action is selected by the action-value function

𝜋(,) that represents the expected sum of discounted

future rewards.

𝜋𝑡,𝑡=E𝜋𝑇

𝑖=𝑡 𝑖−𝑡𝑖|𝑡,𝑡, ()

In this action-value function 𝜋(𝑡,𝑡), we nd an optimal

action-value function ∗(𝑡,𝑡), following an optimal policy

Complexity

which maximizes the expected sum of discounted future

rewards.

∗𝑡,𝑡=max

𝜋𝜋𝑡,𝑡()

is optimal action-value function can be formulated as the

Bellman equation.

∗𝑡,𝑡=max

𝑎𝑡+1 𝑡+𝑡+1,𝑡+1 ()

e DQN uses a nonlinear function approximator to estimate

the action value function. is network is trained by min-

imizing a sequence of loss functions 𝑡(𝑡), which changes

with each sequence of .eweightof𝑡is updated as the

sequence progresses:

𝑡𝑡=E(𝑠,𝑎)∼𝜌(∙) 𝑡−𝑡,𝑡;𝑡2()

𝑡=max

𝑎𝑡+1 𝑡+𝑡+1,𝑡+1;𝑡−1|𝑡,𝑡()

3. Materials and Methods

3.1. Data. In this study, stocks from the S&P Index

were selected based on their trading volume and market

capitalization. To carry out the experiment, the data must

cover the same period. erefore, corresponding stocks were

selected, leaving a total of stocks. Table represents the

dataset of stock names, abbreviations of those stocks, and

their respective sectors. We collected the adjusted daily

closing prices using omson Reuters’ database. e period

of the training dataset is from January , , to December

, , comprising data points; the test dataset covers

the period from January , , to July , , comprising

data points. From these datasets, a pair of stocks will

be selected during the training dataset period using the

cointegration test.

3.2. Selecting Pairs Using the Cointegration Test. It is necessary

to pair stocks which have long-run statistical relationships

or similar price movements. It is possible to determine the

degree to which two stocks have had similar price movements

through the correlation value. Furthermore, the long-term

equilibrium of a pair of stocks is an important characteristic

for the execution of pairs trading. In this study, we used

the cointegration approach to select pairs of stocks. rough

Johansen’s method, we selected pairs of stocks that have

long-run equilibria. Table shows the resulting pairs of stocks

that were identied based on t-statistics and Figure shows

price movements of the cointegrated stocks XOM and CVX.

Using this dataset, we will verify whether our proposed

method has better performance than the traditional pairs-

trading method.

3.3. Trading Signal. Aer selecting the pairs, it is necessary

to extract the signal for trading. To extract signals, we opt

fortheOLSorTLSmethods.First,becausethestockprice

follows a random walk [], we need to ensure that it follows

the (1)process through the augmented Dickey-Fuller test.

Subsequently, the (0) process should be created using the

logarithmic dierence in stock prices which is then applied to

the OLS and TLS methods. In equation (), 1is a constant

value, 1is a hedge ratio (which is used as trading size), 𝑡

is the error term, and log 𝐴,𝑡 and log 𝐵,𝑡 are the logarithmic

dierences in the stock prices and at time .Weconvert

values of 𝑡intoaZ-scoreusedasatradingsignal.For

example, if the trading signal reaches the threshold, we short

one share of the overvalued stock (represented as log 𝐴,𝑡 )

and long 1shares of the undervalued stock (represented

as log 𝐵,𝑡). e hedge ratio is determined based on the

window size. We set a total of six discrete window sizes to

obtain the optimal window size for the experiment. Trading

windows are constituted using half of the formation-window

size. e spread obtained here is used as a state when applying

reinforcement learning (i.e., as an input of the DQN).

log 𝐵,𝑡 =1+1log 𝐴,𝑡 +𝑡()

3.4. Proposed Method: Optimized Pairs-Trading Strategy Using

the DQN Method. In this study, we optimize the pairs-trading

strategy with a type of game using the DQN. We will attempt

to implement an optimal pairs-trading strategy by taking

optimal trading and stop-loss boundaries that correspond to

the given spread, since performance depends on how trading

and stop-loss boundaries are set in pairs trading []. Figure

shows the mechanism of our proposed pairs-trading strategy.

roughout the cointegration test, we identify pairs and,

using regression analysis, obtain a hedge ratio used as trading

volume and a spread used as a trading signal and state. In the

case of the DQN, two hidden layers are set up and the number

of neurons is optimized by taking half of input size through

trial and error. Action values consist of the six discrete spaces

in Table . Each value of 𝑡has values for trading and stop-

loss boundaries.

A pairs-trading system can make a prot if the spread

touches the threshold and returns to the average suchthat the

portfolioisclosedineachtradingwindow.Ontheotherhand,

if the trading boundary is touched and the stop-loss boundary

is reached, the system tries to minimize losses by stopping

trades. If the spread touches the trading boundary but fails to

return to the average, the strategy may end up with a prot

or a loss. In this study, the pairs-trading strategy is therefore

considered as a kind of game; closing a portfolio yields a posi-

tive reward and a portfolio that reaches its stop-loss threshold

yields a negative reward. Although an exited portfolio may

possibly generate a positive prot, there is also a possibility

that losses will occur and it is therefore set to yield a negative

reward. We set the other conditions (such as the maintenance

of the portfolio or not to execute the portfolio) to zero so as

to concentrate on the close, stop-loss, and exit positions.

𝑡=V𝐴,𝑡 ×𝐴,𝑡−𝐴,𝑡

𝐴,𝑡 +V𝐵,𝑡 ×𝐵,𝑡−𝐵,𝑡

𝐵,𝑡

<()

𝑡

=

1000×𝑡

−1000×𝑡 −

−500×𝑡

()

Complexity

T:estocksontheS&PIndexusedinthisstudy.

No. Ticker Stock Sector

AAPL Apple Inc. Technology

MSFT Microso Corporation Technology

BRKa Berkshire Hathaway Inc. Financial Services

JPM JPMorgan Chase & Co. Financial Services

JNJ Johnson & Johnson Healthcare

XOM Exxon Mobil Corporation Energy

BAC Bank of America Corporation Financial Services

WFC Wells Fargo & Company Financial Services

WMT Walmart Inc. Consumer Defensive

UNH UnitedHealth Group Incorporated Healthcare

CVX Chevron Corporation Energy

T AT&T Inc. Communication Services

PFE Pzer Inc. Healthcare

ADBE Adobe Systems Incorporated Technology

MCD McDonald’s Corporation Consumer Cyclical

MDT Medtronic plc Healthcare

MMM M Company Industrials

HON Honeywell International Inc. Industrials

GE General Electric Company Industrials

ABT Abbott Laboratories Healthcare

MO Altria Group, Inc. Consumer Defensive

UNP Union Pacic Corporation Industrials

TXN Texas Instruments Incorporated Technology

UTX United Technologies Corporation Industrials

LLY Eli Lilly and Company Healthcare

Ye a r

Price

20172013200920052001199719931989

140

120

100

80

60

40

20

XOM

CVX

F : Cointegrated stock price movements.

We x the values of portfolio close, stop-loss, and exit

to +, −, and −, respectively. When we update

the Q-values, we must consider the reward as a signicant

component of eciently training the DQN. We therefore set

the reward value to have a range similar to that of the Q-

value. Additionally, we included the corresponding prot or

loss value to reect that weight aer the trading ended. In

equation (), V𝐴,𝑡 and V𝐵,𝑡 are the stock orders of stocks and

at time ,𝐴,𝑡 and 𝐵,𝑡 are the stock prices of and at time

,and𝐴,𝑡and 𝐵,𝑡are the stock prices of and at time .

Algorithmshowstheprocessofourproposedmethod.

Before we start our proposed method, we set a replay memory

and batch size and select pairs using the cointegration test.

At each epoch, we initialized total prot to .. In the

training scheme, we set a state which has spreads within

the formation window and select actions which are used as

Complexity

50 constituent stocks of the S&P

500 Index

Filter out pairs based on trading volume, liquidity and

the cointegration test

Obtain a reward Environment

Construct pairs of stocks

Preprocess dataset

using OLS or TLS Select max Q-value

Q_values

Outputs of DQN

Deep Q-Network

Agent

SpreadHedge ratio

Inputs of DQN

F : Steps for proposed pairs-trading strategy using the DQN method.

Initialize replay memory 𝐷and batch size 𝑁

Initialize deep Q-network

Select pairs using cointegration test

() For each epoch do

() Prot = .

() For steps t = , ...until end of training data set do

() Calculate spreads using OLS or TLS methods

() Obtain initial state by converting spread to Z-score based on formation window 𝑠𝑡

() Using epsilon-greedy method, select a random action 𝑎𝑡

() Otherwise select 𝑎𝑡=𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄(𝑠𝑡,𝑎)

() Execute traditional pairs-trading strategy based on the action selected

() Obtain reward 𝑟𝑡by performing the pairs-trading strategy

() Set next state 𝑠𝑡+1

() Store transition (𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)in 𝐷

() Sample minibatch of transition (𝑠𝑡,𝑎𝑡,𝑟𝑡,𝑠𝑡+1)from 𝐷.

() 𝑦𝑡=

𝑟𝑡𝑖𝑓 𝑠𝑡+1=𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙

𝑟𝑡+𝛾𝑚𝑎𝑥𝑎𝑄𝑠𝑡+1,𝑎𝑖𝑓 𝑠𝑡+1=𝑛𝑜𝑛 −𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙

() Update Q-network by performing a gradient descent step on {𝑦𝑡−𝑄(𝑠𝑡,𝑎)}2

() End

() End

A : Optimized pairs-trading system using DQN.

trading and stop-loss boundaries. roughout the trading

window, we executed a strategy similar to a traditional pairs-

trading strategy using the action selected. Aer executing

the strategy, we obtain a reward based on the results of the

portfolio. Finally, for the Q-learning process, we update the

Q-networks by performing a gradient descent step.

3.5. Performance Measure. We check our experiment results

based on prot, maximum drawdown, and the Sharpe ratio.

Prot is commonly used as a performance measure for

trading strategies. It is calculated as the sum of returns

taking into consideration trading cost. Since many trades can

increase total prot, it is necessary to determine the total

prot taking into consideration transaction costs depending

on trading volume. In this study, we set a trading cost of bp;

equation () is almost the same as equation (), but it does

not include absolute value, and is trading cost. Maximum

drawdown represents the maximum cumulative loss from the

highest to the lowest values of the portfolio during a given

investment period where ()is the value of the portfolio and

is the terminal time value. e Sharpe ratio is an indicator

of the degree of excess prots from investing in risky assets

used in evaluating portfolios []. In equation (), 𝑝is the

expected sum of portfolio returns and 𝑓is the risk-free rate;

we set this value to and 𝑝is the standard deviation of

portfolio returns.

= 𝑇

𝑡=1 V1,𝑡 ∗1,𝑡−1,𝑡

1,𝑡 +V2,𝑡 ∗2,𝑡−2,𝑡

2,𝑡

−∗V1,𝑡 +V2,𝑡 ()

()=max

𝜏∈(0,𝑇) max

𝑡∈(0,𝜏)()−()()

Complexity

T : Summary statistics for pairs veried using cointegration

tests.

No. Pairs t-statistic Correlation

MSFT/JPM−.∗∗ .

MSFT/TXN −.∗∗ .

BRKa/ABT −.∗∗ .

BRKa/UTX−.∗∗ .

JPM/T−.∗∗ .

JPM/HON−.∗∗∗ .

JPM/GE−.∗∗ .

JNJ/WFC−.∗∗ .

XOM/CVX−.∗∗∗ .

HON/TXN −.∗∗∗ .

GE/TXN −.∗∗ .

Note: ∗∗∗and ∗∗ denote a rejection of the null hypothesis at the %and

%signicance levels, respectively.

= 𝑝−𝑓

𝑝()

e Materials and Methods section should contain sucient

details so that all procedures can be repeated. It may be

divided into headed subsections if several methods are

described.

4. Results and Discussion

We use the stock pair XOM and CVX, which rejects the null

hypothesis at the % signicance level, to verify whether our

proposed model is trained well. e lengths of the window

sizes such as the formation window and trading window

are selected from the performance results with the training

dataset. From these results, we select an optimized window

size and compare our proposed model with traditional pairs

trading, which takes a constant set of actions with the test

dataset.

4.1. Training Results. To nd the optimum window size

for the optimized pairs-trading system, we experimented

with six cases. We performed the experiments based on

six window sizes, and the results for each window size are

calculated by averaging the top- results for a total of pairs.

From Tables and , we can nd that the best performance

is obtained when the formation and training windows are

and , respectively, based on the prot generated by both the

OLS and TLS methods. When we trained our networks, we

set a positive reward for taking more closed positions and

fewer stop-loss and exit positions. We can nd the lowest ratio

of portfolio closed positions based on the number of open

positions, which in the formation and trading windows are

for and days (.). Contrary to this result, the highest

ratios of the number of closed positions in the formation and

trading windows are for and days (.). However,

the highest prots reported in the formation and trading

windows are for and days. is can be explained when

we check the ratio of the number of stop-loss portfolios.

e formation and trading window sizes are and days

and the ratio of portfolio stop-loss position is ., but the

formation and trading window sizes are .. is result

indicates that it is important to reduce the stop-loss position

while increasing the closed position. In addition, we can see

that the trading signals made with the TLS method are better

than those made with the OLS method in all six of the discrete

window sizes. e reason for this is based on the dierence

between the hedge ratios of the two methods. In OLS, when

one side is the reference, the relative change of the other side

is estimated. Since the assumption is that there is no error

component on the reference side and there is an error only

on the other side, the hedge ratio varies depending on the side

used as the reference. However, in TLS, hedging ratios are the

same regardless of which side is used as the reference. For this

reason, the experimental results conrm that the TLS method

is better able to determine when to execute the pairs-trading

strategy. From these results, we take the optimum window

size when we verify our proposed method in the test dataset.

However, we rst need to ensure that the model we proposed

is well-trained.

It is important to check whether our reinforcement

learning algorithm is trained well. Reference [] suggested

that a steadily increasing average of Q-values is evidence that

the DQN is learning well. Figure (a) shows the average Q-

values of HON and TXN as training progressed. We nd

that the average Q-values steadily increased, indicating that

our proposed model is properly trained. In addition, we

provide a positive reward when the portfolio closes and a

negative reward when the portfolio reaches the stop-loss

threshold or exits. Figure (b) shows the ratio of the number

of portfolio positions as training progressed. e ratio of

closed to open portfolio positions increased and the ratio

of portfolios reaching their stop-loss thresholds to open

portfolio positions decreased. We also nd that the ratio of

portfolio exits to open portfolio positions slightly increased.

It is possible that the rewards given for an open portfolio

position compared to those given for a closed portfolio

position are relatively small. e DQN is therefore trained

to prevent portfolios from reaching their stop-loss thresholds

(the more important objective) over exiting them. is result

can also serve as a basis for judging whether the proposed

model is being trained properly.

Tables and represent the performance results of XOM

and CVX in the training dataset. We call our proposed

model pairs-trading DQN (PTDQN) and traditional pairs

trading with constant action values as pairs trading with

action (PTA) to pairs trading with action (PTA). From

this result, we can conrm that our proposed method is

more protable than the constant pairs-trading strategies.

In addition, we can see that the TLS method has a higher

protability compared to the OLS method. From PTA to

PTA, the trading boundary and the stop-loss boundary

grew larger; the numbers of open and closed portfolios and

portfolios that reached their stop-loss thresholds are reduced.

In other words, there is less opportunity for prot, but the

probability of loss is also reduced. It is important not only to

take a lot of closed positions, but also to take the best action

to open and close the portfolio. For example, if a portfolio is

Complexity

T : Setting a discrete action space.

Action

A A A A A A

Trad i ng bou n dar y ±0.5 ±1.0 ±1.5 ±2.0 ±2.5 ±3.0

Stop-loss boundary ±2.5 ±3.0 ±3.5 ±4.0 ±4.5 ±5.0

T : Results of applying the DQN method to each window size using OLS.

Formation

window

Trad i ng

window MDD Sharpe

ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

of portfolio

exits

−. . .

−. . .

−. . .

−. . .

−. . .

−. . .

T : Results of applying the DQN method to each window size using TLS.

Formation

window

Trad i ng

window MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

−. . .

−. . .

−. . .

−. . .

−. . .

−. . .

T : Average top- performance results for XOM and CVX using OLS within the training period.

Model MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

T : Average top- performance results for XOM and CVX using TLS within the training period.

Model MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

Complexity

0

10000

8000

6000

4000

2000

Epoch

200

−2000

−4000

−6000

Average of Q_value

1751501251007550250

Avg_Q_value

(a)

Epoch

2001751501251007550250

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Ratio of portfolio

close

stop

exit

(b)

F : Verication that our propos ed model is well-tr ained with HON and TXN u sing TLS. (a) Average of Q-valu es. (b) Rat io of portfolios.

opened and closed by a boundary corresponding to action

within the same spread and if a portfolio is opened and closed

by a boundary corresponding to action , the corresponding

prot is dierent. Assuming that t he mean reversion is certain

to occur, if we take the maximum boundary condition to

open a portfolio, we will obtain a larger prot than when

we take a smaller boundary condition. We can see that the

PTDQN returns are higher than the strategy with the highest

return among the traditional pairs trading strategies that take

the constant action. Figures – show the changes in trading

and stop-loss boundaries and the highest prot for constant

action when applying the DQN method during the training

period using OLS and TLS.

Figures and show comparisons of PTDQN and

PTA using the TLS method. Figure consists of the spread,

trading, and stop-loss boundaries. We nd that trading

and stop-loss boundaries have dierent values in PTDQN,

showing that it has learned to nd the optimal boundary

according to each spread. In contrast to PTDQN, PTA

in Figure has constant trading and stop-loss boundaries.

Figures and exhibit the same features we see in Figures

and . e dierence between these methods lies in the

spreads: dierent results can be obtained depending on the

spreads used. Making better spreads can therefore improve

performance.

Figures and represent the prot corresponding to

DQN and constant actions using TLS and OLS. Reference

[] suggested that an average value over multiple trials

should be presented to show the reproducibility of deep

reinforcement learning because there may be dierent results

from high variances across trials and random seeds. We

therefore conducted ve trials with dierent random seeds.

e prot graph of DQN represents the average prot of

these trials and the lled region between the maximum and

minimum prot values. We can see that PTDQN had a higher

prot than the traditional pairs-trading strategies during

Complexity

Z-score

trading signal

trading boundary

stop-loss boundary

trading signal

trading boundary

stop-loss boundary

Ye a r

Ye a r

8

6

4

2

0

−2

−4

−6

−8

Z-score

8

6

4

2

0

−2

−4

−6

−8

2008200620042002200019981996199419921990

1990-01 1990-071990-04 1990-10 1991-01 1991-04 1991-07 1991-10 1992-01 1992-04

F : An example of optimizing pairs trading using PTDQN based on a training scheme using TLS.

trading signal

trading boundary

stop-loss boundary

Z-score

8

6

4

2

0

−2

−4

−6

−8

Ye a r

2008200620042002200019981996199419921990

F : An example of PTA based on a training scheme using TLS.

the training period. is means that, even with the same

spread, we can see how prot will change as the boundaries

are changed. In other words, nding the optimal boundary

for the spread is an important factor in optimizing the

protability of pairs trading.

4.2. Test Results. Tables and show the average perfor-

mance measures of each pair tested by applying the top-

trained models. We can see that the constant action with

the highest returns for each pair is dierent, and the TLS

method is higher in all pairs than the OLS method based

on prot, as shown above. We also nd that PTDQN has

better performance than traditional pairs-trading strategies.

e pair with the highest prot using the proposed method is

HONandTXN(.);italsoshowsthebiggestdierence

between the DQN method and the optimal constant action

(.). We nd that the proposed method has a higher

Sharpe ratio in all pairs except for MO and UTX when the

Complexity

trading signal

trading boundary

stop-loss boundary

Ye a r

2008200620042002200019981996199419921990

−8

−6

−4

−2

0

2

4

6

8

Z-score

F : An example of optimizing PTDQN based on a training scheme using OLS.

trading signal

trading boundary

stop-loss boundary

Z-score

8

6

4

2

0

−2

−4

−6

−8

Ye a r

2008200620042002200019981996199419921990

F : An example of PTA based on a training scheme using OLS.

Proﬁt

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2008200620042002200019981996199419921990

5

4

3

2

1

F : Average top- prots generated by PTDQN and traditional pairs-trading strategies using TLS in training periods.

Complexity

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2008200620042002200019981996199419921990

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Proﬁt

F : Average top- prots generated by PTDQN and traditional pairs-trading strategies using OLS in training periods.

TLS method is used. If we add the Sharpe ratio in addition to

the total prot as an objective function, we can build a more

optimized pairs-trading system. Based on these results, we

can ensure the robustness of our proposed method for our

dataset. e proposed method can be applied to other pairs

of stocks found in other global markets.

In Figure , we can see that our proposed method,

PTDQN, outperforms the traditional pairs trading strategies

that have constant actions in test dataset. e crucial aspect

of this method is the selection of optimal boundary in the

spread that makes the highest prot in constant action, which

is like a constant boundary. erefore, the trend is the same

as traditional pairs trading strategies; however, when the

optimal boundaries which have the highest prot in the

spread are combined, PTDQN is found to have higher prot

than traditional pairs trading strategies. is method can

therefore be applied in various elds when there is a need

to optimize the eciency of a rule-based strategy [, ].

In this study, we consider spread and boundaries to be the

important factors of pairs trading strategy. erefore, we tried

to optimize pairs trading strategy with various trading and

stop-loss boundaries using deep reinforcement learning and

our method outperforms rule-based strategies. By optimizing

key parameters in rule-based methods, it can improve the

performances.

Pairs trading uses two types of stock which have the same

trends. However, it can be broken due to various factors such

as economic issues and company risk. In this situation, the

spread between two stocks is extremely large. Although this

situation cannot be avoided, we hedge this risk by taking

a dynamic boundary. In this sense, taking the lowest stop-

loss boundary is the best choice since it can be overcome

with the least loss. By taking the dynamic boundary using

the deep reinforcement learning method, we can see that not

only prots are increased, but losses are also minimized as

compared to taking a xed boundary.

5. Conclusions

We propose a novel approach to optimize pairs trad-

ing strategy using a deep reinforcement learning method,

especially deep Q-networks. ere are two key research

questions posed. First, if we set a dynamic boundary based on

a spread in each trading window, can it achieve higher prot

than traditional pairs trading strategy? Second, is it possible

that deep reinforcement learning method can be trained

to follow this mechanism? To investigate these questions,

we collected pairs selected using the cointegration test. We

experimented with how the results varied according to the

spread and the method used. We therefore set dierent

spreads using OLS and TLS methods as the input of the DQN

and the trading signal. To conduct this experiment, we set

up a formation window and a trading window. e hedge

ratio, which is an important factor in determining how much

stock to take, depends on this value. We therefore applied

the OLS and TLS methods and experimented to nd the

optimal window size by varying the formation window and

the trading window.

Tables and show the average performance values of

the formation windows and trading windows in the training

dataset. e results show that all six window sizes were

higher when TLS spreads were used than in OLS spreads.

In addition, we can see that protability gradually increases

as the estimation windows and trading windows of methods

using TLS and OLS decreased. e reason is that although

the ratio of closed position portfolio is the lowest in what

we set formation and trading windows, the ratio of stop-

loss position portfolio is also the lowest compared with other

formation and trading windows. It means that reducing stop-

loss position portfolio is important as well as increasing

closed position portfolio to make a prot. Using the optimal

window size, we then check whether our DQN is properly

trained. At each epoch, we nd that the average Q-value

steadily increased, the ratio of closed portfolios increased,

and the ratio of portfolios that reached their stop-loss

thresholds decreased, conrming that our DQN is trained

well. Based onthese results, we nd that our proposed model

using the test dataset with a formation window of and

a trading window of had results that were superior to

those of traditional pairs-trading strategies in the out-of-

sample dataset. In Figure , we can see that the prot path of

PTDQN is similar PTA to PTA, but better than that from

Complexity

Proﬁt

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

3.0

3.5

2.5

2.0

1.5

1.0

0.5

(a) MSFT/JPM

Proﬁt

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

0.0

0.5

1.0

1.5

2.0

(b) MSFT/TXN

Proﬁt

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

2.50

2.25

2.00

1.75

1.50

1.25

1.00

(c) BRKa/ABT

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r 201820172016201520142013201220112010

2009

Proﬁt

1.5

1.0

0.5

0.0

(d) BRKa/UTX

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

1.6

1.4

1.2

1.0

0.8

Proﬁt

(e) JPM/T

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r 2018201720162015201420132012201120102009

1.75

2.00

2.25

2.50

1.50

1.25

1.00

0.75

Proﬁt

(f) JPM/HON

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

Proﬁt

3.0

2.5

2.0

1.5

1.0

(g) JPM/GE

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

1.75

2.00

2.25

2.50

1.50

1.25

1.00

Proﬁt

(h) JNJ/WFC

F : Continued.

Complexity

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

Proﬁt

1.2

1.0

1.4

0.8

0.6

0.4

(i) XOM/CVX

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

Proﬁt

3.0

2.5

3.5

2.0

1.5

0.5

1.0

(j) HON/TXN

PTDQN

PTA0

PTA1

PTA2

PTA3

PTA4

PTA5

Ye a r

2018201720162015201420132012201120102009

1.75

2.00

2.25

2.50

1.50

1.25

1.00

0.75

Proﬁt

(k) GE/TXN

F : Average top- prots of PTDQN and PTA to PTA using TLS with the test dataset.

other methods. is shows that taking dynamic boundaries

based on our method is ecient in optimizing the pairs

trading strategy. During economic issues uncertainties, it can

be a risk to manage the pairs trading strategies including

our proposed method. However, we set a reward function

if spread is suddenly high, and our network is trained to

prevent this situation by taking less stop-loss boundary since

it is trained to maximize the expected sum of future rewards.

erefore, our proposed method can minimize the risk when

the economic risks appeared compared with traditional pairs

trading strategy with xed boundary.

From the experimental results, we show that our method

can be applied in the pairs trading system. It can be applied

in various elds, including nance and economics, when

there is a need to optimize the eciency of a rule-based

strategy. Furthermore, we nd that our method outperforms

the traditional pairs trading strategy in all pairs based on

constituent stocks in S&P . If we select appropriate pairs

which are cointegrated, we can apply our methods to other

marketssuchasKOSPI,Nikkei,andHangSeng.estudy

focused on only spreads made by two stocks, which have

long-term equilibrium patterns. Since our method selects

optimal boundaries based on spreads, it can be applied

to other stock markets such as KOSPI, Nikkei, and Hang

Seng.

In future works, we can develop our proposed model as

follows. First, as prot was set as the objective function in this

study, the performance of the model is lower than traditional

pairs trading when based on other performance measures. It

can therefore be possible to create a better-optimized pairs-

trading strategy by including all these other performance

indicators as part of the objective function. Second, we can

use other statistical methods such as the Kalman lter and

error-correction models to use diversied spreads. Finally, it

is possible to create a more-optimized pairs-trading strategy

by continuously changing the discrete set of window sizes

and boundaries. We will solve these diculties in future

studies.

Data Availability

e data used to support the ndings of this study have

been deposited in the gshare repository (DOI: ./

m.gshare.).

Disclosure

e funders had no role in the study design, data collec-

tion and analysis, decision to publish, or preparation of

the manuscript. is work represents a part of the study

Complexity

T : Average top- performance results of the proposed method and the traditional pairs-trading strategy in the out-of-sample dataset

using TLS.

Pairs Model MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

MSFT/JPM

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

MSFT/TXN

PTDQN −. . .

PTA −. −. .

PTA −. −. .

PTA −. . .

PTA −. −. .

PTA −. . .

PTA −. . .

BRKa/ABT

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

BRKa/UTX

PTDQN −. . .

PTA −. −. .

PTA −. . .

PTA −. −. .

PTA −. . .

PTA −. . .

PTA −. . .

JPM/T

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

PTA −. . .

JPM/HON

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

Complexity

T : C o n t i n u e d.

Pairs Model MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

JPM/GE

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

JNJ/WFC

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

XOM/CVX

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

PTA −. −. .

PTA −. . .

HON/TXN

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

GE/TXN

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

MO/UTX

PTDQN −. . .

PTA −. −. .

PTA −. . .

PTA −. −. .

PTA −. . .

PTA −. . .

PTA −. . .

Complexity

T : Average top- performance results of the proposed method and the traditional pairs-trading strategy in the out-of-sample dataset

using OLS.

Pairs Model MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

MSFT/JPM

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

PTA −. . .

MSFT/TXN

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. . .

PTA −. . .

BRKa/ABT

PTDQN −. . .

PTA −. −. .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

BRKa/UTX

PTDQN −. . .

PTA −. −. .

PTA −. −. .

PTA −. −. .

PTA −. . .

PTA −. . .

PTA −. −. .

JPM/T

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

JPM/HON

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

Complexity

T : C o n t i nued.

Pairs Model MDD Sharpe ratio Prot ofopen

portfolios

ofclosed

portfolios

ofstop-loss

portfolios

ofexited

portfolios

JPM/GE

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

JNJ/WFC

PTDQN −. . .

PTA −. −. .

PTA −. −. .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

XOM/CVX

PTDQN −. . .

PTA −. −. .

PTA −. . .

PTA −. −. .

PTA −. . .

PTA −. −. .

PTA −. −. .

HON/TXN

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. . .

GE/TXN

PTDQN −. . .

PTA −. . .

PTA −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

PTA −. . .

MO/UTX

PTDQN −. . .

PTA −. . .

PTA −. −. .

PTA −. −. .

PTA −. . .

PTA −. . .

PTA −. . .

Complexity

conducted as a Master esis in Financial Engineering during

and at the University of Ajou, Republic of Korea.

Conflicts of Interest

e authors declare that there are no conicts of interest

regarding the publication of this paper.

Acknowledgments

is work was supported by the National Research Foun-

dation of Korea (NRF) grant funded by the Korea Gov-

ernment (MSIT: Ministry of Science and ICT) (No. NRF-

RCB).

References

[]E.Gatev,W.N.Goetzmann,andK.G.Rouwenhorst,“Pairs

trading: performance of a relative-value arbitrage rule,” Yale

ICF Working Paper No. 08-03, , https://ssrn.com/abstract

= or http://dx.doi.org/./ssrn..

[] R. J. Elliott, J. van der Hoek, and W. P. Malcolm, “Pairs trading,”

Quantitative Finance,vol.,no.,pp.–,.

[] S. Andrade, V. Di Pietro, and M. Seasholes, “Understanding the

protability of pairs trading,” .

[] G. Hong and R. Susmel, “Pairs-trading in the Asian ADR

market,” Univ. Houston, Unpubl. Manuscr., .

[]E.Gatev,W.N.Goetzmann,andK.G.Rouwenhorst,“Pairs

trading: performance of a relative-value arbitrage rule,” Review

of Financial Studies ,vol.,no.,pp.–,.

[] B. Do and R. Fa, “Does simple pairs trading still work?”

Financial Analysts Journal,vol.,no.,pp.–,.

[] S. Mudchanatongsuk, J. A. Primbs, and W. Wong, “Optimal

pairs trading: A stochastic control approach,” in Proceedings

of the 2008 American Control Conference, ACC,pp.–,

USA, June .

[] A. Tourin and R. Yan, “Dynamic pairs trading using the sto-

chastic control approach,” Journal of Economic Dynamics &

Control,vol.,no.,pp.–,.

[] Z. Zeng and C. Lee, “Pairs trading: optimal thresholds and

protability,” Quantitative Finance,vol.,no.,pp.–,

.

[] S. Fallahpour, H. Hakimian, K. Taheri, and E. Ramezanifar,

“Pairs trading strategy optimization using the reinforcement

learning method: a cointegration approach,” So Computing,

vol.,no.,pp.–,.

[] P. Nath, “High frequency pairs trading with U.S. treasury

securities: risks and rewards for hedge funds,” SSRN Electronic

Journal, .

[] T. Leung and X. Li, “Optimal mean reversion trading with

transaction costs and stop-loss exit,” International Journal of

eoretical and Applied Finance,vol.,no.,.

[] E. Ekstr¨om, C. Lindberg, and J. Tysk, “Optimal liquidation of

apairstrade,”inAdvanced Mathematical Methods for Finance,

pp. –, Springer, Heidelberg, .

[] Y. Lin, M. McCrae, and C. Gulati, “Loss protection in pairs

trading through minimum prot bounds: A cointegration

approach,” Journal of Applied Mathematics and Decision Sci-

ences, vol. , pp. –, .

[] A. Mikkelsen, “Pairs trading: the case of Norwegian seafood

companies,” Applied Economics,vol.,no.,pp.–,.

[] K. Kim, “Performance analysis of pairs trading strateg yutilizing

high frequency data with an application to KOSPI Equities,”

SSRN Electronic Journal,p.,.

[] V. Hol´y and P. Tomanov´a, Estimation of Ornstein-Uhlenbeck

Process Using Ultra-High-Frequency Data with Application to

Intraday Pairs Trading Strategy,.

[]D.Chen,J.Cui,Y.Gao,andL.Wu,“PairstradinginChi-

nese commodity futures markets: an adaptive cointegration

approach,” Accounting & Finance,vol.,no.,pp.–,

.

[]H.Puspaningrum,Y.Lin,andC.M.Gulati,“Findingthe

optimal pre-set boundaries for pairs trading strategy based

on cointegration technique,” Journal of Statistical eory and

Practice,vol.,no.,pp.–,.

[] A. A. Roa, “Pairs trading: optimal thershold strategies,” .

[] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with

deep reinforcement learning,” https://arxiv.org/abs/.,

.

[] Y.Wang,D.Wang,S.Zhang,Y.Feng,S.Li,andQ.Zhou,“Deep

Q-trading,” , http://cslt.riit.tsinghua.edu.cn/.

[] C.-Y. Huang, “Financial tradingas a game: a deep reinforcement

learning approach,” , https://arxiv.org/abs/..

[] T. Kim, Optimizing the pairs trading strategy using Deep

reinforcement learning [M.S. thesis], Ajou University, Suwon,

Republic of Korea, .

[] B.Do,R.Fa,andK.Hamza,“Anewapproachtomodeling

and estimation for pairs trading,” in Proceedings of the 2006

Financial Management Association European Conference, .

[] R. D. Dittmar, C. J. Neely, and P. A. Weller, “Is technical

analysis in the foreign exchange market protable? A genetic

programming approach,” Journal of Financial and Quantitative

Analysis,vol.,p.,.

[] H. Rad, R. K. Low, and R. Fa, “e protability of pairs

trading strategies: distance, cointegrationand copula methods,”

Quantitative Finance,vol.,no.,pp.–,.

[] S. Johansen, “Statistical analysis of cointegration vectors,” Jour-

nal of Economic Dynamics and Control,vol.,no.-,pp.–

, .

[] M.H.Kutner,C.J.Nachtsheim,J.Neter,andW.Li,“Applied

linear statistical models,” .

[] G. H. Golub and C. F. Van Loan, “An analysis of the total least

squares problem,” SIAM Journal on Numerical Analysis,vol.,

no. , pp. –, .

[] R. S. Sutton and A. G. Barto, “Introduction to reinforcement

learning,” Learning,.

[] E. F. Fama, “Random walks in stock market prices,” Financial

Analysts Journal,vol.,no.,pp.–,.

[] W. F. Sharpe, “e sharpe ratio,” e Journal of Portfolio

Management,.

[] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,

and D. Meger, “Deep reinforcement learning that matters,” in

Proceedings of the irthy-Second AAAI Conference On Artiﬁcial

Intelligence (AAAI),.

[] Y.H.Li,X.M.Lu,andN.C.Kar,“Rule-basedcontrolstrategy

with novel parameters optimization using NSGA-II for power-

split PHEV operation cost minimization,” IEEE Transactions on

Vehicular Technology,vol.,no.,pp.–,.

[] L. Dymova, P. Sevastianov, and K. Kaczmarek, “A stock trading

expert system based on the rule-base evidential reasoning using

Level Quotes,” ExpertSystemswithApplications,vol.,no.,

pp. –, .

Available via license: CC BY

Content may be subject to copyright.