ArticlePDF Available

Developing & Backtesting Systematic Trading Strategies


Abstract and Figures

Analysts and portfolio managers face many challenges in developing new systematic trading systems. This paper provides a detailed, repeatable process to aid in evaluating new ideas, developing those ideas into testable hypotheses, measuring results in comparable ways, and avoiding and measuring the ever-present risks of over-fitting.
Content may be subject to copyright.
Developing & Backtesting Systematic Trading Strate-
Brian G. Peterson
updated 14 June 2017
Analysts and portfolio managers face many challenges in developing
new systematic trading systems. This paper provides a detailed, re-
peatable process to aid in evaluating new ideas, developing those ideas
into testable hypotheses, measuring results in comparable ways, and
avoiding and measuring the ever-present risks of over-fitting.11Back-testing. I hate it — it’s just opti-
mizing over history. You never see a bad
back-test. Ever. In any strategy. - Josh
Constraints, Benchmarks, and Objectives22Essentially, all models are wrong, but
some are useful. - George Box(1987)
It is important to understand what you are trying to achieve before
you set out to achieve it.
Without a set of clearly defined goals there is great potential to
accept a strategy or backtest that is really incompatible with the
(previously unstated) business objectives and constraints. Worse, it
can lead to adjusting your goal to try to follow the backtest, which
can culminate in all sorts of bad decision making, and also increases
the probability of erroneously accepting an overfitted backtest.
Understanding your business constraints (and advantages)
Think about the constraints, costs, and advantages that you are under
first, before formulating a hypothesis, or writing a strategy specifica-
tion. Many strategists skip this step. Don’t. It is a waste of time and
energy to design a strategy for a constraint set that is not available to
Some examples include:
execution platform
access and proximity to markets
products you can trade
data availability and management
business structure
capital available
• fees/rebates
Choosing a benchmark
The choice of what benchmarks to measure yourself against provides
both opportunities for better understanding your performance and
developing &backtesting systematic trading strategies 2
risks of benchmark chasing and overfitting. Mutual Fund managers
have it easy, or hard, because of the “Morningstar ” categorization
of styles and benchmarks. Mutual fund managers have benchmarks
clearly defined, and can be easily measured against what are often
exchange traded products or well known indices.
A good suite of benchmarks for your strategy should reflect your
business objectives and constraints, as well as help you measure
performance over time.
What constitutes a good benchmark for a trading strategy?
There are several categories of potential benchmarks:
1. archetypal strategies
In many cases, you can use a well-known archetypal strategy as
a benchmark. For example, a momentum strategy could use an
MA-cross strategy as a benchmark.
If you are comparing two or more candidate indicators or sig-
nal processes, you can also use one as a benchmark for the other
(more on this later).
2. alternative indices
EDHEC, Barclays, HFR, and others offer alternative investment
style indices. One of these may be a good fit as a benchmark for
your strategy. Be aware that they tend to have a downward bias
imposed by averaging, or are even averages of averages. So ‘beat-
ing the benchmark’ in your backtest would be a bare minimum
first step if using one of these as your benchmark. As an exam-
ple, many CTA’s are trend followers, so a momentum strategy in
commodities may often be fairly compared to a CTA index.
3. custom tracking portfolios
In some ways an extension of (1.) above, you can create your
own benchmark by creating a custom tracking portfolio. As the
most common example, a cap-weighted index is really a strategy
archetype. The tracking portfolio for a capitalization-weighted in-
dex invests in a basket of securities using market capitalization as
the weights for the portfolio. This portfolio is then rebalanced on
the schedule defined in the index specification (typically quarterly
or annually). Components may be added or removed following
some rule at these rebalancing periods, or on some other abnormal
event such as a bankruptcy. Other custom tracking portfolios or
synthetic strategies may also be appropriate for measuring your
strategy against, depending on what edge(s) the strategy hopes to
capture, and from what universe of investible products.
developing &backtesting systematic trading strategies 3
4. market observables
Market observable benchmarks are perhaps the simplest avail-
able benchmarks, and can tie in well to business objectives and to
Some easily observable benchmarks in this category:
$/day, %/day
• $/contract
% daily volume
% open interest
measuring performance against your benchmark
Good benchmarks help to provide insight into the drivers of strategy
Choosing more than one benchmark can help to guard against
targeting the benchmark (a form of overfitting). Identifying specific
limitations of your benchmarks, for example:
lack of goodness of fit,
a different investible universe,
lack of stationarity,
or specific representative features of the benchmark
can also assist in refining the choice of benchmarks and avoiding
taking them too seriously.
No matter what benchmarks you pick, if you are going to share
these benchmarks with outside investors or other members of your
team, you need to be careful that you understand and disclose the
limitations of the benchmark(s) that you’ve chosen.
Failure to disclose known divergences from a benchmark you will
be measured against may look like a limitation rather than a feature,
so be sure to explain why differences should exist.
You will be measured by others against these benchmarks for
performance, style drift, etc. so failure to disclose known limitations
of your benchmark(s) will waste everyone’s time.
Large bodies of research have been built up over time about how
to measure performance against such benchmarks. TrackingError,
SharpeRatio’s, FactorAnalytics, and more are well represented in R.
Choosing an Objective33When measuring results against objec-
tives, start by making sure the objectives are
correct. - Ben Horowitz (2014)
In strategy creation, it is very dangerous to start without a clear
Market observable benchmarks (see above) may form the core of
the first objectives you use to evaluate strategy ideas.
developing &backtesting systematic trading strategies 4
Your business objective states the types of returns you require
for your capital, your tail risk objectives, the amount of leverage
you intend to or are willing to use, and your drawdown constraints
(which are closely related to the leverage you intend to employ).
Some of the constraints on your business objective may be dictated
by the constraints your business structure has (see above).
For example: - Leverage constraints generally have a hard limit
imposed by the entity you are using to access the market, whether
that is a broker/dealer, a 106.J membership, a leased seat, a clearing
member, or a customer relationship.
- Drawdown constraints have hard limits dictated by the leverage
you intend to employ: 10:1leverage imposes a 10% hard drawdown
constraint, 4:1imposes a 25% drawdown constraint, and so on. -
Often, there will also be certain return objectives below which a
strategy is not worth doing.
Ideally, the business objectives for the strategy will be specified
as ranges, with minimum acceptable, desired target, and maximum
acceptable or plausible targets.
Once you have these objectives described, you need to formulate
them so that they may be easily used to programmatically evaluate
your strategy.44FIXME we need a version of con-
strained_objective() from PortA in
Formulate your business objectives in the same form com-
monly used in portfolio optimization:
maximize some measure of return
subject to
minimizing one or more measures of risk
The most commonly used objective of this type is the Sharpe Ratio
which is most often calculated as mean return over volatility.
Many other modified Sharpe-style ratios also exist:
Information Ratio: annualized return over annualized risk
Profit factor: gross profits divided by gross losses
Sortino Ratio: (annualized return - risk free) over downside volatil-
Calmar Ratio: annualized return over max drawdown
return over expected tail loss
All of these ratios have the properties that you want to maximize
them in optimization, and that they are roughly comparable across
different trading systems.
Additional properties of the trading system that you may target
when evaluating strategy performance but which we would not put
into the category of business requirements include:
correlation to perfect profit
developing &backtesting systematic trading strategies 5
slope of the equity curve
linearity of the equity curve
trade statistics etc.
A strategy conceptualized as a diversifier for an existing suite may
well have different business objectives from a strategy designed to
enter a new market or replicate published research or take advantage
of a transient market opportunity. As such, it is further important to
be aware that business objectives are likely to differ from one strategy
to another even inside the same organization or on the same team.
Understanding your constraints, benchmarks, and objectives prior
to embarking on hypothesis generation or strategy creation and eval-
uation will anchor your goals. Defining these in advance will allow
you to quickly reject ideas that are incompatible with your goals
and requirements. Proceeding without understanding the milieu in
which the proposed strategy will operate risks overfitting and sloppy
analysis, and should be avoided whenever possible.
Hypothesis Driven Development55Far better an approximate answer to the
right question, which is often vague, than
an exact answer to the wrong question,
which can always be made precise. - John
Tukey (1962) p. 13
Strategy creation is expensive in time (for research) and money (for
implementation into a live trading environment).
To maximize return on that investment, we follow a hypothesis
driven approach that allows us to confirm ideas quickly, reject fail-
ures cheaply, and avoid many of the dangers of overfitting.
Ideas generated via a brainstorming session are a useful starting
point, but they are typically not testable on their own.
creating good (testable) hypotheses
To create a testable idea (a hypothesis), we need to:
formulate the idea as a declarative conjecture
make sure the conjecture is predictive
define the expected outcome of that conjecture
describe means of verifying (testing) the expected outcome
Many ideas will fail the process at this point. The idea will be in-
triguing or interesting, but not able to be formulated as a conjecture.
Or, once formatted as a conjecture, you won’t be able to formulate
an expected outcome. Often, the conjecture is a statement about the
nature of markets. This may be an interesting, and even verifiable
point, but not be a prediction about market behavior. So a good con-
jecture also makes a prediction, or an observation that amounts to a
developing &backtesting systematic trading strategies 6
prediction. The prediction must specify either a cause, or an observ-
able state that precedes the predicted outcome.
Finally, a good hypothesis must specify its own test(s). It should
describe how the hypothesis will be verified. A non-testable “work-
ing hypothesis” is a starting point for further investigation, but the
goal has to be to specify tests. In some cases, it is sufficient to de-
scribe a test of the prediction. We’ll cover specific tests in depth later
in this document. In statistical terms, we need the hypothesis to have
a testable dependent variable. In other cases it may be necessary to
describe a statistical test to differentiate the hypothesized prediction
from other measurable effects, or from noise, or factor interactions.
Specified tests should identify measurable things that can help iden-
tify the (spurious) appearance of a valid hypothesis.
A good/complete hypothesis statement includes:
what is being analyzed (the subject),
the dependent variable(s) (the output/result/prediction)
the independent variables (inputs into the model)
the anticipated possible outcomes, including direction or compari-
addresses how you will validate or refute each hypothesis
Most strategy ideas will be rejected during hypothesis cre-
ation and testing.
Many ideas, thoughts, and observations about the state of the
markets lack sufficient structure to become a testable hypothesis.
While it is possible that these ideas are indeed true, unless you can
quantify the idea into a (dis)provable conjecture, it can’t be used to
create a quantitative strategy with any degree of confidence. Far from
being a negative thing, this is a good use of your time, and guards
against prematurely moving on with a fundamentally questionable
or just plain erroneous idea. The faster an untestable idea can be set
aside for more verifiable ideas, the better.
Proceeding without a hypothesis risks ruin.6Many strate- 6A big computer, a complex algorithm and
a long time does not equal science. - Robert
gists will still try to skip robust hypotheses in favor of sloppy ones
such as “I hypothesize that this strategy idea will make money”.
While the value of testing a more rigorous hypothesis should be
clear, it may be more difficult to see the risks imposed by having no
hypothesis or a sloppy or incomplete hypothesis. An incomplete or
otherwise deficient hypothesis at this stage will create a strong desire
to “refine” the hypothesis later by adding new explanatory state-
ments. This is called an “ad hoc hypothesis”, where new hypotheses
are piled on top of old to better “explain” observation. (Carroll (2011)
developing &backtesting systematic trading strategies 7
, p. 7or online; see also “rule burden”, below) Engaging in the cre-
ation of ad hoc hypotheses risks ruin, as the in sample (IS) explana-
tory power of the model will seem to go up, while the out of sample
(OOS) power goes down. It also invites going over the data multiple
times while the model is “refined”, introducing progressively more
data snooping bias.
Social scientists have coined the term HARKing(Kerr 1998) for
“hypothesizing after the results are known”. This needs to be sep-
arated from true scientific inference, which develops an a priori
hypothesis from examining raw or unmassaged data. Defining a hy-
pothesis should proceed from data to question to testable conjecture
about a dependent variable. Rejecting a hypothesis is more valuable
that inventing one after the fact. (and avoids losing money on an
random or over-fit model)
Defining the strategy
A trading strategy or investment approach is more than just a testable
After creating and confirming a hypothesis, you need to specify
the strategy. The Rpackage quantstrat formalizes the strategy struc-
ture into filters,indicators,signals, and rules.
Filters help to select the instruments to trade.
They may be part of the formulated hypothesis, or they may be
market characteristics that allow the rest of the strategy to trade bet-
ter. In fundamental equity investing, some strategies consist only
of filters. For example, the StarMine package that was bought by
Thomson Reuters defines quantitative stock screens based on tech-
nicals or fundamentals.7Many analysts will expand or shrink their 7a modern, free alternative may be
found at
investible universe based on screens. Lo’s Variance Ratio is another
measure often used as a filter to turn the strategy on or off for par-
ticular instruments (but can also be used as an indicator, since it is
Indicators are quantitative values derived from market data.
Examples of indicators include moving averages, RSI, MACD,
volatility bands, channels, factor models, and theoretical pricing
models. They are calculated in advance on market data (or on other
indicators). Indicators have no knowledge of current positions or
One risk to watch out for when defining your indicators is confus-
ing the indicator with the strategy. The indicator is a description of
developing &backtesting systematic trading strategies 8
reality, a model that describes some aspect of the market, or a theo-
retical price. An indicator, on its own, is not a strategy. The strategy
depends on interactions with the rest of its components to be fully
Signals describe the interaction between filters, market data, and
Signal processes describe the desire for an action, but the strategy
may choose not to act or may not be actionable at the time. They
include ‘classic’ things like crossovers and thresholds, as well as more
complicated interactions between pricing models, other indicators,
and other signal processes. It is our experience that signals may often
interact with other signals. The final signal is a composite of multiple
things. The combination of intermediate signals will increase the
likelihood of actually placing or modifying orders based on the final
Like indicators, signals are calculated in advance on market data
while doing research or a backtest. In production, they are typically
calculated in a streaming fashion.
As stated above, it is important to separate the desire for action
from the action itself. The signal happens only in its own context.
In very simple strategies, signal and action may be the same, but in
most real strategies the separation serves to make clear what analysis
can be done to support a decision, and what needs to be postponed
until a path-dependent decision is warranted.
Rules make path-dependent actionable decisions.
Rules, in both research/backtesting and in production, are what
take input from market data, indicators, and signals, and actually
take action. Entry, Exits, Risk, Profit Taking, and Portfolio Rebalanc-
ing are all rule processes. Rules are path dependent, meaning that
they are aware of the current market state, the current portfolio, all
working orders, and any other data available at the time the rule is
being evaluated. No action is instantaneous, so rules also have a cost
in time.
There must be some lag between observation (indicator or signal),
decision (rule), and action (order entry or modification).
strategy specification document
The strategy specification document is a complete description of the
strategy. A draft covering all components of the strategy should be
completed before any code is written. It describes the business con-
text and objectives, the hypothesis and tests for that hypothesis, and
developing &backtesting systematic trading strategies 9
all the components of the strategy itself (filters, indicators, signals
and rules). It also needs to describe details such as initial parameter
choices and why those choices were made, and any data require-
ments (tick, bar, L2, external data, etc.).
The strategy specification needs to be written at a level sufficient
for the research analyst to write code from. Equally important, it
should be written for review by the investment and risk committees.
One of the key reasons for writing the specification before begin-
ning coding, beyond clarity and rigor of thinking, is that it provides
another curb against overfitting. Changes to the specification should
be deliberate, recorded and indicate that something was incomplete
or erroneous in the initial specification. The strategy designer will
learn from success and failure along the development path, and the
outcomes of these incremental experiments should be clearly re-
flected in the strategy specification and its revision history.
Specification changes need to be conscious and recorded.
Loosely adjusting the specification as you go increases the likelihood
of an overfit and invalid test.
Creating the Strategy
Creating the strategy code for backtesting or production is outside
the scope of this document. In Ryou’ll likely use quantstrat, and
when transitioning to production, you’ll use whatever execution
platforms are supported by your organization or team.
Developing and testing the strategy in stages using the
tools described in the following sections makes the strategy devel-
opment process less expensive, more rigorous, and less subject to
over-fitting. This is in contrast to the process described by many of
the references in this document which focus only on evaluating the
entire strategy at once, and modifying the whole strategy before do-
ing another analytical trial.
Evaluating the Strategy88No matter how beautiful your theory, no
matter how clever you are or what your
name is, if it disagrees with experiment, it’s
wrong. - Richard P. Feynman (1965)
How do you evaluate the backtest?
Conceptually, the goal should continue to be to reject or fix failures
quickly, before getting to the point of something that risks feeling too
developing &backtesting systematic trading strategies 10
expensive to throw away. We want to evaluate the backtest against
our hypothesis and business objectives. The earlier in the strategy
framework we can identify and correct problems, the less likely we
are to fall victim to common errors in backtesting:
Look Ahead Biases are introduced when an analysis directly uses
knowledge of future events. They are most commonly found when
using ‘corrected’ or ‘final’ numbers in a backtest rather than the
correct ‘vintage’ of data. Other common examples are using the
mean of the entire series, or using only surviving securities in an
analysis performed after several candidate securities have been de-
listed. (see Baquero, Ter Horst, and Verbeek 2005) Look ahead bias
is relatively easily corrected for in backtests by subsetting the data at
each step so that indicators, signals, and rules only have access to the
data that would have been contemporaneously available. Guarding
against vintage or survivorship problems requires slightly more
work, as the analyst must be aware that the data being used for the
analysis comes in vintages.
Data Mining Bias arises from testing multiple configurations and
parameters of a trading system without carefully controlling for the
introduction of bias. It is typically caused by brute force searching
of the parameter space without a driving hypothesis for why those
parameters would be good candidates.(see Aronson 2006, Chapter 6,
Correcting for data mining bias is covered in Aronson, and con-
sists mostly in designing the tests such that you do not pollute your
data. Walk forward analysis (covered later) is a key component of
this, as are adjustments and measurements of data mining effects
after they occur.
Data snooping is the process by which knowledge of the data set
contaminates your statistical testing (your backtest) because you have
already looked at the data. In an innocuous form, it could occur
simply because you have market knowledge. (see Smith, n.d.) Tukey
called this “uncomfortable science” because of the bias that it can
introduce, but knowledge of how markets work, and what things
have worked in the past is an essential part of strategy development.
What you need to be cautious of is making changes to the strategy
to overcome specific deficiencies you see in testing. These types of
changes, while sometimes unavoidable and even desirable, increase
the probability of overfitting.
Sullivan, Timmermann, and White (1999) and Hansen (2005) pro-
vide tests derived from White’s Reality Check (which is patented in
developing &backtesting systematic trading strategies 11
White 2000), to measure the potential impact of data snooping on the
backtested results.
One deficiency in the strategy literature is that it has been
moving towards utilizing parameter optimization, machine learning,
or statistical classification methods (such as random forests or neural
networks) as a silver bullet for strategy development. While a full
treatment is beyond the scope of this paper, it is worth mentioning
that the model assessment and selection techniques of the statistical
classification and machine learning literature are very valid in the
context of evaluating trading strategies. (see e.g. Hastie, Tibshirani,
and Friedman 2009, chap. 7and 8) We advocate extensive use of
model validation tests, tests for effective number of parameters, and
careful use of training and cross validation tests in every part of
strategy evaluation. Of particular utility is the application of these
techniques to each component of the strategy, in turn, rather than or
before testing the entire strategy model. (see, e.g. Kuhn and Johnson
Defining the objectives,hypotheses,and expected out-
come(s) of the experiment (backtest) as declared before any strategy
code is written or run and evaluating against those goals on an ongo-
ing basis will guard against many of the error types described above
by discarding results that are not in line with the stated hypotheses.
Evaluating Each Component of the Strategy99Maintain alertness in each particular
instance of particular ways in which our
knowledge is incomplete. - John Tukey
(1962) p. 14
It is important to evaluate each component of the strategy separately.
If we wish to evaluate whether our hypotheses about the market are
correct, it does not make sense to first build a strategy with many
moving parts and meticulously fit it to the data until after all the
components have been evaluated for their own “goodness of fit”.
The different components of the strategy, from filters, through in-
dicators, signals, and different types of rules, are all trying to express
different parts of the strategy’s hypothesis and business objectives.
Our goal, at every stage, should be to confirm that each individual
component of the strategy is working: adding value, improving the
prediction, validating the hypothesis, etc. before moving on to the
next component.
There are several reasons why it is important to test components
Testing individually guards against overfitting. As de-
developing &backtesting systematic trading strategies 12
scribed in the prior section, one of the largest risks of overfitting
comes from data snooping. Rejecting an indicator, signal process, or
other strategy component as quickly in the process as possible guards
against doing too much work fitting a poorly conceived strategy to
the data.
Tests can be specific to the technique. In many cases, specific
indicators, statistical models, or signal processes will have test meth-
ods that are tuned to that technique. These tests will generally have
better power to detect a specific effect in the data. General tests, such
as p-values or t-tests, may also be valuable, but their interpretation
may vary from technique to technique, or they may be inappropriate
for certain techniques.
It is more efficient. The most expensive thing an analyst has is
time. Building strategies is a long, intensive process. By testing indi-
vidual components you can reject a badly-formed specification. Re-
using components with known positive properties increases chances
of success on a new strategy. In all cases, this is a more efficient use
of time than going all the way through the strategy creation process
only to reject it at the end.
Evaluating Indicators
In many ways, evaluating indicators in a vacuum is harder than
evaluating other parts of the strategy. It is nearly impossible if you
cannot express a theory of action for the indicator.
Why? This is true because the indicator is not just some analytical
output that has no grounding (at least we hope not). A good indica-
tor is describing some measurable aspect of reality: a theoretical “fair
value” price, or the impact of a factor on that price, or turning points
of the series, or slope.
If we have a good conceptualization of the hypothesized properties
of the indicator, we can construct what the signal processing field
calls the ‘symmetric filter’ on historical data. These are often opti-
mized ARMA or Kalman-like processes that filter out all the ‘noise’
to capture some pre-determined features of the series. They’re (usu-
ally) useless for prediction, but are very descriptive of past behavior.
It is also possible to do this empirically if you can describe the
properties in ways that can be readily programmed. For example,
if the indicator is supposed to indicate turning points, and is tuned
to a mean duration of ten periods, then you can look for all turning
developing &backtesting systematic trading strategies 13
points that precede a short term trend centered around ten periods.
The downside of this type of empirical analysis is that it is inherently
a one-off process, developed for a specific indicator.
How is this useful in evaluating an indicator? Once you have con-
structed the symmetric filter, you can evaluate the degree to which
the indicator matches perfect hindsight. Using kernel methods, clus-
tering, mean squared error, or distance measures we can evaluate the
degree of difference between the indicator and the symmetric filter. A
low degree of difference (probably) indicates a good indicator for the
features that the symmetric filter has been tuned to pick up on.
The danger of this approach comes if you’ve polluted your
data by going over it too many times. Specifically, if you train the
indicator on the entire data set, then of course it has the best possi-
ble fit to the data. Tests for look ahead bias and data snooping bias
can help to detect and guard against that, but the most important
guard has to be the analyst paying attention to what they are doing,
and constructing the indicator function or backtest to not use non-
contemporaneously available data. Ideally, polluted training periods
will be strictly limited and near the beginning of the historical series,
or something like walk forward will be used (sparingly) to choose
parameters so that the parameter choosing process is run over pro-
grammatically chosen subsets. Failure to exercise care here leads
almost inevitably to overfitting (and poor out of sample results).
An indicator is, at it’s core, a measurement used to make a predic-
tion. This means that the broader literature on statistical predictors
is valid. Many techniques have been developed by statisticians and
other modelers to improve the predictive value of model inputs. See
Kuhn and Johnson (2013) or Hastie !@Hastie2009. Input scaling, de-
trending, centering, de-correlating, and many other techniques may
all be applicable. The correct adjustments or transformations will
depend on the nature of the specific indicator.
Luckily, the statistics literature is also full of diagnostics to help
determine which methods to apply, and what their impact is. You do
need to remain cognizant of what you give up in each of these cases
in terms of interpretability, trace-ability, or microstructure of the data.
It is also important to be aware of the structural dangers of bars.
Many indicators are constructed on periodic or “bar” data. Bars are
not a particularly stable analytical unit, and are often sensitive to ex-
act starting or ending time of the bar, or to the methodologies used to
calculate the components (open, high, low, close, volume, etc.) of the
bar. Further, you don’t know the ordering of events, whether the high
came before the low, To mitigate these dangers, it is important to test
developing &backtesting systematic trading strategies 14
the robustness of the bar generating process itself, e.g. by varying the
start time of the first bar. We will almost never run complete strategy
tests on bar data, preferring to generate the periodic indicator, and
then apply the signal and rules processes to higher frequency data.
In this way the data used to generate the indicator is just what is re-
quired by the indicator model, with more realistic market data used
for generating signals and for evaluating rules and orders.
The analysis on indicators described here provides another oppor-
tunity to reject your hypothesis, and go back to the drawing board.
Evaluating Signals
From one or more indicators, you usually proceed to examining the
effectiveness of the signal generating process. One of the challenges
in evaluating the full backtest is that the backtest needs to make
multiple assumptions about fills. We can evaluate signals without
making any assumptions about the execution that accompanies the
desire to express a particular view.
Asignal is a directional prediction for a point in time.
The point in time indicated for the signal will generally vary based
on some parameterization of the indicator and/or signal function(s).
These may be plotted using a forward looking boxplot which lines
up returns from the signal forward by using the signal as t0, and
then displaying a boxplot of period returns from t1...n. Without any
assumptions about execution, we have constructed a distribution of
forward return expectations from the signal generating process.
Many analyses can be done with this data. You can take a distribu-
tion of overall return expectations (quantiles, mean, median, standard
deviation, and standard errors of these estimates), generate a condi-
tional return distribution, examine a linear or non-linear fit between
the period expectations, and compare the expectations of multiple
different input parameters. When comparing input parameter expec-
tations, you should see ‘clusters’ of similar positive and/or negative
return expectations in similar or contiguous parameter combinations.
Existence of these clusters indicates what Tomasini and Jaekle (2009)
refer to as a ‘stable region’ for the parameters (see parameter opti-
mization below). A random assortment of positive expectations is a
bad sign, and should lead to reviewing whether your hypotheses and
earlier steps are robust.
The signals generated by the backtest(s) are also able to be em-
pirically classified. You can measure statistics such as the number of
developing &backtesting systematic trading strategies 15
entry and exit signals, the distribution of the period between entries
and exits, degree of overlap between entry and exit signals, and so
Given the empirical data about the signal process, it is possible to
develop a simulation of statistically similar processes. A simulation
of this sort is described by Aronson (2006) for use in evaluating entire
strategy backtest output (more on this below), but it is also useful at
this earlier stage. These randomly simulated signal processes should
display (random) distributions of returns, which can be used to as-
sess return bias introduced by the historical data. If you parameterize
the statistical properties of the simulation, you can compare these
parameterized simulations to the expectation generated from each
parameterization of the strategy’s signal generating process.
It is important to separate a parameterized and conditional simu-
lation from an unconditional simulation. Many trading system Monte
Carlo tools utilize unconditional sampling. They then erroneously
declare that a good system must be due to luck because its expecta-
tions are far in the right-hand tail of the unconditional distribution.
One of the only correct inferences that you can deduce from a un-
conditional simulation would be a mean return bias of the historical
It would make little sense to simulate a system that scalps every
tick when you are evaluating a system with three signals per week.
It is critical to center the simulation parameters around the statistical
properties of the system you are trying to evaluate. The simulated
signal processes should have nearly indistinguishable frequency
parameters (number of enter/exit signals, holding period, etc.) to the
process you are evaluating, to ensure that you are really looking at
comparable things.
Because every signal is a prediction, when analyzing signal pro-
cesses, we can begin to fully apply the literature on model specifica-
tion and testing of predictions. From the simplest available methods
such as mean squared model error or kernel distance from an ideal
process, through extensive evaluation as suggested for Akaike’s In-
formation Criterion(AIC), Bayesian Information Criterion(BIC), ef-
fective number of parameters, cross validation of Hastie, Tibshirani,
and Friedman (2009), and including time series specific models such
as the data driven “revealed performance” approach of Racine and
Parmeter (2012): all available tools from the forecasting literature
should be considered for evaluating proposed signal processes.
It should be clear that evaluating the signal generating process
offers multiple opportunities to re-evaluate assumptions about the
method of action of the strategy, and to detect information bias or
luck before moving on to (developing and) testing the rest of the
developing &backtesting systematic trading strategies 16
Evaluating Rules
By the time you get to this stage, you should have experimental con-
firmation that the indicator(s) and signal process(es) provide sta-
tistically significant information about the instruments you are ex-
amining. If not, stop and go back and reexamine your hypotheses.
Assuming that you do have both a theoretical and empirical basis on
which to proceed, it is time to define the strategy’s trading rules.
Much of the work involved in evaluating “technical trading rules”
described in the literature is really an evaluation of signal processes,
described in depth above. Rules should refine the way the strategy
‘listens’ to signals, producing path-dependent actions based on the
current state of the market, your portfolio, and the indicators and
signals. Separate from whether a signal has predictive power or not,
as described above, evaluation of rules is an evaluation of the actions
taken in response to the rule.
entry rules
Most signal processes designed by analysts and described in the lit-
erature correspond to trade entry. Described another way, every entry
rule will likely be tightly coupled to a signal (possibly a composite
signal). If the signal (prediction) has a positive expectation, then the
rule should have potential to make money. It is most likely valuable
with any proposed system to test both aggressive and passive entry
order rules.
If the system makes money in the backtest with a passive entry
on a signal, but loses money with an aggressive entry which crosses
the bid-ask spread, or requires execution within a very short time
frame after the signal, the likelihood that the strategy will work in
production is greatly reduced.
Conversely, if the system is relatively insensitive in the backtest
to the exact entry rule from the signal process, there will likely be a
positive expectation for the entry in production.
Another analysis of entry rules that may be carried out both on
the backtest and in post-trade analysis is to extract the distribution
of duration between entering the order and getting a fill. Differences
between the backtest and production will provide you with infor-
mation to calibrate the backtest expectations. Information from post
trade analysis will provide you with information to calibrate your
execution and microstructure parameters.
developing &backtesting systematic trading strategies 17
You can also analyze how conservative or aggressive your back-
test fill assumptions are by analyzing how many opportunities you
may have had to trade at the order price after entering the order but
before the price changes, or how many shares or contracts traded at
your order price before you would have moved or canceled the order.
exit rules
There are two primary classes of exit rules, signal based and empir-
ical; evaluation and risks for these two groupings is different. Exit
rules, whether being driven by the same signal process as the entry
rules, or based on empirical evidence from the backtest, are often the
difference between a profitable and an unprofitable strategy.
Signal driven exit rules may follow the same signal process as
the entry rules (e.g. band and midpoints models, or RSI-style over-
bought/oversold models), or they may have their own signal process.
For example, a process that seeks to identify slowing or reversal of
momentum may be different from a process to describe the forma-
tion and direction of a trend. When evaluating signal based exits, the
same process from above for testing aggressive vs. passive order logic
is likely valuable. Additionally, testing trailing order logic after the
signal may “let the winners run” in a statistically significant manner.
Empirical exit rules are usually identified after initial tests of other
types of exits, or after parameter optimization (see below). They
include classic risk stops (see below) and profit targets, as well as
trailing take profits or pullback stops. Empirical profit rules are
usually identified using the outputs of things like Mean Adverse
Excursion(MAE)/Mean Favorable Excursion(MFE), for example:
MFE shows that trades that have advanced x% or ticks are un-
likely to advance further, so the trade should be taken off
a path fit is still strongly positive, even though the signal process
indicates to be on the lookout for an exit opportunity, so a trailing
take profit may be in order
See more on MAE/MFE below.
risk rules
There are several types of risk rules that may be tested in the back-
test, and the goals of adding them should be derived from your
business objectives. By the time you get to adding risk rules in the
backtest, the strategy should be built on a confirmed positive expec-
tation for the signal process, signal-based entry and exit rules should
have been developed and confirmed, and any empirical profit taking
developing &backtesting systematic trading strategies 18
rules should have been evaluated. There are two additional “risk”
rule types that should commonly be considered for addition to the
strategy, empirical risk stops, and business constraints.
Empirical risk stops are generally placed such that they would cut
off the worst losing trades. This is very strategy dependent. You may
be able to get some hints from the signal analysis described above,
some examples:
a path dependent fit is strongly negative after putting on the trade
the return slope is negative after xperiods
Or, you may place an empirical risk stop based on what you see
after doing parameter optimization or post trade analysis (see tools
below) to cut off the trade after there is a very low expectation of a
winning trade, for example:
xticks or % against the entry price
in the bottom quantile of MAE
The business constraint example we most often see in practice is
that of a drawdown constraint in a levered portfolio. The drawdown
constraint is a clear business requirement, imposed by the leverage
in the book. It is possible to impose drawdown constraints utilizing
a risk rule that would flatten the position after a certain loss. While
it is common to use catastrophic loss rules in strategies, we have
had better luck controlling portfolio drawdowns via diversified asset
allocation and portfolio optimization than via risk stops.
Bailey and López de Prado (forthcoming, 2014) examines the im-
pact of drawdown based risk rules, and adjustments to order siz-
ing based on drawdowns or other changes in account equity. They
develop an autoregressive model for estimating the potential for
drawdowns even in high-Sharpe or short history investments. They
further model the likely time to recover as investment sizes are varied
following the drawdown.
It is also possible to model risk rules such as hedges or option
insurance, but these are best done in asset allocation and money
management as an overlay for the strategy or a portfolio of strategies,
and not in the individual strategy itself. Rules to model orthogonal
portfolio hedges or option insurance significantly complicate the
task of evaluating the impact of the rules designed for the regular
functioning of the strategy.
As with other parts of strategy development and testing, it is again
important to formulate a hypothesis about the effects of exit rules
that you plan to add, and to document their intended effects, and the
business reasons you’ve chosen for adding them. This is especially
important if you are adding rules after parameter optimization. Done
developing &backtesting systematic trading strategies 19
poorly, exit or risk rules added after parameter optimization are just
cherry picking the best returns, while if done properly they may be
used to take advantage of what you’ve learned about the statistical
expectations for the system, and may improve out of sample corre-
spondence to the in-sample periods.
order sizing
We tend to do minimal order sizing in backtests. Most of our back-
tests are run on “one lot” or equivalent portfolios. The reason is that
the backtest is trying to confirm the hypothetical validity of the trad-
ing system, not really trying to optimize execution beyond a certain
point. How much order sizing makes sense in a backtest is depen-
dent on the strategy and your business objectives. Typically, we will
do “leveling” or “pyramiding” studies while backtesting, starting
from our known-profitable business practices.
We will not do “compounding” or “percent of equity” studies, be-
cause our goal in backtesting is to confirm the validity of the positive
expectation and rules embodied in the system, not to project revenue
opportunities. Once we have a strategy model we believe in, and es-
pecially once we have been able to calibrate the model on live trades,
we will tend to drive order sizing from microstructure analysis, post
trade analysis of our execution, and portfolio optimization.
rule burden
Beware of rule burden. Too many rules will make a backtest look
excellent in-sample, and may even work in walk forward analysis,
but are very dangerous in production. One clue that you are over-
fitting by adding too many rules is that you add a rule or rules to
the backtest after running an exhaustive analysis and being disap-
pointed in the results. This is introducing data snooping bias. Some
data snooping is unavoidable, but if you’ve run multiple parameter
optimizations on your training set or (worse) multiple walk forward
analyses, and then added rules after each run, you’re likely introduc-
ing dangerous biases in your output.
Parameter optimization10 10 Every trading system is in some form an
optimization. (Tomasini)
Parameter optimization is important for indicators, signals, rules, and
complete trading systems. Care needs to be taken to do this as safely
as possible. That care needs to start with clearly defined objectives, as
developing &backtesting systematic trading strategies 20
has already been stated. The goal of parameter optimization should
be to locate a parameter combination that most closely matches the
hypotheses and objectives. To avoid overfitting, it is also critical to
avoid outliers, looking for stable regions of both in and out of sample
performance. (Tomasini and Jaekle 2009 , pp. 4956)
When a parameter or set of parameters is robust, it will have a few
key properties:
small parameter changes lead to small changes in P&L and objec-
tive expectations
out of sample deterioration is not large, on average (see walk
forward optimization)
parameter choices have a sound theoretical or economic basis
parameter variation should produce correlated differences in mul-
tiple objectives
For small numbers of parameters, it may be sufficient to graph
parameters against your objective, use quantiles, or examine neigh-
bors around a candidate parameter set. When there are more than
a couple of parameters, or more than one objective, the inference to
find a “stable region” is somewhat more difficult. Quantiles of the
parameter sets lined up against the objectives are still useful, but you
may need to cluster these values to find the best intersection, or ap-
ply formal global optimization solvers with a penalized objective to
locate the best combination of parameters.
Limiting the degrees of freedom in your parameter optimiza-
tion is another way to guard against overfitting and data mining
Most real strategies have only a few influential parameters, out
of many other “fine tuning” parameters. When backtesting, focus
on the major drivers. These major drivers should be known in ad-
vance, from the theory of the strategy. If they are not obvious from
the theory of the strategy, first work on refining the hypothesis so
its premises may be tested. If that is not possible or is still insuffi-
cient, effective parameter testing (Hastie, Tibshirani, and Friedman
2009) can help to identify the major drivers in a small training and
testing set from the beginning of the data. By limiting the degrees
of freedom in this way, you limit the opportunities to cherry pick
lucky combinations, and statistically the required adjustment for data
mining bias is reduced, as are the number of trials (resulting in faster
parameter searches).
Incremental evaluation of rules is another advanced form of
parameter optimization usually paired with more advanced global
developing &backtesting systematic trading strategies 21
optimization solvers or machine learning. Incremental rule evalu-
ation runs the strategy with different combinations of rules. In a
normal evaluation of this type, rules are separated by type (entry,
exit, risk, rebalance, and so on), and various combinations are tried,
starting with minimal enter/exit pairs, and adding incrementally to
complexity. These types of trials can be very informative as to what
adds value to the strategy.
If not carefully controlled, they can also result in significant out of
sample deterioration (see walk forward analysis).
Choosing the best parameter set should be a function of your
business objectives. In the simplest case, your selection algorithm
should maximize or minimize your business objective. In more com-
plex cases, you will likely have multiple business objectives, and will
utilize a penalized multi-objective optimization to choose the best
combination. In an ideal case, you will also have a view on slope and
linearity of the equity curve, described by something like Kestner’s
K-Ratio(2003), which is really the t-test on the linear model, or by
other descriptive statistics for goodness of fit or model choice (see
Ripley 2004 for an overview).
Walk Forward Analysis
A logical extension of parameter optimization, walk forward analysis
utilizes a rolling or expanding window as ‘in-sample’ (IS) for param-
eterization and another period as ‘out of sample’ (OOS) using the
chosen parameters.
Most strategies do not have stable parameters through
time. The world’s top investors, in all styles and frequencies, ad-
just their expectations, outlook, and approach over time as market
conditions change. Warrent Buffett or Ray Dalio say that they follow
the same “strategy” over decades, but they still adapt to changing
Walk forward analysis allows the parameterization of the strategy
to change over time as market conditions change.
Walk forward analysis guards against data mining bias.
At first, this may be counter-intuitive, as walk forward analysis is a
data mining process.
Barring regime shift (more on this later), the strategist generally
assumes local stationarity in market conditions. If we are conducting
developing &backtesting systematic trading strategies 22
parameter optimization in alignment with our previously stated busi-
ness objectives (supplemented by statistical measures of goodness
of fit), then utilizing walk forward analysis describes the strategy
parameterization that would have been most appropriate for the
business at a point in time. If you are truly in a region of local sta-
tionarity in market characteristics, then you are choosing optimal
parameters for your out of sample period without looking ahead and
incurring bias.
Another guard against data snooping bias is to have your walk
forward periods be realistic. If the strategy is hard to parameterize,
or uses hard to get or periodic data, then your walk forward analysis
should reflect these difficulties.
For example, you should make sure that you:
Choose between anchored walk forward or rolling walk forward
based on whether the theoretical properties of your indicators(s)
would benefit from that.
Only reparameterize the free parameters for the whole strategy
on a frequency that makes sense for your strategy and business.
Indicators that adapt to changes in market data automatically
are probably safer than doing frequent parameter searches of the
recent past.
Don’t vary the length of in and out of sample periods across mul-
tiple tests, unless you are using such variations to model out of
sample deterioration.
If you use macroeconomic data released periodically (monthly or
quarterly), your in and out of sample periods should reflect when
information arrives to change how the strategy operates.
Understand whether windowing effects would strongly influence
the results. If windowing effects are present, consider an anchored
walk forward.
In addition to the choosing mechanism used for parameter anal-
ysis, you may additionally evaluate turnover and correspondence
between in and out of sample performance. On turnover, you may
choose to stay with the parameter set you are already running, or the
set with the lowest turnover, if the performance of this zero or low
turnover parameter set is within the error bounds of the estimation.
If you are utilizing a population-based optimizer, it makes sense to
evaluate the top performers from the in-sample period in the out of
sample period, and utilize the ones with the least drift as part of your
starting population for your next (non-overlapping) walk forward
Walk forward analysis should be used sparingly. One risk of
developing &backtesting systematic trading strategies 23
using walk forward analysis is that you can introduce data snooping
biases if you apply it multiple times with differing goals, looking for
the best outcomes11. As with objective and hypothesis generation, 11 see cross validation section, below,
for additional cautionary notes about
correct use of walk forward analysis
you should be clear about the definition of success before performing
the test and polluting your data set with prior knowledge.
Regime Analysis
Parameter optimization and walk forward analysis as-
sume local stationarity.
How do you know if this is a reasonable assumption?
Regime models attempt to provide this information. Many fi-
nancial time series exhibit distinct states, with specific properties
differing sharply between these different ‘regimes’. There are two pri-
mary classes of models that are used for detecting regimes, Markov
switching models and change point models. In the context of strat-
egy development, it can make sense to both test for regime behaviors
and develop a plan for managing regime change. If your data has
well-defined regimes, and you can model these, you may find that
very different parameter sets are appropriate for the strategy in those
regimes, or that no parameter set produces reasonable results out
of sample in particularly challenging regimes, or that no change is
necessary. In any case, once you have a strategy model you believe in,
doing some regime analysis may provide some additional marginal
Evaluating the whole system12 12 Net profit as a sole evaluation method ig-
nores many of the characteristics important
to this decision. - Robert Pardo (2008)
Pardo says (pp. 202209):
Some of the key characteristics of a robust trading strategy are:
1. A relatively even distribution of trades
2. A relatively even distribution of trading profit
3. Relative balance between long and short profit
4. A large group of contiguous, profitable strategy parameters in
the optimization
5. Acceptable trading performance in a wide range of markets
6. Acceptable risk
7. Relatively stable winning and losing runs
8. A large and statistically valid number of trades
9. A positive performance trajectory
developing &backtesting systematic trading strategies 24
Evaluating Trades
Entire books have been written extolling the virtues or lamenting the
problems of one performance measure over another. We have chosen
to take a rather inclusive approach both to trade and P&L based
measures and to return based measures (covered later). Generally,
we run as many metrics as we can, and look for consistently good
metrics across all common return and cash based measures. Trade
and P&L based measures have an advantage of being precise and
reconcilable to clearing statements, but disadvantages of not being
easily comparable between products, after compounding, etc.
What is a trade,anyway?
The word “trade” has different meanings in different parts and
types of strategy analysis. Certainly a single transaction is a “trade”.
When speaking of “trades” in “trade statistics” you usually mean a
pair or more of transactions that both open and close a position.
There are several ways of calculating the round trip “trades”. The
most common are FIFO, tax lots, flat to flat, and flat to reduced.
FIFO is “first in, first out”, and pairs entry and exit transactions
by time priority. We generally do not calculate statistics on FIFO
because it is impossible to match P&L to clearing statements; very
few institutional investors will track to FIFO. FIFO comes from
accounting for physical inventory, where old (first) inventory is
accounted for in the first sales of that inventory. It can be very
difficult to calculate a cost basis if the quantities of your orders
vary, or you get a large number of partial fills, as any given closing
fill may be an amalgam of multiple opening fills.
2. tax lots
Tax lot “trades” pair individual entry and exit transactions to-
gether to gain some tax advantage such as avoiding short term
gains, harvesting losses, lowering the realized gain, or shifting
the realized gain or loss to another tax period or tax jurisdiction.
This type of analysis can be very beneficial, though it will be very
dependent on the goals of the tax lot construction.
3. flat to flat
Flat to flat “trade” analysis marks the beginning of the trade from
the first transaction to move the position off zero, and marks the
end of the “trade” with the transaction that brings the position
back to zero, or “flat”. It will match brokerage statements of re-
alized P&L when the positions is flat and average cost of open
developing &backtesting systematic trading strategies 25
positions always, so it is easy to reconcile production trades using
this methodology. One advantage of the flat to flat methodology is
that there are no overlapping “trades” so doing things like boot-
strap, jackknife, or Monte Carlo analysis on those trades is not
terribly difficult. The challenge of using flat to flat is that it will
not work well for strategies that are rarely flat; if a strategy adjusts
its position up or down, levels into trades, etc. then getting useful
aggregate statistics from flat to flat analysis may be difficult.
4. flat to reduced
The flat to reduced methodology marks the beginning of the
“trade” from the first transaction to increase the position, and
ends a “trade” when the position is reduced at all, going closer
to zero. Brokerage accounting practices match this methodology
exactly, they will adjust the average cost of the open position as
positions get larger or further from flat, and will realize gains
whenever the position gets smaller (closer to flat). Flat to reduced
is our preferred methodology most of the time. This methodology
works poorly for bootstrap or other trade resampling methodolo-
gies, as some of the aggregate statistics are highly intercorrelated
because multiple “trades” share the same entry point. You need to
be aware of the skew in some statistics that may be created by this
methodology and either correct for it or not use those statistics.
Specific problems will depend on the strategy and trading pattern.
The flat to reduced method is a good default choice for “trade”
definition because it is the easiest to reconcile to statements, other
execution and analysis software, and accounting and regulatory
5. increased to reduced
An analytically superior alternative to FIFO is “increased to re-
duced”. This approach marks the beginning of a “trade” any time
the position increases (gets further from flat/zero) and marks the
end of a trade when the position is decreased (gets smaller in ab-
solute terms, closer to flat), utilizing average cost for the cost basis.
Typically, flat to flat periods will be extracted, and then broken
into pieces matching each reduction to an increase, in expanding
order from the first reduction. This is “time and quantity” priority,
and is analytically more repeatable and manageable than FIFO. If
you have a reason for utilizing FIFO-like analysis, consider using
“increased to reduced” instead. This methodology is sometimes
called “average cost FIFO”, and there is an average cost LIFO
variant as well. In contrast to traditional FIFO, realized P&L on a
per-period basis will match the brokerage statements using this
developing &backtesting systematic trading strategies 26
methodology, as realized P&L is taken from the same average cost
basis. Be aware that the round-turn “trade” statistics may not be
fully reconcilable to a brokerage statement, as the separation into
round-turn “trades” is purely an analytical methodology.
Aggregate trade statistics are calculated on the entire
Basically all modern backtesting and execution platforms can
calculate and report aggregated trade statistics for the entire test
period. Typically, you will be looking at these statistics to confirm
that the performance of the strategy is as expected for its class, and
in-line with the expectations you established from the components of
the strategy.
common trade statistics :
number of transactions and “trades” performed
gross and net trading profits/losses (P&L)
mean/median trading P&L per trade
standard deviation of trade P&L
largest winning/losing trade
percent of positive/negative trades
Profit Factor : absolute value ratio of gross profits over gross
mean/median P&L of profitable/losing trades
annualized Sharpe-like ratio
max drawdown
start-trade drawdown (Fitschen 2013,185)
win/loss ratios of winning over losing trade P&L (total/mean/median)
Some authors advocate testing the strategy against “perfect profit”,
the potential profit of buying every upswing and selling every down-
swing. Such a “perfect profit” number is meaningless in almost all
circumstances, as it relies on a number of assumptions of periodic-
ity, execution technology, etc. that are certain to be false. While it
would be possible to create an “ideal profit” metric for a given strat-
egy which would have similar stochastic entry/exit properties to the
strategy being evaluated, our assertion is that this type of simulation
is more appropriately situated with analysis of the signal process, as
described above. It is certainly possible to extend the signal analysis
as described above to a given fixed set of rules, but we believe this is
an invitation to overfitting, and prefer to only perform that kind of
speculative analysis inside the structure of a defined experimental de-
sign such as parameter optimization, walk forward analysis, or k-fold
cross validation on strategy implementations. Leave the simulated
developing &backtesting systematic trading strategies 27
data much earlier in the process when confirming the power of the
strategy components.
The goal of many trading strategies is to produce a smoothly
upward sloping equity curve. To compare the output of your strategy
to an idealized representation, most methodologies utilize linear
models. The linear fit of the equity curve may be tested for slope,
goodness of fit, and confidence bounds. Kestner (2003) proposes
the K-Ratio, which is the p-test statistic on the linear model. Pardo
(2008,2029) emphasizes the number of trades, and the slope of the
resulting curve. You can extend the analysis beyond the simple linear
model to generalized linear models or generalized additive models,
which can provide more robust fits that do not vary as much with
small changes to the inputs.
Dangers of aggregate statistics . . .
hiding the most common outcomes
focusing on extremes
not enough trades or history for validity
colinearities of flat to reduced
Per-Trade statistics can provide insight for fine-tuning.
Analyzing individual trades can give you insight into how the
strategy performs on a very fine-grained level.
Especially for path dependent behaviors such as:
developing a distribution of trade performances and volatilities,
validating the signal expectations, which were inherently forward
quick dips or run-ups immediately after the trade is put on,
frequent drawdowns from peak before taking the trade off,
positive or negative excursions while the trade is on.
Analyzing excursions can provide empirical support from
the backtest for setting risk stops or profit taking levels.
Maximum Adverse Excursion (MAE) or Maximum Favorable Ex-
cursion (MFE) show how far down (or up) every trade went during
the course of its life-cycle. You can capture information on how many
trades close close to their highs or lows, as well as evaluating points
at which the P&L for the trade statistically just isn’t going to get any
better, or isn’t going to recover. While not useful for all strategies,
many strategies will have clear patterns that may be incorporated
into risk or profit rules.
You can also separate the individual trades into quantiles (or other
slices) by any statistic that you have available. The different distribu-
tions, paths, and properties of these quantiles will frequently provide
developing &backtesting systematic trading strategies 28
insight into methods of action in the strategy, and can lead to further
strategy development.
It is important when evaluating MAE/MFE to do this type of anal-
ysis in your test set. One thing that you want to test out of sample is
whether the MAE threshold is stable over time. You want to avoid, as
with other parts of the strategy, going over and “snooping” the data
for the entire test period, or all your target instruments.
Post Trade Analysis
Post trade analysis offers an opportunity to calibrate the things you
learned in the backtest, and generate more hypotheses for improving
the strategy. Analyzing fills may proceed using all the tools described
earlier in this document. Additionally, you now have enough data
with which to model slippage from the model prices, as well as any
slippage (positive or negative) from the other backtest statistics.
One immediate benefit for post trade analysis is that you already
have all the tools to evaluate performance; they have been applied to
every component of the strategy. Tests and output of all the analyses
are already specified, and may be compared directly.
Some analysis is unique to post trade analysis. Direct rec-
onciliation between the backtest and production requires both the
theoretical and production output have all the same information
available for modeled prices, orders, and execution reports. Once in
the same format, it is possible to analyze variances in:
modeled theoretical price (these should match in most cases)
order timing and prices
• P&L
evolution of the position, including partial fill analysis
Backtests tend to model complete fills, and use deliberately conser-
vative fill assumptions. One outcome of this is that many strategies
can receive more or earlier fills in production. If you’ve modeled
your signal expectations, these may be run over the production pe-
riod as well, and you can develop a model for when and how com-
pletely fills occur in production after the signals and rules of the
model would have wanted to be filled. It is sometimes also reason-
able to have a limited adjustment to (overly) conservative fill assump-
tions of the backtest more clearly match reality. You can also model
slippages at this point, in price, quantity, and time.
Our practice is to impose haircuts on the research model based
on production results. All negative results, whether the fault of the
developing &backtesting systematic trading strategies 29
model or of execution, will be used to lower the backtest expectations
on future trials. All positive production results better than the model
will be appropriately celebrated, but will not be used to ‘upward
adjust’ the model.
Reconciliation of the differences will often lead to improvements in
the strategy, or its production execution.
Microstructure Analysis
Microstructure can not be easily tested in a backtest.
The nature of backtesting is that you want to try many combinations
As such, the data you use for a backtest is almost always at least
minimally reduced from full L2market data. If you have a desire or
a need to formulate an opinion about how the strategy will perform
given market microstructure, this will be very hard to simulate, even
from full L2data. While simulators exist for L2data that are quite
accurate most of those are stochastic, very computationally inten-
sive, and need to be run hundreds or thousands of times with minor
variations in assumptions and timing.
Some things can be known from the L2data without a simulator:
distribution or composition of the book
likelihood that the market will turn to a new price
total shown size for a given level at a certain point in time
To refine expectations of execution, you have to gather real trade
By doing post trade analysis (see above) you can line up orders
and fills with the L1or L2book data, dealing with things like jitter
and skew in the timestamps. You can then draw inferences about
likelihoods of getting filled, market impact, how your MAE/MFE
and other trade statistics relate to the backtest.
When much of the overall net performance of the strategy de-
pends on the execution the only way to refine that expectation is to
gather real data on the target system, or to use an informative prior
constructed from a similar system that was or is actually traded in
production. When you know that the signal process has a positive
expectation, and a reasonable backtest also shows a positive expec-
tation, one of the best ways to adjust the theoretical expectation is to
calibrate it against the output of post trade analysis.
developing &backtesting systematic trading strategies 30
Evaluating Returns
Returns create a standard mechanism for comparing multiple strate-
gies or managers using all the tools of portfolio and risk analysis
that have been developed for investment and portfolio management.
Returns based analysis is most valuable at daily or lower periodici-
ties (often requiring aggregating intraday transaction performance to
daily returns) , is easily comparable across assets or strategies, and
has many analytical techniques not available for trades and P&L. Re-
turns are not easily reconcilable back to the trades both because of
the transformation done to get to returns, changes in scale over time,
and possible aggregation to lower or regular frequencies different
from the trading frequency.
Choice of the denominator matters. In a retail account, capital
is a specific number and is usually not in question. Accounting for
additions and withdrawals is straightforward. In a back test or in an
institutional account, the choice of denominator is less clear. Several
options exist, any or all of which may be reasonable. The first is to
use a fixed denominator. This has the advantage of being simple,
and assumes that gains are not reinvested. Another related option
is to use a fixed starting capital. This option is as simple as the fixed
denominator, but assumes that gains are reinvested into the strategy.
A third option is to look at return on exchange margin. Return on
margin has the challenge of knowing what the required margin was,
which requires access to things like a SPAN engine, or to historical
portfolio reports from live trading of the strategy.
Should we use cash or percent returns? If the strategy uti-
lizes the same order sizing in contract or cash terms throughout
the tested period, then cash P&L should work fine. If instead the
“money management” of the strategy reinvests gains or changes or-
der sizes based on account equity, then you will need to resample
percent returns, or your results will not be unbiased. The cash P&L
would exhibit a (presumably upward) bias from the portfolio growth
in the backtest portfolio over the course of the test if the strategy is
reinvesting. Conversely, if the strategy does not change order sizes
during the test (e.g. fixed 1-lot sizing), then using percent returns
versus account equity will show a downward bias in returns, as ac-
count equity grows, but order sizes remain constant. In this case, you
may choose to use a fixed/constant denominator to get to simple
“returns”, which would be the equivalent of saying that any cash
generated by the strategy was withdrawn.
Many analytical techniques presume percent return-based analysis
developing &backtesting systematic trading strategies 31
rather than analysis on cash P&L. Some examples include: - tail risk
measures - volatility analysis - factor analysis - factor model Monte
Carlo - style analysis - applicability to asset allocation (see below)
Comparing strategies in return space can also be a good
reason to use percent returns rather than cash. When comparing
strategies, working in return-space may allow for disparate strategies
to be placed on a similar footing. Things like risk measures often
make more sense when described against their percent impact on
capital, for example.
While it may be tempting to do all of the analysis of a trading
strategy in only cash P&L, or only in return space, it is valuable to
analyze almost every strategy in both ways, as the two approaches
provide different insight.
Rebalancing and asset allocation
Once the strategy is basically complete, and preferably in production,
then you can work at marginal improvement to return and risk via
rebalancing and asset allocation. We prefer to do this analysis on real
trade data (see Post Trade Analysis) rather than backtest data, when
Asset allocation should support your business objectives for return
and risk. This may be a good time to revisit the business objective
and make sure the business objective is suitable for optimization. If
the business objectives are formulated as described above in the sec-
tion on business objectives, this should require no additional work.
The Kelly Criterion, and the closely related optimal f and Leverage
Space Portfolio Model (LSPM) of Vince (2009) seek to choose an opti-
mal bet size given uncertainty of the outcome of any single bet. Kelly
and optimalf come out of betting, where the goal is to determine the
optimal bet size given some edge and a certain capital pool. LSPM
extends this model to a portfolio allocation context given a joint
distribution of drawdowns. All three of these models seek to max-
imize final wealth, the first two without regard to path, and LSPM
while limiting the joint probability of some drawdown to a business-
acceptable threshold. The most difficult challenge in utilizing LSPM
is in estimating the joint probability and magnitude of drawdowns,
making the method increasingly difficult for larger portfolios of trad-
ing strategies.
There are some particular challenges in using portfolio optimiza-
tion for trading strategies. Once you have real trade data, and have
developing &backtesting systematic trading strategies 32
aggregated it to a suitable time frame (typically daily), that daily cash
return for the strategy or an instrument or configuration within the
strategy is essentially a synthetic instrument return. When doing
portfolio allocation among multiple choices, you ideally want a large
number of observations. An order of magnitude more observations
for each asset, strategy, or configuration than the number of synthetic
assets you’ll have in the optimization is usually near the minimum
stable observations set. So 60 observations for each of six synthetic
assets in the portfolio, for example.
What if you dont have enough data? Let’s suppose you want
500 observations for each of 50 synthetic assets based on the rule
of thumb above. That is approximately two years of daily returns.
This number of observations would likely produce a high degree
of confidence if you had been running the strategy on 50 synthetic
assets for two years in a stable way. If you want to allocate capital
in alignment with your business objectives before you have enough
data, you can do a number of things: - use the data you have and
re-optimize frequently to check for stability and add more data - use
higher frequency data, e.g. hourly instead of daily - use a technique
such as Factor Model Monte Carlo (Jiang 2007, Zivot 2011,2012) to
construct equal histories - optimize over fewer assets, requiring a
smaller history
Fewer assets? Suppose you have three strategies, each with 20 con-
figurations. Using the rule of thumb above, you’d want a minimum
of 600 days of production results. If instead you use the aggregate
return of each of the three strategies, you could likely get a direc-
tionally correct optimization result from 30 or so observations, or
a month and a half of daily data with all three strategies running.
More data is generally better, if available. This compares well in prac-
tice to the three years of monthly returns (36 observations) often used
for fund of hedge funds investments.
What if you have lots of strategies and configurations?
Suppose we have many strategies, each with many configurations or
traded assets. In the likely case that you don’t have enough data to
optimize over all the configurations as in the example above, you can
optimize over just the aggregate strategy returns as described above.
At this point your most mature strategy or strategies may very well
have enough data for optimization separately. This opens the way
for what is called layered objectives and optimization. You may have
different business objectives for a single strategy, e.g. the objectives
for a market maker and a medium term trend follower are different.
developing &backtesting systematic trading strategies 33
In this case, it is preferable to optimize the configurations for a single
strategy, generating an OOS return for the strategy on a daily scale
that may be used as the input to the multi-strategy optimization
described above. This layered optimization is well-supported in R.
How often should you rebalance? Most academic literature
on optimal portfolio allocation seems to operate on an assumption of
continuous rebalancing. Such an assumption is rarely warranted in
practice due to:
cost of rebalancing trades
time to perform complex portfolio optimization
data management required to be ready to do optimization
uncertainty in conversion from weights to order sizes (see below)
conversion of optimization output into strategy parameters
It typically makes sense to choose a less frequent rebalance period.
Periods which may make sense for your production portfolio include,
but are not limited to:
calendar periods such as weeks or months
periods when cash is added or withdrawn
infrequently enough to measure out of sample deterioration with
reasonable statistical confidence
when new strategies leave small scale incubation and are ready to
be scaled up to larger production sizes
at a frequency efficient for your available computing resources
Also discuss:
rebalancing implications
discuss implications of Factor Model Monte Carlo
techniques for backing out from weights to capital to order sizes
Probability of Overfitting13 13 We should recognize the reality that any
simulated (backtest) performance presented
to us likely overstates future prospects. By
how much? -Antti Ilmanen (2011) p. 112
This entire paper has been devoted to avoiding overfit-
ting. At the end of the modeling process, we still need to evaluate
how likely it is that the model may be overfit, and develop a haircut
for the model that may identify how much out of sample deteriora-
tion could be expected.
With all of the methods described in this section, it is important
to note that you are no longer measuring performance; that was
developing &backtesting systematic trading strategies 34
covered in prior sections. At this point, we are measuring statistical
error, and developing or refuting the level of confidence which may
be appropriate for the backtest results. We have absolutely fitted the
strategy to the data, but is it over-fit? With what confidence and error
estimates can we identify the fitting?
out of sample deterioration
Figure 1: source:Aronson
We need to measure the degradation that occurs OOS in a consistent
linear measurement via lm
measurement and reconciliation of trade stats
application of OOS deterioration measurement to Walk Forward
and k-folds
It can be very informative to consider all of the IS and OOS pe-
riods from walk forward analysis, and apply the OOS degradation
measurements to these periods. Over time, you can hope to learn
about modifications to your objective function or strategy develop-
ment process which would decrease OOS degradation.
resampled trades
Figure 2: source:Tomasini
Tomasini(2009,1049) describes a basic resampling mechanism for
trades. The period returns for all “flat to flat” trades in the backtest
(and the flat periods with period returns of zero) which are sam-
pled without replacement. After all trades or flat periods have been
sampled, a new time series is constructed by applying the original
index to the resampled returns. This gives a number of series which
will have the same mean and net return as the original backtest, but
differing drawdowns and tail risk measures. This allows confidence
bounds to be placed on the drawdown and tail risk statistics based
on the resampled returns.
This analysis can be extended to resample without replacement
only the duration and quantity of the trades. These entries, exits,
and flat periods are then applied to the original price data. This
will generate a series of returns which are potentially very different
from the original backtest.14 In this model, average trade duration, 14 We are applying these methodologies
here to gain or refute confidence in
backtested results. All of these analyti-
cal methods may also be applied to post
trade analysis to gain insight into real
trades and execution processes.
percent time in market, and percent time long and short will remain
the same, but all the performance statistics will vary based on the
new path of the returns. This should allow a much more coherent
placement of the chosen strategy configuration versus other strategy
configurations with similar time series statistical properties.
Burns (2006) describes a number of tests that may be applied
to the resampled returns to evaluate the key question of skill ver-
developing &backtesting systematic trading strategies 35
sus luck. You should be able to determine via the p-value control
statistics some confidence that the strategy is the product of skill
rather than luck as well as the strength of the predictive power of
the strategy in a manner similar to that obtained earlier on the sig-
nal processes. You can also apply all the analysis that was utilized
on evaluating the strategy and its components along the way to the
resampled returns, to see where in the distribution of statistics the
chosen strategy configuration fits.
What if your strategy is never flat? These resampling methods pre-
sume flat to flat construction, where trades are marked from flat
points in the backtest. Things get substantially more complicated if
the strategy is rarely flat, consider for example a long only trend fol-
lower which adds and removes position around a core bias position
over very long periods of time, or a market making strategy which
adds and reduces positions around some carried inventory. In these
cases constructing resampled returns that are statistically similar to
the original strategy is very difficult. One potential choice it to make
use of the “increased to reduced” trade definition methodology. If
you are considering only one instrument, or a series of independent
instruments, the increased to reduced methodology can be appropri-
ately applied.15 15 one of the only cases in which a FIFO
style methodology provides useful
statistical information
If the instruments are dependent or highly correlated, the diffi-
culty level of drawing useful statistical inference goes up yet again.
In the case of dependent or highly correlated instruments or posi-
tions, you may need to rely on portfolio-level Monte Carlo analysis
rather than attempting to resample trades. While it may be useful
to learn if the dependent structure adds to overall returns over the
random independent resampled case, it is unlikely to give you much
insight into improving the strategy, or much confidence about the
strategy configuration that has made it all the way to these final anal-
what can we learn from resampling methods?
what would be incorrect inferences?
Monte Carlo
If there is strong correlation or dependence between instruments in
the backtest, then you will probably have to resample or perform
Monte Carlo analysis from portfolio-level returns, rather than trades.
You lose the ability to evaluate trade statistics, but will still be able to
assess the returns of the backtest against the sampled portfolios for
your primary business objectives and benchmarks. As was discussed
in more detail under Evaluating Signals, it is important that any re-
sampled data preserve the autocorrelation structure of the original
developing &backtesting systematic trading strategies 36
data to the degree possible.
White’s Reality Check
White’s Data Mining Reality Check from White (2000) (usually re-
ferred to as DRMC or just “White’s Reality Check” WRC) is a boot-
strap based test which compares the strategy returns to a benchmark.
The ideas were expanded in Hansen (2005). It creates a set of boot-
strap returns and then checks via absolute or mean squared error
what the chances that the model could have been the result of ran-
dom selection. It applies a p-value test between the bootstrap distribu-
tion and the backtest results to determine whether the results of the
backtest appear to be statistically significant.
cross validation
Cross validation is a widely used statistical technique for model
In its classical statistical form, the data is cut in half, the model is
trained on one half, and then the trained model is tested on the half
of the model that was “held out”. As such, this type of model will
often be referred to as a single hold out cross validation model. In time
series analysis, the classical formulation is often modified to use a
smaller hold-out period than half the data, in order to create a larger
training set. Challenges with single hold out cross validation in-
clude that one out of sample set is hard to draw firm inferences from,
and that the OOS period may additionally be too short to generate
enough signals and trades to gain confidence in the model.
In many ways, walk forward analysis is related to cross validation.
The OOS periods in walk forward analysis are effectively validation
sets as in cross validation. You can and should measure the out of
sample deterioration of your walk forward model between the IS
performance and the OOS performance of the model. One advantage
of walk forward is that it allows parameters to change with the data.
One disadvantage is that there is a temptation to make the OOS pe-
riods for walk forward analysis rather small, making it very difficult
to measure deterioration from the training period. Another potential
disadvantage is that the IS periods are overlapping, which can be ex-
pected to create autocorrelation among the parameters. This autocor-
relation is mixed from an analysis perspective. A degree of parameter
stability is usually considered an advantage. The IS periods are not
all independent draws from the data, and the OOS periods will later
be used as IS periods, so any analytical technique that assumes i.i.d.
observations should be viewed at least with skepticism.
k-fold cross validation improves on the classical single hold-out
developing &backtesting systematic trading strategies 37
OOS model by randomly dividing the sample of size Tinto sequen-
tial sub-samples of size T/k.(Hastie, Tibshirani, and Friedman 2009)
Alternately or additionally, something like a jackknife or block boot-
strap may be used to maintain some of the serial structure while
providing more samples. The strategy model is fit using the fitting
procedure (e.g. parameter search and/or walk forward analysis)
on the entire set of k-1samples, in turn, and then tested on the k-th
sample which was held out. This procedure is repeated using each k
sample as the holdout. There is some work which must be done on
price-based vs return-based time series to make the series continu-
ous, e.g. the series must be turned into simple differences and then
back converted into a price series after the folds have been recom-
bined, and an artificial time index must be imposed. One weakness
with k-fold cross validation for trading strategies resides in the fact
that the length of each T/k section must be long enough to have a rea-
sonable error bound on the objectives in the OOS segments.(Bailey,
Borwein, López de Prado, et al. 2014 , p.17) While we think that this
is important to note, in the same manner that it is always important
to understand the statistical error bounds of your calculations, it is
not a fatal flaw.
Some question exists whether k-fold cross validation is appropriate
for time series in the same way that it is for categorical or panel data.
Rob Hyndman addresses this directly here16 and here17. What he 16
describes as “forecast evaluation with a rolling origin” is essentially
Walk Forward Analysis. One important takeaway from Prof. Hyn-
dman’s treatment of the subject is that it is important to define the
expected result and tests to measure forecast accuracy before perform-
ing the (back)test. Then, all the tools of forecast evaluation may be
applied to evaluate how well your forecast is doing out of sample,
and whether you are likely to have overfit your model.
linear models such as Bailey, Borwein, López de Prado, et al. (2014) and
Bailey and López de Prado (2014)
modifying existing expectations
track the number of trials
how do you define a ‘trial’?
CSCV sampling
combinatorially symmetric cross validation “generate S/2 testing
sets of size T/2 by recombining the Sslices of the overall sample
of size T. (Bailey, Borwein, López de Prado, et al. 2014,17)
developing &backtesting systematic trading strategies 38
prior remarks about overlapping periods, parameter autocorrela-
tion, and i.i.d. assumptions apply here as well
Harvey and Liu (2015) , Harvey and Liu (2014) , Harvey and Liu
(2013) look at Type I vs Type II error in evaluating backtests and
look at appropriate haircuts based on this.
data mining bias
data mining bias and cross validation from Aronson (2006)
Cawley and Talbot (2010)
Keogh and Kasetty (2003)
developing &backtesting systematic trading strategies 39
I would like to thank my team, John Bollinger, George Coyle, Ilya
Kipnis, and Stephen Rush for insightful comments and questions on
various drafts of this paper. All remaining errors or omissions should
be attributed to the author.
All views expressed in this paper should be viewed as those of
Brian Peterson, and do not necessarily reflect the opinions or policies
of DV Trading or DV Asset Management.
This document rendered into L
EXusing rmarkdown (Xie (2014)) via
the rmarkdown::tufte_handout template. The BibTeX bibliography
file is managed via JabRef.
©2014-2016 Brian G. Peterson
The most recently published version of this document may be
found at
developing &backtesting systematic trading strategies 40
Aronson, David. 2006.Evidence-Based Technical Analysis: Applying the
Scientific Method and Statistical Inference to Trading Signals. Wiley.
Bailey, David H, and Marcos López de Prado. 2014. “The Deflated
Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and
Non-Normality.” Journal of Portfolio Management, Forthcoming.http:
———. forthcoming, 2014. “Drawdown-Based Stop-Outs and the
’Triple Penance’ Rule.” Journal of Risk.
Bailey, David H, Jonathan M Borwein, Marcos López de Prado,
and Qiji Jim Zhu. 2014. “Pseudomathematics and Financial Charla-
tanism: The Effects of Backtest Over Fitting on Out-of-Sample Perfor-
mance.” Notices of the AMS 61 (5): 45871.
Bailey, David H, Jonathan M Borwein, Marcos López de Prado,
and Qiji Jim Zhu. 2014. “The Probability of Backtest Overfitting.”
Baquero, Guillermo, Jenke Ter Horst, and Marno Verbeek. 2005.
“Survival, Look-Ahead Bias, and Persistence in Hedge Fund Perfor-
mance.” Journal of Financial and Quantitative Analysis 40 (03). Cam-
bridge Univ Press: 493517.
Box, George E.P., and Norman R. Draper. 1987.Empirical Model-
Building and Response Surfaces. John Wiley & Sons.
Burns, Patrick. 2006. “Random Portfolios for Evaluating Trading
Carroll, Robert. 2011.The Skeptic’s Dictionary: A Collection of Strange
Beliefs, Amusing Deceptions, and Dangerous Delusions. John Wiley &
Cawley, Gavin C, and Nicola LC Talbot. 2010. “On over-Fitting
in Model Selection and Subsequent Selection Bias in Performance
Evaluation.” The Journal of Machine Learning Research 11. JMLR. org:
Diedesch, Josh. 2014. “2014 Forty Under Forty.” Chief Investment
Officer. California State Teachers’ Retirement System. http://www.
Feynman, Richard P, Robert B Leighton, Matthew Sands, and EM
Hafner. 1965.The Feynman Lectures on Physics. Vols. 1-3.
Fitschen, Keith. 2013.Building Reliable Trading Systems. John Wiley
developing &backtesting systematic trading strategies 41
& Sons, Inc.
Hansen, Peter R. 2005. “A Test for Superior Predictive Ability.”
Journal of Business and Economic Statistics.
Harvey, Campbell R., and Yan Liu. 2013. “Multiple Testing in
Economics.” SSRN.
———. 2014. “Evaluating Trading Strategies.” Journal of Portfo-
lio Management 40 (5): 10818.
———. 2015. “Backtesting.” SSRN.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009.The
Elements of Statistical Learning: Data Mining, Inference, and Prediction.
Second Edition. Springer.
Horowitz, Ben. 2014.The Hard Thing About Hard Things: Building a
Business When There Are No Easy Answers. HarperCollins.
Ilmanen, Antti. 2011.Expected Returns: An Investor’s Guide to Har-
vesting Market Rewards. John Wiley & Sons.
Keogh, Eamonn, and Shruti Kasetty. 2003. “On the Need for Time
Series Data Mining Benchmarks: A Survey and Empirical Demonstra-
tion.” Data Mining and Knowledge Discovery 7(4). Springer: 34971.
Kerr, Norbert L. 1998. “HARKing: Hypothesizing After the Re-
sults Are Known.” Personality and Social Psychology Review 2(3). Sage
Publications: 196217.
Kestner, Lars. 2003.Quantitative Trading Strategies: Harnessing the
Power of Quantitative Techniques to Create a Winning Trading Program.
Kuhn, Max, and Kjell Johnson. 2013.Applied Predictive Modeling.
Pardo, Robert. 2008.The Evaluation and Optimization of Trading
Strategies. Second ed. John Wiley & Sons.
Racine, Jeffrey S, and Christopher F Parmeter. 2012. “Data-Driven
Model Evaluation: A Test for Revealed Performance.” McMaster Uni-
Ripley, Brian D. 2004. “Selecting Amongst Large Classes of Mod-
els.” Methods and Models in Statistics: In Honor of Professor John Nelder,
Smith, Martha K. n.d. “Common Misteaks Mistakes in Using
Statistics: Spotting and Avoiding Them - Data Snooping.” https:
developing &backtesting systematic trading strategies 42
Sullivan, Ryan, Allan Timmermann, and Halbert White. 1999.
“Data Snooping, Technical Trading Rule Performance, and the Boot-
strap.” The Journal of Finance 54 (5): 164791.
Tomasini, Emilio, and Urban Jaekle. 2009.Trading Systems: A New
Approach to System Development and Portfolio Optimisation.
Tukey, John W. 1962. “The Future of Data Analysis.” The Annals
of Mathematical Statistics. JSTOR, 167.
Vince, Ralph. 2009.The Leverage Space Trading Model: Reconciling
Portfolio Management Strategies and Economic Theory. John Wiley &
White, Halbert L. 2000. “System and Method for Testing Prediction
Models and/or Entities.” Google Patents.
Xie, Yihui. 2014. “R Markdown — Dynamic Documents for R.”
... The testing of systematic trading rules is usually done through back-testing and is at high risk of spurious accuracy as a result of the data-mining bias (DMB) present from testing multiple rules concurrently over the same history. The eradication of this DMB with the use of statistical methodologies is currently a relevant topic in investment research, illustrated by papers written by Chordia et al. (2017), Harvey and Liu (2014), Novy-Marx (2016) andPeterson (2015). This study collectively reviews the various statistical methodologies in place to test multiple systematic trading strategies and implements these methodologies under simulation with known artificial trading rules in order to critically compare and evaluate them. ...
... This is illustrated by the large number of studies into the subject in recent years. Examples of such research include papers by Chordia et al. (2017), Harvey and Liu (2014), Novy-Marx (2016) and Peterson (2015). These authors advocate the use of these statistical methods to eradicate DMB and will serve as the inspiration for our research. ...
Full-text available
Systematic trading is a method that is currently extremely popular in the investment world. The testing of systematic trading rules is usually done through backtesting and is at high risk of spurious accuracy as a result of the data-mining bias (DMB) present when testing multiple rules concurrently over the same historical period. The eradication of this DMB through the use of statistical methodologies is currently a relevant topic in investment research, as is illustrated by papers written by Chordia, Goyal and Saretto in 2017, by Harvey and Liu in 2014, by Novy-Marx in 2016 and by Peterson in 2015. This study reviews the various statistical methodologies that are in place to test multiple systematic trading strategies and implements these methodologies under simulation with known artificial trading rules in order to critically compare and evaluate them.
... The testing of systematic trading rules is usually done through back-testing and is at high risk of spurious accuracy as a result of the data-mining bias (DMB) present from testing multiple rules concurrently over the same history. The eradication of this DMB with the use of statistical methodologies is currently a relevant topic in investment research, illustrated by papers written by Chordia et al. (2017), Harvey and Liu (2014), Novy-Marx (2016) andPeterson (2015). This study collectively reviews the various statistical methodologies in place to test multiple systematic trading strategies and implements these methodologies under simulation with known artificial trading rules in order to critically compare and evaluate them. ...
... This is illustrated by the large number of studies into the subject in recent years. Examples of such research include papers by Chordia et al. (2017), Harvey and Liu (2014), Novy-Marx (2016) and Peterson (2015). These authors advocate the use of these statistical methods to eradicate DMB and will serve as the inspiration for our research. ...
Full-text available
Systematic trading is a method that is currently extremely popular in the investment world. The testing of systematic trading rules is usually done through back-testing and is at high risk of spurious accuracy as a result of the data-mining bias (DMB) present from testing multiple rules concurrently over the same history. The eradication of this DMB with the use of statistical methodologies is currently a relevant topic in investment research, illustrated by papers written by Chordia et al. (2017), Harvey and Liu (2014), Novy-Marx (2016) and Peterson (2015). This study collectively reviews the various statistical methodologies in place to test multiple systematic trading strategies and implements these methodologies under simulation with known artificial trading rules in order to critically compare and evaluate them.
Full-text available
As a growing number of investors seek to invest in companies that fulfil a certain level of gender diversity in top management, we aim to study the risk-adjusted return of gender diverse equities listed on the Stockholm stock exchange to establish whether they can form the basis for an efficient investment strategy. We construct an equally-weighted and a market-weighted passive portfolio containing gender diverse companies 2015-01-01 through 2019-12-31. We also use regressions to find correlations between the tradition of diversity in the firm and the firm’s market performance. We conclude that gender diverse companies outperform the market both in risk-adjusted and nominal terms. The number of years elapsed as a diverse company also affects performance positively. We also discover evidence to partly support Glass Cliff Hypothesis as companies experiencing disruptions prior to becoming diverse, underperform companies which have undergone an organic process to reach diversity. According to Modern Portfolio Theory, our gender diverse investment models prove to be fairly efficient, with finance and health care constituting close-to optimal sector indices. The study adds on the current debate on sustainable investing. By focusing on diversity in the entire management structure – not only in the boardroom, we hope the study will bring deeper insight to gender diversity as an investment. Key words: Risk adjusted return, return on investment, ESG, Modern Portfolio Theory, Jensen’s alpha, Gender diversity, Critical Mass, Glass Cliff
Full-text available
This essay provides a process and supporting framework to help in replicating the research of others. Analysis of new trading system hypotheses often starts when an analyst or their management reads an interesting paper which provides the seed of a new idea. To fully understand and make use of the ideas in the paper, the analyst needs to first replicate the paper, since most papers do not publish data and code. After replicating the paper, the results need to be analyzed to determine whether the model or results are credible, and how the analysis of the paper can be improved. Only after replication is complete, and the analyst has formulated an opinion on the quality and applicability of the research to be done, can s/he move on to creating a backtest for a new strategy incorporating or based on the models or techniques in the paper. Throughout this document, I will refer to the " original research " or " source paper " interchangeably, though the research to be replicated may be part of a book chapter, or a blog post, or a web page. With practice, after a number of replications of this type are completed, some of the steps in this process may be curtailed or skipped altogether. We believe that the complete process has value as the reproducibility and clarity of thinking of the research will make the deliverables more reusable and useful over time. Try to go through the entire process at a publication-quality level, as you may choose to publish part or all of your replication work at a later point. You should write in your own words, in order to better develop a deeper understanding of the source material. If you must quote, make sure to properly cite the quotation, as even unintentional or accidental plagiarism can have serious consequences. Any portion of the replication may be edited as you progress through the project; do not consider it to be a completed document until the entire project is finished. 1 Summarize the Paper One of the first main steps is to be able to summarize the paper in a structured document which should be 2-3 pages long, devoting no more than one paragraph to each main point. The goal of the summary stage is to make certain that the paper and its research context are understood. The précis model of formal summary is useful in this regard. Introduction, 2-5 main points, conclusion. Similar forms are the classic " five paragraph essay ". Why so formal? The concise summary provides a road map for the following stages. The goal of such a short, formal, summary is to organize your thinking in a structured way which will make replication easier. The summary will help to uncover logical errors or omissions (either in the reader or the paper) much earlier in the process. It will allow you to identify early where you need more information, and where you will be likely to have difficulties with your research. Candidates for main points are key assertions or findings, major contributions, summary of the main technique(s) used, etc. Each of the main points should describe how the paper supports its points, including what methods, tests, or arguments are made.
There are many strategies that can be employed using the weekly options, but if you learn how to use just a few of them you can accomplish the same results. Many of the strategies are duplicated by the spreads that are employed, if you use too many different types of trades you can cause yourself to make a lot of mistakes.
Many investment firms and portfolio managers rely on backtests (ie, simulations of performance based on historical market data) to select investment strategies and allocate capital. Standard statistical techniques designed to prevent regression overfitting, such as hold-out, tend to be unreliable and inaccurate in the context of investment backtests. We propose a general framework to assess the probability of backtest overfitting (PBO). We illustrate this framework with specific generic, model-free and non-parametric implementations in the context of investment simulations; we call these implementations combinatorially symmetric cross-validation (CSCV). We show that CSCV produces reasonable estimates of PBO for several useful examples.
With the advent in recent years of large financial data sets, machine learning, and high-performance computing, analysts can back test millions (if not billions) of alternative investment strategies. Backtest optimizers search for combinations of parameters that maximize the simulated historical performance of a strategy, leading to back test overfitting. The problem of performance inflation extends beyond back testing. More generally, researchers and investment managers tend to report only positive outcomes, a phenomenon known as selection bias. Not controlling for the number of trials involved in a particular discovery leads to overly optimistic performance expectations. The deflated Sharpe ratio (DSR) corrects for two leading sources of performance inflation: Selection bias under multiple testing and non-normally distributed returns. In doing so, DSR helps separate legitimate empirical findings from statistical flukes.
When predicting a numeric outcome, some measure of accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure accuracy, each with its own nuance. In Section 5.1 we define common measures for evaluating quantitative performance. We also discuss the concept of variance-bias trade-off (Section 5.2), and the implication of this principle for predictive modeling. In Section 5.3, we demonstrate how measures of predictive performance can be generated in R.
We propose a new way to conduct multiple hypothesis testing in economics research. Our framework allows for correlation among tests and incomplete data, both of which are prevalent in economic meta-analysis. Our simulations show that that our method is able to produce the correct p-value cutoff that controls the overall rate of false discoveries at a prespecified level of significance. The single hypothesis test, as used by most researchers, leads to too many false discoveries and should be avoided.
When comparing two competing approximate models, the one having smallest 'ex-pected true error' is closest to the data generating process (according to the specified loss function) and is therefore to be preferred. In this paper we consider a data-driven method of testing whether two competing approximate models, for instance a parametric and a nonparametric model, are equivalent in terms of their expected true error (i.e., their expected performance on unseen data drawn from the same data generating process). The proposed test is quite flexible with regards to the types of models and data types that can be compared (i.e., time-series, cross section, panel etc.). Moreover, by applying our method in time-series settings we can overcome two of the drawbacks associated with approaches that are popular and dominant among practitioners, namely, their re-liance on only one split of the data and the need to have a sufficiently large hold-out sample in order for the test to possess adequate power. Some useful graphical summaries are also presented. Finite-sample performance and several illustrative applications are considered.
Traditionally a 'model' is a family of probability distributions for the observed data parametrized by a set of parameters (of fixed and finite dimension), but it is often helpful to consider all the models considered as subsets of one model, as well as some even larger models used in 'over-fitting' as part of the validation process. Traditional distinctions between 'parametric' and 'non-parametric' models are of-ten moot, when people now (attempt to) fit neural networks with half a million parameters. We consider how to select in such model classes. Statisticians and other users of statistical methods have been choosing models for a long time, but the current availability of large amounts of data and of computational resources means that model choice is now being done on a scale which was not dreamt of 25 years ago. Unfortunately, the practical issues are probably less widely appreciated than they used to be, as statistical software has made it so much easier for the end user to trawl through literally thousands of models (and in some cases many more). Model choice is a large subject, and this paper deliberately chooses to look at only some aspects of it, most particularly some of the misunder-standings about formal methods such as AIC and cross-validation. Whole books have been written about these and other aspects: two recent ones are Harrell (2001) and Burnham and Anderson (2002).