ArticlePDF Available

The Virtue of Complexity in Return Prediction

Wiley
The Journal of Finance
Authors:

Abstract and Figures

Much of the extant literature predicts market returns with “simple” models that use only a few parameters. Contrary to conventional wisdom, we theoretically prove that simple models severely understate return predictability compared to “complex” models in which the number of parameters exceeds the number of observations. We empirically document the virtue of complexity in U.S. equity market return prediction. Our findings establish the rationale for modeling expected returns through machine learning.
This content is subject to copyright. Terms and conditions apply.
THE JOURNAL OF FINANCE VOL. LXXIX, NO. 1 FEBRUARY 2024
The Virtue of Complexity in Return Prediction
BRYAN KELLY, SEMYON MALAMUD, and KANGYING ZHOU*
ABSTRACT
Much of the extant literature predicts market returns with “simple” models that use
only a few parameters. Contrary to conventional wisdom, we theoretically prove that
simple models severely understate return predictability compared to “complex” mod-
els in which the number of parameters exceeds the number of observations. We em-
pirically document the virtue of complexity in U.S. equity market return prediction.
Our findings establish the rationale for modeling expected returns through machine
learning.
THE FINANCE LITERATURE HAS RECENTLY seen rapid advances in return
prediction methods borrowing from the machine learning canon. The pri-
mary economic-use case of these predictions has been portfolio construction.
While a number of papers document significant empirical gains in portfolio
performance through the use of machine learning, there is little theoretical
*Bryan Kelly is at Yale School of Management, AQR Capital Management, and NBER. Semyon
Malamud is at Swiss Finance Institute, EPFL, and CEPR, and is a consultant to AQR. Kangy-
ing Zhou is at Yale School of Management. We are grateful for helpful comments from Cliff As-
ness; Kobi Boudoukh; Daniel Buncic; James Choi; Frank Diebold; Egemen Eren; Paul Goldsmith-
Pinkham; Amit Goyal; Ron Kaniel (discussant); Stefan Nagel (Editor); Andreas Neuhierl (discus-
sant); Matthias Pelster (discussant); Olivier Scaillet (discussant); Christian Schlag (discussant);
Akos Toeroek; Hui Wang (discussant); Guofu Zhou (discussant); seminar participants at AQR,
Yale, Vienna University of Economics and Business, Philadelphia Fed, Bank for International
Settlements, NYU Courant, and EPFL; and conference participants at the Macro Finance So-
ciety, Adam Smith Asset Pricing Conference, SFS Cavalcade North America Conference, Hong
Kong Conference for Fintech, AI, and Big Data in Business, Wharton Jacobs-Levy Conference,
Research Symposium on Finance and Economics, China International Risk Forum, Stanford SITE
New Frontiers in Asset Pricing, and XXI Symposium. We are especially grateful to Mohammad
Pourmohammadi for suggesting several essential improvements to our proofs and technical con-
ditions. AQR Capital Management is a global investment management firm, which may or may
not apply similar investment techniques or methods of analysis as described herein. The views
expressed here are those of the authors and not necessarily those of AQR. Semyon Malamud
gratefully acknowledges support from the Swiss Finance Institute and the Swiss National Science
Foundation, Grant 100018_192692. We have read The Journal of Finance’s disclosure policy and
have no conflicts of interest to disclose.
Correspondence: Bryan Kelly, Yale School of Management, AQR Capital Management, and
NBER; 165 Whitney Ave., New Haven, CT 06511; e-mail: bryan.kelly@yale.edu.
This is an open access article under the terms of the Creative Commons Attribution-Non
Commercial-NoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is non-commercial and no modifications or adaptations are
made.
DOI: 10.1111/jofi.13298
© 2023 The Authors. The Journal of Finance published by Wiley Periodicals LLC on behalf of
American Finance Association.
459
460 The Journal of Finance®
understanding of return forecasts and portfolios formed from heavily parame-
terized models.
We provide a theoretical analysis of such “machine learning portfolios.” Our
analysis can be summarized by the following thought experiment. Imagine
there is a true predictive model of the form
Rt+1=f(Gt)+t+1,(1)
where Ris an asset return, Gis a fixed set of predictive signals, and fis
a smooth function. The predictors Gmay be known to the analyst, but the
prediction function fis unknown. Rather than futile guess the functional form,
the analyst relies on the universal approximation rationale (see, for example,
Hornik, Stinchcombe, and White (1990)), that fcan be approximated with a
sufficiently wide neural network,
f(Gt)
P
i=1
Si,tβi,
where Si,t=˜
f(w
iGt) is a known nonlinear activation function with known
weights wiand Pis sufficiently large.1As a result, (1) takes the form
Rt+1=
P
i=1
Si,tβi+˜t+1.(2)
The training sample for this regression has a fixed number of data points, T,
and the analyst must decide on the “complexity,” or the number of features P,to
use in their approximating model. A simple model, one with P<< T, will have
low variance thanks to parsimonious parameterization but will be a coarse
approximator of f, while a high-complexity model (P>T) has better approx-
imation potential but may be poorly behaved and will require shrinkage/bias.
Our central research question therefore is, what level of model complexity (i.e.,
which P) should the analyst opt for? Does the approximation improvement
from large Pjustify the statistical costs (higher variance and/or higher bias)?
Answer: We prove that expected out-of-sample forecast accuracy and port-
folio performance are strictly increasing in model complexity when appropri-
ate shrinkage is applied (indeed, we derive the optimal degree of shrinkage
to maximize expected out-of-sample model performance). The analyst should
always use the largest approximating model that she can compute. In other
words, when the true data-generating process (DGP) is unknown, the approxi-
mation gains achieved through model complexity dominate the statistical costs
of heavy parameterization. The interpretation is not necessarily that asset re-
turns are subject to a large number of fundamental driving forces. Rather,
even when the driving variables (Gt) have low dimension, complex models
1Assuming known weights wiis innocuous, as the universal approximation result applies even
if weights are randomly generated Rahimi and Recht (2007). Our empirical analysis uses the
random Fourier feature (RFF) method of Rahimi and Recht (2007) to generate features as in (2).
The Virtue of Complexity in Return Prediction 461
better leverage the information content of Gtby more accurately approximat-
ing the unknown and likely nonlinear prediction function.
To provide intuitive characterizations of forecast and portfolio behavior
in complex models, our theoretical environment has two simplifying aspects.
First, the machine learning models we study are restricted to high-dimensional
linear models. As suggested by equation (2), this sacrifices little generality as
a number of recent papers establish an equivalence between high-dimensional
linear models and more sophisticated models such as deep neural networks
(Jacot, Gabriel, and Hongler (2018), Allen-Zhu, Li, and Song (2019), Hastie
et al. (2022)). In fact, equation (2) is a neural network with one hidden layer
with Pneurons and fixed input weights. Second, we focus on a single risky
asset. Prediction is therefore isolated to the time-series dimension, and the
portfolio optimization problem reduces to market timing.2These two sim-
plifications make our key findings more accessible, yet neither is critical for
our conclusions.
To provide a baseline for our findings, consider the well-known deficiency
of ordinary least squares (OLS) prediction in high dimensions. As the num-
ber of regressors, P, approaches the number of data points, T, the expected
out-of-sample R2tends to negative infinity. An immediate implication is that
a portfolio strategy attempting to use OLS return forecasts in such a setting
will have divergent variance. In turn, its expected out-of-sample Sharpe ra-
tio collapses to zero. The intuition behind this is simple: When the number of
regressors is similar to the number of data points, the regressor covariance
matrix is unstable, and its inversion induces wild variation in coefficient esti-
mates and forecasts. This is commonly interpreted as overfitting: With P=T,
the regression exactly fits the training data and performs poorly out-of-sample.
We are particularly interested in the behavior of portfolios in the high model
complexity regime, where the number of predictors exceeds the number of ob-
servations (P>T).3In this case, standard regression logic no longer holds
because the regressor inverse covariance matrix is not defined. However, the
pseudo-inverse is defined and it corresponds to a limiting ridge regression with
infinitesimal shrinkage, or the “ridgeless” limit. An emerging statistics and
machine learning literature shows that, in the high-complexity regime, ridge-
less regression can achieve accurate out-of-sample forecasts despite fitting the
training data perfectly.4
2The single-asset time-series case is economically important in its own right. It coincides with
predictive regression for the market return, which has been the primary method for investigating
a central organizing question of asset pricing: How much do discount rates vary over time? While
our analysis can be applied to a panel of many assets, the roles of covariances in asset returns and
signals across stocks complicate the theory.
3The statistics and machine learning community often refer to P>Tas the “high-dimensional”
or “overparameterized” regime. We avoid terminology like “overparameterized” and “overfit” as it
suggests the model uses too many parameters, which is not necessarily the case. For example, the
true DGP may be highly complex (i.e., Pis large relative to T) and thus a correctly specified model
would require P>T. When an empirical model has the same specification as the true model, we
prefer to call it correctly parameterized as opposed to overparameterized.
4This seemingly counterintuitive phenomenon is sometimes called “benign overfit” (Bartlett
et al. (2020), Tsigler and Bartlett (2023)).
462 The Journal of Finance®
We analyze related phenomena in the context of return prediction and port-
folio optimization. We establish the striking theoretical result that market
timing strategies based on ridgeless least-squares predictions generate posi-
tive Sharpe ratio improvements for arbitrarily high levels of model complex-
ity. Stated more plainly, when the true DGP is highly complex (i.e., has many
more parameters than there are training data observations), one might think
that a timing strategy based on ridgeless regression is bound to fail. After all,
it exactly fits the training data with zero error. Surprisingly, this intuition is
wrong. We prove that strategies based on extremely high-dimensional models
can thrive out-of-sample and outperform strategies based on simpler models
under fairly general conditions.
Our theoretical analysis delivers a number of additional conclusions. First, it
shows that the out-of-sample R2from a prediction model is an incomplete mea-
sure of its economic value. A market timer can generate significant economic
profits even when the predictive R2is negative. The reason is that the R2is
heavily influenced by the variance of forecasts.5A very low out-of-sample R2
indicates a highly volatile timing strategy. But the properties of least squares
imply that the expected out-of-sample return of a timing strategy is always
positive. So, as long as the timing variance is not too high (R2is not too nega-
tive), the timing Sharpe ratio can be substantial.
Second, we study two theoretical cases, one for correctly specified models and
one for misspecified models. The correctly specified case develops the behavior
of timing portfolios when the true DGP varies from simple to complex, holding
the data size fixed. This is valuable for developing a general understanding
of machine learning portfolios for various DGPs. But the correct model spec-
ification is unrealistic—it is unlikely that we ever have a predictor data set
that nests all relevant conditioning information, and it is also unlikely that
we use information in the proper functional form. Our main theoretical results
pertain to misspecified models, and this analysis coincides with the thought
experiment above. In practice, when we vary the empirical model specification
from simple to complex, we change how accurately the model approximates a
fixed DGP.
Third, while the results discussed so far refer primarily to the case of
ridgeless regression, we show that machine learning portfolios tend to incre-
mentally benefit from moving away from the ridgeless limit by introducing
nontrivial shrinkage. The bias induced by heavier ridge shrinkage lowers the
expected returns to market timing, but the associated variance reduction reins
5That is, R2is not just about predictive correlation. Consider a simple model with a single
predictor and a coefficient estimate many times larger than the true value. This scale error will
tend to drive the R2negative, but it will not affect the correlation between the model fits and the
true conditional expectation. The R2is negative only because the variance of the fits is off. Related,
Rapach, Strauss, and Zhou (2010) show that mean square forecast error (MSE) decomposes into a
scale-free (correlation) component and a scale-dependent component. It is the scale-free component
that is important for trading strategy performance. Leitch and Tanner (1991), Cenesizoglu and
Timmermann (2012), and Rapach and Zhou (2013) also emphasize the importance of evaluating
return prediction models based on their economic value in terms of trading strategy performance.
The Virtue of Complexity in Return Prediction 463
in the volatility of the strategy. The Sharpe ratio tends to benefit from higher
shrinkage because the variance reduction overwhelms the deterioration in ex-
pected timing returns. This is especially true when PT, where the behavior
of ridgeless regression is most vulnerable.
From a technical standpoint, we characterize the behavior of portfolios in the
high-complexity regime using asymptotic analysis, as the model’s size grows
with the number of observations at a fixed rate (T→∞and P/Tc>0).
When P→∞, the regular asymptotic results, such as laws of large numbers
and central limit theorems, do not hold. Such analysis requires the apparatus
of random matrix theory, on which we draw heavily to derive our results. Con-
ceptually, this delivers an approximation of how a machine learning model be-
haves as we gradually increase the number of parameters holding the amount
of data fixed.
We conduct an extensive empirical analysis that demonstrates the virtues of
model complexity in a canonical asset pricing problem: predicting the aggre-
gate U.S. equity market return.6In particular, we study market timing strate-
gies based on predictions from very simple models with a single parameter to
extremely complex models with over 10,000 parameters (applied to training
samples with as few as 12 monthly observations). The data inputs to our mod-
els are 15 standard predictor variables from the finance literature compiled by
Goyal and Welch (2008). To map our data analysis to the theory, we require a
method that smoothly transitions from low- to high-complexity models while
holding the underlying information set fixed. The random feature method of
Rahimi and Recht (2007) is ideal for this. We use it to construct expanding
neural network architectures that take the Goyal and Welch (2008) predictors
as inputs and maintain the core ridge regression structure of our theory.
We find extraordinary agreement between empirical patterns and our theo-
retical predictions. Over the standard Center for Research in Security Prices
(CRSP) sample from 1926 to 2020, out-of-sample market timing Sharpe ratio
improvements (relative to market buy-and-hold) reach roughly 0.47 per annum
with t-statistics near 3.0. This is despite the fact that the out-of-sample pre-
dictive R2is substantially negative for the vast majority of models, consistent
with the theoretical argument that predictive R2is inappropriate for judging
the economic benefit of a machine learning model.
Timing positions from high-complexity models are remarkable. They behave
similarly to long-only strategies, following the Campbell and Thompson (2008)
recommendation to impose a nonnegativity constraint on expected market re-
turns. But our models learn this behavior as opposed to being handed a con-
straint. Moreover, machine learning strategies learn to divest leading up to
National Bureau of Economic Research (NBER) recessions, successfully doing
so in 14 out of 15 recessions in our test sample on a purely out-of-sample basis.
6Surveys of this large literature include Koijen and Van Nieuwerburgh (2011), Cochrane (2011),
and Rapach and Zhou (2022). For early machine learning approaches to market return prediction,
see Rapach, Strauss, and Zhou (2010) and Kelly and Pruitt (2013).
464 The Journal of Finance®
This paper relates most closely to emerging literature that studies the theo-
retical properties of machine learning models. A number of recent papers show
that linear models combined with random matrix theory help characterize the
behavior of neural networks trained by gradient descent.7In particular, wide
neural networks (many nodes in each layer) are effectively kernel regressions,
and “early stopping” in neural network training is closely related to ridge regu-
larization (Ali, Kolter, and Tibshirani (2019)). Recent research also emphasizes
the phenomenon of benign overfit and “double descent,” in which expected fore-
cast error drops in the high-complexity regime.8
In this literature, the paper closest to ours is Hastie et al. (2022), who de-
rive nearly optimal error bounds in finite samples for bias and risk in the
ridge(less) regression under very general conditions.9They are also the first
to introduce misspecified models in which some of the signals may be unob-
servable. In this paper, we focus on the (easier) asymptotic regime. We use
a different method of proof and relax some of the technical conditions on the
distributions of signals, using recent results of Yaskov (2016). In particular,
we allow for nonuniformly positive-definite covariance matrices. Most impor-
tantly, instead of focusing on the prediction model forecast error variance, we
characterize expected out-of-sample expected returns, volatility, and Sharpe
ratios of market timing strategies based on machine learning predictions. As
in Hastie et al. (2022), our key interest is in the misspecified model. While
Hastie et al. (2022) focus on a specific form of misspecification and its ridge-
less limit, we derive general expressions for asymptotic expected returns and
volatility in terms of signal correlations.
Our paper also relates closely to a growing empirical literature that uses ma-
chine learning methods to analyze stock returns. The state-of-the-art market
return prediction uses high-dimensional models with shrinkage and demon-
strates robust out-of-sample predictive power. Rapach, Strauss, and Zhou
(2010) use predictors from Goyal and Welch (2008) and forecast combination
methods (which they show exert a strong shrinkage effect). Ludvigson and Ng
(2007) and Kelly and Pruitt (2013) use principal components regression and
partial least squares, respectively, to leverage large predictor sets for market
return prediction and achieve shrinkage through dimension reduction. Dong
et al. (2022) use 100 long-short “anomaly” portfolios to forecast the market re-
turn using a variety of forecasting strategies to implement shrinkage (more
generally, see the recent survey by Rapach and Zhou (2022)). An emerging lit-
erature uses machine learning methods to forecast large panels of individual
stock returns or portfolios, including Rapach and Zhou (2020), Kozak, Nagel,
and Santosh (2020), Freyberger, Neuhierl, and Weber (2020), Gu, Kelly, and
Xiu (2020), and Chen, Pelger, and Zhu (2023) (also see the survey by Kelly
7See, for example, Jacot, Gabriel, and Hongler (2018), Hastie et al. (2022), Du et al. (2018), Du
et al. (2019), and Allen-Zhu, Li, and Song (2019).
8See, for example, Spigler et al. (2019), Belkin et al. (2019), Belkin, Rakhlin, and Tsybakov
(2019), Belkin, Hsu, and Xu (2020), and Bartlett et al. (2020).
9See also Richards, Mourtada, and Rosasco (2021), who obtain less general results in an asymp-
totic setting (as in our paper).
The Virtue of Complexity in Return Prediction 465
and Xiu (2022)). Our paper offers theoretical justification for the successes of
machine learning prediction documented in the asset pricing literature. Our
theoretical results call for researchers to consider even larger information sets
and higher dimensional approximations to further improve return forecasts (a
rationale justified by our empirical analysis). Finally, our paper is related to
Martin and Nagel (2022) and Da, Nagel, and Xiu (2022), who examine market
efficiency implications of the high-dimensional prediction problem faced by in-
vestors, to Fan et al. (2022) who touch upon the “double descent” phenomenon
in their analysis of structural machine learning models, and to financial econo-
metrics applications of random matrix theory such as Fan, Fan, and Lv (2008),
Ledoit and Wolf (2020), and Fan, Guo, and Zheng (2022).
The paper is organized as follows. In Section I, we lay out the theoretical
environment. Section II presents the foundational results from random matrix
theory from which we derive our main theoretical results. Section III charac-
terizes the behavior of machine learning portfolios in the correctly specified
setting and emphasizes the intuition behind the portfolio benefits of high-
complexity prediction models. Section IV extends these results to the more
practically relevant setting of misspecified models. We present our main em-
pirical results in Section V. Section VI concludes. The Internet Appendix con-
tains a variety of supplementary theoretical results and empirical robustness
analyses.10 We invite readers that are primarily interested in the qualitative
theoretical points and the empirical analysis to skip the technical material of
Sections Iand II.
I. Environment
This section describes our modeling assumptions and outlines the criteria
we use to evaluate machine learning portfolios.
A. Asset Dynamics
ASSUMPTION 1: There is a single asset whose excess return behaves according
to
Rt+1=S
tβ+εt+1,(3)
with εt+1independent and identically distributed (i.i.d.), E[εt+1]=E[ε3
t+1]=
0,E[ε2
t+1]=σ2,E[ε4
t+1]<,11 and Sta P-vector of predictor variables. Without
loss of generality, everywhere in the sequel, we normalize σ2=1.
Assumption 1establishes the basic return-generating process. Most notably,
conditional expected returns depend on a potentially high-dimensional infor-
10 The Internet Appendix is available in the online version of the article on The Journal of
Finance website.
11 The assumption of zero skewness simplifies the analytical expressions but does not affect
our results.
466 The Journal of Finance®
mation set embodied by the predictors, S. The interpretation of this assump-
tion is not necessarily that asset returns are subject to many fundamental
driving forces. Instead, it espouses the machine learning perspective discussed
in the introduction: The DGP’s functional form is unknown but may be approx-
imated with richly parameterized models using a high-dimensional nonlinear
expansion Sof some underlying feature set.
The covariance structure of Splays a central role in the behavior of machine
learning predictions and portfolios. Assumption 2imposes basic regularity con-
ditions on this covariance.
ASSUMPTION 2: There exist independent random vectors XtRPwith four fi-
nite first moments, and a symmetric, P-dimensional positive semidefinite matrix
such that
St=1/2Xt.
Furthermore, E[Xi,t]=E[X3
i,t]=E[Xi1Xi2Xi3]=0and E[X2
i,t]=1,forall
i,i1,i2,i3=1,...,PandE[Xi1Xi2Xi3Xi4]=0whenever at least three indices
among i1,i2,i3,i4are different. Furthermore, the fourth moments E[X4
i,t]are
uniformly bounded.
As we show below, the theoretical properties of machine learning portfolios
depend heavily on the distribution of eigenvalues of . We are interested in
limiting behavior in the high model complexity regime, that is, as P,T→∞,
with P/Tc>0. Assumption 3ensures that estimates of are well behaved
in this limit.
ASSUMPTION 3: We use λk(),k=1,...,P, to denote the eigenvalues of an
arbitrary matrix .InthelimitasP→∞, the spectral distribution Fof the
eigenvalues of
F(x)=1
P
P
k=1
1λk()x,(4)
converges to a nonrandom probability distribution H supported on [0,+∞).12
Furthermore, is uniformly bounded as P →∞.Weuse
ψ,k=lim
P→∞ P1tr(k)k1,
to denote asymptotic moments of the eigenvalues of .
Our last assumption governs the behavior of the true predictive coefficient,
β.
ASSUMPTION 4: We assume that β=βPis random, β=(βi)P
i=1RP, with i.i.d.
coordinates βithat are independent13 of S and R, and such that E[β]=0,
12 If zero is in the support of H,thenis strictly degenerate, meaning that some signals are re-
dundant.
13 The assumption of a random coefficient vector βis related to that in Gagliardini, Ossola, and
Scaillet (2016).
The Virtue of Complexity in Return Prediction 467
E[ββ]=P1b,PI for some constant b,P=E[β2],14 and satisfies b,Pbin
probability for some b>0. Furthermore, E[β4
i]KP2for some K >0.
The randomness of βin Assumption 4allows us to characterize the predic-
tion and portfolio problem for generic predictive coefficients. The assumption
that βis mean zero is inconsequential; we could allow for a nonzero mean
and restate our analysis in terms of variances rather than second moments.
The assumption E[ββ]=P1b,PIimposes that the predictive content of sig-
nals is rotationally symmetric, that is, predictability is uniformly distributed
across signals. This may seem restrictive, as commonly used return predictors
would not satisfy Assumption 4. But it is closely aligned with the structure
of feed-forward neural networks, in which raw features are mixed and non-
linearly propagated into final generated features whose ordering is essentially
randomized by the initialization step of network training. Intuitively, we ex-
pect (and later confirm empirically) that the random-feature methodology that
we use in our empirical analysis satisfies Assumption 4.15
When βis random and rotationally symmetric, we can focus on average port-
folio behavior across signals, which implies that only the traces of the relevant
matrices matter, as opposed to entire matrices (which are the source of techni-
cal intractability). The proportionality of E[ββ]toP1, and likewise the finite
limiting 2norm of β, controls the “true” Sharpe ratio. It ensures that Sharpe
ratios of timing strategies remain bounded as the number of predictors grows.
In other words, our setting is one with many signals, each contributing a little
bit of predictability.
A key aspect of our paper, and one rooted in Assumptions 2and 4,isthat
realized out-of-sample expected returns are independent of the specific real-
ization of β. This is due to a law of large numbers in the P→∞limit and is
guaranteed by the following lemma.16
LEMMA 1: As P →∞, we have
βAPβP1btr(AP)0
in probability for any bounded sequence of matrices AP.Inparticular,ββ
bψ,1.
14 This identity follows because b=tr E[ββ]=E[tr(ββ)] =E[b].
15 From a technical standpoint, it is possible to derive explicit expressions for portfolio per-
formance without this assumption, but the expressions become more complex. In this case, the
asymptotic behavior depends on the distribution of projections of βon the eigenvectors of (the
signals’ principal components). See Hastie et al. (2022). In particular, when βis concentrated on
the top principal components, the phenomenon of benign overfit emerges (Bartlett et al. (2020),
Tsigler and Bartlett (2023)) and the optimal ridge regularization is zero. We leave this generaliza-
tion for future research.
16 It is possible to use the results in Hastie et al. (2022) to extend our analysis to generic β
distributions. We leave this important direction for future research.
468 The Journal of Finance®
B. Timing Strategies and Performance Evaluation
We study timing-strategy returns, defined as
Rπ
t+1=πtRt+1,
where πtis a timing weight that scales the position in the asset up and down
to exploit time variation in the asset’s expected returns.
We are interested in timing strategies that optimize the unconditional
Sharpe ratio,
SR =E[Rπ
t+1]
E[(Rπ
t+1)2]
.(5)
While there are other possible performance criteria, we focus on this one for
its simplicity and ubiquity. It is implied by the quadratic utility function at
the foundation of mean-variance portfolio theory. Academics and real-world
investors rely nearly universally on the unconditional Sharpe ratio when eval-
uating empirical trading strategies. The use of centered versus uncentered sec-
ond moment in the denominator is without loss of generality.17
Our analysis centers on the following timing-strategy functional form:
πt(β)=S
tβ. (6)
This strategy takes positions equal to the asset’s conditional expected re-
turn. Note that this timing strategy optimizes the conditional Sharpe ra-
tio. It achieves the same Sharpe ratio as the conditional Markowitz solution,
πCond. MV
t=Et[Rt+1]/Vart[Rt+1]=S
tβ, according to equation (3). While strategy
πtis conditionally mean-variance efficient, it is not the optimizer of the uncon-
ditional objective in (5), which takes the form πUncond. MV
t=S
tβ/(1 +(S
tβ)2).18
In the proof of Proposition 1in the Internet Appendix, we show that πtin equa-
tion (6)andπUncond. MV
tare equal up to third-order terms.19 We study πt=S
tβ
for the simplicity of its linearity in both βand St, but note that our conclusions
are identical for πUncond. MV
tbecause, in the limit as P→∞, the normalization
factor 1 +(S
tβ)2converges to a constant.20
Proposition 1states the behavior of timing strategy πt=S
tβwhen T→∞
and P/T0 (i.e., when the predictive parameter βis known).
17 Define
SR =E[Rπ
t+1]
Var [(Rπ
t+1)] . Direct calculation yields SR =1
1+
SR2.
18 See Hansen and Richard (1987), Ferson and Siegel (2001), Abhyankar, Basu, and Stremme
(2012).
19 In particular, the Sharpe ratio in equation (5) is less than one due to the Cauchy-Schwarz
inequality. We show that the difference in Sharpe ratios for πtversus πUncond. MV
tis on the order of
the Sharpe ratio cubed.
20 By a version of Lemma 1,1+(S
tβ)21+bψ,1.
The Virtue of Complexity in Return Prediction 469
PROPOSITION 1 (Infinite Sample): The unconditional first and second moments
of returns to the infeasible market timing strategy πt=S
tβare
E[πtRt+1]bψ,1>0and E(πtRt+1)23(bψ,1)2+bψ,1.
The infeasible market-timing Sharpe ratio is
SR 1
3+(bψ,1)1<1
31/2
.(7)
For comparison, under Assumptions 1to 4, the unconditional first and sec-
ond moments of the untimed asset return are (see Lemma 1)
E[Rt+1]=0,and E[R2
t+1]1+bψ,1.
That is, our assumptions imply that the untimed asset has a Sharpe ratio of
zero. This is just a normalization so that any positive market timing Sharpe
ratio can be interpreted as pure excess performance arising from timing ability.
C. Relating Predictive Accuracy to Portfolio Performance
We are ultimately interested in understanding the portfolio properties of a
feasible timing strategy, ˆπt=ˆ
βSt. This is, of course, intimately tied to the pre-
diction accuracy of the estimator ˆ
β, summarized by its expected MSE on an
independent test sample. This is the fundamental notion of estimator “risk”
from statistical theory, although we use the term “MSE” here to avoid confu-
sion with portfolio riskiness. We can write MSE as
MSE(ˆ
β)=ERt+1S
tˆ
β2|ˆ
β=E[R2
t+1]2EπtRt+1|ˆ
β]

Timing
Expected Return
+Eπ2
t|ˆ
β]

Timing
Leverage
.(8)
In other words, the higher the strategy’s expected return, the lower the MSE.
And the larger the positions—or “leverage”—of the strategy, the larger the
MSE. A timing strategy with a higher expected return corresponds to more
predictive power, while higher leverage gives the strategy higher variance. In-
terestingly, these two objects, expected return and leverage of the timing strat-
egy, appear repeatedly throughout our analysis. The expected return/leverage
trade-off in (8) is a financial decomposition of MSE analogous to its statistical
decomposition into a bias/variance trade-off.
Note that a strategy πt=βStbased on the infeasible true βsatisfies
E[πtRt+1]=E[ββ]=E[π2
t].21 In this case, the MSE collapses to E[R2
t+1]
E[πtRt+1] and is minimized, meaning that the leverage taken is exactly justi-
fied by the predictive benefits of the strategy. This can also be stated in terms
21 Indeed, E[(βSt)2]=E[βStS
tβ]=ββ.
470 The Journal of Finance®
of the infeasible R2based on equation (3) and Lemma 1:
R2=ββ
ββ +1bψ,1
bψ,1+1.
Thus, there is a monotonic mapping from the infeasible timing-strategy ex-
pected return to the true R2, and from the infeasible Sharpe ratio to the true
R2(see equation (7)).
II. Machine Learning and Random Matrices
The central premise of machine learning is that large data sets can be used
in flexible model specifications to improve prediction. This can be understood in
the environment above by considering the regime in which the number of pre-
dictors, P, is large, perhaps even larger than T. Our main objective is thus to
understand the behavior of optimal timing portfolios as the prediction model
becomes increasingly complex, that is, as P→∞. Because this involves es-
timating infinite-dimensional parameters, traditional large-Tasymptotics do
not apply, and hence we instead resort to random matrix theory. In this sec-
tion, we discuss the ridge estimator and present random matrix theory results
at the foundation of our theoretical characterization of high-complexity tim-
ing strategies.
A. Least Squares Estimation
Throughout, we analyze (regularized) least-squares estimators taking the
form
ˆ
β(z)=zI +T1
t
StS
t11
T
t
StRt+1
for a given ridge shrinkage parameter, z. The ridge-regularized form is nec-
essary for characterizing ˆ
β(z) in the high-complexity regime, P/Tc>1, al-
though we will see that it also has important implications for the behavior of
ˆ
β(z)whenP/T<1.22
Consider first the OLS estimator, ˆ
β(0). As Papproaches Tfrom below, the
denominator of the least-squares estimator approaches the singularity. This
produces explosive variance of ˆ
β(0) and, in turn, explosive forecast error vari-
ance. As PT, the model begins to fit the data with zero error, so a common
22 One could alternatively analyze “sparse” least-squares models that combine shrinkage with
variable selection (e.g., based on LASSO). First, recent evidence of Giannone, Lenza, and Primiceri
(2021) suggests that sparsity of predictive relationships in economics and finance is likely an illu-
sion. Second, our empirical focus is on nonparametric models that seek to approximate a generic
nonlinear function as a linear combination of generated features, and sparsity in the generated
feature space is difficult to identify (see, for example, Ghorbani et al. (2020)). Third, analysis with
1shrinkage is significantly more taxing from a theoretical standpoint. We thus leave sparse least-
squares models to future research.
The Virtue of Complexity in Return Prediction 471
interpretation of the explosive variance of ˆ
β(0) is an insidious overfit that does
not generalize out-of-sample.
When Pmoves beyond T, there are more parameters than obser-
vations and the least-squares problem has multiple solutions. A par-
ticularly interesting solution invokes the Moore-Penrose pseudo-inverse,
(T1tStS
t)+1
TtStRt+1.23 This solution is equivalent to the ridge estima-
tor as the shrinkage parameter approaches zero:
ˆ
β(0+)=lim
z0+zI +T1
t
StS
t11
T
t
StRt+1.
The solution ˆ
β(0+) is often referred to as the “ridgeless” regression estima-
tor. When P<T, OLS is the ridgeless estimator. At P=T, there is still a
unique least-squares solution, but the model can exactly fit the training data
(for this reason, P=Tis called the “interpolation boundary”). When P>T,the
ridgeless estimator is one of many solutions that exactly fit the training data,
but among these, it is the only solution that achieves the minimum 2norm
ˆ
β(z) (Hastie et al. (2022)). The machine learning literature has recently de-
voted substantial attention to understanding ridgeless regression in the high-
complexity regime. The counterintuitive insight from this literature is that,
beyond the interpolation boundary, allowing the model to become more com-
plex in fact regularizes the behavior of least-squares regression despite using
infinitesimal shrinkage. We explore the implications of this idea for market
timing in the subsequent sections.
B. The Role of Random Matrix Theory
We analyze the ˆ
β(z) behavior and associated market-timing strategies in the
limit as P→∞. This is possible due to a remarkable connection between ridge
regression and random matrix theory.
In regression analysis, the sample covariance matrix of signals, ˆ
:=
T1tStS
t, naturally plays a central role. But no general characterization
exists for the behavior of ˆ
in the limit as P,T→∞. However, the tools of
random matrix theory characterize one aspect of ˆ
—the distribution of its
eigenvalues. Fortunately, as we show, the prediction and portfolio performance
properties of least-squares estimators rely only on the eigenvalue distribution
of ˆ
. Thus, random matrix theory facilitates a rich understanding of machine
learning portfolios. Here, we elaborate on the core results from the random
matrix theory that we build upon.
First, to understand the central role of ˆ
’s eigenvalue distribution in deter-
mining the limiting behavior of the least-squares estimator, suppose for the
moment that we could replace ˆ
with its true unobservable signal covariance,
23 Recall that the Moore-Penrose pseudo-inverse A+of a matrix Ais defined as A+=
limz0+(zI +AA)1A=limz0+A(zI +AA)1.
472 The Journal of Finance®
. For any symmetric matrix , a convenient matrix identity states that
1
Ptr(zI)1=1
P
P
i=1
(λi()z)1,
where λi() are the eigenvalues of . Using formula (4), we can rewrite this
identity as
1
Ptr(zI)1=1
xzdF(x)z<0.
From this identity, we immediately see the fundamental connection between
ridge regularization and the distribution of eigenvalues for . The right-hand
side quantity is the Stieltjes transform of the eigenvalue distribution of ,
denoted F. By Assumption 3, this distribution is well behaved when P→∞
and converges to a nonrandom distribution H. We therefore have
m(z):=1
xzdH(x)=lim
P→∞
1
Ptr(zI)1.(9)
The function m(z)isthelimiting Stieltjes transform of the eigenvalue distri-
bution of . Equation (9) is a powerful step toward understanding the least-
squares estimator in the machine learning regime (and hence machine learn-
ing predictions and portfolios). It states that key properties of the limiting
inverse of the ridge-regularized signal covariance matrix can be characterized
entirely if we know ’s eigenvalue distribution.
The problem, of course, is that the true is unobservable. We only observe
its sample counterpart, ˆ
, and thus, we only have empirical access to the
Stieltjes transform of ˆ
’s eigenvalues. The empirical counterpart to the un-
observable m(z)is
m(z;c):=lim
P→∞
1
Ptr(ˆ
zI)1.
In traditional finite Pstatistics, we would have a convergence between the
sample covariance ˆ
and the true covariance as T→∞.Onemightbe
tempted to think that limP→∞ 1
Ptr(( ˆ
zI)1) and limP→∞ 1
Ptr((zI )1)also
converge as T→∞. But this is not the case. The limiting eigenvalue distri-
butions of ˆ
and remain divergent in the limit as T→∞if P/Tc>0.
Here, we see a first glimpse of the complexity of machine learning and how ran-
dom matrix theory can help us understand it. In the Internet Appendix (see
Theorem 2), we show how m(z;c) can be computed from m(z) using re-
sults of Silverstein and Bai (1995) and Bai and Zhou (2008). In particular,
m(z;c)>m(z;0) =m(z) for all c>0.24 The next result shows that,
24 Theorem 2 in the Internet Appendix is a generalized version of the Marˇ
cenko and Pastur
(1967) theorem that accommodates non-i.i.d. St. When signals are i.i.d. with =Iand m(z)=
The Virtue of Complexity in Return Prediction 473
remarkably, if we constrain ourselves to linear ridge regression estimators, all
asymptotic expressions depend only on m(z;c) and do not require m.25
PROPOSITION 2: We have
lim
T→∞
1
Ttr((zI +ˆ
)1)ξ(z;c) (10)
almost surely, where
ξ(z;c)=1zm(z;c)
c11+zm(z;c).
The quantity tr E[(zI +ˆ
)1] appears in virtually every expression we an-
alyze to describe portfolio behavior. It depends on the interaction between the
sample and true signal covariance matrix and arises in the computation of both
the expected return and leverage of the timing strategy (see equation (8)). One
might imagine, therefore, that we need to know the limiting eigenvalue distri-
bution of both matrices (or their Stieltjes transforms, mand m) to describe
tr E[(zI +ˆ
)1]. Proposition 2shows that this is not the case—we only need
to know the empirical version, m(z;c). This is a powerful result. It will allow
us to quantify the expected out-of-sample behavior of machine learning portfo-
lios based only on the eigenvalue distribution of the sample signal covariance
ˆ
(which is observable) without requiring that we know the eigenvalues of .26
We refer to the constant cas “model complexity,” which (as the preceding
results show) plays a critical role in understanding model behavior. It describes
the limiting ratio of predictors to data points: P/Tc.WhenTgrows at a
faster rate than the number of predictors (i.e., c0), the limiting eigenvalue
distributions of ˆ
and converge: m(z;0) =m(z). As cbecomes positive,
these distributions fail to converge, and their divergence is wider for larger c.
It is, therefore, clear that the behavior of the least-squares estimator in the
machine learning regime will differ from the true coefficient, even when T
,aslongasc>0. As a result, machine learning portfolios will suffer relative
to the infeasible performance in Proposition 1despite abundant data. However,
while machine learning portfolios underperform the infeasible strategy, they
(1 z)1,Mar
ˇ
cenko and Pastur (1967) show that
m(z;c)=((1 c)+z)+((1 c)+z)2+4cz
2cz .
By direct calculation, the expression above is indeed the unique positive solution to (IA4) when
m(z)=(1 z)1. While the eigenvalue distributions of the sample and true covariance matri-
ces do not coincide, Theorem 2 describes the precise nonlinear way they relate to each other. In
particular, when P>T,thematrix ˆ
has PTzero eigenvalues and therefore, P1tr((zI +ˆ
)1)
contains a singular part, P1(PT)z1=(1 c1)z1.
25 It is possible to develop nonlinear shrinkage estimators analogous to those developed by
Ledoit and Wolf (2020) for covariance matrices. Such estimators would require knowledge of the
true eigenvalue distribution of , which can be recovered from m(z;c) using equation (IA4).
26 Heuristically, E[ˆ
]=and hence tr E[(zI +ˆ
)1]tr E[(zI +ˆ
)1ˆ
]. However, random
matrix corrections make the true relationship nonlinear.
474 The Journal of Finance®
can continue to generate substantial trading gains. This is true even in the
ridgeless case. Additional ridge shrinkage can boost performance even further.
We precisely characterize these behaviors in the following sections.
III. Prediction and Performance in the Machine Learning Regime
In this section, we analyze correctly specified models. We present the the-
oretical characterizations of machine learning models in terms of prediction
accuracy and portfolio performance. We then illustrate their behavior in a cal-
ibrated theoretical setting.
A. Expected Out-of-Sample R2
To understand a model’s prediction accuracy in the high-complexity regime,
we study its limiting MSE, defined as
MSE(z;c)=lim
T,P→∞,P/TcERt+1S
tˆ
β(z)2|ˆ
β(z).(11)
Notably, while ˆ
β(z) is random and depends on the sample realization, we show
below that the limit in (11) is nonrandom. The arguments zand care central
to understanding the limiting predictive ability of least squares. Respectively,
they describe the extent of ridge shrinkage and the complexity of the DGP (and
thus of the correctly specified model).
In finance and economics, it is common to state predictive performance in
terms of R2rather than MSE. We denote the limiting out-of-sample R2as
R2(z;c)=1MSE(z,c)
limT,P→∞ E[R2
t+1],
where E[R2
t+1] is the null MSE when β=0.
In Section I.C above, we discuss the infeasible maximum R2,or
R2(0;0) =bψ,1
1+bψ,1
.
This corresponds to a data-rich environment (c=0, so observations vastly out-
number parameters) and OLS regression (z=0). The R2(0;0) is the bench-
mark for evaluating the loss of predictive accuracy due to high model com-
plexity, even when data are abundant. Specifically, the R2of the least-squares
estimator in the machine learning regime behaves as follows.
PROPOSITION 3: In the limit as T,P→∞, and P/Tc, we have
E(z;c)=lim EπtRt+1|ˆ
β(z)] =bν(z;c),
L(z;c)=lim Eπ2
t|ˆ
β(z)] =bˆν(z;c)cν(z;c),(12)
R2(z;c)=2E(z;c)L(z;c)
1+bψ,1
,
The Virtue of Complexity in Return Prediction 475
where
ν(z;c)=ψ,1c1zξ(z;c)=lim P1tr( ˆ
(zI +ˆ
)1)>0,
ν(z;c)=−c1(ξ(z;c)+zξ(z;c)) =−lim P1tr( ˆ
(zI +ˆ
)2)<0,
ˆν(z;c)=ν(z;c)+zν(z;c)=lim P1tr( ˆ
2(zI +ˆ
)2)>0.
As we show in the Internet Appendix, these limits exist in probability.
Furthermore, R2(z;c)is monotone increasing in z for z <z=c/b, and de-
creasing in z for z >z.TheR
2(z;c)attains its maximum at z=c/b, where it
is positive and given by
R2(z;c)=R2(0;0) ξ(z;c)
1+bψ,1=bν(z;c)
1+bψ,1
>0.
In the ridgeless limit, assuming H(0+)=0, we have
R2(0;c)=R2(0;0) (1 +bψ,1)1(c11)1,c<1
μ(c),c>1(13)
for some μ(c)>0(1+)=+. Lastly, we have
lim
c→∞ R2(0;c)=0>lim
c1R2(0;c)=−.(14)
When the prediction model is complex (c>0), the limiting eigenvalues of
ˆ
and diverge, and this unambiguously reduces the predictive R2relative
to the infeasible best, R2(0;0). Intuitively, because the frictionless R2(0;0) is
fixed, as cincreases, the investor must learn the same amount of predictabil-
ity but spread across many sources, and this dimensionality expansion hinders
statistical inference. The degradation in predictive accuracy due to complexity
can be so severe that expected out-of-sample R2becomes extremely negative,
particularly in the ridgeless case. Shrinkage can mitigate this and help pre-
serve accuracy amid complexity. Shrinkage controls variance but introduces
bias. Proposition 3points out that the amount of shrinkage that optimizes
the bias-variance trade-off is z=c/b.27 More complex settings benefit from
heavier shrinkage, while settings with a higher signal-to-noise ratio (higher
b) benefit from lighter shrinkage (see, for example, Hastie et al. (2022)). The
Eand Lare the limiting out-of-sample expected return and leverage of the
timing strategy. Proposition 3shows that these are the main determinants of
out-of-sample R2.
Figure 1illustrates the theoretical behavior of the least-squares estimator
derived in Proposition 3. The plots set to the identity matrix and fix b=0.2
27 Note that the optimal shrinkage must be inferred from an estimate of b. Our theoretical and
empirical results indicate a general insensitivity of prediction and timing-strategy performance to
thechoiceofzin the high-complexity regime. As a result, simple shrinkage selection methods like
cross-validation tend to perform well.
476 The Journal of Finance®
Figure 1. Expected out-of-sample R2and norm of least-squares coefficient. This
figure shows the limiting out-of-sample R2and ˆ
βnorm as a function of cand zfrom Proposi-
tion 3assuming is the identity matrix and b=0.2. (Color figure can be viewed at wileyonlineli-
brary.com)
(recall that σ2is normalized to one). The left panel draws the expected out-
of-sample R2as a function of model complexity c(shownonthex-axis) and
ridge penalty z(different curves). In this calibration, the infeasible maximum
predictive R2(that uses the true parameter values) is the dotted red line and
provides a reference point. Throughout the paper, we refer to plots like these,
which describe the model’s performance as a function of model complexity, as
“VoC curves.”
The black line shows the R2in the ridgeless limit. When c1, the ridgeless
limit corresponds to exactly z=0 (i.e., OLS). On this side of c=1, predictive
accuracy deteriorates rapidly as model complexity increases. This captures the
well-known property that OLS suffers when the number of predictors is large
relative to the number of data points. As c1, the denominator of the OLS
estimator approaches the singularity, and the expected out-of-sample R2dives.
To the right of c=1, the number of predictors exceeds the sample size, and
the “ridgeless” case is defined as the limit as z0 (i.e., when the least-squares
denominator is calculated via the pseudo-inverse of ˆ
). Counterintuitively, the
R2begins to rise as model complexity increases.28
The reason is that, while there are many equivalent βsolutions that exactly
fit29 the training data when c>1, ridgeless regression selects the solution with
the smallest norm. As complexity increases, there are more solutions for ridge-
less regression to search over, and thus it can find smaller and smaller betas
that still exactly fit the training data. This acts as shrinkage, biasing the beta
estimate toward zero. Due to this bias, the forecast variance drops, improv-
ing the R2. In other words, despite z0, the ridgeless solution still regular-
izes the least-squares estimator, and more so, the larger is c. This property of
28 This is an illustration of what the statistics literature refers to as benign overfitting.
29 That is, βSt=Rt+1for all t[1,...,T].
The Virtue of Complexity in Return Prediction 477
ridgeless least squares is a newly documented phenomenon in the statistics lit-
erature and an emerging topic of research.30 It shows that even in very simple
DGPs, one may be able to improve the accuracy of return forecasts by pushing
model dimensionality well beyond sample size.
The remaining curves in Figure 1show how the out-of-sample R2is affected
by nontrivial ridge shrinkage. Allowing z>0 improves R2except at very low
levels of complexity. This is again a manifestation of the bias-variance trade-
off. When z>0, the norm of ˆ
βis controlled, and the associated variance reduc-
tion outweighs the effects of bias when the model is complex.
It is useful to place our analysis thus far in the context of the literature. Some
formulas of Propositions 2and 3have been established in papers on random
matrix theory (e.g., Ledoit and Péché (2011)). Hastie et al. (2022) prove an
analog of Proposition 3allowing for arbitrary βand expressing all quantities in
terms of the distribution of projections of βonto the eigenvectors of (see also
WuandXu(2020)). Furthermore, they establish nonasymptotic bounds on the
rate of convergence. However, both Hastie et al. (2022)andWuandXu(2020)
require that is strictly positive definite. By contrast, in our data analysis,
we find that is nearly degenerate. Richards, Mourtada, and Rosasco (2021)
also allow for more general βstructures and matrices, but require that Xt
be i.i.d. Gaussian and Dobriban and Wager (2018) require Xtbe i.i.d. This is
clearly not applicable to the RFFs used in our empirical analysis (or any other
nonlinear signal transformations). In contrast to these papers, we establish
our results under much weaker conditions on the distribution of Xi,tacross i.
This is important for practical applications, where neither the independence of
Xtnor equality (or boundedness) of their higher moments can be guaranteed.
Lastly, the novel techniques that we develop allow us to characterize the out-
of-sample performance of misspecified models. While Hastie et al. (2022) study
misspecification in certain stylized examples, we derive more general results
allowing for generic covariance structures. To the best of our knowledge, our
characterization is new in the literature (see Section IV).
Our main theoretical contribution is in the following sections, where we de-
rive portfolio performance properties.
B. Expected Out-of-Sample Market Timing Performance
We analyze the behavior of market timing based on the least-squares esti-
mate,
ˆπt(z)=ˆ
β(z)St.
Formula (12) derives the expected return of this strategy. The following propo-
sition characterizes the expected out-of-sample risk-return trade-off of market
timing in the high-complexity regime.
30 See Spigler et al. (2019), Belkin et al. (2019), Belkin, Rakhlin, and Tsybakov (2019), Belkin,
Hsu, and Xu (2020), and Hastie et al. (2022).
478 The Journal of Finance®
Figure 2. Expected out-of-sample risk and return of market timing. This figure shows the
limiting out-of-sample expected return and volatility of the market timing strategy as a function
of cand zfrom Proposition 3assuming is the identity matrix and b=0.2. (Color figure can be
viewed at wileyonlinelibrary.com)
PROPOSITION 4: In the limit when P,T→∞, and P/Tc, the limiting sec-
ond moment of the market timing strategy is
V(z;c):=lim Eπt(z)Rt+1)2|ˆ
β=2(E(z;c))2+1+bψ,1L(z;c)
in probability, with Eand Lgivenin(12). As a result, the Sharpe ratio satisfies
SR(z;c)=E(z;c)
V(z;c)=1
2+1+bψ,1L(z;c)
(E(z;c))2
.(15)
Furthermore, we have
(i) E(z;c)is monotone decreasing in z and hence 0<E(z;c)<E(0,c)<
E(0,0), and
(ii) SR(z;c)is monotone increasing in z for z <z=c/band monotone de-
creasing in z for z >z=c/b. Thus, the maximal Sharpe ratio is given by
SR(z;c)=1
2+1+bψ,11
bν(z;c)
<SR(0;0) ,(16)
where E(0,0) and SR(0,0) are the infeasible market-timing expected return and
Sharpe ratio from Proposition 1.
The left panel of Figure 2plots the expected out-of-sample return and the
right panel plots the expected out-of-sample volatility based on Propositions 3
and 4using the same calibration as Figure 1. Again, the ridgeless case is in
black. The expected returns of least-squares timing strategies are always posi-
tive because they are quadratic in beta. When c<1 (i.e., in the OLS case), the
ridgeless timing strategy achieves the true expected return even though the
The Virtue of Complexity in Return Prediction 479
Figure 3. Expected out-of-sample Sharpe ratio of market timing. This figure shows the
limiting out-of-sample Sharpe ratio of the market timing strategy as a function of cand zfrom
Proposition 3assuming is the identity matrix and b=0.2. (Color figure can be viewed at wi-
leyonlinelibrary.com)
corresponding R2is significantly negative in much of this range. The fact that
the out-of-sample expected return is unimpaired reflects the unbiasedness of
OLS, while the declining R2reflects the increasing forecast variance as crises
toward one. The return volatility of the timing strategy is likewise increasing
in cfor c[0,1] due to the rising forecast variance and maxes out at c=1.
When c>1, the ridgeless expected return begins to deteriorate. This is more
subtle and is related to the rising R2discussed above. When model complexity
is high, the multiplicity of least-squares solutions allows ridgeless regression
to find a low-norm beta that exactly fits the training data. So, even though
z0, the ridgeless beta is biased, and the expected return of the strategy
falls. At the same time, the volatility of the strategy falls.
The other expected return and volatility curves show that the bias induced
by a nontrivial ridge penalty eats into the timing strategy even for c<1. But
the bright side of this attenuation is a reduction in the strategy’s riskiness. For
relatively high shrinkage levels like z=1, the volatility of the timing strat-
egy drops even below that of the infeasible best strategy while maintaining a
meaningfully positive expected return.
The net effect of these expected return and volatility behaviors is summa-
rized by the market timing strategy’s expected out-of-sample Sharpe ratio,
given in Proposition 4. The calibrated Sharpe ratio is shown in Figure 3. Recall
that the buy-and-hold Sharpe ratio is normalized to zero. The key implication
of Proposition 4is that despite the sometimes massively negative predictive
R2, the ridgeless Sharpe ratio is everywhere positive, even for extreme levels
of model complexity. At c=1, the Sharpe ratio drops to near zero, not because
the strategy is unprofitable (it remains maximally profitable in an expected
return sense) but because its volatility explodes.
Another interesting aspect of Figure 3is that the Sharpe ratio benefits from
nontrivial ridge shrinkage regardless of model complexity. Shrinkage is most
480 The Journal of Finance®
valuable near c=1, where it reins in volatility substantially more than it re-
duces expected return. At both low levels of complexity (c0) and high levels
of complexity (c>> 1), the Sharpe ratio is relatively insensitive to z.
Proposition 4also implies that when the model is correctly specified, the
shrinkage that optimizes the expected out-of-sample R2also optimizes the
Sharpe ratio. This is convenient because it means that one can focus on tuning
the prediction model and be confident that the tuned zwill optimize timing
performance. Two caveats, however, are in order. The first is that this state-
ment applies to the Sharpe ratio, so if investors judge their performance with
other criteria, then other levels of shrinkage may be optimal. For example,
a risk-neutral investor prefers ridgeless regression despite its comparatively
poor performance in R2. Second, this statement requires correct specification.
If the empirical model is misspecified, the optimal amount of shrinkage can
differ depending on whether the objective is to maximize out-of-sample R2or
the Sharpe ratio.
C. A Note on R2
At this point, we already see that a timing strategy with negative R2
can have high average out-of-sample returns and thus positive out-of-sample
Sharpe ratios.31 More plainly, the positivity of out-of-sample R2is not a neces-
sary condition for an economically valuable timing strategy. The least-squares
timing strategies in our framework all have strictly positive out-of-sample ex-
pected return and Sharpe ratio regardless of shrinkage or model complexity
(despite having enormously negative R2in many cases).
This is an important contrast versus the mapping from R2to the timing
Sharpe ratio proposed by Campbell and Thompson (2008), which is an often-
used heuristic for interpreting the economic benefits of a predictive R2. Their
mapping is population mapping, meaning that it corresponds to the special
case of an analyst using a correctly specified model with c=0 (i.e., infinitely
more data than parameters). In contrast, our analysis characterizes expected
out-of-sample R2and Sharpe ratios for generic c, even with misspecified mod-
els (see Section IV).
Out-of-sample R2and Sharpe ratio measurements serve different purposes.
The R2helps evaluate forecast accuracy, while the Sharpe ratio is appropri-
ate for evaluating the economic value of forecasts in asset allocation contexts.
Much of the empirical literature on return prediction and market timing fo-
cuses its evaluations on out-of-sample predictive R2(see, for example, Goyal
and Welch (2008)). Proposition 4ensures that we can worry less about the
31 To see this in a simple example, consider a model with one predictor and imagine estimating
a predictive coefficient that happens to be a large scalar multiple of the truth. In this case, the
R2will be pushed negative, but the predictions will be perfectly correlated with the true expected
return. Thus, the expected return of the timing strategy will be positive. Furthermore, because the
Sharpe ratio is independent of scale effects, this timing strategy will equal the actual Sharpe ratio
of the DGP.
The Virtue of Complexity in Return Prediction 481
positivity of out-of-sample R2from a prediction model and focus more on the
out-of-sample performance of timing strategies based on those predictions.
IV. Machine Learning and Model Misspecification
So far we have studied the behavior of machine learning portfolios as a func-
tion of the complexity of the true DGP while assuming we have the correctly
specified model. Under correct specification, the complexity comparative stat-
ics in Figures 1to 3change both the empirical and the true model as we vary
c, and thus, these theoretical comparative statics cannot be taken to the data.
Nevertheless, theory grounded on correct model specification is powerful for
developing a conceptual understanding of machine learning portfolios.
A more empirically relevant theoretical setting would consider a single true
DGP. It would then consider empirical models that are always a misspecified
approximation to this DGP. Finally, it would make comparisons by increas-
ing the complexity of the empirical model to achieve an increasingly accurate
approximation of the true DGP. We develop this theory now.
We consider a true DGP with Ppredictors. We consider an expanding
set of empirical models to approximate the DGP. Each model is indexed by
P1=1,...,Pand corresponds to an economic agent observing only a subset of
the signals, S(1)
t=(Si,t)P1
i=1.WeuseS(2)
t=(Si,t)P
i=P1+1to denote the remaining
unobserved signals. The signal covariance matrix corresponding to this parti-
tion is
=1,11,2
1,22,2.
Naturally, misspecified estimator behavior depends on the correlation struc-
ture of observed and unobserved signals captured by the off-diagonal blocks of
.
We make the following technical assumption, which ensures that estimators
in the machine learning regime have well-behaved limits.
ASSUMPTION 5: For any sequence P1→∞such that P1/P=q>0, the eigen-
value distribution of the matrix 1,1converges to a nonrandom probability dis-
tribution H(x;q). We say that signals are sufficiently mixed if H(x;q)is inde-
pendent of q. We also use
ψ,k(q)=lim
P1→∞ P1
1tr(k
1,1)k1,
to denote asymptotic moments of the eigenvalues of 1,1.
In a misspecified model, the (regularized) least-squares estimator is
ˆ
β(z;q)=zI +ˆ
1,111
T
t
S(1)
tRt+1RP1,
482 The Journal of Finance®
where
ˆ
1,1=T1
t
S(1)
t(S(1)
t)RP1×P1.
We also introduce the following auxiliary objects:
ξ2,1(z;cq;q)=lim
T→∞ T1tr E[(zI +ˆ
1,1)11,2
1,2]0,(17)
ξ2,1(z;cq;q)=lim
T→∞ T1tr E[(zI +ˆ
1,1)11,1(zI +ˆ
1,1)11,2
1,2]0.
The quantities in (17) account for covariances between observed and unob-
served signals. While the existence of the limits in (17) cannot be guaranteed
in general, the expectations are uniformly bounded for z>0(asthematrices
are uniformly bounded for z>0). Hence, by passing to a subsequence of T,P,
we can always assume that the limits in (17) exist. In the Internet Appendix,
we show that these limits actually exist for a class of correlation structures.
With the additional assumptions for the misspecified setting in place, we
have the following analog of Propositions 2,3,and4.
PROPOSITION 5: In the limit T,P,P1→∞,P/Tc, and P1/Pq(0,1],
lim
T→∞
1
Ttr((zI +ˆ
1,1)11,1)ξ(z;cq;q)
in probability, where
ξ(z;cq;q)=1zm(z;cq;q)
(cq)11+zm(z;cq;q)
and
m(z;cq;q)=lim P1
1tr((zI +ˆ
1,1)1).
Furthermore,
ν(z;cq;q)=ψ,1(q)(qc)1zξ(z;cq;q)>0,
ν(z;cq;q)=−(qc)1(ξ(z;cq;q)+zξ(z;cq;q)) <0,
ˆν(z;cq;q)=ν(z;cq;q)+zν(z;cq;q)>0.
In addition, we have
(i) The expected return on the market timing strategy converges in proba-
bility to
E(z;cq;q):=lim Eπt(z)Rt+1|ˆ
β]=bqν(z;cq;q)+(cq)1ξ2,1(z;cq;q)
1+ξ(z;cq;q).
The Virtue of Complexity in Return Prediction 483
(ii) Expected leverage converges in probability to
L(z;cq;q):=lim Eπt(z)2|ˆ
β]=qbˆν(z;cq;q)
c(1 +b[ψ,1(1) qψ,1(q)])ν(z;cq;q)+(z;cq;q),
where
(z;cq;q)=b
(qc)1
ξ2,1(z;cq;q)+2(1 +ξ(z;cq;q))ν(z;cq;q)ξ2,1(z;cq;q)
(1 +ξ(z;cq;q))2.
(iii) R2converges in probability to
R2(z;cq;q)=2E(z;cq;q)L(z;cq;q)
1+bψ,1(1) .(18)
(iv) The second moment of the market timing strategy converges in probabil-
ity to
V(z;cq;q):=lim Eπt(z)Rt+1)2=2(E(z;cq;q))2+1+bψ,1L(z;cq;q).
(v) And, as a result, the Sharpe ratio satisfies
SR(z;cq;q)=E(z;cq;q)
V(z;cq;q)=1
2+1+bψ,1L(z;cq;q)
(E(z;cq;q))2
.
In general, the behavior of the quantities in Proposition 5depends in com-
plex fashion on the correlations between observable and unobservable signals,
as captured by the quantities (17). When both quantities (17) are zero, the ex-
pressions simplify significantly. It is straightforward to show that both quan-
tities in (17) are zero if the matrices 1,2,
2,1have uniformly bounded traces.
For example, this is when 1,2has a finite, uniformly bounded rank when
P,P1→∞(due to, say, a finite-dimensional factor structure in the signals).
We thus obtain the following result.
PROPOSITION 6: Suppose that tr(1,22,1)=o(P).32 Then, ξ2,1=
ξ2,1=0. Fur-
thermore,
(i) E(z;cq;q)is monotone decreasing in z and hence 0<E(z;cq;q)<
E(0;cq;q)<E(0,0;0).
(ii) Both R2(z;cq;q)and SR(z;cq;q)are monotone increasing in z for z <z=
c(1 +b(ψ,1(1) qψ,1(q)))/band monotone decreasing in z for z >z.
(iii) And in the ridgeless limit as z 0,wehave
32 This is the case, for example, when P=DP+QP, where lim supP→∞ rankQP<, while DP
are diagonal matrices and DP,QPare uniformly bounded. In this case, we can replace Pwith DP
in all expressions. Perhaps more tangibly, this condition obtains when the signals satisfy a finite-
dimensional factor structure. Furthermore, if the signals have similar idiosyncratic variance, they
satisfy the necessary mixing condition.
484 The Journal of Finance®
E(0;cq;q)=bq(ψ,1(q)(cq)2m(cq;q)11q>1/c),
L(0;cq;q)=E(0;cq;q)+(1 +b(ψ,1(1) qψ,1(q)))((cq)11)1,q<1/c
˜μ(cq;q),q>1/c,
V(0;cq;q)=2(E(0;cq;q))2+1+bψ,1L(0;cq;q),
SR(0;cq;q)=E(0;cq;q)
V(0;cq;q)
for some m(cq;q)>0and some ˜μ(cq;q)<0with ˜μ(1+;c)=−. In particu-
lar, if is proportional to the identity matrix, =ψ,1I, then
E(0;cq;q)=bψ,1min{q,c1}(19)
is constant for q >1/c.
The comparative statics of Section III.B highlight how, even when the empir-
ical model is correctly specified, complexity hinders the model’s ability to hone
in on the true DGP because there is not enough data to support the model’s
heavy parameterization. That analysis shows that when models are correctly
specified, the best performance (in terms of R2and Sharpe ratio) comes from
simple models. Naturally, a small correctly specified model will converge on
the truth faster than a large correctly specified model. But this is not a very
helpful comparison.
The fundamental difference in this section is that while raising cq brings the
usual statistical challenges of heavy parameterization without much data, the
added complexity also brings the benefit of improving the empirical model’s
approximation of the true DGP. A simple model will tend to suffer from poor
approximation and thus fare poorly in terms of both statistical metrics like
R2and portfolio metrics like the expected return and Sharpe ratio. Thus, our
misspecification analysis tackles the most important question about high com-
plexity: Does the improvement in approximation justify the statistical cost of
heavy parameterization when it comes to out-of-sample forecast and portfolio
performance? The answer is yes, as established by the following theorem.
THEOREM 1 (Virtue of Complexity): Suppose that signals are sufficiently mixed
(so that H(x;q)does not depend on q) and tr(1,22,1)=o(P). Then, with
the optimal amount of shrinkage z, the Sharpe ratio SR(z(q;c);cq;q)and
R2(z(q;c);cq;q)are strictly monotone increasing and concave in q [0,1].
Figures 4, 5,and6illustrate the behavior of misspecified machine learn-
ing predictions and portfolios derived in Proposition 5. In this calibration, the
true unknown DGP is assumed to have a complexity of c=10. We continue to
calibrate as identity and b=0.2. We analyze the behavior of approximat-
ing empirical models that range in complexity from very simple (cq 0and
The Virtue of Complexity in Return Prediction 485
Figure 4. Expected out-of-sample prediction accuracy from misspecified models. This
figure shows the limiting out-of-sample R2and ˆ
βnorm as a function of cand zfrom Proposition 6
assuming is the identity matrix, b=0.2, and the complexity of the true model is c=10. (Color
figure can be viewed at wileyonlinelibrary.com)
Figure 5. Expected out-of-sample risk and return from misspecified models. This
figure shows the limiting out-of-sample expected return and volatility of the market timing strat-
egy as a function of cand zfrom Proposition 6assuming is the identity matrix, b=0.2, and
the complexity of the true model is c=10. (Color figure can be viewed at wileyonlinelibrary.com)
thus severely misspecified) to highly complex (q=1,cq =10 and thus cor-
rectly specified). The left panel of Figure 4shows the expected out-of-sample
R2. The cost of misspecification for low cis seen as a shift downward in the R2
relative to Figure 1. The challenges of model complexity highlighted in previ-
ous sections play an important role here as well. Intermediate levels of com-
plexity (cq 1) dilate the size of beta estimates (Figure 4, right panel), driving
down the R2and inflating portfolio volatility (Figure 5, right panel). These
effects abate once again for cq >1 due to the implicit regularization of high-
complexity ridgeless regression, just as in the earlier analysis. More generally,
486 The Journal of Finance®
Figure 6. Expected out-of-sample Sharpe ratio from misspecified models. This
figure shows the limiting out-of-sample Sharpe ratio of the market timing strategy as a func-
tion of cand zfrom Proposition 6assuming is the identity matrix, b=0.2, and the complexity
of the true model is c=10. (Color figure can be viewed at wileyonlinelibrary.com)
the patterns for R2,ˆ
βnorm, and portfolio volatility share similar qualitative
patterns as those in Figure 1.
The most important difference compared to Figure 1is the pattern for the
out-of-sample expected return of the market timing strategy (Figure 5,left
panel). Expected returns are now low for simple strategies due to their poor ap-
proximation of the DGP. Increasing model complexity monotonically increases
expected timing returns. In the ridgeless case, the benefit of added complexity
reaches its maximum of E(0;1;c1)=bψ,1c1when cq =1. A surprising fact
is that the ridgeless expected return is exactly flat as complexity rises beyond
cq =1, in which case the benefits of incremental improvements in DGP approx-
imation are exactly offset by the gradually rising bias of ridgeless shrinkage;
see formula (19).
This new fact that the expected return rises monotonically with model com-
plexity in the misspecified setting induces a similar pattern in the out-of-
sample Sharpe ratio, shown in Figure 6. Rather than decreasing in complexity
as we saw in the correctly specified setting, the expected return improvement
from additional complexity leads the Sharpe ratio to also increase with com-
plexity. Consistent with Theorem 1, this is particularly true with nontrivial
ridge shrinkage but is even true in the ridgeless case as long as cq is suffi-
ciently far from unity. In summary, in the realistic case of misspecified em-
pirical models, complexity is a virtue. It improves the expected out-of-sample
market timing performance in terms of both expected return and Sharpe ratio.
It is instructive to compare our findings with the phenomenon of double de-
scent, where by absent regularization, out-of-sample MSE has a nonmonotonic
pattern in model complexity (Belkin et al. (2019), Hastie et al. (2022)). The mir-
ror image of double descent in MSE is the “double ascent” behavior of the ridge-
less Sharpe ratio (Figure 6). As Theorem 1shows, Sharpe ratio double ascent
is an artifact of insufficient shrinkage. With the right amount of shrinkage,
The Virtue of Complexity in Return Prediction 487
complexity becomes a virtue even in the low-complexity regime (when cq <1):
The hump disappears, and “double ascent” turns into “permanent ascent.”
V. Virtue of Complexity: Empirical Evidence From Market Timing
In this section, we present empirical analyses that are direct empirical
analogs to the theoretical comparative statics for misspecified models in Sec-
tion IV.
A. Data
Our empirical investigation centers on a cornerstone of empirical asset pric-
ing research—forecasting the aggregate stock market return. To make the con-
clusions from this analysis as easy to digest as possible, we perform our anal-
ysis in a conventional setting with conventional data. Our forecast target is
the monthly excess return of the CRSP value-weighted index. The information
set we use for prediction consists of the 15 predictor variables from Goyal and
Wel ch (2008) available monthly over the sample from 1926 to 2020.33
We volatility-standardize returns and predictors using backward-looking
standard deviations that preserve the out-of-sample nature of our forecasts.
Returns are standardized by their trailing 12-month return standard deviation
(to capture their comparatively fast-moving conditional volatility).34 In con-
trast, predictors are standardized using an expanding window historical stan-
dard deviation (given the much higher persistence of most predictors). We re-
quire 36 months of data to ensure enough stability in our initial predictor stan-
dardization, so the final sample we bring to our analysis began in 1930. We per-
form this standardization to align the empirical analysis with our homoskedas-
tic theoretical setting. Our results are insensitive to this step—none of our
findings are sensitive to variations in how standardizations are implemented.
B. Random Fourier Features
We seek models that take the form of equation (3).35 To evaluate our theory,
we also seek a framework that will allow us to smoothly transition from low-
33 This list includes (using mnemonics from their paper): dfy, infl, svar, de, lty, tms, tbl, dfr, dp,
dy, ltr, ep, b/m, and ntis, as well as one lag of the market return. Most of these variables are based
on market prices and are available at month end. Our date convention for inflation is that used by
Goyal, Welch, and Zafirov (2023) and the data set graciously provided by Amit Goyal. Note that
while inflation for month tis typically reported two weeks into month t+1, the Goyal, Welch, and
Zafirov (2023) date convention views the price data upon which the official inflation statistic is
based as part of the time tinformation set. We show in Internet Appendix Figure IA12 and Table
IA2 that our results are essentially unchanged if we exclude inflation from our analysis.
34 For returns, we calculate standard deviation from the uncentered second moment due to the
noisiness of estimating mean monthly returns in short windows.
35 As in equation (3), we exclude the intercept from our regressions. If we include a constant as
an additional regressor in our high complexity regressions, the associated intercept is shrunken
so heavily that it has no effect on the results reported in Table I.
488 The Journal of Finance®
complexity models to high-complexity models. To do so, we adopt an influential
methodology from the machine learning literature known as RFF (Rahimi and
Recht (2007), Rahimi and Recht (2008)).36 Let Gtdenote our 15 ×1 vector of
predictors. The RFF methodology converts Gtinto a pair of new signals,
Si,t=sin(γω
iGt),cos(γω
iGt)
ii.i.d.N(0,I),(20)
where Si,tuses the vector ωito form a random linear combination of Gt,
which is then fed through the trigonometric functions.37 The advantage of
RFF is that for a fixed set of input data, Gt, we can create an arbitrar-
ily large (or small) set of features based on the information in Gtthrough
the nonlinear transformation in (20). If one desires a very low-dimensional
model in (3), say P=2, one can generate a single pair of RFFs. For a very
high-dimensional model, say P=10,000, one can instead draw many random
weight vectors ωi,i=1,...,5,000. The larger the number of random features,
the richer the approximation that (3) provides to the general functional form
E[Rt+1|Gt]=f(Gt), where fis some smooth nonlinear function. Indeed, the
RFF approach is a wide two-layer neural network with fixed weights in the
first layer (in the form of ωi) and optimized weights in the second layer (in the
form of the regression estimates for β).
C. Out-of-Sample Performance
To conduct the empirical analogue of the theoretical analysis in Figures 4,
5,and6, we consider one-year, five-year, and 10-year rolling training windows
(T=12, 60, or 120) and a large set of RFFs (as high as P=12,000). These
choices are guided by our desire to investigate the role of model complexity,
defined in the empirical analysis as c=P/T. The advantages of short training
samples like T=12 are (i) we can reach extreme levels of model complexity
with smaller Pand thus less computing burden and (ii) it shows that the virtue
of complexity can be enjoyed in small samples. But none of our conclusions are
sensitive to this choice as we document all of the same patterns for training
windows of T=60 and 120.
36 Rahimi and Recht (2007) describe how RFF approximation accuracy improves as one in-
creases the level of model complexity. In the limit of zero complexity (P,T→∞,P/T0), RFF
regression approximates any sufficiently smooth nonlinear function arbitrarily well. Subsequent
papers (see, for example, Rudi and Rosasco (2017)) further characterize rates of convergence. The
case of nonzero complexity is less well understood. Recent results (Mei and Montanari (2022),
Mei, Misiakiewicz, and Montanari (2022), Ghorbani et al. (2020)) show that, for nonzero complex-
ity, random features methods cannot learn the true function and only learn its projection on a
specific functional subspace.
37 The parameter γcontrols the Gaussian kernel bandwidth in the generation of RFFs. Ran-
dom features can be generated in several ways (for a survey, see Liu et al. (2021)). Our choice of
functional form in (20) is guided by Sutherland and Schneider (2015), who document tighter error
bounds for this functional approximation relative to some alternative random feature formula-
tions. We find, however, that our results are insensitive to using other random feature schemes.
The Virtue of Complexity in Return Prediction 489
To draw “VoC curves” along the lines of Figures 4, 5,and6,weestimatease-
quence of out-of-sample predictions and trading strategies for various degrees
of model complexity, ranging from P=2toP=12,000, and various degrees of
ridge shrinkage, ranging from log10(z)=−3,...,3. One repetition of our anal-
ysis proceeds as follows:
(i) Generate 12,000 RFFs according to (20) with bandwidth parameter γ.38
(ii) Fix a model defined by the number of features P∈{2,...,12,000}and
ridge shrinkage parameter log10(z)∈{3,...,3}. The set of predictors
Stfor regression (3) corresponds to the first PRFFs from (i).
(iii) Given the model in (ii), and fixing a training window T∈{12,60,120},
conduct a recursive out-of-sample prediction and market timing strat-
egy. For each t∈{T,...,1,091}, estimate (3) using training observa-
tions {(Rt,St1),...,(RtT+1,StT)}.39 Then, from the estimated regres-
sion coefficient, construct out-of-sample return forecast ˆ
βStand timing
strategy return ˆ
βStRt+1.
(iv) From the sequence of out-of-sample predictions and strategy returns
in (iii), calculate the average ˆ
β2across training samples, the out-of-
sample R2, and the out-of-sample average return, volatility, and Sharpe
ratio of the timing strategy.40
The inherent randomness of RFFs means that estimates of out-of-sample per-
formance tend to be noisy for models with low P. We therefore repeat the anal-
ysis steps from (i) to (iv) 1,000 times with independent draws of the RFFs, and
then average the performance statistics across repetitions.
The VoC curves in Figures 7and 8plot out-of-sample prediction and market
timing performance as a function of model complexity and ridge shrinkage
for the case T=12. The wide range of complexity that we consider (e.g., c
[0,1000] when T=12) can make it difficult to read plots. To better visualize
the results while emphasizing both behaviors near the interpolation boundary
and behavior for extreme complexity, we break the x-axis at an intermediate
value of c.
The first conclusion from these figures is that the out-of-sample empirical
behavior of machine learning predictions is a strikingly close match to the VoC
curves predicted by our theory. In particular, compare the empirical results of
Figure 7to the theoretical results under model misspecification from Figure 4.
The beta estimates and out-of-sample R2demonstrate explosiveness at the in-
terpolation boundary and recovery in the high-complexity regime. Figures IA1
and IA2 (reported in the Internet Appendix in the interest of space) document
identical patterns for training windows of 60 and 120 months.
38 We set γ=2. Our results are generally insensitive to γ, as discussed in Section V.F .
39 Prior to estimation, we volatility-standardize the training sample RFFs {St1,...,StT}and
out-of-sample RFFs Stby their standard deviations in the training sample.
40 Our empirical R2calculation is one minus the ratio of out-of-sample forecast error variance to
out-of-sample realized return variance. Our empirical Sharpe ratio calculation uses the centered
standard deviation in the denominator.
490 The Journal of Finance®
Figure 7. Out-of-sample market timing performance (T=12). This figure shows the out-
of-sample prediction accuracy and portfolio performance estimates for the empirical analysis de-
scribed in Section V. C . The training window is T=12 months and RFF count P(or cT)ranges
from 2 to 12,000 with γ=2. (Color figure can be viewed at wileyonlinelibrary.com)
Extreme behavior at the interpolation boundary makes it difficult to fully ap-
preciate the patterns in R2. Figure IA3 in the Internet Appendix provides more
detail by plotting the out-of-sample R2zooming-in on the range [10%,1%].
Here, we see more clearly that high complexity and regularization together
produce a positive out-of-sample R2. In this plot, regularization comes in two
forms, both directly through higher zand more subtly through higher c(which
allows ridgeless regression to find solutions with small ˆ
βnorm). For large z,
the R2is almost everywhere positive for all training windows.
The most intriguing aspect of Figure 7is the clear increasing pattern in
out-of-sample expected returns as model complexity rises. For z=103,which
roughly approximates the ridgeless case, we see a nearly linear upward trend
in average returns as crises from zero to one. Beyond c=1, the ridgeless
expected return is nearly flat, just as predicted by equation (19) in Propo-
sition 6. For higher levels of ridge shrinkage, the rise in expected return
is more gradual and continues into the range of extreme model complexity.
The Virtue of Complexity in Return Prediction 491
Figure 8. Out-of-sample market timing performance (T=12). This figure shows the out-
of-sample prediction accuracy and portfolio performance estimates for the empirical analysis de-
scribed in Section V. C . The training window is T=12 months and RFF count P(or cT)rangesfrom
2 to 12,000 with γ=2. Alphas are versus a static position in the volatility-standardized market
portfolio. (Color figure can be viewed at wileyonlinelibrary.com)
Internet Appendix Figures IA1 and IA2 again document an identical expected
return pattern for longer training windows.
The increasing pattern in out-of-sample expected return and the decreasing
pattern in volatility above c=1 translate into a generally increasing pattern
in the out-of-sample market-timing Sharpe ratio, shown in Figure 8. The ex-
ception is a brief dip near c=1 at low levels of regularization as the spike in
variance compresses the Sharpe ratio. For high complexity, the Sharpe ratio
generally exceeds 0.4.
In our theoretical setting, we normalize the expected return of the untimed
asset to zero. This is not the case of course for the U.S. market return. There-
fore, to adjust for buy-and-hold market exposure, we calculate the out-of-
sample alpha, alpha t-statistic, and information ratio (IR) of the timing strat-
egy return via time-series regression on the untimed market. Figure 8shows
that the market timing alpha and IR inherit the same patterns as the average
return and Sharpe ratio. In the high-complexity regime, we find IRs around
492 The Journal of Finance®
Figure 9. Out-of-sample market timing performance (T=60,120). This figure shows the
out-of-sample prediction accuracy and portfolio performance estimates for the empirical analy-
sis described in Section V. C . The training window is T=60 or 120 months and RFF count P
(or cT) ranges from 2 to 12,000 with γ=2. Alphas are versus a static position in the volatility-
standardized market portfolio. (Color figure can be viewed at wileyonlinelibrary.com)
0.3 and significant alpha t-statistics ranging from 2.6 to 2.9 depending on the
amount of ridge shrinkage. Figure 9repeats this analysis for training win-
dows of 60 and 120 months, where we find similar IRs of roughly 0.25 with
t-statistics above 2.0 for high-complexity models.
What do market timing strategies look like in the high-complexity regime?
Figure 10 plots ˆπ(z,c) for the highest complexity and shrinkage configurations
of our empirical model (P=12,000 and z=103, averaged across 1,000 sets of
random feature weights). The three lines correspond to training windows of 12,
60, and 120 months. Positions show the same patterns for all training windows;
their time-series correlations are 90% (T=12 with T=60), 87% (T=12 with
T=120), and 97% (T=60 with T=120).41 The plot shows six-month moving
41 While the time-series patterns in positions are the same for all training windows, the scale
of positions is smaller for longer training windows. This is because the “leverage” of a strategy is
driven by the norm of beta, and this is typically smaller for larger T.
The Virtue of Complexity in Return Prediction 493
Figure 10. Market timing positions. This figure shows the out-of-sample prediction accuracy
and portfolio performance estimates for the empirical analysis described in Section V. C . The train-
ing window is T=12, 60, or 120 months with P=12,000, z=103,andγ=2. Positions are av-
eraged across 1,000 sets of random feature weights. Plots show the six-month moving average of
positions to improve readability. (Color figure can be viewed at wileyonlinelibrary.com)
averages of raw positions for better readability (our trading results are based
on the raw positions and not the moving averages).
The timing positions in Figure 10 are remarkable. First, they show that the
high-complexity strategy is long-only at heart. Negative bets are infrequent
and small relative to positive bets. The machine learning model thus heeds
the guidance of Campbell and Thompson (2008) “that many predictive regres-
sions beat the historical average return, once weak restrictions are imposed on
the signs of coefficients and return forecasts.” However, unlike Campbell and
Thompson (2008), the machine seems to learn this rule without being given an
explicit constraint.42
Second, the machine learning strategy learns to divest leading up to reces-
sions. NBER recession dates are shown in the gray-shaded regions. For 14 out
of 15 recessions in our test sample, the timing strategy substantially reduces
its position in the market before the recession (the exception is the eight-month
recession of 1945). And it does this on a purely out-of-sample basis.
D. Comparison with Goyal and Welch (2008)
Our results seem at odds with the primary conclusion of Goyal and Welch
(2008). These authors argue that the enterprise of market return prediction,
42 Strictly imposing the Campbell and Thompson (2008) constraint boosts the Sharpe ratio from
0.47 to 0.54 in the T=12 case, from 0.42 to 0.50 for T=60, and from 0.41 to 0.49 for T=120.
494 The Journal of Finance®
which has occupied a large amount of attention in the asset pricing literature
for decades, is by and large a failed endeavor: “these models seem unstable,
as diagnosed by their out-of-sample predictions and other statistics; and these
models would not have helped an investor with access only to available infor-
mation to profitably time the market.” But we use the same predictive infor-
mation as in that paper. What is the source of the discrepancy?
The conclusions of Goyal and Welch (2008) are based on their findings of con-
sistently negative out-of-sample prediction R2. They do not analyze the perfor-
mance of timing strategies based on expected returns or Sharpe ratios.43 We
revisit their analysis with a focus on timing strategy performance using the
same recursive out-of-sample prediction scheme as in the analysis of Figures 7
and 8. We use rolling 12-, 60-, and 120-month training windows (Panels A,
B, and C, respectively), and we focus on a version of what Goyal and Welch
(2008) call the “kitchen sink” regression. Our implementation uses 15 monthly
predictors in a linear ridgeless regression.44
The first finding of Table Iis that we confirm the conclusions of Goyal and
Wel ch (2008). Note that with monthly data, a model with 15 regressors already
has nontrivial complexity even for long training windows, and for the 12-month
training window, its complexity even exceeds one. Monthly return forecasts us-
ing linear ridgeless regression behave egregiously. The monthly out-of-sample
R2from ridgeless regression (z=0+) is large and negative at less than 100%
(9764% to be precise!). The timing strategy based on these predictions is also
poor. The Sharpe ratio is 0.11 and is insignificantly different from zero. This
seems perhaps not so terrible given the wildness of the forecasts, but it is due
to the fact that the strategy’s volatility is so high. Its maximum loss is 98 stan-
dard deviations. In light of our theoretical analysis, this agreement with the
conclusions of Goyal and Welch (2008) is perhaps unsurprising. With P=15
and T=12, this analysis takes place near the interpolation boundary. Thus,
forecasts and timing-strategy returns are expected to be highly volatile, as
our estimates confirm. In Panels B and C, we repeat the analysis with longer
training windows (T=60 and 120). Longer training windows lead to less vari-
able ridgeless regression estimates, producing higher (though still negative)
R2, and improving the Sharpe ratio.
Our theoretical analysis suggests that, in circumstances like the linear
kitchen sink where the regression takes place near the interpolation boundary,
the benefits from additional ridge shrinkage are potentially large. We, there-
fore, reestimate the Goyal and Welch (2008) kitchen sink regression with the
same range of ridge parameters that we used in our machine learning mod-
els. The R2from even heavily regularized regressions can remain negative,
43 Updating the original Goyal and Welch (2008) analysis, Goyal, Welch, and Zafirov (2023)
provide some evidence of timing-strategy performance for market return predictors.
44 To remain consistent with our other analyses, the forecast target is the monthly market
return standardized by its rolling 12-month volatility standardization. We continue to refer to this
as “the market” throughout. As discussed in the robustness section, our results across the board
are generally insensitive to, and our conclusions entirely unaffected by, whether we work with the
raw or volatility-standardized market return.
The Virtue of Complexity in Return Prediction 495
Tab l e I
Comparison with Goyal and Welch (2008)
This table report the out-of-sample prediction accuracy and portfolio performance estimates for
high-complexity timing-strategy returns with c=1,000 and z=103in Section V. C (“Nonlinear”)
averaged across 1,000 sets of random feature weights, compared with the linear kitchen sink
model of Goyal and Welch (2008) (“Linear”) with shrinkage of z=0+(ridgeless) and z=103.The
forecast target is the monthly market return standardized by its rolling 12-month volatility stan-
dardization. We report strategy Sharpe ratios (with average return t-statistics), information ratios
versus the market and versus the linear model with z=103(with alpha t-statistics). The panels
correspond to training windows of 12, 60, or 120 months. “Max Loss” is in standard deviation units.
IR v. IR v. Max
Model Shrinkage R2SR tMkt tLinear tLoss Skew
Panel A: 12-month training window
Linear z=0+<100% 0.11 1.0 0.16 1.6 98.5 0.9
z=1033.8% 0.46 4.4 0.33 3.1 2.4 0.1
Nonlinear z=1030.6% 0.47 4.5 0.31 2.9 0.26 2.5 1.2 2.5
Panel B: 60-month training window
Linear z=0+96.6% 0.00 0.0 0.07 0.6 35.8 11.1
z=1030.5% 0.44 4.1 0.10 0.9 1.4 0.3
Nonlinear z=1030.5% 0.42 3.9 0.25 2.3 0.27 2.5 0.5 1.7
Panel C: 120-month training window
Linear z=0+26.6% 0.20 1.8 0.14 1.2 15.4 6.5
z=1030.1% 0.49 4.4 0.13 1.2 0.8 0.9
Nonlinear z=1030.3% 0.41 3.7 0.24 2.2 0.24 2.2 0.3 0.9
as seen in the out-of-sample R2of 3.8% when z=103. However, with this
much shrinkage, the benefits of market timing become large. The annualized
out-of-sample Sharpe ratio of the strategy is 0.46 with a t-statistic of 4.4. This
performance is not due to static market exposure. In the column “IR v. Mkt,”
we report performance after regressing on the volatility-standardized market
return. The linear model with z=103hasanIRof0.33(t=3.1) versus the
market. Shrinkage also produces more attractive maximum loss and skewness.
These patterns align with the behavior predicted by our theoretical analysis.
Near the interpolation boundary, models can seem defective in terms of R2, yet
they can nonetheless confer large economic benefits to investors. In Panels B
and C, we see that shrinkage also benefits performance amid longer training
windows. For T=120, the linear strategy Sharpe ratio is 0.49 for z=103(the
alpha versus the market is insignificant, however).
The “Nonlinear” model in Table Irefers to the machine learning timing strat-
egy with c=1,000 and z=103(averaged across 1,000 sets of random weight
draws). In Panel A, the out-of-sample R2is 1% per month, with a Sharpe ratio
of 0.46 and an IR of 0.31 versus the market. It also has a significant IR of 0.26
(t=2.5) versus the best linear strategy (z=103). One of the most attractive as-
pects of the machine learning strategy is its low downside risk. Its worst month
496 The Journal of Finance®
Figure 11. Variable importance. This figure shows the variable importance (VI) for the ith
predictor that is the change in performance, defined as out-of-sample R2or Sharpe ratio, moving
from the full model with 15 variables to the reestimated model using 14 variables (excluding
variable i). (Color figure can be viewed at wileyonlinelibrary.com)
was a loss of 1.23 standard deviations, and its skewness is positive, 2.48. These
attractive tail risk properties of the machine learning model are not reflected
in the Sharpe ratio. Still, they would be an important utility boost for investors
who care about non-Gaussian risks. Note that the machine learning strategy
accomplishes this using the identical information set as the linear strategy; it
exploits this information in a high-dimensional, nonlinear way. Using longer
training windows (Panels B and C) leads to the same conclusions.
E. Variable Importance
These results above beg the question: how can such large models learn pre-
dictive patterns in training windows as short as 12 months, particularly when
several raw predictors are highly persistent (e.g., dividend yield and T-bill
rate)? The short answer is that a number of the 15 raw predictors are, in fact,
highly variable over short horizons, and these variables are the most impor-
tant contributors to the performance of the high-complexity model. To shed
more detailed light on this answer, we analyze the contribution of each vari-
able to overall model performance. We reestimate the machine learning model
omitting each of the 15 predictor variables one by one. We calculate “variable
importance” (VI) for the ith predictor as the change in performance (defined as
out-of-sample R2or Sharpe ratio) moving from the full model with 15 variables
to the reestimated model using 14 variables (excluding variable i).
Figure 11 plots the results for the 12-month training window (with P=
12,000, z=103, and averaged across 1,000 sets of random feature weights).
The three most important variables are also the three predictors with the
highest average variation in 12-month windows (i.e., the least persistent
The Virtue of Complexity in Return Prediction 497
predictors).45 Excluding the lagged market return (“lag mkt”), long-term bond
return (“ltr”), or default return (“dfr”) from the random features model reduces
the out-of-sample monthly prediction R2by 1.9%, 1.3%, and 0.8%, respectively.
In other words, the complex model is particularly adept at leveraging informa-
tion in short-horizon fluctuations among predictors. The VI calculations tell
the same story when we measure it in terms of R2(bars) or Sharpe ratio (line).
VI helps us identify which of the 15 predictors are the most dominant infor-
mation sources. But our results further show that the key differentiator of the
high-complexity model is its ability to extract nonlinear prediction effects. The
first evidence of this is its alpha versus the linear model shown in Table I. The
linear model has access to the same predictors, but incorporating nonlineari-
ties generates significant alpha over the linear model.
The VI results show that some linear predictors have impressive individ-
ual performance. To show that machine learning performance is not driven
by these simple linear effects, Internet Appendix Table IA1 reports IRs of the
machine learning strategy on the linear univariate timing strategy of each pre-
dictor (the univariate timing strategy is defined as the product of a predictor
at time twith the market return at t+1).
The machine learning model has a large and highly significant IR over ev-
ery linear strategy. We also calculate its IR versus all 15 univariate strategies
simultaneously (“All”).46 In this case, we find an IR of 0.32 (t=2.9), providing
further direct evidence for the nonlinear benefits of complexity.
Naturally, interpretation is a challenge for complex nonlinear models. In-
ternet Appendix Figure IA5 makes progress in this direction by illustrating
the nonlinear prediction patterns associated with each of the 15 predictors. To
trace the impact of predictor ion expected returns, we fix the prediction model
estimated from a given training sample and fix the values of all variables other
than iat their values at the time of the forecast. Next, we vary the value of the
ith predictor from its full-sample min (corresponding to 1 in the plots) to its
full-sample max (corresponding to +1) and record how the return prediction
varies. We then average this prediction response function across all training
windows and plot the result.
The figure illustrates a few interesting patterns. First, we see that when cer-
tain indicators of macroeconomic risk are at their lowest (in particular, stock
market variance “svar” and credit spreads on risky corporate debt “dfy”), the
machine learning model forecasts positive returns. However, once these vari-
ables reach even moderate levels, the return prediction drops to zero. This is
45 Figure IA4 in the Internet Appendix reports the average variation of each predictor in 12-
month training windows.
46 We cannot run in-sample versus all 15 univariate strategies simultaneously because this
would be equivalent to using the in-sample tangency portfolio of the 15 timing strategies as a
benchmark. This is not an apples-to-apples comparison because the machine learning strategy is
out-of-sample, so it should be benchmarked to a similarly out-of-sample strategy. To this end, we
build the out-of-sample tangency portfolio of the 15 timing strategies (scaled to have an expected
volatility of 20%) using an expanding window. We use this combined strategy as the regressor
when calculating alpha for the “all” case.
498 The Journal of Finance®
consistent with the time-series pattern in Figure 10, which shows that tim-
ing positions (i.e., expected returns) drop to zero heading into recessions. In
fact, all predictors demonstrate a similar “risk on/risk off” predictive pattern
in which certain values trigger positive market bets; otherwise, they advocate
positions near zero.
F. The Extent of Nonlinearity and Other Robustness
It is interesting to note that the linear strategy and the nonlinear machine
learning strategy each have beneficial performance relative to buy-and-hold.
Yet, they are distinct from each other (e.g., the nonlinear strategy has signif-
icant alpha versus the linear strategy). The parameter γcontrols the degree
of nonlinearity in the RFF approximation. It turns out that the linear kitchen
sink regression is equivalent to an RFF model in the limit when γ0. In par-
ticular, note that
sin(γω
iGt)=γω
iGt+O(γ2),cos(γω
iGt)=1γω
iGt+O(γ2).(21)
Suppose for simplicity that we only have the sin features. Then, defining
=1
P1/2(ωi)P
i=1R15×P, we have that the model is equivalent to a model with
random linear features, St=Gt.47
This begs the question: is there an optimal degree of nonlinearity? In gen-
eral, the answer is no. In the high-complexity regime, different choices of γ
deliver different approximations of the true DGP, with none strictly domi-
nating the others. Mei, Misiakiewicz, and Montanari (2022) show that high
model complexity poses an insurmountable obstacle for any random feature
regression—it is impossible to learn the “true” dependency Rt+1=f(Gt)+εt+1
when the model is complex. In this case, different random feature generators
recover different aspects (projections) of the truth on different subspaces. As a
result, we would expect linear and nonlinear random features to contain com-
plementary information. This is clearly reflected in the results of Table I.48
We assess robustness of our results to various degrees of nonlinearity (γ=
0.5 or 1, versus γ=2 in our main analysis) in Section VI of the Internet Ap-
pendix. We also investigate the effect of excluding volatility standardization of
the market return. The brief summary of these analyses is that our conclusions
are robust to each variation in empirical design.
Next, we analyze the robustness of our main findings in subsamples. We
report model performance splitting the test sample into halves, as shown in
Internet Appendix Figures IA9, IA10, and IA11 for training windows T=12,
60, and 120, respectively. The left side of each figure reports machine learning
timing-strategy out-of-sample performance from 1930 to 1974, and the right
47 See Proposition IA1 in Section Vof the Internet Appendix.
48 Related, the machine learning model and the linear kitchen sink (with z=103)havealpha
versus each other, suggesting that there are benefits to model averaging. For example, an equal-
weighted average of the two strategies (after they are rescaled to have the same volatility) pro-
duces a Sharpe ratio of 0.53 and a significant IR versus the market of 0.37.
The Virtue of Complexity in Return Prediction 499
side from 1975 to 2020. The figures show that the patterns of out-of-sample
timing- strategy performance with respect to complexity and shrinkage do not
depend on the subsample. Average out-of-sample returns rise monotonically
with complexity and decrease with ridge shrinkage; volatility abates when we
move past the interpolation boundary and is further dampened by shrinkage.
IRs rise with complexity and are fairly insensitive to shrinkage. In the interest
of space, we do not plot the out-of-sample R2or ˆ
βnorm, but these also follow
identical patterns to those for the full sample.
While the patterns are the same across subsamples, the magnitudes differ.
Average returns in the second sample are about half as large as in the first
sample. But volatilities are roughly the same, so IRs are about half as large
in the second sample. This is consistent with the machine’s trading patterns
plotted in Figure 10. Starting around 1968, the machine finds notably fewer
buying opportunities and, when it does, takes smaller positions than in the
earlier sample.
Finally, we compare the performance of the machine learning model with
a 12-month training window to a 12-month time-series momentum strategy
(Moskowitz, Ooi, and Pedersen (2012)). If regressors are highly persistent, they
will appear roughly static in a typical 12-month window. In this case, forecasts
from a high-complexity regression will behave very similarly to time-series
momentum.49 In Section VII of the Internet Appendix, we explain this issue
in more detail. We also show that our results are not driven by this “short
window and persistent regressor” mechanism. Instead, as emphasized in Sec-
tion V. E , our machine learning model performance is driven by relatively high-
frequency fluctuations among the predictors. We also show that the machine
learning timing strategy has economically large and statistically significant
alpha over time-series momentum.
VI. Conclusion
The field of asset pricing is in the midst of a boom in research applications us-
ing machine learning. The asset management industry is experiencing a paral-
lel boom in adopting machine learning to improve portfolio construction. How-
ever, the properties of portfolios based on such richly parameterized models
are not well understood.
In this paper, we offer new theoretical insights into the expected out-of-
sample behavior of machine learning portfolios. Building on recent advances
in the theory of high-complexity models from the machine learning literature,
we demonstrate a theoretical “virtue of complexity” for investment strategies
derived from machine learning models. Contrary to conventional wisdom, we
prove that market timing strategies based on ridgeless least squares generate
positive Sharpe ratio improvements for arbitrarily high levels of model com-
plexity. In other words, the performance of machine learning portfolios can
be theoretically improved by pushing model parameterization far beyond the
49 We are grateful to the editor for pointing this out.
500 The Journal of Finance®
number of training observations, even when minimal regularization is applied.
We provide a rigorous foundation for this behavior rooted in techniques from
random matrix theory. We complement these technical developments with in-
tuitive descriptions of the key statistical mechanisms.
In addition to establishing the virtue of complexity, we demonstrate that
out-of-sample R2from a prediction model is generally a poor measure of its
economic value. We prove that a market timing model can earn large economic
profits when R2is large and negative. This naturally recommends that the fi-
nance profession focus less on evaluating models in terms of forecast accuracy
and more on evaluating in economic terms, for example, based on the Sharpe
ratio of the associated strategy. We compare and contrast the implications of
model complexity for machine learning portfolio performance in correctly spec-
ified versus misspecified models.
Finally, we compare theoretically predicted behavior to the empirical be-
havior of machine learning–based trading strategies. The theoretical virtue
of complexity aligns remarkably closely with patterns in real-world data. In
a canonical empirical finance application—market return prediction and con-
comitant market timing strategies—we find out-of-sample IRs on the order
of 0.3 relative to a market buy-and-hold strategy, and these improvements are
highly statistically significant. The emerging strategies have some remarkable
attributes, behaving as long-only strategies that divest the market leading up
to recessions. Our high-complexity models learn this behavior without guid-
ance from researcher priors or modeling constraints.
Our results are not a license to add arbitrary predictors to a model. Instead,
we recommend (i) including all plausibly relevant predictors and (ii) using rich
nonlinear models rather than simple linear specifications. Doing so confers
prediction and portfolio benefits, even when training data are scarce, particu-
larly when accompanied by prudent shrinkage. Even when the number of raw
predictors is small, gains are achieved using those predictors in highly param-
eterized nonlinear prediction models.
This recommendation clashes with the philosophy of parsimony frequently
espoused by economists and famously articulated by the statistician George
Box:
Since all models are wrong, the scientist cannot obtain a ‘correct’ one by
excessive elaboration. On the contrary, following William of Occam he
should seek an economical description of natural phenomena. Just as the
ability to devise simple but evocative models is the signature of the great
scientist so overelaboration and overparameterization is often the mark of
mediocrity. (Box (1976))
Our theoretical analysis (along with that of Belkin et al. (2019), Hastie et al.
(2022), and Bartlett et al. (2020), among others) shows the flaw in this view—
Occam’s razor may instead be Occam’s blunder. Theoretically, we show that a
small model is preferable only if it is correctly specified. But as Box (1976)
emphasizes, models are never correctly specified. The logical conclusion is
The Virtue of Complexity in Return Prediction 501
that large models are preferable under fairly general conditions. The machine
learning literature demonstrates the preferability of large models in a wide
range of real-world prediction tasks. Our results indicate that the same is
likely true in finance and economics.
Our findings point to a number of interesting directions for future work,
such as studying the theoretical behavior of high-complexity models in cross-
sectional trading strategies and more extensive empirical investigation into
the virtue of complexity across different asset markets.
Initial submission: June 21, 2022; Accepted: December 16, 2022
Editors: Stefan Nagel, Philip Bond, Amit Seru, and Wei Xiong
REFERENCES
Abhyankar, Abhay, Devraj Basu, and Alexander Stremme, 2012, The optimal use of return pre-
dictability: An empirical study, Journal of Financial and Quantitative Analysis 47, 973–1001.
Ali, Alnur, J. Zico Kolter, and Ryan J. Tibshirani, 2019, A continuous-time view of early stopping
for least squares regression, in Kamalika Chaudhuri and Masashi Sugiyama, eds., Proceed-
ings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS),
volume 89, 1370–1378 (Naha, Okinawa, Japan), PMLR.
Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song, 2019, A convergence theory for deep learning via
over-parameterization, in Kamalika Chaudhuri and Ruslan Salakhutdinov, eds., Proceedings
of the 36th International Conference on Machine Learning, 242–252 (Long Beach, California),
PMLR 97.
Bai, Zhidong, and Wang Zhou, 2008, Large sample covariance matrices without independence
structures in columns, Statistica Sinica 18, 425–442.
Bartlett, Peter L., Philip M. Long, Gábor Lugosi, and Alexander Tsigler, 2020, Benign overfitting
in linear regression, Proceedings of the National Academy of Sciences 117, 30063–30070.
Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal, 2019, Reconciling modern machine-
learning practice and the classical bias–variance trade-off, Proceedings of the National
Academy of Sciences 116, 15849–15854.
Belkin, Mikhail, Daniel Hsu, and Ji Xu, 2020, Two models of double descent for weak features,
SIAM Journal on Mathematics of Data Science 2, 1167–1180.
Belkin, Mikhail, Alexander Rakhlin, and Alexandre B. Tsybakov, 2019, Does data interpolation
contradict statistical optimality? In Kamalika Chaudhuri and Masashi Sugiyama, eds., Pro-
ceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AIS-
TATS), volume 89, 1611–1619 (Naha, Okinawa, Japan), PMLR.
Box, George EP, 1976, Science and statistics, Journal of the American Statistical Association 71,
791–799.
Campbell, John Y., and Samuel B. Thompson, 2008, Predicting excess stock returns out of sample:
Can anything beat the historical average? Review of Financial Studies 21, 1509–1531.
Cenesizoglu, Tolga, and Allan Timmermann, 2012, Do return prediction models add economic
value? Journal of Banking & Finance 36, 2974–2987.
Chen, Luyang, Markus Pelger, and Jason Zhu, 2023, Deep learning in asset pricing, Management
Science, Articles in Advance, 1–37.
Cochrane, John H., 2011, Presidential address: Discount rates, Journal of Finance 66, 1047–1108.
Da, Rui, Stefan Nagel, and Dacheng Xiu, 2022, The statistical limit of arbitrage, Working paper,
Chicago Booth.
Dobriban, Edgar, and Stefan Wager, 2018, High-dimensional asymptotics of prediction: Ridge re-
gression and classification, The Annals of Statistics 46, 247–279.
Dong, Xi, Yan Li, David E. Rapach, and Guofu Zhou, 2022, Anomalies and the expected market
return, Journal of Finance 77, 639–681.
Du, Simon, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, 2019, Gradient descent finds
global minima of deep neural networks, in Kamalika Chaudhuri and Ruslan Salakhutdinov,
502 The Journal of Finance®
eds., Proceedings of the 36th International Conference on Machine Learning, 1675–1685 (Long
Beach, California), PMLR.
Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh, 2018, Gradient descent provably
optimizes over-parameterized neural networks, Working paper, arXiv, Cornell University.
Fan, Jianqing, Yingying Fan, and Jinchi Lv, 2008, High dimensional covariance matrix estimation
using a factor model, Journal of Econometrics 147, 186–197.
Fan, Jianqing, Jianhua Guo, and Shurong Zheng, 2022, Estimating number of factors by adjusted
eigenvalues thresholding, Journal of the American Statistical Association 117, 852–861.
Fan, Jianqing, Zheng Tracy Ke, Yuan Liao, and Andreas Neuhierl, 2022, Structural deep learning
in conditional asset pricing, Working paper, SSRN.
Ferson, Wayne E., and Andrew F. Siegel, 2001, The efficient use of conditioning information in
portfolios, Journal of Finance 56, 967–982.
Freyberger, Joachim, Andreas Neuhierl, and Michael Weber, 2020, Dissecting characteristics non-
parametrically, Review of Financial Studies 33, 2326–2377.
Gagliardini, Patrick, Elisa Ossola, and Olivier Scaillet, 2016, Time-varying risk premium in large
cross-sectional equity data sets, Econometrica 84, 985–1046.
Ghorbani, Behrooz, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, 2020, When do neu-
ral networks outperform kernel methods? Advances in Neural Information Processing Systems
33, 14820–14830.
Giannone, Domenico, Michele Lenza, and Giorgio E. Primiceri, 2021, Economic predictions with
big data: The illusion of sparsity, Econometrica 89, 2409–2437.
Goyal, Amit, and Ivo Welch, 2008, A comprehensive look at the empirical performance of equity
premium prediction, Review of Financial Studies 21, 1455–1508.
Goyal, Amit, Ivo Welch, and Athanasse Zafirov, 2023, A comprehensive 2021 look at the empirical
performance of equity premium prediction II, Working paper, Swiss Finance Institute.
Gu, Shihao, Bryan Kelly, and Dacheng Xiu, 2020, Empirical asset pricing via machine learning,
Review of Financial Studies 33, 2223–2273.
Hansen, Lars Peter, and Scott F. Richard, 1987, The role of conditioning information in deducing
testable restrictions implied by dynamic asset pricing models, Econometrica 55, 587–613.
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani, 2022, Surprises in
high-dimensional ridgeless least squares interpolation, The Annals of Statistics 50, 949–986.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White, 1990, Universal approximation of an un-
known mapping and its derivatives using multilayer feedforward networks, Neural Networks
3, 551–560.
Jacot, Arthur, Franck Gabriel, and Clément Hongler, 2018, Neural tangent kernel: Convergence
and generalization in neural networks, Advances in Neural Information Processing Systems
31.
Kelly, Bryan, and Seth Pruitt, 2013, Market expectations in the cross-section of present values,
Journal of Finance 68, 1721–1756.
Kelly, Bryan, and Dacheng Xiu, 2022, Financial machine learning, Working paper, Yale.
Koijen, Ralph, and Stijn Van Nieuwerburgh, 2011, Predictability of returns and cash flows, Annual
Review of Financial Economics 3, 467–491.
Kozak, Serhiy, Stefan Nagel, and Shrihari Santosh, 2020, Shrinking the cross-section, Journal of
Financial Economics 135, 271–292.
Ledoit, Olivier, and Sandrine Péché, 2011, Eigenvectors of some large sample covariance matrix
ensembles, Probability Theory and Related Fields 151, 233–264.
Ledoit, Olivier, and Michael Wolf, 2020, Analytical nonlinear shrinkage of large-dimensional co-
variance matrices, The Annals of Statistics 48, 3043–3065.
Leitch, Gordon, and J. Ernest Tanner, 1991, Economic forecast evaluation: Profits versus the con-
ventional error measures, American Economic Review 580–590.
Liu, Fanghui, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens, 2021, Random features
for kernel approximation: A survey on algorithms, theory, and beyond, IEEE Transactions on
Pattern Analysis and Machine Intelligence 44, 7128–7148.
Ludvigson, Sydney C., and Serena Ng, 2007, The empirical risk–return relation: A factor analysis
approach, Journal of Financial Economics 83, 171–222.
The Virtue of Complexity in Return Prediction 503
Marˇ
cenko, Vladimir A., and Leonid Andreevich Pastur, 1967, Distribution of eigenvalues for some
sets of random matrices, Mathematics of the USSR-Sbornik 1, 457.
Martin, Ian W. R., and Stefan Nagel, 2022, Market efficiency in the age of big data, Journal of
Financial Economics 145, 154–177.
Mei, Song, Theodor Misiakiewicz, and Andrea Montanari, 2022, Generalization error of random
feature and kernel methods: Hypercontractivity and kernel matrix concentration, Applied and
Computational Harmonic Analysis 59, 3–84.
Mei, Song, and Andrea Montanari, 2022, The generalization error of random features regres-
sion: Precise asymptotics and the double descent curve, Communications on Pure and Applied
Mathematics 75, 667–766.
Moskowitz, Tobias J., Yao Hua Ooi, and Lasse Heje Pedersen, 2012, Time series momentum,
Journal of Financial Economics 104, 228–250.
Rahimi, Ali, and Benjamin Recht, 2007, Random features for large-scale kernel machines, Ad-
vances in Neural Information Processing Systems 20.
Rahimi, Ali, and Benjamin Recht, 2008, Weighted sums of random kitchen sinks: Replacing mini-
mization with randomization in learning, Advances in Neural Information Processing Systems
21.
Rapach, David, and Guofu Zhou, 2013, Forecasting stock returns, in Graham Elliott and Allan
Timmermann, eds., Handbook of Economic Forecasting, volume 2, 328–383 (Elsevier).
Rapach, David, and Guofu Zhou, 2022, Asset pricing: Time-series predictability, Oxford Research
Encyclopedia of Economics and Finance.
Rapach, David E., Jack K. Strauss, and Guofu Zhou, 2010, Out-of-sample equity premium pre-
diction: Combination forecasts and links to the real economy, Review of Financial Studies 23,
821–862.
Rapach, David E., and Guofu Zhou, 2020, Time-series and cross-sectional stock return forecasting:
New machine learning methods, Machine Learning for Asset Management: New Developments
and Financial Applications 1–33.
Richards, Dominic, Jaouad Mourtada, and Lorenzo Rosasco, 2021, Asymptotics of ridge (less) re-
gression under general source condition, in Arindam Banerjee and Kenji Fukumizu, eds., Pro-
ceedings of the 24th International Conference on Artificial Intelligence and Statistics (AIS-
TATS), 3889–3897 (SanDiego, California, USA), PMLR.
Rudi, Alessandro, and Lorenzo Rosasco, 2017, Generalization properties of learning with random
features, Advances in Neural Information Processing Systems 30.
Silverstein, Jack W., and Z. D. Bai, 1995, On the empirical distribution of eigenvalues of a class of
large dimensional random matrices, Journal of Multivariate Analysis 54, 175–192.
Spigler, Stefano, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu
Wyart, 2019, A jamming transition from under-to over-parametrization affects generalization
in deep learning, Journal of Physics A: Mathematical and Theoretical 52, 474001.
Sutherland, Danica J., and Jeff Schneider, 2015, On the error of random fourier features, Proceed-
ings of the Thirty-First Conference on Uncertainty in Artificial Intelligence 862–871.
Tsigler, Alexander, and Peter L. Bartlett, 2023, Benign overfitting in ridge regression, Journal of
Machine Learning Research 24, 123–131.
Wu, Denny, and Ji Xu, 2020, On the optimal weighted 2regularization in overparameterized
linear regression, Advances in Neural Information Processing Systems 33, 10112–10123.
Yaskov, Pavel, 2016, A short proof of the Marchenko–Pastur theorem, Comptes Rendus Mathema-
tique 354, 319–322.
Supporting Information
Additional Supporting Information may be found in the online version of this
article at the publisher’s website:
Appendix S1: Internet Appendix.
Replication Code.
... The integration of machine learning methods into financial prediction has emerged as one of the most active areas of research in empirical asset pricing (Kelly et al. 2024, Gu et al. 2020, Bianchi et al. 2021, Chen et al. 2024, Feng et al. 2020). The appeal is clear: while financial markets generate increasingly high-dimensional data, traditional econometric methods remain constrained by limited sample sizes and the curse of dimensionality. ...
... The pioneering work of Kelly et al. (2024) has significantly advanced our theoretical understanding by establishing rigorous conditions under which complex machine learning models can outperform traditional approaches in financial prediction. Their theoretical framework, grounded in random matrix theory, demonstrates that the conventional wisdom about overfitting may not apply in high-dimensional settings, revealing a genuine 'virtue of complexity' under appropriate conditions. ...
... To bridge this theory-practice gap, I conduct comprehensive numerical validation of the kernel approximation breakdown across realistic parameter spaces that span the configurations used in recent high-dimensional financial prediction studies (Kelly et al. 2024, Nagel 2025. The numerical analysis examines how within-sample standardization destroys the theoretical Gaussian kernel convergence that underlies existing RFF frameworks, quantifying the magnitude of approximation errors under practical implementation choices. ...
Preprint
Full-text available
Recent advances in machine learning have shown promising results for financial prediction using large, over-parameterized models. This paper provides theoretical foundations and empirical validation for understanding when and how these methods achieve predictive success. I examine three key aspects of high-dimensional learning in finance. First, I prove that within-sample standardization in Random Fourier Features implementations fundamentally alters the underlying Gaussian kernel approximation, replacing shift-invariant kernels with training-set dependent alternatives. Second, I derive sample complexity bounds showing when reliable learning becomes information-theoretically impossible under weak signal-to-noise ratios typical in finance. Third, VC-dimension analysis reveals that ridgeless regression's effective complexity is bounded by sample size rather than nominal feature dimension. Comprehensive numerical validation confirms these theoretical predictions, revealing systematic breakdown of claimed theoretical properties across realistic parameter ranges. These results show that when sample size is small and features are high-dimensional, observed predictive success is necessarily driven by low-complexity artifacts, not genuine high-dimensional learning.
... Bianchi et al. (2021) state that the success of neural networks in bond price predictions is due to the model's ability to capture complex nonlinearities in the data. This is echoed by Kelly et al. (2024), who establish a theoretical underpinning to the observation that machine learning models, such as our neural network, outperform linear models. A single study to date by Kim et al. (2021) applies a battery of machine learning methods to predict corporate bond yield spreads. ...
Article
Full-text available
This study is the first to examine the real estate-specific determinants of REIT bond risk premia. Using a dataset of 33,857 U.S. REIT bond yield spreads and 24 explanatory variables, we predict REIT bond yield spreads with a non-parametric artificial neural network algorithm and interpret the model’s predictions using the explainable machine learning method Accumulated Local Effect Plots (ALE). We report evidence of a direct real estate factor for U.S. REIT bond yield spreads proxied by real estate market total return and REIT property type. In addition, we find a property-type diversification risk premium for REIT bonds, indicating that there is no economic benefit in the form of lower cost of bond debt for most property-type diversification at the REIT-level. We argue that this is due to higher management and valuation complexity of diversified REIT portfolios. This study’s findings have relevant implications for REIT portfolio strategy and REIT capital structure decisions, as we show that specialized REITs generally have lower bond debt costs compared to diversified REITs. Moreover, a better understanding of the drivers influencing REIT bond risk premia helps investors to effectively manage bond portfolio risks.
... In the same vein, financial as well as medical data, well-known for their non-stationary nature and variability across multiple frequency scales, present significant analytical challenges [23,24,25]. These data are often considered complex processes akin to multifractional or even multifractal systems. ...
Article
In this paper, we introduce a novel and advanced multiscale approach to Granger causality testing, achieved by integrating Variational Mode Decomposition (VMD) with traditional statistical causality methods. Our approach decomposes complex time series data into intrinsic mode functions (IMFs), each representing a distinct frequency scale, thus enabling a more precise and granular analysis of causal relationships across multiple scales. By applying Granger causality tests to the stationary IMFs, we uncover causal patterns that are often concealed in aggregated data, providing a more comprehensive understanding of the underlying system dynamics. This methodology is implemented in a Python-based software package, featuring an intuitive, user-friendly interface that enhances accessibility for both researchers and practitioners. The integration of VMD with Granger causality significantly enhances the flexibility and robustness of causal analysis, making it particularly effective in fields such as finance, engineering, and medicine, where data complexity is a significant challenge. Extensive empirical studies, including analyses of cryptocurrency data, biomedical signals, and simulation experiments, validate the effectiveness of our approach. Our method demonstrates a superior ability to reveal hidden causal interactions, offering greater accuracy and precision than leading existing techniques.
Preprint
The performance of the data-dependent neural tangent kernel (NTK; Jacot et al. (2018)) associated with a trained deep neural network (DNN) often matches or exceeds that of the full network. This implies that DNN training via gradient descent implicitly performs kernel learning by optimizing the NTK. In this paper, we propose instead to optimize the NTK explicitly. Rather than minimizing empirical risk, we train the NTK to minimize its generalization error using the recently developed Kernel Alignment Risk Estimator (KARE; Jacot et al. (2020)). Our simulations and real data experiments show that NTKs trained with KARE consistently match or significantly outperform the original DNN and the DNN- induced NTK (the after-kernel). These results suggest that explicitly trained kernels can outperform traditional end-to-end DNN optimization in certain settings, challenging the conventional dominance of DNNs. We argue that explicit training of NTK is a form of over-parametrized feature learning.
Article
This study examines the predictive power of incident‐based Environmental, Social and Governance (ESG) risk on the Eurozone stock market returns using a forecast combination method. We find that our constructed indicator shows significant return predictability from both a statistical and economic perspective, with an out‐of‐sample CER gain of 4.55% and a Sharpe ratio of 0.43, consistently outperforming the mean benchmark. Moreover, we find that the predictive power is concentrated during non‐expansion periods. We attribute this mechanism to the firm's fundamentals, cash flow and discount rate channels. Our findings highlight the value of ESG information for investors.
Article
Full-text available
We provide the first systematic evidence on the link between long‐short anomaly portfolio returns—a cornerstone of the cross‐sectional literature—and the time‐series predictability of the aggregate market excess return. Using 100 representative anomalies from the literature, we employ a variety of shrinkage techniques (including machine learning, forecast combination, and dimension reduction) to efficiently extract predictive signals in a high‐dimensional setting. We find that long‐short anomaly portfolio returns evince statistically and economically significant out‐of‐sample predictive ability for the market excess return. The predictive ability of anomaly portfolio returns appears to stem from asymmetric limits of arbitrage and overpricing correction persistence.
Article
Our paper reexamines whether 29 variables from 26 papers published after Goyal and Welch 2008a, as well as the original 17 variables, were useful in predicting the equity premium in-sample and out-of-sample as of the end of 2021. Our samples include the original periods in which these variables were identified, but end later. More than one-third of these new variables no longer have empirical significance even in-sample. Of those that do, half have poor out-of-sample performance. A small number of variables still perform reasonably well both in-sample and out-of-sample. (JEL G3, G4)
Article
We use deep neural networks to estimate an asset pricing model for individual stock returns that takes advantage of the vast amount of conditioning information, keeps a fully flexible form, and accounts for time variation. The key innovations are to use the fundamental no-arbitrage condition as criterion function to construct the most informative test assets with an adversarial approach and to extract the states of the economy from many macroeconomic time series. Our asset pricing model outperforms out-of-sample all benchmark approaches in terms of Sharpe ratio, explained variation, and pricing errors and identifies the key factors that drive asset prices. This paper was accepted by Agostino Capponi, finance. Supplemental Material: The online appendix and data are available at https://doi.org/10.1287/mnsc.2023.4695 .
Article
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Article
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.
Article
Consider the classical supervised learning problem: we are given data (yi,xi), i≤n, with yi a response and xi∈X a covariates vector, and try to learn a model fˆ:X→R to predict future responses. Random feature methods map the covariates vector xi to a point ϕ(xi) in a higher dimensional space RN, via a random featurization map ϕ. We study the use of random feature methods in conjunction with ridge regression in the feature space RN. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random feature ridge regression. In particular, we address two fundamental questions: (1) What is the generalization error of KRR? (2) How big N should be for the random feature approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top ℓ eigenfunctions of the kernel, where ℓ depends on the sample size n. We show that the test error of random feature ridge regression is dominated by its approximation error and is larger than the error of KRR as long as N≤n1−δ for some δ>0. We characterize this gap. For N≥n1+δ, random features achieve the same error as the corresponding KRR, and further increasing N does not lead to a significant change in test error.
Article
2019 by the author(s). We show that classical learning methods interpolating the training data can achieve optimal rates for the problems of nonparametric regression and prediction with square loss.