Available via license: CC BY 4.0
Content may be subject to copyright.
Empirical Asset Pricing via Ensemble Gaussian Process Regression∗
Damir Filipovi´c†Puneet Pasricha‡
January 2, 2025
Abstract
We introduce an ensemble learning method based on Gaussian Process Regression (GPR) for
predicting conditional expected stock returns given stock-level and macro-economic information.
Our ensemble learning approach significantly reduces the computational complexity inherent
in GPR inference and lends itself to general online learning tasks. We conduct an empirical
analysis on a large cross-section of US stocks from 1962 to 2016. We find that our method
dominates existing machine learning models statistically and economically in terms of out-of-
sample R-squared and Sharpe ratio of prediction-sorted portfolios. Exploiting the Bayesian
nature of GPR, we introduce the mean-variance optimal portfolio with respect to the prediction
uncertainty distribution of the expected stock returns. It appeals to an uncertainty averse
investor and significantly dominates the equal- and value-weighted prediction-sorted portfolios,
which outperform the S&P 500.
Keywords: empirical asset pricing, Gaussian process regression, portfolio selection, ensemble
learning, machine learning, firm characteristics
JEL classification: C11, C14, C52, C55, G11, G12
∗We benefited from discussions with Markus Pelger, Dacheng Xiu, Semyon Malamud, and seminar and conference partici-
pants at Platform for Advanced Scientific Computing (PASC) Conference 2021, SIAM Conference on Financial Mathematics
and Engineering 2021 (FM21), Euler Institute Research Seminar at University of Lugano, and SFI Research Days 2022. We
also thank the editor and two anonymous reviewers for helpful comments.
†´
Ecole Polytechnique F´ed´erale de Lausanne and Swiss Finance Institute, Email: damir.filipovic@epfl.ch
‡Indian Institute of Technology Ropar, Email: puneet.pasricha@iitrpr.ac.in
arXiv:2212.01048v2 [q-fin.RM] 3 Jan 2025
1 Introduction
A central problem of empirical asset pricing is the prediction of conditional expected stock returns
given the information set of market participants.1Conditional expected returns are notoriously
hard to predict, for a number of reasons. First, financial markets are very noisy and exhibit a low
signal-to-noise ratio when compared to other domains such as computer vision, which is arguably
due to the efficiency of the markets. Second, the full information set is not observable and likely too
complex to model. The literature has accumulated a non-exhaustive list of predictive stock-level
characteristics and macro-economic variables, which continuous to increase. Third, the relation
between predictors and returns is evidently non-linear and time-varying due to the dynamically
evolving economic conditions, which further complicates the problem.
Over the years, a wide range of methods have been proposed to improve the prediction perfor-
mance, from traditional statistical methods (Welch and Goyal [2008], Koijen, Moskowitz, Pedersen,
and Vrugt [2018] etc.) to the modern machine learning methods (Gu, Kelly, and Xiu [2020], Chen,
Pelger, and Zhu [2022], Gu, Kelly, and Xiu [2021] etc.). Interest in machine learning methods for
empirical asset pricing has grown tremendously in academic finance and the industry. The leading
example in this vein is Gu et al. [2020], who outline a new research agenda marrying machine
learning with empirical asset pricing. They compare several machine learning methods for predict-
ing stock returns, namely linear regression, generalized linear models with penalization, random
forests, and neural networks. Other recent studies, such as Chen et al. [2022], Gu et al. [2021],
further explore machine learning methods for empirical asset pricing, which are primarily based on
neural networks. These articles provide empirical evidence that neural networks can significantly
improve the predictive performance over traditional statistical methods and thus improve our em-
pirical understanding of stock returns. However, focusing exclusively on prediction performance
often overlooks an equally important aspect, the uncertainty inherent in these predictions. Quanti-
fying uncertainty plays a significant role in finance when it comes to making decisions. Capturing
uncertainties is crucial and has significant implications because the quality of our predictions di-
rectly impacts the quality of applications we build upon these predictions, for example, portfolio
selection, hedging, or speculation.
This paper adopts a broader perspective, stepping back from the traditional “horse race” to
improve prediction accuracy. Instead, it leverages kernel methods in machine learning, particu-
larly Gaussian process regression (GPR), for predicting conditional expected stock returns given
observable stock-level characteristics and macroeconomic variables. As a Bayesian non-parametric
method, the predictions from a GPR model naturally come along with a predictive distribution.
In particular, it gives confidence intervals for the predicted conditional expected returns. As a
novel application, we harness these uncertainty estimates by incorporating in portfolio construc-
tion, showcasing the methodological advantages of GPR over conventional machine learning models,
which often give point estimates of return without any quantification of uncertainty.2
1Conditional expected returns in excess of the risk-free rate are also referred to as risk premium.
2The standard error estimates of some of the machine learning algorithms are available in the literature. For
1
Our paper makes several methodological and empirical contributions. First, while Gu et al.
[2020] study generalized linear models with nonlinear transformations of the original features, they
restrict these transformations to second order splines. As such they miss the full range of kernel ridge
regression, which comes with a strong mathematical framework and constitutes a powerful machine
learning method that works well on medium data sizes. Gaussian processes provide an alternative
view on kernel ridge regression, by modeling a distribution over functions and performing inference
directly in function space.3We bridge the gap in Gu et al. [2020] and establish a link between
two significant and growing research areas, kernel methods in machine learning and empirical asset
pricing in financial economics. Thus, we contribute to the emerging literature on machine learning
for empirical asset pricing.
Second, as mentioned earlier, being a Bayesian method, the predictions from a GPR model
come along with a predictive distribution. As a novel application, we harness these uncertainty
estimates. We find that incorporating uncertainty in portfolio construction leads to substantial
statistical and economic improvements in terms of out-of-sample R-squared and Sharpe ratio of
prediction-sorted portfolios, respectively. More precisely, we first use the predictive covariance
matrix to construct a minimum uncertainty-weighted (UW) decile portfolio in the spirit of a global
minimum variance portfolio. We find that the UW portfolio delivers a significantly high out-of-
sample predictive pooled R-squared, R2
pool, that outperforms the two traditional portfolios, namely,
equal-weighted (EW) and value-weighted (VW). Motivated by this finding, we further exploit the
predictive covariance matrix and introduce two new portfolios. The first is a prediction-weighted
portfolio, originally proposed by Kaniel, Lin, Pelger, and Van Nieuwerburgh [2022], which takes
advantage of the ranking (to form decile portfolios) and the relative strengths of the predictions.
The second is a prediction-uncertainty-weighted (PUW) portfolio in the spirit of the mean-variance
optimal portfolio that gives more weight to stocks with higher predicted returns and minimizes
uncertainty at the same time, which appeals to an uncertainty averse investor. We find that PW
and PUW portfolios generate large economic gains, in terms of Sharpe ratio, compared to EW and
VW portfolios.
Third, a well-known issue with implementing GPR is the need to invert the kernel matrix re-
peatedly for the computation of the marginal log-likelihood function, which has a fundamental time
complexity of the order O(N3), where Nis the size of the training sample. This limitation pro-
hibits, both in time and memory space, using GPR for large datasets. To tackle this computational
instance, Giordano, Rocca, and Perna [2002] investigate using the AR-Sieve bootstrap method to estimate the stan-
dard error of the sampling distribution of the neural network predictive values in a regression model with dependent
errors. Farrell, Liang, and Misra [2021] use a semi-parametric framework to provide non-asymptotic high-probability
bounds for neural network predictions. The estimates of standard error predictions from random forests and LASSO
are obtained by Wager, Hastie, and Efron [2014], and Casella, Ghosh, Gill, and Kyung [2010] respectively. Although,
these articles attempt to estimate prediction uncertainties, they are either computationally expensive or lack the
theoretical rigor of GPR.
3Gaussian processes are powerful mathematical objects that have enjoyed success in many practical applications.
Gaussian processes have very close connections to other regression techniques, such as kernel ridge regression, support
vector machines and linear regression with radial basis functions. Gaussian process provide a mathematical framework
for many well-known models, including Bayesian linear models, spline models, and large neural networks (under
suitable conditions), see Williams and Rasmussen [2006].
2
bottleneck of GPR, we introduce an easy-to-implement ensemble learning method in the spirit of
the mixture-of-experts approach. Specifically, we partition the large training sample into subsets,
apply individual GPRs on all subsets in parallel, and obtain a predictive distribution conditional on
the full training data by mixing the predictive distributions over the subsets. This also addresses
one of the concerns associated with neural networks in financial applications. Neural networks
thrive in data-rich environments. With so many parameters to learn, they require massive training
data and are computationally costly to train. Much of the literature on neural networks in em-
pirical asset pricing focuses on monthly returns of a large cross-section of stocks spanning several
decades. It is questionable whether neural networks would perform reasonably when restricted to
smaller training sets, such as single industry sectors, or the S&P 500. Such samples only have a few
thousand data points, which seems unsuitable for neural networks. What’s more, neural networks
hardly adapt to online learning, so one has to retrain the whole network when new data arrives.4
In other words, neural networks are computationally intensive and hence not scalable in scenarios
where data comes in sequentially, as with financial data. In contrast, our ensemble learning ap-
proach offers several benefits: (i) it allows a straightforward parallel implementation of GPR on
small training subsets, thus reducing the computational cost; (ii) it scales well with sample size
and naturally lends itself to an online learning framework; and (iii) its data-driven mixing weight
scheme takes into account the non-stationarity and heteroscedasticity present in the financial data.
Fourth, our empirical analysis confirms the insights from the growing literature that machine
learning methods have excellent potential to predict conditional expected returns. In particular, we
show that a simple GPR model with very few hyperparameters outperforms the benchmark models,
both statistically and economically, in terms of the out-of-sample R-squared and Sharpe ratio of the
prediction-sorted portfolios, respectively. Concretely, we conduct an extensive empirical analysis,
investigating monthly returns of a large cross-section of US stocks from 1962 to 2016. Our features
include 94 time-varying stock-specific characteristics. We compare the performance of our GPR
model with a non-linear kernel against linear benchmark models. Specifically, ensemble GPR with
an affine kernel, ensemble linear regression, and standard linear regression. We find that our model
outperforms, out-of-sample, these benchmarks in predicting individual stock returns. In particular,
our model generates an R2
pool of 0.96% compared to 0.74% for the ensemble GPR model with affine
kernel, 0.61% for the ensemble linear regression model and 0.37% for the one-batch linear regression.
We also evaluate the predictive performance of our model based on two alternative metrics. The
first is the time average of the monthly R-squared, R2
avg , which gives equal weight to every month,
in contrast to R2
pool, which places more weight on months with larger cross sections. The second
is the information coefficient (IC), which quantifies the model’s ability to differentiate the relative
performance among stocks disregarding the absolute levels of predictions. We find that R2
avg and
IC are 0.58% and 7.3% for our model.
We also assess our model’s predictive performance at the portfolio level. In particular, we form
decile portfolios (bottom D1 to top D10) sorted on out-of-sample stock return predictions from our
4Gu et al. [2020] re-fit the neural network once every year.
3
model. Our UW portfolios achieve a higher R-squared than EW and VW portfolios for each decile.
Further, when assessed over the grand panel of all decile portfolios, UW generates a R2
pool of 12.8%
compared to 8.17% and 2.94% for EW and VW. The more pronounced predictive power of the
UW portfolio shows that reducing prediction uncertainty is economically significant. We also find
that the economic gains from the decile portfolios constructed using our predictions are large in
terms of Sharpe ratio. For example, the long-short portfolio from PUW with uncertainty aversion
value 20 yields an annualized out-of-sample Sharpe ratio of 3.66, which compared to 3.68 from
PW, however, at a significantly lower volatility. Moreover, it outperforms the Sharpe ratios of 3.18
(EW) and 1.12 (VW), which confirms that reducing prediction uncertainty matters economically.
Further, our prediction uncertainty based portfolios outperform the corresponding portfolios from
the linear benchmark models.
We find that the most important features are related to liquidity, which includes variables such
as bid-ask spread (baspread), dollar volume (dolvol), turnover volatility (SD turn), Amihud illiq-
uidity (ill), and recent price trends, including short-term reversal (mom1m and mom6m), stock
momentum (mom12m), momentum change (chmom), and long-term reversal (mom36m). In gen-
eral, we find that our model is inclusive and extracts predictive information from a wide range of
features. We also investigate the cross-sectional heterogeneity in predicted returns and prediction
uncertainty. We find that stocks with high predicted returns tend to be less liquid, and stocks with
high prediction uncertainty are the ones with limits to arbitrage frictions and that exhibit extreme
illiquidity. Stocks with higher predicted returns have a higher 6-month (mom6m) and 12-month
(mom12m) momentum and a lower 1-month (mom1m) momentum, which suggests momentum over
longer horizons and reversal over short horizons.
In sum, our paper confirms the great potential of machine learning for predicting conditional
expected returns. We contribute to this understanding by adding prediction uncertainty, which
greatly improves the performance of prediction-sorted portfolios.
Our paper contributes to the fast emerging literature on machine learning for empirical asset
pricing.5In their pioneering work, Gu et al. [2020] conduct a comparative study of several machine
learning methods to predict the grand panel of individual stock returns from the US markets and
demonstrated the advantages of machine learning methods over traditional approaches. Similar
studies are performed for European stock markets, Drobetz and Otto [2021], and bond markets,
Bianchi, B¨uchner, and Tamoni [2021]. Gu et al. [2021] use an autoencoder neural network to
demonstrate that imposing economic structure on a machine learning algorithm can substantially
improve the estimation. In the same spirit, Chen et al. [2022] use deep neural networks with
the fundamental no-arbitrage condition as a criterion function to estimate an asset pricing model
for individual stock returns. These articles, however, exclude the important class of kernel-based
5The literature on traditional empirical asset pricing can essentially be divided into two broad categories: time-
series and cross-sectional prediction models. For the former see, e.g., Welch and Goyal [2008], Koijen et al. [2018] and
references therein. Cross-sectional models aim at explaining differences in the cross-section of stock returns and do
so by regressing returns on stock-level characteristics, e.g., past returns, turnover etc., and macro-economic variables.
See, e.g., Fama and French [2008] and references therein. The main limitation of these traditional regression models
is their incapability of incorporating a large number of features and non-linear dependencies.
4
models. We bridge this gap and show that our simple ensemble method based on GPR dominates
the performance of their best benchmark models, which are based on neural networks. Our model
leads to a better out-of-sample predictive R-squared, and further taking into account the estimates
of the prediction accuracy leads to better portfolio performance in terms of Sharpe ratio.
The adoption of GPR in finance is only recent, albeit GPR has demonstrated much success
outside of finance under the name of kriging. Williams and Rasmussen [2006] provide an extensive
background on GPR models and highlight its applications in various fields. For instance, they
emphasize that Gaussian processes can be viewed as a Bayesian non-parametric generalization of
well-known econometrics techniques. In particular, a time series model, AR(p), is a discrete-time
equivalent of a Gaussian process model with Mat´ern covariance functions with an appropriate hyper-
parameter choice. Han, Zhang, and Wang [2016] combine Gaussian process state space models with
stochastic volatility models and propose a GPR stochastic volatility (GPRSV) model to predict the
volatility of the stocks returns. They also present an adjusted Markov Chain Monte Carlo to esti-
mate their model and demonstrate, through empirical analysis, the superior predictive performance
over the traditional GARCH and stochastic volatility models. De Spiegeleer, Madan, Reyners, and
Schoutens [2018] show how GPR can be deployed to approximate the derivative pricing function,
for instance, pricing of exotic options under advanced models. They also study the application of
GPR in the fitting of sophisticated Greek curves and implied volatility surfaces. Their numerical
findings suggest that GPR could deliver a speed-up of a factor of several magnitudes relative to
other benchmark methods for respective problems. Cousin, Maatouk, and Rulli`ere [2016] intro-
duce shape-constrained Gaussian processes for yield-curve and credit default swaps (CDS) curve
interpolation. Filipovi´c, Pelger, and Ye [2022b] and Filipovi´c, Pelger, and Ye [2022a] introduce an
economically motivated kernel ridge regression method for estimating the yield and return curves
of treasury bonds. Our paper seems to be the first that applies GPR to predicting conditional
expected stock returns.
Our paper also contributes to the literature on scalable GPR. Our ensemble learning method
complements existing approaches, such as inducing point methods Quinonero-Candela and Ras-
mussen [2005] and subset of regressors (SoR) Silverman [1985], which approximate the kernel
matrix using a reduced set of inducing points. Unlike inducing point methods, our ensemble
learning approach does not rely on selecting a subset of points. Instead, it leverages the diversity
of multiple Gaussian processes applied to different subsets of the data. Alternative approaches
include Kronecker and Toeplitz methods, which exploit data structure to enable computationally
efficient modeling for gridded or separable datasets (Wilson [2014]). These methods were fur-
ther advanced in Wilson and Nickisch [2015], where the authors introduce Kernel Interpolation
for Scalable Structured Gaussian Processes (KISS-GP), combining the inducing point framework
with structure-exploiting techniques. This approach is extended in Wilson, Hu, Salakhutdinov, and
Xing [2016] by integrating it with neural networks. More recently, Gardner, Pleiss, Weinberger,
Bindel, and Wilson [2018] utilized GPU acceleration to perform efficient matrix computations for
large-scale GPR. This GPU-based acceleration can also enhance our ensemble method, where each
5
individual Gaussian process model is implemented to take advantage of GPU computing.
The paper is organized as follows. Section 2 introduces the model framework and our ensemble
learning approach. Section 3 contains our empirical analysis. Section 4 concludes. The appendix
contains background material on Gaussian processes, on the kernel selection, and additional results.
2 Methodology
Consider a financial market consisting of assets iin discrete time t= 0,1,2, . . ., where trepresents
the end of a month. More specifically, we denote by It+1 the index set of assets ithat exist during
the period [t, t + 1]. At any time t, for any asset i∈ It+1, we observe the vector of predictor
variables x=xi,t with values in a feature space Xconsisting of asset i-specific characteristics and
common macro-economic variables.
We denote by ri,t+1 the excess log return (henceforth simply referred to as “return”, if there is
no risk of confusion) of asset iover the period [t, t + 1]. Following Gu et al. [2020], we describe it
as an additive prediction error model,
ri,t+1 =Et(ri,t+1) + ϵi,t+1 ,(1)
where Etdenotes the conditional expectation given the information available at time t. The errors
ϵi,t+1 are due to market imperfections and idiosyncratic noise. We assume that the conditional
expected return is given by a common function f:X → R,
Et(ri,t+1) = f(xi,t ),for all i∈ It+1 and t. (2)
At any given time t, our goal is to learn ffrom past data D={(xi,j−1, ri,j ), i ∈ Ij, j = 1,2, . . . , t},
and then predict next period conditional expected returns, ˆri,t+1 =f(xi,t), for i∈ It+1 . We assume
that the function fdoes neither directly depend on the index inor time t, so that it can be learned
efficiently using all instances in the panel data set D.
2.1 Ensemble learning method based on Gaussian process regression
We use Gaussian process regression (GPR) to learn the function fin (1) and (2). Thereto we
assume that fis a Gaussian process with some pre-specified prior mean function m(·) and covariance
function, or, kernel k(·,·). We assume that the errors ϵi,t+1 are i.i.d. Gaussian random variables
with mean 0 and variance σ2
ϵand independent of f. This means that the joint distribution of the
past returns r1:t={ri,j , i ∈ Ij, j = 1,2, . . . , t}and the conditional expected returns f(xt) =
{f(xi,t), i ∈ It+1}is Gaussian of the form
r1:t
f(xt)!∼ N m(x0:t−1)
m(xt)!, k(x0:t−1,x0:t−1) + σ2
ϵI k(x0:t−1,xt)
k(xt,x0:t−1)k(xt,xt)!!,(3)
6
where Iis the identity matrix, and x0:t−1={xi,j−1, i ∈ Ij, j = 1,2, . . . , t}and xt={xi,t, i ∈ It+1 }
denote the arrays of features observed by time t−1 and at t, respectively. Here, for a function
g:X → Rand an array x={xi}of points in X, we denote by g(x) = {g(xi)}the corresponding
array of function values. In particular, k(x0:t−1,x0:t−1) is the N×Nmatrix of the covariances
evaluated at all pairs of the past features x0:t−1, where N=Pt
j=1 |Ij|denotes the size of the
sample. Similarly, k(x0:t−1,xt) is the N× |It+1 |matrix of the covariances evaluated at all pairs of
the past features x0:t−1and current features xt.
The predictive (posterior) distribution of fgiven the data Dis again Gaussian with mean and
covariance functions given by
ˆm(x) = m(x) + k(x, x0:t−1)k(x0:t−1,x0:t−1) + σ2
ϵI−1(r1:t−m(x0:t−1)),(4)
ˆ
k(x, x′) = k(x, x′)−k(x, x0:t−1)k(x0:t−1,x0:t−1) + σ2
ϵI−1k(x0:t−1, x′).(5)
The mean ˆm(xt) is then our prediction of the conditional expected return vector rt+1, and ˆ
k(xt,xt)
represents the covariance of the Bayesian uncertainty of our prediction.
In our empirical analysis, we assume that errors ϵi,t+1 are economically non-significant with a
small fixed variance of σ2
ϵ= 10−10. This may seem at odds with the anecdotal low signal-to-noise
ratio of stock returns. But we can argue that the Bayesian uncertainty of our predictor also adds
noise to the signal, which here is the true conditional expected returns. On the other hand, adding
σ2
ϵIto the kernel matrix k(x0:t−1,x0:t−1) has nevertheless a regularizing effect, as the latter may
be ill-conditioned. We further set the prior mean function m(·) equal to zero. This is motivated
by the empirical fact that zero predictions perform better than the historical mean of excess log
returns, see Gu et al. [2020]. Moreover, it is well documented in the literature that a zero prior
mean usually works well for GPR, see De Spiegeleer et al. [2018], Williams and Rasmussen [2006].
The kernel k(·,·) of the prior distribution of fin (3) depends on hyperparameters, which are
estimated from the training data by maximizing the marginal log-likelihood, given by
log p(r1:t|x0:t−1) = −Nlog(2π)
2−r⊤
1:tk(x0:t−1,x0:t−1) + σ2
ϵI−1r1:t
2
−log det(k(x0:t−1,x0:t−1) + σ2
ϵI)
2.
(6)
A known challenge in GPR is that the computation of the log-likelihood function (6) involves
repeated inversion of the regularized kernel matrix, which takes time of the order O(N3). This is
only feasible for a small N(less than several thousand), which is not the case for the problem at
hand for which Nis of the order of millions.
We tackle the computational bottleneck of GPR and introduce an ensemble learning approach in
the spirit of the mixture-of-experts method.6Thereto, we partition the training data into subsets,
6Alternative ensemble learning approaches proposed in the literature include the product of GP experts in Ng and
Deisenroth [2014], the generalised product of experts in Cao and Fleet [2014], the Bayesian Committee Machine in
Tresp [2000], the robust Bayesian Committee Machine in Deisenroth and Ng [2015], and Distributed Kriging (DISK)
7
apply a GPR on each subset in parallel, and obtain a predictive distribution conditional on the
full training data by mixing the predictive distributions over the subsets. In contrast to ad hoc
partitioning schemes used in the literature, such as random or clustering based partitioning, we
use that our training data is naturally divided into monthly subsets. That is, we treat data from
each month j= 1,2, . . . , t as a training subset on which we train an individual Gaussian process
f(j). Specifically, we estimate hyperparameters by maximizing the log-likelihood function (6), and
obtain the predictive Gaussian distribution of f(j)with mean and covariance functions ˆm(j)(·) and
ˆ
k(j)(·,·) as in (4) and (5), with the training data {x0:t−1,r1:t}replaced by {xj−1,rj}.
Finally, we obtain the predictive distribution conditional on the full training data by mixing the
individual predictive Gaussian distributions of f(j)(xt) using some weights wj≥0 with Pjwj= 1.
The mean vector and covariance matrix of this Gaussian mixture distribution are given by
ˆ
rt+1 =X
j
wjˆm(j)(xt),(7)
ˆ
Σt+1 =X
j
wjˆ
M(j)
t+1 −ˆ
rt+1 ˆ
r⊤
t+1,(8)
where ˆ
M(j)
t+1 =ˆ
k(j)(xt,xt)+ ˆm(j)(xt) ˆm(j)(xt)⊤denotes the second order moment matrix of f(j)(xt).
The mean ˆ
rt+1 is our ensemble prediction of the conditional expected return vector rt+1, and ˆ
Σt+1
represents the covariance of the Bayesian uncertainty of our prediction.
For our empirical analysis, we implement the following two mixing weight schemes:
(i) Equal weights: we select the Kt≤tmost recent training months (more details on the choice
of Ktfollow in the next subsection), set wj= 0 for j≤t−Kt, and apply equal weights
wj= 1/Ktto the GPR models j=t−Kt+ 1, . . . , t −1, t. The sums in (7) and (8) are
effectively over j=t−Kt+ 1, . . . , t −1, t.
(ii) MSE weights: as above we select the Kt≤tmost recent training months and set wj= 0
for j≤t−Kt. We also hold out the most recent training month, wt= 0, which we call
calibration month, and define the weights for the remaining Kt−1 months as proportional
to the mean squared error
MSEj=1
|It|X
i∈It
(ri,t −ˆr(j)
i,t )2,
of GPR model jon the calibration month twith corresponding predicted returns ˆr(j)
i,t =
in Guhaniyogi, Li, Savitsky, and Srivastava [2017]. The product of experts approach obtains the joint prediction by
the product of all predictions from trained GPR models, while the generalized product of experts approach adds the
flexibility by assigning weights to the contributions from independent GPR models thus increasing/reducing their
importance. These approaches are further generalized in the Bayesian Committee Machine and the robust Bayesian
Committee Machine, where the GP priors are explicitly incorporated when combining predictions. In contrast to these
product of experts approaches, Distributed Kriging obtains the combined predictions as the Wasserstein barycenter
of the subset posterior distributions.
8
ˆm(j)(xi,t−1). The MSE weights are thus given by
wj=1/MSEj
Pt−1
s=t−Kt+1 1/MSEs
, j =t−Kt+ 1, . . . , t −2, t −1.
That is, the smaller MSEjthe larger the weight wjwe give to GPR model j. The sums in
(7) and (8) are effectively over j=t−Kt+ 1, . . . , t −2, t −1.
Our ensemble learning approach has several advantages. First, it offers a substantial compu-
tational speed-up, due to a straightforward parallel implementation of the individual GPR models
on subsets of the training data, compared to training a GPR model on the full training data. Sec-
ond, the flexibility that each GPR model can have its own set of optimal hyperparameters (the
parametric kernel is the same for each model) allows to account for the non-stationarity and het-
eroscedasticity present in the financial data. Third, our approach scales well with sample size and
provides an online learning framework. This is in contrast to other sophisticated machine learning
algorithms, which are hard to train recursively every month due to their high computational costs.
Specifically, we only need to train one additional GPR model on the newly observed data from
month t+ 1 and combine it with the already trained GPR models on months 1,2, . . . , t to obtain
a mixed predictive distribution for the conditional expected returns rt+2.
2.2 Sample splitting: training, validation and test samples
The process of estimating hyperparameters, predicting, and evaluating the predictions requires
the modeller to partition the full sample into training, validation and test samples. To achieve
this goal, we conduct an empirical analysis of rolling and expanding schemes on the training and
validation samples, and then use the scheme that performs best in terms of prediction accuracy on
the validation sample, for our out-of-sample test data analysis.7
The underlying idea of the rolling scheme is to gradually shift the training and validation
samples forward in time to include more recent data and exclude the oldest data points such that
a fixed size of the rolling window is maintained. At each rolling step, one re-fits the model on the
prevailing training and validation samples and obtains the predictions on the next test data, thus
resulting in a sequence of performance measures, i.e., one corresponding to each window. Although
this approach has the benefit that it can potentially leverage more recent data for predictions,
it can significantly impact the performance of the model if the excluded data contains essential
information, e.g., a financial crisis period. The expanding scheme also gradually includes more
recent data points in the training and validation samples. But in contrast to the rolling scheme it
retains the entire history in the training sample. In terms of the mixing weight schemes (i) and (ii),
7Another scheme, known as the fixed scheme, divides the sample into fixed training, validation, and test data,
estimates the model once from the training and validation samples and makes predictions on the test sample. Although
the fixed scheme is not very expensive in terms of the computation cost, it fails to capture the changes in the behaviour
of data over time, thus affecting the model’s performance.
9
we set Kt=Kfor the rolling scheme, where Kis a fixed constant, and Kt=tfor the expanding
scheme.
2.3 Predictive performance evaluation
We evaluate the model performance in predicting conditional expected returns, using three mea-
sures. The first is the predictive out-of-sample pooled R-squared,
R2
pool = 1 −Pt∈T3Pi∈It(ri,t −ˆri,t )2
Pt∈T3Pi∈Itr2
i,t
,
where T3denotes the collection of test months. R2
pool provides a metric for the grand panel-level
performance of the model by pooling the prediction errors across stocks and over time. Being
a pooled performance measure, R2
pool places more weight on months with a comparatively larger
cross-section of stocks. However, the size of the cross-section varies considerably across the sample,
as shown in Figure 1. A monthly rebalancing portfolio manager is more concerned with the average
monthly predictive performance.
Therefore, we also consider a second performance measure, the predictive out-of-sample average
R-squared,
R2
avg =1
|T3|X
t∈T3
R2
t,
where R2
tdenotes the R-squared for the predictions in month t,
R2
t= 1 −Pi∈It(ri,t −ˆri,t )2
Pi∈Itr2
i,t
.
Both measures, R2
pool and R2
avg , compare our model predictions against the naive forecast of zero
excess log returns and not against the historical mean excess log returns. This is because the latter
are known to predict excess log returns worse than zero by a large margin, see Gu et al. [2020].
Investors can use our model to construct portfolios based on the predicted relative performance
of the stocks. For example, a long-short investor will go long in top-ranked stocks and short
in bottom-ranked stocks to earn the difference between the relative returns of the two buckets
of stocks. The performance metrics, R2
pool and R2
avg , measure the extent to which the levels of
predicted excess returns differ from realized excess returns and thus are not necessarily suitable
for a long-short investor. Therefore, we consider a third performance measure, the information
coefficient, defined as the average
IC =1
|T3|X
t∈T3
ρt
of the cross-sectional Spearman’s rank correlation coefficients between the realized excess returns
10
and predictions,
ρt= 1 −6Pi∈Itd2
i
|It|(|It|2−1),
where diis the difference in ranks between the ith largest elements of {ˆrj,t}j∈Itand {rj,t}j∈It.
The IC, originally proposed by Ambachtsheer [1974], is a widely used performance measure in
investment management to measure predictive ability. It disregards the absolute levels, is less
sensitive to outliers, and quantifies the model’s ability to differentiate the relative performance
among stocks.
3 An empirical study of US equities
This section contains our empirical analysis. Section 3.1 describes the data and the steps we follow
to prepare it for the empirical analysis. Section 3.2 discusses the model selection where we conduct
a comparison study on the validation sample to select the model parameters for out-of-sample
analysis, namely the sample splitting scheme, the mixing weight scheme to create an ensemble and
the kernel. Section 3.3 contains the performance results of our model using statistical and economic
criteria. Section 3.4 focuses on the cross-section insights of the model performance such as variable
importance, the relationship between the features and the predicted returns, and the relationship
between the features and the prediction uncertainty. Section 3.5 discusses residual analysis.
3.1 Data
We consider the monthly returns of approximately 30,000 individual stocks from the three major
stock exchanges in the US, namely, NYSE, AMEX and NASDAQ. The data is collected from
CRSP over a period spanning 55 years from February 1962 to December 2016. We use the monthly
Treasury bill rate as a proxy for the risk-free rate to determine the excess simple return of a stock.
The literature on empirical asset pricing has constructed a large collection of features that help
predict future stock returns. The conditioning information we use includes 94 stock-specific charac-
teristics that is considered in Gu et al. [2020] and Gu et al. [2021].8The stock-level characteristics
pertain to several categories, including past returns, investment profitability, value, trading fric-
tions etc. Thereof, 61 characteristics are updated annually, 13 are updated quarterly, and 20 are
updated on a monthly basis. Since most characteristics are lagged in the sense that there is a delay
in their release to the public, we follow the common conventions concerning the usage of these
characteristics to avoid a forward-looking bias. More precisely, we assume that there is a lag of
at most one month, four months and six months in the monthly, quarterly and annually reported
characteristics respectively. Consequently, we predict the returns rt+1 over the period [t, t + 1]
as a function of the most recent publicly available characteristics at t. That is, we use the most
8Data is available on the homepage of Dacheng Xiu (https://dachxiu.chicagobooth.edu/).
11
Figure 1: This figure shows the size of the cross-section of stocks in each month of the sample. The
full sample is split into training sample (green), Feb 1962 to Dec 1981, validation sample (yellow),
Jan 1982 to Dec 1986, and test sample (red), Jan 1987 to Dec 2016.
recent monthly, quarterly and annual characteristics at the end of the months t−1, t−4 and t−6,
respectively.
To prepare the data for empirical analysis, we apply transformations to the stock-specific char-
acteristics. This is a common practice in machine learning since different features have different
absolute scales, and some of them are highly skewed and leptokurtic. To address the difference
in scale and remove the influence of outliers, at any time t, we standardize the non-missing val-
ues across any specific characteristic by subtracting the cross-sectional mean and dividing by the
cross-sectional standard-deviation. We then replace the missing observations by zero.
3.2 Model selection
We divide the data into three consecutive non-overlapping samples while maintaining the temporal
ordering of the data, as shown in Figure 1. The training sample, Feb 1962 to Dec 1981, is used
to estimate the hyperparameters of the Gaussian process. The validation sample, Jan 1982 to Dec
1986, is used for model selection, that is, the kernel function, the mixing weight scheme, the sample
splitting scheme (rolling or expanding), and the size of the rolling window if we adopt a rolling
scheme. The test sample, Jan 1987 to Dec 2016, is then used to evaluate the performance of the
selected model.
12
Figure 2: This figure describes the mechanism of the rolling scheme with training window, including
a calibration month, of length K.
For model selection, we follow a data-driven approach and select the configuration, among all
possible combinations, that performs best on the validation sample in terms of prediction accuracy,
as measured by R2
pool. More specifically, we implement our ensemble method for both, rolling and
expanding, schemes. In more detail, we apply the rolling scheme for different possible training
window lengths K= 2,...,239. That is, for the MSE weighting scheme (ii) and K= 2, we use Nov
1981 and Dec 1981 as training and calibration months, respectively, for the first test month, Jan
1982.9Likewise, we choose Oct 1986 and Nov 1986, respectively, for the last test month, Dec 1986.
Similarly, for the maximal possible K= 239, we use Feb 1962 to Nov 1981 as training months and
Dec 1981 as calibration month for the test month Jan 1982, and shift each by one month for the
next test month. This is illustrated in Figure 2. We also apply the expanding scheme, using the
full training sample available for each test month. More specifically, we use Feb 1962 to Nov 1981
as training months and Dec 1981 as a calibration month for the test month Jan 1982, while we use
Feb 1962 to Oct 1986 as training months and Nov 1986 as calibration month for the last test month
Dec 1986. We follow a similar procedure for the equal weighting scheme (i) but without holding
out a calibration month.
In preliminary experiments reported in the appendix with various kernels in predicting excess
log returns over the validation period, we observe that the following three kernels outperformed
9For K= 2, as there is only one training month, the role of the calibration month is effectively redundant.
13
the others in terms of R2
pool,
K5(x, x′) = σ21 + ||x−x′||
2αℓ2−α
, K8(x, x′) = σ21 + ||x−x′||2
2αℓ2−α
,
K10(x, x′) = σ2exp −||x−x′||
ℓγ,
where σ, α, ℓ > 0, γ ∈(0,2] are hyperparameters.10 We therefore report the validation analysis only
for these kernels on simple excess returns. We note that σis not relevant for the predictive mean
in (4) but it is significant in quantifying the Bayesian uncertainty of our predictions, see (5).
Figure 3 shows R2
pool over the validation sample for MSE- and equal-weighting schemes against
varying lengths Kof the training window for the three kernels selected in the preliminary studies.
The solid lines correspond to the rolling scheme, while the dotted lines represent R2
pool for the
expanding scheme, which does not depend on K. There are several findings. First, we observe
that both sample splitting and mixing weight schemes generate positive R2
pool for Klarge enough
(K > 30). Second, the MSE weighted ensemble yields a larger R2
pool than the equal weighted. A
possible explanation is the temporal non-stationarity of financial data. There are non-observable
changing regimes such that the regime prevailing at any given month jis not explicitly captured
by the observable features xj−1. But it is implicitly captured by the trained Gaussian process
f(j)through its fitted hyperparameters and predictive distribution, given data {xj−1,rj}. The
MSE-weighting scheme in turn gives more weight to a training month jthe closer its regime is to
the current regime prevailing in the calibration month. This results in a more accurate prediction
than for equal weighting. Third, it is striking that incorporating the full available training sample,
which is the expanding scheme, worsens the predictive performance; as the dotted lines are below
the solid lines for both mixing weight schemes. This finding is in line with the bias-variance trade-
off in machine learning, as our ensemble model complexity grows with the size of the training
window. Fourth, for the rolling scheme, R2
pool is relatively more volatile for small K(K < 50)
and becomes stable and provides considerably more consistent performance for large K(K > 50).
Further, for K > 50, we observe a peak around K= 100, which motivates us to choose K= 96 as
the window length for the rolling scheme.11 Finally, the performance of the three kernels is similar
to each other. We select K10 as the best kernel to do out-of-sample test. To conclude, based on
the performance in the validation sample, we select the rolling scheme with training window length
K= 96, the MSE-weighting scheme and the gamma exponential kernel for the out-of-sample test.
10The kernels K8and K10 are rational quadratic and γ-exponential kernels, the standard kernels from the literature.
The kernel K5is a modified version of rational quadratic, we call it linear quadratic kernel.
11The NBER’s business cycle dating committee maintains a chronology of US business cycles, available on https:
//www.nber.org/research/data/us-business-cycle-expansions- and-contractions. During our sample period,
the average length of a business cycle is around eight years, which is in line with our choice of K= 96.
14
Figure 3: This figure presents R2
pool over the validation sample, Jan 1982 to Dec 1986 for MSE-
and Equal- weighting scheme against the length of the training window.
3.3 Model performance
Having selected the model configuration, we now conduct our empirical analysis over the test
sample. We first train K−1 = 95 individual monthly GPR models, Jan 1979 to Nov 1986,
compute MSE weights on the first calibration month, Dec 1986, and use these weights to get the
predictive distribution for the returns in the first test month, Jan 1987. We proceed by induction
and train one additional GPR model on Dec 1986, combine it with the already trained GPR models
on the previous 94 months, Feb 1979 to Nov 1986, compute MSE weights on the next calibration
month, Jan 1987, and predict returns on Feb 1987. We repeat this online learning procedure until
the full test sample is exhausted, consisting of 360 test months until Dec 2016. We evaluate the
predictive performance, both statistically and economically, in the following subsections.
3.3.1 Predictive performance across all stocks
We first evaluate the predictive performance of our ensemble GPR model, henceforth referred
to as E-GPR (γ-exp), across all stocks and benchmark it against three linear models. The first
benchmark, E-GPR (affine), is an ensemble GPR model with an affine kernel function, defined as
K(x, y) = c0+c1x⊤y,
where c0, c1>0 are hyperparameters. The second benchmark, ensemble linear regression (E-LR),
adopts the ensemble GPR framework by fitting a separate linear regression for each training month.
The third benchmark is standard linear regression (LR), fitted using all available training data for
15
Figure 4: This figure shows the evolution of R2
pool (solid lines) and R2
avg (dotted lines) over an
expanding test subsample for our model, E-GPR (γ-exp), against the linear benchmark models,
E-GPR (affine), E-LR and LR. The shaded periods indicate NBER recessions.
Model R2
pool (%) R2
avg (%)
E-GPR (γ-exp) 0.96 0.58
E-GPR (affine) 0.74 0.42
E-LR 0.61 0.18
LR 0.37 0.003
Table 1: Comparison of predictive performance across all stocks, i.e., R2(%), among various models
each test month.
Figure 4 shows the out-of-sample performance in predicting the returns across time. It illustrates
the evolution of R2
pool (solid lines) and R2
avg (dotted lines) across an expanding test sample for our
model and the linear benchmark models.12 For example, R2
pool (R2
avg ) in Jan 2000 is the pooled
R-squared (average R-squared) evaluated on the test sample, Jan 1987 to Jan 2000.
Further, Table 1 presents the final values of R2
pool and R2
avg for each model over the entire test
sample. These final values correspond to the endpoints of the curves depicted in Figure 4.
There are several observations. The positive values of R2
pool and R2
avg over the expanding
test subsample indicates that our model outperforms the zero predictions consistently over time.
Moreover, our model achieves R2
pool = 0.96% over the full test sample.13 The value R2
avg = 0.58%
over the full test sample further confirms the superior performance of our model. To assure that
small stocks do not drive this unprecedented predictive performance, i.e., that our model is not
12We also plot the time series of monthly R-squared, R2
t, from our model in Figure A.3 in the appendix.
13This is substantially greater than the corresponding numbers, 0.4% and 0.58%, for the neural network models in
Gu et al. [2020] (Table 1 on Page 2250) and Gu et al. [2021] (Table 2 on Page 11), respectively.
16
simply picking up small-scale inefficiencies driven by illiquidity, we also measured R2
pool on two
test subsamples. The first consists of the top-1,000 stocks and the second of the bottom-1,000
stocks by market capitalization each month. The values of R2
pool for the two subsamples are 0.73%
and 1.34%, respectively. This is indicative that our model is capable of capturing the systematic
structure in the large-cap as well as small-cap stocks, although the performance is slightly better
for the small-cap stocks.
Compared to the linear benchmark models, we observe a significant improvement in prediction
accuracy when employing the non-linear kernel instead of the affine kernel. Our model, E-GPR
(γ-exp), consistently outperforms the linear benchmarks in terms of both R2
pool and R2
avg . Figure 4
also illustrates the temporal evolution of the contribution of non-linearity to predictive perfor-
mance, highlighting that the ranking of model performance presented in Table 1 remains consistent
throughout the sample period. This consistency reinforces the robustness of our model across
different time horizons. Additionally, the improved R2of ensemble linear regression (E-LR) com-
pared to standard linear regression (LR) underscores the benefits of our proposed ensemble learning
approach.
Figure 5 shows the time series of monthly Spearman’s rank correlations ρtbetween the predicted
and realized returns. The flat line gives the information coefficient, IC, our third performance
measure. It is evident that there is a substantial variation over time in the ability of our model to
differentiate relative performance between stocks, as can be seen by correlation coefficients ranging
from -23.2% to 31.61%. The information coefficient equals 7.3% and is significantly greater than
zero at the 95% confidence level. Remarkably, we observe that our model performs equally well
during the NBER recession months, as shown in the shaded periods.
3.3.2 Predictive performance across sorted portfolios
So far, our model’s predictive performance assessment has been based on individual stock returns.
Next, we analyze the predictive ability of our model at the portfolio level. Why should we assess
the portfolio-level predictions when our model is optimized for predicting individual stock returns?
We know that one of the main applications of predicting stock returns is to construct portfolios. A
model performing better at predicting stock returns need not provide accurate predictions at the
portfolio level. Therefore, assessing portfolio forecasts give an additional measure to evaluate the
predictive ability of our model.
Given our predictions at the beginning of each test month, we sort stocks into deciles, which
we denote by D1, D2, . . . , D10, where D1 corresponds to the lowest predicted returns and D10
corresponds to the largest predicted returns. Within each decile, we then construct three different
portfolios. The first is equal weighted (EW), and the second is value weighted (VW) by market
capitalization. These are the standard portfolios in the empirical asset pricing literature. For the
third portfolio, we minimize the variance of the Bayesian uncertainty of our predictions by solving
the following optimization problem,
min
w∈W w⊤ˆ
Σt+1w,(9)
17
Figure 5: This figure shows the evolution of Spearman’s rank correlation ρtbetween the realized
and predicted returns over the test sample. The flat line gives the information coefficient. The
shaded periods indicate NBER recessions.
for the predictive covariance matrix ˆ
Σt+1, and where W={w;Pj∈Dwj= 1, wj≥0, j ∈D}
denotes the feasible set of portfolio weights, for the respective decile D. We call it the uncertainty-
weighted (UW) portfolio. The goal of studying the predictive performance of the UW portfolio
is to examine the role of accuracy estimates in portfolio selection and finally, study its economic
contribution when we construct a risk-adjusted portfolio below.
Table 2 shows the out-of-sample predictive performance of EW and VW portfolios from our
ensemble GPR model with γ-exponential kernel. We also compare to the performance of EW and
VW portfolios from the benchmark models. We report the R2
pool between the predicted excess
returns and realized excess returns of the decile portfolios over the test sample, along with R2
pool,
pooled over all deciles, for each of the two portfolio strategies. It provides a grand portfolio-level
assessment of portfolio predictions against the realized portfolio returns.
Table 3 compares the same measures for UW portfolios from our ensemble GPR model with
γ-exponential kernel to that from the ensemble GPR model with affine kernel. Note that UW
portfolios are a unique feature of GPR models and thus are not available for E-LR and LR. It is
evident from Table 2 that the out-of-sample predictive performance at the portfolio level aligns
very closely with the results on the prediction performance at the individual stocks level reported
earlier. That is, GPR based models outperform both the variants of the linear regression model
(E-LR and LR) across all the portfolio strategies. Remarkably, the UW portfolios significantly
outperform the EW and VW portfolios in terms of predictive performance across all the deciles.
Further, the UW portfolio yields a positive grand panel R2
pool of 12.80% in contrast to EW and
18
Equal-Weighted
D1D2D3D4D5D6D7D8D9D10 All
E-GPR (γ-exp) 2.12 0.56 1.52 3.20 5.28 6.71 8.60 11.05 13.33 22.08 8.17
E-GPR (affine) 2.25 2.24 3.04 4.12 4.72 5.38 6.17 6.55 8.55 15.21 6.66
E-LR 0.42 0.68 1.33 2.27 3.14 3.70 5.19 6.10 9.36 16.35 5.29
LR -2.08 -1.88 -0.52 0.68 1.60 2.15 2.47 2.44 4.41 10.29 2.93
Value-Weighted
D1D2D3D4D5D6D7D8D9D10 All
E-GPR (γ-exp) -1.28 -1.06 -0.07 0.83 2.34 3.80 5.33 8.08 8.68 10.52 2.94
E-GPR (affine) 1.17 2.27 3.12 4.68 4.85 6.05 6.15 5.62 6.64 2.18 3.77
E-LR -2.33 -1.16 -0.41 0.20 0.86 1.00 1.80 0.97 2.39 -0.68 -0.18
LR -5.42 -2.65 -0.59 1.05 2.04 2.30 2.14 1.80 2.07 -3.66 -0.81
Table 2: In this table, we report the out-of-sample predictive performance, measured by R2
pool in
percentage points, across the sorted portfolios obtained using equal-weighting and value-weighting.
We compare the performance of our ensemble GPR model with γ-exponential kernel to ensemble
GPR model with affine kernel and different variations of linear regression model. In each panel,
the first ten columns (D1 to D10) report R2
pool for each decile while the last column (All) reports
R2
pool calculated over the grand panel of all deciles.
Uncertainty-Weighted Portfolios
D1D2D3D4D5D6D7D8D9D10 All
E-GPR (γ-exp) 6.39 1.90 1.92 4.56 6.89 10.54 14.10 17.62 22.83 31.85 12.80
E-GPR (affine) -0.21 -0.46 0.35 0.67 0.61 3.02 1.23 2.66 6.72 -0.45 1.49
Table 3: In this table, we report the out-of-sample predictive performance, measured by R2
pool in
percentage points, across the sorted portfolios obtained using uncertainty-weighted strategy. We
compare the performance of our ensemble GPR model with γ-exponential kernel to ensemble GPR
model with affine kernel. The first ten columns (D1 to D10) report R2
pool for each decile while the
last column (All) reports R2
pool calculated over the grand panel of all deciles.
VW portfolios that generate a grand panel R2
pool of 8.17% and 2.94%, respectively. These findings
reveal that prediction uncertainties matter.
3.3.3 Economic performance of sorted portfolios
Next, we assess whether the improved statistical performance of our prediction-sorted portfolios
translates into better economic performance. On top of the EW and VW portfolios discussed in
the previous subsection, we introduce two additional portfolios.
First, for the prediction-weighted (PW) portfolio we assign weights, within each decile, to the
stocks based on their predicted returns. The goal is to take advantage of the relative strength of
the prediction signal in addition to the rankings. Specifically, for each of the top five deciles (D6
to D10), we subtract the smallest predicted return within the decile to aim at maximal predicted
return. Similarly, for each of the bottom five deciles (D1 to D5), we subtract each predicted return
from the largest predicted return within that decile to aim at the minimal predicted return. We
19
then normalized the level-adjusted predicted returns such that they sum up to 1 within each decile.
More specifically, for any stock iwe define the level-adjusted predicted return
ˆsi,t+1 =
ˆri,t+1 −minj∈Diˆrj,t+1,if ilies in the top 5 deciles, D6 to D10,
maxj∈Diˆrj,t+1 −ˆri,t+1,if ilies in the bottom 5 deciles, D1 to D5,
(10)
where ˆrj,t+1 are the predicted returns in decile Dicorresponding to stock i. These level-adjusted
predicted returns are then normalized to obtain portfolio weights,
wi=ˆsi,t+1
Pj∈Diˆsj,t+1
.
Second, the prediction-uncertainty-weighted (PUW) portfolio is motivated by the findings of the
previous subsection that the UW portfolio yields a substantially higher R2
pool (for each decile) than
the EW and VW portfolios. We utilize the predictive covariance matrix and the predicted returns
to construct a mean-variance type portfolio for a uncertainty averse investor. We aim at maximizing
the portfolio return by investing in stocks with high predicted returns with high accuracy at the
same time. More precisely, we combine (9) and (10) by solving the following optimization problem,
max
w∈W w⊤ˆ
st+1 −ζ
2w⊤ˆ
Σt+1w,(11)
where ˆ
st+1 ={ˆsj,t+1, j ∈D}is the vector of level-adjusted predicted returns in the respective
decile D, and ζis the uncertainty-aversion parameter.
Figure 6 shows the cumulative excess log returns of the VW decile portfolios D1 to D10. It
also shows the cumulative excess log returns of the S&P 500 for comparison. A clear pattern
emerges. First, realized returns of the decile portfolios essentially increase monotonically from D1
to D10. Portfolios based on higher predictions have higher subsequent returns. In particular, the
D1 portfolio and D10 portfolio clearly separate. Second, the D10 portfolio outperforms the S&P
500 by a large margin. Similar patterns emerge for the EW, PW, and PUW20 decile portfolios,
as shown in Figures A.4 to A.7 in the appendix. These findings indicate that our sorted portfolio
strategies can consistently dissect the market into high- and low-performing stocks.
In Figure 7, we plot the cumulative excess log returns of the top (D10) and bottom (D1)
decile portfolios for EW, VW, UW, PW, PUW20, and the S&P 500. The PUW D10 portfolio
yields a cumulative excess log return of 10.63. The PUW D1 portfolio, on other side, yields a
cumulative excess log return of -7.08. The corresponding numbers for PW portfolio are 20.14 and
-8.67, and for EW portfolio are 11.73 and -8.67 respectively. The large difference in cumulative
returns from PW portfolios shows that our model is not only good at dissecting the stock market
into deciles but also at predicting the relative return levels within a decile. Further, the uncertainty
averse PUW portfolio dominates the performance of the VW by a large margin in both directions,
top and bottom and compares favorably to that of EW. These findings show that exploiting the
Bayesian nature of GPR by incorporating uncertainty in the predictions can significantly improve
20
Figure 6: This figure shows the out-of-sample cumulative excess log returns of the VW decile
portfolios sorted based on our predicted returns. It also shows the S&P 500. The shaded periods
indicate NBER recessions.
portfolio performance. Interestingly, the bottom decile portfolio performance is essentially flat in
the post-2000 sample, except for PUW. This observation has also been reported in Gu et al. [2020].
We also construct a zero-net-investment portfolio that is long D10 and short D1 for all the five
portfolio strategies. Figure 8 shows their cumulative excess log returns. The qualitative conclusions
are identical to what we observed in Figure 7.
We evaluate the performance of decile portfolios, D1 to D10, and the long-short portfolio LS
from our five portfolio strategies, namely EW, VW, PW, UW and PUW, in terms of their predicted
monthly returns, the average realized monthly returns, their standard deviations, and Sharpe ratios.
These metrics are annualized and are calculated based on simple excess returns, and are collected
in Table 4. We observe that realized portfolio returns generally increase monotonically with our
model’s predictions. The average realized monthly returns, their standard deviations, and Sharpe
ratios for S&P 500 are 0.054, 0.150 and 0.360 respectively. It is evident that each of our portfolios
generate a higher Sharpe ratio in comparison to the S&P 500 index. The Sharpe ratio of 3.18 of
the LS portfolio from EW outperforms Sharpe ratio of 1.12 of the corresponding portfolio from
the VW strategy, an observation coinciding with the findings of Gu et al. [2020]. Further, taking
into account the levels of predictions, the PW portfolio outperforms the EW portfolio in terms of
Sharpe ratio, but it exhibits a higher standard deviation of 25% compared to 16% from the EW.
This reconfirms that our model is good at predicting the ranks of stock returns as well as their
levels relative to other stocks within the same decile. The best long-short strategy comes from the
PUW20 portfolio, which gives on average 53% annualized return with annualized volatility of 15%,
21
Figure 7: This figure shows the out-of-sample cumulative excess log returns of the D10 (solid lines)
and D1 (dashed lined) portfolios for EW, VW, PW, UW and PUW (γ= 20), sorted based on our
predicted returns. It also shows the S&P 500. The shaded periods indicate NBER recessions.
Figure 8: This figure shows the cumulative excess log returns of the long-short portfolios for EW,
VW, PW and PUW, obtained by taking a long position in the D10 and short position in D1
portfolios. It also shows the S&P 500. The shaded periods indicate NBER recessions.
22
amounting to an annualized Sharpe ratio of 3.66. The annualized Sharpe ratio of the LS portfolio
from the PW strategy is 3.68, slightly higher than that from PUW20 but it comes at the expense of
a high standard deviation of 25%. Nonetheless, PW could potentially be a suitable choice for a risk-
seeking investor. The UW portfolio also generates a high Sharpe ratio of 2.97, with an annualized
standard deviation of just 11%. The UW portfolio is an ideal choice for a risk-averse investor who
by taking least risk can still beat the market. As we increase ζin the PUW portfolios, we approach
the performance of the UW portfolios, which is as expected. These finding suggests that taking
into account the uncertainty estimates of the predictions significantly improves the performance of
the EW and VW portfolios, typically studied in the literature.14
Next, we investigate the factors contributing to the improved performance of our portfolios,
aiming to identify whether this enhancement stems from the non-linearity of the Gaussian process
regression or from the ensemble learning approach. For that, we compare the performance of the
portfolios from our ensemble GPR model with γ-exponential kernel to the corresponding portfolios
from the ensemble GPR model with affine kernel, the ensemble linear regression model, and the
standard linear regression model. The results for the benchmark models are reported in Tables 5
and 6. An important implication emerges from a comparison of the portfolio performance in Table
5, with that in Table 4. It demonstrates that the non-linear kernel outperforms the affine kernel
in several aspects: dissecting the cross-section (EW and VW portfolios), accurately predicting
stock return levels (PW portfolios), and achieving high precision in return predictions (UW and
PUW portfolios). While the performance of the PW portfolio with affine kernel is comparable to
that with γ-exponential kernel, the uncertainty-based portfolios, UW and PUW, exhibit significant
improvements when using the non-linear kernel over the affine kernel. Thus, it suggests that the
affine kernel is not able to deliver good estimates of the prediction uncertainty, as is also evident
from the performance of the UW portfolio in Table 5 where the annualized standard deviation of
the LS portfolio is 43%, much higher than 11% of the corresponding portfolio resulting from the
γ-exponential kernel. Furthermore, the performance of the EW, VW and PW portfolios of the
GPR with affine kernel (Table 5) is similar to the corresponding portfolios from linear regression
(Table 6). The performance of the ensemble linear regression is similar to that of the standard
regression model. In summary, this observations suggest that the improved portfolio performance
of our ensemble GPR model with γ-exponential kernel stems from several aspects: the non-linearity,
the ensemble learning approach, and the prediction uncertainty estimates.
We also examine how the contributions from these factors, the non-linearity, the ensemble learn-
ing approach, and the prediction uncertainty estimates, change over time. To this end, we plot
the cumulative excess log-return of long-short portfolios from these models in Figures A.8 to A.10
in the appendix. In Figure A.8 for EW long-short portfolios, E-GPR (γ-exp) consistently out-
performs the linear benchmark models, demonstrating the significant contribution of non-linearity
in improved portfolio performance. Additionally, E-GPR (affine) and E-LR perform comparably,
14Our performance compares favorably to Gu et al. [2020], who report Sharpe ratios of 1.0 and 0.8 for EW and
VW D10 portfolios, -0.4 and -0.19 for EW and VW D1 portfolios, and 2.36 (2.63) and 1.2 (1.53) for EW and VW
long-short portfolios (in Gu et al. [2021]).
23
EW VW PW
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.27 -0.18 0.29 -0.62 -0.24 -0.10 0.29 -0.35 -0.37 -0.29 0.31 -0.93
D2 -0.11 -0.03 0.23 -0.14 -0.11 -0.03 0.22 -0.12 -0.13 -0.04 0.24 -0.18
D3 -0.06 0.01 0.19 0.04 -0.05 0.01 0.18 0.04 -0.06 0.00 0.19 0.00
D4 -0.02 0.04 0.17 0.25 -0.02 0.04 0.17 0.22 -0.02 0.04 0.17 0.23
D5 0.02 0.06 0.16 0.38 0.02 0.05 0.15 0.32 0.01 0.06 0.16 0.37
D6 0.04 0.08 0.16 0.49 0.04 0.06 0.15 0.44 0.05 0.08 0.16 0.50
D7 0.07 0.10 0.16 0.60 0.07 0.08 0.15 0.52 0.08 0.10 0.16 0.62
D8 0.11 0.12 0.16 0.76 0.11 0.11 0.15 0.69 0.12 0.13 0.17 0.78
D9 0.16 0.16 0.18 0.90 0.15 0.12 0.16 0.71 0.17 0.17 0.19 0.92
D10 0.29 0.37 0.26 1.46 0.23 0.19 0.21 0.89 0.43 0.69 0.36 1.93
LS 0.50 0.50 0.16 3.18 0.42 0.24 0.21 1.12 0.75 0.93 0.25 3.68
UW PUW1 PUW10
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.22 -0.15 0.17 -0.93 -0.72 -0.72 0.58 -1.24 -0.45 -0.40 0.25 -1.56
D2 -0.11 -0.05 0.13 -0.36 -0.15 -0.08 0.18 -0.43 -0.13 -0.06 0.14 -0.44
D3 -0.05 -0.01 0.12 -0.10 -0.07 -0.01 0.14 -0.06 -0.06 -0.02 0.12 -0.14
D4 -0.02 0.02 0.11 0.19 -0.03 0.03 0.12 0.25 -0.02 0.02 0.11 0.19
D5 0.02 0.04 0.10 0.39 0.00 0.05 0.12 0.39 0.01 0.04 0.11 0.37
D6 0.04 0.06 0.10 0.63 0.06 0.07 0.11 0.63 0.05 0.07 0.10 0.65
D7 0.07 0.09 0.11 0.85 0.09 0.10 0.12 0.85 0.08 0.09 0.11 0.88
D8 0.11 0.11 0.11 1.02 0.12 0.13 0.12 1.06 0.12 0.12 0.11 1.12
D9 0.15 0.15 0.11 1.29 0.18 0.18 0.15 1.24 0.17 0.16 0.12 1.37
D10 0.23 0.24 0.13 1.76 0.91 2.01 1.42 1.41 0.40 0.50 0.24 2.06
LS 0.39 0.34 0.11 2.97 1.57 2.67 1.48 1.80 0.80 0.84 0.26 3.27
PUW20 PUW100 PUW250
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.32 -0.27 0.19 -1.41 -0.23 -0.16 0.17 -0.98 -0.22 -0.16 0.17 -0.95
D2 -0.12 -0.06 0.14 -0.40 -0.11 -0.05 0.14 -0.37 -0.11 -0.05 0.14 -0.36
D3 -0.06 -0.02 0.12 -0.13 -0.05 -0.01 0.12 -0.11 -0.05 -0.01 0.12 -0.10
D4 -0.02 0.02 0.11 0.19 -0.02 0.02 0.11 0.19 -0.02 0.02 0.11 0.19
D5 0.01 0.04 0.10 0.38 0.02 0.04 0.10 0.39 0.02 0.04 0.10 0.39
D6 0.05 0.07 0.10 0.64 0.05 0.06 0.10 0.63 0.04 0.06 0.10 0.63
D7 0.08 0.09 0.11 0.86 0.07 0.09 0.11 0.85 0.07 0.09 0.11 0.85
D8 0.11 0.12 0.11 1.09 0.11 0.11 0.11 1.04 0.11 0.11 0.11 1.03
D9 0.16 0.16 0.11 1.36 0.15 0.15 0.11 1.31 0.15 0.15 0.11 1.30
D10 0.29 0.32 0.16 2.00 0.24 0.24 0.14 1.81 0.23 0.24 0.13 1.78
LS 0.56 0.53 0.15 3.66 0.41 0.35 0.12 3.08 0.40 0.34 0.11 3.01
Table 4: In this table, we compare the economic performance of prediction sorted portfolios over the
30-year out-of-sample testing period for the ensemble GPR model with γ-exponential kernel. We
compare the performance of the decile portfolios corresponding to EW, VW, PW, UW and PUW
strategies. We report the performance of PUW portfolios for different values, {1,1020,100,250}
of ζ, the uncertainty-aversion parameter. We also compare the long-short portfolios. For each
portfolio, we report the predicted monthly returns (“Pred”), the average realized monthly returns
(“Avg”), their standard deviations (“Std”), and Sharpe ratios (“SR”). We calculate these measures
using realized simple excess returns of the portfolios over the test sample. The values of “Avg”,
“Std” and “SR” for the S&P 500 are 0.054, 0.150 and 0.360 respectively. All measures are annual-
ized.
24
EW VW PW
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.19 -0.09 0.25 -0.36 -0.17 -0.04 0.24 -0.17 -0.32 -0.16 0.29 -0.55
D2 -0.04 0.01 0.20 0.06 -0.04 0.02 0.19 0.10 -0.06 0.00 0.21 0.01
D3 0.01 0.04 0.18 0.20 0.01 0.05 0.17 0.28 0.00 0.03 0.19 0.19
D4 0.05 0.06 0.17 0.36 0.05 0.07 0.15 0.48 0.04 0.06 0.17 0.34
D5 0.08 0.08 0.16 0.46 0.08 0.07 0.15 0.47 0.07 0.07 0.17 0.45
D6 0.11 0.09 0.17 0.55 0.11 0.09 0.15 0.63 0.11 0.10 0.17 0.58
D7 0.14 0.11 0.17 0.66 0.14 0.10 0.15 0.68 0.14 0.11 0.17 0.66
D8 0.17 0.13 0.18 0.73 0.17 0.11 0.16 0.71 0.18 0.14 0.18 0.75
D9 0.23 0.18 0.21 0.88 0.23 0.14 0.18 0.81 0.24 0.20 0.22 0.91
D10 0.38 0.34 0.27 1.30 0.33 0.16 0.21 0.72 0.52 0.50 0.31 1.60
LS 0.51 0.38 0.16 2.36 0.44 0.14 0.18 0.80 0.79 0.60 0.22 2.73
UW PUW1 PUW10
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.13 -0.06 0.36 -0.16 -0.71 -0.66 0.79 -0.83 -0.53 -0.32 0.60 -0.54
D2 -0.03 0.09 0.27 0.32 -0.08 -0.08 0.34 -0.23 -0.07 0.02 0.30 0.08
D3 0.01 0.05 0.21 0.22 -0.01 0.00 0.31 0.01 -0.01 0.05 0.30 0.15
D4 0.05 0.01 0.21 0.03 0.03 0.04 0.21 0.18 0.04 0.05 0.23 0.23
D5 0.08 0.05 0.20 0.26 0.07 0.06 0.20 0.30 0.07 0.05 0.21 0.24
D6 0.11 0.13 0.23 0.57 0.12 0.05 0.21 0.26 0.12 0.09 0.23 0.37
D7 0.14 0.07 0.19 0.38 0.15 0.03 0.26 0.12 0.15 0.06 0.25 0.25
D8 0.17 0.11 0.24 0.47 0.20 0.17 0.23 0.76 0.19 0.15 0.20 0.72
D9 0.22 0.26 0.31 0.83 0.27 0.25 0.34 0.73 0.26 0.24 0.31 0.77
D10 0.30 0.16 0.31 0.52 0.96 1.12 1.27 0.89 0.73 0.36 0.71 0.51
LS 0.37 0.16 0.43 0.37 1.61 1.73 1.42 1.22 1.21 0.63 0.94 0.67
PUW20 PUW100 PUW250
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.39 -0.18 0.55 -0.32 -0.17 -0.17 0.41 -0.42 -0.14 -0.09 0.37 -0.23
D2 -0.07 -0.06 0.28 -0.23 -0.05 0.07 0.28 0.26 -0.04 0.10 0.27 0.35
D3 0.00 0.01 0.22 0.04 0.01 0.08 0.21 0.35 0.01 0.07 0.20 0.33
D4 0.04 0.04 0.22 0.18 0.04 0.01 0.20 0.05 0.05 -0.01 0.21 -0.03
D5 0.07 0.06 0.20 0.31 0.08 0.07 0.20 0.34 0.08 0.06 0.20 0.30
D6 0.12 0.11 0.24 0.46 0.11 0.12 0.23 0.51 0.11 0.13 0.23 0.55
D7 0.15 0.04 0.25 0.17 0.14 0.05 0.20 0.27 0.14 0.05 0.19 0.29
D8 0.19 0.14 0.22 0.62 0.18 0.10 0.23 0.44 0.17 0.10 0.24 0.40
D9 0.25 0.30 0.32 0.92 0.23 0.26 0.30 0.85 0.22 0.26 0.31 0.84
D10 0.60 0.28 0.60 0.48 0.34 0.20 0.33 0.61 0.32 0.13 0.32 0.42
LS 0.94 0.41 0.81 0.51 0.45 0.32 0.47 0.68 0.40 0.17 0.44 0.38
Table 5: In this table, we compare the economic performance of prediction sorted portfolios over the
30-year out-of-sample testing period for the ensemble GPR model with affine kernel. We compare
the performance of the decile portfolios corresponding to EW, VW, PW, UW and PUW strategies.
We report the performance of PUW portfolios for different values, {1,1020,100,250}of ζ, the
uncertainty-aversion parameter. We also compare the long-short portfolios. For each portfolio, we
report the predicted monthly returns (“Pred”), the average realized monthly returns (“Avg”), their
standard deviations (“Std”), and Sharpe ratios (“SR”). We calculate these measures using realized
simple excess returns of the portfolios over the test sample. The values of “Avg”, “Std” and “SR”
for the S&P 500 are 0.054, 0.150 and 0.360 respectively. All measures are annualized.
25
Linear Regression (LR)
EW VW PW
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.19 -0.09 0.25 -0.36 -0.17 -0.04 0.24 -0.17 -0.27 -0.13 0.26 -0.52
D2 -0.04 0.01 0.20 0.06 -0.04 0.02 0.19 0.10 -0.06 0.02 0.20 0.09
D3 0.01 0.04 0.18 0.20 0.01 0.05 0.17 0.28 -0.01 0.05 0.18 0.31
D4 0.05 0.06 0.17 0.36 0.05 0.07 0.15 0.48 0.03 0.07 0.17 0.40
D5 0.08 0.08 0.16 0.46 0.08 0.07 0.15 0.47 0.06 0.07 0.16 0.44
D6 0.11 0.09 0.17 0.55 0.11 0.09 0.15 0.63 0.10 0.09 0.17 0.52
D7 0.14 0.11 0.17 0.66 0.14 0.10 0.15 0.68 0.13 0.10 0.18 0.59
D8 0.17 0.13 0.18 0.73 0.17 0.11 0.16 0.71 0.17 0.12 0.20 0.64
D9 0.23 0.18 0.21 0.88 0.23 0.14 0.18 0.81 0.23 0.19 0.23 0.81
D10 0.38 0.34 0.27 1.30 0.33 0.16 0.21 0.72 0.49 0.49 0.32 1.53
LS 0.51 0.38 0.16 2.36 0.44 0.14 0.18 0.80 0.70 0.57 0.22 2.63
Ensemble Linear Regression (E-LR)
EW VW PW
Pred Avg Std SR Pred Avg Std SR Pred Avg Std SR
D1 -0.19 -0.09 0.25 -0.36 -0.17 -0.04 0.24 -0.17 -0.33 -0.16 0.31 -0.51
D2 -0.04 0.01 0.20 0.06 -0.04 0.02 0.19 0.10 -0.06 0.01 0.22 0.04
D3 0.01 0.04 0.18 0.20 0.01 0.05 0.17 0.28 0.00 0.04 0.20 0.21
D4 0.05 0.06 0.17 0.36 0.05 0.07 0.15 0.48 0.04 0.06 0.18 0.33
D5 0.08 0.08 0.16 0.46 0.08 0.07 0.15 0.47 0.07 0.08 0.17 0.47
D6 0.11 0.09 0.17 0.55 0.11 0.09 0.15 0.63 0.11 0.09 0.16 0.57
D7 0.14 0.11 0.17 0.66 0.14 0.10 0.15 0.68 0.14 0.11 0.17 0.69
D8 0.17 0.13 0.18 0.73 0.17 0.11 0.16 0.71 0.18 0.14 0.17 0.79
D9 0.23 0.18 0.21 0.88 0.23 0.14 0.18 0.81 0.23 0.19 0.20 0.97
D10 0.38 0.34 0.27 1.30 0.33 0.16 0.21 0.72 0.49 0.45 0.29 1.56
LS 0.51 0.38 0.16 2.36 0.44 0.14 0.18 0.80 0.76 0.56 0.22 2.57
Table 6: In this table, we compare the economic performance of prediction sorted portfolios over the
30-year out-of-sample testing period for the standard linear regression model (LR) and ensemble
linear regression model (E-LR). We compare the performance of the decile portfolios corresponding
to EW, VW, and PW strategies. We also compare the long-short portfolios. For each portfolio,
we report the predicted monthly returns (“Pred”), the average realized monthly returns (“Avg”),
their standard deviations (“Std”), and Sharpe ratios (“SR”). We calculate these measures using
realized simple excess returns of the portfolios over the test sample. The values of “Avg”, “Std”
and “SR” for the S&P 500 are 0.054, 0.150 and 0.360 respectively. All measures are annualized.
26
with both achieving slightly better results than LR. A similar pattern is observed for VW long-
short portfolios underscoring the robustness of non-linear approaches. Moreover, ensemble models
exhibit considerably better performance than LR, further confirming the importance of adopting
an ensemble learning approach. When incorporating uncertainty, PUW from our model surpasses
PUW from E-GPR (affine) for both levels of uncertainty averse-ness (ζ= 1 and ζ= 20) and the
PW portfolios from linear benchmark models, in Figure A.10.15
We end this section by a disclaimer noting that the performance of the above portfolios does
not take into account any transaction costs, which, when considered, could offset the gains.16
Importantly, our dataset contains highly illiquid stocks with extremely small market capitalization.
These stocks are thus unlikely to be accessible to investors and may incur significant transaction
costs due to high bid-ask spreads and low liquidity. Hence, an investor could not potentially exploit
the higher gains by shorting the D1 portfolio. But this applies to the benchmarks in the literature
as well.
3.4 Cross-sectional insights
This section focuses on cross-sectional insights of the model performance. Section 3.4.1 analyzes
the contribution of individual features to the model’s performance. Section 3.4.2 examines the
relationship between the features and the return predictions, while Section 3.4.3 examines how the
features relate to the prediction uncertainty.
3.4.1 Variable importance
We analyze the relative importance of individual features based on their contribution to the per-
formance of our model. Following the approach in Gu et al. [2020], we rank the features based on
a variable importance metric, denoted as V Ijfor any feature j.V Ijis defined as the reduction
in pooled R-squared resulting from assigning all values of feature jto zero, while keeping the es-
timates of the remaining model parameters unchanged. Specifically, we re-obtain the predictions
for each test month using the pre-trained GPR models from the training period, utilizing all but
the jth feature, which is set to zero. Figure 9 reports the resultant importance of the top twenty
features, normalized to sum to one.17 Beyond these, variable importance hovers near zero. The
total contribution by the top twenty features is 96.89%.
Following Gu et al. [2020], we have grouped the features into four categories. The first category
is based on recent price trends, which includes variables such as short-term reversal (mom1m and
mom6m), stock momentum (mom12m), momentum change (chmom), long-term reversal (mom36m),
recent maximum return (maxret) and industry momentum (indmom). The second category con-
sists of liquidity variables, which include bid-ask spread (baspread), dollar volume (dolvol), turnover
15These levels of uncertainty averse-ness were chosen based on their superior Sharpe ratios.
16Avramov, Cheng, and Metzker [2020] pointed out that transaction costs could significantly deteriorate the per-
formance of machine learning portfolios due to high turnover or extreme positions.
17There were features for which the reduction in R2
pool(%) was negative but negligibly close to zero. We replaced
the variable importance for such features with zero.
27
Figure 9: Top twenty features by variable importance, averaged over all training samples. The
variable importance of these twenty features is normalized to sum to one.
Category Importance (%)
Liquidity Variables 41.78
Recent Price Trends 41.02
Risk Measures 10.91
Valuation Ratios and Fundamental Signals 6.29
Table 7: Variable Importance by Categories
volatility (SD turn), Amihud illiquidity (ill), turnover (turn) and cash. The third category con-
sists of risk measures, which include total and idiosyncratic return volatility (retvol, idiovol). The
fourth category includes valuation ratios and fundamental signals consisting of features such as
R&D to market capitalization (rd mev), asset growth (agr), change in shares outstanding (chcsho)
and leverage (lev). We also report the category-wise contributions of these features in Table 7,
which shows that our model is inclusive and extracts predictive information from a wide range of
features in the sense that it assigns significant importance to features across all the four categories,
rather than being dependent on any single category.
3.4.2 Association between features and predicted returns
Next, we investigate the cross-sectional heterogeneity in predicted returns and prediction uncer-
tainty, and relate them to the above top twenty features.
To analyze the relation between predicted returns and the features, in each of the test months,
28
we divide the cross-section into deciles based on the predicted returns. That is, given our predictions
in each of the test month, we sort stocks into deciles, which we denote by D1, D2, . . . , D10, where
D1 corresponds to the stocks with lowest predicted returns and D10 corresponds to the stocks
with highest predicted returns. Within each decile, we find the mean of each of the top twenty
features resulting into a vector of length 360 (number of test months) for each feature. Since the
cross-sectional mean of each feature is standardized to zero, we also conduct t-tests to determine
whether the mean feature values within each decile are significantly different from zero. We report
the results of the t-test in Table A.1 in the appendix.
Figure 10 presents the mean values of the top twenty features, categorized into four groups,
averaged across the testing period for each decile. In the panel on liquidity variables, we observe that
stocks with the highest predicted returns tend to be less liquid, as indicated by higher values of the
bid-ask spread (baspread) and illiquidity (ill). Additionally, higher predicted returns are associated
with lower values of turnover (turn) and dollar volume (dolvol), further reinforcing the observation
that less liquid stocks are linked to higher predicted returns. For momentum variables, we observe
an increasing trend in 6-month (mom6m) and 12-month (mom12m) momentum, coupled with a
declining trend in 1-month (mom1m) momentum. This pattern suggests a positive momentum effect
over longer horizons and a reversal effect over shorter horizons. Further, the stocks in deciles D1 and
D10 tend to carry higher levels of risk as depicted by a U-shaped pattern for risk measures, retvol
and idivol, a finding that aligns with the risk-return trade-off: the stocks with higher expected
returns, positive or negative, have high values of risk levels. The bottom right panel, valuation
ratios and fundamentals, shows that our models predict higher returns for growth-oriented and
R&D-intensive firms as depicted by rising trend in rd mve and agr. In contrast, chcsho exhibits a
declining trend, suggesting higher predicted returns for firms with less equity dilution.
In essence, the above findings also give insights into the shape of the predictive function as
they reveal how our model translates features to predictions. That is, analyzing the behavior
of features across deciles based on predicted returns offers an understanding of the structural
relationship and feature interactions that govern the model’s predictive behavior. In conclusion,
we observe significant heterogeneity in the features across deciles, suggesting that predicted returns
are influenced by a diverse set of factors.
3.4.3 Association between features and prediction uncertainty
To analyze the relation between prediction uncertainty and the features, we repeat the above
exercise. Given our prediction uncertainty in each of the test month, we sort stocks into deciles,
which we denote by D1, D2, . . . , D10, where D1 corresponds to the stocks with lowest prediction
uncertainty and D10 corresponds to the stocks with highest prediction uncertainty.
Figure 11 shows the mean value of each of the top twenty features, categorized into four groups,
averaged over the testing period for each decile. We report the results of the t-test in Table A.2
in the appendix. There are several interesting observations. We observe that the stocks with
high prediction uncertainty are indeed the ones with limits to arbitrage frictions and that exhibit
29
Figure 10: This figure shows the relationship between the features and predicted returns. Horizontal
axes show cross-sectional deciles of predicted returns (D1 is lowest, D10 is highest). Vertical axes
show time averages of conditional means of feature values given deciles.
30
Figure 11: This figure shows the relationship between the features and prediction uncertainty.
Horizontal axes show cross-sectional deciles of prediction uncertainty (D1 is lowest, D10 is highest).
Vertical axes show time averages of conditional means of feature values given deciles.
extreme illiquidity as depicted by high value of bid-ask spread, a high value of illiquidity (ill) and
a low value of dollar volume (dolvol). Further, the stocks with high uncertainty have high values
of risk measures, namely return volatility (retvol) and idiosyncratic volatility (idiovol), low value
of asset growth (agr), a relatively higher leverage and high value of maximum return (maxret). In
contrast, other recent price trends exhibit no clear association with prediction uncertainty.
3.5 Residual diagnostics
Leveraging the inherent ability of the GPR model to quantify prediction uncertainty, we present
another use case. Following Gu et al. [2021], we consider a set of pre-specified portfolios, defined
as
w⊤
t= (Z⊤
tZt)−1Z⊤
t,
where Ztis the |It| × 94 matrix whose ith row consists of 94 features corresponding to the ith
asset at time t. Given the portfolio weights wtat the beginning of each test month, we obtain the
31
Figure 12: Empirical cumulative distribution function of the standardized residuals of the pre-
specified feature sorted portfolios.
realized portfolio returns
Rt+1 =w⊤
trt+1.
The vector Rt+1 represents a collection of portfolios that are dynamically rebalanced based on
features. We also obtain the predicted portfolio returns, w⊤
tˆ
rt+1 and the prediction uncertainty in
terms of the variance, w⊤
tˆ
Σt+1wt, thanks to our posterior uncertainty estimates that allow us to
calculate portfolio variances. For each of the 360 test months and the 94 portfolios, we compute
the standardized residuals
ηt+1 =Rt+1 −w⊤
tˆ
rt+1
qw⊤
tˆ
Σt+1wt
,
resulting in a balanced panel of size 94 ×360. According to standard GPR model assumptions, the
standardized residuals, ηt+1, should follow a standard normal distribution. But a normality test
leads to rejection of this hypothesis. However, this is due to the heavy tails of the residuals. In
fact, Figure 12 presents the empirical cumulative distribution function (ECDF) of the grand panel
of standardized residuals. Remarkably, the empirical median is exactly zero. Approximately 75%
of the residuals fall within one standard deviation, and 94.1% lie within two standard deviations,
which aligns with the two-sigma rule.
We also conduct statistical tests to determine whether the mean and variance of the standardized
residuals are 0 and 1, respectively. To this end, for each portfolio, we calculate the empirical mean
and variance of the residuals over the 360 months. Subsequently, we perform t-tests to assess
whether the average of the portfolio means and the average of the portfolio variances significantly
deviate from 0 and 1, respectively. This approach allows us to evaluate the consistency of the
32
Mean of Min 1st Qu. Median Mean 3rd Qu. Max p-value
Portfolio Means -0.244 -0.026 0.002 0.001 0.031 0.187 0.910
Portfolio Variances 0.592 0.816 0.908 1.025 1.113 3.344 0.537
Grand Panel of Residuals -30.352 -0.479 -0.001 0.001 0.482 20.191 0.898
Table 8: Summary statistics of the means and variances of the standardized residuals of the 94
portfolios
residuals with the expected properties under the normal assumption. The first two rows in Table 8
present the summary statistics of the means and variances of the standardized residuals of the 94
portfolios, and the p-values of the corresponding t-tests, which clearly shows that we fail to reject
the null hypothesis of mean 0 and variance 1. We also test whether the mean of the grand panel
of the standardized residuals is significantly different from 0, the third row in Table 8 presents
the corresponding results. A p-value of 0.898 shows that this is not the case. These findings
highlight our model’s capability to reliably predict portfolio returns as well as quantify prediction
uncertainties up to a two-sigma level.
4 Conclusion
The out-of-sample prediction of conditional expected stock returns remains a central challenge in
empirical asset pricing. In this paper, we introduce a novel ensemble GPR method to predict
conditional expected returns. While we do not claim that our simple method is the best approach
for all situations, we find that it outperforms the benchmarks by significant margins in terms of
R-squared. Exploiting the Bayesian nature of GPR, we also model and quantify the prediction
uncertainty, which leads to significant economic gains in terms of the performance of uncertainty-
weighted prediction-sorted portfolios. Our ensemble learning approach reduces the computational
complexity inherent in GPR and addresses the non-stationarity and heteroscedasticity in financial
data. As such, it lends itself to a variety of online learning tasks to be explored in future re-
search. Another direction of future research consists of exploiting kernel methods beyond Gaussian
processes for the modeling of statistical financial risk, such as in Filipovi´c and Pasricha [2022].
References
K. P. Ambachtsheer. Profit potential in an “almost efficient” market. The Journal of Portfolio
Management, 1(1):84–87, 1974.
D. Avramov, S. Cheng, and L. Metzker. Machine learning versus economic restrictions: Evidence
from stock return predictability. available at ssrn 3450322, 2020.
D. Bianchi, M. B¨uchner, and A. Tamoni. Bond risk premiums with machine learning. The Review
of Financial Studies, 34(2):1046–1089, 2021.
Y. Cao and D. J. Fleet. Generalized product of experts for automatic and principled fusion of
Gaussian process predictions. arXiv preprint arXiv:1410.7827, 2014.
33
G. Casella, M. Ghosh, J. Gill, and M. Kyung. Penalized regression, standard errors, and Bayesian
lassos. Bayesian Analysis, 5(2):369–411, 2010.
L. Chen, M. Pelger, and J. Zhu. Deep learning in asset pricing. Management Science, Forthcoming,
2022.
A. Cousin, H. Maatouk, and D. Rulli`ere. Kriging of financial term-structures. European Journal
of Operational Research, 255(2):631–648, 2016.
J. De Spiegeleer, D. B. Madan, S. Reyners, and W. Schoutens. Machine learning for quantitative
finance: fast derivative pricing, hedging and fitting. Quantitative Finance, 18(10):1635–1643,
2018.
M. Deisenroth and J. W. Ng. Distributed Gaussian processes. In International Conference on
Machine Learning, pages 1481–1490. PMLR, 2015.
W. Drobetz and T. Otto. Empirical asset pricing via machine learning: evidence from the European
stock market. Journal of Asset Management, 22(7):507–538, 2021.
E. F. Fama and K. R. French. Dissecting anomalies. The Journal of Finance, 63(4):1653–1678,
2008.
M. H. Farrell, T. Liang, and S. Misra. Deep neural networks for estimation and inference. Econo-
metrica, 89(1):181–213, 2021.
D. Filipovi´c and P. Pasricha. Copula process models for financial risk management. Swiss Finance
Institute Working Paper, 2022.
D. Filipovi´c, M. Pelger, and Y. Ye. Shrinking the term structure. Swiss Finance Institute Research
Paper, (22-61), 2022a.
D. Filipovi´c, M. Pelger, and Y. Ye. Stripping the discount curve–a robust machine learning ap-
proach. Swiss Finance Institute Research Paper, (22-24), 2022b.
J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. Gpytorch: Blackbox
matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information
processing systems, 31, 2018.
F. Giordano, M. L. Rocca, and C. Perna. Standard error estimation in neural network regression
models: the AR-Sieve bootstrap approach. In Neural Nets WIRN Vietri-01, pages 201–206.
Springer, 2002.
S. Gu, B. Kelly, and D. Xiu. Empirical Asset Pricing via Machine Learning. The Review of
Financial Studies, 33(5):2223–2273, 2020.
S. Gu, B. Kelly, and D. Xiu. Autoencoder asset pricing models. Journal of Econometrics, 222(1):
429–450, 2021.
R. Guhaniyogi, C. Li, T. D. Savitsky, and S. Srivastava. A divide-and-conquer Bayesian approach
to large-scale kriging. arXiv preprint arXiv:1712.09767, 2017.
J. Han, X.-P. Zhang, and F. Wang. Gaussian process regression stochastic volatility model for
financial time series. IEEE Journal of Selected Topics in Signal Processing, 10(6):1015–1028,
2016.
34
R. Kaniel, Z. Lin, M. Pelger, and S. Van Nieuwerburgh. Machine-learning the skill of mutual fund
managers. Technical report, National Bureau of Economic Research, 2022.
R. S. Koijen, T. J. Moskowitz, L. H. Pedersen, and E. B. Vrugt. Carry. Journal of Financial
Economics, 127(2):197–225, 2018.
J. W. Ng and M. P. Deisenroth. Hierarchical mixture-of-experts model for large-scale Gaussian
process regression. arXiv preprint arXiv:1412.3078, 2014.
J. Quinonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate gaussian
process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression
curve fitting. Journal of the Royal Statistical Society: Series B (Methodological), 47(1):1–21,
1985.
V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.
S. Wager, T. Hastie, and B. Efron. Confidence intervals for random forests: The jackknife and the
infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651, 2014.
I. Welch and A. Goyal. A comprehensive look at the empirical performance of equity premium
prediction. The Review of Financial Studies, 21(4):1455–1508, 2008.
C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning. MIT press Cam-
bridge, MA, 2006.
A. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp).
In International conference on machine learning, pages 1775–1784. PMLR, 2015.
A. G. Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with
Gaussian processes. PhD thesis, University of Cambridge Cambridge, UK, 2014.
A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artificial
intelligence and statistics, pages 370–378. PMLR, 2016.
A Gaussian Process Regression
In this appendix we give an introduction to Gaussian process regression. For more background and
theory we refer the reader to Williams and Rasmussen [2006].
A.1 Gaussian Processes
A Gaussian process is a collection of random variables, any finite number of which have a joint
Gaussian distribution. More specifically, let Xbe a non-empty set. A random function f:X → R
is a Gaussian process (GP) with mean function m(·) and covariance function, or, kernel k(·,·), if
for any finite set x= (x1, x2, . . . , xn)⊂ X the random vector
f(x) = (f(x1), f (x2), . . . , f(xn))⊤
35
follows multivariate normal distribution N(m(x), k(x,x⊤)) with mean vector
m(x)=(m(x1), m(x2), . . . , m(xn))⊤
and covariance matrix
k(x,x⊤)=(k(xi, xj))n
i,j=1.
A.2 Training GPR
To fit the hyperparameters by maximum likelihood, we compute the partial derivatives of the
marginal likelihood function (6) w.r.t. the hyperparameters,
∂
∂θj
log p(y|X, θ) = −y⊤K−1
y
∂Ky
∂θjK−1
yy
2−1
2trK−1
y
∂Ky
∂θj
=1
2tr(K−1
yyy⊤K−1
y−K−1
y)∂Ky
∂θj(12)
where we write Ky=K+σ2
ϵI. We then find the optimal hyperparameters ˆ
θby any gradient based
optimizer and gradient given in (12).
B Choice of kernel
A critical ingredient in the success of Gaussian process regression is the choice of kernel as it defines
our prior assumptions about the function being modeled. It also drives the model’s potential to
capture the relationship between input features and the output variable by governing the flexibility,
smoothness and the generalization ability of the model. In preliminary experiments, we tried with
various kernels in predicting excess log returns in the validation period for a guided choice on the
optimal kernel. We experimented with the following ten kernels,
(i) K1(x, x′) = σ2(1 + α||x||)(1 + α||x′||)1 + ||x−x′||2
2αℓ2−α
(ii) K2(x, x′) = σ2p(1 + α||x||2)p(1 + α||x′||2)1 + ||x−x′||2
2αℓ2−α
(iii) K3(x, x′) = σ2p(1 + α||x||2)p(1 + α||x′||2) exp −||x−x′||2
2ℓ2
(iv) K4(x, x′) = σ2(1 + α||x||)(1 + α||x′||) exp −||x−x′||
ℓ
(v) K5(x, x′) = σ21 + ||x−x′||
2αℓ2−α
(vi) K6(x, x′) = σ2(1 + α||x||)(1 + α||x′||)1 + ||x−x′||
β−1
36
(vii) K7(x, x′) = σ2exp −||x−x′||2
2ℓ2
(viii) K8(x, x′) = σ21 + ||x−x′||2
2αℓ2−α
(ix) K9(x, x′) = σ2exp −||x−x′||
ℓ
(x) K10(x, x′) = σ2exp −||x−x′||
ℓγ
where σ, α, ℓ > 0, γ ∈(0,2]. Here, the kernels K7to K10 are the standard kernels: squared exponen-
tial kernel, rational quadratic kernel, exponential kernel and the γ-exponential kernel respectively.
The kernels K1to K6are variations of the four standard kernels defined with an aim to introduce
feature dependent variance function and utilizing the fact that any non-degenerate covariance ker-
nel can be factorized as k(x, x′) = pv(x)v(x′)ρ(x, x′) where v:X → (0,∞) is a function and ρa
kernel on Xwith ρ(x, x) = 1. This factorization can always be achieved by setting v(x) = k(x, x)
and ρ(x, x′) = k(x, x′)/pv(x)v(x′), so that v(x) is the variance of f(x), and ρ(x, x′) the linear
correlation of f(x) and f(x′).
Figures A.1 and A.2 show R2
pool over the validation sample for MSE- and equal-weighting
schemes against varying lengths Kof the training window in predicting log-excess returns. It
is evident from both the plots that both the weighting schemes generate positive R2
pool for Klarge
enough (K > 50) with K2being an exception in equal-weighting scheme. Further, the best per-
forming kernels are K5,K8and K10. To conclude, based on the performance in validation sample
on log-excess returns, we select these three kernels for further analysis.
C Additional results
This appendix includes supplementary figures and tables referenced in the main text.
37
Figure A.1: This figure presents R2
pool over the validation sample, Jan 1982 to Dec 1986 for MSE-
weighting scheme against the length of the training window for ten kernels under consideration.
Figure A.2: This figure presents R2
pool over the validation sample, Jan 1982 to Dec 1986 for equal-
weighting scheme against the length of the training window for ten kernels under consideration.
38
Figure A.3: This figure shows the evolution of R2
tover the test sample. The shaded periods indicate
NBER recessions.
Figure A.4: This figure shows the cumulative excess log returns of equal weighted (EW) decile
portfolios sorted based on our predicted returns. It also shows the S&P 500. The shaded periods
indicate NBER recessions.
39
Figure A.5: This figure shows the cumulative excess log returns of prediction weighted (PW) decile
portfolios sorted based on our predicted returns. It also shows the S&P 500. The shaded periods
indicate NBER recessions.
Figure A.6: This figure shows the cumulative excess log returns of uncertainty-weighted (UW)
decile portfolios sorted based on our predicted returns. It also shows the S&P 500. The shaded
periods indicate NBER recessions.
40
Figure A.7: This figure shows the cumulative excess log returns of prediction-uncertainty-weighted
(PUW), with γ= 20, decile portfolios sorted based on our predicted returns. It also shows the
S&P 500. The shaded periods indicate NBER recessions.
Figure A.8: This figure shows the cumulative excess log returns of equal-weighted (EW) long-short
portfolio sorted based on the predicted returns from our GPR model (E-GPR (γ-exp)) and linear
benchmark models (E-GPR (affine), E-LR and LR). The shaded periods indicate NBER recessions.
41
Figure A.9: This figure shows the cumulative excess log returns of value-weighted (VW) long-short
portfolio sorted based on the predicted returns from our GPR model (E-GPR (γ-exp)) and linear
benchmark models (E-GPR (affine), E-LR and LR). The shaded periods indicate NBER recessions.
Figure A.10: This figure shows the cumulative excess log returns of prediction-weighted (PW) and
prediction-uncertainty-weighted (PUW) long-short portfolio sorted based on the predicted returns
from our GPR model (E-GPR (γ-exp)) and linear benchmark models (E-GPR (affine), E-LR and
LR). The shaded periods indicate NBER recessions.
42
D1D2D3D4D5D6D7D8D9D10
mom1m 0.844 0.270 0.145 0.071 0.015 -0.042 -0.105 -0.187 -0.308 -0.703
baspread 0.732 0.158 -0.089 -0.218 -0.274 -0.289 -0.272 -0.213 -0.080 0.545
dolvol -0.107 0.004 0.040 0.077 0.116 0.135 0.141 0.093 -0.020 -0.480
std turn 0.328 0.054 -0.059 -0.110 -0.125 -0.120 -0.093 -0.048 0.036 0.132
ill -0.005 -0.058 -0.085 -0.102 -0.110 -0.107 -0.100 -0.074 -0.005 0.645
retvol 0.984 0.207 -0.077 -0.225 -0.291 -0.313 -0.304 -0.257 -0.143 0.417
mom6m -0.548 -0.215 -0.088 -0.013 0.038 0.082 0.130 0.183 0.255 0.178
idiovol 0.609 0.201 -0.056 -0.201 -0.266 -0.281 -0.261 -0.192 -0.038 0.484
turn 0.475 0.185 0.026 -0.053 -0.086 -0.103 -0.101 -0.102 -0.091 -0.156
mom12m -0.562 -0.286 -0.152 -0.065 0.011 0.073 0.144 0.223 0.330 0.284
chmom 0.125 0.086 0.072 0.055 0.029 0.006 -0.020 -0.053 -0.091 -0.209
rd mve -0.033 -0.045 -0.052 -0.054 -0.054 -0.048 -0.036 -0.011 0.044 0.290
zerotrade -0.059 0.029 0.030 0.009 -0.015 -0.031 -0.046 -0.046 -0.029 0.157
mom36m 0.089 0.080 0.043 0.025 0.023 0.020 0.022 0.003 -0.046 -0.259
agr -0.418 -0.233 -0.086 -0.016 0.026 0.059 0.087 0.118 0.164 0.300
maxret 1.070 0.236 -0.039 -0.184 -0.251 -0.282 -0.287 -0.266 -0.196 0.200
cash 0.141 0.063 -0.006 -0.044 -0.066 -0.072 -0.072 -0.049 0.003 0.101
chcsho 0.270 0.155 0.067 0.018 -0.014 -0.037 -0.061 -0.087 -0.122 -0.188
indmom -0.481 -0.307 -0.211 -0.125 -0.039 0.044 0.137 0.233 0.345 0.406
lev -0.083 -0.065 -0.039 -0.028 -0.016 -0.002 0.018 0.048 0.076 0.091
Table A.1: Source of heterogeneity (based on predicted returns). Bold means not rejected at 1%
otherwise rejected at 1%.
D1D2D3D4D5D6D7D8D9D10
mom1m 0.003 0.042 0.038 0.031 0.027 0.026 0.021 0.009 -0.016 -0.182
baspread -0.764 -0.484 -0.423 -0.335 -0.222 -0.088 0.075 0.285 0.610 1.344
dolvol -0.243 0.150 0.246 0.229 0.172 0.098 0.023 -0.066 -0.181 -0.428
std turn -0.339 -0.168 -0.179 -0.148 -0.099 -0.033 0.042 0.134 0.250 0.534
ill -0.217 -0.195 -0.182 -0.159 -0.127 -0.083 -0.014 0.095 0.246 0.636
retvol -0.779 -0.496 -0.424 -0.329 -0.219 -0.085 0.076 0.284 0.602 1.369
mom6m 0.013 0.056 0.057 0.045 0.036 0.029 0.026 0.012 -0.011 -0.263
idiovol -0.787 -0.602 -0.533 -0.403 -0.256 -0.085 0.115 0.355 0.733 1.462
turn -0.407 -0.255 -0.213 -0.155 -0.091 -0.010 0.080 0.188 0.312 0.547
mom12m 0.004 0.065 0.062 0.049 0.039 0.029 0.032 0.022 -0.006 -0.296
chmom 0.008 0.013 0.018 0.010 0.007 0.004 0.000 -0.007 -0.004 -0.048
rd mve -0.004 -0.088 -0.123 -0.117 -0.097 -0.077 -0.045 0.009 0.116 0.426
zerotrade -0.084 -0.018 -0.034 -0.017 0.001 0.012 0.033 0.047 0.042 0.019
mom36m -0.037 0.027 0.041 0.047 0.053 0.058 0.059 0.045 -0.028 -0.265
agr 0.005 0.109 0.118 0.103 0.086 0.060 0.021 -0.050 -0.162 -0.290
maxret -0.659 -0.425 -0.364 -0.284 -0.191 -0.079 0.058 0.235 0.508 1.200
cash -0.021 -0.306 -0.268 -0.186 -0.112 -0.040 0.037 0.131 0.279 0.485
chcsho -0.008 -0.148 -0.137 -0.098 -0.074 -0.039 -0.006 0.053 0.141 0.316
indmom -0.155 -0.006 0.026 0.020 0.028 0.035 0.038 0.033 0.013 -0.031
lev 0.006 0.103 0.050 -0.001 -0.035 -0.058 -0.069 -0.057 -0.041 0.102
Table A.2: Source of heterogeneity (based on prediction uncertainty). Bold means not rejected at
1% otherwise rejected at 1%.
43