PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We propose a new approach to portfolio optimization that utilizes a unique combination of synthetic data generation and a CVaR-constraint. We formulate the portfolio optimization problem as an asset allocation problem in which each asset class is accessed through a passive (index) fund. The asset-class weights are determined by solving an optimization problem which includes a CVaR-constraint. The optimization is carried out by means of a Modified CTGAN algorithm which incorporates features (contextual information) and is used to generate synthetic return scenarios, which, in turn, are fed into the optimization engine. For contextual information we rely on several points along the U.S. Treasury yield curve. The merits of this approach are demonstrated with an example based on ten asset classes (covering stocks, bonds, and commodities) over a fourteen-and-half year period (January 2008-June 2022). We also show that the synthetic generation process is able to capture well the key characteristics of the original data, and the optimization scheme results in portfolios that exhibit satisfactory out-of-sample performance. We also show that this approach outperforms the conventional equal-weights (1/N) asset allocation strategy and other optimization formulations based on historical data only.
Content may be subject to copyright.
A Modified CTGAN-Plus-Features Based Method for Optimal Asset
Allocation
Jos´e-Manuel Pe˜naa, Fernando Su´areza, Omar Larr´ea, Domingo Ram´ıreza, Arturo
Cifuentesb
aFintual Administradora General de Fondos S.A. Santiago, Chile. Fintual, Inc.
bClapes UC, Pontificia Universidad Cat´olica de Chile, Santiago, Chile.
ARTICLE HISTORY
Compiled May 17, 2024
ABSTRACT
We propose a new approach to portfolio optimization that utilizes a unique com-
bination of synthetic data generation and a CVaR-constraint. We formulate the
portfolio optimization problem as an asset allocation problem in which each asset
class is accessed through a passive (index) fund. The asset-class weights are deter-
mined by solving an optimization problem which includes a CVaR-constraint. The
optimization is carried out by means of a Modified CTGAN algorithm which incor-
porates features (contextual information) and is used to generate synthetic return
scenarios, which, in turn, are fed into the optimization engine. For contextual infor-
mation we rely on several points along the U.S. Treasury yield curve. The merits of
this approach are demonstrated with an example based on ten asset classes (cover-
ing stocks, bonds, and commodities) over a fourteen-and-half year period (January
2008-June 2022). We also show that the synthetic generation process is able to cap-
ture well the key characteristics of the original data, and the optimization scheme
results in portfolios that exhibit satisfactory out-of-sample performance. We also
show that this approach outperforms the conventional equal-weights (1/N) asset
allocation strategy and other optimization formulations based on historical data
only.
KEYWORDS
Asset allocation; Portfolio optimization; Portfolio selection; Synthetic data;
Synthetic returns; Machine learning; Features; Contextual information; GAN;
CTGAN; neural networks
1. Motivation and Previous Work
The portfolio selection problem—how to spread a given budget among several invest-
ment options—is probably one of the oldest problems in applied finance. Until 1952,
when Harry Markowitz published his famous portfolio selection paper, the issue was
tackled with a mix of gut feeling, intuition, and whatever could pass for common
sense at the moment. A distinctive feature of these approaches was that they were, in
general, qualitative in nature.
Markowitz’s pioneering work showed that the portfolio selection problem was in
essence an optimization problem that could be stated within the context of a well-
Contact email: research@fintual.com
defined mathematical framework (Markowitz, 1952). The key ideas behind this frame-
work (e.g., the importance of diversification, the tradeoff between risk and return, and
the efficient frontier) have survived well the test of time. Not only that, Markowitz’s
paper triggered a voluminous amount of research on this topic that was quantitative
in nature, marking a significant departure from the past.
However, notwithstanding the merits of Markowitz’s approach (also known as mean-
variance or MV portfolios), its implementation has been problematic. First, estimating
the coefficients of the correlation-of-returns matrix—the essential backbone of the MV
formulation—is a problem that still lacks a practical solution. For example, DeMiguel,
Garlappi, and Uppal (2009) concluded that in the case of a portfolio of 25 assets,
estimating the entries of the correlation matrix with an acceptable level of accuracy
would require more than 200 years of monthly data. A second drawback of Markowitz’s
formulation, more conceptual than operational, is that it relies on the standard de-
viation of returns to describe risk. However, the standard deviation, since its focuses
on dispersion, is not a good proxy for risk for it really captures uncertainty—a subtle
but significant difference (Friedman, Isaac, James, and Sunder (2014)). Anyhow, the
fact of the matter is that during the second part of the previous century most research
efforts were aimed at devising practical strategies to implement the MV formulation.
Needless to say, success in these efforts has been mixed at best and these days most
practitioners have moved beyond the original MV formulation, which only remains
popular within some outdated academic circles. Kolm, ut¨unc¨u, and Fabozzi (2014)
summarize well the challenges associated with the implementation of Markowitz’s ap-
proach. Pagnoncelli, Ram´ırez, Rahimian, and Cifuentes (2022) provide a brief overview
of the different techniques that have attempted to reconcile the implementation of the
MV formulation with reality.
John Bogle, who founded the Vanguard Group (an asset management company)
and is recognized as the father of index investing, is another pioneer whose main idea
was revolutionary at the time and remains influential until today. In 1975 he intro-
duced a concept known as passive investment. He thought that a fund whose goal was
to beat the market would necessarily have high costs, and hence investors would be
better served by a low-cost fund that would simply mimic the market by replicating
a relevant index (Bogle, 2018; Thune, 2022). This innovation, highly controversial at
the time, has been validated by empirical evidence as study-after-study has demon-
strated that trying to beat the market (in the context of liquid and public markets)
is a fool’s errand (e.g., Elton, Gruber, and de Souza (2019); Fahling, Steurer, Sauer,
et al. (2019); Sharpe (1991); Walden (2015)). But Bogle’s idea had another important
ramification that made the portfolio selection problem more tractable: it shifted the
emphasis from asset selection to asset allocation. More to the point, before the exis-
tence of index funds, an investor who wanted exposure to, say, the U.S. stock market
—leaving aside the shortcomings of the MV formulation for a moment— faced an
insurmountable large optimization problem (at least 500 choices if one restricts the
feasible set to the stocks in the S&P 500 index). Today, the same investor can gain
exposure to a much more diversified portfolio— for example, a portfolio made up of
U.S. stocks, emerging markets stocks, high yield bonds, and commodities— simply by
choosing an index fund in each of these markets and concentrating then in estimating
the proper asset allocation percentages. In short, a much smaller optimization prob-
lem (Amenc, Martellini, et al. (2001); Gutierrez, Pagnoncelli, Vallad˜ao, and Cifuentes
(2019); Ibbotson (2010)).
In any event, this switch from asset selection to asset allocation, plus a number
of innovations that emerged at the end of the last century and have gained wide
2
acceptance in recent years, have changed the portfolio selection landscape in important
ways. Among these innovations we identify the following:
(1) The Conditional-Value-at-Risk or CVaR has established itself as the risk metric
of choice. A key advantage is that it captures much better than the standard de-
viation the so-called tail risk (the danger of extreme events). A second advantage
is that by focusing on losses rather than volatility of returns is better aligned
with the way investors express their risk preferences (Rockafellar and Uryasev
(2000), Rockafellar and Uryasev (2002)). A third advantage is that in the case
of the discretization and linearization of the portfolio optimization problem, as
we will see in the following section, the CVaR places no restrictions on the type
of probability distribution that can be used to model the returns.
(2) The benefits of relying on synthetic data to simulate realistic scenarios is cru-
cial for solving stochastic optimization problems such as the one described by
Markowitz. As mentioned by Fabozzi, Fabozzi, opez de Prado, and Stoyanov
(2021), a financial modeler looking at past returns data, for example, only sees
the outcome from a single realized path (one returns time series) but remains at
a loss regarding the stochastic (data generating) process behind such time series.
Additionally, any effort aimed at generating realistic synthetic data must cap-
ture the actual marginal and joint distributions of the data, that is, all the other
possible returns time histories that could have occurred but were not observed.
Fortunately, recent advances in neural networks and machine learning—for ex-
ample, an algorithm known as Generative Adversarial Networks or GAN—has
proven effective to this end in a number of applications (Goodfellow et al., 2014).
Moreover, a number of authors have explored the use of GAN-based algorithms
in portfolio optimization problems, albeit, within the scope of a framework dif-
ferent than the one discussed in this paper (e.g., Lu and Yi (2022); Mariani et
al. (2019); Pun, Wang, and Wong (2020); Takahashi, Chen, and Tanaka-Ishii
(2019)). Lommers, Harzli, and Kim (2021). Eckerli and Osterrieder (2021) pro-
vide a very good overview of the challenges and opportunities faced by machine
learning in general and GANs in particular when applied to financial research.
(3) There is a consensus among practitioners that the joint behavior of a group
of assets can fluctuate between discrete states, known as market regimes, that
represent different economic environments (Hamilton (1988, 1989); Schaller and
Norden (1997)). Considering this observation, realistic synthetic data generators
(SDGs) must be able to account for this phenomenon. In other words, they must
be able to generate data belonging to different market regimes, according to a
multi-mode random process.
(4) The incorporation of features (contextual information) to the formulation of
many optimization problems has introduced important advantages. For example,
Ban and Rudin (2019) showed that by adding features to the classical newsven-
dor problem resulted in solutions with much better out-of-sample performance
compared to more traditional approaches. Other authors have also validated
the effectiveness of incorporating features to other optimization problems (e.g.,
Bertsimas and Kallus (2020); Chen, Owen, Pixton, and Simchi-Levi (2022); Hu,
Kallus, and Mao (2022); See and Sim (2010)).
With that as background, our aim is to propose a method to tackle the portfolio se-
lection problem based on an asset allocation approach. Specifically, we assume that our
investor has a medium- to long-term horizon and has access to a number of liquid and
3
public markets in which he/she will participate via an index fund. Thus, the problem
reduces to estimating the appropriate portfolio weights assuming that the rebalancing
is not done very frequently. Frequently, of course, is a term subject to interpretation.
In this study we assume that the rebalancing is done once a year. Rebalancing daily,
weekly, or even monthly, would clearly defeat the purpose of passive investing while
creating excessive trading costs, that ultimately could affect performance.
Our approach is based on a Markowitz-inspired framework but with a CVaR-based
risk constraint instead. More importantly, we rely on synthetic returns data gener-
ated with a Modified Conditional GAN approach which we enhance with contextual
information (in our case, the U.S. Treasury yield curve). In a sense, our approach
follows the spirit of Pagnoncelli et al. (2022), but it differs in several important ways
and brings with it important advantages—including performance—a topic we discuss
in more detail later in this paper. In summary, our goals are twofold. First, we seek
to propose an effective synthetic data generation algorithm; and second, we seek to
combine such algorithm with contextual information to propose an asset allocation
method, that should yield, ideally, acceptable out-of-sample performance.
In the next section we formulate the problem at hand more precisely, then we
describe in detail the synthetic data generating process, and we follow with a numerical
example. The final section presents the conclusions.
2. Problem Formulation
Consider the case of an investor who has access to nasset classes, each represented
by a suitable price index. We define the portfolio optimization problem as an asset
allocation problem in which the investor seeks to maximize the return by selecting the
appropriate exposure to each asset class while keeping the overall portfolio risk below
a predefined tolerance level.
The notion of risk in the context of financial investments has been widely discussed
in the literature, specifically, the advantages and disadvantages of using different risk
metrics. In our formulation, and in agreement with current best practices, we have
chosen the Conditional-Value-at-Risk (CVaR) as a suitable risk metric. Considering
that most investors focus on avoiding losses rather than volatility (specially medium- to
long-term investors) the CVaR represents a better choice than the standard deviation
of returns. Moreover, the CVaR (unlike the Value-at-Risk or VaR) has some attractive
features, namely, is convex and coherent (i.e., it satisfies the sub-additive condition)
(Pflug, 2000) in the sense of Artzner, Delbaen, Eber, and Heath (1999).
Let xRnbe the decision vector of weights that specify the asset class allocation
and rRnthe return of each asset class in a given period1. The underlying probability
distribution of rwill be assumed to have a density, which we denote by π(r). Thus,
the expected return of the portfolio can be expressed as the weighted average of the
expected return of each asset class, that is,
E(xTr) =
n
X
i=1
xiE[ri].(1)
Let α(0,1) be a set level of confidence and Λ Rthe risk tolerance of the
1We use boldface font for vectors to distinguish them from scalars.
4
investor. We can state the optimal portfolio (asset) allocation problem for a long-only
investor as the following optimization problem:
maximize
xRn
E(xTr)
s.t.CVaRα(xTr)Λ
n
X
i=1
xi= 1
x0.
(2)
Recall that the CVaR of a random variable X, with a predefined level of confidence
α, can be expressed as the expected values of the Xthat exceed the corresponding
VaR. More formally,
CVaRα(X) = 1
1αZ1α
0
VaRγ(X), (3)
where VaRγ(X) denotes the Value-at-Risk of the distribution for a given confidence
γ, defined using the cumulative distribution function FX(x) as
VaRγ(X) = inf {xR|FX(x)>1γ}.(4)
2.1. Discretization and linearization
In this section, we will explain how the asset allocation problem (2), a nonlinear
optimization problem due to the CVaR-based constraint, can be restated as a linear
programming problem and why this formulation is useful in practice.
The foundational concept was established by Rockafellar and Uryasev (2000), who
demonstrated that solving the optimization problem (2) is equivalent to solving the
following optimization problem:
minimize
(x)Rn×RE(xTr)
s.t. ζ + (1 α)1ZRn
[xTrζ]+π(r)drΛ
n
X
i=1
xi= 1
x0.
(5)
Here, the dummy variable ζRis introduced to serve as a “threshold for losses”.
When πhas a discrete density πj, with j= 1, . . . , m representing the probabilities of
occurrence associated to each of the return vectors or scenarios r1,...,rm, the problem
(5) can be reformulated as:
5
minimize
(x)Rn×RE(xTr)
s.t. ζ + (1 α)1
m
X
j=1
[xTrjζ]+πjΛ
n
X
i=1
xi= 1
x0.
(6)
Finally, introducing dummy variables zj, for j= 1, . . . , m, and explicitly writing
E(xTr) as xTRπ, where RRn×mdenotes the return variable’s range based on
the density vector π, the problem in (6) can be restated as the following linear pro-
gramming problem:
maximize
(x,z)Rn×Rm×RxTRπ
s.t. ζ +1
1απTzΛ
z xTRζ
n
X
i=1
xi= 1
x,z0.
(7)
A comprehensive explanation (including all the formal proofs) regarding the derivation
of the equivalent continuous formulation (5) and its subsequent discretization and lin-
earization, (6) and (7), can be found in Rockafellar and Uryasev (2000). An in-depth
analysis of these equivalent formulations–including practical examples, and a discus-
sion of the general problem setting incorporating transaction costs, value constraints,
liquidity constraints, and limits on position–are provided in Krokhmal, Uryasev, and
Palmquist (2002).
This discretized and linear formulation (7) has the advantage that it can be handled
with a number of widely available linear optimization solvers. Moreover, the discretiza-
tion allows us to use sampled data from the relevant probability distribution of rin
combination with the appropriate discrete probability density function π.
In this study, as well as in practice, Rgenerally represents a sampled distribution
of returns, and the vector πdetermines the weights for each of the msampled re-
turn vectors (scenarios). For instance, in the simple case of random samples without
replacement from a set of historical returns, πis naturally defined as πj= 1/m for
j {1, ..., m}. Notice, however, that the weights πcan be modified to adjust the
formulation for a case in which features (contextual information) are added to the
optimization problem. Assume, for example, that we are incorporating a sample of
lfeatures FRl×m. In this case, we redefine πto reflect the different importance
attributed to samples based on the similarity between their corresponding features
and those of the current state of the world (economic environment).
6
More formally, if f1and f2represent normalized2vectors of economic features, we
define a distance d(·,·) as
d(f1,f2) = q(f1f2)T(f1f2).(8)
Let df1be the inverse distance vector of Fto a given vector fdefined by [d1
f]q=
1
d(f,fq), for each q[1, . . . , m], fqbeing a row of F. We define the density vector πf
as the normalization of the inverse distance vector to f, that is,
πf=df1
1Tdf1.(9)
For the purpose of this study we will sample Rand Fbased on a suitable data gen-
erator (note that they are not independently sampled, but simultaneously sampled),
together with the corresponding weighting scheme (either πor the modified density
πf), and then we will cast the optimization (asset allocation) problem according to
the discretized linear framework described in (7).
3. Synthetic Data Generation
In principle, generating random samples from a given probability density function is a
relatively straightforward task. In practice, however, there are two major limitations
that prevent finance researchers and practitioners from relying on such simple exercise.
First, and as mentioned before, a financial analyst has only the benefit of knowing
a single path (one sample outcome) generated by an unknown stochastic process, that
is, a multidimensional historical returns time series produced by an unknown data
generating process (DGP) (Tu and Zhou (2004)).
The second limitation is the non-stationary nature of the stochastic processes un-
derlying all financial variables. More to the point, financial systems are dynamic and
complex, characterized by conditions and mechanisms that vary over time, due to both
endogenous effects and external factors (e.g., regulatory changes, geopolitical events).
Not surprisingly, the straight reliance on historical data to generate representative sce-
narios, or, alternatively, attempts based on conventional (fixed constant) parametric
models to generate such scenarios have been disappointing.
Therefore, given these considerations, our approach consists of using machine learn-
ing techniques to generate synthetic data based on recent historical data. More pre-
cisely, the idea is to generate (returns) samples based on a market-regime aware genera-
tive modeling method known as Conditional Tabular Generative Adversarial Networks
(CTGAN). CTGANs automatically learn and discover patterns in historical data, in
an unsupervised mode, to generate realistic synthetic data that mimic the unknown
DGP. Then, we use these generated (synthetic) data to feed the discretized optimiza-
tion problem described in (7).
In brief, our goal is to develop a process, that given a historical dataset Dhconsisting
of Rhasset returns and Fhfeatures (both mhsamples), could train a synthetic data
2We say normalized in the sense of transforming the variables into something comparable between them. For
simplicity, in our study we used a zero mean normalization (Z-score).
7
generator (SDG) to create realistic (that is market-regime aware synthetic return
datasets) on-demand (Ds). Figure 1 summarizes visually this concept.
Figure 1.: The Synthetic Data Generation Schema
SDG
Dh=RhFhDs= [RsFs]
train generate
3.1. Conditional Tabular Generative Adversarial Networks (CTGAN)
Recent advances in machine learning and neural networks, specifically, the devel-
opment of Generative Adversarial Networks (GAN) that can mix continuous and
discrete (tabular) data to generate regime-aware samples, are particularly useful in
financial engineering applications. A good example is the method proposed by Xu,
Skoularidou, Cuesta-Infante, and Veeramachaneni (2019). These authors introduced a
neural network architecture for Conditional Tabular Generative Adversarial Networks
(CTGAN) to generate synthetic data. This approach presents several advantages,
namely, it can create a realistic synthetic data generating process that can capture,
in our case, the complex relationships between asset returns and features, while being
sensitive to the existence of different market regimes.
In general terms, the architecture of a CTGAN differs from that of a standard GAN
in several ways:
CTGAN models the dataset as a conditional process, where the continuous vari-
ables are defined by a conditional distribution dependent on the discrete vari-
ables, and each combination of discrete variables defines a state that determines
the uni and multivariate distribution of the continuous variables.
To avoid the problem of class imbalances in the training process, CTGAN intro-
duces the notions of conditional generator and a trainning-by-sampling process.
The conditional generator decomposes the probability distribution of a sample
as the aggregation of the conditional distributions given all the possible discrete
values for a selected variable. Given this decomposition, a conditional generator
can be trained considering each specific discrete state, allowing the possibility of
a training-by-sampling process that can select states evenly for the conditional
generator and avoid poor representation of low-frequency states.
CTGAN improves the normalization of the continuous columns employing mode-
specific normalization. For each continuous variable, the model uses Variational
Gaussian Mixture models to identify the different modes of its univariate dis-
tribution and decompose each sample using the normalized value based on the
most likely mode and a one hot-vector defined by the mode used. This process
improves the suitability of the dataset for training, converting it into a bounded
vector representation easier to process by the network.
3.2. A Modified CTGAN-plus-features method
To enhance the capacity of generating state-aware synthetic data (scenarios) based on
the CTGAN architecture, we use an unsupervised method to generate discrete market
regimes or states. Our approach is based on identifying clusters of samples exhibiting
8
similar characteristics in terms of asset returns and features, and we finally use the
cluster identifier as the state-defining variable employed by the CTGAN model.
A full discussion of how to generate a market regime (or state) aware identification
model goes beyond the scope of this study. Suffice to say that in this case we relied
on well-known methods from the machine learning literature for dimensionality
reduction such as t-NSE (short for t-distributed Stochastic Neighbor Embedding)
and density-based clustering such as HDBSCAN (short for High Density-Based
Spatial Clustering of Applications with Noise.) (Campello, Moulavi, & Sander, 2013).
Additionally, to reduce the noise generated by trivially-correlated assets (like the S&P
500 and the Nasdaq 100, for example), we first decompose the asset returns based on
their principal components using a PCA technique (where the number of dimensions
is equal to the number of asset classes).
In summary, the synthetic data generating process, which is described schematically
in Figure 2, consists of the following steps:
(1) Start with an historical dataset Dhconsisting of Rhasset returns and Fhfeatures
(both mhsamples from the same periods).
(2) The dataset is orthogonalyzed in all its principal components using PCA to avoid
forcing the model to estimate the dependency of highly correlated assets such as
equity indexes with mayor overlaps. The eigenvectors are stored to reverse the
projection on the synthetically generated dataset.
(3) Generate a discrete vector Cassigning a cluster identifier to each sample. The
process to generate the clusters consists of two steps:
(a) Reduce the dimensionality of the dataset Dhfrom mhto 2 using t-SNE.
(b) Apply HDBSCAN on the 2-dimensional projection of Dh.
(4) Train a CTGAN using as continuous variables the PCA-transformed dataset
(Dh
pca) and the vector Cas an extra discrete column of the dataset.
(5) Generate mssynthetic samples using the trained CTGAN (Ds
pca).
(6) Reverse the projection from the PCA space to its original space in the synthetic
dataset Ds
pca using the stored eigenvectors, obtaining a new synthetic dataset Ds
of mssamples.
9
Figure 2.: The Modified CTGAN-plus-Features Data Generating Process
ChNmh
k
CTGAN continous =Dh
pca , discrete =Ch
Ds
pca Rn+l, ms
Dh
pca :=PCA RhFh
Normalize and
orthogonalize via PCA
Clustering Process
Train CTGAN using Chas the discrete variable
Generate synthetic samples
Reverse PCA using original eigenvectors
Dh=RhFhRn+l, mh
Start with an historical dataset
PCA1Ds
pca= [RsFs] = Ds
4. Example of Application
The following example will help to assess the merits of our approach vis-`a-vis other
alternative asset allocation schemes. Consider the case of an investor who has access
to ten asset classes (a diverse assortment of stocks, bonds and commodities) based on
the indices described in Table 1. We further assume that the investor has a medium- to
long-term horizon and that he/she will be rebalancing his/her portfolio (recalculating
the asset allocation weights) once a year, which for simplicity we assume that is done
at the beginning of the calendar year (January). We consider the period January 2003-
June 2022, a time span for which we have gathered daily returns data corresponding
to all the indices listed in Table 1. Finally, we assume that the investor will rely on
a 5-year lookback period to, first, generate synthetic returns data (via the Modified
CTGAN approach outlined in the previous section), and then, would rely on the linear
optimization framework described in (7) to determine the asset allocation weights.
Table 1.: Indices Employed in the Asset Allocation Example
Asset Class Bloomberg Ticker Name
US Equities SPX S&P 500 Index
US Equities Tech NDX Nasdaq 100 Index
Global Equities MXWO Total Stock Market Index
EM Equities MXEF Emerging Markets Stock Index
High Yield IBOXHY High Yield Bonds Index
Investment Grade IBOXIG Liquid Investment Grade Index
EM Debt JPEIDIVR Emerging Markets Bond Index
Commodities BCOMTR Bloomberg Commodity Index
Long-term Treasuries I01303US Long-Term Treasury Index
Short-term Treasuries LT01TRUU Short-Term Treasury Index
10
4.1. Feature selection
As mentioned before, incorporating features to an optimization problem can greatly
improve the out-of-sample performance of the solutions. Financial markets offer a huge
number of options for contextual information. The list is long and includes macroe-
conomic indicators, such as GDP, consumer confidence indices, or retail sales volume.
Since our intention is to incorporate an indicator that could describe the state of the
economy at several specific times, we argue that the Treasury yield curve (or more
precisely, the interest rates corresponding to different maturities) is a suitable choice
for several reasons. First, the yield curve is very dynamic as it quickly reflects changes
in market conditions, as opposed to other indicators which are calculated on a monthly
or weekly basis and take more time to adjust. Second, its computation is “error-free”
in the sense that is not subject to ambiguous interpretations or subjective definitions
such as the unemployment rate or construction spending. And third, it summarizes
the overall macroeconomic environment—not just one aspect of it—while offering some
implicit predictions regarding the direction the economy is moving. In fact, both the
empirical evidence and much of the academic literature, support the view that the
yield curve (also known as the term structure of interest rates) is a useful tool for esti-
mating the likelihood of a future recession, pricing financial assets, guiding monetary
policy, and forecasting economic growth. A discussion of the yield curve with reference
to its information content is beyond the scope of this paper. However, a number of
studies have covered this issue extensively (e.g., Bauer, Mertens, et al. (2018); Estrella
and Trubin (2006); Evgenidis, Papadamou, and Siriopoulos (2020); Kumar, Stauver-
mann, and Vu (2021)). For the purpose of this example we use the U.S. yield curve
tenors specified in Table 2. In other words, we use eight features, and each feature
corresponds to the interest rate associated with a different maturity.
Table 2.: Features (Index Returns) Used in the Asset Allocation Example
Bloomberg Ticker Maturity
FDTR 0 Months (Fed funds rate)
I02503M 3 Months
I02506M 6 Months
I02501Y 1 Year
I02502Y 2 Years
I02505Y 5 Years
I02510Y 10 Years
I02530Y 30 Years
4.2. Synthetic Data Generation Process (SDGP) Validation
Given the paramount importance played by the synthetic data generation process
(SDGP) in our approach, it makes sense, before solving any optimization problem,
to investigate whether the CTGAN model actually generates suitable scenarios (or
data samples). In other words, to explore if the quality of the SDGP is appropriate
to mimic the unknown stochastic process behind the historical data. Although the
inner structure of the actual stochastic process is unknown, one can always compare
the similarity between the input and output distributions. In short, we can compare
if their single and joint multivariate distributions are similar, and that the synthetic
samples are not an exact copy of the (original) training samples. To perform this
comparison, we trained the CTGAN using historical data from the 2017-2022 period
11
(5 years).
Figure 3.: Pair-Plot comparison of synthetic versus original data, annual returns.
Figures 3 and 4 both hint that the synthetic data actually display the same char-
acteristics of the original data. However, and notwithstanding the compelling visual
evidence, it is possible to make a more quantitative assessment to validate the SDGP.
To this end, we can perform two comparisons. First, we can compare for each vari-
able (e.g., U.S. equities returns) the corresponding marginal distribution based on the
original and synthetic data to see if they are indeed similar. And second, for each pair
of variables, we can compare the corresponding joint distributions.
12
Figure 4.: Pair-Plot comparison of synthetic versus original data features, annual
yields.
Table 3 reports the results of the Kolmogorov-Smirnov test (KS-test) (Massey,
1951), which seeks to determine whether both samples (original and synthetic) come
from the same distribution. The null hypothesis (e.g., that both samples come from the
same distribution) cannot be rejected. Notice that the table reports the complement
score, that is, a value of 1 refers to two identical distributions while 0 signals two
different distributions. The average value is 0.87, suggesting that in all cases, both the
original and synthetic distributions, are very similar in nature.
Table 3.: Kolmogorov-Smirnov Test: Comparison Between Original and Synthetic Re-
turns and Interest Rates Distributions
Variable KS-test Score Variable KS-test Score
US Equities 91.89% Fed Funds Rate 89.21%
US Equities Tech 86.30% 3 Months Treasury 82.85%
Global Equities 94.52% 6 Months Treasury 82.58%
EM equities 92.66% 1 Year Treasury 84.44%
High Yield 93.53% 2 Years Treasury 86.41%
Investment Grade 85.87% 5 Years Treasury 84.61%
EM Debt 86.47% 10 Years Treasury 85.87%
Commodities 76.61% 30 Years Treasury 85.21%
Long-term Treasuries 88.11%
Short-term Treasuries 80.55%
In order to verify that the synthetic samples preserve the relationship that existed
13
between the variables in the original data, we compared the joint distributions based on
the original and synthetic datasets. To this end, we compared the degree of similarity of
the correlation matrices determined by each sample. Specifically, for any two variables,
say, for example, US Equities and Commodities, we would expect the correlation
between them to be similar in both, the original and synthetic datasets. Figure (5)
shows, for all possible paired-comparisons, the value of a correlation similarity index,
Such index is defined as 1 minus the absolute value of the difference between both
(original and synthetic data) correlations. A value of 1 indicates identical values; a
value of 0 indicates a maximum discrepancy. The values shown in Figure (5) (the
lowest is 0.83) evidence a high level of agreement.
Figure 5.: Correlation similarity comparison between the correlation matrices of the
original and the synthetic data.
A more nuanced comparison between the characteristics of the original (historical)
dataset and the synthetic dataset can be accomplished by looking at the clusters. In
other words, the different market regimes identified during the data generation process.
This comparison can be carried out in two steps. First, we computed the correlation
between the distribution of data points across clusters in the original dataset and
their counterparts in the synthetic dataset. The number of synthetic samples drawn
from each cluster followed a distribution that closely mirrors the distribution of clus-
ters identified in the original dataset (44 clusters in total), having a correlation of
97.2%. This high degree of agreement can be attributed to the CTGAN’s training
process, wherein the probability distribution for the conditional variables are explic-
itly learned, facilitating an accurate replication of the original dataset’s structural
characteristics. And second, we refined the KS-test, partitioning both the original and
synthetic datasets based on their respective clusters. This allowed us to compare the
similarities between samples where the original and synthetic data originated from the
same cluster versus those from different clusters. The results of this exercise, displayed
in Figure 6, reveal that the synthetic data conditioned on the same cluster as the
14
original data typically yielded the highest KS-test scores compared to data generated
from other clusters. This finding provides further evidence of the effectiveness of the
cluster-based approach to produce synthetic data that replicates not only the broad
characteristics of the original dataset but also all the key elements of all the different
market regimes.
Figure 6.: Pair-Plot comparison of synthetic versus original data average KS-Test
across all dimensions, divided by cluster. Values are scaled by the maximum KS-test
score of each row
In conclusion, based on the previous results we can state with confidence that the
CTGAN does create data samples congruent with the original dataset, effectively pre-
serving both marginal and joint distributions. Furthermore, our results highlight a
tangible improvement in the quality of data generation attributable to the incorpora-
tion of the clustering process. Having validated the SDGP, the next step is to assess
the merits of the optimization approach itself.
4.3. Testing strategy
In order to better assess the performance of our approach, i.e., (Modified) CTGAN
with features, which we denote as GwF, we compare it with four additional asset
15
allocation strategies, as indicated below. In short, we test five strategies, namely:
(i) CTGAN without features (Gw/oF)
(ii) CTGAN with features (GwF)
(iii) Historical data without features (Hw/oF)
(iv) Historical data with features (HwF)
(v) Equal Weights (EW)
The historical-data strategies, unlike the CTGAN-based strategies, are based on
direct sampling from historical data. We also utilize the Equal-Weight (EW) strat-
egy, known as the 1/N strategy, which assigns equal weights to all asset classes. This
approach is chosen precisely because it does not depend on any predetermined risk
constraint or measure, nor does it rely on historical data. Its effectiveness is not con-
tingent on the assumptions required by other strategies that use measures like CVaR
to bound risk. Despite its simplicity, this seemingly naive strategy has generally per-
formed surprisingly well, often outperforming many variations of Mean-Variance (MV)
strategies. A comprehensive evaluation of the EW strategy’s performance can be found
in the work of DeMiguel et al. (2009), which underscores its utility as a useful bench-
mark. Indeed, we contend that any strategy failing to outperform the EW strategy
likely has little to offer and is unlikely to be of practical relevance.
Figure 7.: Sequence of 5-year Overlapping Windows
year 0 year 1 year 2 year 3 year 4 year 5 year 6 year 7 year 8
The optimization model to decide the asset allocation weights is run once a year (in
January), based on 5-year lookback periods. In essence, the optimization is based on a
sequence of overlapping windows as shown in Figure (7). Hence, the first optimization
is based on data from the the January 2003-December 2007 period. And the merits
of this asset-class selection (out-of-sample performance) are evaluated a year later, in
January 2009 (backtesting). Then, a second optimization is run based on the Jan-
uary 2004-December 2008 period data, and its performance is evaluated, this time, in
January 2010. This backtesting process is repeated until reaching the January 2017-
December 2021 period. Note that this last weight selection is tested over a shorter
time-window (January 2022-June 2022). Also, each optimization problem is solved for
several CVaR limits, ranging from 7.5% to 30%, to capture the preferences of investors
with different risk-tolerance levels. Additionally, given that the proposed procedure is
non-deterministic (mainly because of the synthetic nature of the returns generated
when using CTGAN) each optimization is run 5 times for each CVaR tolerance level
(Λ) . This allows us to test the stability of the results. Finally, note that in the cases
with no features the density vector πRmis defined as πj=1
m. for j {1, ..., m}.
In summary, the testing strategy is really a sequence of fourteen backtesting exer-
cises starting in January 2009, and performed annually, until January 2022, plus, one
16
final test done in July 2022 (based on a 6-month window, January 2022-June 2022).
This process is summarized in a schematic fashion in Figure 8.
4.4. Performance metrics
Comparing the performance of investment strategies over long time-horizons (an asset
allocation scheme is ultimately an investment strategy) is a multidimensional exercise
that should take into account several factors, namely, returns, risk, level of trading,
degree of portfolio diversification, etc.
To this end, we consider four metrics (figures of merits) to carry out our comparisons.
These comparisons are based on the performance (determined via backtesting) over
the January 2008-June 2022 period, in all, 14.5 years.
We consider the following metrics:
(1) Returns: Returns constitute the quintessential performance yardstick. Since we
are dealing with a medium- to long-term horizon investor, the cumulative return
over this 14.5-year period, expressed in annualized form, is the best metric to
assess returns.
(2) Risk: Since we have formulated the optimization problem based on a CVaR
constraint, it makes sense to check the CVaR ex post. A gross violation of the
CVaR limit should raise concerns regarding the benefits of the strategy.
(3) Transaction costs: Notwithstanding the fact that rebalancing is done once a
year, transaction costs, at least in theory, could be significant. Portfolio rotation
is a good proxy to assess the impact of transaction costs (which, if excessive,
could negatively affect returns). The level of portfolio rotation, on an annual
basis, can be expressed as
rotation =P14
t=2 P10
i=1 |wi,t wi,t1|
14 (10)
where the ω’s are the asset allocation weights. A static portfolio results in a value
equal 0; increasing values of this metric are associated with increasing levels of
portfolio rotation.
(4) Diversification: Most investors aim at having a diversified portfolio. (Recall
that a frequent criticism to the conventional MV-approach is that it often yields
corner solutions based on portfolios heavily concentrated on a few assets.) To
measure the degree of diversificatio, we follow Pagnoncelli et al. (2022), and rely
on the complementary Herfindahl–Hirschman (HH) Index. A value of 0 for the
index reflects a portfolio concentrated on a single asset. On the other hand, a
value approaching 1 corresponds to a fully diversified portfolio (all assets share
the same weight).
17
Figure 8.: Overview of Backtesting Method
ftRl
Compute the Euclidean distance between the normalized
present day features and the sample features
d(f1,f2) = p(f1f2)T(f1f2)
xGw/oF :Optimization Rs
t,1
m
Run the optimization for different risk-tolerance levels with
the appropriate weights for each scenario
Start with a set of returns and features
FvRl, mv
Loop over all rebalance
days tT
Obtain the present day features
πft=d1
ft
1Td1
ft
xHw/oF :Optimization Rh
t,1
m
RvRn, mv
xHwF :Optimization Rh
t,πft
xGwF :Optimization (Rs
t,πft)
4.5. Performance comparison
For comparison purposes, all numerical experiments were run on a MacBook Pro 14
with an M1 Pro chip and 16 GB of RAM. All the strategies were run without the use
of a dedicated GPU to be able to perform a fair comparison across strategies.
The strategies were backtested using a 5-year window of daily historical scenarios
as input. In the case of the CTGAN-based strategies(Gw/oF and GwF) all the 5-year
window historical scenarios were used as input for the Data Generating Process, then,
a sample of 500 synthetic scenarios were used to solve the optimization problem. In the
case of the historical-based strategies (Hw/oF and HwF) the inputs were a sub-sample
of 500 historical scenarios which were used to solve the optimization problem. In the
case of the EW strategy there is no such input or sub-sampling since the strategy does
not dependent on any scenarios: the weights are always the same and identical.
Regarding the historical-based strategies (Hw/oF and HwF) the running time was
on average 0.001 seconds per rebalance cycle. The running time for the CTGAN based
18
strategies (Gw/oF and GwF) was on average 203.5 seconds per rebalance cycle. Given
that all strategies were run using only CPU and not GPU-accelerated hardware the
CTGAN based strategies were slower to run given the greater number of operations
used to train a GAN-based architecture.
Figure 9 shows the values of all relevant metrics.
We start with the returns. First, the benefits of including features (contextual infor-
mation) in the optimization process are evident: both, the GwF and HwF approaches,
outperform by far their non-features counterparties. The difference in performance
is more manifest as the CVaR limit increases. Intuitively, this makes sense: stricter
risk limits tend to push the solutions towards cash-based instruments, which, in turn,
exhibit returns that are less dependent on the economic environment, and thus, the
benefits of the information-content embedded in the features is diminished. Note also
that all strategies (except for the EW) deliver, more or less, monotonically increas-
ing returns as the CVaR limit is relaxed. Additionally, it is worth mentioning that
a naive visual inspection might suggest that GwF only outperforms HwF by a fairly
small margin. Take the case of CVaR = 0.25, for example; the difference between
16.78% and 15.65% might appear as innocuous. Over a 14.5-year period, however, it
is significant. More clearly: an investor who contributed $100 to the GwF strategy
initially, will end up with $948; the investor who adopted the HwF strategy, will end
up with only $823. We should be careful not to jump to conclusions regarding the
merits of including features in asset allocation problems. However, our results strongly
suggest that the benefits of incorporating features to the optimization framework can
be substantial. Finally, the EW strategy clearly underperforms compared to all other
strategies.
We now turn to the CVaR (ex post). Again, the benefits of including features are
clear as they always decrease the risk compared to the non-features options. Also
noticeably, including features (see HwF and GwF) always yields solutions that never
violate the CVaR limit established ex ante. It might seem surprising that the CVaR-ex
post value does not increase monotonically as the CVaR limit (actually Λ based on the
notation used in (2), increases, especially in the GwF and HwF cases. We attribute
this situation to the fact that the CVaR-restriction was probably not active when the
optimization reached a solution.
In terms of diversification (HH Index), all in all, all strategies display fairly similar
diversification levels. Two comments are in order. First, relaxing the risk limit (higher
CVaR) naturally results in lower diversification as the portfolios tend to move to
higher-yielding assets, which are, in general, riskier. And second, it might appear that
the overall diversification level is low (values of the HH Index below 0.20 in most
cases). That sentiment, however, would be misplaced: these are portfolios made up,
not of individual assets, but indices, and thus, they are inherently highly diversified.
Lastly, we examine trading expenses. It might be difficult from Table 9, Rotation,
to gauge its impact on returns. To actually estimate rigorously the potential impact
of trading expenses on returns, in all cases, we proceed as follows. Table 4 shows for
different asset classes (based on some commonly traded and liquid ETFs), representa-
tive bid-ask spreads. This information, in combination with the rotation levels shown
in Table 9, can be used to estimate the trading expenses on a per annum basis (shown
in Table 5). Finally, Table 6 shows the returns after correcting for trading expenses.
A comparison between these returns and those shown in Table 9 proves that trading
expenses have no significant impact on returns.
In summary, all things considered, features-based strategies outperform their ver-
sions with no features, and, more important, GwF clearly outperforms HwF, most
19
evidently in terms of returns, the variable investors care the most. The EW strategy,
which had done surprisingly well against MV-based portfolios, emerges as the clear
loser, by far.
Figure 9.: Key Metrics for All Strategies
(a) Annualized Returns
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 12.54% 13.50% 12.90% 12.74% 7.89%
0.1 11.96% 13.30% 12.73% 12.98% 7.89%
0.125 12.46% 14.94% 13.04% 13.67% 7.89%
0.15 13.84% 15.43% 13.20% 14.03% 7.89%
0.175 12.95% 15.18% 14.04% 14.08% 7.89%
0.2 13.02% 15.21% 13.57% 14.71% 7.89%
0.225 12.51% 16.22% 13.26% 15.20% 7.89%
0.25 13.19% 16.78% 13.31% 15.65% 7.89%
0.275 13.60% 17.36% 13.59% 16.45% 7.89%
0.3 13.87% 17.77% 14.90% 16.64% 7.89%
(b) CVaR Ex-post
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 10.20% 4.56% 7.34% 6.58% 5.33%
0.10 10.75% 7.31% 8.76% 5.51% 5.33%
0.125 10.01% 4.68% 8.65% 4.21% 5.33%
0.15 8.62% 4.71% 8.59% 3.87% 5.33%
0.175 10.21% 5.90% 7.22% 3.42% 5.33%
0.20 10.29% 5.41% 8.60% 3.49% 5.33%
0.225 10.77% 5.11% 9.97% 3.62% 5.33%
0.25 10.15% 4.12% 9.98% 3.89% 5.33%
0.275 10.71% 6.18% 9.98% 4.24% 5.33%
0.3 10.41% 4.67% 7.61% 4.71% 5.33%
(c) HH Index
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 0.18 0.25 0.17 0.19 1
0.10 0.15 0.23 0.18 0.18 1
0.125 0.17 0.20 0.19 0.18 1
0.15 0.18 0.18 0.19 0.16 1
0.175 0.17 0.16 0.20 0.16 1
0.20 0.18 0.15 0.19 0.17 1
0.225 0.17 0.18 0.18 0.18 1
0.25 0.17 0.12 0.18 0.18 1
0.275 0.16 0.10 0.17 0.18 1
0.3 0.16 0.12 0.17 0.17 1
(d) Rotation
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 14.35 28.91 9.38 24.53 0
0.10 13.98 34.76 10.15 26.96 0
0.125 13.92 26.23 10.43 25.95 0
0.15 14.20 26.60 10.49 28.48 0
0.175 14.95 26.34 10.25 26.80 0
0.20 14.07 31.25 10.95 24.43 0
0.225 15.46 24.12 12.01 23.76 0
0.25 15.19 39.00 12.15 21.53 0
0.275 14.42 19.78 12.61 19.00 0
0.3 15.03 19.45 10.91 17.01 0
Table 4.: Trading Expenses by Asset Class
Asset Class Selected ETF Average 30 Day Bid-Ask Spread (Basis Points)
US equities SPY US 0.36
US equities tech QQQ US 0.52
Global equities VT US 0.54
EM equities EEM US 2.69
US high yield HYG US 1.35
US inv. grade LQD US 0.96
EM debt PCY US 5.66
Commodities COMT US 14.1
Long term treasuries TLT US 1.03
Short term treasuries BIL US 1.25
20
Table 5.: Annualized Transaction Expenses
(Basis Points)
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 0.54 1.32 0.19 1.52 0
0.10 0.43 1.50 0.23 1.72 0
0.125 0.44 1.20 0.23 1.64 0
0.15 0.47 1.61 0.23 1.82 0
0.175 0.53 1.31 0.24 1.76 0
0.20 0.47 1.46 0.25 1.56 0
0.225 0.49 0.99 0.28 1.49 0
0.25 0.53 1.51 0.28 1.30 0
0.275 0.45 0.80 0.30 1.07 0
0.30 0.45 0.85 0.27 0.93 0
Table 6.: Annualized Returns
(Net of Transaction Expenses)
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 12.53% 13.49% 12.90% 12.72% 7.89%
0.1 11.96% 13.28% 12.73% 12.96% 7.89%
0.125 12.46% 14.93% 13.04% 13.65% 7.89%
0.15 13.84% 15.41% 13.20% 14.01% 7.89%
0.175 12.94% 15.17% 14.04% 14.06% 7.89%
0.2 13.02% 15.20% 13.57% 14.69% 7.89%
0.225 12.51% 16.21% 13.26% 15.19% 7.89%
0.25 13.18% 16.76% 13.31% 15.64% 7.89%
0.275 13.60% 17.35% 13.59% 16.44% 7.89%
0.3 13.87% 17.76% 14.90% 16.63% 7.89%
4.6. Discussion of results and some considerations regarding potential
statistical biases
Broadly speaking, presenting a model that outperforms a benchmark is not an insur-
mountable task. In this case, we have presented a model (strategy or method) that
both generates realistic synthetic data and delivers satisfactory out-of-sample perfor-
mance. Given this situation, reasonable readers might ask themselves: How well would
the model proposed perform under circumstances different from those described in the
example selected by the authors? Did the authors fine-tune the value of some critical
parameters in order to present their results in the best possible light? Do the results
suffer from any form of selection bias? Overfitting and other statistical biases are com-
mon problems that affect many novel strategies and methods. Is there any indication
of overfitting in this case? The following considerations are aimed at mitigating these
concerns.
First, and in reference to a potential model selection bias. The synthetic data gen-
eration approach we have presented is based on a Modified CTGAN model. We also
considered two other potential choices for synthetic data generation, and we discarded
21
them both. One was the NORTA (Normal to Anything) algorithm, a method based on
the Gaussian copula that can generate vectors given a certain interdependence struc-
ture. This method has been successfully used in some financial applications (Pagnon-
celli et al., 2022) and delivered good out-of-sample performance. Unfortunately, this
algorithm requires to perform the Cholesky decomposition of the correlation matrix, a
computational exercise of order O(n3), which makes the process computationally very
expensive when one has many indices (ten in our case) combined with several features
(eight in our example). In short, computationally speaking, NORTA was no match
for CTGAN. A second alternative we considered, and decided not to explore, was the
CopulaGAN method, a variation of GAN in which a copula-based algorithm is used to
preprocess the data before applying the GAN model. This method is relatively new,
and there is a lack of both academic literature and practical experience to make a
strong case for CopulaGAN versus CTGAN. Hence, we also decided not to test it in
our study.
Second, in reference to overfitting and selection bias. Like most neural networks,
CTGAN relies on a set of hyperparameters. To avoid overfitting, we excluded any
hyperparameter-tuning process. In fact, we maintained the number of layers, dimen-
sions, and architecture of the CTGAN model proposed in (Xu et al., 2019), which
also matched the default values of the model library. The only parameters that were
modified were the learning rate, reduced to 104from 2 ×104, and the number of
epochs (increased from 300 to 1500). These values proved to yield stable results across
all runs. It is important to mention that smoothing the learning rate and increasing
the number of epochs does not affect the optimal solution, but guarantees a closer
convergence at the expense of a higher (but still tolerable) computational cost. In the
case of the remaining components of our proposed Synthetic Data Generation Process,
namely TSNE, PCA, and HDBSCAN, we also decided not to tune any parameters,
relying instead on the original implementations as they are.
Third, in reference to the lookback period (five years) and rebalancing period (one
year), we did not test other lookback periods. However, previous experience suggests
that the optimal length for lookback periods should be between three and five years
(Gutierrez et al. (2019)). A period less than three years does not offer enough vari-
ability to capture key elements of the DGP, while periods longer than five years bring
the risk of sampling from a ”different universe” as financial markets are subject to ex-
ogenous conditions (e.g., regulation) that change over time. In other words, sampling
returns from too far in the past could bring elements into the modeling process that
may not reflect current market dynamics. Additionally, we did not test rebalancing
periods different than one year. Rebalancing periods much shorter than one year prob-
ably do not make sense in the context of passive investment, which is the philosophy
behind the investment approach we are advocating. And from a practical point of
view, most investors would not entertain a rebalancing period less frequent than once
a year since in general people evaluate their investment priorities on a yearly basis.
In brief, we hope that these additional explanations will be helpful in evaluating the
relevance of our results and dispelling any major concerns related to potential biases.
5. Conclusions
Several conclusions emerge from this study. The most important is that the synthetic
data generating approach suggested (based on a Modified CTGAN method enhanced
with contextual information) seems very promising. First, it generates data (in this
22
case returns) that capture well the essential character of historical data. And second,
such data, when used in conjunction with the CVaR-based optimization framework
described in (7), yields portfolios with satisfactory out-of-sample performance.
Additionally, the example also emphasizes the benefits of incorporating contextual
information. Recall that both, the GwF and HwF methods outperformed clearly their
non-features counterparties. Also, the fact that the GwF approach outperformed the
HwF approach, highlights both, the shortcomings of methods based only on historical
data, and the relevance of including scenarios that even though have not occurred, are
“feasible”, given the nature of the historical data. This element, we think, is critical
to achieve a good out-of-sample performance.
However, notwithstanding the fact that the example presented captured a challeng-
ing period for the financial markets (subprime and COVID crises), and considered a
broad set of assets (stocks, bonds, and commodities), the results should be interpreted
with restraint. That is, as an invitation to explore in more detail certain topics, rather
than falling into the temptation of making absolute statements about the merits of the
methods we have presented. In fact, two topics that deserve further exploration are:
(i) the benefits of using alternatives other than the different tenors of the yield curve
as features, or, perhaps, using the yield curve in combination with other data (e.g.,
market volatility, liquidity indices, currency movements); and (ii) the use of the syn-
thetic data generating method we proposed applied to financial variables other than
returns, for example, bond default rates, or, exchange rates. We leave these challenges
for future research efforts.
Disclosure statement
The authors report there are no competing interests to declare.
Data availability statement
The code and data that support the findings of this study are openly available in
GitHub at https://github.com/chuma9615/ctgan-portfolio-research, Historical
data was obtained from Bloomberg.
References
Amenc, N., Martellini, L., et al. (2001). It’s time for asset allocation. Journal of Financial
Transformation,3, 77–88.
Artzner, P., Delbaen, F., Eber, J.-M., & Heath, D. (1999). Coherent measures of risk. Math-
ematical finance ,9(3), 203–228.
Ban, G.-Y., & Rudin, C. (2019). The big data newsvendor: Practical insights from machine
learning. Operations Research ,67 (1), 90–108.
Bauer, M. D., Mertens, T. M., et al. (2018). Information in the yield curve about future
recessions. FRBSF Economic Letter ,20 , 1–5.
Bertsimas, D., & Kallus, N. (2020). From predictive to prescriptive analytics. Management
Science,66 (3), 1025–1044.
Bogle, J. C. (2018). Stay the course: the story of Vanguard and the index revolution. John
Wiley & Sons.
23
Campello, R. J., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hi-
erarchical density estimates. In Pacific-Asia conference on knowledge discovery and data
mining (pp. 160–172).
Chen, X., Owen, Z., Pixton, C., & Simchi-Levi, D. (2022). A statistical learning approach to
personalization in revenue management. Management Science,68 (3), 1923–1937.
DeMiguel, V., Garlappi, L., & Uppal, R. (2009). Optimal versus naive diversification: How
inefficient is the 1/n portfolio strategy? The review of Financial studies,22 (5), 1915–1953.
Eckerli, F., & Osterrieder, J. (2021). Generative adversarial networks in finance: an overview.
arXiv preprint arXiv:2106.06364 .
Elton, E. J., Gruber, M. J., & de Souza, A. (2019). Are passive funds really superior invest-
ments? an investor perspective. Financial Analysts Journal,75 (3), 7–19.
Estrella, A., & Trubin, M. (2006). The yield curve as a leading indicator: Some practical
issues. Current issues in Economics and Finance,12 (5).
Evgenidis, A., Papadamou, S., & Siriopoulos, C. (2020). The yield spread’s ability to forecast
economic activity: What have we learned after 30 years of studies? Journal of Business
Research,106 , 221–232.
Fabozzi, F. J., Fabozzi, F. A., opez de Prado, M., & Stoyanov, S. V. (2021). Asset manage-
ment: Tools and issues. World Scientific.
Fahling, E. J., Steurer, E., Sauer, S., et al. (2019). Active vs. passive funds—an empirical
analysis of the german equity market. Journal of Financial Risk Management ,8(2), 73.
Friedman, D., Isaac, R. M., James, D., & Sunder, S. (2014). Risky curves: On the empirical
failure of expected utility. Routledge.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
. . . Bengio, Y. (2014). Generative adversarial networks. arXiv. Retrieved from
https://arxiv.org/abs/1406.2661
Gutierrez, T., Pagnoncelli, B., Vallad˜ao, D., & Cifuentes, A. (2019). Can
asset allocation limits determine portfolio risk–return profiles in DC pension
schemes? Insurance: Mathematics and Economics,86 , 134-144. Retrieved from
https://www.sciencedirect.com/science/article/pii/S0167668718301331
Hamilton, J. D. (1988). Rational-expectations econometric analysis of changes
in regime: An investigation of the term structure of interest rates. Jour-
nal of Economic Dynamics and Control ,12 (2), 385-423. Retrieved from
https://www.sciencedirect.com/science/article/pii/0165188988900474
Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time
series and the business cycle. Econometrica,57 (2), 357–384. Retrieved 2022-11-15, from
http://www.jstor.org/stable/1912559
Hu, Y., Kallus, N., & Mao, X. (2022). Fast rates for contextual linear optimization. Manage-
ment Science.
Ibbotson, R. G. (2010). The importance of asset allocation. Financial Analysts Journal,
66 (2), 18-20. Retrieved from https://doi.org/10.2469/faj.v66.n2.4
Kolm, P. N., ut¨unc¨u, R., & Fabozzi, F. J. (2014). 60 years of
portfolio optimization: Practical challenges and current trends. Euro-
pean Journal of Operational Research,234 (2), 356-371. Retrieved from
https://www.sciencedirect.com/science/article/pii/S0377221713008898 (60
years following Harry Markowitz’s contribution to portfolio theory and operations
research)
Krokhmal, P., Uryasev, S., & Palmquist, J. (2002). Portfolio optimization with conditional
value-at-risk objective and constraints (Vol. 4) (No. 2). Infopro Digital Risk (IP) Limited.
Kumar, R. R., Stauvermann, P. J., & Vu, H. T. T. (2021). The relationship between yield
curve and economic activity: An analysis of G7 countries. Journal of Risk and Financial
Management,14 (2), 62.
Lommers, K., Harzli, O. E., & Kim, J. (2021). Confronting machine learning with financial
research. The Journal of Financial Data Science ,3(3), 67–96.
Lu, J., & Yi, S. (2022). Autoencoding conditional GAN for portfolio allocation diversification.
24
arXiv preprint arXiv:2207.05701 .
Mariani, G., Zhu, Y., Li, J., Scheidegger, F., Istrate, R., Bekas, C., & Malossi, A. C. I. (2019).
Pagan: Portfolio analysis with generative adversarial networks. arXiv. Retrieved from
https://arxiv.org/abs/1909.10578
Markowitz, H. (1952). Portfolio selection. The Journal of Finance,7(1), 77–91. Retrieved
2022-10-20, from http://www.jstor.org/stable/2975974
Massey, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal
of the American Statistical Association ,46 (253), 68–78. Retrieved 2022-11-25, from
http://www.jstor.org/stable/2280095
Pagnoncelli, B. K., Ram´ırez, D., Rahimian, H., & Cifuentes, A. (2022). A synthetic data-plus-
features driven approach for portfolio optimization. Computational Economics. Retrieved
from https://doi.org/10.1007/s10614-022-10274-2
Pflug, G. C. (2000). Some remarks on the Value-at-Risk and the Conditional Value-at-Risk.
In Probabilistic constrained optimization (pp. 272–281). Springer.
Pun, C. S., Wang, L., & Wong, H. Y. (2020). Financial thought experiment: A GAN-based
approach to vast robust portfolio selection. In Proceedings of the 29th international joint
conference on artificial intelligence (ijcai’20).
Rockafellar, R. T., & Uryasev, S. (2000). Optimization of conditional value-at-risk. Journal
of Risk,2(3), 21–41.
Rockafellar, R. T., & Uryasev, S. (2002). Conditional Value-at-Risk for general loss distribu-
tions. Journal of banking & finance,26 (7), 1443–1471.
Schaller, H., & Norden, S. V. (1997). Regime switching in stock mar-
ket returns. Applied Financial Economics ,7(2), 177-191. Retrieved from
https://doi.org/10.1080/096031097333745
See, C.-T., & Sim, M. (2010). Robust approximation to multiperiod inventory management.
Operations research ,58 (3), 583–594.
Sharpe, W. F. (1991). The arithmetic of active management. Financial Analysts Journal,
47 (1), 7–9.
Takahashi, S., Chen, Y., & Tanaka-Ishii, K. (2019). Modeling financial time-series with
generative adversarial networks. Physica A: Statistical Mechanics and its Applications,
527 , 121261.
Thune, K. (2022, Jan). How and why John Bogle started Vanguard. Retrieved from
www.thebalancemoney.com/how-and-why-john-bogle-started-vanguard-2466413
Tu, J., & Zhou, G. (2004). Data-generating process uncertainty: What difference does it make
in portfolio decisions? Journal of Financial Economics,72 (2), 385-421. Retrieved from
https://www.sciencedirect.com/science/article/pii/S0304405X03002472
Walden, M. L. (2015). Active versus passive investment management of state pension plans:
implications for personal finance. Journal of Financial Counseling and Planning,26 (2),
160–171.
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Model-
ing tabular data using conditional GAN. CoRR,abs/1907.00503 . Retrieved from
http://arxiv.org/abs/1907.00503
25
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Features, or contextual information, are additional data than can help predicting asset returns in financial problems. We propose a mean-risk portfolio selection problem that uses contextual information to maximize expected returns at each time period, weighing past observations via kernels based on the current state of the world. We consider yearly intervals for investment opportunities, and a set of indices that cover the most relevant investment classes. For those intervals, data scarcity is a problem that is often dealt with by making distribution assumptions. We take a different path and use distribution-free simulation techniques to populate our database. In our experiments we use the Conditional Value-at-Risk as our risk measure, and we work with data from 2007 until 2021 to evaluate our methodology. Our results show that, by incorporating features, the out-of-sample performance of our strategy outperforms the equally-weighted portfolio. We also generate diversified positions, and efficient frontiers that exhibit coherent risk-return patterns.
Article
Full-text available
The yield curve is an important tool to assess the economic progress of a country. In this study, we examine the strength of the relationship between term spread and economic activity, and between the components of the yield curve and economic activity in the G7 countries using monthly data on yield rates and seasonally adjusted data on the industrial production index (IPI). After matching the start and end date of the IPI with the yield rates, the data used and respective time period are as follows: Canada: Mar-1994 to Dec-2018, France: Jan-1999 to Dec-2018, Germany: Oct-2005 to Dec-2018, Italy: Jul-2009 to Dec-2018, Japan: Jul-1994 to Jan–2019, the UK: Jan-1994 to Dec-2018, and the US: Feb-1990 to Jan-2019. The results show positive associations between term spread and economic activity for Canada, France, Germany, Japan, the UK, and the US. For Italy, a negative association is noted. All three empirical factors could predict economic activity for France and Germany at the 12-month horizon only. For all other horizons, the factors’ ability to predict economic activity varies. We observe that by including additional macro-finance variables such as the current economic growth rate and the 3-month yield rate to capture the term structure level effects, the relationship between term spread and economic activity becomes stronger. This implies that the usefulness of yield curve and its decomposed components for the purpose of predicting economic activity should be cautiously modelled and employed for policy.
Conference Paper
Full-text available
Modern day trading practice resembles a thought experiment, where investors imagine various possibilities of future stock market and invest accordingly. Generative adversarial network (GAN) is highly relevant to this trading practice in two ways. First, GAN generates synthetic data by a neural network that is technically indistinguishable from the reality, which guarantees the reasonableness of the experiment. Second, GAN generates multitudes of fake data, which implements half of the experiment. In this paper, we present a new architecture of GAN and adapt it to portfolio risk minimization problem by adding a regression network to GAN (implementing the second half of the experiment). The new architecture is termed GANr. Battling against two distinctive networks: discriminator and regressor, GANr's generator aims to simulate a stock market that is close to the reality while allow for all possible scenarios. The resulting portfolio resembles a robust portfolio with data-driven ambiguity. Our empirical studies show that GANr portfolio is more resilient to bleak financial scenarios than CLSGAN and LASSO portfolios.
Article
Incorporating side observations in decision making can reduce uncertainty and boost performance, but it also requires that we tackle a potentially complex predictive relationship. Although one may use off-the-shelf machine learning methods to separately learn a predictive model and plug it in, a variety of recent methods instead integrate estimation and optimization by fitting the model to directly optimize downstream decision performance. Surprisingly, in the case of contextual linear optimization, we show that the naïve plug-in approach actually achieves regret convergence rates that are significantly faster than methods that directly optimize downstream decision performance. We show this by leveraging the fact that specific problem instances do not have arbitrarily bad near-dual-degeneracy. Although there are other pros and cons to consider as we discuss and illustrate numerically, our results highlight a nuanced landscape for the enterprise to integrate estimation and optimization. Our results are overall positive for practice: predictive models are easy and fast to train using existing tools; simple to interpret; and, as we show, lead to decisions that perform very well. This paper was accepted by Hamid Nazerzadeh, data science.
Article
We consider a logit model-based framework for modeling joint pricing and assortment decisions that take into account customer features. This model provides a significant advantage when one has insufficient data for any one customer and wishes to generalize learning about one customer’s preferences to the population. Under this model, we study the statistical learning task of model fitting from a static store of precollected customer data. This setting, in contrast to the popular learning and earning paradigm, represents the situation many business teams encounter in which their data collection abilities have outstripped their data analysis capabilities. In this learning setting, we establish finite-sample convergence guarantees on the model parameters. The parameter convergence guarantees are then extended to out-of-sample performance guarantees in terms of revenue, in the form of a high-probability bound on the gap between the expected revenue of the best action taken under the estimated parameters and the revenue generated by a decision maker with full knowledge of the choice model. We further discuss practical implications of these bounds. We demonstrate the personalization approach using ticket purchase data from an airline carrier. This paper was accepted by J. George Shanthikumar, special issue on data-driven prescriptive analytics
Book
Long gone are the times when investors could make decisions based on intuition. Modern asset management draws on a wide-range of fields beyond financial theory: economics, financial accounting, econometrics/statistics, management science, operations research (optimization and Monte Carlo simulation), and more recently, data science (Big Data, machine learning, and artificial intelligence). The challenge in writing an institutional asset management book is that when tools from these different fields are applied in an investment strategy or an analytical framework for valuing securities, it is assumed that the reader is familiar with the fundamentals of these fields. Attempting to explain strategies and analytical concepts while also providing a primer on the tools from other fields is not the most effective way of describing the asset management process. Moreover, while an increasing number of investment models have been proposed in the asset management literature, there are challenges and issues in implementing these models. This book provides a description of the tools used in asset management as well as a more in-depth explanation of specialized topics and issues covered in the companion book, Fundamentals of Institutional Asset Management. The topics covered include the asset management business and its challenges, the basics of financial accounting, securitization technology, analytical tools (financial econometrics, Monte Carlo simulation, optimization models, and machine learning), alternative risk measures for asset allocation, securities finance, implementing quantitative research, quantitative equity strategies, transaction costs, multifactor models applied to equity and bond portfolio management, and backtesting methodologies. This pedagogic approach exposes the reader to the set of interdisciplinary tools that modern asset managers require in order to extract profits from data and processes.
Article
We combine ideas from machine learning (ML) and operations research and management science (OR/MS) in developing a framework, along with specific methods, for using data to prescribe optimal decisions in OR/MS problems. In a departure from other work on data-driven optimization, we consider data consisting, not only of observations of quantities with direct effect on costs/revenues, such as demand or returns, but also predominantly of observations of associated auxiliary quantities. The main problem of interest is a conditional stochastic optimization problem, given imperfect observations, where the joint probability distributions that specify the problem are unknown. We demonstrate how our proposed methods are generally applicable to a wide range of decision problems and prove that they are computationally tractable and asymptotically optimal under mild conditions, even when data are not independent and identically distributed and for censored observations. We extend these to the case in which some decision variables, such as price, may affect uncertainty and their causal effects are unknown. We develop the coefficient of prescriptiveness P to measure the prescriptive content of data and the efficacy of a policy from an operations perspective. We demonstrate our approach in an inventory management problem faced by the distribution arm of a large media company, shipping 1 billion units yearly. We leverage both internal data and public data harvested from IMDb, Rotten Tomatoes, and Google to prescribe operational decisions that outperform baseline measures. Specifically, the data we collect, leveraged by our methods, account for an 88% improvement as measured by our coefficient of prescriptiveness. This paper was accepted by Noah Gans, optimization.
Article
A number of papers have demonstrated that over historical periods, a specified set of factors has outperformed actively managed funds. In almost all cases, however, the factors used or the procedures followed are not replicable by tradable passive investments. In addition, tradable passive investments have expense ratios that almost always cause them to underperform indexes. The purposes of this article are to identify a small set of exchange-traded funds that captures most of the variation in the population of potential indexes and to determine whether a combination of exchange-traded funds from this small set can be identified that outperforms active mutual funds in future periods.