Conference PaperPDF Available

A Machine Learning Plus-Features Based Approach for Optimal Asset Allocation

Authors:
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=rquf20
Quantitative Finance
ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/rquf20
A modified CTGAN-plus-features-based method for
optimal asset allocation
José-Manuel Pe na, Fernando Suárez, Omar Larré, Domingo Ramírez &
Arturo Cifuentes
To cite this article: José-Manuel Pe na, Fernando Suárez, Omar Larré, Domingo Ramírez &
Arturo Cifuentes (08 Apr 2024): A modified CTGAN-plus-features-based method for optimal
asset allocation, Quantitative Finance, DOI: 10.1080/14697688.2024.2329194
To link to this article: https://doi.org/10.1080/14697688.2024.2329194
Published online: 08 Apr 2024.
Submit your article to this journal
View related articles
View Crossmark data
Quantitative Finance, 2024
https://doi.org/10.1080/14697688.2024.2329194
A modified CTGAN-plus-features-based method
for optimal asset allocation
JOSÉ-MANUEL PE NA , FERNANDO SUÁREZ , OMAR LARRÉ , DOMINGO
RAMÍREZ †* and ARTURO CIFUENTES
†Fintual Administradora General de Fondos S.A. Santiago, Chile, Fintual, Inc.
‡Clapes UC, Pontificia Universidad Católica de Chile, Santiago, Chile
(Received 15 February 2023; accepted 22 February 2024; published online 8 April 2024 )
We propose a new approach to portfolio optimization that utilizes a unique combination of synthetic
data generation and a CVaR-constraint. We formulate the portfolio optimization problem as an asset
allocation problem in which each asset class is accessed through a passive (index) fund. The asset-
class weights are determined by solving an optimization problem which includes a CVaR-constraint.
The optimization is carried out by means of a Modified CTGAN algorithm which incorporates fea-
tures (contextual information) and is used to generate synthetic return scenarios, which, in turn, are
fed into the optimization engine. For contextual information, we rely on several points along the
U.S. Treasury yield curve. The merits of this approach are demonstrated with an example based
on 10 asset classes (covering stocks, bonds, and commodities) over a fourteen-and-half-year period
(January 2008–June 2022). We also show that the synthetic generation process is able to capture
well the key characteristics of the original data, and the optimization scheme results in portfolios
that exhibit satisfactory out-of-sample performance. We also show that this approach outperforms
the conventional equal-weights (1/N) asset allocation strategy and other optimization formulations
based on historical data only.
Keywords: Asset allocation; Portfolio optimization; Portfolio selection; Synthetic data; Synthetic
returns; Machine learning; Features; Contextual information; GAN; CTGAN; Neural networks
1. Motivation and previous work
The portfolio selection problem—how to spread a given bud-
get among several investment options—is probably one of the
oldest problems in applied finance. Until 1952, when Harry
Markowitz published his famous portfolio selection paper, the
issue was tackled with a mix of gut feeling, intuition, and
whatever could pass for common sense at the moment. A dis-
tinctive feature of these approaches was that they were, in
general, qualitative in nature.
Markowitz’s pioneering work showed that the portfolio
selection problem was in essence an optimization problem
that could be stated within the context of a well-defined math-
ematical framework (Markowitz 1952). The key ideas behind
this framework (e.g. the importance of diversification, the
tradeoff between risk and return, and the efficient frontier)
have survived well the test of time. Not only that, Markowitz’s
paper triggered a voluminous amount of research on this topic
Corresponding author. Email: research@fintual.com
that was quantitative in nature, marking a significant departure
from the past.
However, notwithstanding the merits of Markowitz’s
approach (also known as mean-variance or MV portfolios),
its implementation has been problematic. First, estimating the
coefficients of the correlation-of-returns matrix—the essential
backbone of the MV formulation—is a problem that still lacks
a practical solution. For example, DeMiguel et al. (2009) con-
cluded that in the case of a portfolio of 25 assets, estimating
the entries of the correlation matrix with an acceptable level
of accuracy would require more than 200 years of monthly
data. A second drawback of Markowitz’s formulation, more
conceptual than operational, is that it relies on the standard
deviation of returns to describe risk. However, the standard
deviation, since its focuses on dispersion, is not a good proxy
for risk for it really captures uncertainty—a subtle but signif-
icant difference (Friedman et al. 2014). Anyhow, the fact of
the matter is that during the second part of the previous cen-
tury, most research efforts were aimed at devising practical
strategies to implement the MV formulation. Needless to say,
success in these efforts has been mixed at best and these days
© 2024 Informa UK Limited, trading as Taylor & Francis Group
2J.-M. Peña et al.
most practitioners have moved beyond the original MV for-
mulation, which only remains popular within some outdated
academic circles. Kolm et al. (2014) summarize well the chal-
lenges associated with the implementation of Markowitz’s
approach. Pagnoncelli et al. (2022) provide a brief overview
of the different techniques that have attempted to reconcile the
implementation of the MV formulation with reality.
John Bogle, who founded the Vanguard Group (an asset
management company) and is recognized as the father of
index investing, is another pioneer whose main idea was rev-
olutionary at the time and remains influential until today.
In 1975 he introduced a concept known as passive invest-
ment. He thought that a fund whose goal was to beat
the market would necessarily have high costs, and hence
investors would be better served by a low-cost fund that
would simply mimic the market by replicating a relevant
index (Bogle 2018, Thune 2022). This innovation, highly
controversial at the time, has been validated by empirical
evidence as study-after-study has demonstrated that trying
to beat the market (in the context of liquid and public mar-
kets) is a fool’s errand (e.g. Sharpe 1991, Walden 2015, Elton
et al. 2019, Fahling et al. 2019). But Bogle’s idea had
another important ramification that made the portfolio selec-
tion problem more tractable: it shifted the emphasis from asset
selection to asset allocation. More to the point, before the exis-
tence of index funds, an investor who wanted exposure to,
say, the U.S. stock market—leaving aside the shortcomings
of the MV formulation for a moment—faced an insurmount-
able large optimization problem (at least 500 choices if one
restricts the feasible set to the stocks in the S&P 500 index).
Today, the same investor can gain exposure to a much more
diversified portfolio—for example, a portfolio made up of
U.S. stocks, emerging markets stocks, high-yield bonds, and
commodities—simply by choosing an index fund in each
of these markets and concentrating then in estimating the
proper asset allocation percentages. In short, a much smaller
optimization problem (Amenc and Martellini 2001, Ibbot-
son 2010, Gutierrez et al. 2019).
In any event, this switch from asset selection to asset allo-
cation, plus a number of innovations that emerged at the
end of the last century and have gained wide acceptance in
recent years, have changed the portfolio selection landscape
in important ways. Among these innovations, we identify the
following:
(1) The Conditional-Value-at-Risk or CVaR has estab-
lished itself as the risk metric of choice. A key advan-
tage is that it captures much better than the standard
deviation of the so-called tail risk (the danger of
extreme events). A second advantage is that by focus-
ing on losses rather than the volatility of returns is
better aligned with the way investors express their risk
preferences (Rockafellar and Uryasev 2000,2002). A
third advantage is that in the case of the discretiza-
tion and linearization of the portfolio optimization
problem, as we will see in the following section, the
CVaR places no restrictions on the type of probability
distribution that can be used to model the returns.
(2) The benefits of relying on synthetic data to simulate
realistic scenarios are crucial for solving stochastic
optimization problems such as the one described by
Markowitz. As mentioned by Fabozzi et al. (2021),
a financial modeler looking at past returns data,
for example, only sees the outcome from a single
realized path (one returns time series) but remains
at a loss regarding the stochastic (data generating)
process behind such time series. Additionally, any
effort aimed at generating realistic synthetic data
must capture the actual marginal and joint distri-
butions of the data, that is, all the other possible
returns time histories that could have occurred but
were not observed. Fortunately, recent advances in
neural networks and machine learning—for example,
an algorithm known as Generative Adversarial Net-
works or GAN—has proven effective to this end in
a number of applications (Goodfellow et al. 2014).
Moreover, a number of authors have explored the use
of GAN-based algorithms in portfolio optimization
problems, albeit, within the scope of a framework dif-
ferent than the one discussed in this paper (e.g. Lu and
Yi 2022, Pun et al. 2020, Takahashi et al. 2019,Mari-
ani et al. 2019). Lommers et al. (2021) and Eckerli and
Osterrieder (2021) provide a very good overview of the
challenges and opportunities faced by machine learn-
ing in general and GANs in particular when applied to
financial research.
(3) There is a consensus among practitioners that the
joint behavior of a group of assets can fluctuate
between discrete states, known as market regimes, that
represent different economic environments (Hamil-
ton 1988,1989, Schaller and Norden 1997). Consider-
ing this observation, realistic synthetic data generators
(SDGs) must be able to account for this phenomenon.
In other words, they must be able to generate data
belonging to different market regimes, according to a
multi-mode random process.
(4) The incorporation of features (contextual information)
into the formulation of many optimization problems
has introduced important advantages. For example,
Ban and Rudin (2019) showed that by adding fea-
tures to the classical newsvendor problem resulted
in solutions with much better out-of-sample perfor-
mance compared to more traditional approaches. Other
authors have also validated the effectiveness of incor-
porating features to other optimization problems (e.g.
Hu et al. 2022, Bertsimas and Kallus 2020, Chen
et al. 2022, See and Sim 2010).
With that as background, our aim is to propose a method to
tackle the portfolio selection problem based on an asset allo-
cation approach. Specifically, we assume that our investor has
a medium- to long-term horizon and has access to a number
of liquid and public markets in which he/she will participate
via an index fund. Thus, the problem reduces to estimating
the appropriate portfolio weights assuming that the rebal-
ancing is not done very frequently. Frequently, of course,
is a term subject to interpretation. In this study, we assume
that the rebalancing is done once a year. Rebalancing daily,
weekly, or even monthly, would clearly defeat the purpose of
A modified CTGAN-plus-features-based method for optimal asset allocation 3
passive investing while creating excessive trading costs, that
ultimately could affect performance.
Our approach is based on a Markowitz-inspired frame-
work but with a CVaR-based risk constraint instead. More
importantly, we rely on synthetic returns data generated with
a Modified Conditional GAN approach which we enhance
with contextual information (in our case, the U.S. Treasury
yield curve). In a sense, our approach follows the spirit of
Pagnoncelli et al. (2022), but it differs in several impor-
tant ways and brings with it important advantages—including
performance—a topic we discuss in more detail later in this
paper. In summary, our goals are twofold. First, we seek to
propose an effective synthetic data generation algorithm; and
second, we seek to combine such algorithm with contextual
information to propose an asset allocation method, that should
yield, ideally, acceptable out-of-sample performance.
In the next section, we formulate the problem at hand
more precisely, then we describe in detail the synthetic data-
generating process, and we follow with a numerical example.
The final section presents the conclusions.
2. Problem formulation
Consider the case of an investor who has access to nasset
classes, each represented by a suitable price index. We define
the portfolio optimization problem as an asset allocation prob-
lem in which the investor seeks to maximize the return by
selecting the appropriate exposure to each asset class while
keeping the overall portfolio risk below a predefined tolerance
level.
The notion of risk in the context of financial investments
has been widely discussed in the literature, specifically, the
advantages and disadvantages of using different risk metrics.
In our formulation, and in agreement with current best prac-
tices, we have chosen the Conditional-Value-at-Risk (CVaR)
as a suitable risk metric. Considering that most investors focus
on avoiding losses rather than volatility (specially medium-
to long-term investors) the CVaR represents a better choice
than the standard deviation of returns. Moreover, the CVaR
(unlike the Value-at-Risk or VaR) has some attractive fea-
tures, namely, is convex and coherent (i.e. it satisfies the
sub-additive condition) (Pflug 2000) in the sense of Artzner
et al. (1999).
Let xRnbe the decision vector of weights that specify
the asset class allocation and rRnthe return of each asset
class in a given period.The underlying probability distribu-
tion of rwill be assumed to have a density, which we denote
by π(r). Thus, the expected return of the portfolio can be
expressed as the weighted average of the expected return of
each asset class, that is,
E(xTr)=
n
i=1
xiE[ri]. (1)
Let α(0, 1)be a set level of confidence and Rthe
risk tolerance of the investor. We can state the optimal portfo-
lio (asset) allocation problem for a long-only investor as the
We use boldface font for vectors to distinguish them from scalars.
following optimization problem:
maximize
xRn
E(xTr)
s.t. CVaRα(xTr)
n
i=1
xi=1
x0.
(2)
Recall that the CVaR of a random variable X, with a prede-
fined level of confidence α, can be expressed as the expected
values of the Xexceed the corresponding VaR. More formally,
CVaRα(X)=1
1α1α
0
VaRγ(X)dγ,(3)
where VaRγ(X)denotes the Value-at-Risk of the distribu-
tion for a given confidence γ, defined using the cumulative
distribution function FX(x)as
VaRγ(X)=−inf{xR|FX(x)>1γ}.(4)
2.1. Discretization and linearization
In this section, we will explain how the asset allocation prob-
lem (2), a nonlinear optimization problem due to the CVaR-
based constraint, can be restated as a linear programming
problem and why this formulation is useful in practice.
The foundational concept was established by Rockafellar
and Uryasev (2000), who demonstrated that solving the opti-
mization problem (2) is equivalent to solving the following
optimization problem:
minimize
(x,ζ)Rn×R
E(xTr)
s.t. ζ+(1α)1Rn
[xTrζ]+π(r)dr
n
i=1
xi=1
x0.
(5)
Here, the dummy variable ζRis introduced to serve as a
‘threshold for losses’. When πhas a discrete density πj, with
j=1, ...,mrepresenting the probabilities of occurrence asso-
ciated with each of the return vectors or scenarios r1,...,rm,
the problem (5) can be reformulated as:
minimize
(x,ζ)Rn×R
E(xTr)
s.t. ζ+(1α)1
m
j=1
[xTrjζ]+πj
n
i=1
xi=1
x0.
(6)
Finally, introducing dummy variables zj,forj=1, ...,m, and
explicitly writing E(xTr)as xTRπ, where RRn×mdenotes
4J.-M. Peña et al.
the return variable’s range based on the density vector π,
the problem in (6) can be restated as the following linear
programming problem:
maximize
(x,z,ζ)Rn×Rm×RxTRπ
s.t. ζ+1
1α
πTz
z≥−xTRζ
n
i=1
xi=1
x,z0.
(7)
A comprehensive explanation (including all the formal
proofs) regarding the derivation of the equivalent contin-
uous formulation (5) and its subsequent discretization and
linearization, (6) and (7), can be found in Rockafellar and
Uryasev (2000). An in-depth analysis of these equivalent
formulations–including practical examples, and a discussion
of the general problem-setting incorporating transaction costs,
value constraints, liquidity constraints, and limits on position–
are provided in Krokhmal et al. (2001).
This discretized and linear formulation (7) has the advan-
tage that it can be handled with a number of widely avail-
able linear optimization solvers. Moreover, the discretization
allows us to use sampled data from the relevant probability
distribution of rin combination with the appropriate discrete
probability density function π.
In this study, as well as in practice, Rgenerally represents
a sampled distribution of returns, and the vector πdeter-
mines the weights for each of the msampled return vectors
(scenarios). For instance, in the simple case of random sam-
ples without replacement from a set of historical returns, π
is naturally defined as πj=1/mfor j∈{1, ...,m}. Notice,
however, that the weights πcan be modified to adjust the for-
mulation for a case in which features (contextual information)
are added to the optimization problem. Assume, for example,
that we are incorporating a sample of lfeatures FRl×m.In
this case, we redefine πto reflect the different importance
attributed to samples based on the similarity between their
corresponding features and those of the current state of the
world (economic environment).
More formally, if f1and f2represent normalizedvectors
of economic features, we define a distance d(·,·)as
d(f1,f2)=(f1f2)T(f1f2).(8)
Let df1be the inverse distance vector of Fto a given vector f
defined by [d1
f]q=1
d(f,fq), for each q[1, ...,m], fqbeing a
row of F. We define the density vector πfas the normalization
of the inverse distance vector to f, that is,
πf=df1
1Tdf1.(9)
We say normalized in the sense of transforming the variables into
something comparable between them. For simplicity, in our study we
used a zero mean normalization (Z-score).
For the purpose of this study, we will sample Rand Fbased
on a suitable data generator (note that they are not indepen-
dently sampled, but simultaneously sampled), together with
the corresponding weighting scheme (either πor the modified
density πf), and then we will cast the optimization (asset allo-
cation) problem according to the discretized linear framework
described in (7).
3. Synthetic data generation
In principle, generating random samples from a given proba-
bility density function is a relatively straightforward task. In
practice, however, there are two major limitations that prevent
finance researchers and practitioners from relying on such
simple exercise.
First, and as mentioned before, a financial analyst has only
the benefit of knowing a single path (one sample outcome)
generated by an unknown stochastic process, that is, a mul-
tidimensional historical returns time series produced by an
unknown data generating process (DGP) (Tu and Zhou 2004).
The second limitation is the non-stationary nature of the
stochastic processes underlying all financial variables. More
to the point, financial systems are dynamic and complex, char-
acterized by conditions and mechanisms that vary over time,
due to both endogenous effects and external factors (e.g. reg-
ulatory changes, geopolitical events). Not surprisingly, the
straight reliance on historical data to generate representative
scenarios, or, alternatively, attempts based on conventional
(fixed constant) parametric models to generate such scenarios
have been disappointing.
Therefore, given these considerations, our approach con-
sists of using machine learning techniques to generate syn-
thetic data based on recent historical data. More precisely,
the idea is to generate (returns) samples based on a market-
regime aware generative modeling method known as Condi-
tional Tabular Generative Adversarial Networks (CTGAN).
CTGANs automatically learn and discover patterns in histor-
ical data, in an unsupervised mode, to generate realistic syn-
thetic data that mimic the unknown DGP. Then, we use these
generated (synthetic) data to feed the discretized optimization
problem described in (7).
In brief, our goal is to develop a process, that given a
historical dataset Dhconsisting of Rhasset returns and Fhfea-
tures (both mhsamples), could train a synthetic data generator
(SDG) to create realistic (that is market-regime aware syn-
thetic return datasets) on-demand (Ds). Figure 1summarizes
visually this concept.
3.1. Conditional tabular generative adversarial networks
(CTGAN)
Recent advances in machine learning and neural networks,
specifically, the development of Generative Adversarial Net-
works (GAN) that can mix continuous and discrete (tabular)
data to generate regime-aware samples, are particularly use-
ful in financial engineering applications. A good example
is the method proposed by Xu et al. (2019). These authors
A modified CTGAN-plus-features-based method for optimal asset allocation 5
Figure 1. The synthetic data generation schema.
introduced a neural network architecture for Conditional Tab-
ular Generative Adversarial Networks (CTGAN) to generate
synthetic data. This approach presents several advantages,
namely, it can create a realistic synthetic data generating pro-
cess that can capture, in our case, the complex relationships
between asset returns and features, while being sensitive to
the existence of different market regimes.
In general terms, the architecture of a CTGAN differs from
that of a standard GAN in several ways:
CTGAN models the dataset as a conditional pro-
cess, where the continuous variables are defined
by a conditional distribution dependent on the dis-
crete variables, and each combination of discrete
variables defines a state that determines the uni
and multivariate distribution of the continuous vari-
ables.
To avoid the problem of class imbalances in the
training process, CTGAN introduces the notions
of conditional generator and a training-by-sampling
process. The conditional generator decomposes the
probability distribution of a sample as the aggre-
gation of the conditional distributions given all
the possible discrete values for a selected variable.
Given this decomposition, a conditional generator
can be trained considering each specific discrete
state, allowing the possibility of a training-by-
sampling process that can select states evenly for
the conditional generator and avoid poor represen-
tation of low-frequency states.
CTGAN improves the normalization of the contin-
uous columns by employing mode-specific normal-
ization. For each continuous variable, the model
uses Variational Gaussian Mixture models to iden-
tify the different modes of its univariate distribution
and decompose each sample using the normalized
value based on the most likely mode and a one
hot-vector defined by the mode used. This process
improves the suitability of the dataset for training,
converting it into a bounded vector representation
is easier to process by the network.
3.2. A modified CTGAN-plus-features method
To enhance the capacity of generating state-aware synthetic
data (scenarios) based on the CTGAN architecture, we use an
unsupervised method to generate discrete market regimes or
states. Our approach is based on identifying clusters of sam-
ples exhibiting similar characteristics in terms of asset returns
and features, and we finally use the cluster identifier as the
state-defining variable employed by the CTGAN model.
A full discussion of how to generate a market regime
(or state) aware identification model goes beyond the scope
of this study. Suffice to say that in this case, we relied
on well-known methods from the machine learning litera-
ture for dimensionality reduction such as t-NSE (short for
t-distributed Stochastic Neighbor Embedding) and density-
based clustering such as HDBSCAN (short for High Density-
Based Spatial Clustering of Applications with Noise.)
(Campello et al. 2013). Additionally, to reduce the noise gen-
erated by trivially-correlated assets (like the S&P 500 and
the Nasdaq 100, for example), we first decompose the asset
returns based on their principal components using a PCA tech-
nique (where the number of dimensions is equal to the number
of asset classes).
In summary, the synthetic data-generating process, which is
described schematically in figure 2, consists of the following
steps:
(1) Start with a historical dataset Dhconsisting of Rhasset
returns and Fhfeatures (both mhsamples from the
same periods).
(2) The dataset is orthogonalyzed in all its principal com-
ponents using PCA to avoid forcing the model to
estimate the dependency of highly correlated assets
such as equity indexes with major overlaps. The eigen-
vectors are stored to reverse the projection on the
synthetically generated dataset.
(3) Generate a discrete vector Cassigning a cluster identi-
fier to each sample. The process to generate the clusters
consists of two steps:
(a) Reduce the dimensionality of the dataset Dhfrom
mhto 2 using t-SNE.
(b) Apply HDBSCAN on the 2-dimensional projec-
tion of Dh.
(4) Train a CTGAN using as continuous variables the
PCA-transformed dataset (Dh
pca) and the vector Cas
an extra discrete column of the dataset.
(5) Generate mssynthetic samples using the trained
CTGAN (Ds
pca).
(6) Reverse the projection from the PCA space to its origi-
nal space in the synthetic dataset Ds
pca using the stored
eigenvectors, obtaining a new synthetic dataset Dsof
mssamples.
4. Example of application
The following example will help to assess the merits of our
approach vis-à-vis other alternative asset allocation schemes.
Consider the case of an investor who has access to ten asset
classes (a diverse assortment of stocks, bonds and commodi-
ties) based on the indices described in table 1. We further
assume that the investor has a medium- to long-term horizon
and that he/she will be rebalancing his/her portfolio (recal-
culating the asset allocation weights) once a year, which for
6J.-M. Peña et al.
Figure 2. The modified CTGAN-plus-features data generating process.
Table 1. Indices employed in the asset allocation example.
Asset class
Bloomberg
ticker Name
US Equities SPX S&P 500 Index
US Equities Tech NDX Nasdaq 100 Index
Global Equities MXWO Total Stock Market Index
EM Equities MXEF Emerging Markets Stock Index
High Yield IBOXHY High Yield Bonds Index
Investment Grade IBOXIG Liquid Investment Grade Index
EM Debt JPEIDIVR Emerging Markets Bond Index
Commodities BCOMTR Bloomberg Commodity Index
Long-term Treasuries I01303US Long-Term Treasury Index
Short-term Treasuries LT01TRUU Short-Term Treasury Index
simplicity we assume that is done at the beginning of the
calendar year (January). We consider the period January 2003-
June 2022, a time span for which we have gathered daily
returns data corresponding to all the indices listed in table 1.
Finally, we assume that the investor will rely on a 5-year look-
back period to, first, generate synthetic returns data (via the
Modified CTGAN approach outlined in the previous section),
and then, would rely on the linear optimization framework
described in (7) to determine the asset allocation weights.
4.1. Feature selection
As mentioned before, incorporating features into an optimiza-
tion problem can greatly improve the out-of-sample perfor-
mance of the solutions. Financial markets offer a huge number
of options for contextual information. The list is long and
includes macroeconomic indicators, such as GDP, consumer
confidence indices, or retail sales volume. Since our inten-
tion is to incorporate an indicator that could describe the state
of the economy at several specific times, we argue that the
Treasury yield curve (or more precisely, the interest rates cor-
responding to different maturities) is a suitable choice for
several reasons. First, the yield curve is very dynamic as it
quickly reflects changes in market conditions, as opposed to
other indicators which are calculated on a monthly or weekly
basis and take more time to adjust. Second, its computation
is ‘error-free’ in the sense that is not subject to ambiguous
interpretations or subjective definitions such as the unemploy-
ment rate or construction spending. And third, it summarizes
the overall macroeconomic environment—not just one aspect
of it—while offering some implicit predictions regarding the
direction the economy is moving. In fact, both the empiri-
cal evidence and much of the academic literature, support the
view that the yield curve (also known as the term structure of
interest rates) is a useful tool for estimating the likelihood of
a future recession, pricing financial assets, guiding monetary
policy, and forecasting economic growth. A discussion of the
yield curve with reference to its information content is beyond
the scope of this paper. However, a number of studies have
covered this issue extensively (e.g. Kumar et al. 2021, Bauer
and Mertens 2018, Evgenidis et al. 2020, Estrella and Tru-
bin 2006). For the purpose of this example we use the U.S.
yield curve tenors specified in table 2. In other words, we use
eight features, and each feature corresponds to the interest rate
associated with a different maturity.
4.2. Synthetic data generation process (SDGP) validation
Given the paramount importance played by the synthetic
data generation process (SDGP) in our approach, it makes
sense, before solving any optimization problem, to investi-
gate whether the CTGAN model actually generates suitable
scenarios (or data samples). In other words, to explore if the
quality of the SDGP is appropriate to mimic the unknown
stochastic process behind the historical data. Although the
A modified CTGAN-plus-features-based method for optimal asset allocation 7
Table 2. Features (Index Returns) used in the
asset allocation example.
Bloomberg Ticker Maturity
FDTR 0 Months (Fed funds rate)
I02503M 3 Months
I02506M 6 Months
I02501Y 1 Year
I02502Y 2 Years
I02505Y 5 Years
I02510Y 10 Years
I02530Y 30 Years
Table 3. Kolmogorov–Smirnov test: comparison between original
and synthetic returns and interest rates distributions.
Variabl e
KS-test
Score Variable
KS-test
Score
US Equities 91.89% Fed Funds Rate 89.21%
US Equities Tech 86.30% 3 Months Treasury 82.85%
Global Equities 94.52% 6 Months Treasury 82.58%
EM equities 92.66% 1 Year Treasury 84.44%
High Yield 93.53% 2 Years Treasury 86.41%
Investment Grade 85.87% 5 Years Treasury 84.61%
EM Debt 86.47% 10 Years Treasury 85.87%
Commodities 76.61% 30 Years Treasury 85.21%
Long-term Treasuries 88.11%
Short-term Treasuries 80.55%
inner structure of the actual stochastic process is unknown,
one can always compare the similarity between the input and
output distributions. In short, we can compare if their sin-
gle and joint multivariate distributions are similar, and that
the synthetic samples are not an exact copy of the (original)
training samples. To perform this comparison, we trained the
CTGAN using historical data from the 2017–2022 period (5
years).
Figures 3and 4both hint that the synthetic data actually
display the same characteristics of the original data. How-
ever, and notwithstanding the compelling visual evidence, it
is possible to make a more quantitative assessment to vali-
date the SDGP. To this end, we can perform two comparisons.
First, we can compare for each variable (e.g. U.S. equities
returns) the corresponding marginal distribution based on the
original and synthetic data to see if they are indeed similar.
And second, for each pair of variables, we can compare the
corresponding joint distributions.
Table 3reports the results of the Kolmogorov–Smirnov test
(KS-test) (Massey 1951), which seeks to determine whether
both samples (original and synthetic) come from the same dis-
tribution. The null hypothesis (e.g. that both samples come
from the same distribution) cannot be rejected. Notice that
the table reports the complement score, that is, a value of 1
refers to two identical distributions while 0 signals two differ-
ent distributions. The average value is 0.87, suggesting that
in all cases, both the original and synthetic distributions, are
very similar in nature.
In order to verify that the synthetic samples preserve the
relationship that existed between the variables in the orig-
inal data, we compared the joint distributions based on the
original and synthetic datasets. To this end, we compared the
degree of similarity of the correlation matrices determined
by each sample. Specifically, for any two variables, say, for
example, US Equities and Commodities, we would expect the
correlation between them to be similar in both, the original
and synthetic datasets. Figure 5shows, for all possible paired
comparisons, the value of a correlation similarity index, Such
index is defined as 1 minus the absolute value of the differ-
ence between both (original and synthetic data) correlations.
A value of 1 indicates identical values; a value of 0 indicates
a maximum discrepancy. The values shown in figure 5(the
lowest is 0.83) evidence a high level of agreement.
A more nuanced comparison between the characteristics of
the original (historical) dataset and the synthetic dataset can
be accomplished by looking at the clusters. In other words,
the different market regimes were identified during the data
generation process. This comparison can be carried out in two
steps. First, we computed the correlation between the distribu-
tion of data points across clusters in the original dataset and
their counterparts in the synthetic dataset. The number of syn-
thetic samples drawn from each cluster followed a distribution
that closely mirrors the distribution of clusters identified in
the original dataset (44 clusters in total), having a correlation
of 97.2%. This high degree of agreement can be attributed to
the CTGAN’s training process, wherein the probability dis-
tribution for the conditional variables are explicitly learned,
facilitating an accurate replication of the original dataset’s
structural characteristics. And second, we refined the KS-test,
partitioning both the original and synthetic datasets based on
their respective clusters. This allowed us to compare the simi-
larities between samples where the original and synthetic data
originated from the same cluster versus those from different
clusters. The results of this exercise, displayed in figure 6,
reveal that the synthetic data conditioned on the same clus-
ter as the original data typically yielded the highest KS-test
scores compared to data generated from other clusters. This
finding provides further evidence of the effectiveness of the
cluster-based approach to produce synthetic data that repli-
cates not only the broad characteristics of the original dataset
but also all the key elements of all the different market
regimes.
In conclusion, based on the previous results we can state
with confidence that the CTGAN does create data samples
congruent with the original dataset, effectively preserving
both marginal and joint distributions. Furthermore, our results
highlight a tangible improvement in the quality of data gen-
eration attributable to the incorporation of the clustering pro-
cess. Having validated the SDGP, the next step is to assess the
merits of the optimization approach itself.
4.3. Testing strategy
In order to better assess the performance of our approach, i.e.
(Modified) CTGAN with features, which we denote as GwF,
we compare it with four additional asset allocation strategies,
as indicated below. In short, we test five strategies, namely:
(i) CTGAN without features (Gw/oF)
(ii) CTGAN with features (GwF)
(iii) Historical data without features (Hw/oF)
8J.-M. Peña et al.
Figure 3. Pair-plot comparison of synthetic versus original data, annual returns.
(iv) Historical data with features (HwF)
(v) Equal Weights (EW)
The historical-data strategies, unlike the CTGAN-based
strategies, are based on direct sampling from historical data.
We also utilize the Equal-Weight (EW) strategy, known as
the 1/N strategy, which assigns equal weights to all asset
classes. This approach is chosen precisely because it does not
depend on any predetermined risk constraint or measure, nor
does it rely on historical data. Its effectiveness is not con-
tingent on the assumptions required by other strategies that
use measures like CVaR to bound risk. Despite its simplic-
ity, this seemingly naive strategy has generally performed
surprisingly well, often outperforming many variations of
Mean-Variance (MV) strategies. A comprehensive evaluation
of the EW strategy’s performance can be found in the work
of DeMiguel et al. (2009), which underscores its utility as a
useful benchmark. Indeed, we contend that any strategy fail-
ing to outperform the EW strategy likely has little to offer and
is unlikely to be of practical relevance.
The optimization model to decide the asset allocation
weights is run once a year (in January), based on 5-year
lookback periods. In essence, the optimization is based on
a sequence of overlapping windows as shown in figure 7.
Hence, the first optimization is based on data from the January
2003-December 2007 period. And the merits of this asset-
class selection (out-of-sample performance) are evaluated a
year later, in January 2009 (backtesting). Then, a second opti-
mization is run based on the January 2004–December 2008
period data, and its performance is evaluated, this time, in Jan-
uary 2010. This backtesting process is repeated until reaching
the January 2017–December 2021 period. Note that this last
weight selection is tested over a shorter time-window (January
2022–June 2022). Also, each optimization problem is solved
for several CVaR limits, ranging from 7.5% to 30%, to cap-
ture the preferences of investors with different risk-tolerance
levels. Additionally, given that the proposed procedure is non-
deterministic (mainly because of the synthetic nature of the
returns generated when using CTGAN) each optimization is
run 5 times for each CVaR tolerance level (). This allows
us to test the stability of the results. Finally, note that in the
cases with no features, the density vector πRmis defined
as πj=1
m.forj∈{1, ...,m}.
In summary, the testing strategy is really a sequence of
fourteen backtesting exercises starting in January 2009, and
performed annually, until January 2022, plus, one final test
A modified CTGAN-plus-features-based method for optimal asset allocation 9
Figure 4. Pair-plot comparison of synthetic versus original data features, annual yields.
done in July 2022 (based on a 6-month window, January
2022–June 2022). This process is summarized in a schematic
fashion in figure 8.
4.4. Performance metrics
Comparing the performance of investment strategies over
long time horizons (an asset allocation scheme is ultimately
an investment strategy) is a multidimensional exercise that
should take into account several factors, namely, returns, risk,
level of trading, degree of portfolio diversification, etc.
To this end, we consider four metrics (figures of merits)
to carry out our comparisons. These comparisons are based
on the performance (determined via backtesting) over the
January 2008-June 2022 period, in all, 14.5 years.
We consider the following metrics:
(1) Returns: Returns constitute the quintessential perfor-
mance yardstick. Since we are dealing with a medium-
to long-term horizon investor, the cumulative return
over this 14.5-year period, expressed in annualized
form, is the best metric to assess returns.
(2) Risk: Since we have formulated the optimization
problem based on a CVaR constraint, it makes sense to
check the CVaR ex post. A gross violation of the CVaR
limit should raise concerns regarding the benefits of the
strategy.
(3) Transaction costs: Notwithstanding the fact that
rebalancing is done once a year, transaction costs, at
least in theory, could be significant. Portfolio rotation
is a good proxy to assess the impact of transaction costs
(which, if excessive, could negatively affect returns).
The level of portfolio rotation, on an annual basis, can
be expressed as
rotat ion =14
t=210
i=1|wi,twi,t1|
14 (10)
where the ω’s are the asset allocation weights. A static
portfolio results in a value equal 0; increasing values
of this metric are associated with increasing levels of
portfolio rotation.
(4) Diversification: Most investors aim at having a diver-
sified portfolio. (Recall that a frequent criticism to the
conventional MV-approach is that it often yields cor-
ner solutions based on portfolios heavily concentrated
on a few assets.) To measure the degree of diversifica-
tio, we follow Pagnoncelli et al. (2022), and rely on the
complementary Herfindahl–Hirschman (HH) Index. A
value of 0 for the index reflects a portfolio concentrated
10 J.-M. Peña et al.
Figure 5. Correlation similarity comparison between the correlation matrices of the original and the synthetic data.
on a single asset. On the other hand, a value approach-
ing 1 corresponds to a fully diversified portfolio (all
assets share the same weight).
4.5. Performance comparison
For comparison purposes, all numerical experiments were run
on a MacBook Pro 14 with an M1 Pro chip and 16 GB of
RAM. All the strategies were run without the use of a ded-
icated GPU to be able to perform a fair comparison across
strategies.
The strategies were backtested using a 5-year window of
daily historical scenarios as input. In the case of the CTGAN-
based strategies(Gw/oF and GwF) all the 5-year window
historical scenarios were used as input for the Data Generat-
ing Process, then, a sample of 500 synthetic scenarios were
used to solve the optimization problem. In the case of the
historical-based strategies (Hw/oF and HwF) the inputs were
a sub-sample of 500 historical scenarios which were used to
solve the optimization problem. In the case of the EW strat-
egy there is no such input or sub-sampling since the strategy
is not dependent on any scenarios: the weights are always the
same and identical.
Regarding the historical-based strategies (Hw/oF and HwF)
the running time was on average 0.001 seconds per rebal-
ance cycle. The running time for the CTGAN-based strategies
(Gw/oF and GwF) was on average 203.5 seconds per rebal-
ance cycle. Given that all strategies were run using only CPU
and not GPU-accelerated hardware the CTGAN-based strate-
gies were slower to run given the greater number of operations
used to train a GAN-based architecture.
Figure 9shows the values of all relevant metrics.
We start with the returns. First, the benefits of including
features (contextual information) in the optimization process
are evident: both, the GwF and HwF approaches, outper-
form by far their non-feature counterparties. The difference
in performance is more manifest as the CVaR limit increases.
Intuitively, this makes sense: stricter risk limits tend to push
the solutions towards cash-based instruments, which, in turn,
exhibit returns that are less dependent on the economic envi-
ronment, and thus, the benefits of the information-content
embedded in the features are diminished. Note also that all
strategies (except for the EW) deliver, more or less, mono-
tonically increasing returns as the CVaR limit is relaxed.
Additionally, it is worth mentioning that a naive visual inspec-
tion might suggest that GwF only outperforms HwF by a fairly
small margin. Take the case of CVaR = 0.25, for example;
the difference between 16.78% and 15.65% might appear as
A modified CTGAN-plus-features-based method for optimal asset allocation 11
Figure 6. Pair-plot comparison of synthetic versus original data average KS-test across all dimensions, divided by cluster. Values are scaled
by the maximum KS-test score of each row.
Figure 7. Sequence of 5-year overlapping windows.
12 J.-M. Peña et al.
Figure 8. Overview of backtesting method.
innocuous. Over a 14.5-year period, however, it is signifi-
cant. More clearly: an investor who contributed $ 100 to the
GwF strategy initially, will end up with $ 948; the investor
who adopted the HwF strategy, will end up with only $ 823.
We should be careful not to jump to conclusions regarding
the merits of including features in asset allocation problems.
However, our results strongly suggest that the benefits of
incorporating features to the optimization framework can be
substantial. Finally, the EW strategy clearly underperforms
compared to all other strategies.
We now turn to the CVaR (ex post). Again, the benefits
of including features are clear as they always decrease the
risk compared to the non-feature options. Also noticeably,
including features (see HwF and GwF) always yields solu-
tions that never violate the CVaR limit established ex ante.
It might seem surprising that the CVaR-ex post value does
not increase monotonically as the CVaR limit (actually
based on the notation used in (2), increases, especially in the
GwF and HwF cases. We attribute this situation to the fact
that the CVaR-restriction was probably not active when the
optimization reached a solution.
In terms of diversification (HH Index), all in all, all
strategies display fairly similar diversification levels. Two
comments are in order. First, relaxing the risk limit (higher
CVaR) naturally results in lower diversification as the port-
folios tend to move to higher-yielding assets, which are, in
general, riskier. And second, it might appear that the over-
all diversification level is low (values of the HH Index below
0.20 in most cases). That sentiment, however, would be mis-
placed: these are portfolios made up, not of individual assets,
but indices, and thus, they are inherently highly diversified.
Lastly, we examine trading expenses. It might be difficult
from figure 9(d), Rotation, to gauge its impact on returns.
To actually estimate rigorously the potential impact of trad-
ing expenses on returns, in all cases, we proceed as follows.
Table 4shows for different asset classes (based on some
commonly traded and liquid ETFs), representative bid-ask
spreads. This information, in combination with the rotation
levels shown in figure 9(d), can be used to estimate the trad-
ing expenses on a per annum basis (shown in table 5). Finally,
table 6shows the returns after correcting for trading expenses.
A comparison between these returns and those shown in figure
9(a) proves that trading expenses have no significant impact
on returns.
In summary, all things considered, features-based strate-
gies outperform their versions with no features, and, more
important, GwF clearly outperforms HwF, most evidently in
terms of returns, the variable investors care the most. The EW
strategy, which had done surprisingly well against MV-based
portfolios, emerges as the clear loser, by far.
4.6. Discussion of results and some considerations
regarding potential statistical biases
Broadly speaking, presenting a model that outperforms a
benchmark is not an insurmountable task. In this case, we
have presented a model (strategy or method) that both gen-
erates realistic synthetic data and delivers satisfactory out-of-
sample performance. Given this situation, reasonable readers
might ask themselves: How well would the model proposed
perform under circumstances different from those described
in the example selected by the authors? Did the authors fine-
tune the value of some critical parameters in order to present
their results in the best possible light? Do the results suffer
from any form of selection bias? Overfitting and other sta-
tistical biases are common problems that affect many novel
strategies and methods. Is there any indication of overfit-
ting in this case? The following considerations are aimed at
mitigating these concerns.
First, and in reference to a potential model selection bias.
The synthetic data generation approach we have presented
is based on a Modified CTGAN model. We also considered
two other potential choices for synthetic data generation, and
we discarded them both. One was the NORTA (Normal to
Anything) algorithm, a method based on the Gaussian cop-
ula that can generate vectors given a certain interdependence
structure. This method has been successfully used in some
financial applications (Pagnoncelli et al. 2022) and deliv-
ered good out-of-sample performance. Unfortunately, this
algorithm requires to perform the Cholesky decomposition
of the correlation matrix, a computational exercise of order
A modified CTGAN-plus-features-based method for optimal asset allocation 13
Figure 9. Key metrics for all strategies.
Table 4. Trading expenses by asset class.
Asset Class Selected ETF
Average 30 Day
Bid-Ask Spread
(Basis Points)
US equities SPY US 0.36
US equities tech QQQ US 0.52
Global equities VT US 0.54
EM equities EEM US 2.69
US high yield HYG US 1.35
US inv. grade LQD US 0.96
EM debt PCY US 5.66
Commodities COMT US 14.1
Long-term treasuries TLT US 1.03
Short-term treasuries BIL US 1.25
O(n3), which makes the process computationally very expen-
sive when one has many indices (ten in our case) combined
with several features (eight in our example). In short, com-
putationally speaking, NORTA was no match for CTGAN. A
second alternative we considered, and decided not to explore,
was the CopulaGAN method, a variation of GAN in which a
copula-based algorithm is used to preprocess the data before
applying the GAN model. This method is relatively new, and
there is a lack of both academic literature and practical expe-
rience to make a strong case for CopulaGAN versus CTGAN.
Hence, we also decided not to test it in our study.
Second, in reference to overfitting and selection bias.
Like most neural networks, CTGAN relies on a set of
Table 5. Annualized transaction expenses (basis points).
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 0.54 1.32 0.19 1.52 0
0.10 0.43 1.50 0.23 1.72 0
0.125 0.44 1.20 0.23 1.64 0
0.15 0.47 1.61 0.23 1.82 0
0.175 0.53 1.31 0.24 1.76 0
0.20 0.47 1.46 0.25 1.56 0
0.225 0.49 0.99 0.28 1.49 0
0.25 0.53 1.51 0.28 1.30 0
0.275 0.45 0.80 0.30 1.07 0
0.30 0.45 0.85 0.27 0.93 0
hyperparameters. To avoid overfitting, we excluded any
hyperparameter-tuning process. In fact, we maintained the
number of layers, dimensions, and architecture of the CTGAN
model proposed by Xu et al. (2019), which also matched the
default values of the model library. The only parameters that
were modified were the learning rate, reduced to 104from
2×104, and the number of epochs (increased from 300 to
1500). These values proved to yield stable results across all
runs. It is important to mention that smoothing the learning
rate and increasing the number of epochs does not affect the
optimal solution but guarantees a closer convergence at the
expense of a higher (but still tolerable) computational cost.
In the case of the remaining components of our proposed
Synthetic Data Generation Process, namely TSNE, PCA, and
14 J.-M. Peña et al.
Table 6. Annualized returns (net of transaction expenses).
CVaR Gw/oF GwF Hw/oF HwF EW
0.075 12.53% 13.49% 12.90% 12.72% 7.89%
0.1 11.96% 13.28% 12.73% 12.96% 7.89%
0.125 12.46% 14.93% 13.04% 13.65% 7.89%
0.15 13.84% 15.41% 13.20% 14.01% 7.89%
0.175 12.94% 15.17% 14.04% 14.06% 7.89%
0.2 13.02% 15.20% 13.57% 14.69% 7.89%
0.225 12.51% 16.21% 13.26% 15.19% 7.89%
0.25 13.18% 16.76% 13.31% 15.64% 7.89%
0.275 13.60% 17.35% 13.59% 16.44% 7.89%
0.3 13.87% 17.76% 14.90% 16.63% 7.89%
HDBSCAN, we also decided not to tune any parameters,
relying instead on the original implementations as they are.
Third, in reference to the lookback period (five years) and
rebalancing period (one year), we did not test other lookback
periods. However, previous experience suggests that the opti-
mal length for lookback periods should be between three and
five years (Gutierrez et al. 2019). A period less than three
years does not offer enough variability to capture key ele-
ments of the DGP, while periods longer than five years bring
the risk of sampling from a ‘different universe’ as financial
markets are subject to exogenous conditions (e.g. regulation)
that change over time. In other words, sampling returns from
too far in the past could bring elements into the modeling
process that may not reflect current market dynamics. Addi-
tionally, we did not test rebalancing periods different than
one year. Rebalancing periods much shorter than one year
probably do not make sense in the context of passive invest-
ment, which is the philosophy behind the investment approach
we are advocating. And from a practical point of view, most
investors would not entertain a rebalancing period less fre-
quently than once a year since in general people evaluate their
investment priorities on a yearly basis.
In brief, we hope that these additional explanations will
be helpful in evaluating the relevance of our results and
dispelling any major concerns related to potential biases.
5. Conclusions
Several conclusions emerge from this study. The most impor-
tant is that the synthetic data-generating approach suggested
(based on a Modified CTGAN method enhanced with con-
textual information) seems very promising. First, it generates
data (in this case returns) that capture well the essential char-
acter of historical data. And second, such data, when used
in conjunction with the CVaR-based optimization framework
described in (7), yields portfolios with satisfactory out-of-
sample performance.
Additionally, the example also emphasizes the benefits
of incorporating contextual information. Recall that both,
the GwF and HwF methods outperformed clearly their non-
features counterparties. Also, the fact that the GwF approach
outperformed the HwF approach, highlights both, the short-
comings of methods based only on historical data, and the
relevance of including scenarios that even though have not
occurred, are ‘feasible’, given the nature of the historical
data. This element, we think, is critical to achieve a good
out-of-sample performance.
However, notwithstanding the fact that the example pre-
sented captured a challenging period for the financial markets
(subprime and COVID crises), and considered a broad set of
assets (stocks, bonds, and commodities), the results should be
interpreted with restraint. That is, as an invitation to explore
in more detail certain topics, rather than falling into the temp-
tation of making absolute statements about the merits of the
methods we have presented. In fact, two topics that deserve
further exploration are (i) the benefits of using alternatives
other than the different tenors of the yield curve as fea-
tures, or, perhaps, using the yield curve in combination with
other data (e.g. market volatility, liquidity indices, currency
movements); and (ii) the use of the synthetic data generating
method we proposed applied to financial variables other than
returns, for example, bond default rates, or, exchange rates.
We leave these challenges for future research efforts.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability statement
The code and data that support the findings of this study are
openly available in GitHub at https://github.com/chuma9615/
ctgan-portfolio-research, Historical data was obtained from
Bloomberg.
ORCID
José-Manuel Pe na http://orcid.org/0009-0007-0889-2276
Fernando Suárez http://orcid.org/0009-0002-8175-8261
Omar Larré http://orcid.org/0009-0003-2181-226X
Domingo Ramírez http://orcid.org/0000-0003-0743-5667
References
Amenc, N. and Martellini, L., It’s time for asset allocation. J. Financ.
Transf., 2001, 3, 77–88.
Artzner, P., Delbaen, F., Eber, J.M. and Heath, D., Coherent measures
of risk. Math. Finance, 1999, 9(3), 203–228.
Ban, G.Y.. and Rudin, C., The big data newsvendor: Practical insights
from machine learning. Oper. Res., 2019, 67(1), 90–108.
Bauer, M.D. and Mertens, T.M., Information in the yield curve about
future recessions. FRBSF Econ. Lett., 2018, 20, 1–5.
Bertsimas, D. and Kallus, N., From predictive to prescriptive analyt-
ics. Manag. Sci., 2020, 66(3), 1025–1044.
Bogle, J.C., Stay the Course: The Story of Vanguard and the Index
Revolution, 2018 (John Wiley & Sons: Hoboken, NJ).
Campello, R.J., Moulavi, D. and Sander, J., Density-based clustering
based on hierarchical density estimates. In Pacific-Asia Confer-
ence on Knowledge Discovery and Data Mining, pp. 160–172,
2013.
Chen, X., Owen, Z., Pixton, C. and Simchi-Levi, D., A statisti-
cal learning approach to personalization in revenue management.
Manag. Sci., 2022, 68(3), 1923–1937.
A modified CTGAN-plus-features-based method for optimal asset allocation 15
DeMiguel, V., Garlappi, L. and Uppal, R., Optimal versus naive
diversification: How inefficient is the 1/N portfolio strategy? Rev.
Financ. Stud., 2009, 22(5), 1915–1953.
Eckerli, F. and Osterrieder, J., Generative adversarial networks in
finance: An overview, 2021. arXiv preprint arXiv: 2106.06364.
Elton, E.J., Gruber, M.J. and de Souza, A., Are passive funds really
superior investments? An investor perspective. Financ. Anal. J.,
2019, 75(3), 7–19.
Estrella, A. and Trubin, M., The yield curve as a leading indicator:
Some practical issues. Curr. Issues Econ. Finance, 2006, 12(5).
Evgenidis, A., Papadamou, S. and Siriopoulos, C., The yield spread’s
ability to forecast economic activity: What have we learned after
30 years of studies? J. Bus. Res., 2020, 106, 221–232.
Fabozzi, F.J., Fabozzi, F.A., López de Prado, M. and Stoyanov,
S.V., Asset Management: Tools and Issues, pp. 1–7, 2021 (World
Scientific: Singapore).
Fahling, E.J., Steurer, E. and Sauer, S., Active vs. passive funds—
An empirical analysis of the German equity market. J. Financ.
Risk Manag., 2019, 8(2), 73.
Friedman, D., Isaac, R.M., James, D. and Sunder, S., Risky Curves:
On the Empirical Failure of Expected Utility, 2014 (Routledge:
New York, NY).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., Courville, A. and Bengio, Y., Generative adver-
sarial networks, 2014. arXiv. Retrieved from https://arxiv.org/
abs/1406.2661.
Gutierrez, T., Pagnoncelli, B., Valladão, D. and Cifuentes, A., Can
asset allocation limits determine portfolio risk–return profiles in
DC pension schemes? Insur. Math. Econ., 2019, 86, 134–144.
Retrieved from https://www.sciencedirect.com/science/article/
pii/S0167668718301331.
Hamilton, J.D., Rational-expectations econometric analysis of
changes in regime: An investigation of the term structure of
interest rates. J. Econ. Dyn. Control, 1988, 12(2–3), 385–423.
Retrieved from https://www.sciencedirect.com/science/article/pii/
0165188988900474.
Hamilton, J.D., A new approach to the economic analysis of non-
stationary time series and the business cycle. Econometrica, 1989,
57(2), 357–384.
Hu, Y., Kallus, N. and Mao, X., Fast rates for contextual linear
optimization. Manag. Sci., 2022, 68(6), 3975–4753, iv–v.
Ibbotson, R.G., The importance of asset allocation. Financ. Anal.
J., 2010, 66(2), 18–20. Retrieved from https://doi.org/10.2469/faj.
v66.n2.4
Kolm, P.N., Tütüncü, R. and Fabozzi, F.J., 60 Years of portfo-
lio optimization: Practical challenges and current trends. Eur. J.
Oper. Res., 2014, 234(2), 356–371. Retrieved from https://www.
sciencedirect.com/science/article/pii/S0377221713008898 (60
years following Harry Markowitz’s contribution to portfolio the-
ory and operations research).
Krokhmal, P., Uryasev, S. and Palmquist, J., Portfolio optimization
with conditional value-at-risk objective and constraints. J. Risk,
2001, 4(2), 43–68.
Kumar, R.R., Stauvermann, P.J. and Vu, H.T.T., The relationship
between yield curve and economic activity: An analysis of G7
countries. J. Risk Financ. Manag., 2021, 14(2), 62.
Lommers, K., Harzli, O.E. and Kim, J., Confronting machine learn-
ing with financial research. J. Financ. Data Sci., 2021, 3(3),
67–96.
Lu, J. and Yi, S., Autoencoding conditional GAN for portfo-
lio allocation diversification, 2022. arXiv preprint arXiv:2207.
05701.
Mariani, G., Zhu, Y., Li, J., Scheidegger, F., Istrate, R., Bekas, C.
and Malossi, A.C.I., Pagan: Portfolio analysis with generative
adversarial networks, 2019. arXiv. Retrieved from https://arxiv.
org/abs/1909.10578.
Markowitz, H., Portfolio selection. J. Finance, 1952, 7(1), 77–
91. Retrieved 2022-10-20, from http://www.jstor.org/stable/2975
974.
Massey, F.J., The Kolmogorov–Smirnov test for goodness of fit. J.
Am. Stat. Assoc., 1951, 46(253), 68–78. Retrieved 2022-11-25,
from http://www.jstor.org/stable/2280095.
Pagnoncelli, B.K., Ramírez, D., Rahimian, H. and Cifuentes, A.,
A synthetic data-plus-features driven approach for portfolio opti-
mization. Comput. Econ., 2022. Retrieved from https://doi.org/
10.1007/s10614-022-10274-2
Pflug, G.C., Some remarks on the value-at-risk and the conditional
value-at-risk. In Probabilistic Constrained Optimization, pp. 272–
281, 2000 (Springer).
Pun, C.S., Wang, L. and Wong, H.Y., Financial thought experiment:
A GAN-based approach to vast robust portfolio selection. In Pro-
ceedings of the 29th International Joint Conference on Artificial
Intelligence (IJCAI’20), 2020.
Rockafellar, R.T. and Uryasev, S., Optimization of conditional value-
at-risk. J. Risk, 2000, 2(3), 21–41.
Rockafellar, R.T. and Uryasev, S., Conditional value-at-risk for
general loss distributions. J. Bank. Finance, 2002, 26(7), 1443–
1471.
Schaller, H. and Norden, S.V., Regime switching in stock market
returns. Appl. Financ. Econ., 1997, 7(2), 177–191. https://doi.org/
10.1080/096031097333745
See, C.T.. and Sim, M., Robust approximation to multiperiod inven-
tory management. Oper. Res., 2010, 58(3), 583–594.
Sharpe, W.F., The arithmetic of active management. Financ. Anal.
J., 1991, 47(1), 7–9.
Takahashi, S., Chen, Y. and Tanaka-Ishii, K., Modeling financial
time-series with generative adversarial networks. Phys. A, 2019,
527, 121261.
Thune, K., How and why John Bogle started vanguard, 2022.
Retrieved from www.thebalancemoney.com/how-and-why-john-
bogle-started-vanguard-2466413.
Tu, J. and Zhou, G., Data-generating process uncertainty: What dif-
ference does it make in portfolio decisions? J. Financ. Econ.,
2004, 72(2), 385–421. Retrieved from https://www.sciencedirect.
com/science/article/pii/S0304405X03002472.
Walden, M.L., Active versus passive investment management of
state pension plans: Implications for personal finance. J. Financ.
Couns. Plan., 2015, 26(2), 160–171.
Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni,
K., Modeling tabular data using Conditional GAN, 2019. CoRR
abs/1907.00503. Retrieved from http://arxiv.org/abs/1907.00
503.
... In terms of deep probabilistic modeling of financial time series, prior work has applied deep learning approaches to model a single financial time series (e.g., [4,31]), or multivariate time series (e.g., [21,25,27,28,32]). Of this work, Tepelyan and Gopal [28] (BDG 1 ) were the first to scale up to hundreds of stocks through combining machine learning with factor modeling, specifically Fama-French factor modeling. ...
... Mean and Covariance. Unlike prior work that has worked on multivariate generative modeling (e.g., [21,25,27,28,32])), our model is able to compute the mean ( + ) and the covariance matrix without sampling (Σ + Σ ) leading to significant speedup during inference. ...
Preprint
The use of machine learning for statistical modeling (and thus, generative modeling) has grown in popularity with the proliferation of time series models, text-to-image models, and especially large language models. Fundamentally, the goal of classical factor modeling is statistical modeling of stock returns, and in this work, we explore using deep generative modeling to enhance classical factor models. Prior work has explored the use of deep generative models in order to model hundreds of stocks, leading to accurate risk forecasting and alpha portfolio construction; however, that specific model does not allow for easy factor modeling interpretation in that the factor exposures cannot be deduced. In this work, we introduce NeuralFactors, a novel machine-learning based approach to factor analysis where a neural network outputs factor exposures and factor returns, trained using the same methodology as variational autoencoders. We show that this model outperforms prior approaches both in terms of log-likelihood performance and computational efficiency. Further, we show that this method is competitive to prior work in generating realistic synthetic data, covariance estimation, risk analysis (e.g., value at risk, or VaR, of portfolios), and portfolio optimization. Finally, due to the connection to classical factor analysis, we analyze how the factors our model learns cluster together and show that the factor exposures could be used for embedding stocks.
Article
Full-text available
Features, or contextual information, are additional data than can help predicting asset returns in financial problems. We propose a mean-risk portfolio selection problem that uses contextual information to maximize expected returns at each time period, weighing past observations via kernels based on the current state of the world. We consider yearly intervals for investment opportunities, and a set of indices that cover the most relevant investment classes. For those intervals, data scarcity is a problem that is often dealt with by making distribution assumptions. We take a different path and use distribution-free simulation techniques to populate our database. In our experiments we use the Conditional Value-at-Risk as our risk measure, and we work with data from 2007 until 2021 to evaluate our methodology. Our results show that, by incorporating features, the out-of-sample performance of our strategy outperforms the equally-weighted portfolio. We also generate diversified positions, and efficient frontiers that exhibit coherent risk-return patterns.
Article
Full-text available
The yield curve is an important tool to assess the economic progress of a country. In this study, we examine the strength of the relationship between term spread and economic activity, and between the components of the yield curve and economic activity in the G7 countries using monthly data on yield rates and seasonally adjusted data on the industrial production index (IPI). After matching the start and end date of the IPI with the yield rates, the data used and respective time period are as follows: Canada: Mar-1994 to Dec-2018, France: Jan-1999 to Dec-2018, Germany: Oct-2005 to Dec-2018, Italy: Jul-2009 to Dec-2018, Japan: Jul-1994 to Jan–2019, the UK: Jan-1994 to Dec-2018, and the US: Feb-1990 to Jan-2019. The results show positive associations between term spread and economic activity for Canada, France, Germany, Japan, the UK, and the US. For Italy, a negative association is noted. All three empirical factors could predict economic activity for France and Germany at the 12-month horizon only. For all other horizons, the factors’ ability to predict economic activity varies. We observe that by including additional macro-finance variables such as the current economic growth rate and the 3-month yield rate to capture the term structure level effects, the relationship between term spread and economic activity becomes stronger. This implies that the usefulness of yield curve and its decomposed components for the purpose of predicting economic activity should be cautiously modelled and employed for policy.
Conference Paper
Full-text available
With the explosive growth of transaction activities in online payment systems, effective and real-time regulation becomes a critical problem for payment service providers. Thanks to the rapid development of artificial intelligence (AI), AI-enable regulation emerges as a promising solution. One main challenge of the AI-enabled regulation is how to utilize multimedia information, i.e., multimodal signals, in Financial Technology (FinTech). Inspired by the attention mechanism in nature language processing, we propose a novel cross-modal and intra-modal attention network (CIAN) to investigate the relation between the text and transaction. More specifically, we integrate the text and transaction information to enhance the text-trade joint-embedding learning, which clusters positive pairs and push negative pairs away from each other. Another challenge of intelligent regulation is the interpretability of complicated machine learning models. To sustain the requirements of financial regulation, we design a CIAN-Explainer to interpret how the attention mechanism interacts the original features, which is formulated as a low-rank matrix approximation problem. With the real datasets from the largest online payment system, WeChat Pay of Tencent, we conduct experiments to validate the practical application value of CIAN, where our method outperforms the state-of-the-art methods.
Conference Paper
Full-text available
Modern day trading practice resembles a thought experiment, where investors imagine various possibilities of future stock market and invest accordingly. Generative adversarial network (GAN) is highly relevant to this trading practice in two ways. First, GAN generates synthetic data by a neural network that is technically indistinguishable from the reality, which guarantees the reasonableness of the experiment. Second, GAN generates multitudes of fake data, which implements half of the experiment. In this paper, we present a new architecture of GAN and adapt it to portfolio risk minimization problem by adding a regression network to GAN (implementing the second half of the experiment). The new architecture is termed GANr. Battling against two distinctive networks: discriminator and regressor, GANr's generator aims to simulate a stock market that is close to the reality while allow for all possible scenarios. The resulting portfolio resembles a robust portfolio with data-driven ambiguity. Our empirical studies show that GANr portfolio is more resilient to bleak financial scenarios than CLSGAN and LASSO portfolios.
Article
Incorporating side observations in decision making can reduce uncertainty and boost performance, but it also requires that we tackle a potentially complex predictive relationship. Although one may use off-the-shelf machine learning methods to separately learn a predictive model and plug it in, a variety of recent methods instead integrate estimation and optimization by fitting the model to directly optimize downstream decision performance. Surprisingly, in the case of contextual linear optimization, we show that the naïve plug-in approach actually achieves regret convergence rates that are significantly faster than methods that directly optimize downstream decision performance. We show this by leveraging the fact that specific problem instances do not have arbitrarily bad near-dual-degeneracy. Although there are other pros and cons to consider as we discuss and illustrate numerically, our results highlight a nuanced landscape for the enterprise to integrate estimation and optimization. Our results are overall positive for practice: predictive models are easy and fast to train using existing tools; simple to interpret; and, as we show, lead to decisions that perform very well. This paper was accepted by Hamid Nazerzadeh, data science.
Article
We consider a logit model-based framework for modeling joint pricing and assortment decisions that take into account customer features. This model provides a significant advantage when one has insufficient data for any one customer and wishes to generalize learning about one customer’s preferences to the population. Under this model, we study the statistical learning task of model fitting from a static store of precollected customer data. This setting, in contrast to the popular learning and earning paradigm, represents the situation many business teams encounter in which their data collection abilities have outstripped their data analysis capabilities. In this learning setting, we establish finite-sample convergence guarantees on the model parameters. The parameter convergence guarantees are then extended to out-of-sample performance guarantees in terms of revenue, in the form of a high-probability bound on the gap between the expected revenue of the best action taken under the estimated parameters and the revenue generated by a decision maker with full knowledge of the choice model. We further discuss practical implications of these bounds. We demonstrate the personalization approach using ticket purchase data from an airline carrier. This paper was accepted by J. George Shanthikumar, special issue on data-driven prescriptive analytics
Book
Long gone are the times when investors could make decisions based on intuition. Modern asset management draws on a wide-range of fields beyond financial theory: economics, financial accounting, econometrics/statistics, management science, operations research (optimization and Monte Carlo simulation), and more recently, data science (Big Data, machine learning, and artificial intelligence). The challenge in writing an institutional asset management book is that when tools from these different fields are applied in an investment strategy or an analytical framework for valuing securities, it is assumed that the reader is familiar with the fundamentals of these fields. Attempting to explain strategies and analytical concepts while also providing a primer on the tools from other fields is not the most effective way of describing the asset management process. Moreover, while an increasing number of investment models have been proposed in the asset management literature, there are challenges and issues in implementing these models. This book provides a description of the tools used in asset management as well as a more in-depth explanation of specialized topics and issues covered in the companion book, Fundamentals of Institutional Asset Management. The topics covered include the asset management business and its challenges, the basics of financial accounting, securitization technology, analytical tools (financial econometrics, Monte Carlo simulation, optimization models, and machine learning), alternative risk measures for asset allocation, securities finance, implementing quantitative research, quantitative equity strategies, transaction costs, multifactor models applied to equity and bond portfolio management, and backtesting methodologies. This pedagogic approach exposes the reader to the set of interdisciplinary tools that modern asset managers require in order to extract profits from data and processes.
Article
We combine ideas from machine learning (ML) and operations research and management science (OR/MS) in developing a framework, along with specific methods, for using data to prescribe optimal decisions in OR/MS problems. In a departure from other work on data-driven optimization, we consider data consisting, not only of observations of quantities with direct effect on costs/revenues, such as demand or returns, but also predominantly of observations of associated auxiliary quantities. The main problem of interest is a conditional stochastic optimization problem, given imperfect observations, where the joint probability distributions that specify the problem are unknown. We demonstrate how our proposed methods are generally applicable to a wide range of decision problems and prove that they are computationally tractable and asymptotically optimal under mild conditions, even when data are not independent and identically distributed and for censored observations. We extend these to the case in which some decision variables, such as price, may affect uncertainty and their causal effects are unknown. We develop the coefficient of prescriptiveness P to measure the prescriptive content of data and the efficacy of a policy from an operations perspective. We demonstrate our approach in an inventory management problem faced by the distribution arm of a large media company, shipping 1 billion units yearly. We leverage both internal data and public data harvested from IMDb, Rotten Tomatoes, and Google to prescribe operational decisions that outperform baseline measures. Specifically, the data we collect, leveraged by our methods, account for an 88% improvement as measured by our coefficient of prescriptiveness. This paper was accepted by Noah Gans, optimization.