Content uploaded by Thomas O'Leary-Roseberry
Author content
All content in this area was uploaded by Thomas O'Leary-Roseberry on Dec 17, 2021
Content may be subject to copyright.
ADAPTIVE PROJECTED RESIDUAL NETWORKS FOR LEARNING
PARAMETRIC MAPS FROM SPARSE DATA∗
THOMAS O’LEARY-ROSEBERRY†, XIAOSONG DU ‡, ANIRBAN CHAUDHURI†,
JOAQUIM R. R. A. MARTINS ‡, KAREN WILLCOX §,AND OMAR GHATTAS ¶
Abstract.
We present a parsimonious surrogate framework for learning high dimensional parametric maps
from limited training data. The need for parametric surrogates arises in many applications that
require repeated queries of complex computational models. These applications include such “outer-
loop” problems as Bayesian inverse problems, optimal experimental design, and optimal design and
control under uncertainty, as well as real time inference and control problems. Many high dimensional
parametric mappings admit low dimensional structure, which can be exploited by mapping-informed
reduced bases of the inputs and outputs. Exploiting this property, we develop a framework for learn-
ing low dimensional approximations of such maps by adaptively constructing ResNet approximations
between reduced bases of their inputs and output. Motivated by recent approximation theory for
ResNets as discretizations of control flows, we prove a universal approximation property of our pro-
posed adaptive projected ResNet framework, which motivates a related iterative algorithm for the
ResNet construction. This strategy represents a confluence of the approximation theory and the
algorithm since both make use of sequentially minimizing flows. In numerical examples we show that
these parsimonious, mapping-informed architectures are able to achieve remarkably high accuracy
given few training data, making them a desirable surrogate strategy to be implemented for minimal
computational investment in training data generation.
1. Introduction. We seek to construct surrogates for parametric mappings,
m7→ q, where m∈RdMare model parameters (input data) and q∈RdQare quantities
of interest (output data), which depend on inputs m, with probability distributions
ν(m). The need for parametric surrogates arises in many problems in computational
science, engineering, and machine learning. In a first class of problems, “outer-loop”
problems, complex computational models need to be evaluated repeatedly for differing
values of input parameters, making their solutions intractable if one is constrained
to only use a high fidelity model. These “outer-loop” problems include uncertainty
quantification, Bayesian inversion, Bayesian optimal experimental design, and opti-
mization under uncertainty. Another class of problems requiring accurate surrogates
is real-time decision making about complex systems; this includes early warning sys-
tems, and real-time inference.
We are specifically interested in settings arising from discretizations of physical
models, however our proposed methodology can be applied to generic input–output
data pairs. In the setting of discretized physical models (e.g. partial differential
equations (PDEs)), inputs (and outputs) often represent approximations of infinite
dimensional field quantities. Often RdMrepresents a discretized approximation of
∗This research was partially funded by the U.S. Department of Energy under ARPA-E award
DE-AR0001208 and ASCR awards DE-SC0019303 and DE-SC0021239; and the U.S. Department of
Defense under MURI award FA9550-21-1-0084.
†Oden Institute for Computational Engineering & Sciences, The University of Texas at Austin,
Austin, TX (tom@oden.utexas.edu,anirbanc@oden.utexas.edu).
‡Department of Aerospace Engineering, University of Michigan (xsdu@umich.edu ,jr-
ram@umich.edu ).
§Oden Institute for Computational Engineering & Sciences, Department of Aerospace En-
gineering and Engineering Mechanics, The University of Texas at Austin, Austin, TX (kwill-
cox@oden.utexas.edu).
¶Oden Institute for Computational Engineering & Sciences, Department of Mechanical Engi-
neering, and Department of Geological Sciences, The University of Texas at Austin, Austin, TX
(omar@oden.utexas.edu).
1
arXiv:2112.07096v1 [cs.LG] 14 Dec 2021
an infinite dimensional Banach space. The vector quantity of interest q(m)∈RdQ
depends on mimplicitly through a (typically nonlinear) state equation R(u, m) = 0,
where the state variable uis an element of RdU, another finite dimensional discretiza-
tion of an infinite dimensional Banach space. Notably in this setting, the nominal
discretization dimensions for the model dM, dQmay be much higher than the “infor-
mation dimension” for the parametric mapping m7→ q. This key observation leads
to the notion of bypassing the dependence on the discretization dimension by instead
parametrizing the model in reduced bases of the inputs and outputs, which has been
a successful strategy in scalable learning of parametric mappings [4,38].
Outside of special cases where one can take advantage of analytical information
about the functional form of q(m), surrogates are typically constructed and trained
via empirical risk minimization over training data {(mi, q(mi)|mi∼ν}N
i=1. The
question that we are concerned with can be stated: in settings where the use case of
a parametric model is likely to amortize the costs of training data generation, what
is the most information that we can extract from a given training dataset.
Our proposed methodology for this is to construct projected residual neural net-
works (ResNets) which learn low dimensional nonlinear mappings from a reduced basis
for the input to a reduced basis for the output. The low dimensional residual neural
networks we propose are constructed and trained adaptively, one layer at a time. This
ResNet strategy allows us to construct surrogate models with the weight dimension in-
dependent of dM, dQby making use of reduced bases. The adaptive training strategy
is motivated by issues associated with neural network training, and empirically leads
to better generalizability. Likewise the compressed ResNet layers allow one to start
with a small amount of nonlinearity in a neural network and adapt this sequentially
until significant overfitting is observed in the empirical risk minimization procedure.
The strategy is highly scalable since it seeks to define neural network architectures
that scale with the dimension of the information content of the map m7→ q, and not
the nominal discretization dimension dM, dQ.
In order to motivate algorithms for the adaptive training and construction of the
projected ResNet, we conceive of the ResNet as a time discretization of a limiting
control flow. Recent approximation theory[21] shows that these flows are universal
approximators of continuous functions on compact sets. Using this framework we
prove that ResNet discretizations of control flows between reduced bases are universal
approximators of L2integrable parametric functions on compact sets (i.e. there exists
sufficient breadth, depth, to achieve a given accuracy). From an approximation theory
standpoint this allows us to construct sequential minimizing flows that move input
data to target data slowly via an ODE system. From a neural network training point of
view this allows to motivate the sequential construction and training of these ResNet
to realize better practical performance in realistic settings than dense feedforward
networks of end-to-end trained ResNet.
In numerical experiments we consider two problems: one from parametric aero-
dynamic wing design optimization, and another from parametric PDE regression. In
the former, no input dimension reduction is used since the inputs (flight constraints)
are low dimensional, while the outputs are compressed. In the latter case both in-
puts and outputs are compressed, and derivative based dimension reduction is used
to compress the inputs. Both experiments demonstrate that this general strategy can
outperform typical neural network construction strategies in terms of generalization
accuracy; these lean models can attain high generalizability for few training data, and
can be adapted to increase representation power as more training data are available.
In general this adaptive projected ResNet framework allows for an architectural strat-
2
egy that can be fine tuned to a specific problem: in terms of dimension reduction,
nonlinearity, and learning “up to the noise”. Empirical results demonstrate that this
strategy is robust to depth, breadth, and the availability of training data; while tra-
ditional neural networks may suffer due to too much depth, or too few training data,
our proposed method performs reliably well.
1.1. Relevant work. A recent topic of interest in scientific machine learning is
the deployment of neural networks as parametric surrogates [4,6,13,20,22,24,31,
32,38]. Of particular relevance to this work are projection based parametric neural
surrogates which seek to parametrize high dimensional maps by use of linear and
nonlinear dimension reduction strategies [4,6,13,31,38].
The use of low rank residual network layers for non-spatial data has been pro-
posed in various works before, but has not been widely adopted [33,47]. Adaptive
training of ResNets has been proposed in recent works [7,11], and has performed
well. Very relevant to the view of the neural networks as discretizations of ODEs
are the following publications [8,21,41], which provide theoretical insight into the
approximation capabilities of ResNets and relations with ODEs.
1.2. Contributions. The contributions of this work iare both an approximation
theoretic motivation of a class of projected ResNet models for parametric mappings,
as well as practical algorithms for their construction. In Theorem 2.2 we posit the
existence of a projected ResNet that can achieve arbitrarily good approximation of an
L2(RdM, ν;RdQ) parametric function on a compact set K⊂⊂ RdM. This view of the
projected ResNet approximating function class motivates the algorithmic contribution
of this work: construction of these approximating functions via time continuation /
time discretization refinement via adaptive construction of the low dimensional ResNet
parametrized between reduced bases of the inputs and the outputs.
2. Approximation by Projected ResNet. In this section, we motivate the
use of projected ResNet architectures for approximating high dimensional parametric
mappings in settings where limited training data are available. We do this by first
reviewing input-output dimension reduced ridge functions, i.e. representing high di-
mensional maps by surrogates restricted to reduced bases of the inputs and outputs
simultaneously. We then discuss the approximation of these dimension reduced ridge
functions by ResNet. Approximation results are derived based on the connection be-
tween ResNet and discretization of control flow ODE systems, which under certain
basic assumptions that many ResNet meet, are universal approximators of continu-
ous functions on compact sets [21]. We extend this theory to derive a bound for the
depth complexity of a ResNet approximation of a reduced basis mapping based on
the desired accuracy, the size of the compact set, and the complexity of the associated
approximating control flow. In this sense these projected ResNet can be thought of
as “universal approximators” for parametric input-output maps.
2.1. Input–Output Reduced Ridge Functions. In this section we recount
some theory regarding input-output projected ridge functions, which are discussed in
[38] , and based on results from [27,39,42,48]. For a more detailed discussion, see [38].
Projected ridge functions are a means of exploiting low-dimensionality of an input-
output map in architecting a parsimonious model. Projected ridge functions restrict
the inputs and outputs of the map to reduced bases. Appropriately chosen reduced
bases can embed key information of the map into the surrogate in order to make up
for limited data in empirical risk minimization problems. These ridge functions are
useful for analyzing representation power of the projected neural networks.
3
A ridge function is a composition of two functions g◦VrM, where VrM:RdM→
RrMis a linear mapping (a matrix in RrM×dM), and g:RrM→RdQis a measurable
function. The ridge function mitigates the dependence of the input dimension dM,
by restricting the mapping to an rMdimensional subspace of RdM. The dependence
of the output RdQcan additionally be mitigated by employing a reduced basis for
the output. In this case we modify the range of gto be RrQ, and employ an affine
transformation from RrQto RdQ( a reduced basis ΦrQ∈RdQ×rQ, and an affine shift
b∈RdQ). The result is an input-output ridge function
(2.1) qr(m)=ΦrQgr(VT
rMm) + b,
which mitigates high dimensional dependence (dM,dQ) linearly via reduced bases
of dimensions (rM,rQ). Here gr:RrM→RrQis an input-output ridge function that
learns a mapping between coefficients of the input basis and coefficient in the output
basis. The nonlinearity of the mapping is approximated in a low dimensional mapping
between RrMand RrQ.
Finding an appropriate ridge function qrto approximate a given parametric map
qinvolves several choices: the input reduced basis, the output reduced basis, and the
choice for low dimensional functional approximation. Each decision depends on the
choice of norm used to measure approximation errors. For our purposes we consider
least-squares regression problems and the corresponding representation errors:
(2.2) Eν[kq−qrk2
`2(RdQ)] = ZRdMkq(m)−qr(m)k2
`2(RdQ)dν(m),
which is the norm corresponding to the function space L2(RdM, ν;RdQ). We
consider reduced bases that come from eigenbases of symmetric positive definite op-
erators. Examples of these for input dimension reduction are Karhunen Lo´eve Ex-
pansion (KLE)[42] and active subspace (AS)[48], for the output proper orthogonal
decomposition (POD)[27] is a canonical example. For each of these reduced bases,
approximation errors can be bounded by trailing eigenvalues associated with these
operators.
KLE exploits low dimensional structure for the input parameter distribution ν(m)
by restricting the random variable mto the rMdominant eigenvectors of the covari-
ance matrix for ν,C=V(KLE )diag(c)(V(KLE))Twhere (ci, vi)i≥1are the eigenpairs
in descending ordered by the eigenvalues ci. KLE is useful in situations where the
output uncertainty in qis inherited in a straightforward way from the input parame-
ter distribution ν. AS exploits low dimensional geometric information about the map
m7→ qas revealed by the Jacobian of the map ∇q(m). active subspace builds a basis
that takes into account both the sensitivity of the mapping, as well as the parameter
uncertainty. This is done via construction of the eigenvalue basis for the following
generalized eigenvalue problem
(2.3) Eν[∇qT∇q]vi=λ(AS)
iC−1vi,
where (λ(AS)
i, vi)i≥1are the generalized eigenvalues sorted such that λi≥λjfor i<j.
In [48] bounds are established for ridge functions based on both KLE and AS. In both
cases bounds are derived of the form:
(2.4) Eνhkq−qrMk2
`2(RdQ)i≤Cinput
dM
X
i=rM+1
λ(input)
i,
4
where qrMis a conditional expectation ridge function where the conditional expecta-
tion is taken with respect to the sigma-algebra generated by the input basis:
(2.5) qrM(m) = Eν[q|σ(V(input)
rM)](m).
These bounds establish that parametric mappings can be well-approximated by low
dimensional ridge functions when there is significant decay in the eigenvalues λ(input)
i,
and the prefactor constant Cinput is not too large. In the case of KLE the constant
Cinput =L2
q, and λ(input)
i=ci. This bound takes into account the worst case amplifi-
cations of the map by making use of the Lipschitz constant of the map, Lq. In the case
of AS, the constant Cinput = 1 and λ(input)
i=λ(AS)
i. The AS bound does not involve
worst case amplifications via a Lipschitz constant because the sensitivities of the map
are taken directly into account via the construction of the input reduced basis. Indeed
reduced basis neural networks using AS outperform identically architected reduced
basis networks using KLE for parametric PDE regression problems [38]. The superior
performance of the AS reduced basis comes at additional costs; these costs however
require only marginally more (linear) computation than the nonlinear data generation
when they are computed at training data locations. This can be done efficiently using
adjoints [38], or sophisticated reverse-mode automatic differentiation when dMdQ
[3].
For output reduced bases we consider POD, which is the eigenvector basis of the
averaged outer product of the output: ΦΛQΦT=Eν[qqT], where (λQ
i,Φi)i≥1are the
generalized eigen-pairs. POD is the optimal rank rQlinear basis for approximation
of functions in L2(RdM, ν;RdQ), that is, it is the minimizer of
(2.6) min
PrM∈RdM×rM
Eνhkq−PrMPT
rMqk2
`2(RdQ)i,
and the truncation error is given by Eνhkq−ΦrQΦT
rQqk2
`2(RdQ)i=PdQ
i=rQ+1 λQ
i, see
[27,39]. Since we are concerned with approximations in a least-squares setting, POD
is the optimal choice of reduced basis for the output. Combining these input and
output reduced bases we can derive an approximation error for an input-output pro-
jected ridge function, which will guide our architectural choices for projected neural
networks.
Proposition 2.1 (Input–Output Ridge Function Error bound (Analogous to
Proposition 2.2 in [38])). Let ν(m) = N(m, C)be a Gaussian distribution for the
input parameter m, and the columns of PrM∈RdM×rMare a rank rMreduced basis
with an input such that the associated conditional expectation ridge function satisfies
the truncation bound
(2.7) Eνhkq−qrMk2
`2(RdQ)i≤Cinput
dM
X
i=r+1
λ(input)
i
for some λi≥λj≥0. Define the rank rQPOD decomposition for qrMas follows:
(2.8) Eν[qrMqT
rM]rQ=b
ΦrQb
DrQb
ΦT
rQ.
Then the following bound holds for the input-output projected conditional expectation
ridge function
(2.9) Eνhkq−b
ΦrQb
ΦT
rQqrMk2
`2(RdQ)i≤Cinput
dM
X
i=r+1
λ(input)
i+
dQ
X
i=rQ+1
(λQ
i).
5
The bound shows how the choices of input and output subspaces are interrelated,
conceptually the first most important decision is the choice of input basis; one cannot
well approximate a functional relationship if too many important parameters are
thrown out. Both cases bound the approximation using POD decay of the conditional
expectation ridge functions qrM(m) instead of the output itself q(m). The spectral
convergence of the POD for qrMto POD for qcan be seen by the following bound:
let kq−qrMk2=kek2≤,
kqqT−qrMqT
rMk`2=kqqT−(q−e)(q−e)Tk2
=kqeT+eqT−eeTk2
≤ kqeTk2+keqTk2+keeTk2
≤2L +2.(2.10)
The spectral bound Eν[kqqT−qrMqT
rMk2] holds due to the linearity of expectation. The
error bounds in Propsition 2.1 establish that when the truncation errors in the input
and output reduced representations decay rapidly, the parametric mapping can be
well approximated by an input-output reduced conditional expectation ridge function
on the joint reduced bases. The choice of ranks for the mapping are guided by the
spectral decay of the operators associated with the construction of those bases. These
bounds motivate the approximation of high dimensional maps by neural networks
that are restricted to learn mappings between reduced subspaces of the inputs and
outputs, such as in [4,20,38].
2.2. Control Flows and Projected ResNet. In this section we motivate the
use of residual neural networks (ResNet) for the approximation of the low dimensional
mapping between the input and output reduced bases. ResNet, among other types of
neural networks are universal approximators for Lpmeasurable functions on compact
sets [10,16,21,23,25]. We favor the use of ResNet over other networks principally
due to their ability to be trained and constructed sequentially [7,11], but also due to
their potentially good approximation powers for sparse representation of the weights
[23]. With projected ridge function surrogates as our target, we consider the ap-
proximation of functions qr:Rr→Rr. We take r=rM=rQfor simplicity, and
because we can require it to meet specific error tolerances (2.9). Additionally this di-
mension compatibility condition can be met via additional prolongation or restriction
operators.
A residual neural network can be thought of as an Explicit Euler approximation
of a control flow ODE (often referred to as neural ODEs) [8,21,41]. The control flow
approximation temporally moves input data to output data, via appropriate choice of
a right-hand-side for the ODE system
dz
dt =φ(z, t)(2.11a)
z(0) = mr=VT
rm.(2.11b)
The goal is to find a right hand side φ, such that the generated flow moves arbitrary
input data mr∈Rr(view as initial conditions) approximately to the associated target
output data qr∈Rrin the eventual time limit lim
t→∞ z(t) = qr(mr). Typically the form
of the right hand side is
(2.12) φ(z, t) = w1(t)σ(t)(w0(t)z+b(t))
6
where for each time t,w1(t)∈Rr×k, w0(t)∈Rk×r, b ∈Rk, where k≤ris a architec-
ture parameter, and σ(t) is a nonlinear function that is applied element-wise. When
the dynamics are time integrated using Explicit Euler this system becomes a ResNet
with llayers:
(2.13) zk+1 =zk+ ∆tw1kσk(w0kzk+bk),
where {(w1k, w0k, bk)}l
i=1 are the layer-wise coefficient arrays, and σkare the choices of
activation functions. ResNet trained via empirical risk minimization can be thought of
as approximations of control flows that move the sample input data to sample output
data; the right hand sides are the process of a minimizing procedure attempting to
control the approximation.
The representation capability of the control flow is related to the time horizon T,
as well as the magnitude of right hand side of the control ODE. These two quantities
together represent how “far” input data need to be moved to match output data, i.e.
the quantity
(2.14) ZT
0kw1(t)σ(t)(w0(t)z+b(t))k`2(Rr)dt.
The representation capability of the ResNet depends on the time horizon, the magni-
tude of the residual perturbation and additionally the time discretization parameter
∆t. The representation capabilities of ResNet (and the connection to the representa-
tion powers of control flows) are related to the Barron function spaces, the elements
of which can be represented to arbitrary accuracy by ReLU ResNet [12]. The Bar-
ron norm measures essentially how far data must be moved from inputs to outputs
representation by a control flow, and is directly related to the time horizons and mag-
nitude of right hand sides for control flow approximations. Note that ResNet naturally
subsume the ∆tdiscretization coefficients into the scaling of the layer weights, and
therefore have the ability to dilate time. This makes it unclear how to specifically
enforce a fine time discretization when constructing a ResNet. In order to constrain
the flow to be smooth and “fine” in time discretization parameter, one can impose
constraints on the norm of each layers perturbation. However since the approximation
power of the ResNet is ultimately related to how far the control flow moves points,
(equation (2.14)), restrictions on layer perturbations then require many layers. This
is consistent with the asymptotic approximation of the control flow: in the limit
as ∆t→0, each layers contributions converge uniformly to zero, and the required
depth → ∞.
In [21], the authors show that a sufficient condition for the class of functional ap-
proximations by generating flows (2.11) to be universal approximators of continuous
functions on compact sets is that functional representation of the right hand side for
the ODE satisfies two conditions that are both highly relevant to ResNet. The first
condition is a restricted affine invariance property (the function space is closed under
affine operations). The second condition is that the closure of the class contains a
“well-function”. A well-function is a function for which the function’s nullspace con-
tains an open bounded set. This property makes it so that the update in the neural
ODE (2.11) can leave entire regions of Rrunchanged while modifying other certain
portions. This allows for the general construction of control flows that work on moving
local regions of the input to their targets in the output, and thus universal approxi-
mation of continuous functions on compact set. For example, ReLU is a well-function
because it maps the entire left half-plane to zero. Many other popular activation func-
tions in machine learning are well-functions. The universal approximation result in
7
[21] states that there exists a finite time (T > 0) control flow that can get arbitrarily
close to approximating any continuous function on a compact set K⊂⊂ Rr. We now
state a representation error bound for input-output projected ResNet approximations
of parametric mappings. Without loss of generality we consider zero-mean mappings
m7→ q, since we can always subtract the mean from the map (or equivalently have
the mean be the last layer bias in the output representation).
Theorem 2.2. Representation Error Bound for Reduced Basis ResNet Given a
parametric mapping q∈L2(RdM, ν;RdQ), that can be approximated by restriction to
rank r < {dM, dQ}reduced bases of the inputs with truncation error given by:
(2.15) Eνhkq(m)−b
Φrqr(VT
rm)k`2(RdQ)i≤ζr
Where qr:Rr→Rris a ridge function that approximates qbetween the reduced basis
for the inputs Vrand the orthonormal reduced basis for the output Φr. Then for any
compact set K⊂⊂ RdMthere exists an input-output projected ResNet fr(VT
rm, w)
such that
(2.16) ZK⊂⊂RdMkq(m)−b
Φrfr(VT
rm, w)k`2(RdQ)dν(m)≤2ζr,
and the depth of the ResNet is OTeT|K|
ζr, where Tis the time horizon for the
associated control flow mapping that is the continuous time analogue of the ResNet.
See Appendix Afor proof. Additionally noted in the appendix, are conditions under
which the depth complexity can be reduced to OT|K|
, which is related to details
of the construction of the control flow being approximated via Explicit Euler.
Since the truncation errors can be made arbitrarily small by the choice of rank
for the basis, this result establishes that high dimensional parametric mappings re-
stricted to compact sets can be approximately arbitrarily well by projected ResNet,
and establishes an approximation bound that depends on the truncation error as a
result of the projections, as well as the complexity in the ResNet approximation of
the reduced ridge function. In particular when the high dimensional map can be
well-approximated for r {dM, dQ}the weight complexity of the ResNet can be
reduced significantly compared to overparametrized networks. As the result states,
the ResNet depth is inversely proportional to the desired accuracy, and proportional
to the size of the desired region where the map is to be approximated |K|, as well
as a complexity upper bound for the approximation of the restricted nonlinear map
via a control flow with time horizon T. Inner regular measures can be approximated
arbitrarily well on compact sets; that is, for any > 0, there exists a compact set K
such that the measure of Kc
is bounded by . This makes the result sufficiently gen-
eral, however extending this result to all of RdMis tricky since the depth complexity
for the ResNet is a function of |K|. More sophisticated bounds could make use of
results in concetration of measure in order to mitigate the dependence on |K|.
The result hinges on the existence of a finite time horizon approximating control
flow for the target function. Unfortunately the result does not say anything generally
about how large Tis as a function of the complexity of the target function qr. In the
case of 1D-monotonic functions a bound is given in [21] in terms of the regularity of
the target function; when the total variation of the Jacobian of the map is smaller,
less time is needed to move the input data space to the output data space. In future
8
academic work, hopefully these approximation rates can be extended to high dimen-
sional non-monotonic functions; one would expect that smoother functions are easier
to approximate.
2.3. Balancing the various errors. The result in Theorem 2.2 gives an ap-
proximation bound for the representation capabilities of ResNet in representing para-
metric maps. Two major hurdles remain for realizing such approximations in prac-
tice: the first is the approximation of the stochastic integral via Monte Carlo samples
{(mi, q(mi))|mi∼ν}Ndata
i=1 , and the second is the nonconvex empirical risk minimiza-
tion problem for training the neural network. The approximation of the stochastic
integral via Monte Carlo will be a fundamental limitation in practice for many para-
metric mappings, since the dominant costs will almost always be associated with
the generation of training data, and not the construction and training of neural net-
works. In dealing with this issue we seek to balance the different approximation
errors. The approximating neural network is not directly learning the true paramet-
ric mapping, but instead instances of the mapping on finite training data. Defining
misfit(mi) = kq(mi)−b
Φrfr(VT
rmi, w)k`2(RdQ), we can include the Monte Carlo error
in the representation via the triangle inequality:
ZK⊂⊂RdMkq(m)−b
Φrfr(VT
rm, w)k`2(RdQ)dν(m)≤
ζr
|{z}
truncation error
+ζr
|{z}
ResNet error
+stdν(misfit)
√Ntrain
| {z }
Monte Carlo error
.(2.17)
In order to achieve arbitrary accuracy we need all three of these errors to tend to
zero uniformly. In the pre-asymptotic regime we strive for the errors to be balanced.
The fundamental representation limitation becomes the availability of training data,
so we seek to balance the representation errors (Theorem 2.2) specifically to the Monte
Carlo errors (the level of the noise). If the noise level is lower than the representation
power, then we have left generalization accuracy on the table. If the noise level is
significantly higher than the representation capability, then we risk learning stochastic
fluctuations in the training data that are not relevant to the true parametric mapping
(i.e. overfitting).
In practice the projection errors are likely the easiest to control since they can be
estimated easily via approximation of trailing eigenvalues using randomized methods
[28]. For subspaces such as AS, KLE and POD where the approximation errors can
be estimated directly, we can decide upon a proper choice of rvia inspection of
the spectral decay of the associated operators, this dimension is associated with the
“intrinsic dimensionality” of the parametric map. The most difficult error to control
is the ResNet error due to its association with a nonconvex optimization problem. In
what follows we motivate a practical algorithm for adaptive construction and training
of these ResNet that attempts to ease the nonconvexity of the associated empirical
risk minimization training problem.
3. Adaptive Construction and Training of Projected ResNet. Deep neu-
ral network training is notoriously difficult; deeper networks have greater representa-
tion power, but their practical performance hinges on the availability of training data;
many strategies in deep learning such as overparametrization require large amounts
of training data for their success. Motivated by achieving good generalizability for
limited training data, we propose an adaptive training and construction strategy, to
9
sequentially add layers to a ResNet up to the noise-level of the training data. This
strategy combines the construction of the ResNet and the training into one combined
iterative process. Unlike much of deep learning, this strategy represents a confluence
between both the approximation theory as well as the training, since in this case both
make use of sequentially minimizing approximating flows.
The strategy allows a tractable path to constructing a deep network (with high
approximation power), while also mitigating issues associated with the associated
training problem in the few-data regime. Note that other neural networks such as
dense feedforward cannot be constructed and trained in this way since additional non-
linear compositions will distort the previous approximation significantly; only ResNet
like architectures can be built and trained in this fashion.
In the first offline step the user can choose an appropriate set of reduced basis for
the inputs and outputs, without loss of generality take r=rM=rQ {dM, dQ}1.
Truncation strategies using AS, KLE and POD allow us to decide the dimension of the
reduced basis based on spectral decay of operators associated with the construction
of these bases. So for these bases the proper dimensionality can be inferred ahead of
time, or based on a user defined tolerance such as spectral decay rate or randomized
error estimators. What is left is the architectural decisions of the ResNet. Since we
seek a parsimonious representation of the mapping that representation power is tied
to what can be learned from a sparse dataset, we use a sparse representation for the
ResNet layer where the matrices w1k∈Rr×rk, w0k∈Rrk×rare low rank (rk< r)
[33]. The algorithm then constructs a residual network one layer at a time by adding
a low rank projected residual network unit:
(3.1) zk+1 =zk+w1kσ(w0kzk+bk).
For small kw1kσ(w0kzk+bk)k, layer k+1 represents a small nonlinear perturbation
of the representation of the map. The layer can either be trained separately, or in
combination with the prior layers. The former builds a flow where future times are
conditioned on previous times via minimization of a loss function. The latter seeks
to build a long discretized flow via temporal refinement. Training one layer at a time
has the benefit of solving a sequence of simple optimization problems. Training all
at once has the added benefit of superior representation capability. As an heuristic
for the imbalance of the representation power and stochasticity of the data, one can
determine the termination of the iterative construction of the ResNet via depth by
monitoring discrepancies between training accuracy and validation accuracy.
In order to avoid overfitting we propose using a small number of optimization
iterations for each sequential training and then have one additional end-to-end training
once the terminal depth is found. Based on experimentation we have found that
stochastic gradient based methods are good for the individual layer trainings during
construction from initial guesses, the termination of this process serves as a good initial
guess for the final end-to-end training, for which second order methods perform well
[35,36] in extracting extra generalization accuracy from a good initial guess (e.g.
transfer learning). We summarize the adaptive training algorithm below.
1Otherwise one can add an upsampling or downsampling layer to the beginning of end of the
network to enforce dimensional compatibility for the ResNet layer.
10
Algorithm 3.1 Adaptive construction of Projected ResNet
1: Given reduced bases for inputs and outputs: Vr,Φr
2: z0(m) = VT
rm(truncated representation of input data)
3: k= 0
4: while Validation accuracy is increasing do
5: Train Projected ResNet f(m)=Φrzk+1 (m) + bQ,
6: where zk+1(m) = zk(m) + wk+1,2σ(wk+1,1zk(m) + bk+1 )
7: k=k+ 1
8: end while
9: One final end to end training (using second order optimizer)
10: return Projected ResNet
The analysis and algorithmic framework are easily adaptable to any setting where
an input-output map needs to be learned from data, and a reduced basis methodology
is appropriate for either the input or output representation spaces, or both. In the
parametric PDE setting it is useful to do dimension reduction on both inputs and
outputs.
4. Numerical Experiments. In this section we demonstrate the benefits of
adaptively constructed projected ResNet for two tasks: inverse aerodynamic shape
design and parametric PDE inference. We study the accuracy of the network as a
function of the truncation dimension r, the depth, and the amount of data available
during training. We compare against different dense neural network strategies as
in [38], where the number of hidden layers is determined by the output dimension
dQ, or the truncation dimension of the projected network r. The former we refer to
as a “full dense”, and the latter “truncated dense”. We also compare against the
same ResNet architecture trained end-to-end (without the adaptive procedure); these
strategies represent baselines for these regression tasks. In all cases we compare all
of the architectural strategies for fixed total number of optimization epochs (the sum
total of each epoch used in the adaptive trainings), with access to the exact same data
and same optimizers. This does not take into account that the computational costs of
optimization for the low dimensional models are much lower; that is an added bonus.
In this work we use Adam optimizer with default settings for the step size and
momentum parameters. This is to avoid specific hyperparameter tuning that makes
the work less general. For the first set of results we only use Adam since the problem is
low dimensional. For the higher dimensional second set of numerical results we use a
combination of Adam and Low Rank Saddle Free Newton (LRSFN), due to its ability
to improve upon the performance of first order optimizers given a good guess [36,36].
We note that we do not compare against projected dense strategies as in [38], since in
that work, an inexact Newton CG optimizer was required to achieve suitable accuracy.
In experiments, projected dense networks were unable to attain suitable generalization
accuracies when using Adam. One of the ma jor selling points of the projected ResNet
over the projected dense strategy, beyond adaptability, is robustness to the choice of
the optimizer. Code for the parametric PDE surrogates is in the hIPPYflow library
[37] which builds on the hIPPYlib library[45], which implements scalable adjoint-
based methods for PDE-based inverse problems using FEniCS[2]. Code for the LRSFN
optimizer can be found in the hessianlearn library[34], a library for Hessian based
stochastic optimization in TensorFlow [1].
11
4.1. Inverse Aerodynamic Shape Optimization. In this section, we detail
the aerodynamic wing design problem setup formulated by the Aerodynamic Design
Optimization Discussion Group (ADODG) of the American Institute of Aeronautics
and Astronautics (AIAA). The AIAA ADODG formulates a series of benchmark cases
which provide a foundation for rational assessment of the multitudinous aerodynamic
design optimization approaches to problems of interest. In particular, we generalize
the ADODG Case 3, i.e., drag minimization of a rectangular wing in inviscid sub-
sonic lifting flow, in the following ways: we generalize the inviscid flow to viscous flow
which is a more challenging test case and requires to solve the Reynolds-averaged
Navier–Stokes (RANS) equations; we incorporate free-form deformation (FFD) con-
trol points as design variables besides a twist angle distribution. We expand each of
the flight conditions and design requirements from a single value to a uniform param-
eter distribution ν(m) delineated in Table 1, this then defines our task for parametric
regression; the mapping from design constraints mto drag-optimal geometry q(m).
Additional details about the software, and data used in this application can be found
in Appendix B.
4.1.1. Problem Description. We set up the inverse aerodynamic design prob-
lem by following the ADODG Case 3, drag minimization of a rectangular wing 1with
the NACA 0012 airfoil as the wing section in inviscid subsonic flow. For the purpose of
testing the proposed algorithm in practical and challenging applications, we consider
viscous flow by solving the RANS equations. We consider 200 FFD control points to
parameterize the wing geometry, with each of 10 wing sections has 10 FFD control
points on the upper surface and 10 on the lower surface 2. Plus, we use 10 twist
variables which are evenly distributed along the zaxis (span-wise direction) within
the range of [0.0,3.0]. The angle of attack is used as a dummy variable to satisfy the
target lift coefficient constraint. In this section we focus in regression for the FFD
control points only, since these data share geometric similarity the POD basis admits
low rank structure. Learning the twist and angle of attack are very low dimensional
and do not lend themselves well to dimension reduction; shallow dense is sufficient
for that task. Since the flight condition is at low speed, varying Mach number will
not significantly affect the aerodynamic performance. Therefore, we extend the other
design requirements for the demonstration purpose of the proposed algorithm. We
set the target lift coefficient within the range of [0.2,0.4], vary the lower bound of
moment coefficient by ±20% to make a range of [−0.11,−0.074], set the lower bound
of internal volume as 0.8∼1.0 with respect to the baseline, and arrange Reynolds
number with the range of [1 ×105, 1 ×107]. We summarize the wing design problem
below in Table 1.
4.1.2. Results. In this aerodynamic wing design problem the input dimension
is small so we do not use any input dimension reduction. Instead we only reduce
the 200 dimensional output, using POD. Thus the truncation errors in the analysis
only apply to the output representation. For these numerical results we use a 10
dimensional POD basis for the output representation. As a first step in the projected
ResNet we prolongate the 4 dimensional data to the 10 dimensional representation via
using a dense matrix. The ResNet then learns the mapping between the prolongated
input data representation and the 10-dimensional output basis for POD. For each
layer of the projected ResNet we use rank 4.
We compare the adaptive projected ResNet strategy (AR) with an end-to-end
trained identical projected ResNet (ER). We compare these two strategies to a trun-
cated dense (TD) strategy that has hidden layer dimensions that are the same as the
12
x
z
z
y yx
c
3c
3.06cc
Fig. 1: The rectangular wing has a chord of c= 1, half wingspan of 3c, a wing-tip
cap of 0.06c, and no twist angle, dihedral angle or sweep angle.
(a) FFD box used for optimization.
1
(b) FFD control points.
Fig. 2: A total of 200 (10 sections ×20 FFD control points / section) geometric
variables provide sufficient design flexibility.
POD basis rank, as well as a full space dense (FD) overparametrized strategy, where
the hidden layer dimensions are the same as the output dimension (200). We study the
effects of availability of training data, and the depth of the neural networks. All net-
works use the same inner layer activation function (softplus). We compare the effects
of amount of training data, and repeat runs over 20 different independent shufflings of
a larger dataset. We define generalization accuracy with respect to the `1(RdQnorm,
which is a common choice for aerodynamic shape optimization problems.
(4.1) `1Accuracy = Eν"1−kq−fk`1(RdQ)
kqk`1(RdQ)#
In order to emphasize the relative strengths of each network architecture with limited
training data we run the trainings for 40 different instances of training data. We
report median generalization accuracies ±30%, as well as the maximum accuracies
in some cases. We begin by studying a comparison of the different networks for the
13
Table 1: 3D wing optimization problem formulation.
Function or variable Description Quantity
minimize CDDrag coefficient
w.r.t. xFFD control points 200
λTwist angle 9
αAngle of attack 1
Total design variables 210
subject to C∗
L= 0.2625 Lift-coefficient constraint 1
CM>-0.092 Moment-coefficient constraint 1
0.8V0≤V≤1.5V0Internal volume constraint 1
0.8tbase ≤t≤1.5tbase Thickness constraints 100
Total constraints 103
Conditions M= 0.5 Mach number
Re = 1 ×106Reynolds number
depth 5 case.
In Figure 3we see that the adaptive ResNet strategy outperforms the dense
strategies in the low data limit, however the full space dense network is able to perform
about as well once around 100 data become available. The truncated dense networks
and end-to-end ResNet perform worse; as will become apparent in further results, the
adaptive training routine consistently outperforms the end-to-end training.
50 100 150 200 250 300 350 400
Training Data
0.80
0.85
0.90
0.95
1.00
`1Accuracy
`1Testing Accuracy vs Training Data
AR |w|= 510
ER |w|= 510
FD |w|= 202000
TD |w|= 2690
50 100 150 200 250 300 350 400
Training Data
0.80
0.85
0.90
0.95
1.00
`1Accuracy
`1Max Testing Accuracy vs Training Data
AR |w|= 510
ER |w|= 510
FD |w|= 202000
TD |w|= 2690
Fig. 3: Depth 5 Comparison of Networks
Figre 4shows similar results as we double the depth to 10. Again the adaptive
ResNet outperforms all of the other models. As the depth increases the dense networks
get harder to train, and the end-to-end ResNet tends to perform better.
14
50 100 150 200 250 300 350 400
Training Data
0.75
0.80
0.85
0.90
0.95
1.00
`1Accuracy
`1Testing Accuracy vs Training Data
AR |w|= 980
ER |w|= 980
FD |w|= 403000
TD |w|= 3240
50 100 150 200 250 300 350 400
Training Data
0.75
0.80
0.85
0.90
0.95
1.00
`1Accuracy
`1Testing Accuracy vs Training Data
depth 5 |w|= 510
depth 10 |w|= 980
final AR |w|= 980
final ER |w|= 980
Fig. 4: Depth 10 Comparison of Networks
Figure 5shows the degradation of performance for the dense networks as more
depth is added, while the performance of the ResNet is about the same. This shows
that adding additional depth may not add much once the informed modes in the data
are well represented, but that the ResNet strategy is a much safer way to add depth
than using dense feedfoward networks for parametric inference.
50 100 150 200 250 300 350 400
Training Data
0.70
0.75
0.80
0.85
0.90
0.95
1.00
`1Accuracy
`1Testing Accuracy vs Training Data
AR |w|= 1920
ER |w|= 1920
FD |w|= 805000
TD |w|= 4340
50 100 150 200 250 300 350 400
Training Data
0.75
0.80
0.85
0.90
0.95
1.00
`1Accuracy
`1Testing Accuracy vs Training Data
depth 10 |w|= 980
depth 20 |w|= 1920
final AR |w|= 1920
final ER |w|= 1920
Fig. 5: Depth 10 Comparison of Networks
These numerical results show that the adaptive ResNet strategy can outperform
other conventional black-box neural network strategies, when limited data are avail-
able, and additionally show that they do not suffer from the degraded performance
adding more depth like dense feedforward networks. In the next example which has
high dimensional inputs and outputs, we can see more significant improvements over
baseline networks.
4.2. Parametric Helmholtz Regression. For our last numerical example, we
consider parametric Helmholtz regression. Accurate surrogates for PDE parameter-
to-observable mappings can be used to make tractable the solution of high dimen-
sional inference and optimization problems, such as Bayesian inference, and optimal
experimental design. We consider a 2D Helmholtz parameter to observable problem
similar to that of [38]. The parameter shows up nonlinearly as a log-prefactor of the
15
wavenumber k. The parameter to observable map is formulated as follows.
−∆u−(kem)2u=fin Ω(4.2a)
PML boundary condition on ∂Ω\Γtop
(4.2b)
∇u·n= 0 on Γtop
(4.2c)
q(m) = Bu(m)=[u(x(i), m)] at x(i)∈Ω(4.2d)
Ω = (0,3)2.(4.2e)
The random parameter mis sampled from a Mat´ern Gaussian distribution with zero
mean and covariance given by the inverse of an elliptic PDE operator: C= (I−5.0∇·
(Θ∇))−2. The matrix Θ ∈R2×2introduces spatial anisotropy to the random field.
The mesh used for this problem is 128×128, and the model parameter mis represented
by linear finite elements, making the input dimension 16,641 = (129)2. The problem
has 100 observation points of a 2D wave field so the output dimension is 200. The
perfectly matched layer (PML) boundary conditions on the side and bottom of the
domain allow us to simulate a semi-infinite domain [9]; only the top surface can reflect
the waves. The right hand side contains a single source point located at (0.775,2.85),
and the observation points are located in a grid in (0.575,0.975) ×(2.75,2.95); none
of the observations take place at the source. The wave number for this problem is
9.118. This is a notoriously difficult problem due to the highly oscillatory nature of
the wavefield solution of Helmholtz, as well as the nonlinear dependence on the model
parameter m. The problem is highly relevant to inference and optimization problems
related to imaging, and subsurface modeling.
As with the last example we compare with truncated dense networks using the
same breadth as the projected ResNet, and full-space dense network which has hidden
layer dimensions of 200. In this case we also compare against identically architected
projected ResNet, which use active subspace and Karhunen-Lo`eve Expansion for the
input dimension reduction. For output dimension reduction we employ POD in all
cases. Projected networks using active subspace are referred to as derivative informed
projected networks (DIPNet) [38], in this case we refer to the ResNet version as
DIPResNet, and the KLE analogue as KLEResNet. As with the last example we use
layer ranks of 4 for the ResNet, and use softplus activation for all of the nonlinear
layers. In this example we use the normalized `2accuracy to compare the different
network strategies:
(4.3) `2Accuracy = Eν"1−kq−fk`2(RdQ)
kqk`2(RdQ)#.
We compare the different strategies as a function of the amount of training data
seen, as well as the depth. For this case we set aside 512 data for testing accuracy.
Additionally validation data that are 25% the size of the training data are used for
validation accuracy during the adaptive training of the ResNet. As was noted before
we train the networks using Adam as they are being constructed, and then perform
an additional last optimization using LRSFN, which increases the generalization ac-
curacy; for a discussion of numerical results that provide empirical justification of this
strategy, see the Appendix.
We begin by considering a case of depth 5, starting with breadth r= 32 and
increasing it to enrich the basis function representation. Figure 6shows compar-
isons of the ResNet with the other dense models. The left part of Figure 6shows
16
that the adaptively trained DIPResNet outperformed all other models until the over-
parametrized full dense model begin outperforming the breadth 32 depth 5 adaptively
trained DIPResNet around 1500 training data; this is an expected result in machine
learning, that overparametrized networks should win in the high data limit (note how-
ever that the full dense strategy is not scalable and will suffer with mesh refinement,
see [38]). The right plot compares the performance of the DIPResNet (which used
active subspace basis) vs KLEResNet. This plot shows that while the two architec-
tures initially have similar performance characteristics in the extremely low training
data limit, the DIPResNet significantly outperforms the KLEResNet as more training
data become available. Additionally though, the adaptive KLEResNet outperforms
the end-to-end trained DIPResNet until about 4,250 training data are available. This
plot demonstrates that both the map informed reduced basis (AS) and the adaptive
training strategy help improve accuracy when few training data are available.
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 32 Depth 5
Adaptive DIPResNet |w|= 8060
End-to-end DIPResNet |w|= 8060
Full dense |w|= 3529400
Truncated dense |w|= 543368
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 32 Depth 5
Adaptive DIPResNet |w|= 8060
End-to-end DIPResNet |w|= 8060
Adaptive KLEResNet |w|= 8060
End-to-end KLEResNet |w|= 8060
Fig. 6: Depth 5 Breadth 32 Comparison of Networks
Since the Helmholtz problem is highly oscillatory it would make sense that we
would need larger reduced basis representations to faithfully capture these higher
order effects. Figures 7and 8demonstrate the effects of enriching both the input and
output basis representations (as well as increasing the intermediate hidden neuron
representations in the truncated dense network), the only constant between all of
these figures is the full dense network. As more basis functions are added the the
inputs and outputs the adaptively trained DIPResNet reliably outperforms the over-
parametrized fully dense network. The effects of increasing the breadth are to shift
the accuracy curves up, as one might expect. Note that due to convergence issues the
full dense is undefined for training data less than 500, this effect demonstrates the
robustness of the ResNet architecture to optimizer issues.
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 48 Depth 5
Adaptive DIPResNet |w|= 11980
End-to-end DIPResNet |w|= 11980
Full dense |w|= 3529400
Truncated dense |w|= 818024
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 48 Depth 5
Adaptive DIPResNet |w|= 11980
End-to-end DIPResNet |w|= 11980
Adaptive KLEResNet |w|= 11980
End-to-end KLEResNet |w|= 11980
Fig. 7: Depth 5 Breadth 48 Comparison of Networks
17
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 64 Depth 5
Adaptive DIPResNet |w|= 16296
End-to-end DIPResNet |w|= 16296
Full dense |w|= 3529400
Truncated dense |w|= 1094728
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 64 Depth 5
Adaptive DIPResNet |w|= 16296
End-to-end DIPResNet |w|= 16296
Adaptive KLEResNet |w|= 16296
End-to-end KLEResNet |w|= 16296
Fig. 8: Depth 5 Breadth 64 Comparison of Networks
In the next set of numerical experiments, we study the effects of increasing the
depth of the networks, in particular we are interested in robustness of the architec-
tural strategies to added depth. Below in Figures 9and 10 we can see the ResNet
strategies are robust to adding more depth, while the full generic network tends to
suffer. Interestingly the end-to-end trained ResNet are able to perform better as more
depth is added, but the dense networks are not.
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 32 Depth 10
Adaptive DIPResNet |w|= 9520
End-to-end DIPResNet |w|= 9520
Full dense |w|= 3730400
Truncated dense |w|= 548648
0 1000 2000 3000 4000 5000 6000
Training Data
0.65
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 32 Depth 10
Adaptive DIPResNet |w|= 9520
End-to-end DIPResNet |w|= 9520
Adaptive KLEResNet |w|= 9520
End-to-end KLEResNet |w|= 9520
Fig. 9: Depth 10 Breadth 32 Comparison of Networks
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 48 Depth 10
Adaptive DIPResNet |w|= 14160
End-to-end DIPResNet |w|= 14160
Full dense |w|= 3730400
Truncated dense |w|= 829784
0 1000 2000 3000 4000 5000 6000
Training Data
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 48 Depth 10
Adaptive DIPResNet |w|= 14160
End-to-end DIPResNet |w|= 14160
Adaptive KLEResNet |w|= 14160
End-to-end KLEResNet |w|= 14160
Fig. 10: Depth 10 Breadth 48 Comparison of Networks
The issues with overparametrization can be seen acutely in Figure 11, where as
more depth is added the ResNet strategies (both adaptive and one-shot training)
consistently maintain accuracy, while the dense strategies performance is significantly
deteriorated. This plot suggests that the projected ResNet architectures are less
susceptible to the depth “peaking phenomenon” [17], where for fixed training data,
18
the accuracy of a network improves initially as depth is added, but eventually the
performance begins to deteriorate. Note that in all cases the training error reaches
zero, so this suggests that the projected neural networks may be better suited to
maintaning generalization accuracy, in comparison to the other dense strategies.
0 1000 2000 3000 4000 5000 6000
Training Data
0.65
0.70
0.75
0.80
0.85
0.90
`2Accuracy
`2Testing Accuracy Breadth 64 Depth 20
Adaptive DIPResNet |w|= 24946
End-to-end DIPResNet |w|= 24946
Full dense |w|= 4132400
Truncated dense |w|= 1157128
Fig. 11: Depth 20 Breadth 64 Comparison of Networks
5. Summary and Conclusion. In this work, we propose an adaptive projected
ResNet surrogate strategy for learning high dimensional parametric maps from limited
training data. This type of surrogate can achieve high accuracy for limited training
data, and can help make possible the accurate solution of high dimensional outer-
loop problems on a limited number of queries from an expensive high fidelity direct
numerical simulation.
Many high dimensional parametric mappings admit low dimensional structure
due to physics, sparse observations of field quantities, geometry of data, concentra-
tion of measure, or other mathematical structure. This inspires the highly scalable
surrogate strategy: parametrization of the full map between map-informed reduced
basis of both the inputs and outputs [4,38]. In this work we motivate both theoreti-
cally and algorithmically, an architectural strategy for learning the reduced nonlinear
mapping via adaptively trained projected ResNets. We begin by proving a universal
approximation property of this class of functions. Error bounds that we derive are
based on theory for input-output reduced ridge function approximations of high di-
mensional parametric mappings, and recent theory establishing a connection between
ResNets and control flow function approximations [21]. This theoretical construction
of the approximation capabilities of ResNets motivates our adaptive algorithm for the
construction of these surrogates.
Two numerical results demonstrate the efficacy of these adaptively trained pro-
jected ResNet, one from aerodynamic shape optimization, and another from para-
metric PDE inference. These results show that these projected ResNet can achieve
superior generalization accuracy to other conventional deep learning methods, all
while having much smaller weight dimensional spaces. Additionally numerical results
demonstrate that the performance of the adaptively constructed ResNets uniformly
outperformed the same exact architecture trained end-to-end; showing the upside of
merging the construction and optimizing processes, as the theory suggests.
19
Conventional wisdom in machine learning suggests that the best strategies for
reconstructing input-output mappings is to build overparametrized neural networks
which have much larger weight dimensions than the available training data cardinality.
In this case the associated large configuration spaces give the optimizer more leeway
to fit the data. The typical setting for machine learning however is “big-data”, where
large training datasets are available for the empirical risk minimization problem (or
the model is already initialized well, e.g. transfer learning). In the setting that we are
concerned with, the opposite seems to hold, that when one is left with few samples of
a very high dimensional map, it can be a liability to overparametrize.
A last upside worth noting is the computational economy of our proposed ap-
proach. Improving the computational economy of neural networks, (e.g. “pruning”
[5]) is an issue of recent concern. In order to scale machine learning models to energy
and memory constrained environments parsimonious network surrogates are needed.
This approaches attempts to build an already “pruned” model from scratch.
Appendix A. Proof of Theorem 2.2.
Let > 0 be arbitrary, and let χKdenote the characteristic function for the
compact set K⊂⊂ RdM, and Kr⊂⊂ Rrbe the restriction of Kto the reduced basis
Vr∈RdM×r, i.e. Kr={VT
rm|m∈K}. Note that since we are in finite dimensions
we can work with probability densities for ν(m), which we refer to as π(m).
Note that since q∈L2(RdM, ν;RdQ), the ridge function approximation Φrqris as
well if ζr<∞. Since kΦk`2(RdQ×r)= 1, qris L2(RdM, ν ;Rr, and can be arbitrarily well
approximated by a continuous function qcont
r∈C1(Rr) when restricted to K, since
continuous functions are dense in L2. Let qcont
rbe such that Eν[kqr−qcont
rk`2(Rr)χKr]<
3.
Proposition 4.11 in [21] states that any continuous function qcont
r∈C(Rr) can
be approximated arbitrarily well by a finite time control flow representation so long
as the set of right hand sides Ffor the control flow is closed under affine operations,
and the closure of this set under the topology of compact convergence contains a
well-function. The restricted affine invariance requires that if f∈ F, then Df(A·+b)
is also in F, for D, A diagonal matrices in Rr×rwith diagonal entries di=±1 and
ai≤1, and b∈Rrarbitrary. We note that the family of right hand sides associated
with the continuous analogue of ResNet satisfies this property. The well-function
property requires that the activation function used in the ResNet can be used to
build a function that is arbitrarily close to zero when restricted to an open bounded
set (e.g. ReLU, sigmoid, tanh etc.). For a lengthier discussion of these requirements
see section 2 in [21]. Proposition 4.11 along with these properties allows that there
exists a finite time T < ∞control flow mapping of the form:
dz
dt =w1(t)σ(w0(t) + b(t))(A.1a)
z(0) = VT
rm(A.1b)
ξT(m) = z(T;m)(A.1c)
such that RKrkξT−qcont
rk`2(Rr)dm <
3. Note that the mapping is effectively z:
(0, T ]×Rr→Rr, but is parametrized over RdMvia VT
rm. We can extend this result
to integration with respect to the probability density function νby noting that by
20
Cauchy Schwarz,
ZKkξT−qcont
rk`2(Rr)dν(m) =
ZKkξT−qcont
rk`2(Rr)π(m)dm ≤ZKkξT−qcont
rk`2(Rr)dm.(A.2)
The system (A.1) can be approximated to arbitrary precision via an Explicit Euler
discretization with time discretization ∆twhich yields the ResNet:
(A.3) zk+1 =zk+ ∆tw1kσ(w0kzk+bk)
Assuming the right hand side of (A.1a) is Lipschitz with bound LResNet , and the true
solution z(t;m) to (A.1) is itself twice differentiable for all t∈(0, T ), and for all
m∈Kwe have that maxt∈(0,T ]
∂2z
∂t2
`2(Rr)≤M, then the global truncation error for
the Explicit Euler approximation can be bounded by
(A.4) ZKkξE.E.
T(m)−ξT(m)k`2(Rr)dν(m)≤eLResNet T−1
2LResNet
M∆t|K| ≤
3,
see [40]. The requirements for ∆tbecome:
(A.5) ∆t≤2
3
LResNet
M|K|(eLResNetT−1) ,
for homogenous time steps we have ∆t=T
depth , which gives us the following bound:
(A.6) depth ≥3
2
M|K|T(eLResNetT−1)
LResNet
Combining all of these results together we can bound as follows:
EνkξE.E.
T−qrk`2(Rr)≤
EνkξE.E.
T−ξTk`2(Rr)+kξT−qcont
rk`2(Rr)+kqcont
r−qrk`2(Rr)≤
3+
3+
3
(A.7)
The final result comes from setting =ζr(since it was arbitrary), and then noting
that ξE.E.
T(m, w) = fr(m, w),
ZKkq(m)−b
Φrfr(VT
rm, w)k`2(RdQ)dν(m)≤
ZKkq(m)−b
Φrqr(VT
rm, w)k`2(RdQ)+kb
Φr(qr(VT
rm)−fr(VT
rm, w))k`2(RdQ)dν(m)
(A.8)
and
kb
Φr(qr(VT
rm)−fr(VT
rm, w))k`2(RdQ)≤
kb
Φrk`2(RdQ×r)kqr(VT
rm)−fr(VT
rm, w)k`2(Rr)≤
kqr(VT
rm)−fr(VT
rm, w)k`2(Rr)
(A.9)
21
since b
Φris orthonormal.
Note additionally that the eTcomplexity in the depth comes only from the global
truncation error for Explicit Euler, which is a conservative bound. In [21] the control
flows they construct are actually piecewise constant in time, which is sufficient since
simple functions are dense in continuous functions. In that case Explicit Euler is an
exact time integrator, and then the the depth complexity can be reduced to OT|K|
.
The bound we give above, however, is more general since it represents discretizations
of continuous time control flows.
Appendix B. Numerical Results Appendix.
B.1. MACH-Aero Design Framework. MACH design framework aims at
Multidisciplinary design optimization of Aircraft Configurations with High fidelity
while MACH-Aero 2implements MACH on aerodynamic design optimization. Aero-
dynamic design by MACH-Aero sets up an optimization problem using the python
interface pyOptSparse [46] and starts with an initial design q0under specified design
requirements m, and uses gradient information to find the optimum airfoil design
(control points and angle of attack). The steps are as follows. (1) A baseline design
volume mesh is generated using pyHyp [43], which will be deformed for any given value
of design variables. (2) The SNOPT (Sparse Nonlinear OPTimizer) [14] updates de-
sign variables and sends the new design to the geometry parameterization. (3) The
geometry parameterization module (pyGeo) [19] performs the geometry deformation,
and computes the values of geometric constraints and corresponding gradients. (4)
The volume mesh deformation module generates the deformed mesh based on the
deformed geometry. (5) The CFD module (ADflow [26] or DAFoam [15]) computes
high fidelity forward and adjoint flow fields on the deformed mesh, and sends the ob-
jective function and constraint values and computed gradient information back to the
optimizer. The process is iterated until an optimal design q(m) is found that satisfies
the optimality and feasibility conditions.
B.2. Aerodynamic Dataset Split. Latin hypercube sampling (LHS) [29], as
an advanced version of Monte Carlo sampling strategy, has been widely used in various
engineering fields. In this work, we generate the original data set using LHS, infill
more samples using augmented LHS (augLHS) [44], and extract subsets from the
whole data set using conditioned LHS (cLHS) [30].
Instead of completely generating random samples, LHS divides the space of [0,1]M
into evenly distribution Nsub-intervals in each dimension, where Mis the dimension
of a design-variable space and Nis the number of samples to be generated. Taking
M= 2 as an example, the division makes a N×Nmatrix. LHS sweeps through
the columns searching for empty rows among which LHS selects a random one to
generate a value (within the sub-interval) for each column. The generated values are
used as cumulative distribution function (CDF) values of corresponding pre-assigned
probabilistic distributions, such as uniform and Gaussian distributions. The inverse
CFD values are the sampled parameters.
Infill sampling using augLHS enlarges a small dataset while maintaining the
characteristics of LHS. In particular, augLHS follows the similar sweeping princi-
ple as LHS except that augLHS divides the original design-variable space into a
(N+Nnew)×N+Nnew ) matrix where Nnew is the number of new samples to be
added into the small dataset. The following steps are the same as LHS. Extract-
2https://github.com/mdolab/MACH-Aero
22
ing data samples while maintaining LHS characteristics plays an important role in
comparing various neural network surrogates on extracted small datasets. We see
cLHS as an inverse process of augLHS by dividing the original design-variable space
into a Nsub ×Nsub matrix where Nsub is the number of samples to be extracted and
Nsub<N . In this work, we use the built-in functions for LHS, augLHS, and cLHS
methods within the Rprogramming language [18]. For the purpose of reproducibility,
we set the random seeds for LHS and augLHS as an arbitrary number, 42. We set
the cLHS random seeds to be the 50 integers within [0, 49] to extract small data sets
to compare neural network surrogate approaches. Specifically, we generated 500 LHS
data samples and infill 500 more samples using augLHS, then let cLHS address the
data extraction.
B.3. Two Step Optimization. We use an Adam optimizer with default set-
tings for a small number of epochs for each adaptive layer training (total epochs sums
to 50), and then perform one final end-to-end training with LRSFN [36] for 50 “epoch
equivalent” neural network sweeps (i.e. forward and backward pass). In this case
we allow all intermediate ResNet layers (and the output layer) to be trained at each
adaptive step, so Adam has “converged” in this process. Below we compare the re-
sults of performing this last training with Adam, and with LRSFN using Hessian rank
r= 40. We used gradient batch size of 2 for Adam, and gradient batch size of 4 and
Hessian batch size of 2 for LRSFN. Both optimizers use α= 1e−3 fixed step sizes.
Some results are shown below and they are consistently representative of the pattern
seen.
0 1000 2000 3000 4000 5000 6000
Training Data
0.80
0.82
0.84
0.86
0.88
0.90
0.92
`2Accuracy
`2Testing Accuracy Breadth 64 Depth 5
DIPResNet trained with Adam |w|= 16296
DIPResNet refined with LRSFN |w|= 16296
0 1000 2000 3000 4000 5000 6000
Training Data
0.725
0.750
0.775
0.800
0.825
0.850
0.875
`2Accuracy
`2Testing Accuracy Breadth 64 Depth 5
KLEResNet trained with Adam |w|= 16296
KLEResNet refined with LRSFN |w|= 16296
Fig. 12: Depth 5 Breadth 32 Optimizer Comparison
0 1000 2000 3000 4000 5000 6000
Training Data
0.775
0.800
0.825
0.850
0.875
0.900
0.925
`2Accuracy
`2Testing Accuracy Breadth 48 Depth 10
DIPResNet trained with Adam |w|= 14160
DIPResNet refined with LRSFN |w|= 14160
0 1000 2000 3000 4000 5000 6000
Training Data
0.76
0.78
0.80
0.82
0.84
`2Accuracy
`2Testing Accuracy Breadth 48 Depth 10
KLEResNet trained with Adam |w|= 14160
KLEResNet refined with LRSFN |w|= 14160
Fig. 13: Depth 10 Breadth 48 Optimizer Comparison
REFERENCES
23
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard, et al.,Tensorflow: A system for large-scale machine learning, in
12th {USENIX}symposium on operating systems design and implementation ({OSDI}
16), 2016, pp. 265–283.
[2] M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson,
J. Ring, M. E. Rognes, and G. N. Wells,The fenics project version 1.5, Archive of
Numerical Software, 3 (2015), https://doi.org/10.11588/ans.2015.100.20553.
[3] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind,Automatic differenti-
ation in machine learning: a survey, Journal of machine learning research, 18 (2018).
[4] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart,Model reduction and
neural networks for parametric pdes, arXiv preprint arXiv:2005.03180, (2020).
[5] D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag,What is the state of neural network
pruning?, arXiv preprint arXiv:2003.03033, (2020).
[6] K. Bollinger and H. Schaeffer,Reduced order modeling using shallow relu networks with
grassmann layers, arXiv preprint arXiv:2012.09940, (2020).
[7] K. H. R. Chan, Y. Yu, C. You, H. Qi, J. Wright, and Y. Ma,Redunet: A white-box deep
network from the principle of maximizing rate reduction, arXiv preprint arXiv:2105.10446,
(2021).
[8] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud,Neural ordinary differential
equations, arXiv preprint arXiv:1806.07366, (2018).
[9] W. Chew and Q. Liu,Perfectly matched layers for elastodynamics: a new absorbing boundary
condition, Journal of Computational Acoustics, 4 (1996), pp. 341–359.
[10] G. Cybenko,Approximation by superpositions of a sigmoidal function, Mathematics of control,
signals and systems, 2 (1989), pp. 303–314.
[11] C. Dong, L. Liu, Z. Li, and J. Shang,Towards adaptive residual network training: A neural-
ode perspective, in International conference on machine learning, PMLR, 2020, pp. 2616–
2626.
[12] W. E, C. Ma, and L. Wu,Barron spaces and the compositional function spaces for neural
network models, arXiv preprint arXiv:1906.08039, (2019).
[13] S. Fresca and A. Manzoni,Pod-dl-rom: enhancing deep learning-based reduced order models
for nonlinear parametrized pdes by proper orthogonal decomposition, Computer Methods
in Applied Mechanics and Engineering, 388 (2022), p. 114181.
[14] P. E. Gill, W. Murray, and M. A. Saunders,SNOPT: An SQP algorithm for large-scale
constrained optimization, SIAM Journal of Optimization, 12 (2002), pp. 979–1006, https:
//doi.org/10.1137/S1052623499350013.
[15] P. He, C. A. Mader, J. R. R. A. Martins, and K. J. Maki,DAFoam: An open-source adjoint
framework for multidisciplinary design optimization with OpenFOAM, AIAA Journal, 58
(2020), https://doi.org/10.2514/1.J058853.
[16] K. Hornik,Approximation capabilities of multilayer feedforward networks, Neural networks,
4 (1991), pp. 251–257.
[17] G. Hughes,On the mean accuracy of statistical pattern recognizers, IEEE transactions on
information theory, 14 (1968), pp. 55–63.
[18] R. Ihaka,R : Past and future history, Technical Report, (1998), https://doi.org/10.1.1.331.299.
[19] G. K. W. Kenway, G. J. Kennedy, and J. R. R. A. Martins,A CAD-free approach to
high-fidelity aerostructural optimization, in Proceedings of the 13th AIAA/ISSMO Multi-
disciplinary Analysis Optimization Conference, Fort Worth, TX, September 2010. AIAA
2010-9231.
[20] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and
A. Anandkumar,Neural operator: Learning maps between function spaces, arXiv pre-
print arXiv:2108.08481, (2021).
[21] Q. Li, T. Lin, and Z. Shen,Deep learning via dynamical systems: An approximation perspec-
tive, arXiv preprint arXiv:1912.10382, (2019).
[22] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and
A. Anandkumar,Neural operator: Graph kernel network for partial differential equa-
tions, arXiv preprint arXiv:2003.03485, (2020).
[23] H. Lin and S. Jegelka,Resnet with one-neuron hidden layers is a universal approximator,
arXiv preprint arXiv:1806.10909, (2018).
[24] L. Lu, P. Jin, and G. E. Karniadakis,Deeponet: Learning nonlinear operators for identifying
differential equations based on the universal approximation theorem of operators, arXiv
preprint arXiv:1910.03193, (2019).
[25] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang,The expressive power of neural networks:
A view from the width, in Proceedings of the 31st International Conference on Neural
24
Information Processing Systems, 2017, pp. 6232–6240.
[26] C. A. Mader, G. K. W. Kenway, A. Yildirim, and J. R. R. A. Martins,ADflow—an
open-source computational fluid dynamics solver for aerodynamic and multidisciplinary
optimization, Journal of Aerospace Information Systems, (2020), https://doi.org/10.2514/
1.I010796.
[27] A. Manzoni, F. Negri, and A. Quarteroni,Dimensionality reduction of parameter-
dependent problems through proper orthogonal decomposition, Annals of Mathematical
Sciences and Applications, 1 (2016), pp. 341–377.
[28] P.-G. Martinsson and J. A. Tropp,Randomized numerical linear algebra: Foundations and
algorithms, Acta Numerica, 29 (2020), pp. 403–572.
[29] M. D. McKay, R. J. Beckman, and W. J. Conover,A comparison of three methods for
selecting values of input variables in the analysis of output from a computer code, Tech-
nometrics, 21 (1979), pp. 239–245.
[30] B. Minasny and A. Mcbratney,A conditioned latin hypercube method for sampling in the
presence of ancillary information, Computers & Geosciences, 32 (2006), pp. 1378–1388,
https://doi.org/10.1016/j.cageo.2005.12.009.
[31] N. H. Nelsen and A. M. Stuart,The random feature model for input-output maps between
banach spaces, arXiv preprint arXiv:2005.10224, (2020).
[32] H. V. Nguyen and T. Bui-Thanh,Model-constrained deep learning approaches for inverse
problems, arXiv preprint arXiv:2105.12033, (2021).
[33] T. O’Leary-Roseberry,Efficient and dimension independent methods for neural network
surrogate construction and training, PhD thesis, 2020.
[34] T. O’Leary-Roseberry,hessianlearn: Stochastic Nonconvex Optimization in TensorFlow
and keras, 2021, https://doi.org/10.5281/zenodo.4608644,https://github.com/tomoleary/
hessianlearn.
[35] T. O’Leary-Roseberry, N. Alger, and O. Ghattas,Inexact Newton methods for stochas-
tic nonconvex optimization with applications to neural network training, arXiv preprint
arXiv:1905.06738, (2019).
[36] T. O’Leary-Roseberry, N. Alger, and O. Ghattas,Low rank saddle free Newton: A scalable
method for stochastic nonconvex optimization, arXiv preprint arXiv:2002.02881, (2020).
[37] T. O’Leary-Roseberry and U. Villa,hippyflow: Dimension reduced surrogate construc-
tion for parametric PDE maps in Python, 2021, https://doi.org/10.5281/zenodo.4608729,
https://github.com/hippylib/hippyflow.
[38] T. O’Leary-Roseberry, U. Villa, P. Chen, and O. Ghattas,Derivative-informed projected
neural networks for high-dimensional parametric maps governed by pdes, Computer Meth-
ods in Applied Mechanics and Engineering, 388 (2022), p. 114199.
[39] A. Quarteroni, A. Manzoni, and F. Negri,Reduced basis methods for partial differential
equations: an introduction, vol. 92, Springer, 2015.
[40] A. Quarteroni, R. Sacco, and F. Saleri,Numerical mathematics, vol. 37, Springer Science
& Business Media, 2010.
[41] L. Ruthotto and E. Haber,Deep neural networks motivated by partial differential equations,
Journal of Mathematical Imaging and Vision, 62 (2020), pp. 352–364.
[42] C. Schwab and R. A. Todor,Karhunen–Lo`eve approximation of random fields by generalized
fast multipole methods, Journal of Computational Physics, 217 (2006), pp. 100–122.
[43] N. Secco, G. K. W. Kenway, P. He, C. A. Mader, and J. R. R. A. Martins,Efficient mesh
generation and deformation for aerodynamic shape optimization, AIAA Journal, (2021),
https://doi.org/10.2514/1.J059491.
[44] M. Stein,Large sample properties of simulations using latin hypercube sampling, Technomet-
rics, 29 (1987), pp. 143–151, https://doi.org/10.1080/00401706.1987.10488205.
[45] U. Villa, N. Petra, and O. Ghattas,hIPPYlib: An Extensible Software Framework for
Large-Scale Inverse Problems Governed by PDEs; Part I: Deterministic Inversion and
Linearized Bayesian Inference, Transactions on Mathematical Software, in print (2020),
https://arxiv.org/abs/1909.03948.
[46] N. Wu, G. Kenway, C. A. Mader, J. Jasa, and J. R. R. A. Martins,pyoptsparse: A python
framework for large-scale constrained nonlinear optimization of sparse systems, Journal of
Open Source Software, 5 (2020), p. 2564, https://doi.org/10.21105/joss.02564.
[47] A. Yaguchi, T. Suzuki, S. Nitta, Y. Sakata, and A. Tanizawa,Scalable deep neural networks
via low-rank matrix factorization, (2019).
[48] O. Zahm, P. G. Constantine, C. Prieur, and Y. M. Marzouk,Gradient-based dimension
reduction of multivariate vector-valued functions, SIAM Journal on Scientific Computing,
42 (2020), pp. A534–A558.
25