ArticlePDF Available

Calibrating a large computer experiment simulating radiative shock hydrodynamics

Authors:

Abstract

We consider adapting a canonical computer model calibration apparatus, involving coupled Gaussian process (GP) emulators, to a computer experiment simulating radiative shock hydrodynamics that is orders of magnitude larger than what can typically be accommodated. The conventional approach calls for thousands of large matrix inverses to evaluate the likelihood in an MCMC scheme. Our approach replaces that tedious idealism with a thrifty take on essential ingredients, synergizing three modern ideas in emulation, calibration, and optimization: local approximate GP regression, modularization, and mesh adaptive direct search. The new methodology is motivated both by necessity - considering our particular application - and by recent trends in the supercomputer simulation literature. A synthetic data application allows us to explore the merits of several variations in a controlled environment and, together with results on our motivating real data experiment, lead to noteworthy insights into the dynamics of radiative shocks as well as the limitations of the calibration enterprise generally.
The Annals of Applied Statistics
2015, Vol. 9, No. 3, 1141–1168
DOI: 10.1214/15-AOAS850
©Institute of Mathematical Statistics, 2015
CALIBRATING A LARGE COMPUTER EXPERIMENT
SIMULATING RADIATIVE SHOCK HYDRODYNAMICS
BYROBERT B. GRAMACY,DEREK BINGHAM,JAMES PAUL HOLLOWAY,
MICHAEL J. GROSSKOPF,CAROLYN C. KURANZ,ERICA RUTTER,
MATT TRANTHAMAND R. PAUL DRAKE
University of Chicago, Simon Fraser Universityand University of Michigan
We consider adapting a canonical computer model calibration apparatus,
involving coupled Gaussian process (GP) emulators, to a computer experi-
ment simulating radiative shock hydrodynamics that is orders of magnitude
larger than what can typically be accommodated. The conventional approach
calls for thousands of large matrix inverses to evaluate the likelihood in an
MCMC scheme. Our approach replaces that costly ideal with a thrifty take
on essential ingredients, synergizing three modern ideas in emulation, cali-
bration and optimization: local approximate GP regression, modularization,
and mesh adaptive direct search. The new methodology is motivated both by
necessity—considering our particular application—and by recent trends in
the supercomputer simulation literature. A synthetic data application allows
us to explore the merits of several variations in a controlled environment and,
together with results on our motivating real-data experiment, lead to notewor-
thy insights into the dynamics of radiative shocks as well as the limitations of
the calibration enterprise generally.
1. Introduction. Rapid increases in computational power have made com-
puter models (or simulators) commonplace as a way to explore complex physical
systems, particularly as an alternative to expensive field work or physical exper-
imentation. Computer models typically idealize the phenomenon being studied,
inducing bias, while simultaneously having more parameters than correspond to
known/controlled quantities in the field. Those extra “knobs” must be adjusted to
make the simulator match reality. Computer model calibration involves finding
values of such inputs, so that simulations agree with data observed in physical ex-
periments to the extent possible, and accounting for any biases in predictions based
on new simulations.
Here, we are interested in computer model calibration for experiments on ra-
diative shocks. These are challenging to simulate because both hydrodynamic
and radiation transport elements are required to describe the physics. The Uni-
versity of Michigan’s Center for Radiative Shock Hydrodynamics (CRASH) is
tasked with modeling a particular high-energy laser radiative shock system. The
Received October 2014; revised June 2015.
Key words and phrases. Emulator, tuning, nonparametric regression, big data, local Gaussian pro-
cess, mesh adaptive direct search (MADS), modularization.
1141
1142 R. B. GRAMACY ET AL.
CRASH team developed a code outputting a space–time field that describes the
evolution of a shock given specified initial conditions (the inputs), and has col-
lected outputs for almost 27,000 such cases. The code has two inputs involved in
addressing known deficiencies in the mathematical model, but which don’t directly
correspond to physical conditions. Our goal is to find values for these inputs, by
calibrating the simulator to a limited amount of field data available from an ear-
lier study, while simultaneously learning relationships governing the signal shared
between simulated and field processes in order to make predictions under novel
physical regimes.
Kennedy and O’Hagan (2001) were the first to propose a statistical framework
for such situations: a hierarchical model linking noisy field measurements from
the physical system to the potentially biased output of a computer model run with
the “true” (but unknown) value of any calibration parameters not controlled in the
field. The backbone of the framework is a pair of coupled Gaussian process (GP)
priors for (a) simulator output and (b) bias. The hierarchical nature of the model,
paired with Bayesian posterior inference, allows both data sources (simulated and
field) to contribute to joint estimation of all unknowns.
The GP is a popular prior for deterministic computer model output [Sacks et al.
(1989)]. In that context, GP predictors are known as surrogate models or em-
ulators, and they have many desirable accuracy and coverage properties. How-
ever, their computational burden severely limits the size of training data sets—
to as few as 1000 input–output pairs in many common setups—and that burden
is compounded when emulators are nested inside larger frameworks, as in com-
puter model calibration. Consequently, new methodology is required when there
are moderate to large numbers of computer model trials, which is increasingly
common in the simulation literature [e.g., Kaufman et al. (2011), Paciorek et al.
(2013)].
Calibrating the radiative shock experiment requires a thriftier apparatus along
several dimensions: to accommodate large simulation data, but also to recognize
and exploit a massive discrepancy between the relative sizes of computer and field
data sets. First, we modularize the model fitting [Liu, Bayarri and Berger (2009)]
and construct the emulator using only the simulator outputs, that is, ignoring the in-
formation from field data at that stage. Unlike Liu, Bayarri and Berger, who argued
for modularization on philosophical grounds, we do this for purely computational
reasons. Second, we insert a local approximate GP [Gramacy and Apley (2015)] in
place of the traditional GP emulator. We argue that the locality of the approxima-
tion is particularly handy in the calibration context which only requires predictions
at a small number of field data sites. Finally, we illustrate how mesh adaptive direct
search [Audet and Dennis (2006)]—acting as glue between the computer model,
bias and noisy field data observations—can quickly provide good values of cali-
bration parameters and, as a byproduct, enough useful distributional information
to replace an expensive posterior sampling.
CALIBRATING A LARGE COMPUTER EXPERIMENT 1143
The remainder of the paper is outlined as follows. Section 2describes the ra-
diative shock application and our goals in more detail. Section 3then reviews the
canonical calibration apparatus with a focus on limitations and remedies, includ-
ing approximate GP emulation. Section 4outlines the recipe designed to meet
the goals of the project. Illustrations on synthetic data are provided in Section 5,
demonstrating proof of concept, exploring variations and discussing limitations.
We return to the motivating example in Section 6equipped with a new arsenal.
The paper concludes with a brief discussion in Section 7.
2. Calibrating simulated radiative shocks. The CRASH team is interested
in studying shocks where radiation from shocked matter dominates the energy
transport and results in a complex evolutionary structure. These so-called radia-
tive shocks arise in practice from astrophysical phenomena (e.g., super-novae) and
other high-temperature systems [e.g., see Drake et al. (2011), McClarren et al.
(2011)]. Our particular work, here, involves a large suite of simulation output and
a small set of twenty field observations from radiative shock experiments. Our goal
is to calibrate the simulator and to predict features of radiative shocks in novel set-
tings.
The field experiments were conducted at the Omega laser facility at the Univer-
sity of Rochester [Boehly et al. (1997)]. A high-energy laser was used to irradiate
a beryllium disk located at the front end of a xenon (Xe) filled tube [Figure 1(a)],
launching a high-speed shock wave into the tube. It is said to be a radiative shock
if the energy flux emitted by the hot shocked material is equal to or larger than
the flux of kinetic energy into the shock. Each physical observation is a radiograph
image [Figure 1(b)], and the quantity of interest for us is the shock location:the
distance traveled at a predetermined time.
The experimental (input) variables are listed in the first column of Table 1,and
the ranges or values used in the field experiment (the design) are in the final col-
umn. The first three variables specify the thickness of the beryllium disk, the xenon
(a) (b)
FIG.1. (a)Sketch of the apparatus used in the radiative shock experiments.A high-energy laser is
used to ignite the beryllium disk on the right,creating a shock wave that travels through the xenon
filled tube.(b)Radiograph image of a radiative shock experiment.
1144 R. B. GRAMACY ET AL.
TABLE 1
Design and calibration variables and input ranges for computer experiment 1(CE1) and 2(CE2)
and field experiments.A single value means that the variable was constant for all simulation runs
Input CE1 CE2 Field design
Design variables
Be thickness (microns) [18,22]21 21
Xe fill pressure (atm) [1.100,1.2032][0.852,1.46][1.032,1.311]
Time (nano-seconds) [5,27][5.5,27]6-values in [13,28]
Tube diameter (microns) 575 [575,1150]{575,1150}
Taper length (microns) 500 [460,540]500
Nozzle length (microns) 500 [400,600]500
Aspect ratio (microns) 1 [1,2]1
Laser energy (J) [3600,3990][3750.0,3889.6]
Effective laser energy (J)[2156.4,4060]
Calibration parameters
Electron flux limiter [0.04,0.10]0.06
Energy scale factor [0.40,1.10][0.60,1.00]
The effective laser energy is the laser energy ×energy scale factor.
fill pressure in the tube and the observation time for the radiograph image. The next
four variables are related to the geometry of the tube and the shape of the appara-
tus at its front end. Most of the physical experiments were performed on circular
shock tubes with a small diameter (in the area of 575 microns), and the remaining
experiments were conducted on circular tubes with a diameter of 1150 microns or
with different nozzle configurations. The aspect ratio describes the shape of the
tube (circular or oval). In our field experiments the aspect ratios are all 1, indicat-
ing a circular tube. Our predictive exercise involves extrapolating to oval shaped
tubes with an aspect ratio of 2. Finally, the laser energy is specified in Joules.
Explaining the inputs listed in the remaining rows of Table 1requires some
details on the computer simulations. Two simulation suites were performed, sepa-
rately, on super-computers at Lawrence Livermore and Los Alamos National Lab-
oratories, and we combine them for our calibration exercise. The second and third
columns of the table reveal differing input ranges in the two computer experiments
(denoted CE1 and CE2, resp.). Briefly, CE1 explores the input region for small,
circular tubes, whereas CE2 investigates a similar input region, but also varies the
tube diameter and nozzle geometry. Both input plans were derived from Latin Hy-
percube samples [LHSs, McKay, Beckman and Conover (1979)]. The thickness
of the beryllium disk could be held constant in CE2 thanks to improvements in
manufacturing in the time in between simulation campaigns.
The computer simulator required two further inputs which could not be con-
trolled in the field, that is, two calibration parameters: the electron flux limiter and
the laser energy scale factor. The electron flux limiter is an unknown constant in-
CALIBRATING A LARGE COMPUTER EXPERIMENT 1145
volved in predicting the amount of heat transferred between cells of a space–time
mesh used by the code. It was held constant in CE2 because in CE1 the outputs
were found to be relatively insensitive to this input. The laser energy scale factor
accounts for discrepancies between the amounts of energy transferred to the shock
in the simulations and experiments, respectively. To explain, in the physical sys-
tem the laser energy for a shock is recorded by a technician. However, things are
a little more complicated for the simulations. Before running CE1, it was felt that
the simulated shock would be driven too far down the tube for any specified laser
energy. Instead, the effective laser energy—the laser energy actually entered into
the code—was constructed from two input variables, laser energy and a scale fac-
tor. For CE1 these two inputs were varied over the ranges specified in the second
column of Table 1. CE2 used effective laser energy directly.
Our analysis uses both laser energy and the laser energy scale factor, which
is treated as a calibration parameter. If the scale factor “calibrates” to one, then
there was no need to down-scale the laser energy in the first experiment. Using
both data sources requires reconciling the designs of the two experiments. To that
end, we expand the CE2 design by gridding values of laser energy scale factor
and pairing them with values of laser energy deduced from effective laser energy
values from the original design. When gridding, we constrained the scale factors
to be less than one but no smaller than value(s) which, when multiplied by the
effective laser energy (in reciprocal), imply a laser energy of 5000 Joules. Under
those restrictions, an otherwise uniform grid with 100 settings of the scale factor
yields a total of 26,458 input–output combinations, combining CE1 and expanded
CE2 sets, to use in the calibration exercise. Figure 2shows the design over laser
energy and energy scale factor.
Our overarching goals here are three-fold: (a) design a calibration apparatus that
can cope with data sizes like those described above, check that we understand its
behavior in controlled settings (synthetic data), and determine how best to deploy
FIG.2. Marginal design for laser energy and energy scale factor from both experiments.
1146 R. B. GRAMACY ET AL.
it for our real data (exploratory analysis); (b) determine the settings of the two-
dimensional calibration parameter, note if down-scaling was necessary in CE1,
and gain an understanding of the extent to which the field data are informative
about settings for either parameter; (c) obtain (via a particular setting of the cal-
ibration parameter) a high-quality predictor for field data measurements in novel
input conditions. In Section 4.3 we describe a (distribution of) input setting(s) of
interest to the CRASH team, for which field data have been collected, which we
use to benchmark our own predictions. Since this experiment is for an oval-shaped
disk, the predictions rely heavily on the computer model output to make an extrap-
olation, as the field training data observations involved only circular disks.
3. Elements of computer model calibration. As explained above, the ra-
diative shock experiment involves runs of a deterministic computer model Mat
a large set of inputs NM=26,458, and a much smaller number NF=20 of ob-
servations from a physical or field experiment F. In what follows we refer to the
inputs shared by Mand Fas design variables, and denote them by x.There-
maining (two in our case) calibration parameters required for Mare labeled as u,
so that Mtakes inputs (x , u). A primary goal is to predict the result of new field
data experiments, via M, which means first finding a good u. Below we outline the
elements involved in such an endeavor, with the focus on limitations and remedies.
3.1. Hierarchical models and modularization. Kennedy and O’Hagan (2001,
hereafter KOH) proposed a Bayesian framework for coupling Mand F.LetyF(x)
denote a field observation under conditions x,andyM(x, u) the (deterministic)
output of a computer model run under conditions xand calibration inputs u.KOH
represent the real process Ras the computer model output at the best setting of
the calibration parameters, u, plus a discrepancy term acknowledging that there
can be systematic disagreement between model and truth. In symbols, yR(x) =
yM(x, u)+b(x).1The field observations connect reality with data:
yF(x) =yR(x ) +ε=yMx,u+b(x) +εwhere εi.i.d.
N02
ε.(1)
The unknowns are u,σ2
εand the bias b(·). KOH propose a Gaussian process (GP)
prior for b(·), which we review in detail in the following subsection. Known infor-
mation or restrictions on u-values can be specified via a prior p(u),orotherwise
a default/uniform prior can be used. Reference priors are typical for σ2
ε.
If evaluating the computer model is fast, then inference is made rather straight-
forward using residuals between computer model outputs and field observations,
yF(x) yM(x , u), which can be computed at will for any u[Higdon et al. (2004)].
1We choose b(x ) for the discrepancy term and casually refer to it as “bias” throughout even though
the actual bias, yM(x, u)yR(x ), which is a property of Mnot R, would actually work out to
b(x).
CALIBRATING A LARGE COMPUTER EXPERIMENT 1147
However, running the computer model is usually time consuming, as is indeed the
case in our example. In such situations it is useful to use an emulator or surro-
gate model in place of yM(·,·). An emulator is a fitted model ˆyM(·,·)trained on
asetofNMsimulations of Mrun over a design of (x, u)-input values. KOH rec-
ommend a GP prior for yM. Rather than performing inference for yMseparately,
using just the NMruns as is typical of a computer experiment in isolation [e.g.,
Morris, Mitchell and Ylvisaker (1993)], they recommend inference joint with b(·),
uand σ2
εusing both field observations and runs of the computer model. From
a Bayesian perspective this is the coherent thing to do: infer all unknowns jointly
given all data.
It is also practical when the Mis very slow, giving small NM, and, moreover,
even a small number NFof field data observations can be highly informative about
the emulator ˆyM(·,·)in that setting. But, more generally, this approach is fraught
with computational challenges. Coupled b(·)and yM(·,·)lead to parameter iden-
tification and MCMC mixing issues, and emulation demands substantial compu-
tational effort in larger NMcontexts, even when applied in isolation. These chal-
lenges are all compounded under coupling.
Liu, Bayarri and Berger (2009) propose going “back to basics” by fitting the
emulator ˆyM(·,·)independently, using only the NMsimulations. Inference for
the rest of the KOH calibration apparatus is still joint, for all parameters given
ˆyMand field data yF. Their argument for this so-called modularization is philo-
sophical, and is a response to previous work outlining how fully Bayesian joint
inference in the KOH framework unproductively confounds emulator uncertainty
with bias discrepancy [Santner, Williams and Notz (2003)]. Our justification for
entertaining modularized calibration is different: decoupling has computational
advantages. Since our NMNF, a small amount of field data cannot substan-
tively enhance the quality of the emulator obtained under joint inference. In other
words, we don’t lose much by modularizing. However, despite simplifying many
matters, a marginalized approach would still require large NMemulation for our
application, and is therefore no panacea.
3.2. Gaussian process emulation and sparse/local approximation. Gaussian
process (GP) regression is canonical for emulating computer experiments [Santner,
Williams and Notz (2003)]. The reasons are many, but, as we shall see, computa-
tional tractability is not one of them. Technically, the GP is a prior over functions
between xRpand YRsuch that any finite collection of Y-values (at those x’s)
is multivariate normal (MVN). Therefore, it is defined by a mean vector μand co-
variance matrix , and these values may be specified in terms of hyperparameters
and x-values. Homoskedasticity and stationarity are common simplifying assump-
tions in emulator applications. Often μis constant/zero and =τ2Khas constant
scale τ2and correlations Kdefined only in terms of displacements xx.
1148 R. B. GRAMACY ET AL.
Performing GP regression requires applying the same logic, conditionally on
data DN=(XN,Y
N)=([x
1,...,x
N],[y1,...,y
N]). Given values of any hy-
perparameters, the predictive distribution for Y(x) at new x’s is directly avail-
able from MVN conditionals. Integrating out τ2under a reference prior [see, e.g.,
Gramacy and Polson (2011)] yields a Student-twith
mean μ(x|DN,K
N)=k(x)K 1
NYN,(2)
and scale σ2(x|DN,K
N)=ψ[K(x,x) k(x)K1
Nk(x)]
N,(3)
and Ndegrees of freedom, where k(x) is the N-vector whose ith component is
Kθ(x, xi), defining the correlation function given hyperparameters θ;KNis an
N×Nmatrix whose entries are Kθ(xi,x
j);andψ=Y
NK1
NYN. Inference for
θcan proceed by maximizing (e.g., Newton-schemes based on derivatives of) the
likelihood,
p(Y|Kθ)=[N/2]
(2π)N/2|KN|1/2×ψ
2N/2
,(4)
or via the posterior p(Y|Kθ)p(θ ) in Bayesian schemes.
Observe that prediction and inference (even sampling from the GP prior) re-
quires decomposing an N×Nmatrix to obtain K1
Nand |KN|. Thus, for most
choices Kθ(·,·)and point-inference schemes, data sizes Nare limited to the low
thousands. Bayesian approaches are even further limited, as orders of magnitude
more likelihood evaluations (and matrix inversions) are typically required, for ex-
ample, for MCMC. Assuming stationarity can also sometimes be too restrictive,
and unfortunately relaxation usually requires even more computation [e.g., Ba and
Joseph (2012), Paciorek and Schervish (2006), Schmidt and O’Hagan (2003)].
A key demand on the emulator in almost any computer modeling context, but
especially for calibration, is that inference and prediction (at any/many x)befast
relative to running new simulations (at x). Otherwise, why bother emulating? As
computers have become faster, computer experiments have become bigger, limit-
ing the viability of standard GP emulation. Sparsity is a recurring theme in recent
searches for emulators with larger capability [e.g., Eidsvik et al. (2014), Haaland
and Qian (2011), Kaufman et al. (2011), Sang and Huang (2012)], allowing de-
compositions of large covariance matrices to be either avoided entirely, be built up
sequentially, or be carried out using fast sparse-matrix libraries.
In this paper we use a recent sparse GP methodology developed by Gramacy
and Apley (2015). They provide a localized approach to GP inference/prediction
that is ideal for calibration, where the full inferential scheme (either KOH or mod-
ular) only requires ˆyM(x , u) for (x, u)-values coinciding with field-data x-values,
and u-values along the search path for u, as we describe in Section 4. The idea is
to focus expressly on the prediction problem at an input x. In what follows we use
CALIBRATING A LARGE COMPUTER EXPERIMENT 1149
xgenerically, rather than (x, u) as inputs to ˆyM. The local GP scheme acknowl-
edges that data input locations in XNwhich are far from xhave vanishingly small
impact on the predictive equations (2)–(3). This is used as the basis of a search for
locations Xn(x) XNwhich minimize Bayesian mean squared prediction error
(MSPE). The search is performed in a greedy fashion, giving an approximate solu-
tion to the local design problem, and paired with efficient updates to the local GP
approximation as new data points are added into the local design. Building a pre-
dictor in this way, ultimately using equations (2)–(3) with a data subset Dn(x),
can be performed in O(n3), a substantial savings if nN. Pragmatically, one can
choose nas large as computational constraints allow.
Gramacy and Apley (2015) show empirically that these MSPE-based local de-
signs lead to predictors which are more accurate than nearest neighbor—using
the nearest XNvalues to x—which is known to be suboptimal [Stein, Chi and
Welty (2004), Vecchia (1988)]. They also extend the scheme to provide local in-
ference of the correlation structure, and thereby fit a globally nonstationary model.
All calculations are independent for each x, so local inference and prediction on
a dense set of xXcan be trivially parallelized, accommodating emulation for
designs of size N=106in under an hour [Gramacy, Niemi and Weiss (2014)].
An implementation is provided in an Rpackage called laGP [Gramacy (2013)].
However, independent calculations for each x—while providing for nonstationar-
ity and parallelization—yield a discontinuous global predictive surface, which can
present challenges in our calibration context.
4. Proposed method. What we propose is thriftier than KOH in three ways,
and thriftier than the modularized version in two ways: It (a) modularizes the KOH
hierarchical model; (b) deploys local approximate GP modeling for the emulator
ˆyM(x , u); and (c) performs maximum a posteriori (point) inference for uvia the
induced fits for the bias ˆ
b(x) under a GP prior. Given a value for the calibration
parameter, u, the rest of the scheme involves a cascade of straightforward Newton-
style maximizing calculations. Below we describe an objective function which,
when optimized, performs the desired calibration, giving an estimated value ˆu,for
u. We then discuss how to predict YF(x) at new x-values given ˆuand the data.
4.1. Calibration as optimization. Let the field data be denoted as DF
NF=
(XF
NF,YF
NF),whereXF
NFis the design matrix of NFfield data inputs, paired with
an NFvector of yFobservations YF
NF. Similarly, let DM
NM=([XM
NM,U
NM],YM
NM)
be the NMcomputer model input–output combinations with column-combined x-
and u-design(s) and yM-outputs. Then, with an emulator ˆyM(·,u) trained on DM
NM,
let ˆ
YM|u
NFyM(XF
NF,u) denote a vector of NFemulated output y-values at the XF
locations obtained under a setting, u, of the calibration parameter. With local ap-
proximate GP modeling, each ˆyM|u
j-value therein, for j=1,...,N
F, is obtained
independently (and in parallel) via local sub-design XnM(xF
j,u) ⊂[XM
NM,U
NM]
1150 R. B. GRAMACY ET AL.
and locally inferred hyperparameters ˆ
θjˆ
θ(DnM(xF
j,u)). The size of the local
sub-design, nM, is a fidelity parameter. Larger nMvalues provide more faithful
(compared to a full GP) emulation at greater computational expense. Finally, de-
note the NF-vector of fitted discrepancies as ˆ
YB|u
NF=YF
NFˆ
YM|u
NF.
Given these quantities, the quality of a particular ucan be measured by the joint
probability density of observing YF
NFat inputs XF
NF. We obtain this from the best
fitting GP regression model trained on data DB
NF(u) =(XF
NF,ˆ
YB|u
NF), emitting es-
timator ˆ
bfor the bias given u.2Valu e s o f uwhich lead to a higher probability
of observing YF
NFunder the GP prior for b(·), modeling the discrepancy between
computer model emulations and field data, are preferred. We therefore suggest
finding ˆuto maximize that probability, while simultaneously maximizing over the
parameterization of b(·), via hyperparameters θb, by solving the following opti-
mization problem:
ˆu=arg max
up(u)max
θb
pbθb|DB
NF(u).(5)
Here p(u) is a prior for uand pbb|...) is a shorthand for our bias “fit” ˆ
b:the
marginalized posterior under a GP prior with lengthscale hyperparmeters θband
noise parameter σ2
ε. It is computationally feasible to use a full, rather than approx-
imate, GP for b(·)since NFis small. The “inner” maxθbcan be performed us-
ing Newton-like methods with closed-form derivatives with respect to the length-
scale θb. The “outer” maxuis discussed shortly.
Algorithm 1represents the “inner” max portion of (5) in pseudocode for a more
detailed second look. In our implementation, steps 1–5 in the code are automated
by applying a wrapper routine in the laGP package, called aGP, which loops over
each element jof the predictive grid, performing local design, inference for ˆ
θj
and subsequent prediction stages, in parallel via OpenMP. With NFand nMsmall
relative to NM, the execution of the “for”-loop is extremely fast. In our examples
to follow (Sections 56), we use a local neighborhood size of nM=50. Steps 8–9
are implemented by functions of the same names in the laGP package.
The GP model for b(·), fit in step 8, estimates a nugget parameter (in addition to
lengthscale ˆ
θb) to capture the noise term σ2
εin (1), whereas the local approximate
ones used for emulation, in step 3, do not. For situations where bias is known to
be very small/zero, it is sensible to entertain a degenerate GP prior for b(·)with
an identity correlation matrix. In that case, step 8 in Algorithm 1is skipped and
step 9 reduces to evaluating a predictive density under an i.i.d. normal likelihood
with μ=0, that is, only averaging over σ2
ε. Note that Algorithm 1works with log
probabilities for numerical stability, while equation (5) is represented in terms of
unlogged quantities.
2Note that DB
NF(u) tacitly depends on hyperparameters ˆ
θjsince it is defined through local GP
emulation.
CALIBRATING A LARGE COMPUTER EXPERIMENT 1151
Algorithm 1 Calculating the pbb|DB
NF(u)) term in equation (5)
Require: Calibration parameter u, fidelity parameter nM, computer data DM
NM,
and field data DF
NF.
1: for j=1,...,N
Fdo
2: IlaGP(xF
j,u|nM,DM
NM){get indicies of local design}
3: ˆ
θjmleGP(DM
NM[I]){local MLE of correlation parameter(s)}
4: ˆyM|u
jmuGP(xF
j|DM
NM[I],ˆ
θj){predictive mean emulation following
equation (3)}
5: end for
6: ˆ
YB|u
NFYF
NFˆ
YM|u{vectorized bias calculation}
7: DB
NF(u) (ˆ
YB|u
NF,XF
NF){create data for estimating ˆ
b(·)|u}
8: ˆ
θbmleGP(DB
NF(u)) {full GP estimate of ˆ
b(·)|u}
9: return llikGP(ˆ
θn,DB
NF(u)) {the objective value of the mleGP call above}
4.2. Derivative-free optimization of the calibration objective.Weturnnowto
the “outer” maxuin (5), thinking of the “inner” maxθbas an objective which can
be evaluated following Algorithm 1. The discrete nature of independent local de-
sign searches for ˆyM(xF
j,u) ensures that this objective is not continuous in u.In
fact, as we illustrate in our empirical work, it can look “noisy,” although it is in
fact deterministic. This means that optimization with derivatives—even numeri-
cally approximated ones—is fraught with challenges. We opt for a derivative-free
approach [see, e.g., Conn, Scheinberg and Vicente (2009)].
Specifically, we use an implementation of the mesh adaptive direct search
(MADS) algorithm [Audet and Dennis (2006)] called NOMAD [Le Digabel (2011)],
via an interface for Rprovided by the crs package [Racine and Nie (2012)].
MADS proceeds by successive pairs of search and poll steps, trying inputs to the
objective function on a sequence of meshes which are refined in such a way as
to guarantee convergence to a local optima under weak regularity conditions; for
more details see Audet and Dennis (2006). Direct, or so-called pattern search,
methods such as these have become popular for many challenging optimization
problems where derivative information is either not available or where approxima-
tions to derivatives may lead to unstable numerical behavior. We are not the first
to use MADS/NOMAD in the context of computer modeling. MacDonald, Ranjan
and Chipman (2012) used it to search for the smallest nugget, leading to numer-
ically stable matrix decompositions for near-interpolating GP emulation. Our use
is novel in the calibration context.
As MADS is a local solver, NOMAD requires initialization. We recommend
choosing a starting u-value from evaluations on a small random space-filling de-
sign, however, in our experiments (e.g., Section 5), starting at the center of the
space performs almost as well.
1152 R. B. GRAMACY ET AL.
4.3. Predictions for field data. Posterior predictive samples of YF(x)u,rep-
resenting the empirical distribution of field-data observations at a novel xgiven
a calibrated computer model using ˆu, can be obtained by running backward
through the KOH model (1) with estimated quantities ˆ
b(x) and ˆyM(x , ˆu).That
is, obtaining a predictive sample at xinvolves executing the following steps in
sequence:
YMˆ
YMx|ˆ
θ(x)via local GP under equations (2)–(4)
(6)
with data DnM(x),
Ybˆ
b(x|ˆ
θb)via full GP under equations (2)–(3) with data Dˆ
b
NF(ˆu),(7)
YF=YM+Ybcombining computer model, bias and noise.(8)
On the left, above, we abuse notation somewhat and let estimated emulator and
bias processes “stand in” for their corresponding predictive equations. Pointers
to those equations are provided on the right. In an unbiased version, the zero-
mean Student-tdraws in (7) are equivalent to GP ones with nugget-augmented
diagonal correlation matrix K=diag(1+σ2
e)with both scale τ2and noise σ2
e
terms integrated out. Equation (6) reminds that local GP emulation depends on
both local design and locally estimated lengthscales.
Again consider Algorithm 1for a second look. Field prediction involves first
running back through steps 2–4 to obtain a local design and correlation parameter
[implementing equation (6)], parallelized for potentially many x; then performing
steps 7–9 using saved Dˆ
b
NFand ˆ
θfrom the optimization [equation (7)]. However,
rather than evaluate a predictive probability, instead save the moments of the pre-
dictive density (step 9) at the new xlocations. These can then be combined with the
computer model emulation(s) obtained in step 4, thus “de-biasing” the computer
model output to get a distribution for YF(x)u, that is, undoing step 6. Ideally, the
full Student-tpredictive density would be used here, in step 4, leading to a sum
of Student-trandom variables [equation (8)] for ˆyM(x, ˆu) and ˆ
b(x) comprising
yF(x)u.However,ifNF,n
M30 summing normals suffices, meaning no sam-
pling is necessary.
As a sum of random samples from a convolution of two GP predictive distribu-
tions, the resulting field predictions account for many uncertainties, arising from
both noise observed in the field and from model quantities estimated from both
data sources. Still, it is important to clarify that some uncertainties are overlooked
in this approach. The biggest omission is uncertainty in ˆu. Monte Carlo alterna-
tives to optimizing u, such as posterior sampling or the bootstrap, are always an
option. But these might not be good value considering identification issues known
to plague KOH-style calibration [Loeppky, Bingham and Welch (2006)]. Our em-
pirical work shows that predictions under ˆuretain many desirable accuracy and un-
certainty attributes, despite (or in spite of) such clearly evident concerns. When u
CALIBRATING A LARGE COMPUTER EXPERIMENT 1153
is a primary goal, we later show how NOMAD evaluations can be salvaged to ap-
proximate a (log) posterior surface, and that these largely agree with a much more
expensive bootstrap alternative. Finally, deploying point-estimates (e.g., MAP) for
lengthscales and other hyperparameters, like ˆ
θband ˆ
θ(x), is a common “Empir-
ical Bayes” practice. With local GP emulation, overlooking such uncertainties is
one of many deliberate acts of pragmatism, including that of local design search.
Since local GPs overestimate uncertainty relative to full-data counterparts [see,
e.g., Gramacy and Haaland (2015)], a measure of conservatism is organically built
in.
5. Illustrations. In this section we entertain variations on a synthetic data-
generating mechanism akin to one described most recently by Goh et al.
(2013), who adapted an example from Bastos and O’Hagan (2009). It uses two-
dimensional field data inputs x, and two-dimensional calibration parameters u,
both residing in the unit cube. The computer model is specified as follows:
yM(x, u) =1e1/(2x2)1000u1x3
1+1900x2
1+2092x1+60
100u2x3
1+500x2
1+4x1+20 .(9)
The field data is generated as
yF(x) =yMx, u+b(x) +ε,
(10)
where b(x) =10x2
1+4x2
2
50x1x2+10 and εi.i.d.
N0,0.52,
using u=(0.2,0.1). We keep this setup, however, we diverge from previous uses
in the size and generation of the input designs, and the number of field data repli-
cates.
Our simulation study is broken into two regimes, considering biased and un-
biased variations, and is designed (i) to explore the efficacy of the proposed ap-
proach; (ii) to investigate performance in different scenarios (with/without bias,
unreplicated and replicated experiments, etc.); and (iii) to motivate alternatives for
our real data analysis in Section 6. Both simulation regimes involve 100 Monte
Carlo (MC) repetitions and proceed as follows.
Each repetition uses a two-dimensional LHS of size 50 (on the unit cube) for the
field data design, with three variations on the number of replicates, {1,2,10},for
each unique design variable setting, x, leading to NF∈{50,100,500}random re-
alizations of YF. The computer model design begins with a four-dimensional LHS
of size 10,000. It is then augmented with simulation trials that are aligned with the
field data design. We take 10 points per input in the field data, differing only in
the u-values: the 500 total (x1,x
2)-values are paired with a two-dimensional LHS
(also of size 500) of (u1,u
2)-values. Combining with the second LHS, this gives
NM=10,500 random (x1,x
2,u
1,u
2)locations for the deterministic simulation
of YM.
1154 R. B. GRAMACY ET AL.
In each MC repetition, a NOMAD search for ˆuis initialized with the best value
found on a maxmin design of size 20, which is obtained by searching stochastically
over a two-dimensional LHS of size 200. Vague independent Beta(2,2)priors on
each component of udiscourage the solver from finding solutions that lie on the
boundary of the search space. Finally, a two-dimensional LHS of size 1000 is used
to generate an out-of-sample validation set of yFvalues without noise, that is,
εx=0. Root mean-squared errors (RMSEs) and estimates ˆuof uare our main
metrics of comparison.
In addition to varying the number of replicates, our comparators include varia-
tions on the calibration apparatus and emulation of yM. For example, we compare
our local approximate modular approach (Section 4) to versions using the true
calibration value, u, a random value in the two-dimensional unit cube, ur,and
combinations of those where yMis used directly—that is, assuming free computer
simulation, and thus bypassing the emulator ˆyM. On the suggestion of a referee,
we also include GP predictors derived from the field data YF
NFonly, bypassing the
computer model and calibration parameter(s) entirely. Together, these alternatives
allow us to explore how the error in our estimates decompose at each level of the
approximate modularized calibration.
5.1. Unbiased calibration. Figure 3summarizes results from our first regime:
generating field data without bias, that is, setting b(·)=0 in equation (10)and
fitting the model bias-free, that is, only estimating σ2
ε. Consider the top left panel
first, which shows boxplots of RMSEs arranged by numbers of replicates (three
groups of six from left to right), and then by the use of an emulator ˆyMor not
(subgroups of three within the six). Observe that a random calibration parameter,
ur(labeled as “urand,” the middle boxplot in each group of three), gives poor pre-
dictions of yF. By contrast, using the correct uwith yMdirectly (labeled “u-M”,
fourth boxplot in each group of six), that is, not emulating via ˆyM, leads to nearly
perfect prediction. Contrasting with the corresponding “u-Mhat” boxplots (first
in each group of six) reveals the relative “cost” of emulating via ˆyMwith u.
Together, “urand” and “u” variations span the best and worst alternatives. Dis-
tinctions between the rest are more nuanced.
The third and sixth boxplots (from the left) show RMSEs obtained with ˆuvia
a single field data replicate. RMSEs obtained under yMor ˆyMare very similar,
with the former being slightly better. This indicates that the local approximate
GP emulator is doing a good job as a surrogate for yM. The story is similar for
two replicates, giving slightly lower RMSEs (boxplots 9 and 12), as expected. Ten
replicates (15 and 18) lead to greater differentiation between yMand ˆyMresults,
implying more replicates provide a more accurate and lower variance estimate ˆu.
Considering how bad things can get (“urand”), all of the other estimates are quite
good relative to the best possible (“u-M” and “u-Mhat”).
The top left panel does not include a boxplot for the predictor based on fit-
ting a GP to the field data only—the comparator recommended by the referee.
CALIBRATING A LARGE COMPUTER EXPERIMENT 1155
FIG.3. Comparison on unbiased data, 100 MC replicates.The top left panel shows RMSE to the
true response on hold-out sets,and the bottom left shows the corresponding standard deviations.
The top right panel shows three examples of the chosen calibration parameter(s)ˆu,and the bottom
right shows 1-d density estimates ˆuaconditional on the true value u
bof the other coordinate.True
uvalues are shown as dashed-blue lines,with a blue triangle positioned at their intersection.The
boxplot axes and scatter plot legend entries indicate if uis estimated (“uhat”)or if the true value is
used (u); Field data sets with 1, 2 and 10 replicates at each design location are shown,arranged
in three groups of six along the x-axis in the left panels;estimators using ˆyM(“Mhat”)and yM
(“M”)are grouped into three groups of six.Whiskers of the “urand” boxplots are truncated to
improve visualization.
We chose not to include these because of how they would adversely affect the
scale of the y-axis. The summary statistics (min, inter-quartile range, and max) are
as follows: (0.44,0.56,0.73,1.13)for one repetition, (0.31,0.44,0.59,0.96)for
two, and (0.22,0.30,0.45,0.97)for ten. These are pairwise dominated by every
1156 R. B. GRAMACY ET AL.
other comparator (with the same number of replicates), including those based on
random ur. Clearly, the computer model/emulator is the key to good prediction.
The bottom left panel shows estimated predictive standard deviations (SDs) for
each variation, whose corresponding RMSEs are directly above. SDs are calculated
by factoring in the predictive variances from both stages: emulation uncertainty (if
any), plus bias/noise components. The random calibration parameter, ur,gives
the greatest uncertainty, which is reassuring given its poor RMSEs. Uncertainties
coming from ˆyMand yMare very similar.
The top right panel shows estimated ˆu-values for three representative cases. The
others follow these trends and are omitted to reduce clutter. In all three the ˆu-values
found are along a straight line going through the true value u=(0.2,0.1).This
is the case whether emulating with ˆyMor using yMdirectly, although we observe
that when there are more replicates, or when yMis used directly, the points cluster
more tightly to the line and more densely near u. We conclude that there is a ridge
in the integrated likelihood for u, giving equal density to combinations (e.g., in
ratio) of u1and u2values.
This is confirmed in the bottom right panel, which shows (MC average) densi-
ties for one u-coordinate conditional on the true value of the other. The pull of our
prior, toward the center of the space, is visible in both panes, but is far weaker when
one of the coordinates is fixed. Further simulation (not shown) reveals that, in this
situation, weaker u-priors move estimates closer to the true u, however, uniform
priors can yield ˆu-values on the boundary, particularly near u2=0.2. Also, observe
that the posterior evaluations appear “noisy.” This is an artifact of the discrete na-
ture of the local design search underlying ˆyM(x, u). The objective surface is in fact
deterministic. Smoothly varying values of the calibration parameter(s) may cause
abrupt changes in the local design, and lead to abrupt (if small) changes to local
emulation and ultimately to the maximizing posterior probabilities, motivating the
NOMAD solver.
To wrap up with timing, we report that the most expensive comparator (“uhat-
Mhat-10”) took between 159 and 388 seconds, averaging 232 seconds, over all 100
repetitions on a 16-core Intel Sandy Bridge 2.6 GHz Xeon machine. That large
range is due to variation in the number of NOMAD optimization steps required,
spanning 11 to 33, averaging 18.
5.2. Biased calibration. Figure 4shows a similar suite of results for the full,
biased, setup described in equations (9)–(10), modeled with a GP prior on b(·).At
a quick glance one notices the following: (1) the ˆuestimates (top right) are far from
the true ufor all calibration alternatives considered; (2) the random setting urisn’t
much worse than the other options (top left). Looking more closely, however, we
can see that the ˆuversions are performing the best in each section of the chart(s).
These are giving the lowest RMSEs (top left)andthelowestSDs(bottom left).
They are doing even better than with the true usetting. So while we are not able
to recover the true u, we nonetheless predict the field data better with the values
CALIBRATING A LARGE COMPUTER EXPERIMENT 1157
FIG.4. Comparison on biased data, 100 MC replicates.The explanation of the panels is the same
as for Figure 3.
we do find. Our modularized approximate calibration method is excelling at one
task, prediction of yF, possibly at the expense of another, estimating u.
The explanation is nuanced. The bias (10) is not well approximated by a station-
ary process, and neither is (9) for that matter. But our fitted ˆ
bassumes stationarity,
so there is clearly a mismatch with (10). The local approximate GP emulator does
allow for adaptivity of correlation structure over the input space, and thus can ac-
commodate a degree of nonstationary in the computer model (9). That explains
why our emulations were very good, but not perfect, in the unbiased case (Fig-
ure 3). In this biased case, the full posterior distribution, inferring both full and
local GPs, is using the flexibility of the joint modeling apparatus to trade off re-
sponsibility, in effect exploiting a lack of identifiability in the model, which is
a popular tactic in nonstationary modeling (further discussion in Section 7). It is
1158 R. B. GRAMACY ET AL.
tuning ˆuto obtain an emulator that better copes with a stationary discrepancy,
resulting in a less parsimonious and larger magnitude estimate of b, but one for
which ˆ
b(·)yM(·,ˆu) gives good predictions of yF(·). Meanwhile, the local GP is
faced with a more demanding emulation task.
Again, we chose not to show boxplots for the field-data-only comparator in
the figure because they would distort the y-axis scale. The summary statistics are
as follows: (0.44,0.57,0.71,1.13)for one repetition, (0.35,0.46,0.61,1.23)for
two, and (0.20,0.29,0.47,0.88)for ten. These are similar to the values obtained
for the unbiased case, but it is important to note that they are not directly com-
parable since the data-generating mechanisms are different—the former does not
augment with equation (10).
Time-wise, the most expensive comparator (“uhat-Mhat-10”) took between 538
and 1700 seconds, averaging 1049 seconds, over all 100 repetitions. The number
of NOMAD optimization steps was similar to the unbiased case, ranging from 11
to 32, averaging 18. The main difference in computational cost was compared to
the unbiased case due to estimating the GP correlation structure for ˆ
b, requiring
O(N3
F)computations for NF=500.
6. Calibrated prediction for radiative shocks. We return now to our moti-
vating example, having proposed a thrifty framework for calibration and explored
its behavior in several variations on a representative benchmark problem. Our ex-
perimental setup for calibration and prediction is similar to the one described in
Section 5. In particular, we again entertain both biased and unbiased alternatives,
being unsure about the extent of bias in the simulator relative to the field data.
One substantial distinction, however, between our synthetic data and the radiative
shock experiment, concerns the input space and the local isotropy assumptions
underlying our local approximate GP emulator. This wasn’t an issue in our previ-
ous experiments since the inputs were in the unit cube, and the responses (9)–(10)
varied by similar magnitude(s) within that range.
The radiative shock experiment involves a larger (and disparate unit) input space
(Table 1), therefore, we augment biased and unbiased variations with pairings of
two different types of preprocessing of the inputs. Our first type of preprocessing
simply scales all inputs to lie in the unit 10-cube, mimicking our synthetic exper-
iment. We call this the “isotropic” case, since all input directions share a common
lengthscale. In the local GP emulator, ˆyM, local isotropy does not preclude global
anisotropy or even nonstationarity. However, the discrepancy ˆ
bhas global reach,
so isotropy can be restrictive—however, with only twenty field data observations,
isotropy has the virtue of parsimony.
In a second version we rescale those inputs by a crude estimate of the global
lengthscale obtained from small random subsets of the computer model run data.
Specifically, we randomly sample 1000 elements of the full 26,458 design, in 100
replications, and save the maximum a posteriori estimate of a separable length-
scale hyperparameter from a Gaussian correlation function. The distribution of
CALIBRATING A LARGE COMPUTER EXPERIMENT 1159
TABLE 2
Summary of estimated lengthscales from a separable power correlation function applied 100 times
to a random subsample of size 1000 from the full 26,458 design
Elect Energy
Be Laser Xe Aspect Nozzle Taper Tube flux scale
thick energy press ratio length length diam Time limit factor
25% 0.17 1.94 3.26 2.68 3.54 3.15 3.26 0.51 0.51 2.48
50% 0.64 2.11 3.65 2.94 3.85 3.57 3.55 0.69 0.88 2.73
75% 1.05 2.33 4.07 3.25 4.20 3.95 3.77 0.91 1.35 2.98
those lengthscales is summarized in Table 2. Observe that while some inputs (the
middle ones: Xe pressure, aspect ratio, nozzle length, taper length, tube diame-
ter) might cope well with a common lengthscale, the analysis suggests others re-
quire faster decay. Be thickness, time and electron flux limiter benefit from length-
scales roughly 4×shorter than those above; laser energy and energy scale factor
almost 2×. We entertain dividing the (already cube-scaled) inputs by square roots
of median lengthscales to circumvent the limits of isotropy in estimating both ˆyM
and ˆ
b.
Finally, a few other small changes from Section 5are worth noting. We initialize
the search for ˆu, a two-vector comprising of electron flux limiter and energy scale
factor, with a larger space-filling design (of size 200 compared to 20). Since we
are not performing a Monte Carlo experiment with hundreds of repetitions, we can
afford a more conservative, computationally costly, search. When estimating the
discrepancy ˆ
b, we apply a GP model to the subset of inputs which actually vary
on more than two values in the field data (laser energy, Xe pressure, time). See
the final column in Table 1. We drop tube diameter, which has only two unique
settings, however, the results aren’t much changed when it is included.
6.1. Exploratory analysis. Before providing results based on a full calibra-
tion, in the four variations described above, we report on an exploratory analysis
concentrated on stressing aspects of the full framework—emulation, bias, calibra-
tion and prediction—with the aim of gaining insight into what differences might
be expected under those variations, if any.
The first aspect is a sensitivity analysis to see which inputs have substantial im-
pact on the response, with a local GP emulator under both isotropic and separable
preprocessing regimes. Average main effect functions are computed for each input
[Sobol (1993)] and displayed in Figure 5. Each panel of the plot gives the emulator
response curve for an input, averaged over the remaining inputs. Observe in Fig-
ure 5that both preprocessing specifications give essentially the same results. The
most influential inputs, marginally, are laser energy, time and laser energy scale
factor. The code is relatively less sensitive to the others, on average. Foreshadow-
ing somewhat, our prediction exercise in Section 6.3 involves inputs with an aspect
1160 R. B. GRAMACY ET AL.
FIG.5. Main effects plots for the emulated simulation runs.
ratio of 2. Since there are no field data runs with that setting (see Table 1), even cal-
ibrated predictions would be relying primarily on the emulated computer model to
make an extrapolation. The emulator shows a negligible effect for that input, so we
can rest assured that predictions in this unsampled regime are not wildly different
from where the models were trained.
We next report on a leave-one-out study to asses the predictive ability of the
four variations on our calibration methodology and gain confidence that it is cap-
turing variability in the input space and between simulation and field data. In turn,
each of the twenty field observations is deleted, models are fit to the remaining
observations and (all) simulations, and the deleted observation is predicted. The
left panel of Figure 6indicates that all four methods are performing well, with
none obviously dominating the other in terms of predictive means. Paired t-tests
fail to detect differences in mean predictive ability among all pairs of compara-
tors. The right panel shows 95% credible intervals from those predictions, after
subtracting off the true values. Here there may be some differences between the
methods visually. For example, the biased predictors seem to have the smallest in-
tervals, on average, which makes sense considering what we understand about the
data-generating mechanism. However, a Bartlett test of unequal variances fails to
reject the null that all four predictors have the same variance. This may be due to
the small sample size of twenty.
CALIBRATING A LARGE COMPUTER EXPERIMENT 1161
FIG.6. Leave-one-out predictions for the radiative shock location field data versus true-values
(left), and with error-bars after subtracting out the true value (right).
6.2. Model calibration. We turn now to a full analysis of the calibration exer-
cise in four variations. The image plots in Figure 7show the log posterior surface
interpolated from all evaluations of the objective (Algorithm 1), combining the ini-
tial design and NOMAD searches. The intersecting lines indicate ˆu’s thus found, and
the open circles are estimates obtained under a parametric bootstrap, discussed in
more detail shortly. The unbiased experiments took about 20 minutes to run on a 4-
core hyperthreaded machine, whereas the biased ones took fifteen. That ordering
would seem paradoxical, since the biased models have more quantities to estimate,
however, the NOMAD convergence was faster for the biased version, requiring fewer
iterations navigate the posterior surface in search of ˆu.
Several observations are noteworthy. All four variations reveal that the poste-
rior surface is much flatter for the electron flux limiter than for energy scale factor,
as expected. There is consensus on a value of scale factor between 0.75 and 0.8,
meaning that scaling the laser energy in CE1 was indeed helpful. The separable
models, biased or unbiased, largely agree on a setting of the electron flux lim-
iter, however, the isotropic versions disagree with that setting and disagree among
themselves. We attribute this divergence to the scales estimated in preprocessing
from Table 2. Estimating a bias adds fidelity to the model, bringing estimates closer
to those obtained in the separable version(s), providing further illustration (aug-
menting the discussion in Section 5.2) of the dual role of the discrepancy estimates
in the calibration framework.
As in our synthetic examples, observe that we obtain a “noisy” profile of the
log posterior in a search for ˆu, although the objective is technically deterministic.
When the data are highly informative about good ˆu, leading to a peaked surface,
the noise is negligible. However, when it is flatter, the noise is evident. Figure 8
shows both cases via a slice through the surface(s) fixing the electron flux limiter
at its midway value. Being a more flexible model, with weaker identification, the
1162 R. B. GRAMACY ET AL.
FIG.7. Profile log-likelihood surfaces for the calibration parameters,electron flux limiter and
energy scale factor,in four setups.Clockwise from top left (MAP setting indicated by intersecting
lines): isotropic unbiased;isotropic biased;separable biased;separable unbiased.Open circles show
estimates obtained under parametric bootstrap resampling.
biased setup yields a much shallower log posterior surface. In the figure this is
revealed by the right-hand yaxes in both plots, compared to the left-hand ones.
Correspondingly, the red dots for biased posterior values are noisier. The shal-
lower and “noisier” surface may explain why NOMAD stopped earlier—possibly
prematurely—in the biased setup.
For a second look at uncertainty in ˆuwe re-performed inference on one hun-
dred parametric bootstrap re-samples of the field data observations YF
NF. See, for
example, Kleijnen (2014) for a nice review of the bootstrap applied to models of
simulation experiments. The resulting estimates are shown as open circles in Fig-
ure 7. Observe that the bootstrap estimates agree with the heat plot depiction of
CALIBRATING A LARGE COMPUTER EXPERIMENT 1163
FIG.8. Slice(s)of the profile log posterior surface over energy scale factor with electron flux limiter
fixed to its midway value in the range:isotropic on left;separable on right.In both plots,the left axes
show scale for the unbiased model,and the right for the biased one.
the posterior density, as interpolated from the NOMAD samples. An exception may
be the separable unbiased case (bottom right), which contains a dispersed cluster
of lower energy scale factor estimates paired with larger estimated electron flux
limiter settings. It is important to note that the bootstrap distribution would not,
in general, be identical to the posterior surface. However, we draw comfort from
their large degree of similarity in this example. The dual summaries of uncertainty
in the figure(s) suggest that the ˆu-values we estimated from the original YF
NFsare
both representative (among open circles) and obtain high probability (in light col-
ored regions) under the posterior. If NOMAD is indeed converging prematurely in
the biased setup, due to the “noise” in the objective, the bootstrap results suggest
it is still finding highly probable ˆuvalues.
6.3. Prediction. Next we make predictions on an interesting input setting pro-
vided to us by the CRASH team. The configuration is listed in the “nominal set-
tings” column in Table 3. In past experiments, it was found that some of the desired
input values were not achieved for certain inputs when measured on the experimen-
tal apparatus (i.e., in the field). For example, the laser energy could be set to 4000
joules, but a laser energy of 3900 joules is what is observed. Our aim here is to
provide predictions for field data experiments before they are run on the appara-
tus. Therefore, for three of the variables the CRASH team provided a distribution
over the inputs (third column in the table). In the case of Be thickness, no variation
was observed in past experiments, but as a conservative accounting of uncertainty,
the input was sampled from a uniform distribution within manufacturing speci-
fications. We were asked to propagate these uncertainties through the calibrated
predictive model(s).
1164 R. B. GRAMACY ET AL.
TABLE 3
Settings and distributions for the design variables in the 2012 experiments.TheBethicknessis
uniform over the specified range and the Laser energy and Xe fill pressure are both normal with the
specific mean and standard deviation
Input Nominal value Distribution
Design variables
Be thickness (microns) 21 Unif(20.5,21.5)
Laser energy (J) 3800 N(3800,81.64)
Xe fill pressure (atm) 1.15 N(1.15,0.10)
Tube diameter (microns) 1150
Taper length (microns) 500
Nozzle length (microns) 500
Aspect ratio (microns) 2
Time (ns) 26
In this manner the exercise is one of propagation uncertainty quantification
in the most basic sense: determining how uncertain inputs filter to uncertain out-
puts. As discussed in Section 4.3, our calibration is able to further account for
some additional estimation uncertainties, like from emulation, estimation of bias
and observation error σ2
ε, but not others like ˆuwithout further simulation (e.g.,
a bootstrap). To clarify, the scheme used here is as follows: (i) sample an input x
according to Table 3; (ii) sample from the predictive distributions for yat that x
given ˆu, as in Section 4.3; (iii) repeat. We note that augmenting with iteration over
bootstrap estimates of ˆuproduces a slightly larger spread, but these results are not
shown here. The goal of this experiment is to explore how a calibrated model (i.e.,
using one good choice of u) predicts in a small out-of-sample exercise.
Figure 9, focusing first on the left panel, shows the predictive distributions for
our four variations. We first observe that, on the scale of the response marginalized
over all inputs (roughly from 1000 to 4500), the predictive distributions are re-
markably similar for all methods, despite choosing different ˆufor the electron flux
limiter. However, observe that estimating bias leads to predictions (red densities)
exhibiting a greater degree of uncertainty. Those models involve extra estimating
steps and the random values of the nominal settings from Table 3filter through to
mean and variance values for the estimated bias. That the mode of the final dis-
tribution under the biased model (dashed-red) is distinctly larger than the others,
while at the same time providing substantial spread for smaller values (but not
larger ones—i.e., it is skewed toward the modes of the others), suggests that these
predictions are the most conservative. This squares well with an a priori preference
for estimating bias and allowing separate lengthscales for each input.
The right panel shows a boxplot version of the same distributions alongside the
output for a field experiment subsequently performed at the nominal input settings
in Table 1. From the plot we can see that all four distributions were quite accu-
rate, showing greatest agreement with the separable biased variation. We conclude
CALIBRATING A LARGE COMPUTER EXPERIMENT 1165
FIG.9. Predictive densities for the 2012 experiments.The acronyms IU,IB,SU,SB link boxplots
on the right to the densities shown plotted on the left.M indicates the marginal computer model data;
F indicates the marginal distribution of the field data.
that there is a certain robustness to our calibration exercise(s), lending assurances
to the methodology generally, and to the predictions provided for the motivating
application.
7. Discussion. Motivated by an experiment from radiative shock hydrody-
namics, we presented a new approach to model calibration that can accommodate
large computer experiments, which are increasingly common in simulation-based
applied work. The cost of computation continues downward, with more and more
processor cores being packed onto motherboards, and more nodes into computing
clusters, whereas the costs of field work remain constant (or possibly increasing).
Although the established, fully Bayesian KOH approach to calibration has many
desirable features, we believe that it is too computation heavy to thrive in this en-
vironment. Something thriftier, retaining many of the salient features of KOH, is
increasingly essential.
Our method pairs local approximate Gaussian process (GP) emulation with
a modularized approach to calibration, where the glue is a flexible derivative-free
optimization method. The ingredients have been carefully chosen to work well
from an engineering standpoint. All software deployed is open source and available
in R. The extra subroutines we developed have been included in the laGP package
on CRAN. During the time that this paper was under revision, we came across two
works [Damblin et al. (2015), Wong, Storlie and Lee (2014)] attacking computer
model calibration leveraging similar themes: backing off of fully Bayesian aspects
of KOH, and framing calibration as optimization. As we are, both of these papers
are motivated by pragmatism when it comes to devoting substantial computational
resources to quantities which are poorly identified. Our method is unique in its
1166 R. B. GRAMACY ET AL.
treatment of large-scale computer model emulation via local approximation, and
in providing open source software.
The biggest drawback of our approach is that it doesn’t average over uncertainty
in the estimated calibration parameter ˆu. As demonstrated in Figure 7, output from
the scheme can provide insight into the posterior for u, giving an indication of how
robust a particular choice of ˆumight be. However, we do not provide a method
for sampling from that distribution, as we believe that would require too much
computation to be practical. As we demonstrate, a parametric bootstrap is always
an option, which is a tack also taken by Wong, Storlie and Lee (2014). But in our
real-data example, it would seem that a small amount of extra uncertainty comes
at the very high price of 100×greater computational effort.
We observed, as many have previously, that the calibration apparatus can yield
excellent predictions even when the estimated ˆuis far from the true value. This
can be attributed to the extreme flexibility afforded by coupled nonparametric re-
gression models, of which GPs are just one example, which further leverage an
augmented design space: the calibration parameters, u. Authors have recently ex-
ploited similar ideas toward tractable nontstationary modeling. In the first case Ba
and Joseph (2012) proposed coupling GPs, and in the second Bornn, Shaddick and
Zidek (2012) proposed auxiliary input variables. We were surprised to discover
that the KOH calibration model, preceding these methods by nearly a decade, ef-
fectively nests them: in the first without auxiliary inputs, and in the second without
bias.
REFERENCES
AUDET,C.andDENNIS,J.E.JR. (2006). Mesh adaptive direct search algorithms for constrained
optimization. SIAM J.Optim.17 188–217 (electronic). MR2219150
BA,S.andJOSEPH, V. R. (2012). Composite Gaussian process models for emulating expensive
functions. Ann.Appl.Stat.61838–1860. MR3058685
BASTOS,L.S.andOHAGAN, A. (2009). Diagnostics for Gaussian process emulators. Technomet-
rics 51 425–438. MR2756478
BOEHLY,T.R.,BROWN,D.L.,CRAXTON,R.S.,KECK,R.L.,KNAUER,J.P.,KELLY,J.H.,
KESSLER,T.J.,KUMPAN,S.A.,LOUCKS,S.J.,LETZRING,S.A.,MARSHALL,F.J.,
MCCRORY,R.L.,MORSE,S.F.B.,SEKA,W.,SOURES,J.M.andVERDON, C . P. (1997).
Initial performance results of the OMEGA laser system. Opt.Commun.133 495–506.
BORNN,L.,SHADDICK,G.andZIDEK, J. V. (2012). Modeling nonstationary processes through
dimension expansion. J.Amer.Statist.Assoc.107 281–289. MR2949359
CONN,A.R.,SCHEINBERG,K.andVICENTE, L . N. (2009). Introduction to Derivative-Free Op-
timization.MPS/SIAM Series on Optimization 8. SIAM, Philadelphia, PA. MR2487816
DAMBLIN,G.,BARBILLON,P.,KELLER,M.,PASANISI,A.andPARENT, E . (2015). Adaptive
numerical designs for the calibration of computer models. Technical report, AgroParisTech.
DRAKE,R.P.,DOSS,F.W.,MCCLARREN,R.G.,ADAM S,M.L.,AMATO,N.,BING-
HAM,D.,CHOU,C.C.,DISTEFANO,C.,FIDKOWSKI,K.,FRYXELL,B.,GOMBOSI,T.I.,
GROSSKOPF,M.J.,HOLLOWAY,J.P.,VAN D E R HOLST,B.,HUNTINGTON,C.M.,KARNI,S.,
KRAULAND,C.M.,KURANZ,C.C.,LARSEN,E.,VA N LEER,B.,MALLICK,B.,MARION,D.,
MARTIN,W.,MOREL,J.E.,MYRA,E.S.,NAIR,V.,POWELL,K.G.,RAUCHWERGER,L.,
CALIBRATING A LARGE COMPUTER EXPERIMENT 1167
ROE,P.,RUTTER,E.,SOKOLOV,I.V.,STOUT,Q.,TORRALVA,B.R.,TOTH,G.,THORN-
TON,K.andVISCO, A . J . (2011). Radiative effects in radiative shocks in shock tubes. Opt.
Commun.7130–140.
EIDSVIK,J.,SHABY,B.A.,REICH,B.J.,WHEELER,M.andNIEMI, J . (2014). Estimation and
prediction in spatial models with block composite likelihoods. J.Comput.Graph.Statist.23 295–
315. MR3215812
GOH,J.,BINGHAM,D.,HOLLOWAY,J.P.,GROSSKOPF,M.J.,KURANZ,C.C.andRUTTER,E.
(2013). Prediction and computer model calibration using outputs from multifidelity simulators.
Technometrics 55 501–512. MR3176554
GRAMACY, R . B . (2013). laGP: Local approximate Gaussian process regression. R package ver-
sion 1.0.
GRAMACY,R.B.andAPLEY, D. W. (2015). Local Gaussian process approximation for large com-
puter experiments. J.Comput.Graph.Statist.24 561–578. MR3357395
GRAMACY,R.andHAALAND, B. (2015). Speeding up neighborhood search in local Gaussian pro-
cess prediction. Technometrics. To appear. Available at arXiv:1409.0074.
GRAMACY,R.B.,NIEMI,J.andWEISS, R. M. (2014). Massively parallel approximate Gaussian
process regression. SIAM/ASA J.Uncertain.Quantificat.2564–584. MR3283921
GRAMACY,R.B.andPOLSON, N. G. (2011). Particle learning of Gaussian process models for
sequential design and optimization. J.Comput.Graph.Statist.20 102–118. MR2816540
HAALAND,B.andQIAN, P. Z. G . (2011). Accurate emulators for large-scale computer experiments.
Ann.Statist.39 2974–3002. MR3012398
HIGDON,D.,KENNEDY,M.,CAVENDISH,J.C.,CAFEO,J.A.andRYNE, R. D. (2004). Combin-
ing field data and computer simulations for calibration and prediction. SIAM J.Sci.Comput.26
448–466. MR2116355
KAUFMAN ,C.G.,BINGHAM,D.,HABIB,S.,HEITMANN,K.andFRIEMAN, J. A. (2011). Effi-
cient emulators of computer experiments using compactly supported correlation functions, with
an application to cosmology. Ann.Appl.Stat.52470–2492. MR2907123
KENNEDY,M.C.andOHAGAN, A. (2001). Bayesian calibration of computer models. J.R.Stat.
Soc.Ser.B.Stat.Methodol.63 425–464. MR1858398
KLEIJNEN, J. P. C. (2014). Simulation-optimization via Kriging and bootstrapping. J.Simul.8241–
250.
LEDIGABEL, S. (2011). Algorithm 909: NOMAD: Nonlinear optimization with the MADS algo-
rithm. ACM Trans.Math.Software 37 Art. 44, 15. MR2774836
LIU,F.,BAYARRI,M.J.andBERGER, J. O. (2009). Modularization in Bayesian analysis, with
emphasis on analysis of computer models. Bayesian Anal.4119–150. MR2486241
LOEPPKY,J.,BINGHAM,D.andWELCH, W. (2006). Computer model calibration or tuning in
practice. Technical report, Univ. British Columbia.
MACDONALD,N.,RANJAN,P.andCHIPMAN (2012). GPfit:AnRpackage for Gaussian process
model fitting using a new optimization algorithm. Technical report, Acadia Univ., Wolfville, Nova
Scotia. Available at arXiv:1305.0759.
MCCLARREN,R.,RYUB,D.,DRAKE,P.,GROSSKOPF,M.,BINGHAM,D.,CHOU,C.-C.,FRY X-
ELL,B.,VA N D E R HOLST,B.,HOLLOWAY,J.,KURANZ,C.,MALLICK,B.,RUTTER,E.and
TORRALVA, B. (2011). A physics informed emulator for laser-driven radiating shock simulations.
Reliab.Eng.Syst.Saf.96 1194–1207.
MCKAY,M.D.,BECKMAN,R.J.andCONOVER, W. J . (1979). A comparison of three methods for
selecting values of input variables in the analysis of output from a computer code. Technometrics
21 239–245. MR0533252
MORRIS,M.D.,MITCHELL,T.J.andYLVISAKER, D. (1993). Bayesian design and analysis
of computer experiments: Use of derivatives in surface prediction. Technometrics 35 243–255.
MR1234641
1168 R. B. GRAMACY ET AL.
PACIOREK,C.J.andSCHERVISH, M. J. (2006). Spatial modelling using a new class of nonstation-
ary covariance functions. Environmetrics 17 483–506. MR2240939
PACIOREK,C.,LIPSHITZ,B.,ZHUO,W.,PRABHAT,KAU FMA N,C.andTHOMAS, R. (2013).
Parallelizing Gaussian process calculations in R. Technical report, Univ. California, Berkeley.
Available at arXiv:1305.4886.
RACINE,J.S.andNIE, Z. (2012). crs: Categorical regression splines. R package version 0.15-18.
SACKS ,J.,WELCH,W.J.,MITCHELL,T.J.andWYNN, H. P. (1989). Design and analysis of
computer experiments. Statist.Sci.4409–435. MR1041765
SANG,H.andHUANG, J . Z . (2012). A full scale approximation of covariance functions for large
spatial data sets. J.R.Stat.Soc.Ser.B.Stat.Methodol.74 111–132. MR2885842
SANTNER,T.J.,WILLIAMS,B.J.andNOT Z, W. I. (2003). The Design and Analysis of Computer
Experiments. Springer, New York. MR2160708
SCHMIDT,A.M.andOHAGAN, A. (2003). Bayesian inference for non-stationary spatial co-
variance structure via spatial deformations. J.R.Stat.Soc.Ser.B.Stat.Methodol.65 743–758.
MR1998632
SOBOL, W. (1993). Analysis of variance of “component stripping” decomposition of multi exponen-
tial curves. Comput.Methods Programs Biomed.39 243–257.
STEIN,M.L.,CHI,Z.andWELTY, L . J. (2004). Approximating likelihoods for large spatial data
sets. J.R.Stat.Soc.Ser.B.Stat.Methodol.66 275–296. MR2062376
VECCHIA, A. V. (1988). Estimation and model identification for continuous spatial processes. J.R.
Stat.Soc.Ser.B.Stat.Methodol.50 297–312. MR0964183
WONG,R.K.,STORLI E,C.B.andLEE, T. C. (2014). A frequentist approach to computer model
calibration. Technical report, Iowa State Univ., Ames, IA.
R. B. GRAMACY
BOOTH SCHOOL OF BUSINESS
UNIVERSITY OF CHICAGO
CHICAGO,ILLINOIS 60637
USA
E-MAIL:rbgramacy@chicagobooth.edu
D. BINGHAM
DEPARTMENT OF STATISTICS AND
ACTUARIAL SCIENCE
SIMON FRASER UNIVERSITY
BURNABY,BRITISH COLUMBIA V5A 1S6
CANADA
E-MAIL:dbingham@stat.sfu.ca
J. P. HOLLOWAY
M. J. GROSSKOPF
C. C. KURANZ
E. R. RUTTER
M. T. TRANTHAM
R. P. DRAKE
CENTER FOR RADIATIVE SHOCK
HYDRODYNAMICS
UNIVERSITY OF MICHIGAN
ANN ARBOR,MICHIGAN 48109
USA
E-MAIL:hagar@umich.edu
mikegros@umich.edu
ckuranz@umich.edu
ruttere@umich.edu
mtrantha@umich.edu
rpdrake@umich.edu
... The Kennedy-O'Hagan (abbreviated as KO thereafter) model has been widely used in the literature in many disciplines. See [5,2,4,1,6,3], among others. However, because of the use of the discrepancy function, the method has an identifiability issue. ...
Preprint
Identification of model parameters in computer simulations is an important topic in computer experiments. We propose a new method, called the projected kernel calibration method, to estimate these model parameters. The proposed method is proven to be asymptotic normal and semi-parametric efficient. As a frequentist method, the proposed method is as efficient as the L2L_2 calibration method proposed by Tuo and Wu [Ann. Statist. 43 (2015) 2331-2352]. On the other hand, the proposed method has a natural Bayesian version, which the L2L_2 method does not have. This Bayesian version allows users to calculate the credible region of the calibration parameters without using a large sample approximation. We also show that, the inconsistency problem of the calibration method proposed by Kennedy and O'Hagan [J. R. Stat. Soc. Ser. B. Stat. Methodol. 63 (2001) 425-464] can be rectified by a simple modification of the kernel matrix.
... However, without taking the model imperfection into account in the conventional Bayesian framework, the theoretical justification for the parameter estimation with imperfect simulators are not fully developed. On the other hand, Bayesian calibration of Kennedy and O'Hagan (2001) takes into account the model imperfection through Gaussian process (GP) modelling, but it suffers from the unidentifiability issue when the parameter estimation is of interest (Bayarri et al., 2007;Gramacy et al., 2015;Han et al., 2009;Hodges & Riech, 2010;Paciorek, 2010). Furthermore, most of the existing developments are based on continuous outputs with a Gaussian assumption, which is not valid for the count data in the epidemic models in our applications. ...
Article
Full-text available
The estimation of unknown parameters in simulations, also known as calibration, is crucial for practical management of epidemics and prediction of pandemic risk. A simple yet widely used approach is to estimate the parameters by minimising the sum of the squared distances between actual observations and simulation outputs. It is shown in this paper that this method is inefficient, particularly when the epidemic models are developed based on certain simplifications of reality, also known as imperfect models which are commonly used in practice. To address this issue, a new estimator is introduced that is asymptotically consistent, has a smaller estimation variance than the least-squares estimator, and achieves the semiparametric efficiency. Numerical studies are performed to examine the finite sample performance. The proposed method is applied to the analysis of the COVID-19 pandemic for 20 countries based on the susceptible-exposed-infectious-recovered model with both deterministic and stochastic simulations. The estimation of the parameters, including the basic reproduction number and the average incubation period, reveal the risk of disease outbreaks in each country and provide insights to the design of public health interventions.
Preprint
Non-Gaussian observations such as binary responses are common in some computer experiments. Motivated by the analysis of a class of cell adhesion experiments, we introduce a generalized Gaussian process model for binary responses, which shares some common features with standard GP models. In addition, the proposed model incorporates a flexible mean function that can capture different types of time series structures. Asymptotic properties of the estimators are derived, and an optimal predictor as well as its predictive distribution are constructed. Their performance is examined via two simulation studies. The methodology is applied to study computer simulations for cell adhesion experiments. The fitted model reveals important biological information in repeated cell bindings, which is not directly observable in lab experiments.
Preprint
Full-text available
Scalable surrogate models enable efficient emulation of computer models (or simulators), particularly when dealing with large ensembles of runs. While Gaussian Process (GP) models are commonly employed for emulation, they face limitations in scaling to truly large datasets. Furthermore, when dealing with dense functional output, such as spatial or time-series data, additional complexities arise, requiring careful handling to ensure fast emulation. This work presents a highly scalable emulator for functional data, building upon the works of Kennedy and O'Hagan (2001) and Higdon et al. (2008), while incorporating the local approximate Gaussian Process framework proposed by Gramacy and Apley (2015). The emulator utilizes global GP lengthscale parameter estimates to scale the input space, leading to a substantial improvement in prediction speed. We demonstrate that our fast approximation-based emulator can serve as a viable alternative to the methods outlined in Higdon et al. (2008) for functional response, while drastically reducing computational costs. The proposed emulator is applied to quickly calibrate the multiphysics continuum hydrodynamics simulator FLAG with a large ensemble of 20000 runs. The methods presented are implemented in the R package FlaGP.
Conference Paper
div class="section abstract"> Design of internal combustion (IC) engine pistons is dependent on accurate prediction of the temperature field in the component. Experimental temperature measurements can be taken but are costly and typically limited to a few select locations. High-fidelity computer simulations can be used to predict the temperature at any number of locations within the model, but the models must be calibrated for the predictions to be accurate. The largest barrier to calibration of piston thermal models is estimating the backside boundary conditions, as there is not much literature available for these boundary conditions. Bayesian model calibration is a common choice for model calibration in literature, but little research is available applying this method to piston thermal models. Neural networks have been shown in literature to be effective for calibration of piston thermal models. In this work, Bayesian model calibration will be compared to two neural network-based calibration methodologies for piston thermal models. The models were compared for both computation time and error across three different data densities. Each data set represents an increasing density of steady-state temperature measurement locations. The results show that the error between the methods is largely consistent across the different data densities, with each model having similar error to the others at each calibration case. On the other hand, computation time highlights the advantage of the neural network methodologies over the Bayesian methodology. At the lowest data density, the Bayesian model calibration methodology had the fastest computation time but only by a few minutes. As the data density increased, the Bayesian model calibration method became hours slower than the Neural network methods, up to 4673.3% slower at the highest data density. Both neural networks-based approaches and the Bayesian model calibration methodology are effective at calibrating at low data densities but for higher data densities, the Bayesian model calibration becomes too computationally expensive. </div
Article
Full-text available
Model calibration is crucial for optimizing the performance of complex computer models across various disciplines. In the era of Industry 4.0, symbolizing rapid technological advancement through the integration of advanced digital technologies into industrial processes, model calibration plays a key role in advancing digital twin technology, ensuring alignment between digital representations and real‐world systems. This comprehensive review focuses on the Kennedy and O'Hagan (KOH) framework (Kennedy and O'Hagan, Journal of the Royal Statistical Society: Series B 2001; 63(3):425–464). In particular, we explore recent advancements addressing the challenges of the unidentifiability issue while accommodating model inadequacy within the KOH framework. In addition, we explore recent advancements in adapting the KOH framework to complex scenarios, including those involving multivariate outputs and functional calibration parameters. We also delve into experimental design strategies tailored to the unique demands of model calibration. By offering a comprehensive analysis of the KOH approach and its diverse applications, this review serves as a valuable resource for researchers and practitioners aiming to enhance the accuracy and reliability of their computer models. This article is categorized under: Statistical Models > Semiparametric Models Statistical Models > Simulation Models Statistical Models > Bayesian Models
Article
Full-text available
Ecole Polytechnique de Montréal NOMAD is software that implements the Mesh Adaptive Direct Search (MADS) algorithm for blackbox optimization under general nonlinear constraints. Blackbox optimization is about optimizing functions that are usually given as costly programs with no derivative information and no function values returned for a significant number of calls attempted. NOMAD is designed for such problems and aims for the best possible solution with a small number of evaluations. The objective of this article is to describe the underlying algorithm, the software's functionalities, and its implementation.
Article
Full-text available
This article surveys optimization of simulated systems. The simulation may be either deterministic or random. The survey reflects the author’s extensive experience with simulation-optimization through Kriging (or Gaussian process) metamodels, analysed through parametric bootstrapping for deterministic and random simulation and distribution-free bootstrapping (or resampling) for random simulation. The survey covers: (1) simulation-optimization through ‘efficient global optimization’ using ‘expected improvement’ (EI); this EI uses the Kriging predictor variance, which can be estimated through bootstrapping accounting for the estimation of the Kriging parameters; (2) optimization with constraints for multiple random simulation outputs and deterministic inputs through mathematical programming applied to Kriging metamodels validated through bootstrapping; (3) Taguchian robust optimization for uncertain environments, using mathematical programming—applied to Kriging metamodels—and bootstrapping to estimate the variability of the Kriging metamodels and the resulting robust solution; (4) bootstrapping for improving convexity or preserving monotonicity of the Kriging metamodel.
Article
Full-text available
Many scientific phenomena are now investigated by complex computer models or codes. A computer experiment is a number of runs of the code with various inputs. A feature of many computer experiments is that the output is deterministic—rerunning the code with the same inputs gives identical observations. Often, the codes are computationally expensive to run, and a common objective of an experiment is to fit a cheaper predictor of the output to the data. Our approach is to model the deterministic output as the realization of a stochastic process, thereby providing a statistical basis for designing experiments (choosing the inputs) for efficient prediction. With this model, estimates of uncertainty of predictions are also available. Recent work in this area is reviewed, a number of applications are discussed, and we demonstrate our methodology with an example.
Article
Full-text available
Recent implementations of local approximate Gaussian process models have pushed computational boundaries for non-linear, non-parametric prediction problems, particularly when deployed as emulators for computer experiments. Their flavor of spatially independent computation accommodates massive parallelization, meaning that they can handle designs two or more orders of magnitude larger than previously. However, accomplishing that feat can still require massive supercomputing resources. Here we aim to ease that burden. We study how predictive variance is reduced as local designs are built up for prediction. We then observe how the exhaustive and discrete nature of an important search subroutine involved in building such local designs may be overly conservative. Rather, we suggest that searching the space radially, i.e., continuously along rays emanating from the predictive location of interest, is a far thriftier alternative. Our empirical work demonstrates that ray-based search yields predictors with accuracy comparable to exhaustive search, but in a fraction of the time - bringing a supercomputer implementation back onto the desktop.
Article
Two types of sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies. These plans are shown to be improvements over simple random sampling with respect to variance for a class of estimators which includes the sample mean and the empirical distribution function.
Article
Formal parameter estimation and model identification procedures for continuous domain spatial processes are introduced. The processes are assumed to be adequately described by a linear model with residuals that follow a second‐order stationary Gaussian random field and data are assumed to consist of noisy observations of the process at arbitrary sampling locations. A general class of two‐dimensional rational spectral density functions with elliptic contours is used to model the spatial covariance function. An iterative estimation procedure alleviates many of the computational difficulties of conventional maximum likelihood estimation for non‐lattice data. The procedure is applied to several generated data sets and to an actual ground‐water data set.
Article
This paper considers the computer model calibration problem and provides a general frequentist solution. Under the proposed framework, the data model is semi-parametric with a nonparametric discrepancy function which accounts for any discrepancy between the physical reality and the computer model. In an attempt to solve a fundamentally important (but often ignored) identifiability issue between the computer model parameters and the discrepancy function, this paper proposes a new and identifiable parametrization of the calibration problem. It also develops a two-step procedure for estimating all the relevant quantities under the new parameterization. This estimation procedure is shown to enjoy excellent rates of convergence and can be straightforwardly implemented with existing software. For uncertainty quantification, bootstrapping is adopted to construct confidence regions for the quantities of interest. The practical performance of the proposed methodology is illustrated through simulation examples and an application to a computational fluid dynamics model.
Article
This article develops a block composite likelihood for estimation and prediction in large spatial datasets. The composite likelihood (CL) is constructed from the joint densities of pairs of adjacent spatial blocks. This allows large datasets to be split into many smaller datasets, each of which can be evaluated separately, and combined through a simple summation. Estimates for unknown parameters are obtained by maximizing the block CL function. In addition, a new method for optimal spatial prediction under the block CL is presented. Asymptotic variances for both parameter estimates and predictions are computed using Godambe sandwich matrices. The approach considerably improves computational efficiency, and the composite structure obviates the need to load entire datasets into memory at once, completely avoiding memory limitations imposed by massive datasets. Moreover, computing time can be reduced even further by distributing the operations using parallel computing. A simulation study shows that CL estimates and predictions, as well as their corresponding asymptotic confidence intervals, are competitive with those based on the full likelihood. The procedure is demonstrated on one dataset from the mining industry and one dataset of satellite retrievals. The real-data examples show that the block composite results tend to outperform two competitors; the predictive process model and fixed-rank kriging. Supplementary materials for this article is available online on the journal web site.