Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Design Editing for Offline Model-based Optimization
Ye Yuan1, 2∗† Youyuan Zhang1∗Can (Sam) Chen1, 2 Haolun Wu1, 2
Zixuan (Melody) Li1, 2 Jianmo Li1James J. Clark1Xue Liu1
1McGill University, 2Mila - Quebec AI Institute
ye.yuan3@mail.mcgill.ca,youyuan.zhang@mail.mcgill.ca,
can.chen@mila.quebec,haolun.wu@mail.mcgill.ca,
zixuan.li3@mail.mcgill.ca,jianmo.li@mail.mcgill.ca,
james.clark1@mcgill.ca,xueliu@cs.mcgill.ca
Abstract
Offline model-based optimization (MBO) aims to maximize a black-box objective
function using only an offline dataset of designs and scores. These tasks span vari-
ous domains, such as robotics, material design, protein and molecular engineering.
A prevalent approach involves training a conditional generative model on existing
designs and their associated scores, followed by the generation of new designs
conditioned on higher target scores. However, these newly generated designs
often underperform due to the lack of high-scoring training data. To address this
challenge, we introduce a novel method, Design Editing for Offline Model-based
Optimization (DEMO), which consists of two phases. In the first phase, termed
pseudo-target distribution generation, we apply gradient ascent on the offline
dataset using a trained surrogate model, producing a synthetic dataset where the
predicted scores serve as new labels. A conditional diffusion model is subsequently
trained on this synthetic dataset to capture a pseudo-target distribution, which
enhances the accuracy of the conditional diffusion model in generating higher-
scoring designs. Nevertheless, the pseudo-target distribution is susceptible to noise
stemming from inaccuracies in the surrogate model, consequently predisposing the
conditional diffusion model to generate suboptimal designs. We hence propose
the second phase, existing design editing, to directly incorporate the high-scoring
features from the offline dataset into design generation. In this phase, top designs
from the offline dataset are edited by introducing noise, which are subsequently
refined using the conditional diffusion model to produce high-scoring designs.
Overall, high-scoring designs begin with inheriting high-scoring features from the
second phase and are further refined with a more accurate conditional diffusion
model in the first phase. Empirical evaluations on
7
offline MBO tasks show that
DEMO outperforms various baseline methods, achieving the highest mean rank of
1.7and median rank of 1. The source code is available here.
1 Introduction
In numerous fields, a primary goal is to innovate and design new objects with specific desired traits [
1
].
This encompasses areas like robotics, material design, protein and molecular engineering [
2
,
3
,
4
,
5
].
∗Equal contribution with random order.
†Corresponding author.
Preprint. Under review.
arXiv:2405.13964v1 [cs.LG] 22 May 2024
Conventionally, these objectives are pursued by iteratively testing a black-box objective function that
maps a design to its property score. However, such testing can be expensive, time-consuming, or
even hazardous [
3
,
4
,
5
,
6
,
7
]. Thus, it is more feasible to utilize an existing offline dataset of designs
and their scores to find optimal solutions, without additional real-world testing [
1
]. This problem
is known as offline model-based optimization (MBO). The aim of MBO is to identify a design that
optimizes the black-box objective function using only the offline dataset.
A common strategy in MBO involves training a conditional generative model on the available offline
dataset to capture the conditional probability distribution
p(x|y)
, where
x
denotes designs and
y
represents property scores. The model then generates new designs conditioned on higher target scores.
Essentially, conditional generative models are designed to establish a one-to-many relationship,
mapping property scores to all possible designs. This becomes particularly challenging when the
black-box objective function operates over a high-dimensional space. Fortunately, previous research
has demonstrated that generative techniques can be effective in solving offline MBO tasks. For
instance, the CbAS method utilizes a variational autoencoder [
8
], while MIN applies a generative
adversarial network (GAN) [
9
,
10
]. DDOM extends these techniques by integrating a classifier-free
conditional diffusion model to enhance generative capabilities [11].
Nonetheless, one important yet unexplored problem with these generative model-based methods
is their reliance on merely training on the offline dataset. This training approach results in models
that effectively mimic the distribution of the offline dataset they are trained on but fail to capture
the information of designs with higher scores. Therefore, while these models learn to replicate the
distribution of existing designs, they struggle to consistently produce new designs that significantly
outperform those in the offline dataset.
To address this challenge, we introduce an innovative and effective approach, Design Editing for
Offline Model-based Optimization (DEMO). DEMO is structured into two primary phases: pseudo-
target distribution generation and existing design editing. In the phase of pseudo-target distribution
generation, to address the scarcity of high-scoring training data, we first augment new data by utilizing
the offline dataset, which may contain pairs of superconductor materials and critical temperatures
for example. To achieve this, a surrogate model, represented as
fθ(·)
, is trained on the offline
dataset
D
, and gradient ascent is applied to existing designs with respect to the surrogate model,
creating a synthetic dataset
D′
with predicted scores as new labels. As illustrated in Figure 1 (a),
the surrogate model fits the offline data
p1
to
p5
, generating new data points
pa
and
pb
through
gradient ascent. Subsequently, a classifier-free conditional diffusion model is trained on
D′
to learn
the conditional probability distribution of these synthetic designs along with their predicted scores.
This diffusion model characterizes a pseudo-target distribution, which has improved accuracy in
generating higher-scoring designs.
Figure 1: Illustration of DEMO: A conditional diffusion model, acting as the pseudo-target distribu-
tion, is trained on a synthetic dataset produced through a surrogate model. New designs are generated
by modifying top existing designs using the diffusion model, under the guidance of target scores.
2
However, as shown in Figure 1 (a), the surrogate model may not accurately capture the black-box
objective function, resulting in the pseudo-target distribution possibly containing noisy information
stemming from the surrogate model. Thus, generating directly from the pseudo-target distribution
could lead to some suboptimal designs, which have high predicted scores but low ground-truth
scores, necessitating the second phase of DEMO. This phase, termed existing design editing, directly
incorporates the high-scoring features from the offline dataset to provide more guidance to the design
generation process. Specifically, we edit top designs from the offline dataset by introducing random
noise to them and employing the conditional diffusion model from the first phase to remove the noise,
guided by higher target scores. As illustrated in Figure 1 (a), after injecting noise, the distribution
of top designs in the offline dataset (represented by the purple contour) has more overlap with
the pseudo-target distribution (represented by the orange contour). By progressively removing the
noise, we gradually project these existing top designs to the manifold of higher-scoring designs, as
demonstrated in Figure 1 (b). In essence, DEMO produces new designs which first inherit high-
scoring features from existing designs and then refine them by a more accurate conditional diffusion
model.
In summary, this paper makes three principal contributions:
•
We introduce a novel method, Design Editing for Offline Model-based Optimization (DEMO).
DEMO operates in two main phases: the first, pseudo-target distribution generation, involves
employing a surrogate model to create a synthetic dataset and training a conditional diffusion model
on this synthetic dataset to serve as the pseudo-target distribution.
•
The second phase, existing design editing, introduces random noise to existing top designs and
uses the trained conditional diffusion model to refine them, resulting in designs which not only
inherit high-scoring features from existing top designs but also achieve higher scores by leveraging
information from the pseudo-target distribution.
•
Extensive experiments demonstrate DEMO effectively and reliably generates new designs, yielding
state-of-the-art results across
7
offline MBO tasks, with the mean rank of
1.7
and the median rank
of 1among 16 methods.
2 Preliminary
2.1 Offline Model-based Optimization
Offline model-based optimization (MBO) addresses a range of optimization challenges with the aim
of maximizing a black-box objective function based on an offline dataset. Mathematically, we define
the valid design space as
X=Rd
, with
d
representing the dimension of the design. Offline MBO is
formulated as:
x∗= arg max
x∈X f(x),(1)
where
f(·)
is the black-box objective function, and
x∈ X
is a potential design. For the optimization
process, we utilize an offline dataset
D={(xi, yi)}N
i=1
, with
xi
representing an existing design,
such as a superconductor material, and
yi
representing the associated property score, such as the
critical temperature. Usually, this optimization process outputs
K
candidates for optimal designs,
where
K
is a small budget to test the black-box objective function. The offline MBO problem also
finds applications in other areas, like robot design, as well as protein and molecule engineering.
2.2 Classifier-free Conditional Diffusion Models
Diffusion models stand out in the family of generative models due to their unique approach involving
forward diffusion and backward denoising processes. The essence of diffusion models is to gradually
add noise to a sample, followed by training a neural network to reverse this noise addition, thus
recovering the original data distribution. In this work, we follow the formulation of diffusion models
with continuous time [
12
,
13
]. Here,
xt
is a random variable denoting the state of a data point at time
t∈[0, T ]. The diffusion process is defined by a stochastic differential equation (SDE):
dx=f(x, t)dt +g(t)dw,(2)
where
f(·, t)
is the drift coefficient of
xt
,
g(·)
is the diffusion coefficient of
xt
, and
w
is a standard
Wiener process. The backward denoising process is given by the reverse time SDE:
dx=f(x, t)−g(t)2∇xlog pt(x)dt +g(t)d¯
w,(3)
3
where
dt
represents a negative infinitesimal step in time, and
¯w
is a reverse time Wiener process. The
gradient of the log probability,
∇xlog pt(x)
, is approximated by a neural network
sϕ(xt, t)
with
score-matching objectives [14, 15].
Beyond basic diffusion models, our focus is to train a conditional diffusion model that learns
the conditional probability distribution of designs based on their associated property scores. To
incorporate conditions to diffusion models, Ho et al. [
16
] achieve it by dividing the score function
into a combination of conditional and unconditional components, known as classifier-free diffusion
models. Specifically, a single neural network,
sϕ(xt, t, y)
, is trained to handle both components by
utilizing
y
as the condition or leaving it empty for unconditional functions. Formally, we can write
this combination as follows:
sϕ(xt, t, y) = (1 + ω)sϕ(xt, t, y)−ωsϕ(xt, t),(4)
where
ω
is a parameter that adjusts the influence of the conditions. A higher value of
ω
ensures that
the generation process adheres more closely to the specified conditions, while a lower
ω
value allows
greater flexibility in the outputs.
3 Related Works
3.1 Offline Model-based Optimization
Recent offline model-based optimization (MBO) techniques broadly fall into two categories: (i)
those that employ gradient-based optimizations and (ii) those that create new designs via generative
models. Gradient-based methods often employ regularization techniques that enhance either the
surrogate model [
17
,
18
,
19
] or the design itself [
20
,
21
], thus improving the model’s robustness and
generalization capacity. It’s worth noting that while some approaches also involve synthesizing new
data with pseudo labels [
22
,
23
], they aim to identify useful information from these synthetic data to
correct the surrogate model’s inaccuracies. The second category encompasses methods that learn
to replicate the distribution of existing designs and include approaches such as MIN [
9
], CbAS [
8
],
Auto CbAS [
24
], and DDOM [
11
]. These methods are known for their ability to generate innovative
designs by sampling from learned distributions. DEMO distinguishes itself by training a conditional
diffusion model that learns a pseudo-target distribution and incorporating features from existing top
designs, which facilitates effectively and consistently generating new superior candidates.
3.2 Diffusion-Based Editing
Diffusion models have shown remarkable success in various generation tasks across multiple modal-
ities, especially for their ability to control the generation process based on given conditions. For
instance, recent advancements have utilized diffusion models for zero-shot, test-time editing in the
domains of text-based image and video generation. SDEdit [
25
] employs an editing strategy to
balance realism and faithfulness in image generation. To improve the reconstruction quality, method-
ologies such as DDIM Inversion [
26
], Null-text Inversion [
27
] and Negative-prompt Inversion [
28
]
concentrate on deterministic mappings from source latents to initial noise, conditioned on source text.
Building on these, CycleDiffusion [
29
] and Direct Inversion [
30
] leverage source latents from each
inversion step and further improve the faithfulness of the target image to the source image. Following
the image editing technique, several video editing methods [
31
,
32
,
33
,
34
,
35
,
36
] adopt image
diffusion models and enforce temporal consistency across frames, offering practical and efficient
solutions for video editing. Inspired by the success of these editing techniques in the field of computer
vision, we edit existing top designs towards a pseudo-target distribution in the context of the offline
MBO problem, enhancing both the effectiveness and reliability of generating new designs.
4 Methodology
In this section, we elaborate on the details of our proposed Design Editing for Offline Model-based
Optimization (DEMO), including two phases. We introduce the first phase, named pseudo-target
distribution generation, in section 4.1. This phase trains a conditional diffusion model, serving as the
pseudo-target distribution, on a synthetic dataset created by performing gradient ascent with respect
to a surrogate model trained on the offline dataset. While the first phase achieves a more accurate
4
conditional diffusion model capable of generating designs with higher scores than a model trained
solely on the offline dataset, it is susceptible to noise caused by inaccuracies in the surrogate model.
This motivates the second phase, termed Existing Design Editing, described in section 4.2, which
explicitly incorporates high-scoring features from existing top designs. Intuitively, one can make an
analogy of our method to writing code for a new research project. In coding for research, the initial
step often involves sourcing and adapting useful existing code from previous projects, tailoring it to
new requirements through modifications and enhancements. In a similar fashion, DEMO generates
new designs by initially inheriting high-scoring features from top existing designs (akin to reusing
existing code) and subsequently refining them through a more accurate conditional diffusion model
(akin to modifying code for a new purpose). Algorithm 1 illustrates the complete process of DEMO.
4.1 Pseudo-target Distribution Generation
Due to the scarcity of high-scoring training data, conditional generative models trained only on the
offline dataset often fail to consistently produce new designs that substantially surpass the existing
ones. One promising yet underutilized approach to address this issue is to generate a synthetic dataset
first, by applying gradient ascent on existing designs using a trained surrogate model. Conditional
generative models trained on this synthetic dataset capture a pseudo-target distribution, which are
more adept at creating designs with higher scores.
Creation of Synthetic Dataset. Initially, a deep neural network (DNN), denoted as
fθ(·)
with
parameters
θ
, is trained on the offline dataset
D={(xi, yi)}N
i=1
, where
xi
and
yi
denotes a design
and its associated score, respectively. The parameters θare optimized as:
θ∗= arg min
θ
1
N
N
X
i=1
(fθ(xi)−yi)2.(5)
The solution
fθ∗(·)
obtained from Eq. (5) serves as a surrogate for the unknown black-box objective
function
f(·)
in Eq. (1). New data are then generated by performing gradient ascent on the existing
designs with respect to the learned surrogate model fθ∗(·). For a design xiin D, we update it as:
xi,t =xi,t−1+η∇xfθ∗(x)
x=xi,t
,for t∈ {1,··· , T },(6)
where
T
is the total number of iterations, and
η
is the step size for the gradient ascent update. The
initial point
xi,0
is same as
xi
, and
xi,T
acquired at step
T
is a synthetic design with enhanced
predicted score. By iteratively using each design in the offline dataset
D
as the initial point, a synthetic
dataset
D′
of the same size as
D
is created, with predicted scores as labels. This process is outlined
from line 2to line 8in Algorithm1.
Training of Conditional Diffusion Model. We employ a classifier-free conditional diffusion
model [
16
] to learn the conditional probability distribution of synthetic designs and their predicted
scores in
D′
, which captures a pseudo-target distribution. Following the approach in DDOM [
11
], we
use the Variance Preserving (VP) stochastic differential equation (SDE) for the forward diffusion
process, as specified in [12]:
dx=−β(t)
2xdt +pβ(t)dw,(7)
where
β(t)
is a continuous time function for
t∈[0,1]
. The forward process in DDPM [
37
] is proved
to be a discretization of Eq. (7) [
12
]. To integrate conditions in the backward denoising process,
we need to train a DNN
sϕ(xt, t, y)
with parameters
ϕ
, conditioned on the time
t
and the score
y
associated with the unperturbed design
x0
corresponding to
xt
. The parameters
ϕ
are optimized as:
ϕ∗= arg min
ϕ
Etλ(t)Ex0,y Ext|x0∥sϕ(xt, t, y)− ∇xlog pt(xt|x0)∥2 ,(8)
where
λ(t)
is a positive weighting function depending on time. Since we train on the synthetic
dataset
D′
, the model optimized according to Eq. (8) more accurately represents the gradient of the
logarithm of a pseudo-target distribution. This distribution essentially reflects the marginal probability
distribution of designs that have enhanced predicted scores. With the optimized model
sϕ∗(xt, t, y)
,
we thereby improve the accuracy in generating new high-scoring designs by simulating the backward
denoising process. This part is described in Line 9of Algorithm 1.
5
4.2 Existing Design Editing
Due to potential inaccuracies of the surrogate model
fθ∗(·)
in representing the black-box objective
function, the synthetic dataset
D′
might include noisy data. Therefore, directly generating from the
pseudo-target distribution could lead to suboptimal new designs. Driven by the success of editing
techniques in image synthesis tasks [
25
,
38
], we explore the potential of creating new designs from
top existing designs, instead of initiating from a random latent variable sampled from the standard
Gaussian prior. We perturb
xtop
by introducing noise at a specific time
m
out of
{1,··· , M }
and
auxiliary noise levels β1,··· , βM:
xperturb =xtop +√1−¯αmϵ, (9)
where
αm= 1 −βm
,
¯αm=Qm
s=1 αs
, and
ϵ∼ N(0,I)
. This results in a closed form that samples
xperturb ∼ N(xtop,(1 −¯αm)I)
. The perturbed design is then used as the starting point. Given a
target property score
ˆy
, a new design is synthesized using a second-order Heun’s sampler [
11
] with
the model
sϕ∗(·)
. To yield
K
candidate optimal designs, we select the top
K
designs from
D
to
obtain various perturbed designs and denoise them conditioned on ˆy. Lines 11 to 16 of Algorithm 1
present the process of this phase.
Algorithm 1 Design Editing for Offline Model-based Optimization
Input: Offline dataset D={(xi, yi)}N
i=1, a target score ˆy, and a time m.
Output: Kcandidate optimal designs.
1/* Pseudo-target Distribution Generation */
2Initialize a surrogate model fθ(·)and optimize θwith Eq. (5) to obtain fθ∗(·).
3D′={}
4for i= 1,2,··· , N do
5xi,0←− xi
6for t= 1,2,··· , T do
7Update xi,t with Eq. (6).
8Append (xi,T , fθ(xi,T ))to D′.
9Initialize sϕ(·)and optimize ϕwith Eq. (8) on D′to obtain sϕ∗(·).
10 /* Existing Design Editing */
11 Candidates = {}
12 for k= 1,2,··· , K do
13 Select design xtop with the k-th best score among all designs in D.
14 Perturb xtop with Eq. (9) and the given time m.
15 Denoise xperturb and generate xnew using the Heun’s method with sϕ∗(·)and ˆy.
16 Append xnew to Candidates
17 return Candidates
5 Experiments
This section first describes the experiment setup, followed by the implementation details and results.
We aim to answer the following questions in this section: (Q
1
)Is our proposed DEMO more effective
than baseline methods in solving the offline MBO problem? (Q
2
)Are the two phases described
in section 4 both necessary? (Q
3
)Compared to existing generative model-based approaches, can
DEMO more reliably and consistently generate new higher-scoring designs?
5.1 Dataset and Tasks
We carry out experiments on
7
tasks selected from Design-Bench [
1
] and BayesO Benchmarks [
39
],
including
4
continuous tasks and
3
discrete tasks. The continuous tasks are as follows: (i) Supercon-
ductor (SuperC) [
5
], where the goal is to create a superconductor with
86
continuous components
to maximize critical temperature, using
17,010
designs; (ii) Ant Morphology (Ant) [
1
,
40
], where
the objective is to design a four-legged ant with
60
continuous components to increase crawling
speed, based on
10,004
designs; (iii) D’Kitty Morphology (D’Kitty) [
1
,
41
], where the focus is on
designing a four-legged D’Kitty with
56
continuous components to enhance crawling speed, using
6
10,004
designs; (iv) Inverse Levy Function (Levy) [
39
], where the aim is to maximize function
values of the inverse black-box Levy function with
60
input dimensions, using
15,000
designs. The
discrete tasks include: (v) TF Bind
8
(TF
8
) [
6
], where the goal is to identify an
8
-unit DNA sequence
that maximizes binding activity score, with
32,898
designs; (vi) TF Bind
10
(TF
10
) [
6
], where the
objective is to find a
10
-unit DNA sequence that optimizes binding activity score, using
50,000
designs; (vii) NAS [
42
], where the aim is to discover the optimal neural network architecture to
improve test accuracy on the CIFAR-10 dataset [43], using 1,771 designs.
5.2 Evaluation and Metrics
Following the evaluation protocol used in previous studies [
1
,
11
,
22
], we assume the budget
K= 256
and generate
256
new designs for each method. The
100
-th (max) percentile normalized ground-truth
score is reported in section 5.5, and the
50
-th (median) percentile score is provided in Appendix A.1.
This normalized score is calculated as
yn=y−ymin
ymax−ymin ,
where
ymin
and
ymax
are the minimum and
maximum scores in the entire offline dataset, respectively. For better comparison, we include the
normalized score of the best design in the offline dataset, denoted as
D(best)
. Additionally, we
provide mean and median rankings across all 7tasks for a comprehensive performance evaluation.
5.3 Comparison Methods
We benchmark DEMO against three groups of baseline approaches: (i) traditional methods, (ii)
those utilizing gradient optimizations from current designs, and (iii) those employing generative
models for sampling. Traditional methods include: (1) BO-qEI [
44
]: conducts Bayesian Optimization
to maximize the surrogate, proposes designs using the quasi-Expected-Improvement acquisition
function, and labels the designs using the surrogate model. (2) CMA-ES [
45
]: progressively adjusts
the distribution toward the optimal design by altering the covariance matrix. (3) REINFORCE [
46
]:
optimizes the distribution over the input space using the learned surrogate. The second category
includes: (4) Grad: performs simple gradient ascent on existing designs to create new ones. (5)
Mean: optimizes the average prediction of the ensemble of surrogate models. (6) Min: optimizes the
lowest prediction from a group of learned objective functions. (7) COMs [
18
]: applies regularization
to assign lower scores to designs derived through gradient ascent. (8) ROMA [
17
]: introduces
smoothness regularization to the DNN. (9) NEMO [
19
]: limits the discrepancy between the surrogate
and the black-box objective function using normalized maximum likelihood before performing
gradient ascent. (10) BDI [
21
] employs forward and backward mappings to transfer knowledge from
the offline dataset to the design. (11) IOM [
47
]: ensures representation consistency between the
training dataset and the optimized designs. Generative model-based methods include: (12) CbAS [
8
],
which adapts a VAE model to steer the design distribution toward areas with higher scores. (13)
Auto CbAS [
24
], which uses importance sampling to update a regression model based on CbAS. (14)
MIN [
9
], which establishes a relationship between scores and designs and seeks optimal designs
within this framework. (15) DDOM [
11
], which learns a generative diffusion model conditioned on
the score values.
5.4 Implementation Details
We follow the training protocols from [
18
] for all comparative methods unless stated otherwise. A
3
-layer MLP with ReLU activation is used for both
fθ(·)
and
sϕ(·)
, with a hidden layer size of
2048
.
In Algorithm 1, the iteration count,
T
, is established at
100
for both continuous and discrete tasks. The
Adam optimizer [
48
] is utilized to train the surrogate models over
200
epochs with a batch size of
128
,
and a learning rate set at
1e−1
. The step size,
η
, in equation 6 is configured at
1e−3
for continuous
tasks and
1e−1
for discrete tasks. The conditional diffusion model,
sϕ(·)
, undergoes training for
1000
epochs with a batch size of
128
. For the existing design editing, following precedents set by
previous studies [
49
,
11
], we assign a target score,
ˆy
, of
1
and
M
at
1000
. The selected value of
m
is
400
, with further elaboration provided in Appendix A.2. Results from traditional methodologies are
referenced from [
1
], and we conduct
8
independent trials for other methods, reporting the mean and
standard error. All experiments are conducted on a single NVIDIA V
100
GPU, with execution times
per trial ranging from 10 minutes to 20 hours, depending on the specific tasks.
7
Table 1: Experimental results on continuous tasks for comparison.
Method Superconductor Ant Morphology D’Kitty Morphology Levy
D(best)0.399 0.565 0.884 0.613
BO-qEI 0.402 ±0.034 0.819 ±0.000 0.896 ±0.000 0.810 ±0.016
CMA-ES 0.465 ±0.024 1.214 ±0.732 0.724 ±0.001 0.887 ±0.025
REINFORCE 0.481 ±0.013 0.266 ±0.032 0.562 ±0.196 0.564 ±0.090
Grad 0.489 ±0.018 0.927 ±0.027 0.949 ±0.014 0.948 ±0.031
Mean 0.505 ±0.013 0.940 ±0.014 0.956 ±0.014 0.984 ±0.023
Min 0.501 ±0.019 0.918 ±0.034 0.942 ±0.009 0.964 ±0.023
COMs 0.481 ±0.028 0.842 ±0.037 0.926 ±0.019 0.936 ±0.025
ROMA 0.509 ±0.015 0.916 ±0.030 0.929 ±0.013 0.976 ±0.019
NEMO 0.502 ±0.002 0.955 ±0.006 0.952 ±0.004 0.969 ±0.019
BDI 0.513 ±0.000 0.906 ±0.000 0.919 ±0.000 0.938 ±0.000
IOM 0.518 ±0.020 0.922 ±0.030 0.944 ±0.012 0.988 ±0.021
CbAS 0.503 ±0.069 0.876 ±0.031 0.892 ±0.008 0.938 ±0.037
Auto CbAS 0.421 ±0.045 0.882 ±0.045 0.906 ±0.006 0.797 ±0.033
MIN 0.499 ±0.017 0.445 ±0.080 0.892 ±0.011 0.761 ±0.037
DDOM 0.486 ±0.013 0.952 ±0.007 0.941 ±0.006 0.927 ±0.031
DEMO(ours) 0.520 ±0.006 0.971 ±0.005 0.957 ±0.006 1.005 ±0.020
Table 2: Experimental results on discrete tasks, and ranking on all tasks for comparison.
Method TF Bind 8TF Bind 10 NAS Rank Mean Rank Median
D(best)0.439 0.467 0.436
BO-qEI 0.798 ±0.083 0.652 ±0.038 1.079 ±0.059 11.1/16 13/16
CMA-ES 0.953 ±0.022 0.670 ±0.023 0.985 ±0.079 7.1/16 3/16
REINFORCE 0.948 ±0.028 0.663 ±0.034 −1.895 ±0.000 12.1/16 16/16
Grad 0.898 ±0.033 0.638 ±0.022 0.611 ±0.052 8.9/16 10/16
Mean 0.895 ±0.020 0.654 ±0.028 0.663 ±0.058 6.4/16 5/16
Min 0.931 ±0.036 0.634 ±0.033 0.708 ±0.027 8.0/16 8/16
COMs 0.474 ±0.053 0.625 ±0.010 0.796 ±0.029 11.1/16 12/16
ROMA 0.921 ±0.040 0.669 ±0.035 0.934 ±0.025 5.7/16 4/16
NEMO 0.942 ±0.003 0.708 ±0.010 0.735 ±0.012 4.6/16 5/16
BDI 0.870 ±0.000 0.605 ±0.000 0.722 ±0.000 9.7/16 10/16
IOM 0.870 ±0.074 0.648 ±0.025 0.411 ±0.044 7.6/16 7/16
CbAS 0.927 ±0.051 0.651 ±0.060 0.683 ±0.079 9.3/16 8/16
Auto CbAS 0.910 ±0.044 0.630 ±0.045 0.506 ±0.074 12.4/16 13/16
MIN 0.905 ±0.052 0.616 ±0.021 0.717 ±0.046 12.3/16 13/16
DDOM 0.961 ±0.024 0.640 ±0.029 0.737 ±0.014 7.3/16 7/16
DEMO(ours) 0.980 ±0.004 0.762 ±0.058 0.766 ±0.017 1.7/16 1/16
5.5 Results
Performance in Continuous Tasks. Table 1 presents the results of the
4
continuous tasks. DEMO
reaches state-of-the-art performance on all of them. When compared to other generative model-based
approaches, such as MIN and DDOM, DEMO generally outperforms them because these methods
train models only on the offline dataset and may not learn characteristics of higher-scoring designs.
DEMO achieves better performance by effectively mitigating this issue. Moreover, DEMO beats
gradient-based methods, like Grad and COMs, by leveraging guidance from existing top designs and
a higher target score simultaneously. This indicates that DEMO is effective for continuous tasks.
Performance in Discrete Tasks. Table 2 exhibits the results of the
3
discrete tasks. DEMO attains
top performances in TF Bind
8
and TF Bind
10
, where the results on TF
10
surpass other methods by a
significant margin, suggesting the ability of DEMO to solve discrete offline MBO tasks. Nonetheless,
DEMO underperforms on NAS, which might be caused by two reasons. First, each neural network
architecture is encoded as a sequence of one-hot vectors, which has a length of
64
. This encoding
process might be incapable of precisely representing all features of a given architecture, inducing
undesirable performance on NAS. Furthermore, after checking the offline dataset of NAS, we find
that many existing designs share commonalities. This redundancy means that the offline dataset
of NAS contains less useful information than those of other tasks, which further explains why the
performance of DEMO on NAS is not as strong.
Summary. These results on both continuous and discrete tasks soundly answer Q
1
. DEMO attains
the highest rankings with a mean of
1.7/16
and median of
1/16
as detailed in Table 2 and Figure 2,
as well as secures top performances in all tasks. We have further run a Welch’s t-test on the tasks
where DEMO obtains state-of-the-art results. We obtain p-values of
0.007
on SuperC,
0.00003
on
8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Rank
BO-qEI
CMA-ES
REINFORCE
Grad
Mean
Min
COMs
ROMA
NEMO
BDI
IOM
CbAs
Auto CbAs
MIN
DDOM
DEMO(ours)
Figure 2: The black triangles repre-
sent the mean rank, and the vertical
sticks showcase the median rank.
The whiskers indicate the mini-
mum and maximum of the rank.
SuperC Ant D'Kitty Levy TF8 TF10 NAS
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
DDOM
DEMO (ours)
Figure 3: The proportion is calculated as the number of new
designs which surpass
D(best)
divided by the budget
256
,
indicating the reliability to consistently generate new higher-
scoring designs. This figure demonstrates that DEMO is
more reliable than DDOM in all tasks.
Ant,
0.08
on D’Kitty,
0.005
on Levy,
0.005
on TF8, and
0.02
on TF10. This confirms that DEMO
accomplishes statistically significant improvements in 5/7tasks.
5.6 Ablation Study
Table 3: Ablation studies on two phases of DEMO.
Task D DEMO w/o pseudo-target w/o editing
SuperC 86 0.520 ±0.006 0.487 ±0.012 0.482 ±0.013
Ant 60 0.971 ±0.005 0.945 ±0.016 0.963 ±0.008
D’Kitty 56 0.957 ±0.006 0.955 ±0.005 0.933 ±0.002
Levy 60 1.005 ±0.020 0.901 ±0.029 0.990 ±0.020
TF8 80.980 ±0.004 0.757 ±0.063 0.965 ±0.008
TF10 10 0.762 ±0.058 0.626 ±0.009 0.658 ±0.019
NAS 64 0.766 ±0.017 0.741 ±0.022 0.668 ±0.084
To rigorously assess the individ-
ual contributions of pseudo-target
distribution generation (pseudo-
target) and existing design edit-
ing (editing) within our DEMO
method, ablation experiments are
conducted by systematically re-
moving each phase. The omis-
sion of the pseudo-target phase in-
cludes training a conditional diffu-
sion model only on the offline dataset and then applying the editing phase. In contrast, the removal of
the editing phase involves using the model trained during the pseudo-target phase to generate new
designs starting from a random Gaussian noise.
The results, as summarized in Table 3, provide clear insights into the impact of these modifications.
For the
4
continuous tasks, DEMO consistently achieves higher performance compared to its ablated
versions. For instance, in the task SuperC, DEMO achieves a score of
0.520 ±0.006
, significantly
higher than the versions without the pseudo-target phase (
0.487 ±0.012
) and without the editing
phase (
0.482 ±0.013
). Similar improvements are observed in Ant, D’Kitty, and Levy, underscoring
the effectiveness of integrating both phases in enhancing performance in continuous tasks. In the
discrete tasks TF8, TF10, and NAS, DEMO’s superior performance over both partial versions is
evident, highlighting its comprehensive effectiveness in managing discrete challenges. Overall, the
ablation studies validate the importance of the pseudo-target distribution generation and existing
design editing within the DEMO method, answering Q
2
that both phases are necessary for DEMO.
These phases collectively contribute to enhancements across a range of tasks and input dimensions.
5.7 Reliability Study
As previously noted, generative model-based methods, which train solely on the offline dataset, often
fail to generate new designs that consistently score higher. In this subsection, we assess the ability
of DEMO to reliably produce superior designs compared to DDOM, which represents the latest
and most robust among generative model-based approaches. We also discuss the comparison to
gradient-based approaches in Appendix A.3. To measure reliability, we compute the proportion of
new designs that exceed the best scores in the offline dataset
D(best)
. The results are depicted in
Figure 3. DEMO consistently outperforms DDOM across all tasks, achieving notable improvements,
particularly in the SuperC and NAS tasks. This confirms DEMO’s enhanced reliability over the
state-of-the-art generative model-based baseline in both continuous and discrete settings. The Median
scores included in Appendix A.1 further support these findings. DEMO achieves the top median-score
rankings, affirming the reliability of DEMO and answering Q3.
9
6 Conclusion and Discussion
In this study, we introduce Design Editing for Offline Model-based Optimization (DEMO), which
consists of two phases. The first phase, pseudo-target distribution generation, involves training a
surrogate model on the offline dataset and applying gradient ascent to create a synthetic dataset where
the predicted scores serve as new labels. A conditional diffusion model is subsequently trained on
this synthetic dataset to learn a pseudo-target distribution. The second phase, existing design editing,
introduces random noise to existing top designs and employs the learned diffusion model to denoise
them, conditioned on higher target scores. Overall, DEMO generates new designs by inheriting
high-scoring features from top existing designs in the second phase and refine them with a more
accurate conditional diffusion model obtained in the first phase. Extensive experiments on diverse
offline MBO tasks validate that DEMO outperform various baseline approaches, yielding state-of-
the-arts performance. The limitations and potential negative impacts of this study are discussed in
Appendix A.4 and Appendix A.5, respectively.
7 Acknowledgement
This research is partly facilitated by the computational resources provided by Compute Canada and
Mila Cluster.
10
References
[1]
Brandon Trabucco, Xinyang Geng, Aviral Kumar, and Sergey Levine. Design-bench: Bench-
marks for data-driven offline model-based optimization. arXiv preprint arXiv:2202.08450,
2022.
[2]
Thomas Liao, Grant Wang, Brian Yang, Rene Lee, Kristofer Pister, Sergey Levine, and Roberto
Calandra. Data-efficient learning of morphology and controller for a microrobot. arXiv preprint
arXiv:1905.01334, 2019.
[3]
Karen S Sarkisyan et al. Local fitness landscape of the green fluorescent protein. Nature, 2016.
[4]
Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and
Lucy Colwell. Model-based reinforcement learning for biological sequence design. In Proc. Int.
Conf. Learning Rep. (ICLR), 2019.
[5]
Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a
superconductor. Computational Materials Science, 2018.
[6]
Luis A Barrera et al. Survey of variation in human transcription factors reveals prevalent dna
binding changes. Science, 2016.
[7]
Paul J Sample, Ban Wang, David W Reid, Vlad Presnyak, Iain J McFadyen, David R Morris,
and Georg Seelig. Human 5 UTR design and variant effect prediction from a massively parallel
translation assay. Nature Biotechnology, 2019.
[8]
David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling
for robust design. In Proc. Int. Conf. Machine Lea. (ICML), 2019.
[9]
Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization.
Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
[10]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
[11]
Siddarth Krishnamoorthy, Satvik Mehul Mashkaria, and Aditya Grover. Diffusion models for
black-box optimization, 2023.
[12]
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.
[13]
Chin-Wei Huang, Jae Hyun Lim, and Aaron Courville. A variational perspective on diffusion-
based generative models and score matching, 2021.
[14]
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural
Computation, 23(7):1661–1674, 2011.
[15]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution, 2020.
[16] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022.
[17]
Sihyun Yu, Sungsoo Ahn, Le Song, and Jinwoo Shin. Roma: Robust model adaptation for
offline model-based optimization. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2021.
[18]
Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective
models for effective offline model-based optimization, 2021.
[19]
Justin Fu and Sergey Levine. Offline model-based optimization via normalized maximum
likelihood estimation. Proc. Int. Conf. Learning Rep. (ICLR), 2021.
[20]
Can Chen, Yingxue Zhang, Xue Liu, and Mark Coates. Bidirectional learning for offline
model-based biological sequence design, 2023.
[21]
Can Chen, Yingxue Zhang, Jie Fu, Xue Liu, and Mark Coates. Bidirectional learning for offline
infinite-width model-based optimization, 2023.
[22]
Ye Yuan, Can Chen, Zixuan Liu, Willie Neiswanger, and Xue Liu. Importance-aware co-
teaching for offline model-based optimization, 2023.
[23]
Can Chen, Christopher Beckham, Zixuan Liu, Xue Liu, and Christopher Pal. Parallel-mentoring
for offline model-based optimization, 2023.
[24]
Clara Fannjiang and Jennifer Listgarten. Autofocused oracles for model-based design. Proc.
Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
11
[25]
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano
Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations, 2022.
[26] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022.
[27]
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion
for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
[28]
Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast
image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807,
2023.
[29]
Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models
for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7378–7387, 2023.
[30]
Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting
diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
[31]
Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and
Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint
arXiv:2303.09535, 2023.
[32]
Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image
diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 23206–23217, 2023.
[33]
Shuai Yang, Yifan Zhou, Ziwei Liu, and Chen Change Loy. Rerender a video: Zero-shot
text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
[34]
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion
features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
[35]
Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-
Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. Flatten: optical flow-guided
attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
[36]
Youyuan Zhang, Xuan Ju, and James J Clark. Fastvideoedit: Leveraging consistency models
for efficient text-to-video editing. arXiv preprint arXiv:2403.06269, 2024.
[37] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
[38]
Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for
image-to-image translation. In International Conference on Learning Representations, 2023.
[39] Jungtaek Kim. BayesO Benchmarks: Benchmark functions for Bayesian optimization, 2023.
[40]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[41]
Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine,
and Vikash Kumar. Robel: Robotics benchmarks for learning with low-cost robots. In Conf. on
Robot Lea. (CoRL), 2020.
[42]
Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv
preprint arXiv:1611.01578, 2017.
[43]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhut-
dinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv
preprint arXiv:1207.0580, 2012.
[44]
James T Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. The reparame-
terization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
[45]
Nikolaus Hansen. The CMA evolution strategy: A comparing review. In Towards a New
Evolutionary Computation: Advances in the Estimation of Distribution Algorithms, 2006.
[46]
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning, 1992.
[47]
Han Qi, Yi Su, Aviral Kumar, and Sergey Levine. Data-driven model-based optimization via
invariant representation learning. In Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2022.
12
[48] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
[49]
Minsu Kim, Federico Berto, Sungsoo Ahn, and Jinkyoo Park. Bootstrapped training of score-
conditioned generator for offline design of biological sequences, 2023.
13
A Appendix
A.1 Median Normalized Scores
Table 4: Experimental results on continuous tasks for comparison.
Method Superconductor Ant Morphology D’Kitty Morphology Levy
D(best)0.399 0.565 0.884 0.613
BO-qEI 0.300 ±0.015 0.567 ±0.000 0.883 ±0.000 0.643 ±0.009
CMA-ES 0.379 ±0.003 −0.045 ±0.004 0.684 ±0.016 0.410 ±0.009
REINFORCE 0.463 ±0.016 0.138 ±0.032 0.356 ±0.131 0.377 ±0.065
Grad 0.293 ±0.010 0.463 ±0.023 0.862 ±0.007 0.613 ±0.019
Mean 0.334 ±0.004 0.569 ±0.011 0.876 ±0.005 0.561 ±0.007
Min 0.364 ±0.030 0.569 ±0.021 0.873 ±0.009 0.537 ±0.006
COMs 0.316 ±0.024 0.564 ±0.002 0.881 ±0.002 0.511 ±0.012
ROMA 0.370 ±0.019 0.477 ±0.038 0.854 ±0.007 0.558 ±0.003
NEMO 0.320 ±0.008 0.592 ±0.000 0.883 ±0.000 0.538 ±0.006
BDI 0.412 ±0.000 0.474 ±0.000 0.855 ±0.000 0.534 ±0.003
IOM 0.350 ±0.023 0.513 ±0.035 0.876 ±0.006 0.562 ±0.007
CbAS 0.111 ±0.017 0.384 ±0.016 0.753 ±0.008 0.479 ±0.020
Auto CbAS 0.131 ±0.010 0.364 ±0.014 0.736 ±0.025 0.499 ±0.022
MIN 0.336 ±0.016 0.618 ±0.040 0.887 ±0.004 0.681 ±0.030
DDOM 0.346 ±0.009 0.615 ±0.007 0.861 ±0.003 0.595 ±0.012
DEMO(ours) 0.412 ±0.008 0.624 ±0.014 0.875 ±0.003 0.601 ±0.006
Table 5: Experimental results on discrete tasks, and ranking on all tasks for comparison.
Method TF Bind 8TF Bind 10 NAS Rank Mean Rank Median
D(best)0.439 0.467 0.436
BO-qEI 0.439 ±0.000 0.467 ±0.000 0.544 ±0.099 6.7/16 7/16
CMA-ES 0.537 ±0.014 0.484 ±0.014 0.591 ±0.102 8.9/16 6/16
REINFORCE 0.462 ±0.021 0.475 ±0.008 −1.895 ±0.000 11.3/16 15/16
Grad 0.556 ±0.021 0.562 ±0.017 0.227 ±0.110 7.9/16 9/16
Mean 0.539 ±0.030 0.539 ±0.010 0.494 ±0.077 6.1/16 5/16
Min 0.569 ±0.050 0.485 ±0.021 0.567 ±0.006 5.3/16 5/16
COMs 0.439 ±0.000 0.467 ±0.002 0.525 ±0.003 8.7/16 8/16
ROMA 0.555 ±0.020 0.512 ±0.020 0.525 ±0.003 6.9/16 6/16
NEMO 0.438 ±0.001 0.454 ±0.001 0.564 ±0.016 8.1/16 9/16
BDI 0.439 ±0.000 0.476 ±0.000 0.517 ±0.000 8.3/16 8/16
IOM 0.439 ±0.000 0.477 ±0.010 −0.050 ±0.011 8.0/16 7/16
CbAS 0.428 ±0.010 0.463 ±0.007 0.292 ±0.027 13.6/16 13/16
Auto CbAS 0.419 ±0.007 0.461 ±0.007 0.217 ±0.005 14.3/16 14/16
MIN 0.421 ±0.015 0.468 ±0.006 0.433 ±0.000 6.7/16 9/16
DDOM 0.401 ±0.008 0.464 ±0.006 0.306 ±0.017 9.4/16 10/16
DEMO(ours) 0.826 ±0.005 0.475 ±0.004 0.541 ±0.005 4.0/16 4/16
Performance in Continuous Tasks. Table 4 showcases the median normalized scores for various
baseline methods across
4
continuous tasks. DEMO, while not always topping the charts, demon-
strates robust performance across these tasks, consistently outperforming several baseline methods.
For example, in the Ant Morphology task, DEMO’s score of
0.624 ±0.014
is the highest one among
all approaches. This highlights DEMO’s capability to approximate the distribution of higher-scoring
designs effectively. Notably, DEMO outperforms traditional generative models like CbAS and Auto
CbAS by significant margins across all tasks, underscoring its advanced generative capabilities. It
also maintains a competitive edge against more recent generative methods like MIN and DDOM.
Performance in Discrete Tasks. Moving to discrete tasks, as detailed in Table 5, DEMO exhibits
impressive performance in the TF Bind
8
task, substantially surpassing all baselines with a score of
0.826 ±0.005
. However, in more complex tasks like TF Bind
10
and NAS, while DEMO performs
competitively, it does not lead the field. This mixed performance can be attributed to DEMO’s
methodology which, although highly effective in capturing a broad distribution of high-quality
designs, might struggle in task environments with redundancy in design features.
Summary. The results presented in Tables 4 and 5 collectively validate DEMO’s efficacy across both
continuous and discrete optimization tasks, providing further support for answering Q
1
affirmatively.
14
0 200 400 600 800 1000
Value of m
0.485
0.490
0.495
0.500
0.505
0.510
0.515
0.520
Normalized Score
Superconductor
DEMO (ours)
Grad
DDOM
0 200 400 600 800 1000
Value of m
0.5
0.6
0.7
0.8
0.9
1.0
Normalized Score
TF Bind 8
DEMO (ours)
Grad
DDOM
Figure 4: Selecting
m
near
0
results in generated designs that align closely with the distribution of
existing designs. Conversely, setting
m
near
1000
steers the generated designs toward the pseudo-
target distribution. Optimal designs are achieved by choosing
m
in the mid-range, effectively utilizing
information from both the pseudo-target distribution and the top existing designs.
With a mean rank of
4.0/16
and a median rank of
4/16
in terms of the median normalized scores,
DEMO stands out among
16
competing methods. This comprehensive performance underscores
DEMO’s capacity to integrate and leverage complex design distributions effectively, setting a new
standard in generative optimization methods.
A.2 Sensitivity to the Choice of m
In Eq. (9), selecting a time
m
close to
M
results in
xperturb
resembling random Gaussian noise,
which introduces greater flexibility into the new design generation process. On the other hand, if
m
is
closer to
0
, the resulting design retains more characteristics of the existing top design. Thus,
m
serves
as a critical hyperparameter in our methodology. This section explores the robustness of DEMO to
various choices of
m
. We perform experiments on one continuous task, SuperC, and one discrete
task, TF
8
, with
m
ranging from
0
to
1000
in increments of
100
. As illustrated in Figure 4, DEMO
generally outperforms the baseline methods with different choices of
m
. Nevertheless, overly extreme
values of
m
, whether too high or too low, can diminish performance. Selecting an excessively low
m
causes the model to adhere too closely to the distribution of existing designs, while choosing an
overly high
m
biases the model towards the pseudo-target distribution, neglecting the guidance of
existing top designs. Choosing
m
from a mid-range effectively balances the influences from both the
pseudo-target distribution and the top existing designs. Empirical results suggest that an
m
within the
range of [200,600] yields optimal performance, leading us to set m= 400 for all tasks.
A.3 Extension of Reliability Study
This section extends the reliability study in section 5.7, comparing DEMO with a gradient-based
approach. When compared to Grad, DEMO demonstrates greater consistency in 5 out of 7 tasks.
However, Grad outperforms DEMO in Levy and TF
10
tasks, which can be attributed to the gradient-
based method’s tendency to generate new designs within a narrower distribution. While Grad achieves
a higher proportion of higher-scoring new designs in these two tasks, DEMO generates new designs
within a wider distribution and thus produces candidates with higher maximum scores, as evidenced
in Table 2.
A.4 Limitations
We have demonstrated the effectiveness of DEMO across a wide range of tasks. However, some
evaluation methods may not fully capture real-world complexities. For example, in the superconductor
task [
5
], we follow traditional practice by using a random forest regression model as the oracle,
as done in prior studies [
1
]. Unfortunately, this model might not entirely reflect the intricacies
of real-world situations, which could lead to discrepancies between our oracle and actual ground-
truth outcomes. Engaging with domain experts in the future could help enhance these evaluation
approaches. Nevertheless, given DEMO’s straightforward approach and the empirical evidence
supporting its robustness and efficacy across various tasks detailed in the Design-Bench [
1
] and
15
SuperC Ant D'Kitty Levy TF8 TF10 NAS
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
Grad
DEMO (ours)
Figure 5: The proportion is calculated as the number of new designs which surpass
D(best)
divided
by the budget
256
, indicating the reliability to consistently generate new higher-scoring designs. This
figure demonstrates that DEMO is more reliable than Grad in 5/7tasks.
BayesO Benchmarks [
39
], we remain confident in its ability to generalize effectively to different
contexts.
A.5 Negative Impacts
This study seeks to advance the field of Machine Learning. However, it’s important to recognize
that advanced optimization techniques can be used for either beneficial or detrimental purposes,
depending on their application. For example, while these methods can contribute positively to society
through the development of drugs and materials, they also have the potential to be misused to create
harmful substances or products. As researchers, we must stay aware and ensure that our contributions
promote societal betterment, while also carefully assessing potential risks and ethical concerns.
16