Available via license: CC BY 4.0
Content may be subject to copyright.
Memory-Efficient Gradient Unrolling for
Large-Scale Bi-level Optimization
Qianli Shen1∗Yezhen Wang1Zhouhao Yang1Xiang Li1Haonan Wang1
Yang Zhang1Jonathan Scarlett1Zhanxing Zhu2Kenji Kawaguchi1
1National University of Singapore 2University of Southampton, UK
Abstract
Bi-level optimization (BO) has become a fundamental mathematical framework
for addressing hierarchical machine learning problems. As deep learning models
continue to grow in size, the demand for scalable bi-level optimization has become
increasingly critical. Traditional gradient-based bi-level optimization algorithms,
due to their inherent characteristics, are ill-suited to meet the demands of large-
scale applications. In this paper, we introduce Forward Gradient Unrolling with
Forward Gradient, abbreviated as
(FG)2
U, which achieves an unbiased stochastic
approximation of the meta gradient for bi-level optimization.
(FG)2
U circumvents
the memory and approximation issues associated with classical bi-level optimiza-
tion approaches, and delivers significantly more accurate gradient estimates than
existing large-scale bi-level optimization approaches. Additionally,
(FG)2
U is in-
herently designed to support parallel computing, enabling it to effectively leverage
large-scale distributed computing systems to achieve significant computational
efficiency. In practice,
(FG)2
U and other methods can be strategically placed at
different stages of the training process to achieve a more cost-effective two-phase
paradigm. Further,
(FG)2
U is easy to implement within popular deep learning
frameworks, and can be conveniently adapted to address more challenging black-
box bi-level optimization scenarios. We provide a thorough convergence analysis
and a comprehensive practical discussion for
(FG)2
U, complemented by extensive
empirical evaluations, showcasing its superior performance in diverse large-scale
bi-level optimization tasks.
1 Introduction
Bi-level optimization is a mathematical framework with a long history of research [
10
,
65
,
73
],
dealing with hierarchical optimization problems where one problem is nested within the other. A
bi-level optimization problem can be formulated as:
min
ϕf(θ∗(ϕ),ϕ)s.t. θ∗(ϕ)∈arg min
θ
g(θ,ϕ),(1)
where
ϕ∈Φ⊆RM
denotes the meta parameter,
θ∈Θ⊆RN
denotes the inner parameter, and
f
,
gare called the meta objective function and inner objective function, respectively.
Recently, with the rise of deep learning, bi-level optimization has regained attention as a theoret-
ical framework covering a wide range of machine learning problems, including hyperparameter
optimization [
46
,
43
,
17
,
16
,
45
], neural architecture search [
78
,
38
,
14
], robust machine learn-
ing [
79
,
76
,
71
,
26
], meta learning [
15
,
53
,
49
,
2
], and physics-informed machine learning [
23
,
62
].
In these scenarios, the inner problem often pertains to the optimization of neural networks, thereby
∗Correspondence: shenqianli@u.nus.edu
Preprint. Under review.
arXiv:2406.14095v1 [cs.LG] 20 Jun 2024
Category Method Constant Hessian-Free Stochastic Grad. Approx.
Memory Optimization Preference
Classical
GU [17, 16] ✗✓ ✓ -
IF [53, 19, 64] ✓✗✓-
VF [39, 37, 61, 33] ✓ ✓ ✗-
Large-scale
TRGU [60] ✓ ✓ ✓ Efficiency
Hessian-Free [76, 75, 9] ✓ ✓ ✓ Efficiency
(FG)2U(ours) ✓ ✓ ✓ Accuracy
RGU/1 RGU/2 RGU/4 RGU/6 (FG)2U/24 (FG)2U/48
Method/Unrolled Depth
10
11
12
13
14
15
F1 Score
RGU: Non-constant Memory
Meta Learning Online Adaptation (DistilGPT2)
F1 Score
Memory
TRGU Hessian-Free (FG)2U (ours)
Method
15
18
21
24
27
30
Accuracy (%)
Data Condensation (CIFAR 100, IPC=50)
Accuracy
Memory
0 5 10 15 20 25 30 35
GPU Memory Usage (GiB)
0
8
16
24
32
40
Efficiency (sec/step)
Data Condensation (CIFAR 100, IPC=50)
TRGU
Hessian-Free
(FG)2U: 1 GPUs
(FG)2U: 2 GPUs
(FG)2U: 4 GPUs
(FG)2U: 8 GPUs
0
8
16
24
32
40
GPU Memory Usage (GiB)
0
3
6
9
12
15
GPU Memory Usage (GiB)
Figure 1: Top Left: A comparison of bi-level optimization methods.
(FG)2
U circumvents the
large-scale challenges inherent in classical bi-level optimization techniques. Within large-scale
bi-level optimization,
(FG)2
U prioritizes the accuracy of gradient approximation over efficiency.
Top Right: An overview of the cost-effective two-phase paradigm.
(FG)2
U is ideally positioned
in Phase II to enhance performance after an approximate solution has been obtained using other
efficient methods. Bottom Left: GPU Memory Usage and Performance on Meta Learning Online
Adaptation experiment.
(FG)2
U can effectively address the memory issue of RGU when both the
inner model and the unrolled depth are large. Bottom Center: GPU Memory Usage and Performance
on Data Condensation experiments. The performance of
(FG)2
U surpasses that of other large-scale
bi-level optimization methods, owing to its accurate gradient approximation, while demonstrating
better memory efficiency. Bottom Right: Efficiency tradeoff of
(FG)2
U on Data Condensation
experiments. The efficiency of (FG)2U can be well enhanced via intra/inter-GPU parallelism.
precipitating challenges associated with gradient-based bi-level optimization. Consequently, various
gradient-based bi-level optimization algorithms have been developed [
73
]. These algorithms typically
employ an iterative solution
θT
obtained by executing multiple inner optimization steps to approxi-
mate the meta gradient, and provide different tradeoffs between computational costs and performance
for meta gradient approximation.
However, as the scale of deep learning models continues to expand, the requirements for scalability
in bi-level optimization correspondingly increase. Existing gradient-based bi-level optimization
algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale
applications. Concretely, gradient unrolling (GU) methods [
17
,
16
,
40
,
60
] are bottlenecked by
the memory overhead associated with either the dimension of the inner parameter or the number
of iterative steps for the inner problem. Implicit Function (IF) approaches [
48
,
19
,
64
,
76
] are
compromised by approximation errors, which stem from the iterative estimation of inner solutions and
computations that involve the Hessian matrix. Value Function (VF) based strategies [
39
,
37
,
61
,
33
],
although exhibit commendable theoretical properties [
8
] for deterministic bi-level optimization, have
yet to gain traction in practical applications, predominantly due to their limitations in addressing large-
scale stochastic challenges [
73
]. Recent advancements in algorithms [
60
,
9
] have been specifically
tailored for large-scale bi-level optimization. Although these methodologies facilitate efficient
gradient approximation by compromising accuracy, they may result in significantly suboptimal
performance due to biased gradient approximations. Additionally, these methods struggle in more
complex scenarios, such as when inner problems are addressed through black-box optimization.
In this paper, we propose a novel method called Forward Gradient Unrolling with Forward Gradient,
abbreviated as
(FG)2
U, which achieves an unbiased stochastic approximation of the meta gradient for
bi-level optimization.
(FG)2
U circumvents the memory issues associated with GU-based approaches
and approximation issues associated with IF-based approaches. Compared to recently developed
large-scale bi-level optimization approaches,
(FG)2
U delivers significantly more accurate gradient
estimates. Additionally, (FG)2U is inherently designed to support parallel computing, enabling it to
effectively leverage large-scale distributed computing systems to achieve significant computational
efficiency. In practice, a cost-effective two-phase paradigm can be achieved by strategically placing
(FG)2
U and other methods at different stages of the training process to balance efficiency and
2
performance. Further,
(FG)2
U is easy to implement within popular deep learning frameworks, and
can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios.
We provide an overview of
(FG)2
U in Figure 1 to illustrate its strengths and role in large-scale
bi-level optimization. The rest of the paper is organized as follows. Firstly, in Section 2, we provide
summaries of existing bi-level optimization algorithms and discuss their limitations in large-scale
contexts. Next, in Section 3, we introduce the proposed method,
(FG)2
U, followed by a convergence
analysis in Section 3.1 and a detailed discussion of the practical considerations in Section 3.2. Further,
in Section 4, we conduct extensive empirical studies covering large-scale bi-level optimization in
computer vision, natural language processing, and physics-informed machine learning to demonstrate
the efficacy of (FG)2U in large-scale bi-level optimization scenarios.
2 Background
Gradient-based Bi-level Optimization. Within deep learning applications, the model concerned
with optimizing over
θ
as presented in
(1)
typically constitutes deep neural networks. The optimal pa-
rameters of such networks are not explicitly accessible and are estimated through iterative procedures.
Consequently, the primal problem of bi-level optimization in
(1)
is approximately reformulated as
follows:
min
ϕ∈Φh(ϕ) := f(θT(ϕ),ϕ),(2)
where θ0(ϕ) = Ω0(ϕ),θt(ϕ) = Ωt(θt−1(ϕ),ϕ)∈Θ, t = 1, . . . , T ,
where
Φ⊆RN
,
Θ⊆RM
are the parameter spaces;
T
, commonly called the unrolled depth, denotes
the number of inner optimization steps for approximating
θ∗(ϕ)
;
Ω0:RN→RM
specifies the
initialization of the inner optimization, and
Ωt: Θ ×Φ→Φ
delineates the transition dynamics
of the inner optimization at timestep
t
. In particular, for gradient descent,
Ωt(θt−1(ϕ),ϕ) =
θt−1−ηt∇θg(θt−1,ϕ), where ηtdenotes the step size at timestep t.
To optimize
ϕ
using a first-order method, it is necessary to estimate the meta gradient
∇ϕh
, which
can be further decomposed according to the chain rule:
∇ϕh(ϕ)
| {z }
meta gradient
=∂f (θT(ϕ),ϕ)
∂θT
dθT(ϕ)
dϕ
| {z }
implicit gradient
+∂f (θT(ϕ),ϕ)
∂ϕ
| {z }
explicit gradient
.(3)
The computation of meta-gradient poses a significant challenge, primarily due to the need for efficient
approximation of the implicit gradient. This task is complicated by the recursive dependency of
θT
on
ϕ
. To surmount this challenge, a variety of gradient-based bi-level optimization algorithms have
been developed, as extensively reviewed recently in [
73
]. These algorithms can be fundamentally
categorized into three types based on their approach to meta-gradient approximation: Gradient
Unrolling (GU), Implicit Function (IF), and Value Function (VF). Recent innovations such as
truncated RGU (TRGU) [
60
] and Hessian-Free approaches [
76
,
75
,
9
], which are predicated on GU
and IF methodologies respectively, have introduced significant biases in their approximations to
accommodate the computational constraints of large-scale scenarios. In the subsequent paragraph,
we furnish a concise overview of GU-based approaches, addressing their non-constant memory
issues in large-scale applications. Extended discussions on the remaining methods are reserved
for Appendix B.
Gradient Unrolling. The core idea behind GU [
17
,
16
,
40
,
60
] entails unrolling the inner optimiza-
tion into an expansive computational graph, followed by the employment of automatic differentiation
(AD) techniques for the iterative computation of gradients.
Forward Gradient Unrolling (FGU) [
17
,
16
] computes the meta gradient using the following forward
recursive formula, starting from Z0=dΩ0(ϕ)
dϕ:
dθt(ϕ)
dϕ
| {z }
Zt
=∂Ωt(θt−1(ϕ),ϕ)
∂θt−1
| {z }
At
dθt−1(ϕ)
dϕ
| {z }
Zt−1
+∂Ωt(θt−1(ϕ),ϕ)
∂ϕ
| {z }
Bt
, t = 1, . . . , T , (4)
3
Reverse Gradient Unrolling (RGU) [
46
,
16
], instead of the employment of explict reccursive formulas
of ZT, focuses on the implicit reccursive formulas of ∇ϕh:
∇ϕh(ϕ) = ∂f (θT(ϕ),ϕ)
∂θT
| {z }
dT
dθT(ϕ)
dϕ
| {z }
ZT
+∂f (θT(ϕ),ϕ)
∂ϕ
| {z }
cT
=dTZT+cT
(4)
=dTAt
| {z }
dT−1
ZT−1+dTBT+cT
| {z }
cT−1
=··· =d0Z0+c0.
(5)
The corresponding reverse recursive formulas can thus be summarized as
ct−1=ct+dtBt,dt−1=dtAt, t =T , . . . , 1.(6)
Weakness (GU): Non-Constant Memory. Both GU approaches exhibit a non-constant memory
overhead, which constrains their utility in large-scale scenarios. The forward reccursive formulas in
(4)
revolve around the Jacobian matrix product, demanding
O(MN)
space consumption. The reverse
recursive formulas in
(6)
necessitate the storage of the entire trajectory of the inner optimization
θ0:T
for backward computation, thereby imposing a memory requirement of
O(T M )
. These requirements
are often impractical for large-scale bi-level optimization, when
ϕ
and
θ
are of high dimension and a
significant unrolled depth is required.
Forward Gradient. Forward-mode automatic differentiation (forward-mode AD) has been applied
to a variety of research fields, including the training of recurrent neural networks [
70
], the computation
of Hessian vector products [
50
], etc. However, the computation of the true gradient via forward-mode
AD requires the full Jacobian, which is typically too costly to compute.
To solve this, forward gradient learning [
69
,
4
,
63
,
4
,
56
], built upon forward-mode AD, was
proposed. Forward gradient methods update parameters based on the directional gradient along a
random perturbation direction for backpropagation-free training. More formally, given a differentiable
function h:RN→R, the gradient for a given input ϕ∈RNcan be approximated as
ˆ
∇h(ϕ) = ∇h(ϕ)vvT,(7)
where
v∼p(v)
is a
N
-dimensional multivariate random variable, satisfying
E[vvT] = I
. Common
choices of the distribution of
v
include Rademacher
v∼Unif({−1,1}N)
, Gaussian
v∼ N(0,I)
,
and uniform distribution over a set of normalized orthogonal coordinates
v∼Unif({√Nei}1:N)
.
For any given
ϕ
,
ˆ
∇h(ϕ)
is an unbiased estimator of
∇h(ϕ)
, as
E[ˆ
∇h(ϕ)] = E[∇h(ϕ)vvT] =
∇h(ϕ)E[vvT] = ∇h(ϕ)I=∇h(ϕ)
. Despite the unbiasedness of
ˆ
∇h
, the dimension-dependent
variance of the estimated gradient with a single direction impedes the scaling-up to high-dimensional
problems. In practice, Monte Carlo gradient estimation can be used via averaged forward gradients
over multiple random directions to reduce the variance.
3 (FG)2U: Forward Gradient Unrolling with Forward Gradient
We aim to circumvent the memory overhead issues associated with forward gradient unrolling (FGU)
as discussed in Section 2. We begin by examining the forward gradient of hat ϕ,
ˆ
∇h(ϕ) = ∇h(ϕ)vvT(5)
=(dTZTv+cTv)vT,(8)
where
v∼p(v)
is a
N
-dimensional multivariate random variable, satisfying
E[vvT] = I
. We follow
the idea of FGU introduced in Section 2 to compute
ZTv
. By multiplying both sides of
(4)
by
v
on
the right, we can obtain the recursive formulas for Ztvas
Z0v=B0v;Ztv=AtZt−1v+Btv, t = 1, . . . , T . (9)
The revised recursive formulas in
(9)
facilitate the tracking of a
M
-dimensional vector
Ztv
, rather
than full Jacobian Ztof size M×N, throughout the forward pass. The stochastic estimation in (8)
is unbiased, adhering to the properties of forward gradient methods. To reduce the variance, we use
4
Monte Carlo estimate via averaged forward gradients over bi.i.d. random directions:
ˆ
∇h(ϕ) = 1
b
b
X
i=1 ∇h(ϕ)viviT=1
b
b
X
i=1
(dTZTvi+cTvi)viT.(10)
We call this algorithm
(FG)2
U, as an abbreviation of Forward Gradient Unrolling with Forward
Gradient. The algorithm is summarized in Appendix A as Algorithm 1.
Compared to GU-based methods, as discussed in Section 2,
(FG)2
U eliminates the dependency on
the meta parameter dimension
N
and the depth of unrolling
T
without introducing bias, significantly
enhancing memory efficiency. Unlike IF-based methods, as discussed in Appendix B.2,
(FG)2
U
overcomes the approximation issues associated with them while maintaining a constant memory
overhead, thus providing superior gradient approximation. Compared to TRGU and Hessian-Free
methods, which compromise approximation accuracy for efficiency,
(FG)2
U consistently delivers
accurate gradient approximations. The computational efficiency of
(FG)2
U can be further enhanced
by leveraging large-scale distributed computing resources, capitalizing on its inherently parallelizable
formulation as presented in
(10)
. In practice, a more cost-effective two-phase paradigm can be
achieved by strategically placing
(FG)2
U and other methods at different stages of the training process,
as we will discuss in Section 3.2. For an illustration of the role of
(FG)2
U in large-scale bi-level
optimization, please refer to Figure 1.
3.1 Convergence
In this section, we provide a convergence analysis for
(FG)2
U. The proofs can be found in Appendix C.
First, we establish a bound on the variance of the estimated gradient, when employing random vectors
whose entries follow the Rademacher distribution.
Lemma 3.1. For any ϕ∈Φ, if vi∼Unif({−1,1}N), the gradient estimation in (10), satisfies
E∥ˆ
∇h(ϕ)− ∇h(ϕ)∥2=1
ρ∥∇h(ϕ)∥2,
where ρ:= b
N−1∈(0,1] as the sample size bis selected from 1,··· , N −1.
The resultant error is bounded by
ON−1
b
, where
b
represents the sample size used for computing
the forward gradient, and
N
is the dimensionality of the gradient itself. This bound demonstrates
how the error scales inversely with the sample size while also being influenced by the gradient’s
dimensionality.
Next, we lay down the following assumptions, on which our main theorems are based. Let
ψ=
(θ,ϕ)∈Θ×Φ
denote the combination of the lower-level parameter
θ
and the meta parameter
ϕ
.
Following existing papers on the theory of bilevel optimization [
45
,
60
,
28
], in Assumption 3.2, we
adopt some standard assumptions over the smoothness of the objective functions fand g.
Assumption 3.2. The meta objective function
f(ψ)
and the lower-level objective function
g(ψ)
are
both C-Lipschitz and L-smooth, i.e., for any ψ,ψ′∈Θ×Φ,
|f(ψ)−f(ψ′)| ≤ C∥ψ−ψ′∥,∥∇f(ψ)− ∇f(ψ′)∥ ≤ L∥ψ−ψ′∥,(11)
|g(ψ)−g(ψ′)| ≤ C∥ψ−ψ′∥,∥∇g(ψ)− ∇g(ψ′)∥ ≤ L∥ψ−ψ′∥.(12)
The next assumption regulates that the transition functions Ωsatisfy similar smoothness conditions.
Assumption 3.3. The transition functions
Ω0:T
are
CΩ
-Lipschitz and
LΩ
-smooth, i.e., for any
ϕ,ϕ′∈Φ,
∥Ω0(ϕ)−Ω0(ϕ′)∥ ≤ CΩ∥ϕ−ϕ′∥,∥∇Ω0(ϕ)− ∇Ω0(ϕ′)∥ ≤ LΩ∥ϕ−ϕ′∥.(13)
For any ψ,ψ′∈Θ×Φ,t= 1, . . . , T ,
∥Ωt(ψ)−Ωt(ψ′)∥ ≤ CΩ∥ψ−ψ′∥,∥∇Ωt(ψ)− ∇Ωt(ψ′)∥ ≤ LΩ∥ψ−ψ′∥.(14)
Assumption 3.3 is made to ensure the generality of our analysis over different optimizers. Note that
Ω
is scheme-dependent w.r.t. the gradient-based optimizer we adopt for lower-level problems. In
many cases, such as gradient descent where
Ωt(ψt−1) = θt−1−ηt∇θg(ψt−1)
, Assumption 3.3 is
a direct consequence of Assumption 3.2.
5
We propose the following theorem and remark for convergence analysis.
Theorem 3.4 (Convergence).Suppose that Asumption 3.2 and Assumption 3.3 hold. Setting the
learning rate
β=ρ
(ρ+1)Lh
for gradient descent over the hyperparameter
ϕ
, then there exists a
constant Lh(depending on C,L,CΩ,LΩ, and T, and defined formally in the proof) such that
1
K
K−1
X
k=0
E∥∇h(ϕk)∥2≤4Lh(E[h(ϕ0)] −minϕh(ϕ))
ρK .(15)
Remark 3.5. Theorem 3.4 shows that Algorithm 1 converges to an
ϵ
-accurate stationary point with a
convergence rate of =O(ϵ−1ρ−1).
Recall that
ρ=b
N−1
, indicating that a sample size
b=O(N)
is necessary to achieve a convergence
rate of
O(ϵ−1)
. This requirement poses a particular challenge when managing high-dimensional
meta parameters
ϕ
. Fortunately, the parallelizable nature of forward gradient methods enables
the mitigation of computational overhead through distributed computing, thereby alleviating the
computational demands in large-scale applications.
3.2 Practical Considerations
Cost-Effective Two-Phase Paradigm. It is important to note that the upper bound delineated in
(15)
linearly depends on the performance discrepancy between the initialized meta parameter
ϕ0
and
the optimal. This dependence motivates the adoption of a more cost-effective two-phase paradigm
for large-scale bi-level optimization. In the initial phase, we utilize efficient yet less accurate
gradient approximation methods, such as TRGU [
60
] and Hessian-Free [
9
], to efficiently establish an
initial
ϕ0
that surpasses random initialization, while keeping computational overhead manageable.
Subsequently, in the second phase,
(FG)2
U is utilized for a more accurate, albeit less efficient, gradient
approximation to further elevate the performance, leveraging extensive computational resources.
Implementation. The technique employed in computing
∇h(ϕ)v
is identified as forward-mode
automatic differentiation (forward-mode AD). In advanced automatic differentiation libraries, such
as JAX [
5
] and PyTorch [
3
], forward-mode AD is efficiently implemented as Jacobian-vector product
(
jvp
), without the necessity of explicitly computing the Jacobian matrix. The FLOP cost of
jvp
is
approximately three times that of a standard forward pass, while the memory overhead is doubled. In
practice, it is only necessary to define the forward computational graph of inner optimization and
invoke forward-mode AD, which simplifies the implementation process significantly. Regarding
distributed training, JAX offers the
vmap
interface for efficient intra-GPU parallelism and the
pmap
interface for effective inter-GPU parallelism.
Zeroth-order Bi-level optimization. In certain applications of bi-level optimization, the inner
problem is approached as a black box, where the gradient of
Ω
is inaccessible, rendering the analytical
gradient unrolling unfeasible. For example, in PDE-constrained optimization [
23
,
62
], in which the
inner problem entails solving a Partial Differential Equation (PDE) using a non-differentiable solver.
In such scenarios, rather than employing forward-mode Automatic Differentiation (AD), one can
resort to Finite Difference methods to approximate the directional gradient ∇h(ϕ)vby
∇h(ϕ)v= lim
µ→0
h(ϕ+µv)−h(ϕ)
µ≈h(ϕ+ ¯µv)−h(ϕ)
¯µ(16)
with sufficiently small positive
¯µ > 0
. We refer to this zeroth-order variant of
(FG)2
U as
(FG)2
U-
ZO, noting that the computation solely encompasses two forward passes and does not involve the
utilization of any first-order information. The memory complexity is the same as forward-mode AD
and the actual computation time will be slightly less than forward-mode AD, at the cost of introducing
an approximation bias. We give a more detailed discussion within the context of zeroth-order
optimization [41] in Appendix D, and empirically study a corresponding case in Section 4.
4 Experiments
We conduct experiments across various contexts, as detailed in the respective subsections. Initially,
we engage in an image data condensation task, where we focus on a comprehensive performance
comparison between
(FG)2
U and both classical and large-scale bi-level optimization algorithms.
6
Subsequently, we investigate meta-learning for the online adaptation of language models, employing
a GPT model as the inner model, to illustrate how (FG)2U effectively circumvents the non-constant
memory issue associated with RGU. Finally, we address a physics-informed bi-level optimization
problem, where gradient-based inner solvers are ineffective, to demonstrate the efficacy of combining
(FG)2
U-ZO, the zeroth-order variant of
(FG)2
U discussed in Section 3.2, with non-differentiable
numerical solvers.
Data Condensation. To overcome the challenges posed by large-scale datasets, a line of works
known as data condensation [
68
,
72
] have been proposed. The main idea is to generate a compact,
synthesized dataset, designed to elicit similar behaviors in machine learning models as those trained
with the original, massive dataset. The objective of the mainstream principles [
72
] designed for
data condensation can be naturally formulated as a bi-level optimization problem. We focus on the
best-known principle performance matching [
72
] on classification tasks, which can be formulated as
min
DoL(θT;Do),where θt=θt−1−η∇L(θt−1;Dc), t = 1, . . . , T , (17)
where
Do
,
Dc
respectively denote the original and condensed dataset,
θ
denotes the model parameter,
Ldenotes the cross-entropy loss function, and ηrepresents the step-size for inner optimization.
Dataset IPC Ratio (%)Approaches For Reference
TRGU Hessian-Free Neumann (FG)2URGU WHOLE
MNIST
1 0.017 73.76±1.68 65.98±1.38 68.37±1.44 82.44±0.68 92.32±0.33
99.6±0.00
10 0.17 94.05±0.33 94.97±0.34 95.75±0.24 96.12±0.28 96.79±0.29
50 0.83 96.63±0.41 96.34±0.31 96.78±0.22 97.01±0.19 97.72±0.23
CIFAR-10
1 0.02 20.78±1.07 19.72±1.28 21.33±0.90 29.37±0.75 34.08±0.55
84.8±0.10
10 0.2 44.01±0.57 45.32±1.02 47.67±0.87 50.10±0.56 53.15±0.53
50 1 49.22±0.45 48.73±0.78 50.02±0.69 51.98±0.44 56.37±0.37
CIFAR-100
1 0.2 3.96±0.68 3.14±0.41 4.52±0.56 8.22±0.45 15.61±0.32
56.2±0.30
10 2 20.20±0.66 19.01±0.84 20.87±0.82 23.38±0.33 25.42±0.45
50 10 22.33±0.93 23.59±0.71 24.52±0.77 25.84±0.31 28.52±0.53
Table 1: The performance (testing accuracy %) comparison among various bilevel optimization
methods on the data condensation task over three datasets. All the datasets are condensed using a
3-layer ConvNet. IPC: image(s) per class. Ratio (%): the ratio of condensed examples to the whole
training set.
We conducted our experiments following the standard data condensation setting established by [
68
,
77
,
67
]. A more detailed task description is given in Appendix E.1 and implementation details are
given in Appendix F.1.
The condensed datasets are evaluated using 3-layer convolutional networks with randomly initialized
parameters, and the average accuracies on test datasets are summarized in Table 1. Compared to
large-scale bi-level optimization methods like TRGU and Hessian-Free, which prioritize efficiency at
the expense of approximation accuracy,
(FG)2
U exhibits significantly better performance, due to more
accurate gradient approximation as explained in Appendix B. Additionally, we assessed Neumann
Series (denoted as Neumann in Table 1), an IF-based method that mitigates gradient approximation
errors through extended computations, as introduced in Appendix B.2. While it demonstrates perfor-
mance enhancements over the Hessian-Free method, Neumann still yields suboptimal performance
compared to
(FG)2
U, owing to the inherent bias of the IF-based method. Further discussions and
supporting evidence are available in Appendix B.2.
The results of RGU, which represent the upper performance bound for both TRGU and
(FG)2
U,
are provided for reference, along with the results from training on the entire dataset (denoted as
WHOLE in Table 1), representing the upper performance bound for all approaches. However, it
is crucial to acknowledge that RGU is not practical in large-scale bi-level optimization scenarios
due to its non-constant memory requirements, as discussed in Section 2. This limitation will be
further exemplified in the subsequent, where the inner model is significantly larger. In principle, the
performance of
(FG)2
U can be further improved to approach that of RGU by increasing the number
of random directions for gradient approximation.
7
The memory and computational efficiencies of TRGU, Hessian-Free, and
(FG)2
U in the most
challenging case (CIFAR-100, IPC=50) are reported in Figure 1 (Bottom Right), demonstrating that
the efficiency of (FG)2U can be significantly enhanced through intra/inter-GPU parallelism.
Meta Learning Online Adaptation of Language Models. The online adaptation of language
models (LM) has been studied recently to keep the knowledge of LM current [
34
,
27
]. However,
trivial auto-regressive fine-tuning the LM, which applies uniform weights to all tokens, often results
in suboptimal performance in downstream tasks. This issue stems from the default average negative
log-likelihood (NLL) loss, which fails to capture the significance of tokens [
25
]. To overcome
this limitation, [
25
] proposed Context-aware Meta-learned Loss Scaling (CaMeLS), a strategy that
employs meta-learning to adjust token weights for more effective online adaptation. Specifically,
they meta train a weight model to reweight the auto-regressive loss during online fine-tuning, aiming
to enhance LM performance on downstream question-answering tasks. A comprehensive task
description and the mathematical formulation of the objectives are detailed in Appendix E.2.
The trained weight model is subsequently fine-tuned on unseen online documents and evaluated on
corresponding question-answering tasks. In [
25
], RGU is utilized for meta gradient approximation.
To mitigate the non-constant memory issue associated with RGU, a DistilGPT2 model [
59
] is chosen
as the surrogate base model for training the weight model, instead of larger models typically employed
for online adaptation. Additionally, a very limited unrolled depth of 6 is utilized within a 40 GiB
GPU memory budget. In our experiments, since
(FG)2
U has circumvented the non-constant memory
issue associated with RGU, we are able to increase the unrolled depth and upscale the base model for
training the weight model. Empirical evaluations are conducted on two datasets, StreamingQA [
36
]
and SQuAD-Seq [54].
Model (# params) Method StreamingQA SQuAD-Seq
EM (↑) F1 (↑) EM (↑) F1 (↑)
DistilGPT2 (82M)
CaMeLS + RGU [25, 66] 1.62 5.79 1.45 3.08
CaMeLS + RGU (impl.) 2.04 5.53 1.52 3.16
CaMeLS + (FG)2U(ours) 2.22 6.37 1.72 3.50
GPT2-Large (774M)
CaMeLS + RGU [25, 66] 5.35 10.60 4.97 8.63
CaMeLS + RGU (impl.) 7.02 12.19 4.86 8.57
CaMeLS + (FG)2U(ours) 7.21 12.50 5.56 8.99
GPT2-XL (1.5B)
CaMeLS + RGU [25, 66] 6.55 11.67 6.70 10.15
CaMeLS + RGU (impl.) 7.93 12.94 6.71 9.65
CaMeLS + (FG)2U(ours) 8.89 14.42 7.37 10.37
Table 2: Comparison of the online adaptation performance. The reported evaluation metrics include
the exact match (EM) and F
1
scores. For vanilla CaMeLS [
25
], RGU is conducted with unrolled
depth
6
, using DistilGPT2 as the base model. We present both the results reported by [
66
] and those
from our implementation (denoted as impl.). For CaMeLS +
(FG)2
U, we select unrolled depths from
{24,48}
, and the base model from
{
DistilGPT2, GPT2
}
. We report the results for the combination
that yields the best F1 score. Additional details and ablation studies are documented in Appendix G.1.
Firstly, we increased the unrolled depth while maintaining the base model as a DistilGPT2. We plotted
the F1 scores and GPU memory usages for RGU with unrolled depths of
{1,2,4,6}
and
(FG)2
U
with unrolled depths of
{24,48}
on StreamingQA in Figure 1 (Bottom Left). The performance
of the weight model is positively correlated with the unrolled depth, substantiating the benefits of
training with larger unrolled depths. The non-constant memory issue associated with RGU can be
observed when the unrolled depth increases, while
(FG)2
U maintains constant memory even with
large unrolled depth. Subsequently, we endeavored to upscale the base model to GPT2 to reduce the
disparity between training and evaluation. The performances are summarized in Table 2, with detailed
ablation studies on unrolled depths and base model variants documented in Table G.1 and Table G.2.
Data-driven Discovery of Partial Differential Equations (PDEs). Let us consider the following
general forms of parametrized and nonlinear PDEs:
ut+N[u;ϕ]=0, x ∈Ψ, t ∈[0, T ],(18)
where
x
denotes the space-time coordinate,
Ψ
denotes a bounded domain with boundary,
u: [0, T ]×
Ψ→R
denotes the latent solution,
ut
represents the first-order derivative of
u
with respect to
t
, and
8
N
is a general differential operator parameterized by
ϕ
, acting on
Ψ
. This setup encompasses a broad
spectrum of problems in physics. For example, the one-dimensional Burgers’ equation is defined by
N[u;ϕ] = µuux−νuxx
, where
ϕ= (µ, ν)∈R2
, and
ux
,
uxx
represent the first and second-order
derivatives of uwith respect to x, respectively.
0 50 100 150 200 250
Speed (second per solution)
1E-3
1E-2
1E-1
1
Relative L2 Error (log)
Efficient
Burgers: PDE Solver Efficiency
PINN (Adam)
PINN (SGD)
Numerical
Method (FG)2U (FG)2U-ZO
Inner solver PINN [52] Numerical
Burgers ϵϕ(E-2,↓)2143.58±855.26 0.97±0.45
ϵu(E-3,↓)336.06±46.91 0.63±0.33
Allen-Cahn ϵϕ(E-2,↓)438.13±101.77 2.34±0.64
ϵu(E-3,↓)133.61±35.93 0.97±0.54
KdV ϵϕ(E-2,↓)94.40±4.31 0.72±0.57
ϵu(E-3,↓)832.81±67.01 2.72±1.55
Figure 2: Left: Comparison of efficiency between the PINN solver and the numerical solver. We
evaluated Adam [
29
] and SGD as the inner optimizers for the PINN solver, with steps ranging from
100 to 50,000. The results demonstrate that the numerical solver is significantly more efficient.
Right: Comparison of relative L2 errors in the prediction of
ϕ
and
u
.
ϵϕ=∥ϕpred −ϕ∥2/∥ϕ∥2
,
ϵu=∥upred −u∥2/∥u∥2.
The problem of data-driven discovery of PDEs [
52
] can be framed as follows: given a set of scattered
observations of the latent solution
u(x)
, what are the parameters most accurately describing the
observed data? The problem can be formulated as a PDE-constrained optimization problem (PDECO):
min
ϕ
Ex,u∼D |u(x;ϕ)−u|2s.t. ut+N[u(·;ϕ); ϕ]=0, x ∈Ψ,(19)
where
D={(xi, ui)}1:k
denotes the observed data. In cases where the closed-form solutions of the
nonlinear PDEs are intractable, parametric solutions
uθ
are used to approximate the latent solution
u
for given ϕ. The PDECO in (19) is then reformulated into a bi-level optimization problem:
min
ϕ
Ex,u∼D |uθS(ϕ)(x;ϕ)−u|2s.t. θs(ϕ) = Ωs(θs−1,ϕ), s = 1, . . . , S. (20)
Employing gradient-based PDE solvers, such as physics-informed neural networks (PINN) [
52
],
facilitates the direct application of
(FG)2
U. However, as demonstrated in Figure 2 (Left), the accuracy
and efficiency of PINNs fall short of the rigorous demands of scientific computing. This limitation
has prompted us to integrate faster and more accurate traditional solvers like the spectral method [
1
]
(see also Appendix E.3.4) to tackle the inner problem. Given these solvers are non-differentiable,
we employ
(FG)2
U-ZO, the zeroth-order variant of
(FG)2
U introduced in Section 3.2, to solve the
problem.
We conduct experiments on three non-linear PDEs: Burgers, Allen-Cahn, and KdV, with a more de-
tailed task description available in Appendix E.3. The results are summarized in Figure 2 (Right). We
can observe that the combination of (FG)2U-ZO and the numerical solver significantly outperforms
(FG)2
U and the PINN solver, in terms of both the prediction on
ϕ
and
u
. The implementation details
are documented in Appendix F.3.
5 Conclusion
In this work, we propose a novel algorithm Forward Gradient Unrolling with Forward Gradient,
abbreviated as
(FG)2
U, designed to tackle the challenges associated with large-scale bi-level op-
timization. We conduct a convergence analysis of
(FG)2
U, perform extensive comparisons with
existing methods, and provide detailed discussions on its practical applications. Additionally, we
undertake an empirical evaluation across a series of large-scale bi-level optimization tasks. Our
findings indicate that
(FG)2
U effectively complements existing bi-level optimization algorithms,
addressing gaps in large-scale bi-level optimization scenarios. A brief discussion on the limitations
of our approach and directions for future work is presented in Appendix H.
9
6 Acknowledgements
This research is supported by the National Research Foundation Singapore under the AI Singapore
Programme (AISG Award No: AISG2-TC-2023-010-SGIL) and the Singapore Ministry of Education
Academic Research Fund Tier 1 (Award No: T1 251RES2207, T1 251RES2218). The computational
work for this article was partially performed on resources of the National Supercomputing Centre,
Singapore (https://www.nscc.sg).
10
References
[1] William F Ames. Numerical methods for partial differential equations. Academic press, 2014.
[2]
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,
Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent.
Advances in neural information processing systems, 29, 2016.
[3]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin
Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban
Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh,
Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang,
Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark
Saroufim, Marcos Yukio Siraichi, Helen Suk, Michael Suo, Phil Tillet, Eikan Wang, Xiaodong Wang,
William Wen, Shunting Zhang, Xu Zhao, Keren Zhou, Richard Zou, Ajit Mathews, Gregory Chanan, Peng
Wu, and Soumith Chintala. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode
Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support
for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM, April 2024.
[4]
Atılım Güne¸s Baydin, Barak A Pearlmutter, Don Syme, Frank Wood, and Philip Torr. Gradients without
backpropagation. arXiv preprint arXiv:2202.08587, 2022.
[5]
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau-
rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX:
composable transformations of Python+NumPy programs, 2018.
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901, 2020.
[7]
Claudio Canuto, M Yousuff Hussaini, Alfio Quarteroni, and Thomas A Zang. Spectral methods: funda-
mentals in single domains. Springer Science & Business Media, 2007.
[8]
Lesi Chen, Yaohua Ma, and Jingzhao Zhang. Near-optimal fully first-order algorithms for finding stationary
points in bilevel optimization. arXiv preprint arXiv:2306.14853, 2023.
[9]
Sang Choe, Sanket Vaibhav Mehta, Hwijeen Ahn, Willie Neiswanger, Pengtao Xie, Emma Strubell, and
Eric Xing. Making scalable meta learning practical. Advances in neural information processing systems,
36, 2024.
[10]
Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of
operations research, 153:235–256, 2007.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[12]
Tian Dong, Bo Zhao, and Lingjuan Lyu. Privacy for free: How does dataset condensation help privacy? In
International Conference on Machine Learning, pages 5378–5396. PMLR, 2022.
[13]
John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order
convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory,
61(5):2788–2806, 2015.
[14]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of
Machine Learning Research, 20(55):1–21, 2019.
[15]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep
networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
[16]
Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-
based hyperparameter optimization. In International Conference on Machine Learning, pages 1165–1173.
PMLR, 2017.
[17]
Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel
programming for hyperparameter optimization and meta-learning. In International conference on machine
learning, pages 1568–1577. PMLR, 2018.
[18]
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners.
arXiv preprint arXiv:2012.15723, 2020.
[19]
Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. arXiv preprint
arXiv:1802.02246, 2018.
[20]
Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation
of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
[21]
David Gottlieb and Steven A Orszag. Numerical analysis of spectral methods: theory and applications.
SIAM, 1977.
11
[22]
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient
estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
[23]
Zhongkai Hao, Chengyang Ying, Hang Su, Jun Zhu, Jian Song, and Ze Cheng. Bi-level physics-
informed neural networks for pde constrained optimization using broyden’s hypergradients. arXiv preprint
arXiv:2209.07075, 2022.
[24]
Ryuichiro Hataya and Makoto Yamada. Nyström method for accurate and scalable implicit differentiation.
In International Conference on Artificial Intelligence and Statistics, pages 4643–4654. PMLR, 2023.
[25]
Nathan Hu, Eric Mitchell, Christopher D Manning, and Chelsea Finn. Meta-learning online adaptation of
language models. arXiv preprint arXiv:2305.15076, 2023.
[26]
W Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, and Tom Goldstein. Metapoison: Practical
general-purpose clean-label data poisoning. Advances in Neural Information Processing Systems, 33:12080–
12091, 2020.
[27]
Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin, Janghoon Han, Gyeonghun Kim, Stanley Jungkyu
Choi, and Minjoon Seo. Towards continual knowledge learning of language models. arXiv preprint
arXiv:2110.03215, 2021.
[28]
Kaiyi Ji, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design.
In International conference on machine learning, pages 4882–4892. PMLR, 2021.
[29]
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[30]
Steven George Krantz and Harold R Parks. The implicit function theorem: History, theory, and applications.
Springer Science & Business Media, 2002.
[31] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[32]
Harshat Kumar, Dionysios S Kalogerias, George J Pappas, and Alejandro Ribeiro. Zeroth-order determin-
istic policy gradient. arXiv preprint arXiv:2006.07314, 2020.
[33]
Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. A fully first-order method for
stochastic bilevel optimization. In International Conference on Machine Learning, pages 18083–18113.
PMLR, 2023.
[34]
Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai
Gimenez, Cyprien de Masson d’Autume, Tomas Kocisky, Sebastian Ruder, et al. Mind the gap: Assessing
temporal generalization in neural language models. Advances in Neural Information Processing Systems,
34:29348–29363, 2021.
[35]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[36]
Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, D’Autume
Cyprien De Masson, Tim Scholtes, Manzil Zaheer, Susannah Young, et al. Streamingqa: A benchmark for
adaptation to new knowledge over time in question answering models. In International Conference on
Machine Learning, pages 13604–13622. PMLR, 2022.
[37]
Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A
simple first-order approach. Advances in neural information processing systems, 35:17248–17262, 2022.
[38]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint
arXiv:1806.09055, 2018.
[39]
Risheng Liu, Xuan Liu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A value-function-based interior-
point method for non-convex bi-level optimization. In International conference on machine learning, pages
6882–6892. PMLR, 2021.
[40]
Risheng Liu, Pan Mu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A generic first-order algorithmic
framework for bi-level programming beyond lower-level singleton. In International conference on machine
learning, pages 6305–6315. PMLR, 2020.
[41]
Sijia Liu, Pin-Yu Chen, Bhavya Kailkhura, Gaoyuan Zhang, Alfred O Hero III, and Pramod K Varshney. A
primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances,
and applications. IEEE Signal Processing Magazine, 37(5):43–54, 2020.
[42]
Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order
stochastic variance reduction for nonconvex optimization. Advances in Neural Information Processing
Systems, 31, 2018.
[43]
Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit
differentiation. In International conference on artificial intelligence and statistics, pages 1540–1552.
PMLR, 2020.
12
[44]
Lu Lu, Raphael Pestourie, Wenjie Yao, Zhicheng Wang, Francesc Verdugo, and Steven G Johnson. Physics-
informed neural networks with hard constraints for inverse design. SIAM Journal on Scientific Computing,
43(6):B1105–B1132, 2021.
[45]
Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of
continuous regularization hyperparameters. In International conference on machine learning, pages
2952–2960. PMLR, 2016.
[46]
Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization
through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR,
2015.
[47]
Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev
Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing
Systems, 36:53038–53075, 2023.
[48]
John L Nazareth. Conjugate gradient method. Wiley Interdisciplinary Reviews: Computational Statistics,
1(3):348–353, 2009.
[49]
Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. arXiv preprint
arXiv:1803.02999, 2018.
[50] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Comput., 6(1):147–160, 1994.
[51]
George F Pinder. Numerical methods for solving partial differential equations: a comprehensive introduc-
tion for scientists and engineers. John Wiley & Sons, 2018.
[52]
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep
learning framework for solving forward and inverse problems involving nonlinear partial differential
equations. Journal of Computational physics, 378:686–707, 2019.
[53]
Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with implicit
gradients. Advances in neural information processing systems, 32, 2019.
[54]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for
machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
[55]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and
Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning,
pages 8821–8831. Pmlr, 2021.
[56]
Mengye Ren, Simon Kornblith, Renjie Liao, and Geoffrey Hinton. Scaling forward gradient with local
losses. arXiv preprint arXiv:2210.03310, 2022.
[57]
Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and
beyond. arXiv preprint arXiv:1611.07476, 2016.
[58]
Levent Sagun, Utku Evci, V. Ugur Güney, Yann N. Dauphin, and Léon Bottou. Empirical analysis of the
hessian of over-parametrized neural networks. In 6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.
[59]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
[60]
Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for
bilevel optimization. In The International Conference on Artificial Intelligence and Statistics, pages
1723–1732. PMLR, 2019.
[61]
Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. In International Conference
on Machine Learning, pages 30992–31015. PMLR, 2023.
[62]
Qianli Shen, Wai Hoh Tang, Zhun Deng, Apostolos Psaros, and Kenji Kawaguchi. Picprop: Physics-
informed confidence propagation for uncertainty quantification. Advances in Neural Information Processing
Systems, 36, 2024.
[63]
David Silver, Anirudh Goyal, Ivo Danihelka, Matteo Hessel, and Hado van Hasselt. Learning by directional
gradient descent. In International Conference on Learning Representations, 2021.
[64]
Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network
compression. Advances in Neural Information Processing Systems, 33:18098–18109, 2020.
[65]
Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. A review on bilevel optimization: From classical to
evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276–
295, 2017.
[66]
Jihoon Tack, Jaehyung Kim, Eric Mitchell, Jinwoo Shin, Yee Whye Teh, and Jonathan Richard Schwarz. On-
line adaptation of language models with a memory of amortized contexts. arXiv preprint arXiv:2403.04317,
2024.
13
[67]
Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen,
Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12196–12205, 2022.
[68]
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint
arXiv:1811.10959, 2018.
[69] R. E. Wengert. A simple automatic derivative evaluation program. Commun. ACM, 7(8):463–464, 1964.
[70]
Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural
networks. Neural Comput., 1(2):270–280, 1989.
[71]
Peng Yang, Yingjie Lao, and Ping Li. Robust watermarking for deep neural networks via bi-level
optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
14841–14850, 2021.
[72]
Ruonan Yu, Songhua Liu, and Xinchao Wang. Dataset distillation: A comprehensive review. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2023.
[73]
Yihua Zhang, Prashant Khanduri, Ioannis Tsaknakis, Yuguang Yao, Mingyi Hong, and Sijia Liu. An
introduction to bi-level optimization: Foundations and applications in signal processing and machine
learning. arXiv preprint arXiv:2308.00788, 2023.
[74]
Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen,
Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient
llm fine-tuning: A benchmark. arXiv preprint arXiv:2402.11592, 2024.
[75]
Yihua Zhang, Yuguang Yao, Parikshit Ram, Pu Zhao, Tianlong Chen, Mingyi Hong, Yanzhi Wang, and
Sijia Liu. Advancing model pruning via bi-level optimization. Advances in Neural Information Processing
Systems, 35:18309–18326, 2022.
[76]
Yihua Zhang, Guanhua Zhang, Prashant Khanduri, Mingyi Hong, Shiyu Chang, and Sijia Liu. Revisiting
and advancing fast adversarial training through the lens of bi-level optimization. In International Conference
on Machine Learning, pages 26693–26712. PMLR, 2022.
[77]
Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. arXiv
preprint arXiv:2006.05929, 2020.
[78]
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint
arXiv:1611.01578, 2016.
[79]
Simiao Zuo, Chen Liang, Haoming Jiang, Xiaodong Liu, Pengcheng He, Jianfeng Gao, Weizhu Chen, and
Tuo Zhao. Adversarial regularization as stackelberg game: An unrolled optimization approach. arXiv
preprint arXiv:2104.04886, 2021.
14
A Algorithm
Algorithm 1 (FG)2U: Forward Gradient Unrolling with Forward Gradient
Require:
Initial inner parameters
θ0
, initial meta parameter
ϕ0
, random direction distribution
p
,
number of random directions b, total meta steps K, meta update mappings Ψ1:K.
1: θ←θ0,ϕ←ϕ0
2: for k= 1, . . . , K do
3: for i= 1, . . . , b do
4: Sample vi∼p(·)and initialize yi←∂Ω0(θ,ϕ)
∂ϕvi
5: end for
6: for t= 1, . . . , T do
7: θ←Ωt(θ,ϕ),A←∂Ωt(θ,ϕ)
∂θ,B←∂Ωt(θ,ϕ)
∂ϕvi
8: for i= 1, . . . , b do
9: yi←Ayi+Bvi
10: end for
11: end for
12: for i= 1, . . . , b do
13: wi←∂f (θ,ϕ)
∂θyi+∂f (θ,ϕ)
∂ϕvi
14: end for
15: ϕ←Ψk(ϕ,1
bPb
i=1 wivT
i)
16: end for
17: return ϕ
B Extended Discussion on Bi-level Optimization
B.1 Truncated Reverse Gradient Unrolling (TRGU)
To address the memory issue of GU methods, truncated Reverse Gradient Unrolling (TRGU) [
60
] is proposed to
reduce the memory usage by preserving only the last
K
steps of the inner optimization trajectory. However, this
introduces a significant bias in large-scale scenarios, particularly when the permissible Kis small.
Recall
(5)
and
(6)
, where the conventional RGU method computes the hypergradient by fully unrolling the
T
-step
inner optimization into a computational graph. Instead, TRGU performs
s
-step truncated back-propagation and
approximates the gradient with the intermediate term cT−s:
cT−s=cT+
T
X
t=T−s+1
BtAt+1 · · · ATdT.(21)
According to Proposition 3.1 in [
60
], if the inner-level objective function
g
is
L
-smooth, twice-differentiable and
globally
α
-strongly convex, and the gradient update rule writes
θt=θt−1−η∇θg(θt−1,ϕ)
, then the bias of
s-step TRGU would be bounded by
∥∇ϕh−cT−s∥ ≤ (1 −ηα)s
ηα ∥dT∥max
t∈0,...,T −s∥Bt∥.(22)
The bound
(22)
demonstrates an exponentially decaying rate in
s
over the bias of
s
-step TRGU. However, when
s
gets smaller, which means that we truncate the computational graph heavier in pursuit of lower memory cost, the
bias would grow exponentially. This would result in an inaccurate calculation of the hypergradient. Contrastively,
our
(FG)2
Uis an unbiased estimator of the hypergradient, while still keeping high memory efficiency with a
small sample size of forward gradient as in (10).
B.2 Implicit Function (IF)
Another idea for computing the implicit gradient is to utilize the implicit function theorem (IFT) [
30
]. Suppose
that the inner optimality
∇θg(θT(ϕ),ϕ)≈0
is approximately achieved by sufficient inner optimization steps.
If
g
is second-order differentiable, by applying the implicit function theorem and taking the first-order derivative
of ϕ,
∂2g(θT(ϕ),ϕ)
∂θ2
T
dθT(ϕ)
dϕ+∂2g(θT(ϕ),ϕ)
∂θT∂ϕ≈0.(23)
15
Then, if the Hessian is further assumed to be invertible, the meta gradient can be approximated as
∇h(ϕ)≈ − ∂f (θT(ϕ),ϕ)
∂θT
| {z }
d
∂2g(θT(ϕ),ϕ)
∂θ2
T−1
| {z }
H−1
∂2g(θT(ϕ),ϕ)
∂θT∂ϕ
| {z }
Y
+∂f (θT(ϕ),ϕ)
∂ϕ
| {z }
c
.(24)
The main challenge lies in the computation of the inverse Hessian matrix
H−1
, which is intractable when
θ
is of
high dimensionality. Fortunately, several iterative inverse Hessian vector product (
ihvp
) approximators requiring
only Hessian vector product (
hvp
) and
O(M)
space can be employed to produce
\
dH−1
for approximating
dH−1, based on Conjugate Gradient [48, 53], Neumann Series [19, 28] and low-rank approximation [64, 24].
Neumann Series. The inverse Hessian vector product can be approximated with a truncated sum of Neumann
series [19, 28],
\
dH−1=α
K
X
k=0
d(I−αH) = dH−1−α
∞
X
k=K+1
d(I−αH),(25)
where
α
is a hyperparameter to ensure the convergence, and
K
is the number of truncated steps. Compared to
other IF-based methods, the Neumann Series has demonstrated good empirical performance and stability [
19
],
and its stochastic variant has been well studied [28].
Weakness (IF): Approximation Errors. The errors of IF emanate from two distinct sources Firstly, IFT
presupposes that the Karush-Kuhn-Tucker (KKT) conditions of the inner problem are satisfied, leading to
an approximation error in
(23)
when iterative approximations of the inner solutions are used. Secondly, the
singular nature of the Hessian within neural network training [
57
] leads to costly and unstable inverse Hessian
approximation in practical applications, with a heavy reliance on engineering efforts [
9
]. More formally, recall
∇h(ϕ) = ∂f (θT(ϕ),ϕ)
∂θT
| {z }
d
dθT(ϕ)
dϕ
| {z }
Z
+∂f (θT(ϕ),ϕ)
∂ϕ
| {z }
c
.(26)
The approximation error can be decomposed into
∇h(ϕ)−ˆ
∇h(ϕ)
| {z }
ϵ
=d(Z+H−1Y)
| {z }
ϵif
+ ( \
dH−1−dH −1)Y
| {z }
ϵinv
.(27)
To reduce the computational cost, Hessian-free approaches [
76
,
75
,
9
] propose approximating the Hessian as an
identity matrix, incorporating additional assumptions about the inner model and objective.
\
dH−1=αdI ,(28)
where
α > 0
is a hyperparameter to control the magnitude. However, these numerous assumptions often diverge
from practical scenarios, resulting in significant approximation errors and consequently inducing suboptimal
outcomes.
100101102103
Iteration
0
1
2
3
4
5
Inner Loss
(a). Inner Loss v.s. Iterations.
100101102103
Iteration
2.0
1.5
1.0
0.5
0.0
Log10 of L2 Grad. Norm
(b). Gradient Norm v.s. Iterations.
Figure B.1: CIFAR100, IPC=50: Inner Loss and gradient norm for Neumann
In Figure B.1, it is evident that the inner optimization has not converged by the unrolled step 100, as indicated
by both inner loss and gradient norm. This observation implies that the Karush-Kuhn-Tucker (KKT) conditions
are not satisfied, leading to the conclusion that the approximation used in (23) introduces a bias.
16
B.3 Value Function (VF)
The VF-based methodology [39, 37, 61, 33] considers an equivalent reformulation of the original optimization
problem as outlined in (2):
min
θ,ϕf(θ,ϕ)s.t. g(θ,ϕ)≤g(θT(ϕ),ϕ).(29)
This reformulation casts the standard bi-level optimization challenge into a constrained single-level optimization
framework. VF-based methods circumvent the need for second-order computations and have demonstrated
near-optimal complexity, comparable to second-order methodologies in deterministic settings, as reported in [
8
].
Weakness (VF): stochastic optimization. VF-based strategies have yet to gain widespread acceptance in
practical ML applications. This limited adoption is primarily attributed to the challenges these methods face in
addressing large-scale stochastic problems, where the complexity significantly impedes their performance [
73
].
C Proofs of Theoretical Results
In this section, we detail the proofs of Lemma 3.1 and Theorem 3.4. Our approach to proving Theorem 3.4
follows a similar high-level approach as [
19
,
28
,
60
] with some important distinctions. Initially, in Lemma B.2,
we extend the smoothness properties of the objective functions
f
and
g
, and the transition functions
Ω
, to
the
T
-th iteration lower-level parameter
θT
. Following this, Lemma B.3 establishes the smoothness of the
meta-learning objective
f(θT,ϕ)
, incorporating results from the inner-loop computations. Building on the
demonstrated smoothness of the meta objective function and the variance of the forward gradient method (as
shown in Lemma 3.1), we then validate the convergence properties of Algorithm 1.
The novelty in our proof of Theorem 3.4 lies in two primary aspects. Firstly, our analysis does not presume that
the lower-level optimization yields an optimal solution
θ∗
; instead, it more realistically assumes the use of
θT
,
which is derived from a finite number of iterations. This assumption aligns more closely with the computational
constraints encountered in real-world scenarios. Secondly, our convergence analysis explicitly accounts for the
variance of our unbiased gradient estimator, achieving a convergence rate of
O(ϵ−1ρ−1)
. This demonstrates that
utilizing the forward gradient method, while significantly reducing memory requirements, does not adversely
affect the algorithm’s convergence rate, underscoring the practical viability and efficiency of our approach even
with memory constraints.
C.1 Proof of Lemma 3.1
For convenience, the lemma is restated as follows.
Lemma 3.1. For any ϕ∈Φ, the gradient estimation with forward gradient method:
ˆ
∇h(ϕ) = 1
b
b
X
i=1
∇h(ϕ)vivi
⊤,
where vi∼Unif({−1,1}N), and bdenotes the sample size, satifies
E∥ˆ
∇h(ϕ)− ∇h(ϕ)∥2=1
ρ∥∇h(ϕ)∥2,
where ρ:= b
N−1∈(0,1] as bis selected from 1,...,N −1.
Proof.
We start by computing the variance of one-sample estimation,
ˆ
∇h(ϕ) = ∇h(ϕ)vv⊤
. Since
E[vv⊤] =
I, we know that E[ˆ
∇h(ϕ)] = ∇h(ϕ). Consequently,
E∥ˆ
∇h(ϕ)−Eˆ
∇h(ϕ)∥2=E∥∇h(ϕ)(vv⊤−I)∥2
=E[∇h(ϕ)⊤(vv⊤−I)⊤(vv⊤−I)(∇h(ϕ))]
=E[(∇h(ϕ))⊤(vv⊤)⊤vv⊤∇h(ϕ)−2(∇h(ϕ))⊤(vv⊤)⊤∇h(ϕ) + (∇h(ϕ))⊤∇h(ϕ)]
=E∥∇h(ϕ)vv⊤∥2−2E∥∇h(ϕ)v∥2+∥∇h(ϕ)∥2.
(30)
Since
v
is an
N
-dimensional Rademacher random variable, we have
E∥∇h(ϕ)v∥2=∥∇h(ϕ)∥2
and
E∥vv⊤∥2=N. Then,
(30) =E∥∇h(ϕ)vv⊤∥2− ∥∇h(ϕ)∥2
= (N−1)∥∇h(ϕ)∥2.
17
For the multi-sample estimation
ˆ
∇h(ϕ) = 1
bPb
i=1 ∇h(ϕ)vivi
⊤
. Since
vi
are i.i.d. sampled, and
E[ˆ
∇h(ϕ)] =
∇h(ϕ), we have
E∥ˆ
∇h(ϕ)−Eˆ
∇h(ϕ)∥2=E
∇h(ϕ)1
b
b
X
i=1
(vivi
⊤−I)
2
=1
b2
b
X
i=1
E∥∇h(ϕ)(vivi
⊤−I)∥2
=N−1
b∥∇h(ϕ)∥2.
C.2 Proof of Theorem 3.4
To prove our main result (Theorem 3.4), we first establish useful smoothness properties of the hyperparameter
learned from solving the lower-level optimization problem. Subsequently, we establish the smoothness of
meta-objective function examined at the approximated lower-level parameters in Lemma B.3.
Regarding the lower-level parameter
θ(ϕ)
, we present the following lemma, which is based on Assumption 3.3,
and establishes that θ(ϕ)inherits similar Lipschitz continuity and smoothness properties as Ωt.
Lemma B.2. Under Assumptions 3.2 and 3.3,
θT(ϕ)
is
CZ
-Lipchitz and
LZ
-smooth, i.e., for any
ϕ,ϕ′∈Φ
,
∥θT(ϕ)−θT(ϕ′)∥ ≤ CZ∥ϕ−ϕ′∥,∥∇θT(ϕ)− ∇θT(ϕ′)∥ ≤ LZ∥ϕ−ϕ′∥,
where CZ=CT+2
Ω−CΩ
CΩ−1and LZ=LΩCT
Ω+CT+2
ΩT
CΩ−1−CT
Ω−1
(CΩ−1)2.
Proof.
We start with the proof of Lipshitz continuity. For any pair of
ϕ,ϕ′∈Φ
, using
(2)
and Assumption 3.3,
we have
∥θs(ϕ)−θs(ϕ′)∥=∥Ω(θs−1(ϕ),ϕ)−Ω(θs−1(ϕ′),ϕ′)∥
≤CΩ∥θs−1(ϕ)−θs−1(ϕ′)∥+CΩ∥ϕ−ϕ′∥.(31)
Applying (31) recursively over s= 1,...,tgives
∥θt(ϕ)−θt(ϕ′)∥ ≤ Ct
Ω∥θ0(ϕ)−θ0(ϕ′)∥+
t
X
s=1
Cs
Ω∥ϕ−ϕ′∥
≤
t+1
X
s=1
Cs
Ω∥ϕ−ϕ′∥=Ct+2
Ω−CΩ
CΩ−1∥ϕ−ϕ′∥,
(32)
where the last inequality holds from the fact that
θ0=Ω0
as well as Assumption 3.3, and the subsequent
equality follows from the geometric series summation formula.
Therefore,
θt(ϕ)
is
CZ(t)
-Lipchitz, where
CZ(t) := Ct+2
Ω−CΩ
CΩ−1
. Substituting
t=T
, we get
CZ=CZ(T) =
CT+2
Ω−CΩ
CΩ−1.
We now proceed with the proof of LZ-smoothness. For simplicity of notation, we follow (4) and denote
Zt(ϕ) = ∇θt(ϕ); At(ϕ) = ∂Ωt(θt−1(ϕ),ϕ)
∂θt−1
;Bt(ϕ) = ∂Ωt(θt−1(ϕ),ϕ)
∂ϕ.
Subsequently, considering the update rule
Zt(ϕ) = At(ϕ)Zt−1(ϕ) + Bt(ϕ)
of Forward Gradient Unrolling
(4), we have
∥∇θt(ϕ)− ∇θt(ϕ′)∥
=∥Zt(ϕ)−Zt(ϕ′)∥
(4)
=∥At(ϕ)Zt−1(ϕ) + Bt(ϕ)−At(ϕ′)Zt−1(ϕ′) + Bt(ϕ′)∥
≤ ∥At(ϕ)Zt−1(ϕ)−At(ϕ′)Zt−1(ϕ′)∥+∥Bt(ϕ)−Bt(ϕ′)∥
=∥At(ϕ)Zt−1(ϕ)−At(ϕ)Zt−1(ϕ′) + At(ϕ)Zt−1(ϕ′)−At(ϕ′)Zt−1(ϕ′)∥+∥Bt(ϕ)−Bt(ϕ′)∥
≤ ∥At(ϕ)Zt−1(ϕ)−At(ϕ)Zt−1(ϕ′)∥+∥At(ϕ)Zt−1(ϕ′)−At(ϕ′)Zt−1(ϕ′)∥
+∥Bt(ϕ)−Bt(ϕ′)∥
≤ ∥At(ϕ)∥ · ∥Zt−1(ϕ)−Zt−1(ϕ′)∥+∥At(ϕ)−At(ϕ′)∥ · ∥Zt−1(ϕ′)∥+∥Bt(ϕ)−Bt(ϕ′)∥
≤CΩ∥Zt−1(ϕ)−Zt−1(ϕ′)∥+ (CZ(t) + 1)LΩ∥ϕ−ϕ′∥,
(33)
18
where the last inequality follows from Assumption 3.3 that
Ω0(ϕ)
and
Ω1:T(ψ)
are
CΩ
-Lipshitz and
LΩ
-
smooth, and the previously proved result that θt(ϕ)is CZ(t)-Lipchitz.
Noting that Z0(ϕ) = ∇θ0(ϕ) = ∇Ω0(ϕ), applying (33) recursively over t= 1,...,T gives
∥∇θT(ϕ)− ∇θT(ϕ′)∥ ≤ CT
Ω∥Z0(ϕ)−Z0(ϕ)∥+
T
X
t=1
CT−t
Ω(CZ(t) + 1)LΩ∥ϕ−ϕ′∥
=CT
Ω∥∇Ω0(ϕ)− ∇Ω0(ϕ)∥+CT
Ω
T
X
t=1
CZ(t) + 1
Ct
Ω
LΩ∥ϕ−ϕ′∥
≤LΩCT
Ω∥ϕ−ϕ′∥+CT
Ω
T
X
t=1
Ct+2
Ω−1
Ct
Ω(CΩ−1) LΩ∥ϕ−ϕ′∥
=LΩ"CT
Ω+CT
Ω
T
X
t=1
C2
Ω
CΩ−1−CT
Ω
CΩ−1
T
X
t=1
1
Ct
Ω#∥ϕ−ϕ′∥
=LΩ
CT
Ω+CT+2
ΩT
CΩ−1−CT
Ω
CΩ−1
1
CΩ(1 −1
CT
Ω
)
1−1
CΩ
∥ϕ−ϕ′∥
=LΩCT
Ω+CT+2
ΩT
CΩ−1−CT
Ω−1
(CΩ−1)2∥ϕ−ϕ′∥,
where the third line follows from Assumption 3.3 that
Ω0(ϕ)
is
LΩ
-smooth and the choice
CZ(t) = Ct+2
Ω−CΩ
CΩ−1
(which gives CZ(t) + 1 = Ct+2
Ω−1
CΩ−1), and the fifth line again uses the geometric series summation formula.
Hence, θT(ϕ)is LZ-smooth with LZ=LΩCT
Ω+CT+2
ΩT
CΩ−1−CT
Ω−1
(CΩ−1)2.
Next, we provide a lemma establishing that the upper-level objective
f
, evaluated at the learned parameter
(θT(ϕ),ϕ), also adheres to certain smoothness properties.
Lemma B.3. Define
h(ϕ) := f(θT(ϕ),ϕ)
. Under Assumptions 3.2 and 3.3,
h(ϕ)
is
Lh
-smooth, i.e., for any
ϕ,ϕ′∈Φ,
∥∇h(ϕ)− ∇h(ϕ′)∥ ≤ Lh∥ϕ−ϕ′∥,
where Lh= (CZ+ 1)2L+CLZ, with CZand LZdefined in Lemma B.2.
Proof. For simplicity of notation, we follow (5) and denote
Zt(ϕ) = ∇θt(ϕ); cT(ϕ) = ∂f (θT(ϕ),ϕ)
∂ϕ;dT(ϕ) = ∂f (θT(ϕ),ϕ)
∂θT
.
For any ϕ,ϕ′∈Φ, following a similar proof as Lemma B.2, we have
∥∇h(ϕ)− ∇h(ϕ′)∥
(5)
=∥dT(ϕ)ZT(ϕ) + c(ϕ)−(dT(ϕ′)ZT(ϕ′) + c(ϕ′))∥
≤ ∥dT(ϕ)ZT(ϕ)−dT(ϕ′)ZT(ϕ′)∥+∥cT(ϕ)−cT(ϕ′)∥
=∥dT(ϕ)ZT(ϕ)−dT(ϕ′)ZT(ϕ) + dT(ϕ′)ZT(ϕ)−dT(ϕ′)ZT(ϕ′)∥+∥cT(ϕ)−cT(ϕ′)∥
≤ ∥dT(ϕ)−dT(ϕ′)∥ · ∥ZT(ϕ)∥+∥dT(ϕ′)∥ · ∥ZT(ϕ)−ZT(ϕ′)∥+∥cT(ϕ)−cT(ϕ′)∥
(34)
Subsequently, we deduce that
(34) ≤CZ∥dT(ϕ)−dT(ϕ′)∥+CLZ∥ϕ−ϕ′∥+∥cT(ϕ)−cT(ϕ′)∥
≤CZ
∂f (θT(ϕ),ϕ)
∂θT
−∂f (θT(ϕ′),ϕ)
∂θT
+
∂f (θT(ϕ′),ϕ)
∂θT
−∂f (θT(ϕ′),ϕ′)
∂θT
+
∂f (θT(ϕ),ϕ)
∂ϕ−∂f (θT(ϕ′),ϕ)
∂ϕ
+
∂f (θT(ϕ′),ϕ)
∂ϕ−∂f (θT(ϕ′),ϕ′)
∂ϕ′
+CLZ∥ϕ−ϕ′∥
≤CZL∥θT(ϕ)−θT(ϕ′)∥+CZL∥ϕ−ϕ′∥+L∥θT(ϕ)−θT(ϕ′)∥+L∥ϕ−ϕ′∥
+CLZ∥ϕ−ϕ′∥
≤CZLCZ∥ϕ−ϕ′∥+CZL∥ϕ−ϕ′∥+LCZ∥ϕ−ϕ′∥+L∥ϕ−ϕ′∥+CLZ∥ϕ−ϕ′∥
= [(CZ+ 1)2L+CLZ]∥ϕ−ϕ′∥,
19
where the first, third, and fourth lines all follow directly from Lemma B.2 and Assumption 3.3 (recall that the
latter states that fis L-Lipschitz and C-smooth).
Therefore, h(ϕ)is Lh-smooth with Lh= (CZ+ 1)2L+CLZ.
Now based on the aforementioned lemmas, we put forward the proof of our main theorem: the convergence
analysis for our bilevel optimization method (FG)2U.
Theorem 3.4 (Convergence).Suppose that Asumption 3.2 and Assumption 3.3 hold. Setting the learning rate
β=ρ
(ρ+1)Lhfor gradient descent over the hyperparameter ϕ, we have
1
K
K−1
X
k=0
E∥∇h(ϕk)∥2≤4Lh(E[h(ϕ0)] −minϕh(ϕ))
ρK .(35)
Proof. We have
h(ϕk+1)−h(ϕk)
≤∇h(ϕk),ϕk+1 −ϕk+Lh
2∥ϕk+1 −ϕk∥2
=−β⟨∇h(ϕk),ˆ
∇h(ϕk)⟩+β2Lh
2∥ˆ
∇h(ϕk)∥2
=−β⟨∇h(ϕk),ˆ
∇h(ϕk)⟩+β2Lh
2∥∇h(ϕk) + ˆ
∇h(ϕk)− ∇h(ϕk)∥2
=−β2Lh
2∥∇h(ϕk)∥2+ (β2Lh−β)⟨∇h(ϕk),ˆ
∇h(ϕk)⟩+β2Lh
2∥∇h(ϕk)−ˆ
∇h(ϕk)∥2,
(36)
where the second line is a well-known inequality for smooth functions with the
Lh
-smoothness itself following
from Lemma B.3, and the third line uses the gradient descent rule ϕk+1 =ϕk−βˆ
∇h(ϕk).
By Lemma 3.1 and the fact that ˆ
∇his unbiased (see the proof of Lemma 3.1), we know that
E[⟨∇h(ϕk),ˆ
∇h(ϕk)⟩|ϕk] = ∥∇h(ϕk)∥2;
E[∥∇h(ϕk)−ˆ
∇h(ϕk)∥2|ϕk] = 1
ρ∥∇h(ϕk)∥2.
Therefore, taking the conditional expectation E[· | ϕk]over (36) gives
E[h(ϕk+1)|ϕk]−h(ϕk)≤ − β−1 + 1
ρβ2Lh
2∥∇h(ϕk)∥2.(37)
Furthermore, taking the full expectation and telescoping (37) over kform 0to K−1yields
1
K
K−1
X
k=0 β−(ρ+ 1)Lh
2ρβ2E∥∇h(ϕk)∥2≤E[h(ϕ0)] −E[h(ϕK)]
K
≤E[h(ϕ0)] −minϕh(ϕ)
K.
(38)
Choosing β=ρ
(ρ+1)Lh, we have
1
K
K−1
X
k=0
E∥∇h(ϕk)∥2≤2(ρ+ 1)Lh(E[h(ϕ0)] −minϕh(ϕ))
ρK
≤4Lh(E[h(ϕ0)] −minϕh(ϕ))
ρK .
(39)
Hence, Algorithm 1 requires O(ϵ−1ρ−1)steps to attain an ϵ-accurate stationary point.
D Zeroth-Order Derivative Estimator
In this section, we give a more detailed introduction to zeroth-order (ZO) derivative estimators. These estimators
are pivotal in scenarios where the computation of exact derivatives is either infeasible due to memory constraints
or computationally prohibitive. Apart from the forward gradient method employed in
(FG)2
U, randomized
20
smoothing (RS) is another widely-used derivative estimator, both in Reinforcement Learning [
22
,
32
] and Large
Language Models [18, 47, 74].
For a function F:Rn→R, gradient estimation via RS can be mathematically formulated as:
∇xF≈Ev∼N (0,I)F(x+ϵv)−F(x)
ϵv⊤≈1
b
b
X
i=1
F(x+ϵvi)−F(x)
ϵv⊤
i,(40)
where
b
is the number of random samples,
ϵ
is the smoothing parameter,
vi
are samples drawn from a standard
Gaussian distribution. Regarding the accuracy of estimation, it has been shown in [
13
,
42
] that the variance of
RS is roughly in the order of O(N/b), which is the same as FG as proved in Lemma 3.1.
RS stands out particularly in its ability to estimate gradients of functions evaluated through black-box systems,
where internal operations are inaccessible or highly complex. This characteristic makes RS exceptionally
valuable in practical applications such as adversarial robustness and black-box optimization, where obtaining
direct gradients might not be possible. Another advantage of RS is its robustness against noise and discontinuities
in the function landscape. Unlike deterministic methods, the stochastic nature of RS allows it to approximate the
gradient over a smoothed version of the function, providing stability in scenarios where slight perturbations can
lead to substantial changes in the output.
While RS provides robust gradient estimates across various scenarios, it is critical to recognize that RS inherently
introduces bias if the expectation is not computed during inference. In many CV and NLP applications, the
computational expense of Monte Carlo sampling at the evaluation stage is prohibitive, leading to a biased
estimation when using RS. However, in the context of inverse PDE problems, where the inner-loop solvers are
non-differentiable numerical solvers, we employ RS as a zeroth-order derivative estimator.
E Detailed Task Description
E.1 Data Condensation
In the era of rapid advancement in machine learning, a multitude of foundation models [
11
,
6
,
55
] has benefited
from training on large-scale datasets, exhibiting formidable performance that models trained on small-scale
data cannot match. However, the exponential growth of data also presents challenges: (1) Models updated with
only new data are prone to catastrophic forgetting [
20
] while retaining all historical data for subsequent training
imposes significant storage and computational burdens. (2) Applications within the realm of meta-learning, such
as hyperparameter tuning [
46
,
43
] and neural architecture search [
78
,
38
], necessitate multiple training iterations
over datasets. The computational cost of these operations scales dramatically with the size of the datasets, posing
a bottleneck for efficiency and scalability. (3) The widespread dissemination and utilization of datasets have
raised significant concerns regarding privacy and copyright [12].
To overcome the challenges posed by large-scale datasets, a line of work known as data condensation [
68
,
72
]
has been proposed, with the idea to generate a compact, synthesized dataset, designed to elicit similar behaviors
in machine learning models as those trained with the original, massive dataset. The objectives of the mainstream
principles [
72
] designed for data condensation can be naturally formulated as a bi-level optimization problem.
We focus on the best-known principle performance matching [
72
] on classification task, which can be formulated
as,
min
Do
L(θT;Do),where θt=θt−1−η∇L(θt−1;Dc), t = 1,...,T, (41)
We conduct our experiments to condense the following image datasets:
•
MNIST [
35
]: a handwritten digits dataset containing
60,000
training images and
10,000
testing
images with the size of 28 ×28 from 10 categories.
•
CIFAR 10/100 [
31
]: colored natural images datasets contraining
50,000
training images and
10,000
testing images from 10/100 categories, respectively.
The scale of the condensed dataset will fundamentally impact the results. Therefore, we consider different scales
for each dataset, with images per class set to 1, 10, and 50. The condensed dataset will be used to train random
initialized models, and evaluated on a test dataset.
E.2 Meta Learning Online Adaptation of Language Models
The online adaptation of language models (LM) [
34
,
27
] has been studied recently to keep the knowledge of
LM updated to date. However, trivial auto-regressive fine-tuning the LM with uniform weights for all tokens
results in poor performance in downstream tasks, as the default average negative log-likelihood (NLL) loss
21
does not accurately reflect the importance of tokens [
25
]. To address the issue, [
25
] proposed Context-aware
Meta-learned Loss Scaling (CaMeLS) to meta-learning the weights of tokens for effective online adaption. More
formally, let
θ
denote the parameter of the base model for adaptation,
ϕ
denote the parameter of a parametric
weight model to assign weights for each token, the meta-learning online adaption of LM can be formulated as
the following bi-level optimization,
min
ϕLmeta(θT(ϕ),ϕ)s.t. θt(ϕ) = θt−1(ϕ)−η∇θLtrain (θt−1, wϕ), t = 1,...,T. (42)
We follow the setting studied by [
25
], where the downstream task is question-answering. The meta-objective
consists of a question-answering term measuring the performance gained from adaptation, and a locality term
that prevents the updated base model parameters from excessively changing the base model’s behavior. Let
DQA
denotes the question-answering dataset,
Dloc
denotes the locality dataset, and
c∈R+
denotes the weight of the
locality term, then the meta objective is formally defined as
Lmeta(θT(ϕ),ϕ) := Eq,a∼DQA −log pθT(a|q) + cEx∼Dloc X
i
KL(pθT(·|x:i)∥pθ0(·|x:i)).(43)
The inner objective is defined as a weighted NLL loss, where the weights are determined by the weight model
wϕ,
Ltrain(θ, wϕ) := Ex∼Dtr ain X
i
−wϕ(xi, x) log pθ(xi|x:i).(44)
The trained weight model is then fine-tuned on unseen online documents and evaluated on corresponding
question-answering tasks.
E.3 Data-driven Discovery of Partial Differential Equations (PDEs)
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
0.0
0.2
0.4
0.6
0.8
1.0
time
Burgers
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
0.0
0.2
0.4
0.6
0.8
1.0
time
Allen-Cahn
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
0.0
0.2
0.4
0.6
0.8
1.0
time
KdV
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Figure E.1: Visualization of the 2D latent solutions for the Burgers, Allen-Cahn, and KdV equations.
The observed data are sampled on an 8×8grid, denoted by white points.
We conducted experiments on three non-linear PDEs, with the latent solutions visualized in Figure E.1. The PDE
structures
(45) (46) (47)
are assumed to be known while the PDE parameters
ν
in
(45) (46) (47)
are assumed
to be unknown. For each equation, 64 observed data points are sampled on an
8×8
grid. The objective is to
predict the unknown PDE parameters using the observed data. The predicted PDE parameters are evaluated
by comparing the error with the ground truth, as well as the error in the corresponding prediction of the latent
solution.
E.3.1 Burgers Equation
The nonlinear viscous Burgers equation is a pivotal partial differential equation arising in diverse domains of
applied mathematics such as fluid mechanics, nonlinear acoustics, and traffic flow. This equation can be deduced
from the Navier-Stokes equations for the velocity field by omitting the pressure gradient term. In our experiment,
the equation along with Dirichlet boundary conditions, is expressed as follows:
ut+uux−νuxx = 0, x ∈[−1,1], t ∈[0,1], ν > 0,
u(0, x) = −sin(πx),
u(t, −1) = u(t, 1) = 0,
(45)
with actual viscosity
ν=0.01
π≈0.0031831
. For PINN, following [
44
], we enforced the initial condition into
the output by choosing a surrogate model of the solution as
ˆu(x) = (1 −exp(−t)) NN(x;θ)−sin(πx),
where NN(x;θ)is a neural network.
22
E.3.2 Allen-Cahn Equation
The Allen-Cahn equation is a reaction-diffusion equation of mathematical physics describing the process of
phase separation in multi-component alloy systems, including order-disorder transitions. In our experiment, it is
expressed as follows:
ut−νuxx = 5(u−u3), x ∈[−1,1], t ∈[0,1], ν > 0,
u(0, x) = x2cos(πx),
u(t, −1) = u(t, 1) = −1,
(46)
with the actual diffusion coefficient
ν= 0.001
. For PINN, we enforced the initial condition into the output by
choosing a surrogate model of the solution as
ˆu(x) = (1 −exp(−t)) NN(x;θ) + x2cos(πx),
where NN(x;θ)is a neural network.
E.3.3 Korteweg–De Vries (KdV) Equation
The Korteweg–de Vries (KdV) equation serves as a mathematical model for waves on shallow water surfaces.
This equation is distinguished as a prototypical example of an integrable PDE. It is characterized by features
typical of integrable systems, including a plethora of explicit solutions, notably soliton solutions, and an infinite
number of conserved quantities. These properties are particularly noteworthy given the inherent nonlinearity of
the equation, which generally complicates the solvability of PDEs. In specific, we consider:
ut+uux+νuxxx = 0,
x∈[−1,1], t ∈[0,1], ν = 0,
u(0, x) = cos(πx),
(47)
with the actual coefficient of dispersion
ν
equal to 0.0025. For PINN, we enforced the initial condition into the
output by choosing a surrogate model of the solution as
ˆu(x) = (1 −exp(−t)) NN(x;θ) + cos(πx),
where NN(x;θ)is a neural network.
E.3.4 Numerical PDE solver
Numerical solvers play a critical role in the study and application of PDEs, enabling the simulation and analysis
of complex physical phenomena that cannot be addressed analytically [
51
]. These solvers convert PDEs into
a form that can be handled computationally, typically by discretizing the domain into a finite set of points or
elements and approximating the derivatives. Conventional numerical methods include finite difference methods,
finite element methods, and spectral methods [1].
Among the various numerical methods for solving PDEs, spectral methods stand out for their ability to deliver
highly accurate solutions, particularly for problems with smooth solutions [
21
,
7
]. Spectral methods involve
representing the solution to a PDE as a sum of basis functions, such as trigonometric polynomials, which are
globally defined over the domain. This approach contrasts with finite difference or finite element methods, where
the solution is localized to the grid points or elements. In this paper, we mainly adopt spectral methods, as we
focus on the Burgers, Allen-Cahn, and KdV equations. All these three equations can be efficiently and accurately
resolved by spectral techniques.
F Implementation Details
F.1 Data Condensation
We conducted our experiments following the standard data condensation setting established by [
68
,
77
,
67
].
The condensation and evaluation are both performed on a depth-3 convolutional neural network [
58
]. The
hyperparameters we used for
(FG)2
U are summarized in Appendix F.1. All experiments are conducted on
NVIDIA-L40S (40G).
F.2 Meta Learning Online Adaptation of Language Models
We adhered to the standard settings of CaMeLS [
25
] and adapted their official code for our implementation. The
only modification made was replacing the meta gradient approximation module with
(FG)2
U. It is important
to note that the base models used for meta-learning were initially pre-trained on a split QA-paired set. While
the official codebase provided the script for pretraining, it did not include the exact base model (weights) they
23
Datasets MNIST CIFAR10 CIFAR100
IPC 1 10 50 1 10 50 1 10 50
Unrolled Depth 100 100 100 100 100 100 100 100 200
# Random Directions 32 32 32 32 32 32 32 32 32
# Inner Batch Size Full Full Full Full Full Full Full Full 100
Hessian-Free Pretraining ✗✓ ✓ ✗✓ ✓ ✗✓ ✓
Gradient Accumulate 16 32 32 32 32 32 64 64 64
Outer Steps 10000 10000 10000 10000 10000 10000 10000 10000 10000
Outer Step Size 1e-2 5e-4 5e-4 1e-2 5e-4 5e-4 1e-2 5e-4 5e-4
Evaluation Steps 1000 10000 10000 1000 10000 10000 1000 10000 10000
Table F.1: (FG)2U hyperparameters for data condensation experiments.
used. We executed the official script to generate the pre-trained base models and observed that meta-learning
performance is sensitive to the choice of base models. For a fair comparison, we reported both the results from
[
66
] (where CaMeLS [
25
] presented performance improvements over baselines using bar plots without specific
metric values) and the results with our best custom pre-trained base models. Following the two-phase training
paradigm introduced in Section 3.2, we performed training of
(FG)2
U on RGU (DistilGPT2, unrolled depth
6) results. The hyperparameters we used for
(FG)2
U are summarized in Appendix F.2, while all remaining
hyperparameters were kept the same as in [
25
]. All experiments are conducted on one NVIDIA A100 GPU
(80G).
base model DistilGPT2 GPT2
Unrolled Depth 24/48 24/48
# of Random Directions 12 8
Gradient Accumulate 32 32
Outer Optimizer Adam Adam
Outer Step Size 2.5E-6 2.5E-6
Table F.2: (FG)2U hyperparameters for CaMeLS experiments.
F.3 Data-driven Discovery of Partial Differential Equations (PDEs)
The hyperparameters we used for this experiment are summarized in Appendix F.3. All experiments are
conducted on NVIDIA-L40S (40G). The structure for PINN is a depth-9 and width-20 MLP with tanh activations.
PDEs Burgers AllenCahn KdV
Inner Solvers PINN Numerical PINN Numerical PINN Numerical
Directional Grad. Calculation FAD ZO FAD ZO FAD ZO
# Random Directions 1 1 1 1 1 1
Outer Steps 5000 5000 5000 5000 5000 5000
Unrolling Depth 1000 - 1000 - 1000 -
Grid Size for Numerical Method - 256×512 - 256×512 - 256×512
Range of initial ν(0, 1e1] (0, 1e1] (0, 1e-1] (0, 1e-1] (0, 1e-2] (0, 1e-2]
Outer Optimizer Adam Adam Adam Adam Adam Adam
Outer Step Size 1e-2 1e-2 1e-2 1e-2 1e-3 1e-3
Inner Optimizer SGD - SGD - SGD -
Inner Batch Size 5000 - 5000 - 5000 -
Inner Step Size 1e-3 - 1e-3 - 1e-3 -
µfor Finite Difference - 1e-4 - 1e-4 - 1e-4
Table F.3:
(FG)2
U hyperparameters for discovery of PDEs experiments.
ν
denotes the unknown PDE
parameters.
24
G Additional Experimental Results
G.1 Meta Learning Online Adaptation of Language Models
Ablation results on the unrolled depth and the base model are summarized in Table G.1 and Table G.2.
Model (# params) Method Unrolled DistilGPT2 GPT2
Steps (#) EM (↑) F1 (↑) EM (↑) F1 (↑)
DistilGPT2 (82M)
RGU (impl.) 6 2.04 5.53 OOM
(FG)2U(ours) 24 2.10 5.59 2.22 6.37
48 2.10 6.25 2.16 6.32
GPT2-Large (774M)
RGU (impl.) 6 7.02 12.19 OOM
(FG)2U(ours) 24 6.91 12.12 7.21 12.50
48 7.03 12.31 7.27 12.45
GPT2-XL (1.5B)
RGU (impl.) 6 7.93 12.94 OOM
(FG)2U(ours) 24 8.34 13.46 8.89 14.42
48 8.23 13.70 8.65 13.91
Table G.1: StreamingQA: Ablation results on the unrolled depth and the base model.
Model (# params) Method Unrolled DistilGPT2 GPT2
Steps (#) EM (↑) F1 (↑) EM (↑) F1 (↑)
DistilGPT2 (82M)
RGU (impl.) 6 1.52 3.16 OOM
(FG)2U(ours) 24 1.72 3.49 1.72 3.50
48 1.75 3.47 1.73 3.49
GPT2-Large (774M)
RGU (impl.) 6 4.86 8.57 OOM
(FG)2U(ours) 24 5.49 8.88 5.56 8.99
48 5.45 8.90 5.32 8.97
GPT2-XL (1.5B)
RGU (impl.) 6 6.71 9.65 OOM
(FG)2U(ours) 24 7.00 10.13 7.27 10.33
48 7.37 10.37 7.25 10.32
Table G.2: SQuAD: Ablation results on the unrolled depth and the base model.
H Limitations and Future Works
The experiments conducted in this paper are of relatively small scale, with the largest inner model being a GPT-2
model. We look forward to validating its effectiveness on larger-scale bi-level optimization tasks. Additionally,
the application of black-box bi-level optimization and the potential of
(FG)2
U-ZO remain underexplored,
considering the prevalent black-box interaction between users and models today. We hope our work will inspire
further development of large-scale bi-level optimization algorithms and their application in corresponding
scenarios. Furthermore, we have not specifically addressed the efficiency issues inherited by
(FG)2
U from the
forward gradient method. Enhancing the efficiency of
(FG)2
U while maintaining its gradient estimation accuracy
will be an important direction for future research.
25