Available via license: CC BY 4.0
Content may be subject to copyright.
Constrained Optimization in the Presence of Noise
Figen Oztoprak∗Richard Byrd †Jorge Nocedal ‡
October 12, 2021
Abstract
The problem of interest is the minimization of a nonlinear function subject to
nonlinear equality constraints using a sequential quadratic programming (SQP) method.
The minimization must be performed while observing only noisy evaluations of the
objective and constraint functions. In order to obtain stability, the classical SQP
method is modified by relaxing the standard Armijo line search based on the noise
level in the functions, which is assumed to be known. Convergence theory is presented
giving conditions under which the iterates converge to a neighborhood of the solution
characterized by the noise level and the problem conditioning. The analysis assumes
that the SQP algorithm does not require regularization or trust regions. Numerical
experiments indicate that the relaxed line search improves the practical performance of
the method on problems involving uniformly distributed noise. One important application
of this work is in the field of derivative-free optimization, when finite differences are
employed to estimate gradients.
1 Introduction
Let us consider the equality constrained nonlinear optimization problem
min
xf(x)s.t. c(x)=0,(1.1)
where
f
:
Rn→R
and
c
(
x
) :
Rn→Rm
are smooth functions. We assume that the
minimization must be performed while observing approximate evaluations
˜
f
(
x
)
,˜c
(
x
)of the
functions f, c and their derivatives.
We consider the application of a sequential quadratic programming (SQP) algorithm
that employs an
`1
merit function to control the stepsize. The goal of the paper is to study
the effect of noise on the behavior of the SQP algorithm, particularly the achievable accuracy
∗Artelys Corporation
†Computer Science Department, University of Colorado, Boulder, USA
‡
Department of Industrial Engineering and Management Sciences, Northwestern University, USA. This
author was supported by National Science Foundation grant DMS-2011494, AFOSR grant FA95502110084,
and ONR grant N00014-21-1-2675.
1
arXiv:2110.04355v1 [math.OC] 8 Oct 2021
in the solution, and to highlight the aspects of the algorithm that are most susceptible to
errors (or noise)—and redesign them. This work was motivated by applications in which
the derivatives of
f
and
c
are approximated by finite differences [
15
], and thus contain
errors, but the algorithm and analysis apply to the more general setting when stochastic or
deterministic noise are present in both the function and derivative evaluations.
Let us define
g(x) = ∇f(x), J(x) = ∇c(x)∈Rm×n, m < n, (1.2)
and let
˜g
(
x
)
,˜
J
(
x
)be the corresponding noisy evaluations. The iteration of the SQP algorithm
is given by
xk+1 =xk+αkdk,(1.3)
where dkis the solution of the quadratic subproblem
min
d∈Rn
1
2dTHkd+ ˜gT
kd(1.4)
s.t.˜ck+˜
Jkd= 0,(1.5)
and the steplength
αk>
0is chosen so as to ensure sufficient decrease in the merit function
˜
φ(x) = ˜
f(x) + πk˜c(x)k1(1.6)
when the iterates are far away from a solution. Here
π >
0is a penalty parameter that
is adjusted during the course of the optimization. The symmetric matrix
Hk
is generally
chosen as an approximation to the Hessian of the Lagrangian. However, in this paper we
assume that Hkis a multiple of the identity matrix,
Hk=βkI βk>0,(1.7)
because allowing more general choices introduces more constants in the analysis without
contributing to the main goals of this investigation.
As in the noiseless case, the control of the penalty parameter in
(1.6)
is of critical
importance in the SQP algorithm.
π
should be chosen so that the SQP direction
dk
is a
descent direction for
˜
φ
at
xk
, and it should provide adequate control on the size of
αk
. The
proposed algorithm has the general form of a classical SQP method [
7
], specialized to the
case when
Hk
is a multiple of the identity matrix, and introduces a modification in the line
search designed to handle noise.
We assume throughout that the noise in the function and gradient evaluations is bounded
by some constants
f
and
c
. This is not always the case in practice (e.g. when noise is
Gaussian) but it covers many important practical settings, including computational noise
[
11
]. Furthermore, we assume that
f, c
are known, or can be estimated, and that the
algorithm has access to them.
This study was motivated by some practical computations performed by the authors using
the knitro software package [
5
]. They selected a few challenging nonlinear optimization
2
problems involving equality and inequality constraints, injected noise in the objective and
constraints, and computed derivatives using noise-aware finite difference formula; see e.g.,
Moré and Wild [
11
] and Berahas et al. [
1
]. They observed that, for low levels of noise,
knitro returned acceptable answers, even though one might suspect the default algorithm
to be brittle in this setting. As the noise level was increased, the quality of the solution
deteriorated markedly, suggesting that classical optimization methods should be redesigned
to handle noise. To guide this investigation, it is essential to develop a convergence theory.
In this paper, we focus on the case when noise cannot be diminished, and characterize the
accuracy of a noise tolerant optimization algorithm.
As a first step in this investigation, we find it convenient to consider equality constrained
optimization, and study the performance of a sequential quadratic optimization method,
which is a simple method in this setting and must yet confront some important challenges
raised by the presence of noise.
1.1 Contributions of this work
The main contribution of this paper is the development of a convergence theory for a classical
sequential quadratic programming (SQP) algorithm for equality constrained optimization
in the presence of noise. It is shown that, by introducing a relaxation in the line search
procedure while keeping all other components of the SQP method unchanged, the iterates of
the algorithm reach an acceptable neighborhood
C1
of the solution defined by a stationarity
measure for the problem. Furthermore, once the iterates enter
C1
they cannot escape a
larger neighborhood
C2
and must revisit
C1
an infinite number of times. The analysis gives
a detailed characterization of these neighborhoods in terms of the noise level and problem
characteristics. Numerical experiments show that the relaxed line search is, in fact, beneficial
in practice.
Our convergence results assume that errors in function and gradients are bounded, and
the analysis is deterministic, yielding somewhat pessimistic bounds. We believe, however,
that the results can be useful in the design of robust constrained optimization methods.
Specifically, our analysis suggests that only slight modifications are needed so that a classical
SQP method is able to handle bounded noise.
1.2 Literature Review
Early work on constrained optimization in the presence of noise is reviewed by Poljak
(a.k.a. Polyak) [
13
]. His study includes penalty, Lagrange, or extended Lagrange functions,
and establishes probabilistic convergence theorems provided the steplength is chosen small
enough from the start. Hintermueller [
10
] studies a penalty SQP method in which equality
constraints are replaced with upper and lower bounding surrogates. Assuming that the noise
level in the function is known, it is shown that in the limit the bounds contain a solution.
Schittkowski [
14
] uses a non-monotone line search to handle errors due to approximate
function and derivative evaluations. His algorithm was implemented in the NLPQLP software,
which is reported to be successful in practice, but no convergence theory were presented.
3
The work that is most closely related to this study is [
2
,
3
,
6
]. In [
3
], an SQP method for
equality constrained optimization is presented to handle the case when the objective function
is stochastic and the constraints are deterministic. The stepsize is obtained by adaptively
estimating Lipschitz constants in place of a line search. Conditions for convergence in
expectation are established. [
2
] considers the case when Jacobians can be rank deficient,
proposes a step decomposition approach, and presents compelling numerical results. [
6
]
studies an SQP algorithm with an inexact step computation for the same problem setting.
These three papers give careful attention to the behavior of the penalty parameter. For
example, in [
3
] the penalty parameter is chosen in a way that provides sufficient descent in the
quadratic model of the merit function in the deterministic setting. In the stochastic setting,
they employ the stochastic gradient of the objective in the same formulae for updating the
penalty parameter, but they can no longer guarantee that the resulting penalty parameter
will be large enough and bounded. They prove their convergence results assuming that the
penalty parameter is well behaved. Then, they discuss the probability of having small penalty
values, and note that the boundedness issue is resolved by making the same assumption as
in this paper, namely that noise is always bounded.
Notation. We let
k·k
denote the
`2
norm, unless otherwise stated. As is the convention,
fk
stands for
f
(
xk
)and similarly for other functions. The terms error and noise in the
functions is used interchangeably. Since we assume absolute bounds on these quantities, the
distinction between them is not important in this study.
2 The Algorithm
Before presenting the algorithm, we introduce some notation. We model the first-order
change in the merit function φat an iterate xkas
˜
`(xk;dk) = ˜gT
kdk+πkk˜ck+˜
Jkdkk1−πkk˜ckk1.(2.1)
We also define ˆ
λk= ( ˜
Jk˜
JT
k)−1˜
Jk˜gk,(2.2)
which is the standard least squares multiplier estimate [
12
, eqn(18.21)], accounting for noisy
function evaluations. We assume that ˜
Jkis full rank for all k, hence ˆ
λkis well defined.
The penalty parameter will be updated using the following classical formula [
12
, eqn(18.32)].
Given a (fixed) parameter τ∈(0,1), we set at every iteration
πk=(πk−1if πk−1≥1
1−τkˆ
λkk∞
2
1−τkˆ
λkk∞otherwise. (2.3)
The factor 2 in the second line of
(2.3)
is introduced so that when
πk
is increased, it is
increased substantially. We will see that this rule ensures that
πk
is eventually fixed. (In
general, SQP methods do not set
Hk
=
βkI
. In that case, using the least squares multiplier
estimate in (2.3) will not lead to a convergent method.)
The algorithm for solving problem
(1.1)
, when only noisy evaluations of the functions
˜
f, ˜c, ˜g, ˜
Jare available, is as follows.
4
Algorithm 1 Noise Tolerant SQP Algorithm
Input:
Initial iterate
x0
, initial merit parameter
π−1>
0, bounds
f, c
on the noise
(3.1)
,
and constants τ, ν ∈(0,1).
Set k←0
Repeat until a termination test is satisfied:
1: Compute βk>0and set Hk=βkIin (1.4)
2: Compute dkby solving (1.4)-(1.5)
3: Compute ˆ
λkvia (2.2)
4: Update penalty parameter πkby (2.3)
5: Compute ˜
`(xk;dk)as in (2.1)
6: Set R= 2(f+πkc)
7: Choose steplength αk>0such that
˜
φ(xk+αkdk)≤˜
φ(xk) + ναk˜
`(xk;dk) + R,(2.4)
8: Compute new iterate: xk+1 =xk+αkdk
9: Set k←k+ 1
The steplength
αk
is computed in Step 7 using a backtracking line search. We refer to
(2.4)
as the relaxed Armijo condition. The term
R
introduces a margin that facilitates the
convergence analysis in the presence of noise, and as discussed in Section 4, is also useful in
practice. Note that the line search cannot fail since
(2.4)
is satisfied for sufficiently small
αk
,
by definition of
R
. In this paper, we assume that the quadratic subproblem
(1.4)
-
(1.5)
has
a unique solution at every iteration—admittedly a strong assumption, but one that helps us
focus on the effect of noise without the complicating effects of regularization parameters or
trust regions. The study of a practical algorithm that employs those globalization strategies
will be the subject of future work.
3 Global Convergence
In this section we show that the iterates generated by Algorithm 1 converge to a neighborhood
of the solution determined by the noise level and certain characteristics of the problem. We
also show that once the iterates reach this neighborhood they cannot stray away from it
(under normal circumstances). We start by stating the assumptions upon which our analysis
is built.
Assumptions 3.1.
The function
f
has a Lipschitz continuous gradient with constant
Lf
.
The functions
∇ci
are Lipschitz continuous for
i
= 1
, . . . , m
with the corresponding constants
held in the vector Lc.
We also assume that the error (or noise) in the evaluation of the functions is bounded.
5
Assumptions 3.2. There exist positive constants f, c, g, Jsuch that for all x∈Rn,
|˜
f(x)−f(x)| ≤ f,k˜c(x)−c(x)k1≤c,(3.1)
k˜g(x)−g(x)k ≤ g,k˜
J(x)−J(x)k1,2≤J.(3.2)
Here,
k·k
denotes the Euclidean norm and
k·k1,2
denotes the matrix norm induced by the
`1norm on Rmand the Euclidean norm on Rn.
As already mentioned, we assume that, for all
k
, the matrices
˜
Jk
have full rank so that
the quadratic problem
(1.4)
-
(1.5)
has a unique solution. To state this precisely, we let
σmin(A)denote the smallest singular value of a matrix A.
Assumptions 3.3. For all k, the scalar βkin (1.7) satisfies
0< bl≤βk≤bu,(3.3)
for some constants bl, bu, and there is a constant γ > 0such that
σmin(Jk)≥γ, with γ > J,∀k. (3.4)
Furthermore, the sequences
{fk},{kckk}
,
{kgkk}
,
{kJkk}
generated by the algorithm are
bounded.
By the matrix inversion lemma [
8
] and
(3.2)
, if
Jk
has full rank and
γ > J
, then
˜
Jk
is
also full rank and
k˜
JT
k(˜
Jk˜
JT
k)−1k ≤ 1
γ−J≡δ, ∀k. (3.5)
The assumption that the sequences
{fk},{kckk}
,
{kgkk}
,
{kJkk}
generated by the
algorithm are bounded is fairly standard in the literature and is designed to avoid pathological
situations. For example, the merit function
φ
may be unbounded below away from the
solution if
π
is not large enough. Although there are strategies to avoid these situations (see
e.g. [12, §18.5], we do not include them in our algorithm, for simplicity.
Given these three sets of assumptions, we are ready to study the convergence properties
of Algorithm 1. Let us apply the well known descent lemma (see e.g.[
4
]) to the true (noiseless)
merit function
φ(x) = f(x) + πkc(x)k1.(3.6)
We have that for any (x, d)
φ(x+αd)≤φ(x) + αg(x)Td+πkc(x) + αJ (x)dk1− kc(x)k1+1
2Lf+πkLck1α2kdk2.
(3.7)
Thus, we can write
φ(x+αd)−φ(x)≤`(x;αd) + 1
2Lf+πkLck1α2kdk2,(3.8)
6
where
`(x;s) = g(x)Ts+πkc(x) + J(x)sk1−πkc(x)k1.(3.9)
When function and derivatives are exact, it is easy to show that for
π
sufficiently large and
α
sufficiently small we can guarantee a reduction in
φ
; see [
12
]. We must establish that this
is also the case in the noisy setting—before the iterates approach the region around the
solution where noise dominates. We begin by establishing bounds on the step dk.
3.1 Preliminary results
The optimality conditions of the quadratic problem (1.4)-(1.5) are given by
Hk˜
JT
k
˜
Jk0! dk
dy!=− ˜gk+˜
JT
ky
˜ck!,(3.10)
for some Lagrange multiplier
y∈Rm
. The step
dk
can be written as the sum of two
orthogonal components,
dk=vk+uk,(3.11)
where
vk
is in the range space of
˜
JT
k
and
uk
is in the null space of
Jk
. A simple computation
from (3.10) shows that
vk=−˜
JT
k(˜
Jk˜
JT
k)−1˜ck, uk=−1
βk
˜
Pk˜gk,(3.12)
where
˜
Pk=I−˜
JT
k˜
Jk˜
JT
k−1˜
Jk(3.13)
is an orthogonal projection matrix onto the tangent space of the constraints. We now
establish bounds on
uk, vk
. In what follows, we let
J†
denote the Moore-Penrose generalized
inverse of a matrix
J
, and define
Pk
=
I−JT
kJkJT
k−1Jk
. Since
˜
Pk
and
Pk
are orthogonal
projections, we have that k˜
Pkk=kPkk= 1.
Lemma 3.4. Under Assumptions 3.1 and 3.2 we have both
kvkk1≤δk˜ckk1≤δ(kckk1+c)(3.14)
kukk ≤ 1
βkkPkgkk+kgkkηJ+g,(3.15)
where δis defined in (3.5) and
η= 1/γ. (3.16)
Therefore,
kdkk ≤ δ(kckk1+c) + 1
βkkPkgkk+kgkkηJ+g.(3.17)
7
Proof.
The bounds
(3.14)
follow directly from
(3.12)
,
(3.5)
, and
(3.1)
. By
(3.2)
, we can
bound the norm of the tangential component as follows
kukk=1
βkk˜
Pk˜gkk
≤1
βkkPkgkk+k(˜
Pk−Pk)gkk+k˜
Pkkkgk−˜gkk
≤1
βkkPkgkk+k(˜
Pk−Pk)kkgkk+g.(3.18)
Moreover, by the bounds on perturbed projection matrices [
16
, Theorems 2.3 and 2.4] we
have that
k˜
Pk−Pkk ≤ J
γ≡ηJ.(3.19)
This yields (3.15).
3.2 Penalty Parameter and Model Decrease
We note from
(3.8)
that in order to obtain a decrease in the true merit function
φ
, we must
ensure that
`
(
xk
;
αkdk
)is negative. We will see that this can be achieved for
αk
= 1 by
choosing a sufficiently large penalty parameter π, and provided noise does not dominate.
Lemma 3.5. If at every iteration kthe penalty parameter satisfies
πk≥1
1−τk(˜
Jk˜
JT
k)−1˜
Jk˜gkk∞, τ ∈(0,1),(3.20)
then
`(xk;dk)≤ − 1
βk
gT
kPkgk+1
βk
(kgkk2ηJ+gkgkk)−τ πkkckk1+gδ(kckk1+c)(3.21)
+πk(2 −τ)c+Jδ(kckk1+c) + 1
βk
(kPkgkk+kgkkηJ+g).
Proof.
Since
dk
=
−1
βk
˜
Pk˜gk−˜
JT
k
(
˜
Jk˜
JT
k
)
−1˜ck,
we have from
(3.9)
,
(1.5)
,
(3.5)
,
(3.1)
,
(3.2)
,
and the definition of the k·k1,2norm in (3.2), that
`(xk;dk) = gT
kdk+πkkck+Jkdkk1−πkkckk1(3.22)
≤ − 1
βk
gT
k˜
Pk˜gk−gT
k˜
JT
k(˜
Jk˜
JT
k)−1˜ck+πkkck+Jkdkk1−πkkckk1
≤ − 1
βk
gT
k˜
Pk˜gk−gT
k˜
JT
k(˜
Jk˜
JT
k)−1˜ck+πkk(ck−˜ck)+(Jk−˜
Jk)dk)k1−πkkckk1
≤ − 1
βk
gT
k˜
Pk˜gk−˜gT
k˜
JT
k(˜
Jk˜
JT
k)−1˜ck+gδk˜ckk1+πk(c+Jkdkk)−πkkckk1
≤ − 1
βk
gT
k˜
Pk˜gk−˜gT
k˜
JT
k(˜
Jk˜
JT
k)−1˜ck+gδ(kckk1+c)
+πkc+Jδ(kckk1+c) + 1
βk
(kPkgkk+kgkkηJ+g)−πkkckk1,
8
the last line following by (3.17). Next, since k˜
Pkk= 1 and recalling (3.19), we obtain
−gT
k˜
Pk˜gk≤ −gT
kPkgk+kgkkkPkgk−˜
Pk˜gkk
≤ −gT
kPkgk+kgkkkPkgk−˜
Pkgk+˜
Pkgk−˜
Pk˜gkk
≤ −gT
kPkgk+kgkk2kPk−˜
Pkk+kgkkkgk−˜gkk
≤ −gT
kPkgk+kgkk2ηJ+kgkkg.
Therefore,
`(xk;dk)≤ − 1
βk
gT
kPkgk+1
βk
(kgkk2ηJ+gkgkk)−˜gT
k˜
JT
k(˜
Jk˜
JT
k)−1˜ck+gδ(kckk1+c)
+πkc+Jδ(kckk1+c) + 1
βk
(kPkgkk+kgkkηJ+g)−πkkckk1.
Now suppose that we choose the parameter πkso that (3.20) holds. Then
−˜gT
k˜
JT
k(˜
Jk˜
JT
k)−1˜ck≤ k˜gT
k˜
JT
k(˜
Jk˜
JT
k)−1k∞k˜ckk1≤(1 −τ)πk(kckk1+c),
and it follows that
`(xk;dk)≤ − 1
βk
gT
kPkgk+1
βk
(kgkk2ηJ+gkgkk)−τ πkkckk1+gδ(kckk1+c)
+πk(2 −τ)c+Jδ(kckk1+c) + 1
βk
(kPkgkk+kgkkηJ+g).
Lemma 3.5 implies that for any
xk
such that the right hand side of
(3.21)
is negative, we
have
`
(
xk
;
dk
)
<
0
.
We now provide conditions under which the decrease in
`
is proportional to
the optimality conditions of the nonlinear problem
(1.1)
. Specifically, since
gT
kPkgk
=
kPkgkk2
is the norm squared of the projected gradient, a combination of
gT
kPkgk
and
kckk1
can be
regarded as a measure of stationarity of the constrained optimization problem. The following
result assumes that the optimality measure is not small compared to the errors (or noise).
Corollary 3.6.
Choose any
θ1∈
[0
,
1). For any
xk
sufficiently far from the solution such
that
(1 −θ1)1
βk
gT
kPkgk+τπkkckk1≥E(xk, βk, πk),(3.23)
where
E(x, β, π) = 1
β(kg(x)k2ηJ+gkg(x)k) + gδ(kc(x)k1+c)
+π(2 −τ)c+Jδ(kc(x)k1+c) + 1
β(kP(x)g(x)k+kg(x)kηJ+g),
(3.24)
we have
`(xk;dk)≤ −θ11
βk
gT
kPkgk+τπkkckk1.(3.25)
9
Proof. For any θ1∈[0,1), we can rewrite (3.21) as
`(xk;dk)≤ − θ1(1
βk
gT
kPkgk+τπkkckk1)−(1 −θ1)( 1
βk
gT
kPkgk+τπkkckk1)
+1
βk
(kgkk2ηJ+gkgkk) + gδk(kckk1+c)
+πk(2 −τ)c+Jδ(kckk1+c) + 1
βk
(kPkgkk+kgkkηJ+g),
from which (3.25) follows by condition (3.23).
In order to make this result, and similar results to be proved later, more understandable
and more convenient to use, we recall that
g
(
x
)
TP
(
x
)
g
(
x
) =
kP
(
x
)
g
(
x
)
k2
, and define the
function
ψπ(x) = 1
bukP(x)g(x)k2+πτ kc(x)k1,(3.26)
where
bu
is given in
(3.3)
. Clearly,
ψπ
may be viewed as a measure of non-stationarity since
ψπ
(
x∗
) = 0 when
x∗
is a stationary point of the problem (1.1). Given this notation we can
restate a slightly weaker version of Corollary 3.6.
Corollary 3.7.
Choose any
θ1∈
[0
,
1). For any
xk
sufficiently far from the solution such
that
ψπk(xk)≥E(xk, βk, πk)/(1 −θ1),(3.27)
we have
`(xk;dk)≤ −θ11
βk
gT
kPkgk+τπkkckk1≤ −θ1ψπk(xk).(3.28)
Proof. The result follows from the fact that
ψπk(xk)≤1
βk
gT
kPkgk+τπkkckk1.
3.3 Line search
Since
πk
is defined by
(2.3)
and
(2.2)
, and by Assumptions 3.3, we have that
{kˆ
λkk}
is
bounded. Moreover, since
{πk}
is monotone and since
πk−πk−1
is either zero or greater
than πk−1, there exists values k0and ¯πsuch that:
πk= ¯π, ∀k≥k0,(3.29)
and (3.20) is satisfied. The rest of the analysis assumes that the penalty parameter has
attained that fixed value ¯π. Thus, for the rest of the section
˜
φ(xk)≡˜
f(xk) + ¯πk˜c(xk)k1, φ(xk)≡f(xk) + ¯πkc(xk)k1,∀k≥k0.(3.30)
10
Algorithm 1 sets
xk+1
=
xk
+
αkdk
, where
αk
is chosen by repeated halving until the
relaxed Armijo condition is satisfied:
˜
φ(xk+αkdk)≤˜
φ(xk) + ναk˜
`(xk;dk) + R,
for some constants
ν∈
(0
,
1) and
R≥
2(
f
+
¯πc
)
,
where
˜
`
(
xk
;
dk
)is defined in
(2.1)
. In
other words, we require that the decrease in the noisy merit function be a fraction of the
decrease of the noisy first-order model ˜
`, plus a relaxation term.
To ensure that the line search yields significant progress toward a solution, we need to
show that
αk
is bounded away from zero and that
˜
`
(
xk
;
dk
)is sufficiently negative. To do so,
we recall that we have established in
(3.28)
that the noiseless first-order model
`
(
xk
;
d
)
k
is
sufficiently negative when condition
(3.27)
is satisfied. To relate
`
(
xk
;
dk
)to
˜
`
(
xk
;
dk
), we
recall (3.9) and (2.1), and measure the difference between these two quantities. By (3.17)
|˜
`(xk;dk)−`(xk;dk)| ≤ gkdkk+ 2¯πc+ ¯πJkdkk
≤(g+ ¯πJ)δ(kckk1+c) + 1
βk
(kPkgkk+kgkkηJ+g)+ 2¯πc
(3.31)
≤(g+ ¯πJ)δ(Cc+c) + 1
bl
(Cg+CgηJ+g)+ 2¯πc
≡`,(3.32)
where Cg, Ccare constants such that
kg(xk)k ≤ Cg,kc(xk)k1≤Cc∀k > k0.(3.33)
We know that these constants exist because of Assumption 3.3. We now describe
conditions under which one can characterize the size of the steplength αk. Let
L=Lf+ ¯πkLck1,(3.34)
where Lf, Lcare defined in Assumptions 3.1.
Theorem 3.8.
Let
θ1
be defined as in Corollary 3.6, choose constants
θ2< θ1
,
ν∈
(0
,
1)
and
R≥2(f+ ¯πc)≡2φ.(3.35)
Then, for all iterates xkwith k≥k0that satisfy both (3.27) and
(1 −ν)(θ1−θ2)1
βkkPkgkk2+ ¯πτkckk1>2ν`,(3.36)
if the steplength satisfies
αk<(1 −ν)θ21
βkkPkgkk2+ ¯πτkckk1
L
2[δ2(kckk1+c)2+1
β2
k
(kPkgkk+kgkkηJ+g)2]≡ˆαk,(3.37)
then ˜
φ(xk+αkdk)≤˜
φ(xk) + ναk˜
`(xk;dk) + R.(3.38)
11
Proof.
By the definition
(3.35)
of
φ
,
(3.8)
,
(3.34)
, the convexity of
`
(
xk
;
·
), (3.32), (3.28),
the fact that P2
k=Pk, and (3.17), we get
˜
φ(xk+αdk)−˜
φ(xk)≤φ(xk+αdk)−φ(xk)+2φ
≤`(xk;αdk)+2φ+L
2α2kdkk2
≤α`(xk;dk)+2φ+L
2α2kdkk2
=να`(xk;dk)+2φ+ (1 −ν)α`(xk;dk) + L
2α2kdkk2
≤να ˜
`(xk;dk) + να`+ 2φ+ (1 −ν)α`(xk;dk) + L
2α2kdkk2
≤να ˜
`(xk;dk)+2φ+ 2να`−(1 −ν)θ1α1
βk
gT
kPkgk+ ¯πτkckk1
+L
2α2kdkk2
≤να ˜
`(xk;dk)+2φ+ 2να`−(1 −ν)θ1α1
βkkPkgkk2+ ¯πτkckk1
+L
2α2hδ2(kckk1+c)2+1
β2
kkPkgkk+kgkkηJ+g2i,
the last line following from the orthogonality of the components (3.11) of dk.
Now we choose a constant
θ2< θ1
, and consider iterates
xk
such that
(3.36)
holds. For
such iterates we have,
˜
φ(xk+αdk)−˜
φ(xk)≤να ˜
`(xk;dk)+2φ−(1 −ν)θ2α1
βkkPkgkk2+ ¯πτkckk1
+L
2α2"δ2(kckk1+c)2+1
β2
kkPkgkk+kgkkηJ+g2#.
Then, for any steplength satisfying
(3.37)
where
xk
satisfies the
(3.23)
and
(3.36)
, we
have ˜
φ(xk+αdk)−˜
φ(xk)≤να˜
l(xk;dk)+2φ,
and thus (3.38) holds since R≥2φ.
Note that condition (3.36) is implied by the slightly weaker inequality
(1 −ν)(θ1−θ2)ψ¯π(xk)>2ν`.(3.39)
Since the numerator in
(3.37)
is bounded away from zero by
(3.36)
, and the denominator
is bounded above given the assumed global upper bounds on
ck, gk,
and lower bound on
βk
stated in Assumptions 3.3, it follows that there is a constant
¯α
such that
ˆαk>
2
¯α
for all
k≥k0
. The algorithm employs a backtracking line search that halves each trial step, hence
we can conclude that
¯α≤αk,for k≥k0.(3.40)
This will allow us to show that, when the conditions in Theorem 3.8 are satisfied, the
algorithm will make non-negligible progress.
12
3.4 The Main Convergence Result
Now we show that Algorithm 1 will eventually generate iterates close to a stationary point
of the problem, as measured by the function
ψ¯π
(
x
)defined in
(3.26)
. To do so, we note
that condition
(3.27)
implies that the linear model decrease
`
is sufficiently negative in
the sense of
(3.28)
, and we have established a bound in
(3.32)
for the distance between
`
and
˜
`
. Furthermore, we have shown that condition
(3.39)
ensures that the relaxed Armijo
condition
(3.38)
is satisfied for steplengths
αk
that are bounded away from zero. Those
two conditions—
(3.27)
,
(3.39)
—are necessary to ensure that the algorithm makes significant
progress, but they are not sufficient. To control the effect of noise in testing
(2.4)
as well as
the effect of the relaxation factor, we impose one additional condition,
ψ¯π(xk)≥2R+ 4φ
ν¯αθ2
,(3.41)
to help define the region where Algorithm 1 progresses toward stationarity.
One more refinement is needed. The definition of the term
E
(
xk, βk, πk
)defined in
(3.24)
involves
c
(
xk
)and
g
(
xk
), which makes the region defined by
(3.27)
difficult to interpret.
Therefore, we compute an upper bound for E. If we define
E=1
bl
(C2
gηJ+gCg) + gδ(Cc+c)
+¯π(2 −τ)c+Jδ(Cc+c) + 1
bl
(Cg+CgηJ+g),(3.42)
where
Cg, Cc
are given in
(3.33)
, then we have that
E
(
xk, βk,¯π
)
≤ E
for all
k≥k0
. We can
thus state a condition that implies (3.27):
ψ¯π(xk)≥E
(1 −θ1),∀k≥k0.(3.43)
In summary, the analysis presented above holds if conditions
(3.43)
,
(3.39)
are satisfied
and we also impose condition
(3.41)
. This allows us to characterize a region, which we denote
by
C1
, where errors dominate and improvement in the merit function
φ
cannot be guaranteed.
In other words,
C1
is the region where at least one of the three conditions—
(3.43)
,
(3.39)
,
(3.41)—is not satisfied.
Definition 3.9. The critical region C1is defined as the set of x∈Rnsatisfying
ψ¯π(x)≤max E
(1 −θ1),2ν`
(1 −ν)(θ1−θ2),2R+ 4φ
ν¯αθ2,(3.44)
where
E
and
`
are defined by
(3.42)
and
(3.32)
, respectively, and
θ1, θ2
are constants such
that 0< θ2< θ1<1.
We also define the following set.
13
Definition 3.10. Let w= sup{φ(x) : x∈C1}, and define the level set
C2={x:φ(x)≤w+ 2φ+R}.
Note that by construction
C1⊆C2
. We are now ready to state the main convergence
result.
Theorem 3.11.
Suppose that Algorithm 1 generates a sequence
{xk}
from
x0
satisfying
Assumptions 3.1-3.3. There is an iteration
k1
at which
{xk}
enters the critical region
C1
,
and for all
k > k1
the iterates remain in the critical level set
C2
. The iterates may leave
C1
,
but there must be infinitely many iterates in C1.
Proof.
Recall that the index
k0
is defined in
(3.29)
. If
k6∈ C1
and
k≥k0
, then the
assumptions of Theorem 3.8 are satisfied and
(3.38)
holds. Therefore, by
(3.40)
,
(3.32)
,
(3.25), (3.28), (3.26), (3.36)
φ(xk+αkdk)−φ(xk)≤˜
φ(xk+αkdk)−˜
φ(xk)+2φ(3.45)
≤ν¯α˜
`φ(xk;dk)+2φ+R(3.46)
≤ν¯α`φ(xk;dk) + ν¯α`+ 2φ+R
≤ − ν¯αθ11
βk
gT
kPkgk+τ¯πkckk1+ν¯α`+ 2φ+R
≤ − ν¯αθ1ψ¯π(xk) + ν¯α`+ 2φ+R
=−[ν¯αθ2+ ˆαν(θ1−θ2)]ψ¯π(xk) + ν¯α`+ 2φ+R
≤ − ν¯αθ2ψ¯π(xk)+2φ+R.(3.47)
Combining this bound with (3.41), we have that if xk/∈C1then
φ(xk+1)−φ(xk)≤ −ν¯αθ2
2ψ¯π(xk).(3.48)
Since the sequence
{φ
(
xk
)
}
is bounded below by Assumptions 3.3,
ψ¯π
(
xk
)converges to zero
and thus it follows that Algorithm 1 eventually generates an iterate in C1.
Now if
xk∈C1
, then by Step 6 in Algorithm 1,
φ
(
xk+1
)
≤φ
(
xk
)+2
φ
+
R≤w
+2
φ
+
R
,
so that xk+1 ∈C2.
On the other hand, if xk∈C2\C1, then by (3.48)
φ(xk+1)−φ(xk)≤0,
which implies
xk+1 ∈C2
. Thus the rest of the sequence lies in
C2
, with infinitely many
iterates in C1.
We should note that since we are not assuming that the objective function is strongly
convex or satisfies a quadratic growth condition, it is possible that the supremum in
Definition 3.10 is w=∞. This is, however, an unlikely scenario.
14
3.5 Discussion
Let us take a closer look the main result of this paper, Theorem 3.11, since the critical
region C1defined in (3.44) is complex.
By the definitions
(3.42)
and
(3.32)
, we have that
E
and
`
are both of order
O
(
c, g, J
),
and so is the right hand side in (3.44). This is as desired. The constants in these orders of
magnitude matter, so we must characterize them.
First note that the critical region
C1
, the set
C2
and
¯π
depend on the starting point
x0
.
It is then possible that
¯π
could be very large in some cases, although in practice this does
not seem to be a major concern. The constants
Cg, Cc
, which also enter in the definition
of
E
and
`
could be quite large. One can, however, give a tighter definition of
C1
by not
introducing these constants. In this case, we would define
`
by
(3.31)
and employ
(3.27)
,
rather than
(3.43)
. This makes the main theorem more precise, albeit more difficult to
interpret.
Returning to the constants in (3.42) and (3.32), we have that
`,E ∼ δ, 1
bl
,η
bl,
and from (3.4), (3.5), (3.16) we observe that
σmin(˜
Jk)≥γ, δ =1
γ−J≥1
σmin(˜
Jk)−J
,and η=1
γ=1
σmin(˜
Jk).
The effect of a near rank-deficient Jacobian and Hessian approximations
βkI
are now
apparent.
It is interesting to compare
C1
with the region obtained by Berahas et al. [
1
] for
unconstrained strongly convex optimization. When constraints are not present, i.e.,
m
= 0,
conditions
(3.23)
and
(3.36)
defining
C1
reduce to requirements of form
kgkk ≥ c1g
and
kgkk2≥c2
(
gkgkk
+
2
g
)for some constants
c1
and
c2
, respectively. That corresponds to
Case 1 in the analysis of [
1
], in which case
g
is small as compared to
kgkk
by some factor
β∈
(0
,
1), so that the line search ensures an improvement in the exact objective function –
f
(
x
)in our notation. Similar to the setting in this paper, [
1
] employs a relaxed line search
which does not fail even in the critical region; that is, when
kgkk ≤ βg
. Their analysis then
provides a level set that the iterates cannot leave, which depends on the relaxation term
R
as well as
φ
(i.e.
f
in the unconstrained case) as in the definition of
C2
in our analysis.
Since strong convexity is assumed in [
1
], they can define this level set in terms of a strong
convexity parameter rather than a bound such as win Definition 3.10.
4 Numerical Experiments
We implemented Algorithm 1 in Python. We set
ν
= 0
.
1,
τ
= 0
.
9, and
βk
= 50, for all
k
.
The purpose of the numerical experiments is to supplement the theoretical results, which are
stated in terms of the merit function
φ
, by reporting the distance to the solution
kxk−x∗k
as the iteration progresses. In order to gain an idea of this behavior, it suffices to test only
15
problem classification objective constraints
HS7 OOR2-AN-2-1 ln(1 + x2
1)−x2(1 + x2
1)2+x2
2= 4
BT11 OOR2-AN-4-3 −x1x2x3x4
x3
1+x2
2= 1
x2
1x4−x3= 0
x2
4−x2= 0
HS40 OOR2-AY-5-3 (x1−1)2+ (x1−x2)2+ (x2−x3)2
+(x3−x4)4+ (x4−x5)4
x1+x2
2+x3
3=−2 + √18
x2+x4+x2
3=−2 + √8
x1−x5= 2
a few examples. We selected the following three small-scale equality-constrained problems
from the CUTEst set [9].
We add uniformly distributed random noise to the exact function values and to each
component of the exact gradients; i.e., for ξi∼ U(−1, 1), and ψij ∼ U(−2, 2)we set
˜
f(x) = f(x) + ξ0,˜ci(x) = ci(x) + ξi
˜gi(x) = gi(x) + ψ0j,˜
Jij (x) = Jij (x) + ψij .
In our tests, we vary
1, 2
, and report
kxk−x∗k
, where
x∗
is a locally optimal solution
obtained by using exact gradients in the algorithm. For each of these problems,
x∗
is a a
nondegenerate stationary point.
Asymptotic Behavior.
In Figure 4.1, we plot
kxk−x∗k
for 1000 iterations, for
1
=
2
= 10
−3
in the definitions of
ξi
, and
ψij
We also display the values of
f, c, g, J
defined in
(3.1)
-
(3.2)
We should note that in each of the runs the penalty parameter
πk
became fixed
within the first 15 iterations. We observe that
{kxk−x∗k}
is contained in a band whose
upper bound is frequently visited by the algorithm, whereas the lower bound is defined
by large irregular spikes. These results suggest that if one desires the highest accuracy
in the solution, the algorithm should continue beyond the point where oscillations in the
merit function occur, since there is little risk that the iterates will stray away from the
neighborhood of the solution, and there is a chance that significantly higher accuracy is
achieved at some iterates.
16
Figure 4.1: Distance to optimality (
log2
(
kxk−x∗k
)) vs iteration number for
1
=
2
= 10
−3
(a) HS7. f= 10−3, c= 10−3, g= 1.41 ×10−3, J= 1.41 ×10−3
(b) BT11. f= 10−3, c= 3 ×10−3, g= 2.24 ×10−3, J= 6.71 ×10−3
(c) HS40. f= 10−3, c= 3 ×10−3, g= 2 ×10−3, J= 6 ×10−3
17
Benefits of the relaxed line search.
The only unconventional part of Algorithm 1 is
the relaxed line search
(2.4)
. To observe the effect of the relaxation, we solved the test
problems with and without it; the results are reported in Tables 4.1–4.3. We observe that
when the relaxation is disabled, the line search often fails in a neighborhood of
x∗
(we
terminate the algorithm as soon as there is a line search failure). When the relaxation is
enabled, the line search is always successful. In this case, we let the algorithm run for 100,
500, and 1000 iterations. It is apparent that the relaxed line search allows the algorithm to
continue iterating past the point where the traditional line search would fail, yielding much
better accuracy in the solution.
Table 4.1: mink{kxk−x∗k} when 1=2= 10−5
relaxation disabled relaxation enabled
problem iter. of failure mink{kxk−x∗k} kmax = 100 kmax = 500 kmax = 1000
HS7 77 7.8260E-3 1.0234E-3 4.9413E-8 4.9413E-8
BT11 64 4.8346E-2 3.9258E-3 1.9791E-6 1.4133E-6
HS40 26 3.4728E-2 2.1251E-3 1.09888E-6 1.0988E-6
Table 4.2: mink{kxk−x∗k} when 1=2= 10−3
relaxation disabled relaxation enabled
problem iter. of failure mink{kxk−x∗k} kmax = 100 kmax = 500 kmax = 1000
HS7 42 8.0390E-2 1.0401E-3 4.9328E-6 4.9328E-6
BT11 18 9.3324E-1 4.0003E-3 1.9804E-4 1.4060E-4
HS40 6 6.4144E-2 2.2293E-3 1.1183E-4 4.9328E-6
Table 4.3: mink{kxk−x∗k} when 1=2= 10−1
relaxation disabled relaxation enabled
problem iter. of failure mink{kxk−x∗k} kmax = 100 kmax = 500 kmax = 1000
HS7 10 3.7404E-1 1.3113E-3 4.5607E-4 2.5422E-4
BT11 8 1.7108 2.0598E-2 2.0598E-2 1.9451E-2
HS40 2 1.1817E-1 5.8202E-2 3.8673E-2 3.8673E-2
Effect of incorrect noise level estimations. In Algorithm 1, estimations of fand c
are needed to set the relaxation bound
R
in
(2.4)
. It is clear that underestimating the
noise level can cause failure of the relaxed line search, which never fails when the true level
(or an overestimation) is provided. On the other hand, overestimation can lead to large
oscillations. The precise behavior of the algorithm will depend on the stop test, and there is
no universally adopted stopping criterion in the noisy setting, to our knowledge.
Nevertheless, we performed the following experiments using a stop test that that could
be considered as a naive modification of termination tests in standard packages. We simply
18
terminate the algorithm when the observed (noisy) feasibility and optimality errors are
smaller than the (estimated) noise provided for these quantities, i.e.,
k˜c(xk)k1≤est
cand k˜g(xk) + ˜
J(xk)Tλkk ≤ est
g+kλkk∞est
J.(4.1)
Figures 4.6–4.4 report the quantity
mink{kxk−x∗k}
when the algorithm employs estimated
noise levels
est
1
and
est
2
that are 10, 100 and 1000 times larger or smaller than the correct
values. We perform this experiment for
i
= 10
−1,
10
−3,
10
−5
. A termination due to the
satisfaction of the condition
(4.1)
is marked with (opt), and a line search failure is marked
with (ls).
Table 4.4: mink{kxk−x∗k} when true i= 10−5;i= 1,2
est
i=iest
i= 0.001iest
i= 1000i
problem iter. mink{kxk−x∗k} iter. mink{kxk−x∗k} iter. mink{kxk−x∗k}
HS7 188 (opt) 3.2017E-6 69 (ls) 8.8000E-3 74 (opt) 5.5704E-3
BT11 233 (opt) 2.4010E-6 64 (ls) 4.8346E-2 64 (opt) 4.0112E-2
HS40 2703 (opt) 8.0766E-7 26 (ls) 3.4728E-2 27 (opt) 2.8305E-2
Table 4.5: mink{kxk−x∗k} when true i= 10−3;i= 1,2
est
i=iest
i= 0.01iest
i= 100i
problem iter. mink{kxk−x∗k} iter. mink{kxk−x∗k} iter. mink{kxk−x∗k}
HS7 117 (opt) 3.5750E-4 42 (ls) 8.0390E-2 39 (opt) 5.4414E-2
BT11 149 (opt) 2.7466E-4 29 (ls) 5.3925E-1 22 (opt) 5.9597E-1
HS40 154 (opt) 4.2653E-4 7 (ls) 6.4142E-2 2 (opt) 6.9002E-2
Table 4.6: mink{kxk−x∗k} when true i= 10−1;i= 1,2
est
i=iest
i= 0.1iest
i= 10i
problem iter. mink{kxk−x∗k} iter. mink{kxk−x∗k} iter. mink{kxk−x∗k}
HS7 51 (opt) 2.6752E-2 556 (ls) 2.5682E-4 5 (opt) 4.2796E-1
BT11 20 (opt) 6.6650E-1 3233 (ls) 6.8738E-3 2 (opt) 2.4045
HS40 210 (opt) 5.82E-2 982 (ls) 1.4785E-2 0 (opt) 2.8877E-1
As expected, underestimations cause line search failures while overestimations cause
(4.1)
to be triggered at earlier iterations. Another consequence of underestimating
2
is
that the algorithm might never be able to satisfy
(4.1)
, even if a line search failure occurs
sufficiently late in the run; see for example the entry corresponding to
i
= 10
−1, est
i
= 0
.
1
i
.
In summary, over-and underestimation of the noise levels can be harmful in ways that are
dependent on the implementation.
We must point out that an optimization algorithm may provide an indication that the
noise estimates must be re-computed. For example, the recovery procedure described by
19
Berahas et al. [
1
] uses information from the line search to request a better estimate (e.g.
through sampling or finite difference tables), and can take precautions to avoid harmful
iterations. Robust implementations of methods for constrained optimization in the presence
of noise should include such features.
5 Final Remarks
Two questions guided this research. What is the best behavior one can expect of a constrained
optimization method when functions and constraints contain a moderate amount of bounded
noise that cannot be diminished at will? What are the minimal modifications of a classical
optimization algorithm that allow it to tolerate noise, when the noise level can be estimated?
In this paper, we focused on a classical sequential quadratic programming method applied
to equality constrained problems. We showed that a modification (relaxation) of the line
search allows the iterates to approach a region around the solution where noise dominates—
and that the iterates remain in a vicinity of this region, under normal circumstances. The
analysis is presented under benign assumptions, for example that the Jacobian of the
constraints is never close to singular, which facilitates the choice of the penalty parameter.
Nevertheless, we believe that the essence of the analysis captures some of the main challenges
to be confronted when functions and derivatives contain noise. The accuracy bounds
presented in this paper will be sharpened in a forthcoming paper that studies the local
behavior of the method near a well behaved minimizer.
The thorny issue of how to design a proper stop test that reflects the desires of the users
has not been addressed in this paper and is worthy of research. The treatment of singularity
and the use of a nondiagonal Hessian also requires attention, as well as the very important
question of how to handle noisy inequality constraints.
Acknowledgement. We thank Shigeng Sun for his careful reading of the paper and useful
suggestions.
20
References
[1]
Albert S Berahas, Richard H Byrd, and Jorge Nocedal. Derivative-free optimization of
noisy functions via quasi-newton methods. SIAM Journal on Optimization, 29(2):965–
993, 2019.
[2]
Albert S Berahas, Frank E Curtis, Michael J O’Neill, and Daniel P Robinson. A
stochastic sequential quadratic optimization algorithm for nonlinear equality constrained
optimization with rank-deficient jacobians. arXiv preprint arXiv:2106.13015, 2021.
[3]
Albert S Berahas, Frank E Curtis, Daniel Robinson, and Baoyu Zhou. Sequential
quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM
Journal on Optimization, 31(2):1352–1379, 2021.
[4] Dimitri P Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.
[5]
R. H. Byrd, J. Nocedal, and R.A. Waltz. KNITRO: An integrated package for nonlinear
optimization. In G. di Pillo and M. Roma, editors, Large-Scale Nonlinear Optimization,
pages 35–59. Springer, 2006.
[6]
Frank E Curtis, Daniel P Robinson, and Baoyu Zhou. Inexact sequential quadratic
optimization for minimizing a stochastic objective function subject to deterministic
nonlinear equality constraints. arXiv preprint arXiv:2107.03512, 2021.
[7] R. Fletcher. Practical Methods of Optimization. Wiley, second edition, 1987.
[8]
Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins
University Press, Baltimore, second edition, 1989.
[9]
Nicholas IM Gould, Dominique Orban, and Philippe L Toint. CUTEst: a constrained
and unconstrained testing environment with safe threads for mathematical optimization.
Computational Optimization and Applications, 60(3):545–557, 2015.
[10]
M Hintermüller. Solving nonlinear programming problems with noisy function values
and noisy gradients. Journal of optimization theory and applications, 114(1):133–169,
2002.
[11]
Jorge J Moré and Stefan M Wild. Estimating derivatives of noisy simulations. ACM
Transactions on Mathematical Software (TOMS), 38(3):19, 2012.
[12]
Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer New York, 2
edition, 1999.
[13]
BT Poljak. Nonlinear programming methods in the presence of noise. Mathematical
programming, 14(1):87–97, 1978.
[14]
K Schittkowski. Nlpqlp-nonlinear programming with non-monotone and distributed
line search, 2014.
21
[15]
Hao-Jun Michael Shi, Melody Qiming Xuan, Figen Oztoprak, and Jorge Nocedal. On the
numerical performance of derivative-free optimization methods based on finite-difference
approximations. arXiv preprint arXiv:2102.09762, 2021.
[16]
Gilbert W Stewart. On the perturbation of pseudo-inverses, projections and linear least
squares problems. SIAM review, 19(4):634–662, 1977.
22