ArticlePDF Available

Finding search directions in quasi-Newton methods for minimizing a quadratic function subject to uncertainty

Authors:

Abstract and Figures

We investigate quasi-Newton methods for minimizing a strongly convex quadratic function which is subject to errors in the evaluation of the gradients. In particular, we focus on computing search directions for quasi-Newton methods that all give identical behavior in exact arithmetic, generating minimizers of Krylov subspaces of increasing dimensions, thereby having finite termination. The BFGS quasi-Newton method may be seen as an ideal method in exact arithmetic and is empirically known to behave very well on a quadratic problem subject to small errors. We investigate large-error scenarios, in which the expected behavior is not so clear. We consider memoryless methods that are less expensive than the BFGS method, in that they generate low-rank quasi-Newton matrices that differ from the identity by a symmetric matrix of rank two. In addition, a more advanced model for generating the search directions is proposed, based on solving a chance-constrained optimization problem. Our numerical results indicate that for large errors, such a low-rank memoryless quasi-Newton method may perform better than a BFGS method. In addition, the results indicate a potential edge by including the chance-constrained model in the memoryless quasi-Newton method.
This content is subject to copyright. Terms and conditions apply.
Computational Optimization and Applications (2025) 91:145–171
https://doi.org/10.1007/s10589-025-00661-4
Finding search directions in quasi-Newton methods for
minimizing a quadratic function subject to uncertainty
Shen Peng1,2 ·Gianpiero Canessa2·David Ek2·Anders Forsgren2
Received: 31 August 2021 / Accepted: 28 January 2025 / Published online: 21 February 2025
© The Author(s) 2025
Abstract
We investigate quasi-Newton methods for minimizing a strongly convex quadratic
function which is subject to errors in the evaluation of the gradients. In particular, we
focus on computing search directions for quasi-Newton methods that all give identical
behavior in exact arithmetic, generating minimizers of Krylov subspaces of increasing
dimensions, thereby having finite termination. The BFGS quasi-Newton method may
be seen as an ideal method in exact arithmetic and is empirically known to behave
very well on a quadratic problem subject to small errors. We investigate large-error
scenarios, in which the expected behavior is not so clear. We consider memoryless
methods that are less expensive than the BFGS method, in that they generate low-rank
quasi-Newton matrices that differ from the identity by a symmetric matrix of rank two.
In addition, a more advanced model for generating the search directions is proposed,
based on solving a chance-constrained optimization problem. Our numerical results
indicate that for large errors, such a low-rank memoryless quasi-Newton method may
perform better than a BFGS method. In addition, the results indicate a potential edge
by including the chance-constrained model in the memoryless quasi-Newton method.
Keywords Quadratic programming ·Quasi-Newton method ·Stochastic
quasi-Newton method ·Chance constrained model
The bulk of the research was carried out while all authors were affiliated with KTH.
BAnders Forsgren
andersf@kth.se
Shen Peng
pengshen@xidian.edu.cn
Gianpiero Canessa
shenp@kth.se
David Ek
daviek@kth.se
1School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
2Optimization and Systems Theory, Department of Mathematics, KTH Royal Institute of Technology,
SE-100 44 Stockholm, Sweden
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
146 S. Peng et al.
1 Introduction
A strongly convex n-dimensional quadratic function may be written on the form
q(x)=1
2xTHx +cTx+d,
where His a positive definite and symmetric n×n-matrix, cis an n-dimensional
vector and dis a constant. The optimization problem of minimizing q(x)is equivalent
to solving q(x)=0, i.e., solving the linear equation Hx +c=0.
One way to do so by an iterative method is to find an initial point x0and associated
gradient g0=Hx0+c. Then generate xkand gk, with gk=Hxk+c, such that xkis
the minimizer of q(x)on x0+Kk(g0,H), where
K0(g0,H)={0},Kk(g0,H)=span{g0,Hg0,H2g0,...,Hk1g0},k=1,2,....
This is equivalent to gkbeing orthogonal to Kk(g0,H), so that g0,g1,…,gkform
an orthogonal basis for Kk+1(g0,H). Since there can be at most nnonzero orthogonal
vectors, there is an r, with rn, such that gr=0. Consequently, xris the minimizer
of q(x). A method for computing such a sequence x1,x2,…,xrmay be characterized
by the search direction pkleading from xkto xk+1.Given pk, the step length αkis
given by minimizing q(xk+αpk), i.e.,
αk=− gT
kpk
pT
kHp
k
.(1.1)
The method of conjugate gradients gives a short recursion for the search direction
pk. It may be written on the form
pk=
g0,k=0,
gk+gT
kgk
gT
k1gk1
pk1,k=1,2,...,r1.(1.2)
See, e.g., [21, Chapter 6] for an introduction to the method of conjugate gradients. By
expanding for k1, k2, …, 0, the expression for pkof (1.2) may equivalently be
written as
pk=−gT
kgk
k
i=0
1
gT
igi
gi.(1.3)
Using the orthogonality of gi,i=1,...,k, it can be seen from (1.3) that another
characterization is given by the conditions that
(i)pkis a linear combination of g0,g1,...,gk,in addition to (1.4a)
(ii)satisfying gT
ipk=−gT
kgk,i=0,...,k,(1.4b)
see, e.g., [7, Lemma 1]. We note from (1.3) that pkrequires information from all
previous gradients gi,i=0,...,k. However, (1.2) shows that it suffices to have gk
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 147
and pk1, so that pk1gives an appropriate linear combination of gi,i=0,...,k1,
in forming pktogether with gk.
An alternative way of computing the search direction pksatisfying (1.4) is through
a quasi-Newton method, in which pkis defined by a linear equation Bkpk=−gk,for
Bka symmetric positive-definite matrix. A well-known method for which Bkgives pk
satisfying (1.4) is the BFGS quasi-Newton method, in which B0=Iand Bkis formed
by adding a particular symmetric rank-2 matrix to Bk1. In our quadratic setting
with exact linesearch, the BFGS method may be viewed as the “ideal” update that
dynamically transforms Bkfrom Ito Hin nsteps. Identity curvature is transformed
to Hcurvature in one new dimension at each step, and this Hcurvature information is
maintained throughout. See Appendix B for a more detailed discussion of the BFGS
method.
The concern of the present paper is to study behavior of quasi-Newton methods
when the gradient is subject to noise. In particular, we are interested in investigating
the impact of quality of search directions on the behavior of the method. Therefore, we
will limit the inaccuracy to the search direction pkonly, and throughout allow exact
linesearch according to (1.1) to be carried out. This means that we do not necessarily
see the behavior of a method, but rather the potential of a method which computes a
particular search direction.
The computational cost of a BFGS quasi-Newton method increases with k,as
the individual gradients are handled explicitly. We will also consider quasi-Newton
matrices that differ from the identity by a matrix of rank two, and refer to an associated
quasi-Newton method as a low-rank quasi-Newton method. The corresponding search
direction can then be computed from a two-by-two block system. In addition, we
investigate the potential for improving performance of the quasi-Newton method by
formulating robust optimization problems of chance-constraint type for computing
the search directions. These methods become of higher interest in the case of large
noise and multiple copies of the gradients. Our interest is to capture the essence of the
behavior, and try to understand the interplay between quality in computed direction
compared to robustness given by the chance constraints. The computational cost will
always be significantly higher, but our interest is to see if we can gain in terms of
robustness and accuracy of the computed solution.
The reason for noise in the gradients can be seen in different perspectives. Firstly,
as mentioned above, finite precision arithmetic gives a residual between the evalu-
ated gradients and the true gradients. Secondly, in many practical problems, such as
PDE-constrained optimization, the objective function often contains computational
noise created by an inexact linear system solver, adaptive grids, or other internal com-
putations. In addition, random noise can exist in the gradient when minimizing an
expectation objective function due to the randomness of samples, which happens in
many machine learning models. According to the central limit theorem, the random
noise follows a normal distribution approximately. Inspired by this case, the normal
distributed random noise is considered in the experiments of this work.
Compared with other literature, in our work, to reduce the influence of uncertainty,
a chance constrained model is proposed for finding a robust search direction. In this
model, the validation of the quasi-Newton setting is guaranteed with high probabil-
ity, while controlling the quality of search direction at the same time. In addition, to
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
148 S. Peng et al.
solve the chance constrained model in each iteration, we provide a deterministic for-
mulation for the low-rank quasi-Newton setting based on the samples of the random
gradient to obtain a robust search direction. Finally, the experimental results illustrate
the effectiveness of the proposed approach for finding search directions. In addition, we
compare to behavior of the BFGS method and two low-rank quasi-Newton methods.
The paper is organized as follows. Section2contains a brief discussion on back-
ground and related work. In Sect. 3, a description of the low-rank quasi-Newton
methods that are used in our study is given. In Sect. 4, we give the chance-constrained
model for computing the search direction, with particular focus on a low-rank setting.
In Sect.5, we present the computational results of our study. Finally, a conclusion is
given in Sect. 6.
2 Background and related work
The paper builds on previous work in the setting of exact arithmetic and finite preci-
sion arithmetic. Forsgren and Odland [8] have studied exact linesearch quasi-Newton
methods for minimizing a strongly convex quadratic function, and given necessary
and sufficient conditions for a quasi-Newton matrix Bkto generate a search direction
which is parallel to that of (1.4) in exact arithmetic. With exact linesearch methods, Ek
and Forsgren [7] have studied certain limited-memory quasi-Newton Hessian approx-
imations for minimizing a convex quadratic function in the setting of finite precision
arithmetic. Dennis and Walker [6] have considered the use of bounded-deterioration
quasi-Newton methods implemented in floating-point arithmetic where only inaccu-
rate values are available. In contrast, our work allows for large noise and we study
performance on a set of test problems.
In the present manuscript, we consider a situation where the function values and
gradients cannot be easily obtained and only noisy information about the gradient is
available. To handle this situation, some stochastic methods are proposed to minimize
the objective function with inaccurate information. Our setting is minimizing a strongly
convex quadratic function.
For strongly convex problems, Mokhtari and Ribeiro [16] have proposed a regular-
ized stochastic BFGS method and analyzed its convergence, and Mokhtari and Ribeiro
[17] have further studied an online L-BFGS method. Berahas, Nocedal and Takac [4]
have considered the stable performance of quasi-Newton updating in the multi-batch
setting, illustrated the behavior of the algorithm and studied its convergence properties
for both the convex and nonconvex cases. Byrd et al. [3] have proposed a stochastic
quasi-Newton method in limited memory form through subsampled Hessian-vector
products. Shi et al. [23] have proposed practical extensions of the BFGS and L-BFGS
methods for nonlinear optimization that are capable of dealing with noise by employing
a new linesearch technique. Xie et al. [24] have considered the convergence analysis of
quasi-Newton methods when there are (bounded) errors in both function and gradient
evaluations, and established conditions under which an Armijo-Wolfe linesearch on
the noisy function yields sufficient decrease in the true objective function. In addition,
Irwin and Haber [11] have proposed a version of the BFGS method where the secant
condition is treated with a penalty method approach.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 149
Unlike the stochastic quasi-Newton methods, which are based on the subsampled
gradients or Hessians, there are also other stochastic tools to reduce the effect of noise
when generating the search direction. Lucchi et al. [13] have studied quasi-Newton
method by incorporating variance reduction technique to reduce the effect of noise in
Hessian matrices by proposing a variance-reduced stochastic Newton method. This
method keeps the variance under control in the use of a multi-stage scheme. Moritz et al.
[15] have proposed a linearly convergent method that integrates the L-BFGS method to
alleviate the effect of noisy gradients with the variance reduction technique by adding
the residual between subsample gradient and full gradient to the noisy gradient.
In addition, chance constraint is a natural approach to handle the effect of random
noise [1]. Therefore, chance constraint has the potential to reduce the effect of random
noise when generating the search direction. By integrating chance constraints in the
design of quasi-Newton methods, we investigate the ability to improve robustness into
the computation of the search direction in the presence of random noise.
3 Low-rank methods for finding the search direction
The desirable properties of the BFGS update comes at the expense of making explicit
use of each gradient gi,i=0,...,k, when forming Bk. We refer to Appendix B for
details. Analogous to the search direction pkof (1.2) being composed by gk,gk1and
pk1,wemayletBkbe composed by gk,gk1and pk1, giving a limited-memory
quasi-Newton matrix which differs from the identity by a matrix of rank two. We will
refer to such a matrix as a low-rank quasi-Newton matrix. Two particular low-rank
quasi-Newton matrices will be considered. One is the memoryless BFGS quasi-Newton
method, in which
Bk=I1
pT
k1pk1
pk1pT
k1+ρk1(gkgk1)(gkgk1)T,(3.1)
where the value of ρk1isgivenbythesecant condition αk1Bkpk1=gkgk1.
We denote this particular value of ρk1by ˆρk1. For exact linesearch, Bkof (3.1)gives
ˆρk1=− 1
αk1gT
k1pk1
.(3.2)
Then Bkpk=−gkfor pksatisfying (1.4) in the case of exact arithmetic, see, e.g., [8,
Proposition 1]. The memoryless BFGS matrix given by (3.1) is analogous to the BFGS
matrix of (B.4) in the sense that curvature is first removed and then set in the previous
search direction pk1. However, the memoryless BFGS matrix removes curvature from
the identity matrix, whereas the BFGS matrix removes curvature from the previous
Hessian approximation Bk1. The memoryless BFGS matrix Bkof (3.1) is positive
definite and nonsingular as ρk1>0 and (gkgk1)Tpk1=−gT
k1pk1= 0.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
150 S. Peng et al.
We will also interpret the method of conjugate gradients in a quasi-Newton frame-
work, by forming the symmetric CG quasi-Newton matrix Bkgiven by
Bk=I1
gT
k1pk1
gkpT
k1I1
gT
k1pk1
pk1gT
k.(3.3)
This matrix is formed by rewriting the recursion (1.2) and making an additional sym-
metrization, see [8].
The behavior of the low-rank methods in the presence of noise is of interest in its
own right. In addition, the memoryless BFGS method of (3.1) forms the basis for the
chance-constrained model which we propose.
We have chosen to consider the memoryless BFGS method, which uses only the
most recent gradient information, and the BFGS method, which uses all gradients, as
they have equivalent behavior in exact arithmetic. It deserves mentioning that there
are methods that use a limited set of gradients, e.g., L-BFGS [14].
4 A chance-constrained model for finding the search direction
In addition to investigating the behavior of the quasi-Newton methods discussed so
far, we are also interested in investigating the potential of increasing the performance
of the quasi-Newton methods in the presence of random noise. For the case of exact
arithmetic, i.e., no noise, our model direction is the direction pkthat satisfies (1.4).
The interest is now to investigate and design quasi-Newton matrices in the presence
of random noise. For a given quasi-Newton matrix Bk, the search direction pkis
computed from Bkpk=−gk. As there exists random noise in each iteration, it means
that the obtained gradient is not accurate, it is the combination of the true gradient
and some random noise. The update of search direction may result in a non-descent
direction because of the influence by the random noise.
Let us assume calculation are not exact, given computational limitations. Every
time a computation is done, a small error in the final result is being carried over to
the value of the gradient evaluated in a given point. Since this decimal error can be
modeled as an added or subtracted value to the true value, and can be represented
as a normal distribution with mean μ=0 with a known variance σ2. Therefore, if
gkis the real value of the gradient at iteration kand is a random vector such as
N(02). We will let gkdenote the exact gradient and let ˜gkdenote the noisy
gradient, i.e. ˜gk=gk+, at iteration k. In addition, we will denote by ¯gkan average
of the noisy gradient, where the precise average may depend on the context.
However, this new setting could potentially lead to numerical errors in experiments
where Bkis not positive definite if the gradient is sampled by some method. There are
different ways to deal with this problem, such as obtaining more samples or resampling
the gradient. This will be taken into account in our experiments and it will be further
explained how we deal with this in the corresponding section.
In our quasi-Newton setting, the aim is not only to generate descent directions, but
also to generate search directions of high quality. In particular, we want to study the
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 151
balance between complexity of computation of the direction and quality. Therefore,
we will consider methods that we refer to as low-rank quasi-Newton methods, where
the quasi-Newton matrix differs from the identity matrix by a symmetric matrix of
rank two. Then, a chance-constrained optimization model for computing the search
direction will be considered, by taking uncertainty into account explicitly.
4.1 A chance-constrained model for the low-rank setting
We will now propose a particular chance-constrained model, applicable for the low-
rank setting.
In the exact arithmetic case, the condition (1.4)gives(gi+1gi)Tpk=0,
i=0,...,k1. This means that pkis orthogonal to the affine span of the
generated gradients. In the noisy setting, we apply the sum of residuals’ squares
k1
i=0(gi+1gi)Tpk2to characterize the quality of search direction at iteration k,
we will use this measure to show how close the direction is to the characterization
in (1.4). We suppose that the direction with lower total residual can provide better
performance.
Motivated by the memoryless BFGS matrix of (3.1), we will consider a low-rank
matrix of the form
Bk(ρ) =I1
pT
k1pk1
pk1pT
k1+ρ(gkgk1)(gkgk1)T,(4.1)
where ρis a variable, and we require ρ>0. The condition ρ>0 ensures Bk(ρ) 0
due to the exact linesearch. In addition, Bk ) pk=−gkfor pksatisfying (1.4)inthe
case of exact arithmetic, see, e.g., [8, Proposition 1].
Then, we can obtain the following model to find a search direction in low-rank
setting,
minimize
ρ>0
k1
i=0
t2
i
subject to ti(gi+1gi)Tpkti,i=0,1,...,k1,
Bk(ρ) pk=−gk,
(D)
where Bk(ρ) is defined by (4.1). Note that in exact arithmetic and under exact
linesearch, the specific values of ρhave no impact on the search direction. By select-
ing a proper ρ, we can obtain a search direction pkaccording to the relationship
Bk(ρ) pk=−gk.
The model (D) is actually a deterministic model, where the noisy gradients are deter-
ministic. However, as mentioned in Sect. 1, in some practical problems, the gradients
themselves are random because of the randomness in the original objective quadratic
function. Therefore, it is more natural to view the gradients as random vectors in model
(D).
As for creating a model to take randomness of the gradients into account, we will
handle randomness in the gradient ˜gk, but assume that the gradients ¯gi,i=0,...,k
1, are deterministic, which can be the realized values of noisy gradients or average
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
152 S. Peng et al.
values of noisy gradient samples. A more precise definition of ¯gi,i=0,...,k1,
will be made in the simplified model that will be presented in Sect. 4.2. Then, based on
model (D), a chance constrained model for finding a search direction can be formulated
as
minimize
ρ>0
k1
i=0
t2
i
subject to P
ti(¯gi+1−¯gi)T˜pkti,i=0,...,k2,
tk1(˜gk−¯gk1)T˜pktk1,
˜
Bk(ρ) ˜pk=−˜gk
1β.
(C)
where β∈[0,1]is a given probability level and
˜
Bk(ρ) =I1
pT
k1pk1
pk1pT
k1+ρ(˜gk−¯gk1)( ˜gk−¯gk1)T.(4.2)
Note that in model (C), only ˜gkis the random vector, while ¯gi,i=0,...,k1, are
constant values. Let ¯gkdenote a mean value of ˜gk, and let
¯
Bk(ρ) =I1
pT
k1pk1
pk1pT
k1+ρ(¯gk−¯gk1)( ¯gk−¯gk1)T.(4.3)
We will discuss in Sect. 4.3 how a mean value may be computed. As ¯gkis an approxima-
tion for gksince the mean value of random noise is zero, similar with the deterministic
case, a search direction pkcan be obtained according to ¯
Bk)pk=−¯gk.Hereρ
is an optimal solution of problem (C).
The value of βindicates the risk-aversion of the decision maker, where β=0is
the most conservative approach as we need to comply with the supremum value of the
underlying random vector. Even small values of βcan have significant impact to the
results [2], therefore studying the behavior of the solution while βis 0 and close to 0
(typically 0.01 or 0.05) is the usual approach. The chance constraint in problem (C)
not only guarantees the validation of quasi-Newton setting with probability at least
1β, but also controls the quality measure of the search direction.
To illustrate the feasibility of model (C), we have the following proposition to show
that the search direction obtained by solving problem (C) is a descent direction with
a high probability.
Proposition 4.1 Denote ˜gk=gk, where gkis the true gradient and ˜is a random
noise with mean equal to 0. Let ρis an optimal solution obtained by solving problem
(C). If ¯
Bk)is invertible and
P˜<1
¯
Bk)1¯gk˜gT
k¯
Bk)1¯gk1α,
the direction pkobtained by ¯
Bk)pk=−¯gkis a descent direction with a probability
at least 1α.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 153
Proof The direction pkis a descent direction if gT
kpk<0. Since ¯
Bk)pk=−¯gk,
we have
gT
kpk=(˜gk)Tpk=−(˜gk)T¯
Bk)1¯gk=−˜gT
k¯
Bk)1¯gk+T¯
Bk)1¯gk.
Hence, pkis a descent direction if ˜gT
k¯
Bk)1¯gkT¯
Bk)1¯gk>0.
Since
˜gT
k¯
Bk)1¯gkT¯
Bk)1¯gk≥˜gT
k¯
Bk)1¯gk− ¯
Bk)1¯gk,
<1
¯
Bk)1¯gk˜gT
k¯
Bk)1¯gkimplies pkis a descent direction.
Therefore, the conclusion can be obtained.
Remark 1 From Proposition 4.1, we can observe that when xkis close to the optimal
solution and the value of ˜gkis small, there exists a constant 0 <¯α<1 such that
P<1
¯
Bk)1¯gk˜gT
k¯
Bk)1¯gk1−¯α. This implies that problem (C) can
always provide a descent direction pkwith some probability no matter how close xk
is to the optimal solution.
4.2 Simplification of the low-rank chance-constrained model
Models (D) and (C) are hard to solve given the non-convex nature of the equality-
constraints created by the condition Bkpk=−gkand the associated condition on Bk,
and the nature of the chance constraints. For this case, however, we may simplify the
formulation. If ˆρk1denotes the value given by the secant condition (3.2), we may
write (4.1)as
Bk(ρ) =Bk(ˆρk1)+ −ˆρk1)(gkgk1)(gkgk1)T.(4.4)
The point of introducing ˆρk1and Bk(ˆρk1)is to give a nonsingular and positive
definite Bk(ˆρk1), which may be used as a foundation for optimizing over ρ.Wemay
therefore view the optimization over ρas the potential for improving over utilizing the
secant condition. To simplify the notation, we let ˆ
Bk=Bk(ˆρk1). Then, application
of the Sherman-Morrison formula on (4.4)gives
Bk(ρ)1=ˆ
B1
k+γˆ
B1
k(gkgk1)(gkgk1)Tˆ
B1
k(4.5)
for
γ=− −ˆρk1)
1+ −ˆρk1)(gkgk1)Tˆ
B1
k(gkgk1),(4.6)
so that an explicit expression for pkmay be given as
pk=−Bk)1gk=−ˆ
B1
kgkγ(gkgk1)Tˆ
B1
kgkˆ
B1
k(gkgk1).
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
154 S. Peng et al.
Note that there is a one-to-one correspondence between γand ρas (4.6)gives
ρ−ˆρk1=− γ
1+γ(gkgk1)Tˆ
B1
k(gkgk1).(4.7)
In addition, if ˆ
Bk0, then (4.6) and (4.7) show that Bk) 0 if and only if the
equivalent conditions
γ>1
(gkgk1)Tˆ
B1
k(gkgk1)
and ρ−ˆρk1>1
(gkgk1)Tˆ
B1
k(gkgk1)
hold. This is a consequence of these lower bounds defining an interval around ˆ
Bkand
ˆ
B1
krespectively, where Bkand B1
kare well defined. As we know that Bk) 0if
and only if ρ>0, the lower bound on ρ−ˆρk1implies
ˆρk1=1
(gkgk1)Tˆ
B1
k(gkgk1).(4.8)
This result is verified in Appendix C. Consequently, we may simplify the one-to-one
relationships between ρand γof (4.6) and (4.7)to
γ=−
−ˆρk1)
1+ρ−ˆρk1
ˆρk1
=−ˆρk1+ˆρ2
k1
ρ,(4.9a)
ρρk1γ
1+γ
ˆρk1
=ˆρ2
k1
ˆρk1+γ,(4.9b)
which are valid for ρ>0 and γ>−ˆρk1. Note that the lower bounds on ρand γdo
not depend on pk1,gk1or gk.
Summarizing, we may formulate the simplified problem as
minimize
γ>−ˆρk1
k1
i=0
((gi+1gi)Tpk)2
subject to pk=−ˆ
B1
kgkγ(gkgk1)Tˆ
B1
kgkˆ
B1
k(gkgk1),
(DS)
where ˆρk1is given by (3.2) and ˆ
Bk=Bk(ˆρk1)is given by (4.4)forρρk1. Then
(DS) is a convex constrained quadratic program if a tolerance is introduced for the strict
lower bound on γ. For this problem, we may eliminate pkto get one variable only,
γ. Note the one-to-one correspondence given by (4.9b) which allows us to recover ρ
from γ.
In the deterministic case, the parameter ˆρk1is typically given by the secant con-
dition αk1Bk(ˆρk1)pk1=gkgk1, which in our setting with strictly convex
quadratic function and exact linesearch means that ˆρk1is given by (3.2). This expres-
sion also applies to the BFGS update (B.1), so the discussion for the memoryless BFGS
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 155
method and the BFGS method is the same. We also note that in the deterministic case,
pkis independent of the value of ˆρk1as long as ˆρk1is positive, due to (1.4a). For
the stochastic case, the choice of ˆρk1is less straightforward. It is essential that ˆρk1
is positive, to ensure ¯
Bk(¯ρk1)of (4.3) positive definite for (¯gk−¯gk1)Tpk1= 0.
A secant condition of the form αk1¯
Bk(ˆρk1)pk1gk−¯gk1does not necessarily
give ¯ρk1>0 as we cannot ensure neither αk1>0 nor for (¯gk−¯gk1)Tpk1<0.
In this study, we have limited uncertainty to the search directions and allow exact
linesearch, i.e., αk1is given by (1.1). We have chosen to set
ˆρk1=pT
k1Hp
k1
(¯gT
k1pk1)2.(4.10)
This choice is consistent with (3.2) in the deterministic case. In addition, as we compute
pkfrom ¯
Bk(ρ) pk=−¯gkfor ρ>0, then assuming (¯gk−¯gk1)Tpk1= 0, we have
¯
Bk(ρ) 0forρ>0 so that ¯gT
kpk<0. In this situation, considering iteration k1,
ˆρk1defined by (4.10) is positive. Note that in the stochastic model, the particular
value of ˆρk1is not important as the optimal value of γ, and thus ρ, is independent
of ˆρk1.
Analogous simplification and reformulation of (C)gives
minimize
γ>−ˆρk1
k1
i=0
t2
i
subject to P
ti(¯gi+1−¯gi)T˜pkti,i=0,...,k2,
tk1(˜gk−¯gk1)T˜pktk1,
˜pk=−˜
B1
k˜gk
γ(˜gk−¯gk1)T˜
B1
k˜gk˜
B1
k(˜gk−¯gk1),
1β,
(CS)
where ˆρk1is given by (4.10) and ˜
Bk=˜
Bk(ˆρk1)is given by (4.2)forρρk1.
Chance constrained models are often non-convex in general and hard to solve [22].
However, different equivalent formulations can be applied to obtain an analytical
solution or approximate the chance constraints. In general, it is often difficult to get
an analytical solution, since it always requires strict assumptions on the probability
distribution of random variables and the structure of chance constraints, which making
the situation specific and not general enough. In contrast to the analytical solution,
scenario approximation (SA) [19] and sample average approximation (SAA) [12,20]
are two general approaches that work well without needing much assumptions on
the random distribution of the stochastic variable, which will be applied to solve the
chance constrained model in the following sections.
4.3 Deterministic formulation based on sample average approximation
Model (CS) can not be solved directly in its current state. As previously stated, an
analytical solution would prove difficult to obtain without assumptions. Therefore, we
propose an SAA approach given its flexibility to work under any type of stochastic
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
156 S. Peng et al.
variables. The first step is to formulate it as a deterministic problem that approximates
the solution of (CS). Let be the set of samples, gkω be the i.i.d. noisy
gradient samples, δ>0 a sufficiently small real number and 0 K<kbe an integer
number that represent the time window to be considered in the model. All samples
have a probability of 1/||. In addition, let Bkω) be defined by
Bkω(ρ) =I1
pT
k1pk1
pk1pT
k1+ρ(gkω−¯gk1)(gkω−¯gk1)T.(4.11)
Then, a deterministic equivalent formulation of (CS) using sample average approxi-
mation (SAA) is as follows:
minimize k1
i=max(0,kK)t2
i
subject to tiMzω(¯gi+1−¯gi)Tpkωti+Mzω,iIK,ω,
tk1Mzω(gkω−¯gk1)Tpkωtk1+Mzω,ω,
pkω=−B1
kωgkω
γ(gkω−¯gk1)TB1
kωgkωB1
kω(gkω−¯gk1), ω
ωzω≤||β,
γ≥−ˆρk1+δ,
zω∈{0,1},ω,
(CSA)
where ˆρk1is given by (4.10), Bkω=Bkω(ˆρk1)is given by (4.11)forρρk1,
IK={max(0,kK),...,k2},Mis a sufficiently large number, and Kk
indicates the number of gradients to be considered in the model. We then form ¯gk=
1
|+|ω+gkω, where +={ω:zω=0}, and ¯
Bk=¯
Bk(ˆρk1)of (4.3).
Finally, pkis obtained using the equation
pk=−
¯
B1
k+γ¯
B1
k(¯gk−¯gk1)( ¯gk−¯gk1)T¯
B1
k¯gk
=−¯
B1
k¯gkγ(¯gk−¯gk1)T¯
B1
k¯gk¯
B1
k(¯gk−¯gk1). (4.12)
If β>0 then (CSA) is a mixed integer program where its complexity will be tied
to the number of dimensions and samples used to solve the problem. Since at every
iteration new gradients are added, the dimensionality of the problem will grow at each
step regardless. And to guarantee the quality of solution by SAA, the sample size
should not be too small if the dimension is large.
Therefore the complexity of this approach grows at each step, leading to increas-
ing solving times which will be an issue on long runs with low convergence speed.
However, we can use a limited memory or memoryless implementation to mitigate
the effect.
If β=0, then all binary variables must be set to zero and can be eliminated from the
problem, creating a continuous linear program which is much simpler to solve. This
is commonly referred as the scenario approach, where all possible sampled scenarios
of the random variables are being considered. This also implies that the solution will
be closely tied to the most conservative of the sampled scenarios.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 157
5 Computational results for the quasi-Newton methods
Two sets of results are presented: First, we consider a set of randomly generated
problems intended to illustrate the properties and methodologies proposed in this
paper. The second set of problems are real-life instances from the CUTEst test set [9],
to test the applicability of these methods in a more realistic environment.
A comparison of the results is provided using different models and/or approximation
formulations, and we discuss the practical implications obtained with each method.
All models are implemented in Python 3.7.10, using Gurobi 9.1 as a solver for the
resulting optimization problems and all computation were done on an Intel(R) i7 @
2.7 GHz and 16 GB of memory over macOS 10.
For every experiment,we applied the following algorithm: At iteration k,tol denotes
our tolerance threshold of precision for the norm of the gradient of the solution
obtained, ¯gkis the average value of the gradient ˜gk,Kkdenotes the number
of gradients to be used in the calculations and MaxK is the maximum number of
steps. When calculating the step length αk, an exact line search approach is used with
the value of the gradient without random noise (the deterministic value of gk), as the
objective is to isolate the effects of each method in finding a descent direction. It
is possible to encounter a non positive definite Bkmatrix as a result of the random
samples obtained at an iteration. If this happens, we skip the current iteration without
taking a step in any direction, and continue to the next one.
The algorithm is given in Algorithm 1.
All experiments are repeated 30 times using different random number generator
seeds and using 20 samples for the uncertainty set kat each step kusing the normal
distribution with μ=0 and fixed σ2, therefore the performance profiles also separate
each method and experiment by seed. Other methods to update kcan be used, i.e.
using information from older iterations, however we are using new samples with every
iteration. Finally, we used a maximum amount of steps MaxK =500.
The methods chosen in our experiments are in the following list:
BFGS, as presented in (B.2) and where ˆρk1is given by (4.10),
Bk=Bk1+1
¯gT
k1pk1
¯gk1¯gT
k1ρk1(¯gk−¯gk1)( ¯gk−¯gk1)T.
Memoryless-BFGS (ml-BFGS), as presented in (3.1), and where ˆρk1is given by
(4.10),
Bk=I1
pT
k1pk1
pk1pT
k1ρk1(¯gk−¯gk1)( ¯gk−¯gk1)T.
Chance-Constrained Quasi Newton (CCQN β). The search direction is obtained
by first solving (CSA)forK=kand β, and then obtaining pkfrom (4.12).
Limited Memory Chance-Constrained Quasi Newton (lm-CCQN β). Same as
CCQN, but 0 <K<k. For our experiments, we used K=10.
Memoryless Chance-Constrained Quasi Newton (ml-CCQN β). Same as CCQN,
but K=1.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
158 S. Peng et al.
Algorithm 1 General solving algorithm
1: k0;
2: xkinitial point;
3: kobtained.
4: gkHxk+c;
5: gkωgk+ωk;
6: ¯gk1
||ωgkω;
7: while ¯gk2>tol and kMaxK do
8: if method =CCQN then
9: γsolution to (CSA)using Kgradients;
10: if β>0then
11: +←{ωk:zω=0}
12: ¯gk1
|+|ω+gkω
13: end if
14: end if
15: Bkfrom method;
16: if Bkis positive definite then
17: pksolution to Bkpk=−¯gk;
18: αk←− gT
kpk
pT
kHp
k
;
19: xk+1xk+αkpk;
20: end if
21: kk+1;
22: kobtained.
23: gkHxk+c;
24: gkωgk+ωk;
25: ¯gk1
||ωkgkω;
26: end while
Conjugate Gradients (CG): The symmetric CG method as presented in (3.3),
Bk=I1
¯gT
k1pk1
¯gkpT
k1I1
¯gT
k1pk1
pk1¯gT
k.
Steepest Descent (SD):
Bk=I.
The methods can be given different characterizations, two traditional quasi-Newton
methods BFGS and ml-BFGS, and three stochastic quasi-Newton methods CCQN
β, lm-CCQN βand ml-CCQN β. In addition, we have included CG, which is not a
traditional quasi-Newton method, but with Bkgiven by the recursion of the method
of conjugate gradients. All these method have identical behavior in exact arithmetic.
Finally, we have included SD, to see the behavior of the method of steepest descent.
We expect SD to be inferior to the other methods, but still would like a reference to
see what happens in the case of larger errors.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 159
5.1 Results for randomly created problems
The first experiment is a set of randomly generated unconstrained quadratic problems.
For each problem, the Hessian matrix His defined as H=QTQ+diag(U1,n),
where Q=aJ
n,n+(ba)Un,n,a,bR,Jn,mis the unit n×mmatrix, Un,mis an
n×mmatrix, each component of Un,mis randomly generated following a Uniform(0,1)
distribution and >0 is a sufficiently small number. The vector cis randomly
generated as c=Un,1. In our experiments, we defined a=−1,b=1,n=100 and
=0.3. For these random test problem, small values of random noise variance, i.e.
σ2<106, presented some numerical issues in CCQN while solving it on Gurobi. We
have therefore not used CCQN but only lm-CCQN and ml-CCQN for these problems.
To compare the methods for the randomly generated problems, we give performance
profiles [5] for the methods used, with the number of iterations as the performance
measure. This measure is defined as follows. Let Km,pbe the number of iterations
required by method mMon problem pP, where it is set to infinity if the method
failed to converge within the maximum number of iterations allowed. For a method
m, The measure Pm(τ) is then defined as
Pm(τ) =|{pP:rm,pτ}|
|P|,where rm,p=Km,p
minmM{Km,p}.
This means that Pm ) is the fraction of problems that are solved within a factor τ
times the minimum number of iterations for each problem. We can then draw Pm(τ)
for τ1 for each method mto get a comparison.
Since random noise will distort the gradient norms once it reaches a certain point
in the run, a set of performance profiles are created for tolerance values close to the
random noise variance, i.e. if the variance σ2=102then these performance profiles
will be set to tolerances such as 101and102. A higher tolerance does not show
significant differences to the deterministic case, and lower can cause the method to
not reach any threshold, so these cases will not be shown.
Figure 1shows the behavior of each method for the two selected random noise
levels. For σ2=102, BFGS has the best performance. The other quasi-Newton
methods have very smilar performance, slightly behind BFGS. CG is slightly behind
the quasi-Newton methods, whereas SD, as expected, does not perform well. For
the higher random noise level, σ2=102, all quasi-Newton methods have similar
behavior and perform best. Again, CG is slightly behind and SD does not perform
well. It is interesting to see the similar performance of all quasi-Newton methods,
traditional as well as stochastic, and memoryless as well as full memory methods.
In this experiment, the chosen value of MaxK was not large enough for SD to show
convergence as seen in Fig. 1, however we ran the same experiment for this method
using a larger value, showing that average log norm of the gradient found by SD can
surpass the tol σ2barrier, similar to CCQN, lm-CCQN and ml-BFGS.
Figure 2shows the performance profiles of each method for two random noise
variance levels under two tolerance thresholds. BFGS consistently has the best per-
formance of all methods. The superiority is less distinct for the high random noise
level and high accuracy. All other quasi-Newton methods behave quite similarly. SD
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
160 S. Peng et al.
Fig. 1 Average log norm of the gradient at step kfor each tested method with different random noise
variances for the randomly generated test problems
Fig. 2 Performance profiles for different tolerances and random noise variance levels for the randomly
generated test problems
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 161
Fig. 3 Performance profile of the minimum gradient norm for different random noise variance levels for
the randomly generated test problems
is comparable to these quasi-Newton methods for the lower accuracy but is inferior for
the higher accuracy. SD consistently performs poorly, compared to the other methods.
Figure 3shows the performance profiles of the minimum gradient norm found in the
set of problems with different seeds for two random noise variance levels. When σ2=
106, we can observe BFGS is able to obtain the minimum value consistently, i.e., for
every problem and every seed, which is followed by the other quasi-Newton methods
that behave similarly. SD shows inferior performance and CG performs poorly.
5.2 Results for CUTEst problems
In our experiments, we will compare the performance of the same approaches presented
in the last section applied to a subset of problems from the CUTEst test set [9],
specifically quadratic, unconstrained and number of variables chosen by the user (QUV
using CUTEst classification system). However, only 6 problem families fall in this
category, therefore we implemented a second batch of problems by adding those
unconstrained sum of squares problems (SUV using CUTEst classification system)
which had a positive definite Hessian at the starting point and left it constant throughout
the solving scheme. This brought the total amount of problems to 20. Most of these
problems were solved with the parameter Nset as 100 (i.e. there are 100 variables in
the generated problem). While, for the problems that did not accept this parameter,
we selected the closest value to 100 (generally 50). The performance profiles of the
different methods on the random test problems, as presented in Sect.5.1, did not show
significant differences between the two variance levels 102and 106. Therefore,
for the CUTEst test problem, we only focus on the random noise variance of 102.
Similarly, the behavior of lm-CCQN and ml-CCQN with β=0 and β=0.1were
very similar. Therefore, the parameter βwill be set as 0 for the methods lm-CCQN,
ml-CCQN and CCQN respectively, allowing for solving linear programs only.
Table 1shows the average number of iterations that each method needed to reach
the tolerance of 102for the 20 CUTEst problems. Each row presents the average
number of iterations needed to solve each problem for all the methods. The minimal
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
162 S. Peng et al.
Table 1 Average number of iterations needed to reach the tolerance level 102on the CUTEst test problems
BFGS CG lm-CCQN ml-BFGS ml-CCQN CCQN SD
ARGLINA 12 28 111 115
BDQRTIC 208 50 62 53 57 74
CHAINWOO 268 370 104 84 86 114
CHNROSNB 90 47 39 38 38 40 310
CHNRSNBM 79 63 36 36 35 37 123
DIXON3DQ 120 ––
DQDRTIC 188 397 111 25 24 110
ERRINROS 73 415––––
ERRINRSM 60 362––––
EXTROSNB 231 90 30 24 23 31 158
HILBERTB 55755 76
INTEQNELS 14 40 13 13 13 13 8
LIARWHD 166 43 14 14 14 14 30
MOREBV 153 ––
PENALTY1 151 33 13 13 13 13 6
PENALTY2 208 111 75 75 190
SROSENBR 273 381 66 64 64 67 152
TQUARTIC 15 22 45 45 42 45 310
TRIDIA 179 324 115 130 115 142
WOODS 319 495 133 116 115 132
average number of iterations needed for each problem with all the solution approaches
was marked in bold. The columns list the average number of iterations for each method
when solving different models. The symbol“-”refers to no convergence obtained within
500 iterations for the tolerance 102.
Table 1indicates that with respect to the measure chosen, the memoryless methods,
i.e., ml-BFGS and ml-CCQN, perform well. They have a similar performance, with
a slight edge for ml-CCQN. ml-CCQN has the largest number of minimal average
number of iterations, eleven. They are unable to solve the same four problems within
the number of iterations allowed. The BFGS method shows the most robust behavior,
solving all but one problem, followed by CG, solving all but two problems.
It is interesting to see the performance on the CUTEst problems and compare to
the random problems. The consistent superior performance of BFGS on the random
problem is not present in the CUTEst problems. It is particularly interesting that the
memoryless methods are the ones with the best performance. We can also notice that
SD performs worst in the test, as expected, and fails to converge in 500 iterations for
half of the problems. For two problems, however, SD obtained the minimal number of
average iterations. The methods CG, lm-CCQN, and CCQN perform about the same
in these 20 CUTEst problems, where they get two or three minimal number of average
iterations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 163
6 Conclusion
We have studied quasi-Newton methods for minimizing a strongly convex quadratic
function in a noisy framework. We have considered a memoryless BFGS method
and compared to a BFGS method, the method of conjugate gradients and steepest
descent. In order to potentially improve the performance of the low-rank quasi-Newton
method, a chance constrained stochastic optimization model has also been formulated.
The secant condition is here replaced by solving a one-dimensional convex quadratic
programming problem. The proposed chance constrained model, which can be solved
effectively by sample average approximation method or scenario approach, has been
proven to provide a descent search direction with a high probability in the random noisy
framework, while the deterministic model may fail to provide a descent direction, if
the random noise level is large.
In the numerical experiments, we have compared the methods in a noisy setting
for a set of random test problems and for a set of CUTEst test problems. The BFGS
method had the consistently best performance on the random test problems. This may
not be surprising, given that the BFGS method builds an exact model of the Hessian
in the exact arithmetic case. For the CUTEst test problems, the memoryless methods
show the best performance, with a slight edge for the memoryless chance-constrained
quasi-Newton method. It is very interesting to observe that the memoryless methods,
with their much simpler Hessian approximation, perform so well. It is also interesting
to see that a chance-constrained model might improve the performance further.
Finally, our intention is to investigate the behavior and the interplay between quality
and robustness of the low-rank quasi-Newton method, especially in the case of large
random noise and multiple copies of gradients. Both the theoretical and numerical
results show that we may gain the robustness and accuracy of the computed solution
with the chance constrained model, although the computational cost can be high. This
shows the potential to be further considered and explored in convex optimization
problems.
A. Average log norms of gradients of the CUTEst problems
In this section, the average log norm of each CUTEst problem is presented. The objec-
tive is to make evident the difference between problems as stated in the experiment
Sect. 5.2. All of these results are presented for σ2=102.
See Figs. 4,5,6,7,8,9,10,11,12,13.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
164 S. Peng et al.
Fig. 4 Average log norm of the gradient at step kfor each tested method
Fig. 5 Average log norm of the gradient at step kfor each tested method
Fig. 6 Average log norm of the gradient at step kfor each tested method
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 165
Fig. 7 Average log norm of the gradient at step kfor each tested method
Fig. 8 Average log norm of the gradient at step kfor each tested method
Fig. 9 Average log norm of the gradient at step kfor each tested method
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
166 S. Peng et al.
Fig. 10 Average log norm of the gradient at step kfor each tested method
Fig. 11 Average log norm of the gradient at step kfor each tested method
Fig. 12 Average log norm of the gradient at step kfor each tested method
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 167
Fig. 13 Average log norm of the gradient at step kfor each tested method
B. The exact linesearch BFGS method on a quadratic function
As mentioned in Sect. 1, a well-known method for which Bkgives pksatisfying (1.4)
is the BFGS quasi-Newton method with B0=I. In the BFGS method, B0=Iand
Bk=Bk1+1
gT
k1pk1
gk1gT
k1
+1
αk1(gkgk1)Tpk1
(gkgk1)(gkgk1)T,k=1,...,r.(B.1)
Expansion gives
Bk=Bk1+1
gT
k1pk1
gk1gT
k1+1
αk1(gkgk1)Tpk1
(gkgk1)(gkgk1)T
=I+
k1
i=0
1
gT
ipi
gigT
i+
k1
i=0
1
αi(gi+1gi)Tpi
(gi+1gi)(gi+1gi)T.(B.2)
For the case of exact linesearch, the BFGS matrix of (B.2) takes the form
Bk=Bk1+1
gT
k1pk1
gk1gT
k11
αk1gT
k1pk1
(gkgk1)(gkgk1)T
=I+
k1
i=0
1
gT
ipi
gigT
i
k1
i=0
1
αigT
ipi
(gi+1gi)(gi+1gi)T.(B.3)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
168 S. Peng et al.
If, in addition, the objective function is quadratic, then the BFGS matrix of (B.3)may
be written as
Bk=Bk11
gT
k1gk1
gk1gT
k1+1
pT
k1Hp
k1
Hp
k1pT
k1H
=I
k1
i=0
1
gT
igi
gigT
i+
k1
i=0
1
pT
iHp
i
Hp
ipT
iH,(B.4)
see, e.g., [7]. If nsteps are taken, then with Pn=(p0p1··· pn1), it follows that
Pnis square and nonsingular, so that
H=HH1H=HP
nP1
nH1PT
nPT
nH
=HP
n(PT
nHP
n)1PT
nH=
n1
i=0
1
pT
iHp
i
Hp
ipT
iH,(B.5)
where the conjugacy of the pis is a consequence of (1.4). In this situation, (B.4)may
therefore be seen as a dynamic way of generating the true Hessian in nsteps, if the
method does not converge early, as Bn=Hby
I
n1
i=0
1
gT
igi
gigT
i=0 and
n1
i=0
1
pT
iHp
i
Hp
ipT
iH=H.
This is a consequence of the orthogonal gradients then spanning the whole space in
combination with (B.5). Consequently, the BFGS method may be viewed as the “ideal”
update that dynamically transforms Bkfrom Ito Hin nsteps. Identity curvature is
transformed to Hcurvature in one new dimension at each step, and this curvature
information is maintained throughout.
Finally, to see that the BFGS method in our setting gives pkthat satisfies (1.4), see,
e.g., [7,8,18]. Alternatively, it is straightforward to verify that Bkpk=−gk, with Bk
given by (B.3) and pkgiven by (1.2), as
Bk1+1
gT
k1pk1
gk1gT
k1gk+gT
kgk
gT
k1gk1
pk1=−gk,
1
αk1gT
k1pk1
(gkgk1)(gkgk1)Tgk+gT
kgk
gT
k1gk1
pk1=0,
taking into account orthogonality of the gradients, Bk1gk=gkand Bk1pk1=
gk1.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 169
C. Verification of an inverse curvature relationship
The low rank-matrix of (4.1) takes the form
Bk(ρ) =I1
pT
k1pk1
pk1pT
k1+ρ(gkgk1)(gkgk1)T,
and it is positive definite if ρ>0 and (gkgk1)Tpk1= 0. In this situation, the
following lemma shows that (4.8) holds, i.e.,
1
(gkgk1)TBk(ˆρk1)1(gkgk1)ρk1,
by identifying p=pk1,y=gkgk1and ˆρk1=ρ.
Lemma C.1 Fo r p ∈
n,p= 0, and y ∈
n,y= 0, that satisfy yTp= 0and ρ>0,
let Bρbe defined by
Bρ=I1
pTpppT+ρyyT.
Then,
yTB1
ρy=1
ρ.
Proof Let Mbe defined by
M=Bρy
yT1
ρ.
Then, Sylvester’s law of inertia gives
In(M)=In(1
ρ)+In(BρρyyT), (C.1)
where for Msymmetric, the inertia of M, written In(M)=(i+(M), i(M), i0(M)),
denotes the triple of positive, negative and zero eigenvalues of M, see. e.g., [10,
Theorem 8.1.7]. We have
BρρyyT=I1
pTpppT,(C.2)
so that BρρyyTis positive semidefinite with n1 unit eigenvalues and one zero
eigenvalue, i.e., In(BρρyyT)=(n1,0,1). Therefore, (C.1)gives
In(M)=(1,0,0)+(n1,0,1)=(n,0,1), (C.3)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
170 S. Peng et al.
i.e., Mis positive semidefinite with one zero eigenvalue.
Analogously, Sylvester’s law of inertia gives
In(M)=In(Bρ)+In(1
ρyTB1
ρy). (C.4)
We have Bρ0, since by (C.2), BρρyyTis positive semidefinite with pspanning
its nullspace. As ρ>0 and yTp= 0, it follows that Bρ0. Then, a combination of
(C.3) and (C.4)gives
(n,0,1)=(n,0,0)+In(1
ρyTB1
ρy),
so that 1 yTB1
ρy=0, completing the proof.
Funding Open access funding provided by Royal Institute of Technology.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
1. Ahmed, S., Xie, W.: Relaxations and approximations of chance constraints under finite distributions.
Math. Program. 170(1), 43–65 (2018). https://doi.org/10.1007/s10107-018-1295- z
2. Barrera, J., Homem-de Mello, T., Moreno, E., Pagnoncelli, B.K., Canessa, G.: Chance-constrained
problems and rare events: an importance sampling approach. Math. Program. 157(1), 153–189 (2016).
https://doi.org/10.1007/s10107-015-0942- x
3. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale
optimization. SIAM J. Optim. 26(2), 1008–1031 (2016). https://doi.org/10.1137/140954362
4. Berahas, A.S., Nocedal, J., Takac, M.: A multi-batch L-BFGS method for machine learning. In: Lee,
D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information
Processing Systems, 29 (NIPS 2016), vol. 29, (2016)
5. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Pro-
gram. 91(2), 201–213 (2002). https://doi.org/10.1007/s101070100263
6. Dennis, J.E., Walker, H.F.: Inaccuracy in quasi-Newton methods: Local improvement theorems. Math.
Program. Study 22, 70–85 (1984)
7. Ek, D., Forsgren, A.: Exact linesearch limited-memory quasi-Newton methods for minimizing a
quadratic function. Comput. Optim. Appl. 79(3), 789–816 (2021). https://doi.org/10.1007/s10589-
021-00277- 4
8. Forsgren, A., Odland, T.: On exact linesearch quasi-Newton methods for minimizing a quadratic
function. Comput. Optim. Appl. 69(1), 225–241 (2018). https://doi.org/10.1007/s10589-017-9940- 7
9. Gould, N.I.M., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment
with safe threads for mathematical optimization. Comput. Optim. Appl. 60(3), 545–557 (2015). https://
doi.org/10.1007/s10589- 014-9687-3
10. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press,
Baltimore (1996)
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Finding search directions in quasi-Newton methods... 171
11. Irwin, B., Haber, E.: Secant penalized BFGS: a noise robust quasi-Newton method via penalizing the
secant condition. Comput. Optim. Appl. 84(3), 651–702 (2023). https://doi.org/10.1007/s10589-022-
00448-x
12. Luedtke, J., Ahmed, S.: A sample approximation approach for optimization with probabilistic con-
straints. SIAM J. Optim. 19(2), 674–699 (2008). https://doi.org/10.1137/070702928
13. Lucchi, A., McWilliams, B., Hofmann, T.: A variance reduced stochastic Newton method. Preprint
arXiv:1503.08316 [cs.LG] (2015)
14. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math.
Program. 45, 503–528 (1989)
15. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial
Intelligence and Statistics, pp. 249–258. PMLR, (2016)
16. Mokhtari, A., Ribeiro, A.: Regularized stochastic BFGS algorithm. In 2013 IEEE Global Conference
on Signal and Information Processing, pp. 1109–1112. IEEE, (2013)
17. Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res.
16(1), 3151–3181 (2015)
18. Nazareth, L.: A relationship between the BFGS and conjugate gradient algorithms and its implications
for new algorithms. SIAM J. Numer. Anal. 16, 794–800 (1979)
19. Nemirovski, A., Shapiro, A.: Scenario approximations of chance constraints. In: Probabilistic and
Randomized Methods for Design Under Uncertainty, pp. 3–47 (2006)
20. Pagnoncelli, B.K., Ahmed, S., Shapiro, A.: Sample average approximation method for chance con-
strained programming: theory and applications. J. Optim. Theory Appl. 142(2), 399–416 (2009). https://
doi.org/10.1007/s10957- 009-9523-6
21. Saad, Y.: Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics,
Philadelphia (2003)
22. Shapiro, A., Dentcheva, D., Ruszczy´nski, A.: Lectures on Stochastic Programming: Modeling and
Theory. SIAM, Philadelphia (2014)
23. Shi, H.-J.M., Xie, Y., Byrd, R., Nocedal, J.: A noise-tolerant quasi-Newton algorithm for unconstrained
optimization. SIAM J. Optim. 32(1), 29–55 (2022). https://doi.org/10.1137/20M1373190
24. Xie, Y., Byrd, R.H., Nocedal, J.: Analysis of the BFGS method with errors. SIAM J. Optim. 30(1),
182–209 (2020). https://doi.org/10.1137/19M1240794
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center
GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers
and authorised users (“Users”), for small-scale personal, non-commercial use provided that all
copyright, trade and service marks and other proprietary notices are maintained. By accessing,
sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of
use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and
students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and
conditions, a relevant site licence or a personal subscription. These Terms will prevail over any
conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to
the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of
the Creative Commons license used will apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may
also use these personal data internally within ResearchGate and Springer Nature and as agreed share
it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise
disclose your personal data outside the ResearchGate or the Springer Nature group of companies
unless we have your permission as detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial
use, it is important to note that Users may not:
use such content for the purpose of providing other users with access on a regular or large scale
basis or as a means to circumvent access control;
use such content where to do so would be considered a criminal or statutory offence in any
jurisdiction, or gives rise to civil liability, or is otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association
unless explicitly agreed to by Springer Nature in writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a
systematic database of Springer Nature journal content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a
product or service that creates revenue, royalties, rent or income from our content or its inclusion as
part of a paid for service or for other commercial gain. Springer Nature journal content cannot be
used for inter-library loans and librarians may not upload Springer Nature journal content on a large
scale into their, or any other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not
obligated to publish any information or content on this website and may remove it or features or
functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke
this licence to you at any time and remove access to any copies of the Springer Nature journal content
which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or
guarantees to Users, either express or implied with respect to the Springer nature journal content and
all parties disclaim and waive any implied warranties or warranties imposed by law, including
merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published
by Springer Nature that may be licensed from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a
regular basis or in any other manner not expressly permitted by these Terms, please contact Springer
Nature at
onlineservice@springernature.com
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that treating the secant condition with a penalty method approach motivated by regularized least squares estimation generates a parametric family with the original BFGS update at one extreme and not updating the inverse Hessian approximation at the other extreme. Furthermore, we find the curvature condition is relaxed as the family moves towards not updating the inverse Hessian approximation, and disappears entirely at the extreme where the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as Secant Penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.
Article
Full-text available
The main focus in this paper is exact linesearch methods for minimizing a quadratic function whose Hessian is positive definite. We give a class of limited-memory quasi-Newton Hessian approximations which generate search directions parallel to those of the BFGS method, or equivalently, to those of the method of preconditioned conjugate gradients. In the setting of reduced Hessians, the class provides a dynamical framework for the construction of limited-memory quasi-Newton methods. These methods attain finite termination on quadratic optimization problems in exact arithmetic. We show performance of the methods within this framework in finite precision arithmetic by numerical simulations on sequences of related systems of linear equations, which originate from the CUTEst test collection. In addition, we give a compact representation of the Hessian approximations in the full Broyden class for the general unconstrained optimization problem. This representation consists of explicit matrices and gradients only as vector components.
Article
Full-text available
Optimization problems with constraints involving stochastic parameters that are required to be satisfied with a prespecified probability threshold arise in numerous applications. Such chance constrained optimization problems involve the dual challenges of stochasticity and nonconvexity. In the setting of a finite distribution of the stochastic parameters, an optimization problem with linear chance constraints can be formulated as a mixed integer linear program (MILP). The natural MILP formulation has a weak relaxation bound and is quite difficult to solve. In this paper, we review some recent results on improving the relaxation bounds and constructing approximate solutions for MILP formulations of chance constraints. We also discuss a recently introduced bicriteria approximation algorithm for covering type chance constrained problems. This algorithm uses a relaxation to construct a solution whose (constraint violation) risk level may be larger than the pre-specified threshold, but is within a constant factor of it, and whose objective value is also within a constant factor of the true optimal value. Finally, we present some new results that improve on the bicriteria approximation factors in the finite scenario setting and shed light on the effect of strong relaxations on the approximation ratios.
Article
Full-text available
This paper concerns exact linesearch quasi-Newton methods for minimizing a quadratic function whose Hessian is positive definite. We show that by interpreting the method of conjugate gradients as a particular exact linesearch quasi-Newton method, necessary and sufficient conditions can be given for an exact linesearch quasi-Newton method to generate a search direction which is parallel to that of the method of conjugate gradients. We also analyze update matrices and give a complete description of the rank-one update matrices that give search direction parallel to those of the method of conjugate gradients. In particular, we characterize the family of such symmetric rank-one update matrices that preserve positive definiteness of the quasi-Newton matrix. This is in contrast to the classical symmetric-rank-one update where there is no freedom in choosing the matrix, and positive definiteness cannot be preserved. The analysis is extended to search directions that are parallel to those of the preconditioned method of conjugate gradients in a straightforward manner.
Article
Full-text available
We study chance-constrained problems in which the constraints involve the probability of a rare event. We discuss the relevance of such problems and show that the existing sampling-based algorithms cannot be applied directly in this case, since they require an impractical number of samples to yield reasonable solutions. We argue that importance sampling (IS) techniques, combined with a Sample Average Approximation (SAA) approach, can be effectively used in such situations, provided that variance can be reduced uniformly with respect to the decision variables. We give sufficient conditions to obtain such uniform variance reduction, and prove asymptotic convergence of the combined SAA-IS approach. As it often happens with IS techniques, the practical performance of the proposed approach relies on exploiting the structure of the problem under study; in our case, we work with a telecommunications problem with Bernoulli input distributions, and show how variance can be reduced uniformly over a suitable approximation of the feasibility set by choosing proper parameters for the IS distributions. Although some of the results are specific to this problem, we are able to draw general insights that can be useful for other classes of problems. We present numerical results to illustrate our findings. © 2015 Springer-Verlag Berlin Heidelberg and Mathematical Optimization Society
Article
The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This inherently gives the algorithm a stochastic flavor that can cause instability in L-BFGS, a popular batch method in machine learning. These difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.
Article
The authors consider the use of bounded-deterioration quasi-Newton methods implemented in floating-point arithmetic to find solutions to F(x) equals 0 where only inaccurate F-values are available. The analysis is for the case where the relative error in F is less than one. They obtain theorems specifying local rates of improvement and limiting accuracies depending on the nearness to Newton's method of the basic algorithm, the accuracy of its implementation, the relative errors in the function values, the accuracy of the solutions of the linear system for the Newton steps, and the unit-rounding errors in the addition of the Newton steps.