ArticlePDF Available

Abstract and Figures

In this paper we present a derivative-free optimization algorithm for finite minimax problems. The algorithm calculates an approximate gradient for each of the active functions of the finite max function and uses these to generate an approximate subdifferential. The negative projection of 0 onto this set is used as a descent direction in an Armijo-like line search. We also present a robust version of the algorithm, which uses the ‘almost active’ functions of the finite max function in the calculation of the approximate subdifferential. Convergence results are presented for both algorithms, showing that either f(x k )→−∞ or every cluster point is a Clarke stationary point. Theoretical and numerical results are presented for three specific approximate gradients: the simplex gradient, the centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function. A performance comparison is made between the regular and robust algorithms, the three approximate gradients, and a regular and robust stopping condition.
Content may be subject to copyright.
A Derivative-Free Approximate Gradient Sampling
Algorithm for Finite Minimax Problems
W. Hare and J. Nutini
February 20, 2013
Abstract
In this paper we present a derivative-free optimization algorithm for finite minimax
problems. The algorithm calculates an approximate gradient for each of the active func-
tions of the finite max function and uses these to generate an approximate subdifferential.
The negative projection of 0 onto this set is used as a descent direction in an Armijo-like
line search. We also present a robust version of the algorithm, which uses the ‘almost
active’ functions of the finite max function in the calculation of the approximate subd-
ifferential. Convergence results are presented for both algorithms, showing that either
f(xk)→ −∞ or every cluster point is a Clarke stationary point. Theoretical and numeri-
cal results are presented for three specific approximate gradients: the simplex gradient, the
centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged
function. A performance comparison is made between the regular and robust algorithms,
the three approximate gradients, and a regular and robust stopping condition.
Keywords: derivative-free optimization, minimax problems, generalized gradient, subgradient
approximation.
AMS Subject Classification: primary, 90C47, 90C56; secondary, 90C26, 65K05, 49M25.
1 Introduction
In this paper we consider the finite minimax problem:
min
xf(x) where f(x) = max{fi(x) : i= 1, . . . , N},
where each individual fiis continuously differentiable. We further restrict ourselves to the field
of derivative-free optimization (DFO), where we are only permitted to compute function values,
i.e., we cannot compute gradient values fidirectly. We present a derivative-free algorithm
that exploits the smooth substructure of the finite max problem, thereby creating a robust
algorithm with an elegant convergence theory.
Finite minimax problems occur in numerous applications, such as portfolio optimization [8],
control system design [21], engineering design [32], and determining the cosine measure of a
positive spanning set [10, Def 2.7]. In a finite max function, although each individual fimay be
smooth, taking the maximum forms a nonsmooth function with ‘nondifferentiable ridges’. For
this reason, most algorithms designed to solve finite minimax problems employ some form of
1
smoothing technique; [31], [33], [34], and [39] (among many others). In general, these smoothing
techniques require gradient calculations.
However, in many situations gradient information is not available or can be difficult to
compute accurately (see [4], [15], [20], [29] and [10, Chpt 1] for some examples of such situations).
Such situations are considered by research in the area of derivative-free optimization. For a
thorough introduction to several basic DFO frameworks and convergence results for each, see
[10].
Research on optimizing finite max functions without calculating derivatives can be seen as
early as 1975 [28], while more recently we have seen a resurface in this area, [26] and [19].
In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential
penalty function in a directional direct-search framework to form a derivative-free optimization
method for finite minimax problems [26]. This method is shown to globally converge towards
a standard stationary point of the original finite minimax problem.
Also specific to the finite minimax problem, a derivative-free method is presented in [19] that
exploits the smooth substructure of the problem. It combines the frameworks of a directional
direct search method [10, Chpt 7] and the gradient sampling algorithm (GS algorithm) presented
in [6] and [7]. Loosely speaking, the GS algorithm uses a collection of local gradients to build
a ‘robust subdifferential’ of the objective function and uses this to determine a ‘robust descent
direction’. In [19], these ideas are used to develop several methods to find an approximate
descent direction that moves close to parallel to an ‘active manifold’. During each iteration,
points are sampled from around the current iterate and the simplex gradient is calculated for
each of the active functions of the objective function. The calculated simplex gradients are then
used to form an approximate subdifferential, which is then used to determine a likely descent
direction.
Ideas from the GS algorithm have appeared in two other recent DFO methodologies [2] and
[24].
In 2008, Bagirov, Karas¨ozen and Sezer presented a discrete gradient derivative-free method
for unconstrained nonsmooth optimization problems [2]. Described as a derivative-free version
of the bundle method presented in [37], the method uses discrete gradients to approximate
subgradients of the function and build an approximate subdifferential. The analysis of this
method provides proof of convergence to a Clarke stationary point for an extensive class of
nonsmooth problems. In this paper, we focus on the finite minimax problem. This allows us to
require few (other) assumptions on our function while maintaining strong convergence analysis.
It is worth noting that we use the same set of test problems as in [2]. Specifically, we use the
[27] test set and exclude one problem as its sub-functions are complex-valued. (The numerics
in [2] exclude the same problem, and several others, without explanation.)
Using approximate gradient calculations instead of gradient calculations, the GS algorithm
is made derivative free by Kiwiel in [24]. Specifically, Kiwiel employs the Gupal estimate of
the gradient of the Steklov averaged function (see [18] or Section 4.3 herein) as an approximate
gradient. It is shown that, with probability 1, this derivative-free algorithm satisfies the same
convergence results as the GS algorithm – it either drives the f-values to −∞ or each cluster
point is found to be Clarke stationary [24, Theorem 3.8]. No numerical results are presented
for Kiwiel’s derivative-free algorithm.
In this paper, we use the GS algorithm framework with approximate gradients to form a
derivative-free approximate gradient sampling algorithm. As we are dealing with finite max
functions, instead of calculating an approximate gradient at each of the sampled points, we
2
calculate an approximate gradient for each of the active functions. Expanding the active set
to include ‘almost’ active functions, we also present a robust version of our algorithm, which
is more akin to the GS algorithm. In this robust version, when our iterate is close to a point
of nondifferentiability, the size and shape of our approximate subdifferential will reflect the
presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdif-
ferential, the descent direction will direct minimization parallel to a ‘nondifferentiable ridge’,
rather than straight at this ridge. It can be seen in our numerical results that these robust
changes greatly influence the performance of our algorithm.
Our algorithm differs from the above in a few key manners. Unlike in [26] we do not employ a
smoothing technique. Unlike in [19], which uses the directional direct-search framework to imply
convergence, we employ an approximate steepest descent framework. Using this framework, we
are able to analyze convergence directly and develop stopping conditions for the algorithm.
Unlike in [2] and [24], where convergence is proven for a specific approximate gradient, we
prove convergence for any approximate gradient that satisfies a simple error bound dependent
on the sampling radius. As examples, we present the simplex gradient, the centered simplex
gradient and the Gupal estimate of the gradient of the Steklov averaged function. (As a side-
note, Section 4.3 also provides, to the best of the authors’ knowledge, novel error analysis of
the Gupal estimate of the gradient of the Steklov averaged function.)
Focusing on the finite minimax problem provides us with an advantage over the methods
of [2] and [24]. In particular, we only require order nfunction calls per iteration (where n
is the dimension of the problem), while both [2] and [24] require order mn function calls per
iteration (where mis the number of gradients they approximate to build their approximate
subdifferential). (The original GS algorithm suggests that m2nprovides a good value for
m.)
The remainder of this paper is organized as follows. In Section 2, we present the approximate
gradient sampling algorithm (AGS algorithm) and our convergence analysis. In Section 3, we
present a robust version of the AGS algorithm (RAGS algorithm), which uses ‘almost active’
functions in the calculation of the approximate subdifferential. In Section 4, we show that
the AGS and RAGS algorithms converge using three specific approximate gradients: simplex
gradient, centered simplex gradient and the Gupal estimate of the gradient of the Steklov
averaged function. Finally, in Section 5, we present our numerical results and analysis.
2 Approximate Gradient Sampling Algorithm
Throughout this paper, we assume that our objective function is of the form
min
xf(x) where f(x) = max{fi(x) : i= 1, . . . , N},(1)
where each fi∈ C1, but we cannot compute fi. We use C1to denote the class of differentiable
functions whose gradient mapping fis continuous. We denote by C1+ the class of continuously
differentiable functions whose gradient mapping fis locally Lipschitz and we denote by C2+
the class of twice continuously differentiable functions whose Hessian mapping 2fis locally
Lipschitz. Additionally, throughout this paper, |·| denotes the Euclidean norm and k·k denotes
the corresponding matrix norm.
3
For the finite max function in equation (1), we define the active set of fat a point ¯xto
be the set of indices
Ax) = {i:f(¯x) = fix)}.
The set of active gradients of fat ¯xis denoted by
{∇fix)}iAx).
Let fbe locally Lipschitz at a point ¯x. As fis Lipschitz, there exists an open dense set DIRn
such that fis continuously differentiable on D. The Clarke subdifferential [9] is constructed
via
∂f (x) = \
ε>0
Gε(x) where Gε(x) = cl conv{∇f(y) : yBε(x)D}.
For a finite max function, assuming fi∈ C1for each iAx), the Clarke subdifferential (as
proven in [9, Prop. 2.3.12]) is equivalent to
∂f x) = conv{∇fi(¯x)}iAx).(2)
By equation (2), it is clear that for finite max functions the subdifferential is a compact set.
This will be important in the convergence analysis in Section 2.2.
We are now ready to state the general form of the AGS algorithm, an approximate subgra-
dient descent method.
2.1 Algorithm - AGS
We first provide a partial glossary of notation used in the definition of the AGS algorithm.
Table 1: Glossary of notation used in the AGS algorithm.
Glossary of Notation
k: Iteration counter xk: Current iterate
µk: Accuracy measure k: Sampling radius
m: Sample size θ: Sampling radius reduction factor
yj: Sampling points Y: Sampled set of points
η: Armijo-like parameter dk: Search direction
tk: Step length tmin: Minimum step length
Afi: Approximate gradient of fiA(xk): Active set at xk
Gk: Approximate subdifferential εtol: Stopping tolerance
Conceptual Algorithm: [Approximate Gradient Sampling Algorithm]
0. Initialize: Set k= 0 and input
x0- starting point
µ0>0 - accuracy measure
0>0 - initial sampling radius
θ(0,1) - sampling radius reduction factor
4
0< η < 1 - Armijo-like parameter
tmin - minimum step length
εtol >0 - stopping tolerance
1. Generate Approximate Subdifferential Gk:
Generate a set Y= [xk, y1,...ym] around the current iterate xksuch that
max
j=1,...,m |yjxk| ≤ k.
Use Yto calculate the approximate gradient of fi, denoted Afi, at xkfor each iA(xk).
Set
Gk= conv{∇Afi(xk)}iA(xk).
2. Generate Search Direction:
Let
dk=Proj(0|Gk).
Check if
kµk|dk|.(3)
If (3) does not hold, then set xk+1 =xk,
k+1 =(θµk|dk|if |dk| 6= 0
θkif |dk|= 0 ,(4)
k=k+ 1 and return to Step 1. If (3) holds and |dk|< εtol, then STOP. Else, continue to
the line search.
3. Line Search:
Attempt to find tk>0 such that
f(xk+tkdk)< f(xk)ηtk|dk|2.
Line Search Failure:
Set µk+1 =µk
2,xk+1 =xkand go to Step 4.
Line Search Success:
Let xk+1 be any point such that
f(xk+1)f(xk+tkdk).
4. Update and Loop:
Set ∆k+1 = max
j=1,...,m |yjxk|,k=k+ 1 and return to Step 1.
5
In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting
point x0, and initialize the parameter values.
In Step 1, we create the approximate subdifferential. First, we select a set of points around
xkwithin a sampling radius of ∆k. In implementation, the points are randomly and uniformly
sampled from a ball of radius ∆k(using the MATLAB randsphere.m function [36]). Using
this set Y, we then calculate an approximate gradient for each of the active functions at xk
and set the approximate subdifferential Gkequal to the convex hull of these active approximate
gradients, Afi(xk). Details on various approximate gradients appear in Section 4.
In Step 2, we generate a search direction by solving the projection of 0 onto the approximate
subdifferential: Proj(0|Gk)arg mingGk|g|2. The search direction dkis set equal to the
negative of the solution, i.e., dk=Proj(0|Gk).
After finding a search direction, we check the inequality ∆kµk|dk|.This inequality de-
termines if the current sampling radius is sufficiently small relative to the distance from 0 to
the approximate subdifferential. If this inequality holds and |dk|< εtol, then we terminate the
algorithm, as 0 is within εtol of the approximate subdifferential and the sampling radius is small
enough to reason that the approximate subdifferential is accurate. If the above inequality does
not hold, then the approximate subdifferential is not sufficiently accurate to warrant a line
search, so we decrease the sampling radius, set xk+1 =xk, update the iterate counter and loop
(Step 1). If the above inequality holds, but |dk| ≥ εtol, then we proceed to a line search.
In Step 3, we carry out a line search. We attempt to find a step length tk>0 such that the
Armijo-like condition holds
f(xk+tkdk)< f(xk)ηtk|dk|2.(5)
This condition ensures sufficient decrease is found in the function value. In implementation,
we use a back-tracking line search (described in [30]) with an initial step-length of tini = 1,
terminating when the step length tkis less than a threshold tmin. If we find a tksuch that
equation (5) holds, then we declare a line search success. If not, then we declare a line search
failure.
If a line search success occurs, then we let xk+1 be any point such that
f(xk+1)f(xk+tkdk).(6)
In implementation, we do this by searching through the function values used in the calculation
of our approximate gradients ({f(yi)}yiY). As this set of function values corresponds to
points distributed around our current iterate, there is a good possibility of finding further
function value decrease without having to carry out additional function evaluations. We find
the minimum function value in our set of evaluations and if equation (6) holds for this minimum
value, then we set xk+1 equal to the corresponding input point. Otherwise, we set xk+1 =
xk+tkdk.
If a line search failure occurs, then we reduce the accuracy measure µkby a factor of 1
2and
set xk+1 =xk.
Finally, in Step 4, we update the iterate counter and the sampling radius, and then loop to
Step 1 to resample.
6
2.2 Convergence
For the following results, we denote the approximate subdifferential of fat ¯xas
Gx) = conv{∇Afi(¯x)}iAx),
where Afix) is the approximate gradient of fiat ¯x. Our first result establishes an error bound
relation between the elements of the approximate subdifferential and the exact subdifferential.
Lemma 2.1. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists an
ε > 0such that |∇Afix)− ∇fi(¯x)| ≤ εfor all i= 1, . . . , N . Then
1. for all wG(¯x), there exists a v∂f x)such that |wv| ≤ ε, and
2. for all vf(¯x), there exists a wGx)such that |wv| ≤ ε.
Proof. 1. By definition, for all wG(¯x) there exists a set of αisuch that
w=X
iAx)
αiAfix),where αi0,X
iAx)
αi= 1.
By our assumption that each fi∈ C1, we have ∂f(¯x) = conv{∇fi(¯x)}iA( ¯x). Using the same αi
as above, we see that
v=X
iAx)
αifix)f(¯x)
Then |wv|=|P
iAx)
αiAfix)) P
iAx)
αifix)|
P
iAx)
αi|∇Afix)− ∇fi(¯x)|
P
iAx)
αiε
=ε
Hence, for all wGx), there exists a vf(¯x) such that
|wv| ≤ ε. (7)
2. Analogous arguments can be applied to v∂f x).
Lemma 2.1 states the quality of the approximate subdifferential as an approximation to the
exact subdifferential once the approximate gradients of the component functions are quality
approximations to the real gradients. Our next goal (in Theorem 2.4) is to show that eventually
a line search success will occur in the AGS algorithm. To achieve this we make use of the
following lemma.
Lemma 2.2. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists an
ε > 0such that |∇Afix)− ∇fi(¯x)| ≤ εfor all i= 1, . . . , N . Define d=Proj(0|G(¯x)) and
suppose |d| 6= 0. Let β[0,1). If ε < (1 β)|d|, then for all vf (¯x)we have
hd, vi<β|d|2.
7
Proof. Notice that, by the Projection Theorem [3, Theorem 3.14], d=Proj(0|G(¯x)) implies
that
h0(d), w (d)i ≤ 0 for all wGx).
Hence,
hd, w +di ≤ 0 for all wGx).(8)
So we have for all vf(¯x),
hd, vi=hd, v w+wd+difor all wG(¯x)
=hd, v wi+hd, w +di+hd, difor all wG(¯x)
≤ hd, v wi−|d|2for all wGx)
≤ |d||vw|−|d|2for all wG(¯x).
For any v∂f (¯x), using was constructed in Lemma 2.1, we see that
hd, vi≤|d|ε− |d|2
<|d|2(1 β)− |d|2(as ε < (1 β)|d|)
=β|d|2.
Remark 2.3.In Lemma 2.2, for the case when β= 0, the condition ε < (1 β)|d|simplifies to
ε < |d|. Thus, if εis bounded above by |d|, then Lemma 2.2 proves that for all v∂f(¯x) we
have hd, vi<0, showing that dis a descent direction for fat ¯x.
To guarantee convergence, we must show that, except in the case of 0 ∂f (xk), the algo-
rithm will always be able to find a sampling radius that satisfies the requirements in Step 2. In
Section 4 we show that (for three different approximate gradients) the value ε(in Lemma 2.2)
is linked to ∆. As unsuccessful line searches will drive ∆ to zero, this implies that eventually
the requirements of Lemma 2.2 will be satisfied. We formalize this in the next two theorems.
Theorem 2.4. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose 06∈ ∂f(xk)for
each iteration k. Suppose there exists ¯
K > 0such that given any set of points generated in Step
1 of the AGS algorithm, the approximate gradient satisfies |∇Afi(xk)fi(xk)| ≤ ¯
Kkfor all
i= 1, . . . , N. Let dk=Proj(0|G(xk)). Then for any µ > 0, there exists ¯
∆ = ¯
∆(xk)>0such
that,
µ|dk|+¯
Kµ(∆k∆) for all 0<<¯
,
Moreover, if k<¯
, then the following inequality holds
kµ|dk|.
8
Proof. Let ¯v= Proj(0|f(xk)) (by assumption, ¯v6= 0).
Given µ > 0, let
¯
∆ = 1
¯
K+1
µ|¯v|,(9)
and consider 0 <<¯
∆. Now create G(xk) and dk=Proj(0|G(xk)). As dkG(xk), by
Lemma 2.1(1), there exists a vk∂f (xk) such that
| − dkvk| ≤ ¯
Kk.
Then ¯
Kk≥ | − dkvk|
¯
Kk≥ |vk|−|dk|
¯
Kk≥ |¯v|−|dk|(as |v| ≥ |¯v|for all vf (xk) ).
Thus, for 0 <<¯
∆, we apply equation (9) to |¯v|in the above inequality to get
¯
Kk(¯
K+1
µ)∆ − |dk|,
which rearranges to
µ(|dk|+¯
Kµ(∆k∆).
Hence, ∆ µ|dk|+¯
Kµ(∆k∆) for all 0 <<¯
∆. Finally, if ∆k<¯
∆, then
kµ|dk|.
Remark 2.5.In Theorem 2.4, it is important to note that eventually the condition ∆k<¯
∆ will
hold. Examine ¯
∆ as constructed above: ¯
Kis a constant and ¯vis associated with the current
iterate. However, the current iterate is only updated when a line search success occurs, which
will not occur unless the condition ∆kµk|dk|is satisfied. As a result, if ∆k¯
∆, the AGS
algorithm will reduce ∆k, with ¯
∆ remaining constant, until ∆k<¯
∆.
Recall in Step 3 of the AGS algorithm, for a given η(0,1), we attempt to find a step
length tk>0 such that
f(xk+tkdk)< f(xk)ηtk|dk|2.
The following result shows that eventually the above inequality will hold in the AGS algorithm.
Recall that the exact subdifferential for a finite max function, as defined in (2), is a compact
set. Thus, we know that in the following theorem ˜vis well-defined.
Theorem 2.6. Fix 0< η < 1. Let f= max{fi:i= 1, .. . , N }where each fi∈ C1. Suppose
there exists an ε > 0such that |∇Afix)− ∇fi(¯x)| ≤ εfor all i= 1, . . . , N . Define d=
Proj(0|G(¯x)) and suppose |d| 6= 0. Let ˜varg max{hd, vi:vf(¯x)}. Let β=2η
1+η. If
ε < (1 β)|d|, then there exists ¯
t > 0such that
fx+td)f(¯x)<ηt|d|2for all 0<t<¯
t.
9
Proof. Note that β(0,1). Recall, from Lemma 2.2, we have for all v∂f (¯x)
hd, vi<β|d|2.(10)
Using β=2η
1+η, equation (10) becomes
hd, vi<2η
1+η|d|2for all v∂f x). (11)
From equation (11) we can conclude that for all v∂f x)
hd, vi<0.
Notice that
lim
τ&0
fx+τ d)f(¯x)
τ= max{hd, vi:vf(¯x)}=hd, ˜vi<0.
Therefore, there exists ¯
t > 0 such that
fx+td)f(¯x)
t<η+ 1
2hd, ˜vifor all 0 < t < ¯
t.
For such a t, we have
fx+td)f(¯x)<η+ 1
2thd, ˜vi
<η+ 1
2
2η
η+ 1t|d|2
<ηt|d|2.
Hence,
fx+td)f(¯x)<ηt|d|2for all 0 < t < ¯
t.
Combining the previous results, we can show that the AGS algorithm is guaranteed to find
function value decrease (provided 0 /f(xk)). We summarize with the following corollary.
Corollary 2.7. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose 0/∂f(xk)for
each iteration k. Suppose there exists a ¯
K > 0such that given any set of points generated in
Step 1 of the AGS algorithm, the approximate gradient satisfies |∇Afi(xk)− ∇fi(xk)| ≤ ¯
Kk
for all i= 1, . . . , N . Then after a finite number of iterations, the algorithm will find a new
iterate with a lower function value.
Proof. Consider xk, where 0 /f(xk).
To find function value decrease with the AGS algorithm, we must declare a line search success
in Step 3. The AGS algorithm will only carry out a line search if the condition below is satisfied
kµk|dk|,(12)
10
where dk=Proj(0|G(xk)), as usual. In Theorem 2.4, we showed that for any µk>0, there
exists a ¯
∆ = ¯
∆(xk)>0 such that if ∆k<¯
∆(xk), then equation (12) is satisfied. If equation
(12) is not satisfied, then ∆kis updated according to equation (4) and xk+1 =xk, which further
implies ¯
∆ = ¯
∆(xk+1) = ¯
∆(xk) is unchanged. In this case, whether |dk| 6= 0 or |dk|= 0, we
can see that ∆k+1 θk. Hence an infinite sequence of equation (12) being unsatisfied is
impossible (as eventually we would have ∆k<¯
∆. So eventually equation (12) will be satisfied
and the AGS algorithm will carry out a line search.
Now, in order to have a line search success, we must be able to find a step length tksuch
that the Armijo-like condition holds,
f(xk+tkdk)< f(xk)ηtk|dk|2.
In Theorem 2.6, we showed that there exists ¯
t > 0 such that
f(xk+tkdk)f(xk)<ηtk|dk|2for all 0 < tk<¯
t,
provided that for β(0,1),
ε < (1 β)|dk|.(13)
Set ε=¯
Kk. If equation (13) does not hold, then a line search failure will occur, resulting in
µk+1 = 0.5µk. Thus, eventually we will have µk<(1β)
¯
Kand
kµk|dk|<(1 β)
¯
K|dk|,
which means equation (13) will hold. Thus, after a finite number of iterations, the AGS
algorithm will declare a line search success and find a new iterate with a lower function value.
We are now ready to prove convergence. In particular, we study the limiting case of the
algorithm generating an infinite sequence (i.e., the situation with εtol = 0). In the following,
assuming that the step length tkis bounded away from 0 means that there exists a ¯
t > 0 such
that tk>¯
t.
Theorem 2.8. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Set εtol = 0 and suppose
that {xk}
k=0 is an infinite sequence generated by the AGS algorithm. Suppose there exists
a¯
K > 0such that given any set of points generated in Step 1 of the AGS algorithm, the
approximate gradient satisfies the error bound |∇Afi(xk)−∇fi(xk)| ≤ ¯
Kkfor all i= 1, . . . , N .
Suppose tkis bounded away from 0. Then either
1. f(xk)↓ −∞, or
2. |dk| → 0,k0and every cluster point ¯xof the sequence {xk}
k=0 satisfies 0∂f x).
Proof. If f(xk)↓ −∞, then we are done.
Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, there-
fore f(xk) converges. We consider two cases.
Case 1: An infinite number of line search successes occur.
Let ¯xbe a cluster point of {xk}
k=0. Notice that xkonly changes for line search successes, so
11
there exists a subsequence {xkj}
j=0 of line search successes such that xkj¯x. Then for each
corresponding step length tkjand direction dkj, the following condition holds
f(xkj+1)f(xkj+tkjdkj)< f (xkj)ηtkj|dkj|2.
Note that
0ηtkj|dkj|2< f (xkj)f(xkj+1 ).
Since f(xk) converges we know that f(xkj)f(xkj+1)0. Since tkjis bounded away from 0,
we see that
lim
j→∞ |dkj|= 0.
Recall from the AGS algorithm, we check the condition
kjµkj|dkj|.
As ∆kj>0, µkjµ0, and |dkj| → 0, we conclude that ∆kj0.
Finally, from Lemma 2.1(1), as dkjG(xkj), there exists a vkjf(xkj) such that
| − vkjdkj| ≤ ¯
Kkj
| − vkj|−|dkj| ≤ ¯
Kkj
⇒ |vkj| ≤ ¯
Kkj+|dkj|,
which implies that
0≤ |vkj| ≤ ¯
Kkj+|dkj| → 0.
So,
lim
j→∞ |vkj|= 0,
where |vkj| ≥ dist(0|∂f (xkj)) 0, which implies dist(0|∂f (xkj)) 0. We have xkj¯x. As f
is a finite max function, f is outer semicontinuous (see [35, Definition 5.4 & Proposition 8.7]).
Hence, every cluster point ¯xof a convergent subsequence of {xk}
k=0 satisfies 0 f (¯x).
Case 2: A finite number of line search successes occur.
This means there exists a ¯
ksuch that xk=x¯
k= ¯xfor all k¯
k. However, by Corollary 2.7,
if 0 /∂f (¯x), then after a finite number of iterations, the algorithm will find function value
decrease (line search success). Hence, we have 0 ∂f(¯x).
To see ∆k0 and |dk| → 0, note that by Lemma 2.1(1) and 0 ∂f(¯x), we have that for all
k > ¯
kthere exists dG(xk) such that |d0| ≤ ¯
Kk. In particular, |dk|=|Proj(0|G(xk))| ≤
¯
Kk¯
K0for all k > ¯
k. Now note that one of two situations must occur: either equation
(3) is unsatisfied an infinite number of times or after a finite number of steps equation (3) is
always satisfied and a line search failure occurs in Step 4. In the first case, we directly have
k0 (by Step 3). In the second case, we have µk0 (by Step 4), so ∆kµk|dk| ≤ µk¯
K0
(by equation (3)) implies ∆k0. Finally, |dk| ≤ ¯
Kkcompletes the proof.
Our last result shows that if the algorithm terminates in Step 2, then the distance from 0
to the exact subdifferential is controlled by εtol.
12
Theorem 2.9. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists a
¯
K > 0such that for each iteration k, the approximate gradient satisfies |∇Afi(xk)fi(xk)| ≤
¯
Kkfor all i= 1, . . . , N . Suppose the AGS algorithm terminates at some iteration ¯
kin Step
2 for εtol >0. Then
dist(0|∂f (x¯
k)) <(1 + ¯
Kµ0)εtol.
Proof. Let ¯w= Proj(0|G(xk)). We use ¯vf (xk) as constructed in Lemma 2.1(1) to see that
dist(0|∂f (xk)) dist(0|¯w) + dist( ¯w|¯v)
=|dk|+|¯w¯v|
≤ |dk|+¯
Kk
< εtol +¯
Kk
The final statement now follows by the test ∆kµk|dk| ≤ µ0εtol in Step 2.
3 Robust Approximate Gradient Sampling Algorithm
The AGS algorithm depends on the active set of functions at each iterate, A(xk). Of course, it
is possible at various times in the algorithm for there to be functions that are inactive at the
current iterate, but active within a small radius of the current iterate. Typically such behaviour
means that the current iterate is close to a ‘nondifferentiable ridge’ formed by the function. In
[6] and [7], it is suggested that allowing an algorithm to take into account these ‘almost active’
functions will provide a better idea of what is happening at and around the current iterate,
thus, making the algorithm more robust.
In this section we present the robust gradient sampling algorithm (RAGS algorithm). Specif-
ically, we adapt the AGS algorithm by expanding our active set to include all functions that
are active at any of the points in the set Y. Recall from the AGS algorithm that the set Yis
sampled from within a ball of radius ∆k. Thus, the points in Yare not far from the current
iterate. We define the robust active set next.
Definition 3.1. Let f= max{fi:i= 1, . . . , N }where fi∈ C1. Let y0=xkbe the current
iterate and Y= [y0, y1, y2, . . . , ym] be a set of randomly sampled points from a ball centered at
y0with radius ∆k. The robust active set of fon Yis
A(Y) = [
yjY
A(yj).
3.1 Algorithm - RAGS
Using the idea of the robust active set, we alter the AGS algorithm to accommodate the robust
active set by replacing Steps 1 and 2 with the following.
1. Generate Approximate Subdifferential Gk
Y(Robust):
Generate a set Y= [xk, y1, . . . , ym] around the current iterate xksuch that
max
j=1,...,m |yjxk| ≤ k.
Use Yto calculate the approximate gradient of fi, denoted Afi, at xkfor each iA(Y).
Then set Gk= conv{∇Afi(xk)}iA(xk)and Gk
Y= conv{∇Afi(xk)}iA(Y).
13
2. Generate Search Direction:
Let
dk=Proj(0|Gk).
Let
dk
Y=Proj(0|Gk
Y).
Check if
kµk|dk|.(14)
If (14) does not hold, then set xk+1 =xk,
k+1 =(θµk|dk|if |dk| 6= 0
θkif |dk|= 0 ,
k=k+ 1 and return to Step 1. If (14) holds and |dk|< εtol, then STOP. Else, continue
to the line search, using dk
Yas a search direction.
Notice that in Step 2 we still use the stopping conditions from Section 2. Although this
modification requires the calculation of two projections, it should be noted that neither of
these projections are particularly difficult to calculate. In Section 3.3, we use the Goldstein
approximate subdifferential to adapt Theorem 2.9 to work for stopping conditions based on dk
Y,
but we still do not have theoretical results for the exact subdifferential. It is important to note
that no additional function evaluations are required for this modification.
In the numerics section, we test each version of our algorithm using the robust descent direc-
tion to check the stopping conditions. This alteration shows convincing results that the robust
stopping conditions not only guarantee convergence, but significantly decrease the number of
function evaluations required for the algorithm to converge.
3.2 Convergence
To show that the RAGS algorithm is well-defined we will require the fact that when ∆kis small
enough, the robust active set is in fact equal to the original active set.
Lemma 3.2. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Let Y= [¯x, y1, . . . , ym]be a
randomly sampled set from a ball centered at ¯xwith radius . Then there exists an ˜ε > 0such
that if YB˜ε(¯x), then A(¯x) = A(Y).
Proof. Clearly, if iAx), then iA(Y) as ¯xY.
Consider i6∈ Ax). Then by the definition of f, we have that
fix)< f (¯x).
By definition, fis continuous, thus, there exists an ˜εi>0 such that for all zB˜εi(¯x),
fi(z)< f(z).
If ∆ <˜εi, then we have |yj¯x|<˜εifor all j= 1, . . . , m. Therefore,
fi(yj)< f(yj) for all j= 1, . . . , m, (15)
so i6∈ A(Y). Setting ˜ε= min
i˜εicompletes the proof.
14
Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-defined when
using the robust active set.
Corollary 3.3. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose 06∈ ∂f (xk)for
each iteration k. Suppose there exists a ¯
K > 0such that given any set of points generated in
Step 1 of the RAGS algorithm, the approximate gradient satisfies |∇Afi(xk)fi(xk)| ≤ ¯
Kk
for all i= 1, . . . , N . Then after a finite number of iterations, the RAGS algorithm will find
function value decrease.
Proof. Consider xk, where 0 /f(xk).
For eventual contradiction, suppose we do not find function value decrease. In the RAGS
algorithm, this corresponds to an infinite number of line search failures. If we have an infinite
number of line search failures, then ∆k0, as |dk|is bounded, and x¯
k=xkfor all ¯
kk.
In Lemma 3.2, ˜εdepends only on xk. Hence, we can conclude that eventually ∆k˜εand
therefore YB˜ε(xk). Thus, eventually A(xk) = A(Yk). Once the two active sets are equal,
the results of Section 2 will hold.
To examine convergence of the RAGS algorithm we use the result that eventually the robust
active set at the current iterate will be a subset of the regular active set at any cluster point of
the algorithm.
Lemma 3.4. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Let Yk= [xk, y1, . . . , ym]
be a randomly sampled set from a ball centered at xkwith radius k. Let xk¯x. Then there
exists an ˜ε > 0such that if YkB˜ε(¯x), then A(Yk)A(¯x).
Proof. Let i /Ax). We must show that for ksufficiently large i /A(Yk).
By definition of f, we have that
fix)< f (¯x).
Since fis continuous, there exists an ˜εi>0 such that for all zB˜εi(¯x)
fi(z)< f(z).
If YkB˜εi(¯x), then we have |xk¯x|<˜εiand |yj¯x|<˜εifor all j= 1, . . . , m. Therefore
fi(xk)< f(xk)
and
fi(yj)< f(yj) for all j= 1, . . . , m.
Thus, if YkB˜εix), then i6∈ A(Yk). Letting ˜ε= mini˜εicompletes the proof.
Now we examine the convergence of the RAGS algorithm.
Theorem 3.5. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Set εtol = 0 and suppose
that {xk}
k=0 is an infinite sequence generated by the RAGS algorithm. Suppose there exists
a¯
K > 0such that given any set of points generated in Step 1 of the RAGS algorithm, the
approximate gradient satisfies the error bound |∇Afi(xk)−∇fi(xk)| ≤ ¯
Kkfor all i= 1, . . . , N .
Suppose tkis bounded away from 0. Then either
1. f(xk)↓ −∞, or
15
2. |dk| → 0,k0and every cluster point ¯xof the sequence {xk}
k=0 satisfies 0∂f x).
Proof. If f(xk)↓ −∞, then we are done.
Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, there-
fore f(xk) converges. We consider two cases.
Case 1: An infinite number of line search successes occur.
Let ¯xbe a cluster point of {xk}
k=0. Notice that xkonly changes for line search successes,
so there exists a subsequence {xkj}
k=0 of line search successes such that xkj¯x. Following
the arguments of Theorem 2.8, we have |dkj| → 0 and ∆kj0. Notice that if ∆kj0, then
eventually YkjB˜ε(¯x), where xkj¯xand ˜εis defined as in Lemma 3.4. Thus, by Lemma
3.4, we have that A(Ykj)A(¯x). This means that GYkj(xkj) is formed from a subset of the
approximated active gradients, related to f(¯x). Thus, by Lemma 2.1(1), as dkjGYkj(xkj),
we can construct a vkj∂f (¯x) from the same set of approximated active gradients, related to
GYkj(xkj), such that
| − dkjvkj|
=X
iA(Ykj)
αiAfi(xkj)X
iA(Ykj)
αifix)
X
iA(Ykj)
αi|∇Afi(xkj)− ∇fix)|
X
iA(Ykj)
αi|∇Afi(xkj)− ∇fi(xkj)|+αi|∇fi(xkj)− ∇fix)|
X
iA(Ykj)
αi¯
Kkj+αi|∇fi(xkj)− ∇fix)|.
¯
Kkj+X
iA(Ykj)
αi|∇fi(xkj)− ∇fix)|.
Using the Triangle Inequality, we have that
|vkj|−|dkj| ≤ ¯
Kkj+X
iA(Ykj)
αi|∇fi(xkj)− ∇fix)|.
We already showed that |dkj| → 0 and ∆kj0. Furthermore, since fiis continuous and
xkj¯x, we have |∇fi(xkj)− ∇fix)| → 0. So,
lim
j→∞ |vkj|= 0.
Using the same arguments as in Theorem 2.8, the result follows.
Case 2: A finite number of line search successes occur.
This means there exists a ¯
ksuch that xk=x¯
k= ¯xfor all k¯
k. However, by Corollary 3.3,
if 0 /∂f (¯x), then after a finite number of iterations, the algorithm will find function value
decrease (line search success). Hence, we have 0 ∂f(¯x). This implies ∆k0 and |dk| → 0, as
in the proof of Theorem 2.8.
Remark 3.6.Using dkto check our stopping conditions allows the result of Theorem 2.9 to still
hold.
16
3.3 Robust Stopping with Goldstein Approximate Subdifferential
We want to provide some insight as to how Theorem 2.9 can work for stopping conditions based
on dk
Y, that is, replacing the stopping conditions ∆kµk|dk|and |dk|< εtol in Step 2 with the
robust stopping conditions
kµk|dk
Y|and |dk
Y|< εtol.(16)
In the situation when the algorithm terminates, the following proposition does not theo-
retically justify why the stopping conditions are sufficient, but does help explain their logic.
Theoretically, since we do not know what ¯xis, we cannot tell when A(Y)A(¯x). However, as
seen above, we do know that if xk¯x, then eventually A(Yk)Ax).
Proposition 3.7. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists
a¯
K > 0such that |∇Afi(x)− ∇fi(x)| ≤ ¯
Kkfor all i= 1, . . . , N and for all xBk(xk).
Suppose the RAGS algorithm terminates at some iteration ¯
kin Step 2 using the robust stopping
conditions given in (16). Furthermore, suppose there exists ¯xB¯
k(x¯
k)such that A(Y¯
k)
Ax). Then
dist(0|∂f x)) <(1 + ¯
Kµ0)εtol.
Proof. If A(Y¯
k)Ax), then the proofs of Lemma 2.1(1) and Theorem 2.9 still hold.
Additionally, in the following results, we approach the theory for robust stopping conditions
using the Goldstein approximate subdifferential. If the RAGS algorithm terminates in Step 2,
then it is shown that the distance between 0 and the Goldstein approximate subdifferential is
controlled by εtol. Again, this does not prove the robust stopping conditions are sufficient for
the exact subdifferential.
First, the Goldstein approximate subdifferential, as defined in [17], is given by the set
G
fx) = conv{∂f (z) : zB(¯x)}.(17)
We now show that the Goldstein approximate subdifferential contains all of the gradients
of the active functions in the robust active set.
Lemma 3.8. Let f= max{fi:i= 1, . . . , N}. Let Y= [y0, y1, y2, . . . , ym]be a randomly
sampled set from a ball centered at y0= ¯xwith radius . If fi∈ C1for each i, then
G
fx)conv{∇fi(yj) : yjY, i A(yj)}.
Proof. If fi∈ C1for each iA(Y), then by equation (2), for each yjYwe have
∂f (yj) = conv{∇fi(yj)}iA(yj)= conv{∇fi(yj) : iA(yj)}.
Using this in our definition of the Goldstein approximate subdifferential in (17) and knowing
Bx)Y, we have
G
fx)conv{conv{∇fi(yj) : iA(yj)}:yjY},
which simplifies to
G
fx)conv{∇fi(yj) : yjY, i A(yj)}.(18)
17
Now we have a result similar to Lemma 2.1(1) for dk
Ywith respect to the Goldstein approx-
imate subdifferential.
Remark 3.9.For the following two results, we assume each of the fi∈ C1+ with Lipschitz
constant L. Note that this implies the Lipschitz constant Lis independent of i. If each
fi∈ C1+ with Lipschitz constant Li, then Lis easily obtained by L= max{Li:i= 1, . . . , N}.
Lemma 3.10. Let f= max{fi:i= 1, . . . , N}where fi∈ C1+ with Lipschitz constant L. Let
Y= [y0, y1, y2, . . . , ym]be a randomly sampled set from a ball centered at y0= ¯xwith radius .
Suppose there exists a ¯
K > 0such that |∇Afi(¯x)− ∇fi(¯x)| ≤ ¯
K. Then for all wGYx),
there exists a gG
fx)such that
|wg| ≤ (¯
K+L)∆.
Proof. By definition, for all wGYx) there exists a set of αisuch that
w=X
iA(Y)
αiAfix),where αi0,X
iA(Y)
αi= 1.
By our assumption that each fi∈ C1+, Lemma 3.8 holds. It is clear that for each iA(Y),
iA(yj) for some yjY. Let jibe the index corresponding to this active index; i.e., iA(yji).
Thus, for each iA(Y), there is a corresponding active gradient
fi(yji)conv{∇fi(yji) : yjiY, i A(yji)} ⊆ G
fx).
Using the same αias above, we can construct
g=X
iA(Y)
αifi(yji)conv{∇fi(yji) : yjiY, i A(yji)} ⊆ G
fx).
Then |wg|=|P
iA(Y)
αiAfix)P
iA(Y)
αifi(yji)|
P
iA(Y)
αi|∇Afix)− ∇fi(yji)|
P
iA(Y)
αi|∇Afix)− ∇fi(¯x)|+|∇fi(¯x)− ∇fi(yji)|
P
iA(Y)
αi¯
K∆ + Lmax
ji|¯xyji|
(¯
K+L)∆.
Thus, using Lemma 3.10, we can show that if the algorithm stops due to the robust stopping
conditions, then the distance from 0 to the Goldstein approximate subdifferential is controlled
by εtol.
18
Proposition 3.11. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1+ with Lipschitz
constant L. Suppose there exists a ¯
K > 0such that for each iteration k, the approximate gra-
dient satisfies |∇Afi(xk)fi(xk)| ≤ ¯
Kkfor all i= 1, . . . , N . Suppose the RAGS algorithm
terminates at some iteration ¯
kin Step 2 using the robust stopping conditions given in (16).
Then
dist(0|G
¯
kf(x¯
k)) <[1 + µ0(¯
K+L)]εtol.
Proof. Let ¯w= Proj(0|GY¯
k(x¯
k)). We use ¯gG
¯
kf(x¯
k) as constructed in Lemma 3.10 to see
that
dist(0|G
¯
kf(x¯
k)) dist(0|¯g)
dist(0|¯w) + dist( ¯w|¯g)
=|d¯
k
Y|+|¯w¯g|
≤ |d¯
k
Y|+ ( ¯
K+L)∆¯
k
< εtol + ( ¯
K+L)∆¯
k
The statement now follows by the test ∆¯
kµ¯
k|d¯
k
Y|in Step 2 and the fact that µ¯
kµ0as
{µk}k=0 is a non-increasing sequence.
4 Approximate Gradients
As seen in the previous two sections, in order for convergence to be guaranteed in the AGS or
RAGS algorithm, the approximate gradient used must satisfy an error bound for each of the
active fi. Specifically, there must exist a ¯
K > 0 such that
|∇Afi(xk)− ∇fi(xk)| ≤ ¯
Kk,
where ∆k= maxj|yjxk|. In this section, we present three specific approximate gradients
that satisfy the above requirement: the simplex gradient, the centered simplex gradient and
the Gupal estimate of the gradient of the Steklov averaged function.
4.1 Simplex Gradient
The simplex gradient is a commonly used approximate gradient. In recent years, several
derivative-free algorithms have been proposed that use the simplex gradient, ([5], [22], [12],
[11], and [19] among others). It is geometrically defined as the gradient of the linear interpola-
tion of fover a set of n+ 1 points in IRn. Mathematically, we define it as follows.
Let Y= [y0, y1, . . . , yn] be a set of affinely independent points in IRn. We say that Yforms
the simplex S= conv{Y}. Thus, Sis a simplex if it can be written as the convex hull of an
affinely independent set of n+ 1 points in IRn.
The simplex gradient of a function fover the set Yis given by
sf(Y) = L1δf (Y),
where
L=y1y0···yny0>
19
and
δf (Y) =
f(y1)f(y0)
.
.
.
f(yn)f(y0)
.
The condition number of the simplex formed by Yis given by kˆ
L1k, where
ˆ
L=1
[y1y0y2y0. . . yny0]>and where ∆ = max
j=1,...n |yjy0|.
4.1.1 Convergence
The following result (by Kelley [23]) shows that there exists an appropriate error bound between
the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz
constant used in the following theorem corresponds to fi.
Theorem 4.1. Consider fi∈ C1+ with Lipschitz constant Kifor fi. Let Y= [y0, y1, . . . , yn]
form a simplex. Let
ˆ
L=1
[y1y0y2y0. . . yny0]>,where ∆ = max
j=1,...n |yjy0|.
Then the simplex gradient satisfies the error bound
|∇sfi(Y)− ∇fi(y0)| ≤ ¯
K,
where ¯
K=1
2Kinkˆ
L1k.
Proof. See [23, Lemma 6.2.1].
With the above error bound result, we conclude that convergence holds when using the
simplex gradient as an approximate gradient in both the AGS and RAGS algorithms.
Corollary 4.2. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1+ with Lipschitz constant
Kifor fi. If the approximate gradient used in the AGS or RAGS algorithm is the simplex
gradient and kˆ
L1kis bounded above for each simplex gradient computed, then the results of
Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.
4.1.2 Algorithm - Simplex Gradient
In order to calculate a simplex gradient in Step 1, we generate a set Y= [xk, y1, . . . , yn] of
points in IRnand then check to see if Yforms a well-poised simplex by calculating its condition
number, kˆ
L1k. A bounded condition number (kˆ
L1k< n) ensures a ‘good’ error bound
between the approximate gradient and the exact gradient.
If the set Ydoes not form a well-poised simplex (kˆ
L1k ≥ n), then we resample. If Y
forms a well-poised simplex, then we calculate the simplex gradient of fiover Yfor each
iA(xk) and then set the approximate subdifferential equal to the convex hull of the active
simplex gradients. We note that the probability of generating a random matrix with a condition
number greater than nis asymptotically constant, [38]. Thus, randomly generating simplices
is a quick and practical option. Furthermore, notice that calculating the condition number
does not require function evaluations; thus, resampling does not affect the number of function
evaluations required by the algorithm.
20
4.2 Centered Simplex Gradient
The centered simplex gradient is the average of two simplex gradients. Although it requires more
function evaluations, it contains an advantage that the error bound satisfied by the centered
simplex gradient is in terms of ∆2, rather than ∆.
Let Y= [y0, y1, . . . , yn] form a simplex. We define the sets
Y+= [x, x + ˜y1, . . . , x + ˜yn]
and
Y= [x, x ˜y1, . . . , x ˜yn],
where x=y0and ˜yi=yiy0for i= 1,...n. The centered simplex gradient is the average
of the two simplex gradients over the sets Y+and Y, i.e.,
CS f(Y) = 1
2(Sf(Y+) + Sf(Y)).
4.2.1 Convergence
To show that the AGS and RAGS algorithms are both well-defined when using the centered
simplex gradient as an approximate gradient, we provide an error bound between the centered
simplex gradient and the exact gradient (again by Kelley [23]).
Theorem 4.3. Consider fi∈ C2+ with Lipschitz constant Kifor 2fi. Let Y= [y0, y1, . . . , yn]
form a simplex. Let
ˆ
L=1
[y1y0, . . . , yny0]where ∆ = max
j=1,...n |yjy0|.
Then the centered simplex gradient satisfies the error bound
|∇CS fi(Y)− ∇fi(y0)| ≤ ¯
K2,
where ¯
K=Kinkˆ
L1k.
Proof. See [23, Lemma 6.2.5].
Notice that Theorem 4.3 requires fi∈ C2+ . If fi∈ C1+ , then the error bound is in terms
of ∆, not ∆2. With the above error bound result, we conclude that convergence holds when
using the centered simplex gradient as an approximate gradient in both the AGS and RAGS
algorithms.
Corollary 4.4. Let f= max{fi:i= 1, . . . , N}where each fi∈ C2+ with Lipschitz constant
Kifor 2fi. If the approximate gradient used in the AGS or RAGS algorithm is the centered
simplex gradient, kˆ
L1kis bounded above for each centered simplex gradient computed and
01, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.
Proof. Since ∆01 and non-increasing, (∆k)2kand ergo, Theorems 2.4, 2.6, 2.8, 2.9 and
3.5 hold.
21
4.2.2 Algorithm
In Step 1, to adapt the AGS or RAGS algorithm to use the centered simplex gradient, we sample
our set Yin the same manner as for the simplex gradient (resampling until a well-poised set is
achieved). We then form the sets Y+and Yand proceed as expected.
4.3 Gupal Estimate
The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [24] uses
the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient.
We see in Theorem 4.8 that an appropriate error bound exists for this approximate gradient.
Surprisingly, unlike the error bounds for the simplex and centered simplex gradients, the error
bound in Theorem 4.8 does not include a condition number term. Mathematically, we define
the Gupal estimate of the gradient of the Steklov averaged function as follows.
For α > 0, the Steklov averaged function fα, as defined in [16, Def. 3.1], is given by
fα(x) = ZIRn
f(xy)ψα(y)dy,
where ψα: IRnIR+is the Steklov mollifier defined by
ψα(y) = 1nif y[α/2, α/2]n,
0 otherwise.
We can equivalently define the Steklov averaged function by
fα(x) = 1
αnZx1+α/2
x1α/2···Zxn+α/2
xnα/2
f(y)dy1. . . dyn.(19)
The partial derivatives of fαare given by ([16, Prop. 3.11] and [18] )
∂fα
∂xi
(x) = ZB
γi(x, α, ζ)(20)
for i= 1, . . . , n, where B= [1/2,1/2]nis the unit cube centred at 0 and
γi(x, α, ζ) = 1
αf(x1+αζ1, . . . , xi1+αζi1, xi+1
2α, xi+1 +αζi+1, . . . , xn+αζn)
f(x1+αζ1, . . . , xi1+αζi1, xi1
2α, xi+1 +αζi+1, . . . , xn+αζn).
(21)
Given α > 0 and z= (ζ1, . . . , ζn)Qn
i=1 B, the Gupal estimate of fα(x) over zis given
by
γ(x, α, z) = (γ1(x, α, ζ 1), . . . , γn(x, α, ζn)).(22)
Remark 4.5.Although we define γ(x, α, z) as the Gupal estimate of fα(x), in Section 4.3.1,
we will show that γ(x, α, z) provides a good approximation to the exact gradient, f(x).
Remark 4.6.For the following results, we note that the αused in the above definitions is
equivalent to our sampling radius ∆. Thus, we will be replacing αwith ∆ in the convergence
results in Section 4.3.1.
22
4.3.1 Convergence
As before, in order to show that both the AGS and RAGS algorithms are well-defined when
using the Gupal estimate as an approximate gradient, we must establish that it provides a good
approximate of our exact gradient. To do this, we first need the following classic result from
[13].
Lemma 4.7. [13, Lemma 4.1.12]. Let f∈ C1+ with Lipschitz constant Kfor f. Let y0IRn.
Then for any yIRn
|f(y)f(y0)− ∇f(y0)>(yy0)| ≤ 1
2K|yy0|2.
Using Lemma 4.7, we establish an error bound between the Gupal estimate and the exact
gradient of f.
Theorem 4.8. Consider fi∈ C1+ with Lipschitz constant Kifor fi. Let ε > 0. Then for
>0,z= (ζ1, . . . , ζn)Z=Qn
i=1 Band any point xIRn, the Gupal estimate of fi,(x)
satisfies the error bound
|γ(x, , z)− ∇fi(x)| ≤ n1
2Ki∆(n+ 3).
Proof. For ∆ >0, let
yj= [x1+ ∆ζ1, . . . , xj1+ ∆ζj1, xj1
2, xj+1 + ∆ζj+1, . . . , xn+ ∆ζn]>
and
yj+= [x1+ ∆ζ1, . . . , xj1+ ∆ζj1, xj+1
2, xj+1 + ∆ζj+1, . . . , xn+ ∆ζn]>.
Applying Lemma 4.7, we have that
|fi(yj+)fi(yj)− ∇fi(yj)>(yj+yj)| ≤ 1
2Ki|yj+yj|2.(23)
From equation (21) (with α= ∆), we can see that
fi(yj+)fi(yj) = ∆ γj(x, , ζ j).
Hence, equation (23) becomes
|γj(x, , ζj)− ∇fi(yj)>(yj+yj)| ≤ 1
2Ki|yj+yj|2.(24)
From our definitions of yjand yj+, we can see that
yj+yj= [0,...,0,,0,...,0]>.
The inner product in equation (24) simplifies to
fi(yj)>(yj+yj) = ∆ ∂fi
∂xj
(yj).
23
Thus, we have
γj(x, , ζj)∂fi
∂xj
(yj)1
2Ki2,
implying
γj(x, , ζj)∂fi
∂xj
(yj)1
2Ki.(25)
Also notice that
|yjx|= ∆
(ζj
1, . . . , ζj
j1,1
2, ζj
j+1, . . . , ζ j
n)
.
Using the standard basis vector ej, we have
|yjx|= ∆
ζjζjej1
2ej|ζj|+|ζjej|+1
21
2∆(n+ 2).
Thus, since fi∈ C1+, we have
|∇fi(yj)− ∇fi(x)| ≤ Ki
1
2∆(n+ 2).(26)
Noting that
∂fi
∂xj
(yj)∂fi
∂xj
(x)≤ |∇fi(yj)− ∇fi(x)|,
we have
∂fi
∂xj
(yj)∂fi
∂xj
(x)Ki
1
2∆(n+ 2).(27)
Using equations (25) and (27), we have
γj(x, , ζj)∂fi
∂xj
(x)
γj(x, , ζj)∂fi
∂xj
(yj)
+
∂fi
∂xj
(yj)∂fi
∂xj
(x)
1
2Ki∆ + Ki
1
2∆(n+ 2)
=1
2Ki∆(n+ 3).
Finally,
|γ(x, , z)− ∇fi(x)|=v
u
u
t
n
X
j=1 γj(x, , ζj)∂fi
∂xj
(x)2
v
u
u
t
n
X
j=1 1
2Ki∆(n+ 3)2
=n1
2Ki∆(n+ 3).
24
We conclude that convergence holds when using the Gupal estimate of the gradient of
the Steklov averaged function of fas an approximate gradient in both the AGS and RAGS
algorithms.
Corollary 4.9. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1+ with Lipschitz constant Ki
for fi. If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate
of the gradient of the Steklov averaged function, then the results of Theorems 2.4, 2.6, 2.8, 2.9
and 3.5 hold.
4.3.2 Algorithm
To use the Gupal estimate of the gradient of the Steklov averaged function in both the AGS
and RAGS algorithms, in Step 1, we sample independently and uniformly {zkl}m
l=1 from the
unit cube in IRn×n, respectively, where mis the number of active functions.
5 Numerical Results
5.1 Versions of the AGS and RAGS Algorithms
We implemented the AGS and RAGS algorithms using the simplex gradient, the centered
simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as
approximate gradients.
Additionally, we used the robust descent direction to create robust stopping conditions. That
is, the algorithm terminates when
kµk|dk
Y|and |dk
Y|< εtol,(28)
where dk
Yis the projection of 0 onto the approximate subdifferential generated using the ro-
bust active set. (See Proposition 3.11 for results linking the robust stopping conditions with
the Goldstein approximate subdifferential.) The implementation was done in MATLAB (v.
7.11.0.584, R2010b). Software is available by contacting the corresponding author.
Let dkdenote the regular descent direction and let dk
Ydenote the robust descent direction.
There are three scenarios that could occur when using the robust stopping conditions:
1. |dk|=|dk
Y|;
2. |dk|≥|dk
Y|, but checking the stopping conditions leads to the same result (line search,
radius decrease or termination); or
3. |dk|≥|dk
Y|, but checking the stopping conditions leads to a different result.
In Scenarios 1 and 2, the robust stopping conditions have no influence on the algorithm. In
Scenario 3, we have two cases:
1. ∆kµk|dk
Y| ≤ µk|dk|, but |dk| ≥ εtol and |dk
Y|< εtol or
2. ∆kµk|dk|holds, but ∆k> µk|dk
Y|.
25
Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS
algorithms to do one of two things: to terminate early, providing a solution that has a smaller
quality measure, but requires less function evaluations to find, or to reduce its sampling radius
instead of carrying out a line search, reducing the number of function evaluations carried out
during that iteration and calculating a more accurate approximate subdifferential at the next
iteration.
Our goal in this testing is to determine if there are any notable numerical differences in the
quality of the three approximate gradients (simplex, centered simplex, and Gupal estimate),
the two search directions (robust and non-robust), and the two stopping conditions (robust and
non-robust). This leads to the following 12 versions:
AGS Simplex (1. non-robust/2. robust stopping)
RAGS Simplex (3. non-robust/4. robust stopping)
AGS Centered Simplex (5. non-robust/6. robust stopping)
RAGS Centered Simplex (7. non-robust/8. robust stopping)
AGS Gupal (9. non-robust/10. robust stopping)
RAGS Gupal (11. non-robust/12. robust stopping)
5.2 Test Sets and Software
Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro. We used the test set from
Lukˇsan-Vlˇcek, [27]. The first 25 problems presented are of the desired form
min
x{F(x)}where F(x) = max
i=1,2,...,N{fi(x)}.
Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued.
Thus, our test set presents a total of 24 finite minimax problems with dimensions from 2 to 20.
There are several problems with functions fithat have the form fi=|fi|, where fiis a smooth
function. We rewrote these functions as fi= max{fi,fi}. The resulting test problems have
from 2 to 130 sub-functions. A summary of the test problems appears in Table 2 in Appendix
A.
5.3 Initialization and Stopping Conditions
We first describe our choices for the initialization parameters used in the AGS and RAGS
algorithms.
The initial starting points are given for each problem in [27]. We set the initial accuracy
measure to 0.5 with a reduction factor of 0.5. We set the initial sampling radius to 0.1 with
a reduction factor of 0.5. The Armijo-like parameter ηwas chosen to be 0.1 to ensure that a
line search success resulted in a significant function value decrease. We set the minimum step
length to 1010.
Next, we discuss the stopping tolerances used to ensure finite termination of the AGS and
RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The
26
first is our theoretical stopping condition, while the remaining three are to ensure numerical
stability of the algorithm.
1. Stopping conditions met - As stated in the theoretical algorithm, the algorithm terminates
for this reason when ∆kµk|dk|and |dk|< εtol, where dkis either the regular or the
robust descent direction.
2. Hessian matrix has NaN /Inf entries - For the solution of the quadratic program in
Step 2, we use the quadprog command in MATLAB, which has certain numerical limita-
tions. When these limitations result in NaN or Inf entries in the Hessian, the algorithm
terminates.
3. ∆k,µk, and |dk|are small - This stopping criterion bipasses the test ∆kµk|dk|(in Step
2) and stops if ∆k<tol,|µk|< µtol and |dk|< εtol . Examining Theorem 2.9 along with
Theorems 4.1, 4.3 and 4.8, it is clear that this is also a valid stopping criterion. We used
a bound of 106in our implementation for both ∆kand µk.
4. Max number of function evaluations reached - As a final failsafe, we added an upper
bound of 106on the number of function evaluations allowed. (This stopping condition
only occurs once in our results.)
5.4 Results
Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each
version. For each of the 25 trials, we record the number of function evaluations, the number of
iterations, the solution, the quality of the solution and the reason for termination. The quality
was measured by the improvement in the number of digits of accuracy, which is calculated using
the formula
log |Fmin F|
|F0F|,
where Fmin is the function value at the final (best) iterate, Fis the true minimum value
(optimal value) of the problem (as given in [27]) and F0is the function value at the initial
iterate. Results on function evaluations and solution quality appear in Tables 3, 4 and 5 of the
appendix.
To visually compare algorithmic versions, we use performance profiles. A performance
profile is the (cumulative) distribution function for a performance metric [14]. For the AGS
and RAGS algorithms, the performance metric is the ratio of the number of function evaluations
taken by the current version to successfully solve each test problem versus the least number
of function evaluations taken by any of the versions to successfully solve each test problem.
Performance profiles eliminate the need to discard failures in numerical results and provide a
visual representation of the performance difference between several solvers. For full details on
the construction of performance profiles, see [14].
In Figures 1(a) and 1(b) we include a performance profile showing all 12 versions of the
AGS and RAGS algorithms tested, declaring a success for 1 digit and 3 digits of improvement,
respectively.
27
100101102103
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
τ
ρ(τ)
Performance Profile on AGS and RAGS Algorithms (successful improvement of 1 digit)
(a) Accuracy improvement of 1 digit.












100101102103
0
0.1
0.2
0.3
0.4
0.5
0.6
τ
ρ(τ)
Performance Profile for AGS and RAGS Algorithms (successful improvement of 3 digits)
(b) Accuracy improvement of 3 digits.
Figure 1: Performance profiles for 12 versions of AGS/RAGS algorithm.
In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged
function as an approximate gradient does not produce the best results. It only produces 1 or
more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is
no significant difference between the performance of the AGS and RAGS algorithms using the
simplex and centered simplex gradients as approximate gradients.
Looking at the results in Tables 3, 4 and 5, and our performance profiles, we can make the
following two observations:
i) the versions of the RAGS algorithm generally outperform (converge faster than) the
versions of the AGS algorithm, and
ii) the RAGS algorithm using the robust stopping conditions terminates faster and with
28
lower (but still significant) accuracy.
Robust active set: From our results, it is clear that expanding the active set to include
‘almost active’ functions in the RAGS algorithm greatly improves performance for the simplex
and centered simplex algorithm. This robust active set brings more local information into the
approximate subdifferential and thereby allows for descent directions that are more parallel to
any nondifferentiable ridges formed by the function.
Robust stopping conditions: We notice from the performance profiles that in terms of
function evaluations, the robust stopping conditions improve the overall performance of the
RAGS algorithm, although they decrease the average accuracy on some problems. These re-
sults correspond with our previously discussed hypothesis. Furthermore, upon studying the
reasons for termination, it appears that the non-robust stopping conditions cause the AGS and
RAGS algorithms to terminate mainly due to ∆kand µkbecoming too small. For the robust
stopping conditions, the RAGS algorithm terminated often because the stopping conditions
were satisfied. As our theory in Section 3.3 is not complete, we cannot make any theoretical
statements about how the robust stopping conditions would perform in general (like those in
Theorem 2.9). However, from our results, we conjecture that the alteration is beneficial for
decreasing function evaluations.
In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the
RAGS algorithm either matches or outperforms the AGS algorithm in average accuracy ob-
tained over 25 trials using the simplex and centered simplex gradients. Knowing this, we
conclude that the improvement of the accuracy is due to the choice of a robust search direction.
5.5 Comparison to NOMAD
NOMAD is a well established general DFO solver, not specialized for the class of minimax
problems [25, 1]. Using the same set of 24 test problems as we did with our previous tests, we
compare the RAGS algorithm to the NOMAD algorithm. Based on the previous tests we use the
simplex gradient with robust stopping conditions. All parameters for the NOMAD algorithm
we left to default settings. The performance profile generated by these two algorithms for the
improvement of 3 digits of accuracy appears in Figure 2.
As seen in Figure 2, NOMAD appears slightly faster than RAGS, but slightly less robust.
Overall, the two algorithms are fairly comparable. Further research into RAGS may lead to
algorithmic improvement. For example, a more informed selection of the sample set Yin Step
1, an improved line search, or better selection of default parameters for RAGS, could all lead
to improved convergence. We leave such research for future study.
6 Conclusion
We have presented a new derivative-free algorithm for finite minimax problems that exploits the
smooth substructure of the problem. Convergence results are given for any arbitrary approxi-
mate gradient that satisfies an error bound dependent on the sampling radius. Three examples
of such approximate gradients are given. Additionally, a robust version of the algorithm is
presented and shown to have the same convergence results as the regular version.
Through numerical testing, we find that the robust version of the algorithm outperforms
the regular version with respect to both the number of function evaluations required and the
29
Figure 2: Performance profile comparing RAGS with NOMAD for an improvement of 3 digits of
accuracy.
100101
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5Performance Profile on RAGS and NOMAD Algorithms (successful improvement of 3 digits)
τ
ρ(τ)
Simplex RAGS (robust stopping)
NOMAD
accuracy of the solutions obtained. Additionally, we tested robust stopping conditions and
found that they generally required less function evaluations before termination. Overall, the
robust stopping conditions paired with the robust version of the algorithm performed best (as
seen in the performance profiles).
Considerable future work is available in this research direction. Most obvious is an ex-
ploration of the theory behind the performance of the robust stopping conditions. Another
direction lies in the theoretical requirement bounding the step length away from 0 (see The-
orems 2.8 and 3.5). In gradient based methods, one common way to avoid this requirement
is with the use of Wolfe-like conditions. We are unaware of any derivative-free variant on the
Wolfe conditions.
Acknowledgements
The authors would like to express their gratitude to Dr. C. Sagastiz´abal for inspirational
conversations regarding the Goldstein subdifferential. This research was partially funded by
the NSERC DG program and by UBC IRF.
References
[1] A. M. Abramson, C. Audet, G. Couture, J. E. Dennis Jr., S. Le Digabel and C. Tribes. The
NOMAD project. Software available at http://www.gerad.ca/nomad.
[2] A. M. Bagirov, B. Karas¨ozen, and M. Sezer. Discrete gradient method: derivative-free method
for nonsmooth optimization. J. Optim. Theory Appl., 137(2):317–334, 2008.
[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert
Spaces. CMS Books in Mathematics. Springer, New York, NY, 2011.
30
[4] A. J. Booker, J. E. Dennis Jr., P. D. Frank, D. B. Serafini, and V. Torczon. Optimization using
surrogate objectives on a helicopter test example. In Computational methods for optimal design
and control (Arlington, VA, 1997), volume 24 of Progr. Systems Control Theory, pages 49–58.
Birkh¨auser Boston, MA, 1998.
[5] D. M. Bortz and C. T. Kelley. The simplex gradient and noisy optimization problems. In
Computational methods for optimal design and control, volume 24 of Progr. Systems Control
Theory, pages 77–90. Birkh¨auser Boston, MA, 1998.
[6] J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling
of gradients. Math. Oper. Res., 27(3):567–584, 2002.
[7] J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradient sampling algorithm for nonsmooth,
nonconvex optimization. SIAM J. Optim., 15(3):751–779, 2005.
[8] X. Cai, K. Teo, X. Yang, and X. Zhou. Portfolio optimization under a minimax rule. Manage.
Sci., 46(7):957–972, 2000.
[9] F. H. Clarke. Optimization and Nonsmooth Analysis. Classics Appl. Math. 5. SIAM, Philadelphia,
second edition, 1990.
[10] A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to Derivative-free Optimization,
volume 8 of MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2009.
[11] A. L. Cust´odio, J. E. Dennis Jr., and L. N. Vicente. Using simplex gradients of nonsmooth
functions in direct search methods. IMA J. Numer. Anal., 28(4):770–784, 2008.
[12] A. L. Cust´odio and L. N. Vicente. Using sampling and simplex derivatives in pattern search
methods. SIAM J. Optim., 18(2):537–555, 2007.
[13] J. E. Dennis Jr. and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and
Nonlinear Equations (Classics in Applied Mathematics). SIAM, Philadelphia, 1996.
[14] E. D. Dolan and J. J. Mor´e. Benchmarking optimization software with performance profiles.
Math. Program., 91(2, Ser. A):201–213, 2002.
[15] R. Duvigneau and M. Visonneau. Hydrodynamic design using a derivative-free method. Struct.
and Multidiscip. Optim., 28:195–205, 2004.
[16] Y. M. Ermoliev, V. I. Norkin and R. J.-B. Wets. The minimization of semicontinuous functions:
Mollifier subgradients. SIAM J. Control Optim., 33:149–167, 1995.
[17] A. A. Goldstein. Optimization of Lipschitz continuous functions. Math. Program., 13(1):14–22,
1977.
[18] A. M. Gupal. A method for the minimization of almost differentiable functions. Kibernetika,
pages 114–116, 1977. In Russian; English translation in: Cybernetics, 13(2):220–222, 1977.
[19] W. Hare and M. Macklem. Derivative-free optimization methods for finite minimax problems.
To appear in Optim. Methods and Softw., 2011.
[20] W. L. Hare. Using derivative free optimization for constrained parameter selection in a home
and community care forecasting model. In International Perspectives on Operations Research
and Health Care, Proceedings of the 34th Meeting of the EURO Working Group on Operational
Research Applied to Health Sciences, pages 61–73, 2010.
31
[21] J. Imae, N. Ohtsuki, Y. Kikuchi, and T. Kobayashi. A minimax control design for nonlinear
systems based on genetic programming: Jung’s collective unconscious approach. Intern. J. Syst.
Sci., 35:775–785, October 2004.
[22] C. T. Kelley. Detection and remediation of stagnation in the Nelder-Mead algorithm using a
sufficient decrease condition. SIAM J. Optim., 10(1):43–55, 1999.
[23] C. T. Kelley. Iterative Methods for Optimization, volume 18 of Frontiers in Applied Mathematics.
SIAM, Philadelphia, 1999.
[24] K. C. Kiwiel. A nonderivative version of the gradient sampling algorithm for nonsmooth noncon-
vex optimization. SIAM J. Optim., 20(4):1983–1994, 2010.
[25] S. Le Digabel. Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm.
ACM T Math Software, 37(4):1–15, 2011.
[26] G. Liuzzi, S. Lucidi, and M. Sciandrone. A derivative-free algorithm for linearly constrained finite
minimax problems. SIAM J. Optim., 16(4):1054–1075, 2006.
[27] L. Lukˇsan and J. Vlˇcek. Test Problems for Nonsmooth Unconstrained and Linearly Constrained
Optimization. Technical report, February 2000.
[28] K. Madsen. Minimax solution of non-linear equations without calculating derivatives, volume 3
of Math. Programming Stud., pages 110–126, 1975.
[29] A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computational framework for derivative-
free optimization of cardiovascular geometries. Comput. Methods Appl. Mech. Engrg., 197(21-
24):1890–1905, 2008.
[30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research.
Springer-Verlag, New York, 1999.
[31] G. Di Pillo, L. Grippo, and S. Lucidi. A smooth method for the finite minimax problem. Math.
Program., 60(2, Ser. A):187–214, 1993.
[32] E. Polak. On the mathematical foundations of nondifferentiable optimization in engineering
design. SIAM Rev., 29(1):21–89, 1987.
[33] E. Polak, J. O. Royset, and R. S. Womersley. Algorithms with adaptive smoothing for finite
minimax problems. J. Optim. Theory Appl., 119(3):459–484, 2003.
[34] R. A. Polyak. Smooth optimization methods for minimax problems. SIAM J. Control Optim.,
26(6):1274–1286, 1988.
[35] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Grundlehren der
Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-
Verlag, Berlin, 1998.
[36] R. Stafford. Random Points in an n-Dimensional Hypersphere. MAT-
LAB File Exchange,http://www.mathworks.com/matlabcentral/fileexchange/
9443-random-points-in-an-n-dimensional-hypersphere, 2005.
[37] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions, volume 3
in Math. Programming Stud., pages 145–173, 1975.
32
[38] M. Wschebor. Smoothed analysis of κ(A). J. Complexity, 20(1):97–107, 2004.
[39] S. Xu. Smoothing method for minimax problems. Comput. Optim. Appl., 20(3):267–279, 2001.
7 Appendix A
Table 2: Test Set Summary: problem name and number, problem dimension (N), and number of
sub-functions (M); * denotes an absolute value operation (doubled number of sub-functions).
Prob. # Name N M Prob. # Name N M
2.1 CB2 2 3 2.13 GAMMA 4 122*
2.2 WF 2 3 2.14 EXP 5 21
2.3 SPIRAL 2 2 2.15 PBC1 5 60*
2.4 EVD52 3 6 2.16 EVD61 6 102*
2.5 Rosen-Suzuki 4 4 2.18 Filter 9 82*
2.6 Polak 6 4 4 2.19 Wong 1 7 5
2.7 PCB3 3 42* 2.20 Wong 2 10 9
2.8 Bard 3 30* 2.21 Wong 3 20 18
2.9 Kow.-Osborne 4 22* 2.22 Polak 2 10 2
2.10 Davidon 2 4 40* 2.23 Polak 3 11 10
2.11 OET 5 4 42* 2.24 Watson 20 62*
2.12 OET 6 4 42* 2.25 Osborne 2 11 130*
33
Table 3: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the simplex
gradient.
AGS RAGS
Regular Stop Robust Stop Regular Stop Robust Stop
Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc.
2.1 3018 2.082 2855 2.120 2580 9.470 202 6.759
2.2 3136 4.565 3112 4.987 4179 13.211 418 6.343
2.3 3085 0.002 3087 0.002 3090 0.002 3096 0.002
2.4 3254 2.189 3265 2.238 2986 11.559 367 7.570
2.5 3391 1.379 3138 1.351 3576 1.471 539 1.471
2.6 3260 1.236 3341 1.228 4258 1.338 859 1.338
2.7 2949 1.408 2757 1.367 4155 9.939 4190 7.230
2.8 4959 0.879 4492 0.913 3634 9.941 3435 7.655
2.9 2806 0.732 3303 0.581 16000 8.049 13681 3.975
2.10 2978 3.343 2993 3.342 3567 3.459 1924 3.459
2.11 3303 2.554 3453 2.559 35367 6.099 11725 5.063
2.12 2721 1.866 3117 1.871 15052 2.882 8818 2.660
2.13 2580 1.073 2706 0.874 43618 1.952 141 1.679
2.14 3254 1.585 3289 1.086 7713 2.696 4221 1.476
2.15 3917 0.262 5554 0.259 31030 0.286 12796 0.277
2.16 3711 2.182 4500 2.077 20331 3.242 11254 2.178
2.18 10468 0.000 10338 0.000 76355 17.717 30972 17.138
2.19 3397 0.376 3327 0.351 5403 7.105 1767 7.169
2.20 4535 1.624 4271 1.624 8757 8.435 7160 6.073
2.21 8624 2.031 8380 2.157 15225 1.334 11752 1.393
2.22 1563 0.958 1408 1.042 64116 3.049 1256 2.978
2.23 7054 2.557 10392 2.744 6092 6.117 970 6.178
2.24 4570 0.301 7857 0.298 93032 0.447 21204 0.328
2.25 3427 0.339 4197 0.340 98505 0.342 343 0.342
34
Table 4: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the centered
simplex gradient.
AGS RAGS
Regular Stop Robust Stop Regular Stop Robust Stop
Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc.
2.1 3769 2.054 3573 2.051 2351 9.469 221 7.125
2.2 3705 6.888 1284 5.154 4151 9.589 330 5.594
2.3 5410 0.003 5352 0.003 5332 0.003 5353 0.003
2.4 4059 2.520 4154 2.456 4347 11.578 296 6.834
2.5 3949 1.422 3813 1.437 4112 1.471 452 1.471
2.6 3756 1.302 3880 1.309 4815 1.338 879 1.338
2.7 4227 1.435 4187 1.373 5285 9.950 7164 6.372
2.8 6928 0.988 6933 1.003 4116 9.939 3754 7.775
2.9 3301 0.933 3743 0.949 17944 8.072 13014 2.436
2.10 3447 3.343 3424 3.342 4744 3.459 427 3.459
2.11 3593 2.768 4082 2.785 47362 6.344 11886 5.115
2.12 3321 1.892 3406 1.876 15550 2.843 10726 2.651
2.13 3067 1.355 3508 1.216 36969 1.873 519 1.643
2.14 3967 1.771 6110 1.152 9757 2.692 7284 1.510
2.15 4646 0.272 6014 0.273 23947 0.280 15692 0.277
2.16 4518 2.223 6911 2.074 22225 2.628 17001 2.215
2.18 30492 16.931 14671 16.634 125859 17.804 20815 17.293
2.19 4473 0.551 4484 0.591 8561 7.113 1697 5.851
2.20 5462 1.615 5503 1.599 8908 9.011 7846 6.042
2.21 11629 1.887 11724 1.661 18957 1.304 17067 1.339
2.22 1877 1.166 1604 1.160 1453 3.139 2066 3.644
2.23 3807 2.150 7850 3.586 15625 6.117 1020 6.230
2.24! 7198 0.302 12745 0.301 115787 0.436 61652 0.329
2.25 4749 0.339 4896 0.341 256508 0.342 568 0.342
35
Table 5: Average accuracy for 25 trials obtained by the AGS and RAGS algorithm for the Gupal
estimate of the gradient of the Steklov averaged function.
AGS RAGS
Regular Stop Robust Stop Regular Stop Robust Stop
Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc.
2.1 2775 2.448 2542 2.124 13126 3.896 89 2.708
2.2 3729 3.267 2221 2.813 5029 15.904 1776 7.228
2.3 2243 0.000 2262 0.000 2276 0.000 2255 0.000
2.4 2985 2.771 2841 2.892 3475 3.449 2362 3.738
2.5 3493 1.213 3529 1.196 3447 1.211 338 1.200
2.6 3144 0.187 3245 0.188 3018 0.162 3059 0.162
2.7 2631 1.368 3129 1.248 2476 1.048 2208 1.047
2.8 2711 1.125 3898 0.893 2231 0.514 5846 0.515
2.9 3102 0.727 3011 0.600 2955 0.937 3248 0.863
2.10 3075 3.241 2927 3.272 3100 0.000 3050 0.000
2.11 2947 1.527 3307 1.528 3003 1.560 2905 1.560
2.12 3095 1.099 7179 0.000 2670 0.788 7803 0.000
2.13 2755 0.710 1485 0.715 2517 0.231 6871 0.227
2.14 2965 0.574 3070 0.427 2860 0.708 4571 0.668
2.15 2658 0.010 2386 0.017 3355 0.050 3210 0.031
2.16 3431 0.457 3256 0.459 2861 0.199 2620 0.119
2.18 3936 16.345 5814 0.000 3950 16.451 6598 4.542
2.19 3337 0.014 3270 0.011 3488 0.970 3376 0.957
2.20 4604 0.835 4434 0.808 9459 1.360 10560 1.359
2.21 5468 0.000 5418 0.000 6632 0.641 6159 0.635
2.22 21 0.000 21 0.000 21 0.000 21 0.000
2.23 5436 1.814 5176 1.877 1.00E+06 2.354 954 2.415
2.24 7426 0.280 171 0.017 7927 0.043 6389 0.283
2.25 4519 0.286 4814 0.300 3760 0.017 3209 0.023
36
... Fortunately, in 2013 it was recognized that if the function had some underlying structure, then the work required to approximate the subdifferential of a nonsmooth function could be greatly decreased [37]. For example, in [37] the authors assume that the objective function f (x) takes the form ...
... Fortunately, in 2013 it was recognized that if the function had some underlying structure, then the work required to approximate the subdifferential of a nonsmooth function could be greatly decreased [37]. For example, in [37] the authors assume that the objective function f (x) takes the form ...
... However, most researchers have branched away from such methods and instead looked at algorithms for nonsmooth optimization. For example, the methods mentioned above ( [37] and [44]) both approach the problem through active manifold methods. These ideas are further developed in [43,48]. ...
Article
Full-text available
Variational Analysis studies mathematical objects under small variations. With regards to optimization, these objects are typified by representations of first-order or second-order information (gradients, subgradients, Hessians, etc). On the other hand, Derivative-Free Optimization studies algorithms for continuous optimization that do not use first-order information. As such, researchers might conclude that Variational Analysis plays a limited role in Derivative-Free Optimization research. In this paper we argue the contrary by showing that many successful DFO algorithms rely heavily on tools and results from Variational Analysis.
... This QP has just one extra scalar bound constraint compared to the QP resulting from (28), and one more variable compared to (27). Thus, (29) is no harder (or at least, cannot be much harder) to solve than (28) or (27). ...
... This QP has just one extra scalar bound constraint compared to the QP resulting from (28), and one more variable compared to (27). Thus, (29) is no harder (or at least, cannot be much harder) to solve than (28) or (27). ...
... Interestingly, the Lagrange multiplier µ k in Proposition 5 indicates that the solution x k+1 of (29) either solves the proximal master problem (28), or the level one (27). ...
Chapter
Full-text available
Many applications of optimization to real-life problems lead to nonsmooth objective and/or constraint functions that are assessed through “noisy” oracles. In particular, only some approximations to the function and/or subgradient values are available, while exact values are not. For example, this is the typical case in Lagrangian relaxation of large-scale (possibly mixed-integer) optimization problems, in stochastic programming, and in robust optimization, where the oracles perform some numerical procedure to evaluate functions and subgradients, such as solving one or more optimization subproblems, multidimensional integration, or simulation. As a consequence, one cannot expect such oracles to provide exact data on the function values and/or subgradients. We review algorithms based on the bundle methodology, mostly developed quite recently, that have the ability to handle inexact data. We adopt an approach which, although not exaustive, covers various classes of bundle methods and various types of inexact oracles, for unconstrained and convexly constrained problems (with both convex and nonconvex objective functions), as well as nonsmooth mixed-integer optimization.
... While model-based methods were originally designed for optimization of smooth objective functions (see, for example, [15,18,22,55]), recent research has moved away from this assumption [23,24,38]. In [23,24], it is assumed that the true objective function takes the form F = max{f i : i = 1, 2, . . . ...
... While model-based methods were originally designed for optimization of smooth objective functions (see, for example, [15,18,22,55]), recent research has moved away from this assumption [23,24,38]. In [23,24], it is assumed that the true objective function takes the form F = max{f i : i = 1, 2, . . . , nf }, where each f i is given by a blackbox that provides only function values. ...
... Moreover, in [26] a derivative-free algorithm for computing proximal points of convex functions that only requires approximate subgradients was developed. Finally, in [24] it was shown how to approximate subgradients for convex finite-max functions using only function values. Combined, these three papers provide a sufficient foundation to develop a derivative-free VU-algorithm suitable for our grey-box optimization setting. ...
Preprint
Full-text available
The $\mathcal{VU}$-algorithm is a superlinearly convergent method for minimizing nonsmooth, convex functions. At each iteration, the algorithm works with a certain $\mathcal{V}$-space and its orthogonal $\U$-space, such that the nonsmoothness of the objective function is concentrated on its projection onto the $\mathcal{V}$-space, and on the $\mathcal{U}$-space the projection is smooth. This structure allows for an alternation between a Newton-like step where the function is smooth, and a proximal-point step that is used to find iterates with promising $\mathcal{VU}$-decompositions. We establish a derivative-free variant of the $\mathcal{VU}$-algorithm for convex finite-max objective functions. We show global convergence and provide numerical results from a proof-of-concept implementation, which demonstrates the feasibility and practical value of the approach. We also carry out some tests using nonconvex functions and discuss the results.
... However, it should be noted that our framework does not include accuracy analysis of sub-gradients. Therefore, the error bounds for cases including nonsmooth component functions [12,17] are not covered. This paper is structured as follows. ...
... Our framework are based on the assumption that all functions are sufficiently smooth. In [12,17], the authors analyze sub-gradients and present error results for cases including nonsmooth component functions. In [19,25], an approximation method for performance function is developed. ...
Preprint
Full-text available
Model-based methods are popular in derivative-free optimization (DFO). In most of them, a single model function is built to approximate the objective function. This is generally based on the assumption that the objective function is one blackbox. However, some real-life and theoretical problems show that the objective function may consist of several blackboxes. In those problems, the information provided by each blackbox may not be equal. In this situation, one could build multiple sub-models that are then combined to become a final model. In this paper, we analyze the relation between the accuracy of those sub-models and the model constructed through their operations. We develop a broad framework that can be used as a theoretical tool in model error analysis and future research in DFO algorithms design.
... DFO is gaining in popularity in recent years and is useful in cases where gradients are unavailable, computationally expensive, or simply difficult to obtain [3,13]. DFO methods have been used in both smooth [9,11,16,25,28,29,32,34] and nonsmooth [4,7,20,21,24,27] optimization, and most commonly takes the form of either direct-search [1,2,5,8,17] or model-based [23,25,26,33,34] methods, while some use a blend of both [6,15]. ...
... Comparing to the nested-set Hessian ∇ 2 s (f g)(x 0 ; S, T ), we see that it is a major improvement. Indeed, the error bound for the nested-set Hessian in (20) reduces to zero whenever F = f g is a quadratic function. ...
Preprint
Full-text available
This work introduces the nested-set Hessian approximation, a second-order approximation method that can be used in any derivative-free optimization routine that requires such information. It is built on the foundation of the generalized simplex gradient and proved to have an error bound that is on the order of the maximal radius of the two sets used in its construction. We show that when the points used in the computation of the nested-set Hessian have a favourable structure, (n+1)(n+2)/2 function evaluations are sufficient to approximate the Hessian. However, the nested-set Hessian also allows for evaluation sets with more points without negating the error analysis. Two calculus-based approximation techniques of the Hessian are developed and some advantages of the same are demonstrated.
... Here we provide examples of nonlinear robust optimization problems considered in the literature. We emphasize cases with infinite-cardinality uncertainty sets, but we note that finite-cardinality uncertainty set examples are also commonplace; see, for example, the minimax regret and test problems in [45,60,61]. The illustrative examples in Section 3.2 are used in later sections to demonstrate solution techniques and available software. ...
Article
Full-text available
Robust optimization (RO) has attracted much attention from the optimization community over the past decade. RO is dedicated to solving optimization problems subject to uncertainty: design constraints must be satisfied for all the values of the uncertain parameters within a given uncertainty set. Uncertainty sets may be modeled as deterministic sets (boxes, polyhedra, ellipsoids), in which case the RO problem may be reformulated via worst-case analysis, or as families of distributions. The challenge of RO is to reformulate or approximate robust constraints so that the uncertain optimization problem is transformed into a tractable deterministic optimization problem. Most reformulation methods assume linearity of the robust constraints or uncertainty sets of favorable shape, which represents only a fraction of real-world applications. This survey addresses nonlinear RO and includes problem formulations and applications, solution approaches, and available software with code samples.
... Another derivative-free variant of GS, proposed by Hare and Nutini in [HN13], is specifically designed for minimizing finite-max functions. This approach exploits knowledge about which of these functions are almost active-in terms of having value close to the objective function-at a particular point. ...
Chapter
Full-text available
This article reviews the gradient sampling methodology for solving nonsmooth, nonconvex optimization problems. We state an intuitively straightforward gradient sampling algorithm and summarize its convergence properties. Throughout this discussion, we emphasize the simplicity of gradient sampling as an extension of the steepest descent method for minimizing smooth objectives. We provide an overview of various enhancements that have been proposed to improve practical performance, as well as an overview of several extensions that have been proposed in the literature, such as to solve constrained problems. We also clarify certain technical aspects of the analysis of gradient sampling algorithms, most notably related to the assumptions one needs to make about the set of points at which the objective is continuously differentiable. Finally, we discuss possible future research directions.
... Their DDS method adjusts the penalty parameter via a rule that depends on the current step size in order to guarantee convergence to a Clarke stationary point. Approximate gradient-sampling methods are developed and analysed by Hare and Macklem [2013] and Hare and Nutini [2013] for the finite minimax problem (58). These methods effectively exploit the subdifferential structure of h(y) = max i=1,...,p y i and employ derivative-free approximations of each ∇F i (x). ...
Preprint
In many optimization problems arising from scientific, engineering and artificial intelligence applications, objective and constraint functions are available only as the output of a black-box or simulation oracle that does not provide derivative information. Such settings necessitate the use of methods for derivative-free, or zeroth-order, optimization. We provide a review and perspectives on developments in these methods, with an emphasis on highlighting recent developments and on unifying treatment of such problems in the non-linear optimization and machine learning literature. We categorize methods based on assumed properties of the black-box functions, as well as features of the methods. We first overview the primary setting of deterministic methods applied to unconstrained, non-convex optimization problems where the objective function is defined by a deterministic black-box oracle. We then discuss developments in randomized methods, methods that assume some additional structure about the objective (including convexity, separability and general non-smooth compositions), methods for problems where the output of the black-box oracle is stochastic, and methods for handling different types of constraints.
Article
Although QP-free algorithms have good theoretical convergence and are effective in practice, their applications to minimax optimization have not yet been investigated. In this article, on the basis of the stationary conditions, without the exponential smooth function or constrained smooth transformation, we propose a QP-free algorithm for the nonlinear minimax optimization with inequality constraints. By means of a new and much tighter working set, we develop a new technique for constructing the sub-matrix in the lower right corner of the coefficient matrix. At each iteration, to obtain the search direction, two reduced systems of linear equations with the same coefficient are solved. Under mild conditions, the proposed algorithm is globally convergent. Finally, some preliminary numerical experiments are reported, and these show that the algorithm is promising.
Article
In many optimization problems arising from scientific, engineering and artificial intelligence applications, objective and constraint functions are available only as the output of a black-box or simulation oracle that does not provide derivative information. Such settings necessitate the use of methods for derivative-free, or zeroth-order, optimization. We provide a review and perspectives on developments in these methods, with an emphasis on highlighting recent developments and on unifying treatment of such problems in the non-linear optimization and machine learning literature. We categorize methods based on assumed properties of the black-box functions, as well as features of the methods. We first overview the primary setting of deterministic methods applied to unconstrained, non-convex optimization problems where the objective function is defined by a deterministic black-box oracle. We then discuss developments in randomized methods, methods that assume some additional structure about the objective (including convexity, separability and general non-smooth compositions), methods for problems where the output of the black-box oracle is stochastic, and methods for handling different types of constraints.
Technical Report
Full-text available
This report contains a description of subroutines which can be used for testing nons-mooth optimization codes. These subroutines can easily be obtained either by using the anonymous ftp (address ftp://ftp.cs.cas.cz/pub/msdos/opt, files TEST06.FOR, TEST19.FOR, TEST22.FOR or from the homepage http://www.cs.cas.cz/~luksan /test.html. Furthermore, all test problems contained in these subroutines are presented in the analytic form.
Article
Full-text available
Ecole Polytechnique de Montréal NOMAD is software that implements the Mesh Adaptive Direct Search (MADS) algorithm for blackbox optimization under general nonlinear constraints. Blackbox optimization is about optimizing functions that are usually given as costly programs with no derivative information and no function values returned for a significant number of calls attempted. NOMAD is designed for such problems and aims for the best possible solution with a small number of evaluations. The objective of this article is to describe the underlying algorithm, the software's functionalities, and its implementation.
Book
Full-text available
This book provides a largely self-contained account of the main results of convex analysis and optimization in Hilbert space. A concise exposition of related constructive fixed point theory is presented, that allows for a wide range of algorithms to construct solutions to problems in optimization, equilibrium theory, monotone inclusions, variational inequalities, best approximation theory, and convex feasibility. The book is accessible to a broad audience, and reaches out in particular to applied scientists and engineers, to whom these tools have become indispensable.
Chapter
Many classes of methods for noisy optimization problems are based on function information computed on sequences of simplices. The Nelder-Mead, multidirectional search, and implicit filtering methods are three such methods. The performance of these methods can be explained in terms of the difference approximation of the gradient implicit in the function evaluations. Insight can be gained into choice of termination criteria, detection of failure, and design of new methods.
Article
Speaking of the minimax controller design, it would be difficult to obtain such controllers in the nonlinear cases. One of the reasons is that the minimax controller should be robust against any kind of disturbances. Based on the genetic programming and Jung's collective unconscious, this paper presents a very simple design technique to solve the minimax control problems, where the minimax controller may be constructed paying attention only one kind of disturbance in a sense. Lastly, experiments are given to demonstrate the practicability of the proposed design technique with the inverted pendulum problem.
Article
It has been shown recently that the efficiency of direct search methods that use opportunistic polling in positive spanning directions can be improved significantly by reordering the poll directions according to descent indicators built from simplex gradients. The purpose of this paper is twofold. First, we analyze the properties of simplex gradients of nonsmooth functions in the context of direct search methods like the Generalized Pattern Search (GPS) and the Mesh Adaptive Direct Search (MADS), for which there exists a convergence analysis in the nonsmooth setting. Our analysis does not require continuous differentiability and can be seen as an extension of the accuracy properties of simplex gradients known for smooth functions. Secondly, we test the use of simplex gradients when pattern search is applied to nonsmooth functions, confirming the merit of the poll ordering strategy for such problems.
Article
For practical applications great interest is represented by the development of numerical methods for the solution of complex extremal problems with nonconvex and nonsmooth functions. The present article discusses algorithms for the minimization of a broad class of functions - almost-differentiable functions - that include convex, concave, and quasidifferentiable functions [ 1]. Together with the investigation of methods for the solution of deterministic problems, an algorithm is considered for the solution of stochastic problems for the case where the optimized function has a probabilistic nature and its analytical form is unknown. The continuous function f(x), given in an n-dimensional space En, is called almost-differentiable [21 if in every bounded region of E, it satisfies the Lipschitz condition. The almost-gradient of the function fix) at the point x is the vector g(x), belonging to the convex envelope of the limit points of all possible sequences {Vf(xk)}, where {x k} is a sequence of points converging to the point x, at which the gradient vf (:c) exists. It is well l~own that for almostdifferentiable functions the gradient V[ (x)exists everywhere, except for points of measure 0. The set of almostgradients of the function f(x) at the point x is denoted by G(x). This set is nonempty, convex, closed, and bounded. When it is required to minimize a nonsmooth function, motion over the analogs of the gradient (almost-gradient) or in the direction of the finite difference of the gradient does not give a monotonic behavior to the function itself nor to the distance to the extremai point. In this connection an interesting question arises of the convergence of the methods, analogous to finite-difference or gradient methods, for the minimization of nonconvex and nonsmooth functions. To seek a minimum of an almost-differentiable function f(x) in the present article we shall consider a finitedifference algorithm of the following form: