Content uploaded by Warren L Hare

Author content

All content in this area was uploaded by Warren L Hare

Content may be subject to copyright.

A Derivative-Free Approximate Gradient Sampling

Algorithm for Finite Minimax Problems

W. Hare and J. Nutini

February 20, 2013

Abstract

In this paper we present a derivative-free optimization algorithm for ﬁnite minimax

problems. The algorithm calculates an approximate gradient for each of the active func-

tions of the ﬁnite max function and uses these to generate an approximate subdiﬀerential.

The negative projection of 0 onto this set is used as a descent direction in an Armijo-like

line search. We also present a robust version of the algorithm, which uses the ‘almost

active’ functions of the ﬁnite max function in the calculation of the approximate subd-

iﬀerential. Convergence results are presented for both algorithms, showing that either

f(xk)→ −∞ or every cluster point is a Clarke stationary point. Theoretical and numeri-

cal results are presented for three speciﬁc approximate gradients: the simplex gradient, the

centered simplex gradient and the Gupal estimate of the gradient of the Steklov averaged

function. A performance comparison is made between the regular and robust algorithms,

the three approximate gradients, and a regular and robust stopping condition.

Keywords: derivative-free optimization, minimax problems, generalized gradient, subgradient

approximation.

AMS Subject Classiﬁcation: primary, 90C47, 90C56; secondary, 90C26, 65K05, 49M25.

1 Introduction

In this paper we consider the ﬁnite minimax problem:

min

xf(x) where f(x) = max{fi(x) : i= 1, . . . , N},

where each individual fiis continuously diﬀerentiable. We further restrict ourselves to the ﬁeld

of derivative-free optimization (DFO), where we are only permitted to compute function values,

i.e., we cannot compute gradient values ∇fidirectly. We present a derivative-free algorithm

that exploits the smooth substructure of the ﬁnite max problem, thereby creating a robust

algorithm with an elegant convergence theory.

Finite minimax problems occur in numerous applications, such as portfolio optimization [8],

control system design [21], engineering design [32], and determining the cosine measure of a

positive spanning set [10, Def 2.7]. In a ﬁnite max function, although each individual fimay be

smooth, taking the maximum forms a nonsmooth function with ‘nondiﬀerentiable ridges’. For

this reason, most algorithms designed to solve ﬁnite minimax problems employ some form of

1

smoothing technique; [31], [33], [34], and [39] (among many others). In general, these smoothing

techniques require gradient calculations.

However, in many situations gradient information is not available or can be diﬃcult to

compute accurately (see [4], [15], [20], [29] and [10, Chpt 1] for some examples of such situations).

Such situations are considered by research in the area of derivative-free optimization. For a

thorough introduction to several basic DFO frameworks and convergence results for each, see

[10].

Research on optimizing ﬁnite max functions without calculating derivatives can be seen as

early as 1975 [28], while more recently we have seen a resurface in this area, [26] and [19].

In 2006, Liuzzi, Lucidi and Sciandrone used a smoothing technique based on an exponential

penalty function in a directional direct-search framework to form a derivative-free optimization

method for ﬁnite minimax problems [26]. This method is shown to globally converge towards

a standard stationary point of the original ﬁnite minimax problem.

Also speciﬁc to the ﬁnite minimax problem, a derivative-free method is presented in [19] that

exploits the smooth substructure of the problem. It combines the frameworks of a directional

direct search method [10, Chpt 7] and the gradient sampling algorithm (GS algorithm) presented

in [6] and [7]. Loosely speaking, the GS algorithm uses a collection of local gradients to build

a ‘robust subdiﬀerential’ of the objective function and uses this to determine a ‘robust descent

direction’. In [19], these ideas are used to develop several methods to ﬁnd an approximate

descent direction that moves close to parallel to an ‘active manifold’. During each iteration,

points are sampled from around the current iterate and the simplex gradient is calculated for

each of the active functions of the objective function. The calculated simplex gradients are then

used to form an approximate subdiﬀerential, which is then used to determine a likely descent

direction.

Ideas from the GS algorithm have appeared in two other recent DFO methodologies [2] and

[24].

In 2008, Bagirov, Karas¨ozen and Sezer presented a discrete gradient derivative-free method

for unconstrained nonsmooth optimization problems [2]. Described as a derivative-free version

of the bundle method presented in [37], the method uses discrete gradients to approximate

subgradients of the function and build an approximate subdiﬀerential. The analysis of this

method provides proof of convergence to a Clarke stationary point for an extensive class of

nonsmooth problems. In this paper, we focus on the ﬁnite minimax problem. This allows us to

require few (other) assumptions on our function while maintaining strong convergence analysis.

It is worth noting that we use the same set of test problems as in [2]. Speciﬁcally, we use the

[27] test set and exclude one problem as its sub-functions are complex-valued. (The numerics

in [2] exclude the same problem, and several others, without explanation.)

Using approximate gradient calculations instead of gradient calculations, the GS algorithm

is made derivative free by Kiwiel in [24]. Speciﬁcally, Kiwiel employs the Gupal estimate of

the gradient of the Steklov averaged function (see [18] or Section 4.3 herein) as an approximate

gradient. It is shown that, with probability 1, this derivative-free algorithm satisﬁes the same

convergence results as the GS algorithm – it either drives the f-values to −∞ or each cluster

point is found to be Clarke stationary [24, Theorem 3.8]. No numerical results are presented

for Kiwiel’s derivative-free algorithm.

In this paper, we use the GS algorithm framework with approximate gradients to form a

derivative-free approximate gradient sampling algorithm. As we are dealing with ﬁnite max

functions, instead of calculating an approximate gradient at each of the sampled points, we

2

calculate an approximate gradient for each of the active functions. Expanding the active set

to include ‘almost’ active functions, we also present a robust version of our algorithm, which

is more akin to the GS algorithm. In this robust version, when our iterate is close to a point

of nondiﬀerentiability, the size and shape of our approximate subdiﬀerential will reﬂect the

presence of ‘almost active’ functions. Hence, when we project 0 onto our approximate subdif-

ferential, the descent direction will direct minimization parallel to a ‘nondiﬀerentiable ridge’,

rather than straight at this ridge. It can be seen in our numerical results that these robust

changes greatly inﬂuence the performance of our algorithm.

Our algorithm diﬀers from the above in a few key manners. Unlike in [26] we do not employ a

smoothing technique. Unlike in [19], which uses the directional direct-search framework to imply

convergence, we employ an approximate steepest descent framework. Using this framework, we

are able to analyze convergence directly and develop stopping conditions for the algorithm.

Unlike in [2] and [24], where convergence is proven for a speciﬁc approximate gradient, we

prove convergence for any approximate gradient that satisﬁes a simple error bound dependent

on the sampling radius. As examples, we present the simplex gradient, the centered simplex

gradient and the Gupal estimate of the gradient of the Steklov averaged function. (As a side-

note, Section 4.3 also provides, to the best of the authors’ knowledge, novel error analysis of

the Gupal estimate of the gradient of the Steklov averaged function.)

Focusing on the ﬁnite minimax problem provides us with an advantage over the methods

of [2] and [24]. In particular, we only require order nfunction calls per iteration (where n

is the dimension of the problem), while both [2] and [24] require order mn function calls per

iteration (where mis the number of gradients they approximate to build their approximate

subdiﬀerential). (The original GS algorithm suggests that m≈2nprovides a good value for

m.)

The remainder of this paper is organized as follows. In Section 2, we present the approximate

gradient sampling algorithm (AGS algorithm) and our convergence analysis. In Section 3, we

present a robust version of the AGS algorithm (RAGS algorithm), which uses ‘almost active’

functions in the calculation of the approximate subdiﬀerential. In Section 4, we show that

the AGS and RAGS algorithms converge using three speciﬁc approximate gradients: simplex

gradient, centered simplex gradient and the Gupal estimate of the gradient of the Steklov

averaged function. Finally, in Section 5, we present our numerical results and analysis.

2 Approximate Gradient Sampling Algorithm

Throughout this paper, we assume that our objective function is of the form

min

xf(x) where f(x) = max{fi(x) : i= 1, . . . , N},(1)

where each fi∈ C1, but we cannot compute ∇fi. We use C1to denote the class of diﬀerentiable

functions whose gradient mapping ∇fis continuous. We denote by C1+ the class of continuously

diﬀerentiable functions whose gradient mapping ∇fis locally Lipschitz and we denote by C2+

the class of twice continuously diﬀerentiable functions whose Hessian mapping ∇2fis locally

Lipschitz. Additionally, throughout this paper, |·| denotes the Euclidean norm and k·k denotes

the corresponding matrix norm.

3

For the ﬁnite max function in equation (1), we deﬁne the active set of fat a point ¯xto

be the set of indices

A(¯x) = {i:f(¯x) = fi(¯x)}.

The set of active gradients of fat ¯xis denoted by

{∇fi(¯x)}i∈A(¯x).

Let fbe locally Lipschitz at a point ¯x. As fis Lipschitz, there exists an open dense set D⊂IRn

such that fis continuously diﬀerentiable on D. The Clarke subdiﬀerential [9] is constructed

via

∂f (x) = \

ε>0

Gε(x) where Gε(x) = cl conv{∇f(y) : y∈Bε(x)∩D}.

For a ﬁnite max function, assuming fi∈ C1for each i∈A(¯x), the Clarke subdiﬀerential (as

proven in [9, Prop. 2.3.12]) is equivalent to

∂f (¯x) = conv{∇fi(¯x)}i∈A(¯x).(2)

By equation (2), it is clear that for ﬁnite max functions the subdiﬀerential is a compact set.

This will be important in the convergence analysis in Section 2.2.

We are now ready to state the general form of the AGS algorithm, an approximate subgra-

dient descent method.

2.1 Algorithm - AGS

We ﬁrst provide a partial glossary of notation used in the deﬁnition of the AGS algorithm.

Table 1: Glossary of notation used in the AGS algorithm.

Glossary of Notation

k: Iteration counter xk: Current iterate

µk: Accuracy measure ∆k: Sampling radius

m: Sample size θ: Sampling radius reduction factor

yj: Sampling points Y: Sampled set of points

η: Armijo-like parameter dk: Search direction

tk: Step length tmin: Minimum step length

∇Afi: Approximate gradient of fiA(xk): Active set at xk

Gk: Approximate subdiﬀerential εtol: Stopping tolerance

Conceptual Algorithm: [Approximate Gradient Sampling Algorithm]

0. Initialize: Set k= 0 and input

x0- starting point

µ0>0 - accuracy measure

∆0>0 - initial sampling radius

θ∈(0,1) - sampling radius reduction factor

4

0< η < 1 - Armijo-like parameter

tmin - minimum step length

εtol >0 - stopping tolerance

1. Generate Approximate Subdifferential Gk:

Generate a set Y= [xk, y1,...ym] around the current iterate xksuch that

max

j=1,...,m |yj−xk| ≤ ∆k.

Use Yto calculate the approximate gradient of fi, denoted ∇Afi, at xkfor each i∈A(xk).

Set

Gk= conv{∇Afi(xk)}i∈A(xk).

2. Generate Search Direction:

Let

dk=−Proj(0|Gk).

Check if

∆k≤µk|dk|.(3)

If (3) does not hold, then set xk+1 =xk,

∆k+1 =(θµk|dk|if |dk| 6= 0

θ∆kif |dk|= 0 ,(4)

k=k+ 1 and return to Step 1. If (3) holds and |dk|< εtol, then STOP. Else, continue to

the line search.

3. Line Search:

Attempt to ﬁnd tk>0 such that

f(xk+tkdk)< f(xk)−ηtk|dk|2.

Line Search Failure:

Set µk+1 =µk

2,xk+1 =xkand go to Step 4.

Line Search Success:

Let xk+1 be any point such that

f(xk+1)≤f(xk+tkdk).

4. Update and Loop:

Set ∆k+1 = max

j=1,...,m |yj−xk|,k=k+ 1 and return to Step 1.

5

In Step 0 of the AGS algorithm, we set the iterate counter to 0, provide an initial starting

point x0, and initialize the parameter values.

In Step 1, we create the approximate subdiﬀerential. First, we select a set of points around

xkwithin a sampling radius of ∆k. In implementation, the points are randomly and uniformly

sampled from a ball of radius ∆k(using the MATLAB randsphere.m function [36]). Using

this set Y, we then calculate an approximate gradient for each of the active functions at xk

and set the approximate subdiﬀerential Gkequal to the convex hull of these active approximate

gradients, ∇Afi(xk). Details on various approximate gradients appear in Section 4.

In Step 2, we generate a search direction by solving the projection of 0 onto the approximate

subdiﬀerential: Proj(0|Gk)∈arg ming∈Gk|g|2. The search direction dkis set equal to the

negative of the solution, i.e., dk=−Proj(0|Gk).

After ﬁnding a search direction, we check the inequality ∆k≤µk|dk|.This inequality de-

termines if the current sampling radius is suﬃciently small relative to the distance from 0 to

the approximate subdiﬀerential. If this inequality holds and |dk|< εtol, then we terminate the

algorithm, as 0 is within εtol of the approximate subdiﬀerential and the sampling radius is small

enough to reason that the approximate subdiﬀerential is accurate. If the above inequality does

not hold, then the approximate subdiﬀerential is not suﬃciently accurate to warrant a line

search, so we decrease the sampling radius, set xk+1 =xk, update the iterate counter and loop

(Step 1). If the above inequality holds, but |dk| ≥ εtol, then we proceed to a line search.

In Step 3, we carry out a line search. We attempt to ﬁnd a step length tk>0 such that the

Armijo-like condition holds

f(xk+tkdk)< f(xk)−ηtk|dk|2.(5)

This condition ensures suﬃcient decrease is found in the function value. In implementation,

we use a back-tracking line search (described in [30]) with an initial step-length of tini = 1,

terminating when the step length tkis less than a threshold tmin. If we ﬁnd a tksuch that

equation (5) holds, then we declare a line search success. If not, then we declare a line search

failure.

If a line search success occurs, then we let xk+1 be any point such that

f(xk+1)≤f(xk+tkdk).(6)

In implementation, we do this by searching through the function values used in the calculation

of our approximate gradients ({f(yi)}yi∈Y). As this set of function values corresponds to

points distributed around our current iterate, there is a good possibility of ﬁnding further

function value decrease without having to carry out additional function evaluations. We ﬁnd

the minimum function value in our set of evaluations and if equation (6) holds for this minimum

value, then we set xk+1 equal to the corresponding input point. Otherwise, we set xk+1 =

xk+tkdk.

If a line search failure occurs, then we reduce the accuracy measure µkby a factor of 1

2and

set xk+1 =xk.

Finally, in Step 4, we update the iterate counter and the sampling radius, and then loop to

Step 1 to resample.

6

2.2 Convergence

For the following results, we denote the approximate subdiﬀerential of fat ¯xas

G(¯x) = conv{∇Afi(¯x)}i∈A(¯x),

where ∇Afi(¯x) is the approximate gradient of fiat ¯x. Our ﬁrst result establishes an error bound

relation between the elements of the approximate subdiﬀerential and the exact subdiﬀerential.

Lemma 2.1. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists an

ε > 0such that |∇Afi(¯x)− ∇fi(¯x)| ≤ εfor all i= 1, . . . , N . Then

1. for all w∈G(¯x), there exists a v∈∂f (¯x)such that |w−v| ≤ ε, and

2. for all v∈∂f(¯x), there exists a w∈G(¯x)such that |w−v| ≤ ε.

Proof. 1. By deﬁnition, for all w∈G(¯x) there exists a set of αisuch that

w=X

i∈A(¯x)

αi∇Afi(¯x),where αi≥0,X

i∈A(¯x)

αi= 1.

By our assumption that each fi∈ C1, we have ∂f(¯x) = conv{∇fi(¯x)}i∈A( ¯x). Using the same αi

as above, we see that

v=X

i∈A(¯x)

αi∇fi(¯x)∈∂f(¯x)

Then |w−v|=|P

i∈A(¯x)

αi∇Afi(¯x)) −P

i∈A(¯x)

αi∇fi(¯x)|

≤P

i∈A(¯x)

αi|∇Afi(¯x)− ∇fi(¯x)|

≤P

i∈A(¯x)

αiε

=ε

Hence, for all w∈G(¯x), there exists a v∈∂f(¯x) such that

|w−v| ≤ ε. (7)

2. Analogous arguments can be applied to v∈∂f (¯x).

Lemma 2.1 states the quality of the approximate subdiﬀerential as an approximation to the

exact subdiﬀerential once the approximate gradients of the component functions are quality

approximations to the real gradients. Our next goal (in Theorem 2.4) is to show that eventually

a line search success will occur in the AGS algorithm. To achieve this we make use of the

following lemma.

Lemma 2.2. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists an

ε > 0such that |∇Afi(¯x)− ∇fi(¯x)| ≤ εfor all i= 1, . . . , N . Deﬁne d=−Proj(0|G(¯x)) and

suppose |d| 6= 0. Let β∈[0,1). If ε < (1 −β)|d|, then for all v∈∂f (¯x)we have

hd, vi<−β|d|2.

7

Proof. Notice that, by the Projection Theorem [3, Theorem 3.14], d=−Proj(0|G(¯x)) implies

that

h0−(−d), w −(−d)i ≤ 0 for all w∈G(¯x).

Hence,

hd, w +di ≤ 0 for all w∈G(¯x).(8)

So we have for all v∈∂f(¯x),

hd, vi=hd, v −w+w−d+difor all w∈G(¯x)

=hd, v −wi+hd, w +di+hd, −difor all w∈G(¯x)

≤ hd, v −wi−|d|2for all w∈G(¯x)

≤ |d||v−w|−|d|2for all w∈G(¯x).

For any v∈∂f (¯x), using was constructed in Lemma 2.1, we see that

hd, vi≤|d|ε− |d|2

<|d|2(1 −β)− |d|2(as ε < (1 −β)|d|)

=−β|d|2.

Remark 2.3.In Lemma 2.2, for the case when β= 0, the condition ε < (1 −β)|d|simpliﬁes to

ε < |d|. Thus, if εis bounded above by |d|, then Lemma 2.2 proves that for all v∈∂f(¯x) we

have hd, vi<0, showing that dis a descent direction for fat ¯x.

To guarantee convergence, we must show that, except in the case of 0 ∈∂f (xk), the algo-

rithm will always be able to ﬁnd a sampling radius that satisﬁes the requirements in Step 2. In

Section 4 we show that (for three diﬀerent approximate gradients) the value ε(in Lemma 2.2)

is linked to ∆. As unsuccessful line searches will drive ∆ to zero, this implies that eventually

the requirements of Lemma 2.2 will be satisﬁed. We formalize this in the next two theorems.

Theorem 2.4. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose 06∈ ∂f(xk)for

each iteration k. Suppose there exists ¯

K > 0such that given any set of points generated in Step

1 of the AGS algorithm, the approximate gradient satisﬁes |∇Afi(xk)−∇fi(xk)| ≤ ¯

K∆kfor all

i= 1, . . . , N. Let dk=−Proj(0|G(xk)). Then for any µ > 0, there exists ¯

∆ = ¯

∆(xk)>0such

that,

∆≤µ|dk|+¯

Kµ(∆k−∆) for all 0<∆<¯

∆,

Moreover, if ∆k<¯

∆, then the following inequality holds

∆k≤µ|dk|.

8

Proof. Let ¯v= Proj(0|∂f(xk)) (by assumption, ¯v6= 0).

Given µ > 0, let

¯

∆ = 1

¯

K+1

µ|¯v|,(9)

and consider 0 <∆<¯

∆. Now create G(xk) and dk=−Proj(0|G(xk)). As −dk∈G(xk), by

Lemma 2.1(1), there exists a vk∈∂f (xk) such that

| − dk−vk| ≤ ¯

K∆k.

Then ¯

K∆k≥ | − dk−vk|

⇒¯

K∆k≥ |vk|−|dk|

⇒¯

K∆k≥ |¯v|−|dk|(as |v| ≥ |¯v|for all v∈∂f (xk) ).

Thus, for 0 <∆<¯

∆, we apply equation (9) to |¯v|in the above inequality to get

¯

K∆k≥(¯

K+1

µ)∆ − |dk|,

which rearranges to

∆≤µ(|dk|+¯

Kµ(∆k−∆).

Hence, ∆ ≤µ|dk|+¯

Kµ(∆k−∆) for all 0 <∆<¯

∆. Finally, if ∆k<¯

∆, then

∆k≤µ|dk|.

Remark 2.5.In Theorem 2.4, it is important to note that eventually the condition ∆k<¯

∆ will

hold. Examine ¯

∆ as constructed above: ¯

Kis a constant and ¯vis associated with the current

iterate. However, the current iterate is only updated when a line search success occurs, which

will not occur unless the condition ∆k≤µk|dk|is satisﬁed. As a result, if ∆k≥¯

∆, the AGS

algorithm will reduce ∆k, with ¯

∆ remaining constant, until ∆k<¯

∆.

Recall in Step 3 of the AGS algorithm, for a given η∈(0,1), we attempt to ﬁnd a step

length tk>0 such that

f(xk+tkdk)< f(xk)−ηtk|dk|2.

The following result shows that eventually the above inequality will hold in the AGS algorithm.

Recall that the exact subdiﬀerential for a ﬁnite max function, as deﬁned in (2), is a compact

set. Thus, we know that in the following theorem ˜vis well-deﬁned.

Theorem 2.6. Fix 0< η < 1. Let f= max{fi:i= 1, .. . , N }where each fi∈ C1. Suppose

there exists an ε > 0such that |∇Afi(¯x)− ∇fi(¯x)| ≤ εfor all i= 1, . . . , N . Deﬁne d=

−Proj(0|G(¯x)) and suppose |d| 6= 0. Let ˜v∈arg max{hd, vi:v∈∂f(¯x)}. Let β=2η

1+η. If

ε < (1 −β)|d|, then there exists ¯

t > 0such that

f(¯x+td)−f(¯x)<−ηt|d|2for all 0<t<¯

t.

9

Proof. Note that β∈(0,1). Recall, from Lemma 2.2, we have for all v∈∂f (¯x)

hd, vi<−β|d|2.(10)

Using β=2η

1+η, equation (10) becomes

hd, vi<−2η

1+η|d|2for all v∈∂f (¯x). (11)

From equation (11) we can conclude that for all v∈∂f (¯x)

hd, vi<0.

Notice that

lim

τ&0

f(¯x+τ d)−f(¯x)

τ= max{hd, vi:v∈∂f(¯x)}=hd, ˜vi<0.

Therefore, there exists ¯

t > 0 such that

f(¯x+td)−f(¯x)

t<η+ 1

2hd, ˜vifor all 0 < t < ¯

t.

For such a t, we have

f(¯x+td)−f(¯x)<η+ 1

2thd, ˜vi

<−η+ 1

2

2η

η+ 1t|d|2

<−ηt|d|2.

Hence,

f(¯x+td)−f(¯x)<−ηt|d|2for all 0 < t < ¯

t.

Combining the previous results, we can show that the AGS algorithm is guaranteed to ﬁnd

function value decrease (provided 0 /∈∂f(xk)). We summarize with the following corollary.

Corollary 2.7. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose 0/∈∂f(xk)for

each iteration k. Suppose there exists a ¯

K > 0such that given any set of points generated in

Step 1 of the AGS algorithm, the approximate gradient satisﬁes |∇Afi(xk)− ∇fi(xk)| ≤ ¯

K∆k

for all i= 1, . . . , N . Then after a ﬁnite number of iterations, the algorithm will ﬁnd a new

iterate with a lower function value.

Proof. Consider xk, where 0 /∈∂f(xk).

To ﬁnd function value decrease with the AGS algorithm, we must declare a line search success

in Step 3. The AGS algorithm will only carry out a line search if the condition below is satisﬁed

∆k≤µk|dk|,(12)

10

where dk=−Proj(0|G(xk)), as usual. In Theorem 2.4, we showed that for any µk>0, there

exists a ¯

∆ = ¯

∆(xk)>0 such that if ∆k<¯

∆(xk), then equation (12) is satisﬁed. If equation

(12) is not satisﬁed, then ∆kis updated according to equation (4) and xk+1 =xk, which further

implies ¯

∆ = ¯

∆(xk+1) = ¯

∆(xk) is unchanged. In this case, whether |dk| 6= 0 or |dk|= 0, we

can see that ∆k+1 ≤θ∆k. Hence an inﬁnite sequence of equation (12) being unsatisﬁed is

impossible (as eventually we would have ∆k<¯

∆. So eventually equation (12) will be satisﬁed

and the AGS algorithm will carry out a line search.

Now, in order to have a line search success, we must be able to ﬁnd a step length tksuch

that the Armijo-like condition holds,

f(xk+tkdk)< f(xk)−ηtk|dk|2.

In Theorem 2.6, we showed that there exists ¯

t > 0 such that

f(xk+tkdk)−f(xk)<−ηtk|dk|2for all 0 < tk<¯

t,

provided that for β∈(0,1),

ε < (1 −β)|dk|.(13)

Set ε=¯

K∆k. If equation (13) does not hold, then a line search failure will occur, resulting in

µk+1 = 0.5µk. Thus, eventually we will have µk<(1−β)

¯

Kand

∆k≤µk|dk|<(1 −β)

¯

K|dk|,

which means equation (13) will hold. Thus, after a ﬁnite number of iterations, the AGS

algorithm will declare a line search success and ﬁnd a new iterate with a lower function value.

We are now ready to prove convergence. In particular, we study the limiting case of the

algorithm generating an inﬁnite sequence (i.e., the situation with εtol = 0). In the following,

assuming that the step length tkis bounded away from 0 means that there exists a ¯

t > 0 such

that tk>¯

t.

Theorem 2.8. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Set εtol = 0 and suppose

that {xk}∞

k=0 is an inﬁnite sequence generated by the AGS algorithm. Suppose there exists

a¯

K > 0such that given any set of points generated in Step 1 of the AGS algorithm, the

approximate gradient satisﬁes the error bound |∇Afi(xk)−∇fi(xk)| ≤ ¯

K∆kfor all i= 1, . . . , N .

Suppose tkis bounded away from 0. Then either

1. f(xk)↓ −∞, or

2. |dk| → 0,∆k↓0and every cluster point ¯xof the sequence {xk}∞

k=0 satisﬁes 0∈∂f (¯x).

Proof. If f(xk)↓ −∞, then we are done.

Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, there-

fore f(xk) converges. We consider two cases.

Case 1: An inﬁnite number of line search successes occur.

Let ¯xbe a cluster point of {xk}∞

k=0. Notice that xkonly changes for line search successes, so

11

there exists a subsequence {xkj}∞

j=0 of line search successes such that xkj→¯x. Then for each

corresponding step length tkjand direction dkj, the following condition holds

f(xkj+1)≤f(xkj+tkjdkj)< f (xkj)−ηtkj|dkj|2.

Note that

0≤ηtkj|dkj|2< f (xkj)−f(xkj+1 ).

Since f(xk) converges we know that f(xkj)−f(xkj+1)→0. Since tkjis bounded away from 0,

we see that

lim

j→∞ |dkj|= 0.

Recall from the AGS algorithm, we check the condition

∆kj≤µkj|dkj|.

As ∆kj>0, µkj≤µ0, and |dkj| → 0, we conclude that ∆kj↓0.

Finally, from Lemma 2.1(1), as −dkj∈G(xkj), there exists a vkj∈∂f(xkj) such that

| − vkj−dkj| ≤ ¯

K∆kj

⇒ | − vkj|−|dkj| ≤ ¯

K∆kj

⇒ |vkj| ≤ ¯

K∆kj+|dkj|,

which implies that

0≤ |vkj| ≤ ¯

K∆kj+|dkj| → 0.

So,

lim

j→∞ |vkj|= 0,

where |vkj| ≥ dist(0|∂f (xkj)) ≥0, which implies dist(0|∂f (xkj)) →0. We have xkj→¯x. As f

is a ﬁnite max function, ∂f is outer semicontinuous (see [35, Deﬁnition 5.4 & Proposition 8.7]).

Hence, every cluster point ¯xof a convergent subsequence of {xk}∞

k=0 satisﬁes 0 ∈∂f (¯x).

Case 2: A ﬁnite number of line search successes occur.

This means there exists a ¯

ksuch that xk=x¯

k= ¯xfor all k≥¯

k. However, by Corollary 2.7,

if 0 /∈∂f (¯x), then after a ﬁnite number of iterations, the algorithm will ﬁnd function value

decrease (line search success). Hence, we have 0 ∈∂f(¯x).

To see ∆k↓0 and |dk| → 0, note that by Lemma 2.1(1) and 0 ∈∂f(¯x), we have that for all

k > ¯

kthere exists d∈G(xk) such that |d−0| ≤ ¯

K∆k. In particular, |dk|=|Proj(0|G(xk))| ≤

¯

K∆k≤¯

K∆0for all k > ¯

k. Now note that one of two situations must occur: either equation

(3) is unsatisﬁed an inﬁnite number of times or after a ﬁnite number of steps equation (3) is

always satisﬁed and a line search failure occurs in Step 4. In the ﬁrst case, we directly have

∆k↓0 (by Step 3). In the second case, we have µk↓0 (by Step 4), so ∆k≤µk|dk| ≤ µk¯

K∆0

(by equation (3)) implies ∆k↓0. Finally, |dk| ≤ ¯

K∆kcompletes the proof.

Our last result shows that if the algorithm terminates in Step 2, then the distance from 0

to the exact subdiﬀerential is controlled by εtol.

12

Theorem 2.9. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists a

¯

K > 0such that for each iteration k, the approximate gradient satisﬁes |∇Afi(xk)−∇fi(xk)| ≤

¯

K∆kfor all i= 1, . . . , N . Suppose the AGS algorithm terminates at some iteration ¯

kin Step

2 for εtol >0. Then

dist(0|∂f (x¯

k)) <(1 + ¯

Kµ0)εtol.

Proof. Let ¯w= Proj(0|G(xk)). We use ¯v∈∂f (xk) as constructed in Lemma 2.1(1) to see that

dist(0|∂f (xk)) ≤dist(0|¯w) + dist( ¯w|¯v)

=|dk|+|¯w−¯v|

≤ |dk|+¯

K∆k

< εtol +¯

K∆k

The ﬁnal statement now follows by the test ∆k≤µk|dk| ≤ µ0εtol in Step 2.

3 Robust Approximate Gradient Sampling Algorithm

The AGS algorithm depends on the active set of functions at each iterate, A(xk). Of course, it

is possible at various times in the algorithm for there to be functions that are inactive at the

current iterate, but active within a small radius of the current iterate. Typically such behaviour

means that the current iterate is close to a ‘nondiﬀerentiable ridge’ formed by the function. In

[6] and [7], it is suggested that allowing an algorithm to take into account these ‘almost active’

functions will provide a better idea of what is happening at and around the current iterate,

thus, making the algorithm more robust.

In this section we present the robust gradient sampling algorithm (RAGS algorithm). Specif-

ically, we adapt the AGS algorithm by expanding our active set to include all functions that

are active at any of the points in the set Y. Recall from the AGS algorithm that the set Yis

sampled from within a ball of radius ∆k. Thus, the points in Yare not far from the current

iterate. We deﬁne the robust active set next.

Deﬁnition 3.1. Let f= max{fi:i= 1, . . . , N }where fi∈ C1. Let y0=xkbe the current

iterate and Y= [y0, y1, y2, . . . , ym] be a set of randomly sampled points from a ball centered at

y0with radius ∆k. The robust active set of fon Yis

A(Y) = [

yj∈Y

A(yj).

3.1 Algorithm - RAGS

Using the idea of the robust active set, we alter the AGS algorithm to accommodate the robust

active set by replacing Steps 1 and 2 with the following.

1. Generate Approximate Subdifferential Gk

Y(Robust):

Generate a set Y= [xk, y1, . . . , ym] around the current iterate xksuch that

max

j=1,...,m |yj−xk| ≤ ∆k.

Use Yto calculate the approximate gradient of fi, denoted ∇Afi, at xkfor each i∈A(Y).

Then set Gk= conv{∇Afi(xk)}i∈A(xk)and Gk

Y= conv{∇Afi(xk)}i∈A(Y).

13

2. Generate Search Direction:

Let

dk=−Proj(0|Gk).

Let

dk

Y=−Proj(0|Gk

Y).

Check if

∆k≤µk|dk|.(14)

If (14) does not hold, then set xk+1 =xk,

∆k+1 =(θµk|dk|if |dk| 6= 0

θ∆kif |dk|= 0 ,

k=k+ 1 and return to Step 1. If (14) holds and |dk|< εtol, then STOP. Else, continue

to the line search, using dk

Yas a search direction.

Notice that in Step 2 we still use the stopping conditions from Section 2. Although this

modiﬁcation requires the calculation of two projections, it should be noted that neither of

these projections are particularly diﬃcult to calculate. In Section 3.3, we use the Goldstein

approximate subdiﬀerential to adapt Theorem 2.9 to work for stopping conditions based on dk

Y,

but we still do not have theoretical results for the exact subdiﬀerential. It is important to note

that no additional function evaluations are required for this modiﬁcation.

In the numerics section, we test each version of our algorithm using the robust descent direc-

tion to check the stopping conditions. This alteration shows convincing results that the robust

stopping conditions not only guarantee convergence, but signiﬁcantly decrease the number of

function evaluations required for the algorithm to converge.

3.2 Convergence

To show that the RAGS algorithm is well-deﬁned we will require the fact that when ∆kis small

enough, the robust active set is in fact equal to the original active set.

Lemma 3.2. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Let Y= [¯x, y1, . . . , ym]be a

randomly sampled set from a ball centered at ¯xwith radius ∆. Then there exists an ˜ε > 0such

that if Y⊆B˜ε(¯x), then A(¯x) = A(Y).

Proof. Clearly, if i∈A(¯x), then i∈A(Y) as ¯x∈Y.

Consider i6∈ A(¯x). Then by the deﬁnition of f, we have that

fi(¯x)< f (¯x).

By deﬁnition, fis continuous, thus, there exists an ˜εi>0 such that for all z∈B˜εi(¯x),

fi(z)< f(z).

If ∆ <˜εi, then we have |yj−¯x|<˜εifor all j= 1, . . . , m. Therefore,

fi(yj)< f(yj) for all j= 1, . . . , m, (15)

so i6∈ A(Y). Setting ˜ε= min

i˜εicompletes the proof.

14

Using Lemma 3.2, we can easily conclude that the AGS algorithm is still well-deﬁned when

using the robust active set.

Corollary 3.3. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose 06∈ ∂f (xk)for

each iteration k. Suppose there exists a ¯

K > 0such that given any set of points generated in

Step 1 of the RAGS algorithm, the approximate gradient satisﬁes |∇Afi(xk)−∇fi(xk)| ≤ ¯

K∆k

for all i= 1, . . . , N . Then after a ﬁnite number of iterations, the RAGS algorithm will ﬁnd

function value decrease.

Proof. Consider xk, where 0 /∈∂f(xk).

For eventual contradiction, suppose we do not ﬁnd function value decrease. In the RAGS

algorithm, this corresponds to an inﬁnite number of line search failures. If we have an inﬁnite

number of line search failures, then ∆k→0, as |dk|is bounded, and x¯

k=xkfor all ¯

k≥k.

In Lemma 3.2, ˜εdepends only on xk. Hence, we can conclude that eventually ∆k≤˜εand

therefore Y⊆B˜ε(xk). Thus, eventually A(xk) = A(Yk). Once the two active sets are equal,

the results of Section 2 will hold.

To examine convergence of the RAGS algorithm we use the result that eventually the robust

active set at the current iterate will be a subset of the regular active set at any cluster point of

the algorithm.

Lemma 3.4. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Let Yk= [xk, y1, . . . , ym]

be a randomly sampled set from a ball centered at xkwith radius ∆k. Let xk→¯x. Then there

exists an ˜ε > 0such that if Yk⊆B˜ε(¯x), then A(Yk)⊆A(¯x).

Proof. Let i /∈A(¯x). We must show that for ksuﬃciently large i /∈A(Yk).

By deﬁnition of f, we have that

fi(¯x)< f (¯x).

Since fis continuous, there exists an ˜εi>0 such that for all z∈B˜εi(¯x)

fi(z)< f(z).

If Yk⊆B˜εi(¯x), then we have |xk−¯x|<˜εiand |yj−¯x|<˜εifor all j= 1, . . . , m. Therefore

fi(xk)< f(xk)

and

fi(yj)< f(yj) for all j= 1, . . . , m.

Thus, if Yk⊆B˜εi(¯x), then i6∈ A(Yk). Letting ˜ε= mini˜εicompletes the proof.

Now we examine the convergence of the RAGS algorithm.

Theorem 3.5. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Set εtol = 0 and suppose

that {xk}∞

k=0 is an inﬁnite sequence generated by the RAGS algorithm. Suppose there exists

a¯

K > 0such that given any set of points generated in Step 1 of the RAGS algorithm, the

approximate gradient satisﬁes the error bound |∇Afi(xk)−∇fi(xk)| ≤ ¯

K∆kfor all i= 1, . . . , N .

Suppose tkis bounded away from 0. Then either

1. f(xk)↓ −∞, or

15

2. |dk| → 0,∆k↓0and every cluster point ¯xof the sequence {xk}∞

k=0 satisﬁes 0∈∂f (¯x).

Proof. If f(xk)↓ −∞, then we are done.

Conversely, if f(xk) is bounded below, then f(xk) is non-increasing and bounded below, there-

fore f(xk) converges. We consider two cases.

Case 1: An inﬁnite number of line search successes occur.

Let ¯xbe a cluster point of {xk}∞

k=0. Notice that xkonly changes for line search successes,

so there exists a subsequence {xkj}∞

k=0 of line search successes such that xkj→¯x. Following

the arguments of Theorem 2.8, we have |dkj| → 0 and ∆kj↓0. Notice that if ∆kj↓0, then

eventually Ykj⊆B˜ε(¯x), where xkj→¯xand ˜εis deﬁned as in Lemma 3.4. Thus, by Lemma

3.4, we have that A(Ykj)⊆A(¯x). This means that GYkj(xkj) is formed from a subset of the

approximated active gradients, related to ∂f(¯x). Thus, by Lemma 2.1(1), as −dkj∈GYkj(xkj),

we can construct a vkj∈∂f (¯x) from the same set of approximated active gradients, related to

GYkj(xkj), such that

| − dkj−vkj|

=X

i∈A(Ykj)

αi∇Afi(xkj)−X

i∈A(Ykj)

αi∇fi(¯x)

≤X

i∈A(Ykj)

αi|∇Afi(xkj)− ∇fi(¯x)|

≤X

i∈A(Ykj)

αi|∇Afi(xkj)− ∇fi(xkj)|+αi|∇fi(xkj)− ∇fi(¯x)|

≤X

i∈A(Ykj)

αi¯

K∆kj+αi|∇fi(xkj)− ∇fi(¯x)|.

≤¯

K∆kj+X

i∈A(Ykj)

αi|∇fi(xkj)− ∇fi(¯x)|.

Using the Triangle Inequality, we have that

|vkj|−|dkj| ≤ ¯

K∆kj+X

i∈A(Ykj)

αi|∇fi(xkj)− ∇fi(¯x)|.

We already showed that |dkj| → 0 and ∆kj↓0. Furthermore, since ∇fiis continuous and

xkj→¯x, we have |∇fi(xkj)− ∇fi(¯x)| → 0. So,

lim

j→∞ |vkj|= 0.

Using the same arguments as in Theorem 2.8, the result follows.

Case 2: A ﬁnite number of line search successes occur.

This means there exists a ¯

ksuch that xk=x¯

k= ¯xfor all k≥¯

k. However, by Corollary 3.3,

if 0 /∈∂f (¯x), then after a ﬁnite number of iterations, the algorithm will ﬁnd function value

decrease (line search success). Hence, we have 0 ∈∂f(¯x). This implies ∆k↓0 and |dk| → 0, as

in the proof of Theorem 2.8.

Remark 3.6.Using dkto check our stopping conditions allows the result of Theorem 2.9 to still

hold.

16

3.3 Robust Stopping with Goldstein Approximate Subdiﬀerential

We want to provide some insight as to how Theorem 2.9 can work for stopping conditions based

on dk

Y, that is, replacing the stopping conditions ∆k≤µk|dk|and |dk|< εtol in Step 2 with the

robust stopping conditions

∆k≤µk|dk

Y|and |dk

Y|< εtol.(16)

In the situation when the algorithm terminates, the following proposition does not theo-

retically justify why the stopping conditions are suﬃcient, but does help explain their logic.

Theoretically, since we do not know what ¯xis, we cannot tell when A(Y)⊆A(¯x). However, as

seen above, we do know that if xk→¯x, then eventually A(Yk)⊆A(¯x).

Proposition 3.7. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1. Suppose there exists

a¯

K > 0such that |∇Afi(x)− ∇fi(x)| ≤ ¯

K∆kfor all i= 1, . . . , N and for all x∈B∆k(xk).

Suppose the RAGS algorithm terminates at some iteration ¯

kin Step 2 using the robust stopping

conditions given in (16). Furthermore, suppose there exists ¯x∈B∆¯

k(x¯

k)such that A(Y¯

k)⊆

A(¯x). Then

dist(0|∂f (¯x)) <(1 + ¯

Kµ0)εtol.

Proof. If A(Y¯

k)⊆A(¯x), then the proofs of Lemma 2.1(1) and Theorem 2.9 still hold.

Additionally, in the following results, we approach the theory for robust stopping conditions

using the Goldstein approximate subdiﬀerential. If the RAGS algorithm terminates in Step 2,

then it is shown that the distance between 0 and the Goldstein approximate subdiﬀerential is

controlled by εtol. Again, this does not prove the robust stopping conditions are suﬃcient for

the exact subdiﬀerential.

First, the Goldstein approximate subdiﬀerential, as deﬁned in [17], is given by the set

∂G

∆f(¯x) = conv{∂f (z) : z∈B∆(¯x)}.(17)

We now show that the Goldstein approximate subdiﬀerential contains all of the gradients

of the active functions in the robust active set.

Lemma 3.8. Let f= max{fi:i= 1, . . . , N}. Let Y= [y0, y1, y2, . . . , ym]be a randomly

sampled set from a ball centered at y0= ¯xwith radius ∆. If fi∈ C1for each i, then

∂G

∆f(¯x)⊇conv{∇fi(yj) : yj∈Y, i ∈A(yj)}.

Proof. If fi∈ C1for each i∈A(Y), then by equation (2), for each yj∈Ywe have

∂f (yj) = conv{∇fi(yj)}i∈A(yj)= conv{∇fi(yj) : i∈A(yj)}.

Using this in our deﬁnition of the Goldstein approximate subdiﬀerential in (17) and knowing

B∆(¯x)⊇Y, we have

∂G

∆f(¯x)⊇conv{conv{∇fi(yj) : i∈A(yj)}:yj∈Y},

which simpliﬁes to

∂G

∆f(¯x)⊇conv{∇fi(yj) : yj∈Y, i ∈A(yj)}.(18)

17

Now we have a result similar to Lemma 2.1(1) for dk

Ywith respect to the Goldstein approx-

imate subdiﬀerential.

Remark 3.9.For the following two results, we assume each of the fi∈ C1+ with Lipschitz

constant L. Note that this implies the Lipschitz constant Lis independent of i. If each

fi∈ C1+ with Lipschitz constant Li, then Lis easily obtained by L= max{Li:i= 1, . . . , N}.

Lemma 3.10. Let f= max{fi:i= 1, . . . , N}where fi∈ C1+ with Lipschitz constant L. Let

Y= [y0, y1, y2, . . . , ym]be a randomly sampled set from a ball centered at y0= ¯xwith radius ∆.

Suppose there exists a ¯

K > 0such that |∇Afi(¯x)− ∇fi(¯x)| ≤ ¯

K∆. Then for all w∈GY(¯x),

there exists a g∈∂G

∆f(¯x)such that

|w−g| ≤ (¯

K+L)∆.

Proof. By deﬁnition, for all w∈GY(¯x) there exists a set of αisuch that

w=X

i∈A(Y)

αi∇Afi(¯x),where αi≥0,X

i∈A(Y)

αi= 1.

By our assumption that each fi∈ C1+, Lemma 3.8 holds. It is clear that for each i∈A(Y),

i∈A(yj) for some yj∈Y. Let jibe the index corresponding to this active index; i.e., i∈A(yji).

Thus, for each i∈A(Y), there is a corresponding active gradient

∇fi(yji)∈conv{∇fi(yji) : yji∈Y, i ∈A(yji)} ⊆ ∂G

∆f(¯x).

Using the same αias above, we can construct

g=X

i∈A(Y)

αi∇fi(yji)∈conv{∇fi(yji) : yji∈Y, i ∈A(yji)} ⊆ ∂G

∆f(¯x).

Then |w−g|=|P

i∈A(Y)

αi∇Afi(¯x)−P

i∈A(Y)

αi∇fi(yji)|

≤P

i∈A(Y)

αi|∇Afi(¯x)− ∇fi(yji)|

≤P

i∈A(Y)

αi|∇Afi(¯x)− ∇fi(¯x)|+|∇fi(¯x)− ∇fi(yji)|

≤P

i∈A(Y)

αi¯

K∆ + Lmax

ji|¯x−yji|

≤(¯

K+L)∆.

Thus, using Lemma 3.10, we can show that if the algorithm stops due to the robust stopping

conditions, then the distance from 0 to the Goldstein approximate subdiﬀerential is controlled

by εtol.

18

Proposition 3.11. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1+ with Lipschitz

constant L. Suppose there exists a ¯

K > 0such that for each iteration k, the approximate gra-

dient satisﬁes |∇Afi(xk)−∇fi(xk)| ≤ ¯

K∆kfor all i= 1, . . . , N . Suppose the RAGS algorithm

terminates at some iteration ¯

kin Step 2 using the robust stopping conditions given in (16).

Then

dist(0|∂G

∆¯

kf(x¯

k)) <[1 + µ0(¯

K+L)]εtol.

Proof. Let ¯w= Proj(0|GY¯

k(x¯

k)). We use ¯g∈∂G

∆¯

kf(x¯

k) as constructed in Lemma 3.10 to see

that

dist(0|∂G

∆¯

kf(x¯

k)) ≤dist(0|¯g)

≤dist(0|¯w) + dist( ¯w|¯g)

=|d¯

k

Y|+|¯w−¯g|

≤ |d¯

k

Y|+ ( ¯

K+L)∆¯

k

< εtol + ( ¯

K+L)∆¯

k

The statement now follows by the test ∆¯

k≤µ¯

k|d¯

k

Y|in Step 2 and the fact that µ¯

k≤µ0as

{µk}k=0 is a non-increasing sequence.

4 Approximate Gradients

As seen in the previous two sections, in order for convergence to be guaranteed in the AGS or

RAGS algorithm, the approximate gradient used must satisfy an error bound for each of the

active fi. Speciﬁcally, there must exist a ¯

K > 0 such that

|∇Afi(xk)− ∇fi(xk)| ≤ ¯

K∆k,

where ∆k= maxj|yj−xk|. In this section, we present three speciﬁc approximate gradients

that satisfy the above requirement: the simplex gradient, the centered simplex gradient and

the Gupal estimate of the gradient of the Steklov averaged function.

4.1 Simplex Gradient

The simplex gradient is a commonly used approximate gradient. In recent years, several

derivative-free algorithms have been proposed that use the simplex gradient, ([5], [22], [12],

[11], and [19] among others). It is geometrically deﬁned as the gradient of the linear interpola-

tion of fover a set of n+ 1 points in IRn. Mathematically, we deﬁne it as follows.

Let Y= [y0, y1, . . . , yn] be a set of aﬃnely independent points in IRn. We say that Yforms

the simplex S= conv{Y}. Thus, Sis a simplex if it can be written as the convex hull of an

aﬃnely independent set of n+ 1 points in IRn.

The simplex gradient of a function fover the set Yis given by

∇sf(Y) = L−1δf (Y),

where

L=y1−y0···yn−y0>

19

and

δf (Y) =

f(y1)−f(y0)

.

.

.

f(yn)−f(y0)

.

The condition number of the simplex formed by Yis given by kˆ

L−1k, where

ˆ

L=1

∆[y1−y0y2−y0. . . yn−y0]>and where ∆ = max

j=1,...n |yj−y0|.

4.1.1 Convergence

The following result (by Kelley [23]) shows that there exists an appropriate error bound between

the simplex gradient and the exact gradient of our objective function. We note that the Lipschitz

constant used in the following theorem corresponds to ∇fi.

Theorem 4.1. Consider fi∈ C1+ with Lipschitz constant Kifor ∇fi. Let Y= [y0, y1, . . . , yn]

form a simplex. Let

ˆ

L=1

∆[y1−y0y2−y0. . . yn−y0]>,where ∆ = max

j=1,...n |yj−y0|.

Then the simplex gradient satisﬁes the error bound

|∇sfi(Y)− ∇fi(y0)| ≤ ¯

K∆,

where ¯

K=1

2Ki√nkˆ

L−1k.

Proof. See [23, Lemma 6.2.1].

With the above error bound result, we conclude that convergence holds when using the

simplex gradient as an approximate gradient in both the AGS and RAGS algorithms.

Corollary 4.2. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1+ with Lipschitz constant

Kifor ∇fi. If the approximate gradient used in the AGS or RAGS algorithm is the simplex

gradient and kˆ

L−1kis bounded above for each simplex gradient computed, then the results of

Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.

4.1.2 Algorithm - Simplex Gradient

In order to calculate a simplex gradient in Step 1, we generate a set Y= [xk, y1, . . . , yn] of

points in IRnand then check to see if Yforms a well-poised simplex by calculating its condition

number, kˆ

L−1k. A bounded condition number (kˆ

L−1k< n) ensures a ‘good’ error bound

between the approximate gradient and the exact gradient.

If the set Ydoes not form a well-poised simplex (kˆ

L−1k ≥ n), then we resample. If Y

forms a well-poised simplex, then we calculate the simplex gradient of fiover Yfor each

i∈A(xk) and then set the approximate subdiﬀerential equal to the convex hull of the active

simplex gradients. We note that the probability of generating a random matrix with a condition

number greater than nis asymptotically constant, [38]. Thus, randomly generating simplices

is a quick and practical option. Furthermore, notice that calculating the condition number

does not require function evaluations; thus, resampling does not aﬀect the number of function

evaluations required by the algorithm.

20

4.2 Centered Simplex Gradient

The centered simplex gradient is the average of two simplex gradients. Although it requires more

function evaluations, it contains an advantage that the error bound satisﬁed by the centered

simplex gradient is in terms of ∆2, rather than ∆.

Let Y= [y0, y1, . . . , yn] form a simplex. We deﬁne the sets

Y+= [x, x + ˜y1, . . . , x + ˜yn]

and

Y−= [x, x −˜y1, . . . , x −˜yn],

where x=y0and ˜yi=yi−y0for i= 1,...n. The centered simplex gradient is the average

of the two simplex gradients over the sets Y+and Y−, i.e.,

∇CS f(Y) = 1

2(∇Sf(Y+) + ∇Sf(Y−)).

4.2.1 Convergence

To show that the AGS and RAGS algorithms are both well-deﬁned when using the centered

simplex gradient as an approximate gradient, we provide an error bound between the centered

simplex gradient and the exact gradient (again by Kelley [23]).

Theorem 4.3. Consider fi∈ C2+ with Lipschitz constant Kifor ∇2fi. Let Y= [y0, y1, . . . , yn]

form a simplex. Let

ˆ

L=1

∆[y1−y0, . . . , yn−y0]where ∆ = max

j=1,...n |yj−y0|.

Then the centered simplex gradient satisﬁes the error bound

|∇CS fi(Y)− ∇fi(y0)| ≤ ¯

K∆2,

where ¯

K=Ki√nkˆ

L−1k.

Proof. See [23, Lemma 6.2.5].

Notice that Theorem 4.3 requires fi∈ C2+ . If fi∈ C1+ , then the error bound is in terms

of ∆, not ∆2. With the above error bound result, we conclude that convergence holds when

using the centered simplex gradient as an approximate gradient in both the AGS and RAGS

algorithms.

Corollary 4.4. Let f= max{fi:i= 1, . . . , N}where each fi∈ C2+ with Lipschitz constant

Kifor ∇2fi. If the approximate gradient used in the AGS or RAGS algorithm is the centered

simplex gradient, kˆ

L−1kis bounded above for each centered simplex gradient computed and

∆0≤1, then the results of Theorems 2.4, 2.6, 2.8, 2.9 and 3.5 hold.

Proof. Since ∆0≤1 and non-increasing, (∆k)2≤∆kand ergo, Theorems 2.4, 2.6, 2.8, 2.9 and

3.5 hold.

21

4.2.2 Algorithm

In Step 1, to adapt the AGS or RAGS algorithm to use the centered simplex gradient, we sample

our set Yin the same manner as for the simplex gradient (resampling until a well-poised set is

achieved). We then form the sets Y+and Y−and proceed as expected.

4.3 Gupal Estimate

The nonderivative version of the gradient sampling algorithm presented by Kiwiel in [24] uses

the Gupal estimate of the gradient of the Steklov averaged function as an approximate gradient.

We see in Theorem 4.8 that an appropriate error bound exists for this approximate gradient.

Surprisingly, unlike the error bounds for the simplex and centered simplex gradients, the error

bound in Theorem 4.8 does not include a condition number term. Mathematically, we deﬁne

the Gupal estimate of the gradient of the Steklov averaged function as follows.

For α > 0, the Steklov averaged function fα, as deﬁned in [16, Def. 3.1], is given by

fα(x) = ZIRn

f(x−y)ψα(y)dy,

where ψα: IRn→IR+is the Steklov molliﬁer deﬁned by

ψα(y) = 1/αnif y∈[−α/2, α/2]n,

0 otherwise.

We can equivalently deﬁne the Steklov averaged function by

fα(x) = 1

αnZx1+α/2

x1−α/2···Zxn+α/2

xn−α/2

f(y)dy1. . . dyn.(19)

The partial derivatives of fαare given by ([16, Prop. 3.11] and [18] )

∂fα

∂xi

(x) = ZB∞

γi(x, α, ζ)dζ (20)

for i= 1, . . . , n, where B∞= [−1/2,1/2]nis the unit cube centred at 0 and

γi(x, α, ζ) = 1

αf(x1+αζ1, . . . , xi−1+αζi−1, xi+1

2α, xi+1 +αζi+1, . . . , xn+αζn)

−f(x1+αζ1, . . . , xi−1+αζi−1, xi−1

2α, xi+1 +αζi+1, . . . , xn+αζn).

(21)

Given α > 0 and z= (ζ1, . . . , ζn)∈Qn

i=1 B∞, the Gupal estimate of ∇fα(x) over zis given

by

γ(x, α, z) = (γ1(x, α, ζ 1), . . . , γn(x, α, ζn)).(22)

Remark 4.5.Although we deﬁne γ(x, α, z) as the Gupal estimate of ∇fα(x), in Section 4.3.1,

we will show that γ(x, α, z) provides a good approximation to the exact gradient, ∇f(x).

Remark 4.6.For the following results, we note that the αused in the above deﬁnitions is

equivalent to our sampling radius ∆. Thus, we will be replacing αwith ∆ in the convergence

results in Section 4.3.1.

22

4.3.1 Convergence

As before, in order to show that both the AGS and RAGS algorithms are well-deﬁned when

using the Gupal estimate as an approximate gradient, we must establish that it provides a good

approximate of our exact gradient. To do this, we ﬁrst need the following classic result from

[13].

Lemma 4.7. [13, Lemma 4.1.12]. Let f∈ C1+ with Lipschitz constant Kfor ∇f. Let y0∈IRn.

Then for any y∈IRn

|f(y)−f(y0)− ∇f(y0)>(y−y0)| ≤ 1

2K|y−y0|2.

Using Lemma 4.7, we establish an error bound between the Gupal estimate and the exact

gradient of f.

Theorem 4.8. Consider fi∈ C1+ with Lipschitz constant Kifor ∇fi. Let ε > 0. Then for

∆>0,z= (ζ1, . . . , ζn)∈Z=Qn

i=1 B∞and any point x∈IRn, the Gupal estimate of ∇fi,∆(x)

satisﬁes the error bound

|γ(x, ∆, z)− ∇fi(x)| ≤ √n1

2Ki∆(√n+ 3).

Proof. For ∆ >0, let

yj−= [x1+ ∆ζ1, . . . , xj−1+ ∆ζj−1, xj−1

2∆, xj+1 + ∆ζj+1, . . . , xn+ ∆ζn]>

and

yj+= [x1+ ∆ζ1, . . . , xj−1+ ∆ζj−1, xj+1

2∆, xj+1 + ∆ζj+1, . . . , xn+ ∆ζn]>.

Applying Lemma 4.7, we have that

|fi(yj+)−fi(yj−)− ∇fi(yj−)>(yj+−yj−)| ≤ 1

2Ki|yj+−yj−|2.(23)

From equation (21) (with α= ∆), we can see that

fi(yj+)−fi(yj−) = ∆ γj(x, ∆, ζ j).

Hence, equation (23) becomes

|∆γj(x, ∆, ζj)− ∇fi(yj−)>(yj+−yj−)| ≤ 1

2Ki|yj+−yj−|2.(24)

From our deﬁnitions of yj−and yj+, we can see that

yj+−yj−= [0,...,0,∆,0,...,0]>.

The inner product in equation (24) simpliﬁes to

∇fi(yj−)>(yj+−yj−) = ∆ ∂fi

∂xj

(yj−).

23

Thus, we have

∆γj(x, ∆, ζj)−∆∂fi

∂xj

(yj−)≤1

2Ki∆2,

implying

γj(x, ∆, ζj)−∂fi

∂xj

(yj−)≤1

2Ki∆.(25)

Also notice that

|yj−−x|= ∆

(ζj

1, . . . , ζj

j−1,−1

2, ζj

j+1, . . . , ζ j

n)

.

Using the standard basis vector ej, we have

|yj−−x|= ∆

ζj−ζjej−1

2ej≤∆|ζj|+|ζjej|+1

2≤1

2∆(√n+ 2).

Thus, since fi∈ C1+, we have

|∇fi(yj−)− ∇fi(x)| ≤ Ki

1

2∆(√n+ 2).(26)

Noting that

∂fi

∂xj

(yj−)−∂fi

∂xj

(x)≤ |∇fi(yj−)− ∇fi(x)|,

we have

∂fi

∂xj

(yj−)−∂fi

∂xj

(x)≤Ki

1

2∆(√n+ 2).(27)

Using equations (25) and (27), we have

γj(x, ∆, ζj)−∂fi

∂xj

(x)≤

γj(x, ∆, ζj)−∂fi

∂xj

(yj−)

+

∂fi

∂xj

(yj−)−∂fi

∂xj

(x)

≤1

2Ki∆ + Ki

1

2∆(√n+ 2)

=1

2Ki∆(√n+ 3).

Finally,

|γ(x, ∆, z)− ∇fi(x)|=v

u

u

t

n

X

j=1 γj(x, ∆, ζj)−∂fi

∂xj

(x)2

≤v

u

u

t

n

X

j=1 1

2Ki∆(√n+ 3)2

=√n1

2Ki∆(√n+ 3).

24

We conclude that convergence holds when using the Gupal estimate of the gradient of

the Steklov averaged function of fas an approximate gradient in both the AGS and RAGS

algorithms.

Corollary 4.9. Let f= max{fi:i= 1, . . . , N}where each fi∈ C1+ with Lipschitz constant Ki

for ∇fi. If the approximate gradient used in the AGS or RAGS algorithm is the Gupal estimate

of the gradient of the Steklov averaged function, then the results of Theorems 2.4, 2.6, 2.8, 2.9

and 3.5 hold.

4.3.2 Algorithm

To use the Gupal estimate of the gradient of the Steklov averaged function in both the AGS

and RAGS algorithms, in Step 1, we sample independently and uniformly {zkl}m

l=1 from the

unit cube in IRn×n, respectively, where mis the number of active functions.

5 Numerical Results

5.1 Versions of the AGS and RAGS Algorithms

We implemented the AGS and RAGS algorithms using the simplex gradient, the centered

simplex gradient and the Gupal estimate of the gradient of the Steklov averaged function as

approximate gradients.

Additionally, we used the robust descent direction to create robust stopping conditions. That

is, the algorithm terminates when

∆k≤µk|dk

Y|and |dk

Y|< εtol,(28)

where dk

Yis the projection of 0 onto the approximate subdiﬀerential generated using the ro-

bust active set. (See Proposition 3.11 for results linking the robust stopping conditions with

the Goldstein approximate subdiﬀerential.) The implementation was done in MATLAB (v.

7.11.0.584, R2010b). Software is available by contacting the corresponding author.

Let dkdenote the regular descent direction and let dk

Ydenote the robust descent direction.

There are three scenarios that could occur when using the robust stopping conditions:

1. |dk|=|dk

Y|;

2. |dk|≥|dk

Y|, but checking the stopping conditions leads to the same result (line search,

radius decrease or termination); or

3. |dk|≥|dk

Y|, but checking the stopping conditions leads to a diﬀerent result.

In Scenarios 1 and 2, the robust stopping conditions have no inﬂuence on the algorithm. In

Scenario 3, we have two cases:

1. ∆k≤µk|dk

Y| ≤ µk|dk|, but |dk| ≥ εtol and |dk

Y|< εtol or

2. ∆k≤µk|dk|holds, but ∆k> µk|dk

Y|.

25

Thus, we hypothesize that the robust stopping conditions will cause the AGS and RAGS

algorithms to do one of two things: to terminate early, providing a solution that has a smaller

quality measure, but requires less function evaluations to ﬁnd, or to reduce its sampling radius

instead of carrying out a line search, reducing the number of function evaluations carried out

during that iteration and calculating a more accurate approximate subdiﬀerential at the next

iteration.

Our goal in this testing is to determine if there are any notable numerical diﬀerences in the

quality of the three approximate gradients (simplex, centered simplex, and Gupal estimate),

the two search directions (robust and non-robust), and the two stopping conditions (robust and

non-robust). This leads to the following 12 versions:

AGS Simplex (1. non-robust/2. robust stopping)

RAGS Simplex (3. non-robust/4. robust stopping)

AGS Centered Simplex (5. non-robust/6. robust stopping)

RAGS Centered Simplex (7. non-robust/8. robust stopping)

AGS Gupal (9. non-robust/10. robust stopping)

RAGS Gupal (11. non-robust/12. robust stopping)

5.2 Test Sets and Software

Testing was performed on a 2.0 GHz Intel Core i7 Macbook Pro. We used the test set from

Lukˇsan-Vlˇcek, [27]. The ﬁrst 25 problems presented are of the desired form

min

x{F(x)}where F(x) = max

i=1,2,...,N{fi(x)}.

Of these 25 problems, we omit problem 2.17 because the sub-functions are complex-valued.

Thus, our test set presents a total of 24 ﬁnite minimax problems with dimensions from 2 to 20.

There are several problems with functions fithat have the form fi=|fi|, where fiis a smooth

function. We rewrote these functions as fi= max{fi,−fi}. The resulting test problems have

from 2 to 130 sub-functions. A summary of the test problems appears in Table 2 in Appendix

A.

5.3 Initialization and Stopping Conditions

We ﬁrst describe our choices for the initialization parameters used in the AGS and RAGS

algorithms.

The initial starting points are given for each problem in [27]. We set the initial accuracy

measure to 0.5 with a reduction factor of 0.5. We set the initial sampling radius to 0.1 with

a reduction factor of 0.5. The Armijo-like parameter ηwas chosen to be 0.1 to ensure that a

line search success resulted in a signiﬁcant function value decrease. We set the minimum step

length to 10−10.

Next, we discuss the stopping tolerances used to ensure ﬁnite termination of the AGS and

RAGS algorithms. We encoded four possible reasons for termination in our algorithm. The

26

ﬁrst is our theoretical stopping condition, while the remaining three are to ensure numerical

stability of the algorithm.

1. Stopping conditions met - As stated in the theoretical algorithm, the algorithm terminates

for this reason when ∆k≤µk|dk|and |dk|< εtol, where dkis either the regular or the

robust descent direction.

2. Hessian matrix has NaN /Inf entries - For the solution of the quadratic program in

Step 2, we use the quadprog command in MATLAB, which has certain numerical limita-

tions. When these limitations result in NaN or Inf entries in the Hessian, the algorithm

terminates.

3. ∆k,µk, and |dk|are small - This stopping criterion bipasses the test ∆k≤µk|dk|(in Step

2) and stops if ∆k<∆tol,|µk|< µtol and |dk|< εtol . Examining Theorem 2.9 along with

Theorems 4.1, 4.3 and 4.8, it is clear that this is also a valid stopping criterion. We used

a bound of 10−6in our implementation for both ∆kand µk.

4. Max number of function evaluations reached - As a ﬁnal failsafe, we added an upper

bound of 106on the number of function evaluations allowed. (This stopping condition

only occurs once in our results.)

5.4 Results

Due to the randomness in the AGS and RAGS algorithms, we carry out 25 trials for each

version. For each of the 25 trials, we record the number of function evaluations, the number of

iterations, the solution, the quality of the solution and the reason for termination. The quality

was measured by the improvement in the number of digits of accuracy, which is calculated using

the formula

−log |Fmin −F∗|

|F0−F∗|,

where Fmin is the function value at the ﬁnal (best) iterate, F∗is the true minimum value

(optimal value) of the problem (as given in [27]) and F0is the function value at the initial

iterate. Results on function evaluations and solution quality appear in Tables 3, 4 and 5 of the

appendix.

To visually compare algorithmic versions, we use performance proﬁles. A performance

proﬁle is the (cumulative) distribution function for a performance metric [14]. For the AGS

and RAGS algorithms, the performance metric is the ratio of the number of function evaluations

taken by the current version to successfully solve each test problem versus the least number

of function evaluations taken by any of the versions to successfully solve each test problem.

Performance proﬁles eliminate the need to discard failures in numerical results and provide a

visual representation of the performance diﬀerence between several solvers. For full details on

the construction of performance proﬁles, see [14].

In Figures 1(a) and 1(b) we include a performance proﬁle showing all 12 versions of the

AGS and RAGS algorithms tested, declaring a success for 1 digit and 3 digits of improvement,

respectively.

27

100101102103

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

τ

ρ(τ)

Performance Profile on AGS and RAGS Algorithms (successful improvement of 1 digit)

(a) Accuracy improvement of 1 digit.

100101102103

0

0.1

0.2

0.3

0.4

0.5

0.6

τ

ρ(τ)

Performance Profile for AGS and RAGS Algorithms (successful improvement of 3 digits)

(b) Accuracy improvement of 3 digits.

Figure 1: Performance proﬁles for 12 versions of AGS/RAGS algorithm.

In general, we can see that using the Gupal estimate of the gradient of the Steklov averaged

function as an approximate gradient does not produce the best results. It only produces 1 or

more digits of accuracy for problems 2.1, 2.2, 2.4, 2.10, 2.18 and 2.23 (robust version). There is

no signiﬁcant diﬀerence between the performance of the AGS and RAGS algorithms using the

simplex and centered simplex gradients as approximate gradients.

Looking at the results in Tables 3, 4 and 5, and our performance proﬁles, we can make the

following two observations:

i) the versions of the RAGS algorithm generally outperform (converge faster than) the

versions of the AGS algorithm, and

ii) the RAGS algorithm using the robust stopping conditions terminates faster and with

28

lower (but still signiﬁcant) accuracy.

Robust active set: From our results, it is clear that expanding the active set to include

‘almost active’ functions in the RAGS algorithm greatly improves performance for the simplex

and centered simplex algorithm. This robust active set brings more local information into the

approximate subdiﬀerential and thereby allows for descent directions that are more parallel to

any nondiﬀerentiable ridges formed by the function.

Robust stopping conditions: We notice from the performance proﬁles that in terms of

function evaluations, the robust stopping conditions improve the overall performance of the

RAGS algorithm, although they decrease the average accuracy on some problems. These re-

sults correspond with our previously discussed hypothesis. Furthermore, upon studying the

reasons for termination, it appears that the non-robust stopping conditions cause the AGS and

RAGS algorithms to terminate mainly due to ∆kand µkbecoming too small. For the robust

stopping conditions, the RAGS algorithm terminated often because the stopping conditions

were satisﬁed. As our theory in Section 3.3 is not complete, we cannot make any theoretical

statements about how the robust stopping conditions would perform in general (like those in

Theorem 2.9). However, from our results, we conjecture that the alteration is beneﬁcial for

decreasing function evaluations.

In 23 of the 24 problems tested, for both robust and non-robust stopping conditions, the

RAGS algorithm either matches or outperforms the AGS algorithm in average accuracy ob-

tained over 25 trials using the simplex and centered simplex gradients. Knowing this, we

conclude that the improvement of the accuracy is due to the choice of a robust search direction.

5.5 Comparison to NOMAD

NOMAD is a well established general DFO solver, not specialized for the class of minimax

problems [25, 1]. Using the same set of 24 test problems as we did with our previous tests, we

compare the RAGS algorithm to the NOMAD algorithm. Based on the previous tests we use the

simplex gradient with robust stopping conditions. All parameters for the NOMAD algorithm

we left to default settings. The performance proﬁle generated by these two algorithms for the

improvement of 3 digits of accuracy appears in Figure 2.

As seen in Figure 2, NOMAD appears slightly faster than RAGS, but slightly less robust.

Overall, the two algorithms are fairly comparable. Further research into RAGS may lead to

algorithmic improvement. For example, a more informed selection of the sample set Yin Step

1, an improved line search, or better selection of default parameters for RAGS, could all lead

to improved convergence. We leave such research for future study.

6 Conclusion

We have presented a new derivative-free algorithm for ﬁnite minimax problems that exploits the

smooth substructure of the problem. Convergence results are given for any arbitrary approxi-

mate gradient that satisﬁes an error bound dependent on the sampling radius. Three examples

of such approximate gradients are given. Additionally, a robust version of the algorithm is

presented and shown to have the same convergence results as the regular version.

Through numerical testing, we ﬁnd that the robust version of the algorithm outperforms

the regular version with respect to both the number of function evaluations required and the

29

Figure 2: Performance proﬁle comparing RAGS with NOMAD for an improvement of 3 digits of

accuracy.

100101

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Performance Profile on RAGS and NOMAD Algorithms (successful improvement of 3 digits)

τ

ρ(τ)

Simplex RAGS (robust stopping)

NOMAD

accuracy of the solutions obtained. Additionally, we tested robust stopping conditions and

found that they generally required less function evaluations before termination. Overall, the

robust stopping conditions paired with the robust version of the algorithm performed best (as

seen in the performance proﬁles).

Considerable future work is available in this research direction. Most obvious is an ex-

ploration of the theory behind the performance of the robust stopping conditions. Another

direction lies in the theoretical requirement bounding the step length away from 0 (see The-

orems 2.8 and 3.5). In gradient based methods, one common way to avoid this requirement

is with the use of Wolfe-like conditions. We are unaware of any derivative-free variant on the

Wolfe conditions.

Acknowledgements

The authors would like to express their gratitude to Dr. C. Sagastiz´abal for inspirational

conversations regarding the Goldstein subdiﬀerential. This research was partially funded by

the NSERC DG program and by UBC IRF.

References

[1] A. M. Abramson, C. Audet, G. Couture, J. E. Dennis Jr., S. Le Digabel and C. Tribes. The

NOMAD project. Software available at http://www.gerad.ca/nomad.

[2] A. M. Bagirov, B. Karas¨ozen, and M. Sezer. Discrete gradient method: derivative-free method

for nonsmooth optimization. J. Optim. Theory Appl., 137(2):317–334, 2008.

[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert

Spaces. CMS Books in Mathematics. Springer, New York, NY, 2011.

30

[4] A. J. Booker, J. E. Dennis Jr., P. D. Frank, D. B. Seraﬁni, and V. Torczon. Optimization using

surrogate objectives on a helicopter test example. In Computational methods for optimal design

and control (Arlington, VA, 1997), volume 24 of Progr. Systems Control Theory, pages 49–58.

Birkh¨auser Boston, MA, 1998.

[5] D. M. Bortz and C. T. Kelley. The simplex gradient and noisy optimization problems. In

Computational methods for optimal design and control, volume 24 of Progr. Systems Control

Theory, pages 77–90. Birkh¨auser Boston, MA, 1998.

[6] J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdiﬀerentials by random sampling

of gradients. Math. Oper. Res., 27(3):567–584, 2002.

[7] J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradient sampling algorithm for nonsmooth,

nonconvex optimization. SIAM J. Optim., 15(3):751–779, 2005.

[8] X. Cai, K. Teo, X. Yang, and X. Zhou. Portfolio optimization under a minimax rule. Manage.

Sci., 46(7):957–972, 2000.

[9] F. H. Clarke. Optimization and Nonsmooth Analysis. Classics Appl. Math. 5. SIAM, Philadelphia,

second edition, 1990.

[10] A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to Derivative-free Optimization,

volume 8 of MPS/SIAM Series on Optimization. SIAM, Philadelphia, 2009.

[11] A. L. Cust´odio, J. E. Dennis Jr., and L. N. Vicente. Using simplex gradients of nonsmooth

functions in direct search methods. IMA J. Numer. Anal., 28(4):770–784, 2008.

[12] A. L. Cust´odio and L. N. Vicente. Using sampling and simplex derivatives in pattern search

methods. SIAM J. Optim., 18(2):537–555, 2007.

[13] J. E. Dennis Jr. and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and

Nonlinear Equations (Classics in Applied Mathematics). SIAM, Philadelphia, 1996.

[14] E. D. Dolan and J. J. Mor´e. Benchmarking optimization software with performance proﬁles.

Math. Program., 91(2, Ser. A):201–213, 2002.

[15] R. Duvigneau and M. Visonneau. Hydrodynamic design using a derivative-free method. Struct.

and Multidiscip. Optim., 28:195–205, 2004.

[16] Y. M. Ermoliev, V. I. Norkin and R. J.-B. Wets. The minimization of semicontinuous functions:

Molliﬁer subgradients. SIAM J. Control Optim., 33:149–167, 1995.

[17] A. A. Goldstein. Optimization of Lipschitz continuous functions. Math. Program., 13(1):14–22,

1977.

[18] A. M. Gupal. A method for the minimization of almost diﬀerentiable functions. Kibernetika,

pages 114–116, 1977. In Russian; English translation in: Cybernetics, 13(2):220–222, 1977.

[19] W. Hare and M. Macklem. Derivative-free optimization methods for ﬁnite minimax problems.

To appear in Optim. Methods and Softw., 2011.

[20] W. L. Hare. Using derivative free optimization for constrained parameter selection in a home

and community care forecasting model. In International Perspectives on Operations Research

and Health Care, Proceedings of the 34th Meeting of the EURO Working Group on Operational

Research Applied to Health Sciences, pages 61–73, 2010.

31

[21] J. Imae, N. Ohtsuki, Y. Kikuchi, and T. Kobayashi. A minimax control design for nonlinear

systems based on genetic programming: Jung’s collective unconscious approach. Intern. J. Syst.

Sci., 35:775–785, October 2004.

[22] C. T. Kelley. Detection and remediation of stagnation in the Nelder-Mead algorithm using a

suﬃcient decrease condition. SIAM J. Optim., 10(1):43–55, 1999.

[23] C. T. Kelley. Iterative Methods for Optimization, volume 18 of Frontiers in Applied Mathematics.

SIAM, Philadelphia, 1999.

[24] K. C. Kiwiel. A nonderivative version of the gradient sampling algorithm for nonsmooth noncon-

vex optimization. SIAM J. Optim., 20(4):1983–1994, 2010.

[25] S. Le Digabel. Algorithm 909: NOMAD: Nonlinear Optimization with the MADS algorithm.

ACM T Math Software, 37(4):1–15, 2011.

[26] G. Liuzzi, S. Lucidi, and M. Sciandrone. A derivative-free algorithm for linearly constrained ﬁnite

minimax problems. SIAM J. Optim., 16(4):1054–1075, 2006.

[27] L. Lukˇsan and J. Vlˇcek. Test Problems for Nonsmooth Unconstrained and Linearly Constrained

Optimization. Technical report, February 2000.

[28] K. Madsen. Minimax solution of non-linear equations without calculating derivatives, volume 3

of Math. Programming Stud., pages 110–126, 1975.

[29] A. L. Marsden, J. A. Feinstein, and C. A. Taylor. A computational framework for derivative-

free optimization of cardiovascular geometries. Comput. Methods Appl. Mech. Engrg., 197(21-

24):1890–1905, 2008.

[30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research.

Springer-Verlag, New York, 1999.

[31] G. Di Pillo, L. Grippo, and S. Lucidi. A smooth method for the ﬁnite minimax problem. Math.

Program., 60(2, Ser. A):187–214, 1993.

[32] E. Polak. On the mathematical foundations of nondiﬀerentiable optimization in engineering

design. SIAM Rev., 29(1):21–89, 1987.

[33] E. Polak, J. O. Royset, and R. S. Womersley. Algorithms with adaptive smoothing for ﬁnite

minimax problems. J. Optim. Theory Appl., 119(3):459–484, 2003.

[34] R. A. Polyak. Smooth optimization methods for minimax problems. SIAM J. Control Optim.,

26(6):1274–1286, 1988.

[35] R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Grundlehren der

Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-

Verlag, Berlin, 1998.

[36] R. Staﬀord. Random Points in an n-Dimensional Hypersphere. MAT-

LAB File Exchange,http://www.mathworks.com/matlabcentral/fileexchange/

9443-random-points-in-an-n-dimensional-hypersphere, 2005.

[37] P. Wolfe. A method of conjugate subgradients for minimizing nondiﬀerentiable functions, volume 3

in Math. Programming Stud., pages 145–173, 1975.

32

[38] M. Wschebor. Smoothed analysis of κ(A). J. Complexity, 20(1):97–107, 2004.

[39] S. Xu. Smoothing method for minimax problems. Comput. Optim. Appl., 20(3):267–279, 2001.

7 Appendix A

Table 2: Test Set Summary: problem name and number, problem dimension (N), and number of

sub-functions (M); * denotes an absolute value operation (doubled number of sub-functions).

Prob. # Name N M Prob. # Name N M

2.1 CB2 2 3 2.13 GAMMA 4 122*

2.2 WF 2 3 2.14 EXP 5 21

2.3 SPIRAL 2 2 2.15 PBC1 5 60*

2.4 EVD52 3 6 2.16 EVD61 6 102*

2.5 Rosen-Suzuki 4 4 2.18 Filter 9 82*

2.6 Polak 6 4 4 2.19 Wong 1 7 5

2.7 PCB3 3 42* 2.20 Wong 2 10 9

2.8 Bard 3 30* 2.21 Wong 3 20 18

2.9 Kow.-Osborne 4 22* 2.22 Polak 2 10 2

2.10 Davidon 2 4 40* 2.23 Polak 3 11 10

2.11 OET 5 4 42* 2.24 Watson 20 62*

2.12 OET 6 4 42* 2.25 Osborne 2 11 130*

33

Table 3: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the simplex

gradient.

AGS RAGS

Regular Stop Robust Stop Regular Stop Robust Stop

Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc.

2.1 3018 2.082 2855 2.120 2580 9.470 202 6.759

2.2 3136 4.565 3112 4.987 4179 13.211 418 6.343

2.3 3085 0.002 3087 0.002 3090 0.002 3096 0.002

2.4 3254 2.189 3265 2.238 2986 11.559 367 7.570

2.5 3391 1.379 3138 1.351 3576 1.471 539 1.471

2.6 3260 1.236 3341 1.228 4258 1.338 859 1.338

2.7 2949 1.408 2757 1.367 4155 9.939 4190 7.230

2.8 4959 0.879 4492 0.913 3634 9.941 3435 7.655

2.9 2806 0.732 3303 0.581 16000 8.049 13681 3.975

2.10 2978 3.343 2993 3.342 3567 3.459 1924 3.459

2.11 3303 2.554 3453 2.559 35367 6.099 11725 5.063

2.12 2721 1.866 3117 1.871 15052 2.882 8818 2.660

2.13 2580 1.073 2706 0.874 43618 1.952 141 1.679

2.14 3254 1.585 3289 1.086 7713 2.696 4221 1.476

2.15 3917 0.262 5554 0.259 31030 0.286 12796 0.277

2.16 3711 2.182 4500 2.077 20331 3.242 11254 2.178

2.18 10468 0.000 10338 0.000 76355 17.717 30972 17.138

2.19 3397 0.376 3327 0.351 5403 7.105 1767 7.169

2.20 4535 1.624 4271 1.624 8757 8.435 7160 6.073

2.21 8624 2.031 8380 2.157 15225 1.334 11752 1.393

2.22 1563 0.958 1408 1.042 64116 3.049 1256 2.978

2.23 7054 2.557 10392 2.744 6092 6.117 970 6.178

2.24 4570 0.301 7857 0.298 93032 0.447 21204 0.328

2.25 3427 0.339 4197 0.340 98505 0.342 343 0.342

34

Table 4: Average accuracy for 25 trials obtained by the AGS and RAGS algorithms for the centered

simplex gradient.

AGS RAGS

Regular Stop Robust Stop Regular Stop Robust Stop

Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc.

2.1 3769 2.054 3573 2.051 2351 9.469 221 7.125

2.2 3705 6.888 1284 5.154 4151 9.589 330 5.594

2.3 5410 0.003 5352 0.003 5332 0.003 5353 0.003

2.4 4059 2.520 4154 2.456 4347 11.578 296 6.834

2.5 3949 1.422 3813 1.437 4112 1.471 452 1.471

2.6 3756 1.302 3880 1.309 4815 1.338 879 1.338

2.7 4227 1.435 4187 1.373 5285 9.950 7164 6.372

2.8 6928 0.988 6933 1.003 4116 9.939 3754 7.775

2.9 3301 0.933 3743 0.949 17944 8.072 13014 2.436

2.10 3447 3.343 3424 3.342 4744 3.459 427 3.459

2.11 3593 2.768 4082 2.785 47362 6.344 11886 5.115

2.12 3321 1.892 3406 1.876 15550 2.843 10726 2.651

2.13 3067 1.355 3508 1.216 36969 1.873 519 1.643

2.14 3967 1.771 6110 1.152 9757 2.692 7284 1.510

2.15 4646 0.272 6014 0.273 23947 0.280 15692 0.277

2.16 4518 2.223 6911 2.074 22225 2.628 17001 2.215

2.18 30492 16.931 14671 16.634 125859 17.804 20815 17.293

2.19 4473 0.551 4484 0.591 8561 7.113 1697 5.851

2.20 5462 1.615 5503 1.599 8908 9.011 7846 6.042

2.21 11629 1.887 11724 1.661 18957 1.304 17067 1.339

2.22 1877 1.166 1604 1.160 1453 3.139 2066 3.644

2.23 3807 2.150 7850 3.586 15625 6.117 1020 6.230

2.24! 7198 0.302 12745 0.301 115787 0.436 61652 0.329

2.25 4749 0.339 4896 0.341 256508 0.342 568 0.342

35

Table 5: Average accuracy for 25 trials obtained by the AGS and RAGS algorithm for the Gupal

estimate of the gradient of the Steklov averaged function.

AGS RAGS

Regular Stop Robust Stop Regular Stop Robust Stop

Prob. f-evals Acc. f-evals Acc. f-evals Acc. f-evals Acc.

2.1 2775 2.448 2542 2.124 13126 3.896 89 2.708

2.2 3729 3.267 2221 2.813 5029 15.904 1776 7.228

2.3 2243 0.000 2262 0.000 2276 0.000 2255 0.000

2.4 2985 2.771 2841 2.892 3475 3.449 2362 3.738

2.5 3493 1.213 3529 1.196 3447 1.211 338 1.200

2.6 3144 0.187 3245 0.188 3018 0.162 3059 0.162

2.7 2631 1.368 3129 1.248 2476 1.048 2208 1.047

2.8 2711 1.125 3898 0.893 2231 0.514 5846 0.515

2.9 3102 0.727 3011 0.600 2955 0.937 3248 0.863

2.10 3075 3.241 2927 3.272 3100 0.000 3050 0.000

2.11 2947 1.527 3307 1.528 3003 1.560 2905 1.560

2.12 3095 1.099 7179 0.000 2670 0.788 7803 0.000

2.13 2755 0.710 1485 0.715 2517 0.231 6871 0.227

2.14 2965 0.574 3070 0.427 2860 0.708 4571 0.668

2.15 2658 0.010 2386 0.017 3355 0.050 3210 0.031

2.16 3431 0.457 3256 0.459 2861 0.199 2620 0.119

2.18 3936 16.345 5814 0.000 3950 16.451 6598 4.542

2.19 3337 0.014 3270 0.011 3488 0.970 3376 0.957

2.20 4604 0.835 4434 0.808 9459 1.360 10560 1.359

2.21 5468 0.000 5418 0.000 6632 0.641 6159 0.635

2.22 21 0.000 21 0.000 21 0.000 21 0.000

2.23 5436 1.814 5176 1.877 1.00E+06 2.354 954 2.415

2.24 7426 0.280 171 0.017 7927 0.043 6389 0.283

2.25 4519 0.286 4814 0.300 3760 0.017 3209 0.023

36