Content uploaded by Boris T. Polyak

Author content

All content in this area was uploaded by Boris T. Polyak

Content may be subject to copyright.

Mathematical Programming manuscript No.

(will be inserted by the editor)

Yu. Nesterov, B. Polyak

Cubic regularization of Newton method

and its global performance

Received: June, 2004 / . . . / Final version: January 2006

Abstract. In this paper, we provide theoretical analysis for a cubic regularization of Newton

method as applied to unconstrained minimization problem. For this scheme, we prove general

local convergence results. However, the main contribution of the paper is related to global

worst-case complexity bounds for diﬀerent problem classes including some nonconvex cases. It

is shown that the search direction can be computed by standard linear algebra technique.

Keywords: General nonlinear optimization, unconstrained optimization, New-

ton method, trust-region methods, global complexity bounds, global rate of con-

vergence.

1. Introduction

Motivation. Starting from seminal papers by Bennet [1] and Kantorovich [6],

the Newton method turned into an imp ortant tool for numerous applied prob-

lems. In the simplest case of unconstrained minimization of a multivariate func-

tion,

min

x∈R

n

f(x),

the standard Newton scheme looks as follows:

x

k+1

= x

k

− [f

00

(x

k

)]

−1

f

0

(x

k

).

Despite to its very natural motivation, this scheme has several hidden drawbacks.

First of all, it may happen that at current test point the Hessian is degenerate;

in this case the method is not well-deﬁned. Secondly, it may happen that this

scheme diverges or converges to a saddle point or even to a point of local maxi-

mum. In the last ﬁfty years the number of diﬀerent suggestions for improving the

Yurii Nesterov: Center for Operations Research and Econometrics (CORE), Catholic Uni-

versity of Louvain (UCL), 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium; e-mail:

nesterov@core.ucl.ac.be.

B.T.Polyak: Institute of Control Science, Profsojuznaya 65, Moscow 117997, Russia; e-mail:

boris@ipu.rssi.ru.

The research results presented in this paper have been supported by a grant “Action de

recherche concert`e ARC 04/09-315” from the “Direction de la recherche scientiﬁque - Com-

munaut`e fran¸caise de Belgique”. The scientiﬁc responsibility rests with the authors.

Mathematics Subject Classiﬁcation (1991): 49M15, 49M37, 58C15, 90C25, 90C30.

2 Yu. Nesterov, B. Polyak

scheme was extremely large. The reader can consult a 1000-item bibliography in

the recent exhaustive covering of the ﬁeld [2]. However, most of them combine

in diﬀerent ways the following ideas.

– Levenberg-Marquardt regularization. As suggested in [7,8], if f

00

(x) is not pos-

itive deﬁnite, let us regularize it with a unit matrix. Namely, use −G

−1

f

0

(x)

with G = f

00

(x) + γI Â 0 in order to perform the step:

x

k+1

= x

k

− [f

00

(x

k

) + γI]

−1

f

0

(x

k

).

This strategy sometimes is considered as a way to mix Newton’s method with

the gradient method.

– Line search. Since we are interested in a minimization, it looks reasonable to

allow a certain step size h

k

> 0:

x

k+1

= x

k

− h

k

[f

00

(x

k

)]

−1

f

0

(x

k

),

(this is a damped Newton method [12]). This can help to form a monotone

sequence of function values: f(x

k+1

) ≤ f(x

k

).

– Trust-region approach [5,4,3,2]. In accordance to this approach, at point x

k

we have to form its neighborhood, where the second-order approximation of

the function is reliable. This is a trust region ∆(x

k

), for instance ∆(x

k

) =

{x : ||x − x

k

|| ≤ ²} with some ² > 0. Then the next point x

k+1

is chosen as

a solution to the following auxiliary problem:

min

x∈∆(x

k

)

[hf

0

(x

k

), x − x

k

i +

1

2

hf

00

(x

k

)(x −x

k

), x − x

k

i].

Note that for ∆(x

k

) ≡ R

n

, this is exactly the standard Newton step.

We would encourage a reader to look in [2] for diﬀerent combinations and

implementations of the above ideas. Here we only mention that despite to a

huge variety of the results, there still exist open theoretical questions in this

ﬁeld. And, in our opinion, the most important group of questions is related to

the worst-case guarantees for global behavior of the second-order schemes.

Indeed, as far as we know, up to now there are very few results on the

global p erformance of Newton method. One example is an easy class of smooth

strongly convex functions where we can get a rate of convergence for a damped

Newton method [11,10]. However the number of iterations required is hard to

compare with that for the gradient method. In fact, up to now the relations

between the gradient method and the Newton method have not been clariﬁed.

Of course, the requirements for the applicability of these methods are diﬀerent

(e.g. smoothness assumptions are more strong for Newton’s method) as well as

computational burden (necessity to compute second derivatives, store matrices

and solve linear equations at each iteration of Newton’s method). However, there

exist numerous problems, where computation of the Hessian is not much harder

than computation of the gradient, and the iteration costs of both methods are

comparable. Quite often, one reads opinion that in such situations the Newton

method is good at the ﬁnal stage of the minimization process, but it is better to

Cubic regularization of Newton method and its global performance 3

use the gradient method for the ﬁrst iterations. Here we dispute this position:

we show that theoretically, a properly chosen Newton-type scheme outperforms

the gradient scheme (taking into account only the number of iterations) in all

situations under consideration.

In this paper we propose a modiﬁcation of Newton method, which is con-

structed in a similar way to well-known gradient mapping [9]. Assume that func-

tion f has a Lipschitz continuous gradient:

kf

0

(x) −f

0

(y)k ≤ Dky − xk, ∀x, y ∈ R

n

.

Suppose we need to solve the problem

min

x∈Q

f(x),

where Q is a closed convex set. Then we can choose the next point x

k+1

in our

sequence as a solution of the following auxiliary problem:

min

y ∈Q

ξ

1,x

k

(y), ξ

1,x

k

(y) = f (x

k

) + hf

0

(x

k

), y − x

k

i +

1

2

Dky − x

k

k

2

. (1.1)

Convergence of this scheme follows from the fact that ξ

1,x

k

(y) is an upper ﬁrst-

order approximation of the objective function, that is ξ

1,x

k

(y) ≥ f (y) ∀y ∈ R

n

(see, for example, [10], Section 2.2.4, for details). If Q ≡ R

n

, then the rule (1.1)

results in a usual gradient scheme:

x

k+1

= x

k

−

1

D

f

0

(x

k

).

Note that we can do similar thing with the second-order approximation. Indeed,

assume that the Hessian of our objective function is Lipschitz continuous:

kf

00

(x) −f

00

(y)k ≤ Lkx − yk, ∀x, y ∈ R

n

.

Then, it is easy to see that the auxiliary function

ξ

2,x

(y) = f (x) + hf

0

(x), y − xi +

1

2

hf

00

(x)(y − x), y − xi +

L

6

ky − xk

3

will be an upper second-order approximation for our objective function:

f(y) ≤ ξ

2,x

(y) ∀y ∈ R

n

.

Thus, we can try to ﬁnd the next point in our second-order scheme from the

following auxiliary minimization problem:

x

k+1

∈ Arg min

y

ξ

2,x

k

(y) (1.2)

(here Argmin refers to a global minimizer). This is exactly the approach we

analyze in this paper; we call it cubic regularization of Newton’s method. Note

that problem (1.2) is non-convex and it can have local minima. However, our

approach is implementable since this problem is equivalent to minimizing an

explicitly written convex function of one variable.

4 Yu. Nesterov, B. Polyak

Contents. In Section 2 we introduce cubic regularization and present its main

properties. In Section 3 we analyze the general convergence of the process. We

prove that under very mild assumptions all limit points of the process satisfy

necessary second-order optimality condition. In this general setting we get a rate

of convergence for the norms of the gradients, which is better than the rate en-

sured by the gradient scheme. We prove also the lo cal quadratic convergence of

the process. In Section 4 we give the global complexity results of our scheme for

diﬀerent problem classes. We show that in all situations the global rate of con-

vergence is surprisingly fast (like O(

1

k

2

) for star-convex functions, where k is the

iteration counter). Moreover, under rather weak non-degeneracy assumptions,

we have local sup er-linear convergence either of the order

4

3

or

3

2

. We show that

this happens even if the Hessian is degenerate at the solution set. In Section 5 we

show how to compute a solution to the cubic regularization problem and discuss

some eﬃcient strategies for estimating the Lipschitz constant for the Hessian.

We conclude the paper by a short discussion presented in Section 6.

Notation. In what follows we denote by h·, ·i the standard inner product in R

n

:

hx, yi =

n

X

i=1

x

(i)

y

(i)

, x, y ∈ R

n

,

and by kxk the standard Euclidean norm:

kxk = hx, xi

1/2

.

For a symmetric n × n matrix H, its spectrum is denoted by {λ

i

(H)}

n

i=1

. We

assume that the eigenvalues are numbered in decreasing order:

λ

1

(H) ≥ . . . ≥ λ

n

(H).

Hence, we write H º 0 if and only if λ

n

(H) ≥ 0. In what follows, for a matrix

A we use the standard spectral matrix norm:

kAk = λ

1

(AA

T

)

1/2

.

Finally, I denotes a unit n × n matrix.

Acknowledgement. We are very thankful to anonymous referees for their nu-

merous comments on the initial version of the paper. Indeed, it may be too

ambitious to derive from our purely theoretical results any conclusion on the

practical eﬃciency of corresponding algorithmic implementations. However, the

authors do believe that the developed theory could pave a way for future progress

in computational practice.

2. Cubic regularization of quadratic approximation

Let F ⊆ R

n

be a closed convex set with non-empty interior. Consider a twice

diﬀerentiable function f(x), x ∈ F. Let x

0

∈ int F be a starting point of our

Cubic regularization of Newton method and its global performance 5

iterative schemes. We assume that the set F is large enough: It contains at least

the level set

L(f(x

0

)) ≡ {x ∈ R

n

: f(x) ≤ f(x

0

)}

in its interior. Moreover, in this paper we always assume the following.

Assumption 1 The Hessian of function f is Lipschitz continuous on F:

kf

00

(x) −f

00

(y)k ≤ Lkx − yk, ∀x, y ∈ F. (2.1)

for some L > 0.

For the sake of completeness, let us present the following trivial consequences

of our assumption (compare with [12, Section 3]).

Lemma 1. For any x and y from F we have

kf

0

(y) − f

0

(x) −f

00

(x)(y − x)k ≤

1

2

Lky − xk

2

, (2.2)

|f(y) − f(x) − hf

0

(x), y − xi −

1

2

hf

00

(x)(y − x), y − xi| ≤

L

6

ky − xk

3

.

(2.3)

Proof. Indeed,

kf

0

(y) − f

0

(x) −f

00

(x)(y − x)k = k

1

R

0

[f

00

(x + τ (y − x)) −f

00

(x)](y − x)dτk

≤ Lky − xk

2

1

R

0

τdτ =

1

2

Lky − xk

2

.

Therefore,

|f(y) − f(x) − hf

0

(x), y − xi −

1

2

hf

00

(x)(y − x), y − xi|

= |

1

R

0

hf

0

(x + λ(y − x)) − f

0

(x) −λf

00

(x)(y − x), y − xidλ|

≤

1

2

Lky − xk

3

1

R

0

λ

2

dλ =

L

6

ky − xk

3

.

ut

Let M be a positive parameter. Deﬁne a modiﬁed Newton step using the

following cubic regularization of quadratic approximation of function f(x):

T

M

(x) ∈ Arg min

y

£

hf

0

(x), y − xi +

1

2

hf

00

(x)(y − x), y − xi +

M

6

ky − xk

3

¤

,

(2.4)

where ”Arg” indicates that T

M

(x) is chosen from the set of global minima of

corresponding minimization problem. We postpone discussion of the complexity

of ﬁnding this point up to Section 5.1.

Note that point T

M

(x) satisﬁes the following system of nonlinear equations:

f

0

(x) + f

00

(x)(y − x) +

1

2

Mky − xk· (y − x) = 0.

(2.5)

6 Yu. Nesterov, B. Polyak

Denote r

M

(x) = kx −T

M

(x)k. Taking in (2.5) y = T

M

(x) and multiplying it by

T

M

(x) −x we get equation

hf

0

(x), T

M

(x) −xi + hf

00

(x)(T

M

(x) − x), T

M

(x) −xi +

1

2

Mr

3

M

(x) = 0.

(2.6)

In our analysis of the process (3.3), we need the following fact.

Proposition 1. For any x ∈ F we have

f

00

(x) +

1

2

Mr

M

(x)I º 0. (2.7)

This statement follows from Theorem 10, which will be proved later in Section

5.1. Now let us present the main properties of the mapping T

M

(A).

Lemma 2. For any x ∈ F, f(x) ≤ f(x

0

), we have the following relation:

hf

0

(x), x − T

M

(x)i ≥ 0. (2.8)

If M ≥

2

3

L and x ∈ int F, then T

M

(x) ∈ L(f(x)) ⊆ F.

Proof. Indeed, multiplying (2.7) by x − T

M

(x) twice, we get

hf

00

(x)(T

M

(x) − x), T

M

(x) −xi +

1

2

Mr

3

M

(x) ≥ 0.

Therefore (2.8) follows from (2.6).

Further, let M ≥

2

3

L. Assume that T

M

(x) 6∈ F. Then r

M

(x) > 0. Consider

the following points:

y

α

= x + α(T

M

(x) −x ), α ∈ [0, 1].

Since y(0) ∈ int F, the value

¯α : y

¯α

∈ ∂F

is well deﬁned. In accordance to our assumption, ¯α < 1 and y

α

∈ F for all

α ∈ [0, ¯α]. Therefore, using (2.3), relation (2.6) and inequality (2.8), we get

f(y

α

) ≤ f(x) + hf

0

(x), y

α

− xi+

1

2

hf

00

(x)(y

α

− x), y

α

− xi+

α

3

L

6

r

3

M

(x)

≤ f(x) + hf

0

(x), y

α

− xi+

1

2

hf

00

(x)(y

α

− x), y

α

− xi+

α

3

M

4

r

3

M

(x)

= f(x) + (α −

α

2

2

)hf

0

(x), T

M

(x) −xi −

α

2

(1−α)

4

Mr

3

M

(x)

≤ f(x) −

α

2

(1−α)

4

Mr

3

M

(x).

Thus, f(y(¯α)) < f (x). Therefore y(¯α) ∈ int L(f(x)) ⊆ int F. That is a con-

tradiction. Hence, T

M

(x) ∈ F and using the same arguments we prove that

f(T

M

(x)) ≤ f(x). ut

Lemma 3. If T

M

(x) ∈ F, then

kf

0

(T

M

(x))k ≤

1

2

(L + M )r

2

M

(x). (2.9)

Cubic regularization of Newton method and its global performance 7

Proof. From equation (2.5), we get

kf

0

(x) + f

00

(x)(T

M

(x) − x)k =

1

2

Mr

2

M

(x).

On the other hand, in view of (2.2), we have

kf

0

(T

M

(x)) −f

0

(x) −f

00

(x)(T

M

(x) −x)k ≤

1

2

Lr

2

M

(x).

Combining these two relations, we obtain inequality (2.9). ut

Deﬁne

¯

f

M

(x) = min

y

£

f(x) + hf

0

(x), y − xi +

1

2

hf

00

(x)(y − x), y − xi +

M

6

ky − xk

3

¤

.

Lemma 4. For any x ∈ F we have

¯

f

M

(x) ≤ min

y ∈F

£

f(y) +

L+M

6

ky − xk

3

¤

,

(2.10)

f(x) −

¯

f

M

(x) ≥

M

12

r

3

M

(x).

(2.11)

Moreover, if M ≥ L, then T

M

(x) ∈ F and

f(T

M

(x)) ≤

¯

f

M

(x). (2.12)

Proof. Indeed, using the lower bound in (2.3), for any y ∈ F we have

f(x) + hf

0

(x), y − xi +

1

2

hf

00

(x)(y − x), y − xi ≤ f (y) +

L

6

ky − xk

3

.

and inequality in (2.10) follows from the deﬁnition of

¯

f

M

(x).

Further, in view of deﬁnition of point T

def

= T

M

(x), relation (2.6) and in-

equality (2.8), we have

f(x) −

¯

f

M

(x) = hf

0

(x), x − T i −

1

2

hf

00

(x)(T −x), T − xi−

M

6

r

3

M

(x)

=

1

2

hf

0

(x), x − T i +

M

12

r

3

M

(x) ≥

M

12

r

3

M

(x).

Finally, if M ≥ L, then T

M

(x) ∈ F in view of Lemma 2. Therefore, we get

inequality (2.12) from the upper bound in (2.3). ut

3. General convergence results

In this paper the main problem of interest is:

min

x∈R

n

f(x), (3.1)

where the objective function f(x) satisﬁes Assumption 1. Recall that the nec-

essary conditions for a point x

∗

to be a local solution to problem (3.1) are as

follows:

f

0

(x

∗

) = 0, f

00

(x

∗

) º 0. (3.2)

8 Yu. Nesterov, B. Polyak

Therefore, for an arbitrary x ∈ F we can introduce the following measure of

local optimality:

µ

M

(x) = max

n

q

2

L+M

kf

0

(x)k, −

2

2L+M

λ

n

(f

00

(x))

o

,

where M is a positive parameter. It is clear that for any x from F the measure

µ

M

(x) is non-negative and it vanishes only at the points satisfying conditions

(3.2). The analytical form of this measure can be justiﬁed by the following result.

Lemma 5. For any x ∈ F we have µ

M

(T

M

(x)) ≤ r

M

(x).

Proof. The proof follows immediately from inequality (2.9) and relation (2.7)

since

f

00

(T

M

(x)) ≥ f

00

(x) −Lr

M

(x)I ≥ −(

1

2

M + L)r

M

(x)I.

ut

Let L

0

∈ (0, L] be a positive parameter. Consider the following regularized

Newton scheme.

Cubic regularization of Newton method

Initialization: Choose x

0

∈ R

n

.

Iteration k, (k ≥ 0):

1. Find M

k

∈ [L

0

, 2L] such that

f(T

M

k

(x

k

)) ≤

¯

f

M

k

(x

k

).

2. Set x

k+1

= T

M

k

(x

k

).

(3.3)

Since

¯

f

M

(x) ≤ f(x), this process is monotone:

f(x

k+1

) ≤ f(x

k

).

If the constant L is known, we can take M

k

≡ L in Step 1 of this scheme.

In the opposite case, it is possible to apply a simple search procedure; we will

discuss its complexity later in Section 5.2. Now let us make the following simple

observation.

Theorem 1. Let the sequence {x

i

} be generated by method (3.3). Assume that

the objective function f(x) is bounded below:

f(x) ≥ f

∗

∀x ∈ F.

Cubic regularization of Newton method and its global performance 9

Then

∞

P

i=0

r

3

M

i

(x

i

) ≤

12

L

0

(f(x

0

)−f

∗

). Moreover, lim

i→∞

µ

L

(x

i

) = 0 and for any k ≥ 1

we have

min

1≤i≤k

µ

L

(x

i

) ≤

8

3

·

³

3(f(x

0

)−f

∗

)

2k·L

0

´

1/3

.

(3.4)

Proof. In view of inequality (2.11), we have

f(x

0

) −f

∗

≥

k−1

P

i=0

[f(x

i

) −f(x

i+1

) ≥

k−1

P

i=0

M

i

12

r

3

M

i

(x

i

) ≥

L

0

12

r

3

M

i

(x

i

).

It remains to use the statement of Lemma 5 and the upper bound on M

k

in

(3.3):

r

M

i

(x

i

) ≥ µ

M

i

(x

i+1

) ≥

3

4

µ

L

(x

i+1

).

ut

Note that inequality (3.4) implies that

min

1≤i≤k

kf

0

(x

i

)k ≤ O(k

−2/3

).

It is well known that for gradient scheme a possible level of the right-hand side

in this inequality is of the order O

¡

k

−1/2

¢

(see, for example, [10], inequality

(1.2.13)).

Theorem 1 helps to get the convergence results in many diﬀerent situations.

We mention only one of them.

Theorem 2. Let sequence {x

i

} be generated by method (3.3). For some i ≥ 0,

assume the set L(f(x

i

)) be bounded. Then there exists a limit

lim

i→∞

f(x

i

) = f

∗

.

The set X

∗

of the limit points of this sequence is non-empty. Moreover, this is

a connected set, such that for any x

∗

∈ X

∗

we have

f(x

∗

) = f

∗

, f

0

(x

∗

) = 0, f

00

(x

∗

) º 0.

Proof. The proof of this theorem can be derived from Theorem 1 in a standard

way. ut

Let us describe now the behavior of the process (3.3) in a neighborhood of a

non-degenerate stationary point, which is not a point of local minimum.

Lemma 6. Let ¯x ∈ int F be a non-degenerate saddle point or a point of local

maximum of function f(x):

f

0

(¯x) = 0, λ

n

(f

00

(¯x)) < 0.

Then there exist constants ², δ > 0 such that whenever a point x

i

appears to be

in a set Q = {x : kx − ¯xk ≤ ², f(x) ≥ f(¯x)} (for instance, if x

i

= ¯x), then the

next point x

i+1

leaves the set Q:

f(x

i+1

) ≤ f(¯x) −δ.

10 Yu. Nesterov, B. Polyak

Proof. Let for some d with kdk = 1, and for some ¯τ > 0 we have

hf

00

(¯x)d, di ≡ −σ < 0, ¯x ± ¯τ d ∈ F.

Deﬁne ² = min

©

σ

2L

, ¯τ

ª

and δ =

σ

6

²

2

. Then, in view of inequality (2.10), upper

bound on M

i

, and inequality (2.3), for |τ| ≤ ¯τ we get the following estimate

f(x

i+1

) ≤ f(¯x + τd) +

L

2

k¯x + τd − x

i

k

3

≤ f(¯x) −στ

2

+

L

6

|τ|

3

+

L

2

£

²

2

+ 2τhd, ¯x − x

i

i + τ

2

¤

3/2

.

Since we are free in the choice of the sign of τ, we can guarantee that

f(x

i+1

) ≤ f(¯x) −στ

2

+

L

6

|τ|

3

+

L

2

£

²

2

+ τ

2

¤

3/2

, |τ| ≤ ¯τ.

Let us choose τ = ² ≤ ¯τ . Then

f(x

i+1

) ≤ f(¯x) −στ

2

+

5L

3

τ

3

≤ f (¯x) − στ

2

+

5L

3

·

σ

2L

· τ

2

= f (¯x) −

1

6

στ

2

.

Since the process (3.3) is monotone with respect to objective function, it will

never come again in Q. ut

Consider now the behavior of the regularized Newton scheme (3.3) in a neigh-

borhood of a non-degenerate local minimum. It appears that in such a situation

assumption L

0

> 0 is not necessary anymore. Let us analyze a relaxed version

of (3.3):

x

k+1

= T

M

k

(x

k

), k ≥ 0 (3.5)

where M

k

∈ (0, 2L]. Denote

δ

k

=

Lkf

0

(x

k

)k

λ

2

n

(f

00

(x

k

))

.

Theorem 3. Let f

00

(x

0

) Â 0 and δ

0

≤

1

4

. Let points {x

k

} be generated by (3.5).

Then:

1. For all k ≥ 0, the values δ

k

are well deﬁned and they converge quadratically

to zero:

δ

k+1

≤

3

2

³

δ

k

1−δ

k

´

2

≤

8

3

δ

2

k

≤

2

3

δ

k

, k ≥ 0.

(3.6)

2. Minimal eigenvalues of all Hessians f

00

(x

k

) lie within the following bounds:

e

−1

λ

n

(f

00

(x

0

)) ≤ λ

n

(f

00

(x

k

)) ≤ e

3/4

λ

n

(f

00

(x

0

)).

(3.7)

3. The whole sequence {x

i

} converges quadratically to a point x

∗

, which is a

non-degenerate local minimum of function f(x). In particular, for any k ≥ 1 we

have

kf

0

(x

k

)k ≤ λ

2

n

(f

00

(x

0

))

9e

3/2

16L

¡

1

2

¢

2

k

.

(3.8)

Cubic regularization of Newton method and its global performance 11

Proof. Assume that for some k ≥ 0 we have f

00

(x

k

) Â 0. Then the corresponding

δ

k

is well deﬁned. Assume that δ

k

≤

1

4

. From equation (2.5) we have

r

M

k

(x

k

) = kT

M

k

(x

k

) −x

k

k = k(f

00

(x

k

) + r

M

k

(x

k

)

M

k

2

I)

−1

f

0

(x

k

)k ≤

kf

0

(x

k

)k

λ

n

(f

00

(x

k

))

.

(3.9)

Note also that f

00

(x

k+1

) º f

00

(x

k

) −r

M

k

(x

k

)LI. Therefore

λ

n

(f

00

(x

k+1

) ≥ λ

n

(f

00

(x

k

)) −r

M

k

(x

k

)L

≥ λ

n

(f

00

(x

k

)) −

Lkf

0

(x

k

)k

λ

n

(f

00

(x

k

))

= (1 − δ

k

)λ

n

(f

00

(x

k

)).

Thus, f

00

(x

k+1

) is also positive deﬁnite. Moreover, using inequality (2.9) and the

upper bound for M

k

we obtain

δ

k+1

=

Lkf

0

(x

k+1

)k

λ

2

n

(f

00

(x

k+1

))

≤

3L

2

r

2

M

k

(x

k

)

2λ

2

n

(f

00

(x

k+1

))

≤

3L

2

kf

0

(x

k

)k

2

2λ

4

n

(f

00

(x

k

))(1−δ

k

)

2

=

3

2

³

δ

k

1−δ

k

´

2

≤

8

3

δ

2

k

.

Thus, δ

k+1

≤

1

4

and we prove (3.6) by induction. Note that we also get δ

k+1

≤

2

3

δ

k

.

Further, as we have already seen,

ln

λ

n

(f

00

(x

k

))

λ

n

(f

00

(x

0

))

≥

∞

P

i=0

ln(1 −δ

i

) ≥ −

∞

P

i=0

δ

i

1−δ

i

≥ −

1

1−δ

0

∞

P

i=0

δ

i

≥ −1.

In order to get an upper bound, note that f

00

(x

k+1

) ¹ f

00

(x

k

) + r

M

k

(x

k

)LI.

Hence,

λ

n

(f

00

(x

k+1

) ≤ λ

n

(f

00

(x

k

)) + r

M

k

(x

k

)L ≤ (1 + δ

k

)λ

n

(f

00

(x

k

)).

Therefore

ln

λ

n

(f

00

(x

k

))

λ

n

(f

00

(x

0

))

≤

∞

P

i=0

ln(1 + δ

i

) ≤

∞

P

i=0

δ

i

≤

3

4

.

It remains to prove Item 3 of the theorem. In view of inequalities (3.9) and

(3.7), we have

r

M

k

(x

k

) ≤

1

L

λ

n

(f

00

(x

k

))δ

k

≤

e

3/4

L

λ

n

(f

00

(x

0

))δ

k

.

Thus, {x

i

} is a Cauchy sequence, which has a unique limit point x

∗

. Since the

eigenvalues of f

00

(x) are continuous functions of x, from (3.7) we conclude that

f

00

(x

∗

) > 0.

Further, from inequality (3.6) we get

δ

k+1

≤

δ

2

k

(1−δ

0

)

2

≤

16

9

δ

2

k

.

Denoting

ˆ

δ

k

=

16

9

δ

k

, we get

ˆ

δ

k+1

≤

ˆ

δ

2

k

. Thus, for any k ≥ 1 we have

δ

k

=

9

16

ˆ

δ

k

≤

9

16

ˆ

δ

2

k

0

<

9

16

¡

1

2

¢

2

k

.

Using the upper bound in (3.7), we get (3.8). ut

12 Yu. Nesterov, B. Polyak

4. Global eﬃciency on speciﬁc problem classes

In the previous section, we have already seen that the modiﬁed Newton scheme

can be supported by a global eﬃciency estimate (3.4) on a general class of non-

convex problems. The main goal of this section is to show that on more speciﬁc

classes of non-convex problems the global performance of the scheme (3.3) is

much better. To the best of our knowledge, the results of this section are the

ﬁrst global complexity results on a Newton-type scheme. The nice feature of

the scheme (3.3) consists in its ability to adjust the performance to a speciﬁc

problem class automatically.

4.1. Star-convex functions

Let us start from a deﬁnition.

Deﬁnition 1. We call a function f(x) star-convex if its set of global minimums

X

∗

is not empty and for any x

∗

∈ X

∗

and any x ∈ R

n

we have

f(αx

∗

+ (1 − α)x) ≤ αf(x

∗

) + (1 − α)f(x) ∀x ∈ F, ∀α ∈ [0, 1]. (4.1)

A particular example of a star-convex function is a usual convex function. How-

ever, in general, a star-convex function does not need to be convex, even for

scalar case. For instance, f(x) = |x|(1 − e

−|x|

), x ∈ R, is star-convex, but not

convex. Star-convex functions arise quite often in optimization problems related

to sum of squares, e.g. the function f (x, y) = x

2

y

2

+x

2

+y

2

belongs to this class.

Theorem 4. Assume that the objective function in (3.1) is star-convex and the

set F is bounded: diam F = D < ∞. Let sequence {x

k

} be generated by method

(3.3).

1. If f(x

0

) −f

∗

≥

3

2

LD

3

, then f(x

1

) −f

∗

≤

1

2

LD

3

.

2. If f(x

0

) − f

∗

≤

3

2

LD

3

, then the rate of convergence of process (3.3) is as

follows:

f(x

k

) −f(x

∗

) ≤

3LD

3

2(1+

1

3

k)

2

, k ≥ 0.

(4.2)

Proof. Indeed, in view of inequality (2.10), upper bound on the parameters M

k

and deﬁnition (4.1), for any k ≥ 0 we have:

f(x

k+1

) −f (x

∗

)

≤ min

y

£

f(y) − f(x

∗

) +

L

2

ky − x

k

k

3

: y = αx

∗

+ (1 −α)x

k

, α ∈ [0, 1]

¤

≤ min

α∈[0,1]

£

f(x

k

) −f(x

∗

) −α(f(x

k

) −f (x

∗

)) +

L

2

α

3

kx

∗

− x

k

k

3

¤

≤ min

α∈[0,1]

£

f(x

k

) −f(x

∗

) −α(f(x

k

) −f (x

∗

)) +

L

2

α

3

D

3

¤

.

Cubic regularization of Newton method and its global performance 13

The minimum of the objective function in the last minimization problem in

α ≥ 0 is achieved for

α

k

=

q

2(f(x

k

)−f(x

∗

))

3LD

3

.

If α

k

≥ 1, then the actual optimal value corresponds to α = 1. In this case

f(x

k+1

) −f(x

∗

) ≤

1

2

LD

3

.

Since the process (3.3) is monotone, this can happen only at the ﬁrst iteration

of the method.

Assume that α

k

≤ 1. Then

f(x

k+1

) −f(x

∗

) ≤ f(x

k

) −f (x

∗

) −

£

2

3

(f(x

k

) −f(x

∗

))

¤

3/2

1

√

LD

3

.

Or, in a more convenient notation, that is α

2

k+1

≤ α

2

k

−

2

3

α

3

k

< α

2

k

. Therefore

1

α

k+1

−

1

α

k

=

α

k

−α

k+1

α

k

α

k+1

=

α

2

k

−α

2

k+1

α

k

α

k+1

(α

k

+α

k+1

)

≥

α

2

k

−α

2

k+1

2α

3

k

≥

1

3

.

Thus,

1

α

k

≥

1

α

0

+

k

3

≥ 1 +

k

3

, and (4.2) follows. ut

Let us introduce now a generalization of the notion of non-degenerate global

minimum.

Deﬁnition 2. We say that the optimal set X

∗

of function f(x) is globally non-

degenerate if there exists a constant γ > 0 such that for any x ∈ F we have

f(x) −f

∗

≥

γ

2

ρ

2

(x, X

∗

),

(4.3)

where f

∗

is the global minimal value of function f(x), and ρ(x, X

∗

) is the Eu-

clidean distance from x to X

∗

.

Of course, this property holds for strongly convex functions (in this case

X

∗

is a singleton), however it can also hold for some non-convex functions. As

an example, consider f (x ) = (kxk

2

− 1)

2

, X

∗

= {x : kxk = 1}. Moreover, if

the set X

∗

has a connected non-trivial component, the Hessians of the objective

function at these points cannot be non-degenerate. However, as we will see, in this

situation the modiﬁed Newton scheme ensures a super-linear rate of convergence.

Denote

¯ω =

1

L

2

¡

γ

2

¢

3

.

Theorem 5. Let function f (x) be star-convex. Assume that it has also a globally

non-degenerate optimal set. Then the performance of the scheme (3.3) on this

problem is as follows.

1. If f (x

0

)−f(x

∗

) ≥

4

9

¯ω, then at the ﬁrst phase of the process we get the following

rate of convergence:

f(x

k

) −f(x

∗

) ≤

h

(f(x

0

) −f(x

∗

))

1/4

−

k

6

q

2

3

¯ω

1/4

i

4

.

(4.4)

This phase is terminated as soon as f(x

k

0

) −f (x

∗

) ≤

4

9

¯ω for some k

0

≥ 0.

2. For k ≥ k

0

the sequence converges superlinearly:

f(x

k+1

) −f(x

∗

) ≤

1

2

(f(x

k

) −f(x

∗

))

q

f(x

k

)−f(x

∗

)

¯ω

.

(4.5)

14 Yu. Nesterov, B. Polyak

Proof. Denote by x

∗

k

the projection of the point x

k

onto the optimal set X

∗

. In

view of inequality (2.10), upper bound on the parameters M

k

and deﬁnitions

(4.1), (4.3), for any k ≥ 0 we have:

f(x

k+1

) −f(x

∗

)

≤ min

α∈[0,1]

£

f(x

k

) −f (x

∗

) −α(f(x

k

) −f(x

∗

)) +

L

2

α

3

kx

∗

k

− x

k

k

3

¤

≤ min

α∈[0,1]

·

f(x

k

) −f(x

∗

) −α(f(x

k

) −f (x

∗

)) +

L

2

α

3

³

2

γ

(f(x

k

) −f(x

∗

))

´

3/2

¸

.

Denoting ∆

k

= (f (x

k

) −f(x

∗

))/¯ω, we get inequality

∆

k+1

≤ min

α∈[0,1]

h

∆

k

− α∆

k

+

1

2

α

3

∆

3/2

k

i

. (4.6)

Note that the ﬁrst order optimality condition for α ≥ 0 in this problem is

α

k

=

q

2

3

∆

−1/2

k

.

Therefore, if ∆

k

≥

4

9

, we get

∆

k+1

≤ ∆

k

−

¡

2

3

¢

3/2

∆

3/4

k

.

Denoting u

k

=

9

4

∆

k

we get a simpler relation:

u

k+1

≤ u

k

−

2

3

u

3/4

k

,

which is applicable if u

k

≥ 1. Since the right-hand side of this inequality is

increasing for u

k

≥

1

16

, let us prove by induction that

u

k

≤

h

u

1/4

0

−

k

6

i

4

.

Indeed, inequality

h

u

1/4

0

−

k+1

6

i

4

≥

h

u

1/4

0

−

k

6

i

4

−

2

3

h

u

1/4

0

−

k

6

i

3

clearly is equivalent to

2

3

h

u

1/4

0

−

k

6

i

3

≥

h

u

1/4

0

−

k

6

i

4

−

h

u

1/4

0

−

k+1

6

i

4

=

1

6

[

h

u

1/4

0

−

k

6

i

3

+

h

u

1/4

0

−

k

6

i

2

h

u

1/4

0

−

k+1

6

i

+

h

u

1/4

0

−

k

6

ih

u

1/4

0

−

k+1

6

i

2

+

h

u

1/4

0

−

k+1

6

i

3

],

which is obviously true.

Finally, if u

k

≤ 1, then the optimal value for α in (4.6) is one and we get

(4.5). ut

Cubic regularization of Newton method and its global performance 15

4.2. Gradient-dominated functions

Let us study now another interesting class of problems.

Deﬁnition 3. A function f(x) is called gradient dominated of degree p ∈ [1, 2]

if it attains a global minimum at some point x

∗

and for any x ∈ F we have

f(x) −f (x

∗

) ≤ τ

f

kf

0

(x)k

p

, (4.7)

where τ

f

is a positive constant. The parameter p is called the degree of domina-

tion.

Note that we do not assume that the global minimum of function f is unique.

For p = 2, this class of functions has been introduced in [13].

Let us give several examples of gradient dominated functions.

Example 1. Convex functions. Let f be convex on R

n

. Assume it achieves its

minimum at point x

∗

. Then, for any x ∈ R

n

, kx − x

∗

k ≤ R, we have

f(x) −f (x

∗

) ≤ hf

0

(x), x − x

∗

i ≤ kf

0

(x)k · R.

Thus, on the set F = {x : kx − x

∗

k ≤ R}, function f is a gradient dominated

function of degree one with τ

f

= R . ut

Example 2. Strongly convex functions. Let f be diﬀerentiable and strongly con-

vex on R

n

. This means that there exists a constant γ > 0 such that

f(y) ≥ f (x) + hf

0

(x), y − xi +

1

2

γky − xk

2

, (4.8)

for all x, y ∈ R

n

. Then, (see, for example, [10], inequality (2.1.19)),

f(x) −f (x

∗

) ≤

1

2γ

kf

0

(x)k

2

∀x ∈ R

n

.

Thus, on the set F = R

n

, function f is a gradient dominated function of degree

two with τ

f

=

1

2γ

. ut

Example 3. Sum of squares. Consider a system of non-linear equations:

g(x) = 0 (4.9)

where g(x) = (g

1

(x), . . . , g

m

(x))

T

: R

n

→ R

m

is a diﬀerentiable function. We

assume that m ≤ n and that there exists a solution x

∗

to (4.9). Let us assume

in addition that the Jacobian

J(x) = (g

0

1

(x), . . . , g

0

m

(x))

is uniformly non-degenerate on a certain convex set F containing x

∗

. This means

that the value

σ ≡ inf

x∈F

λ

n

¡

J

T

(x)J(x)

¢

16 Yu. Nesterov, B. Polyak

is positive. Consider the function

f(x) =

1

2

m

X

i=1

g

2

i

(x).

Clearly, f(x

∗

) = 0. Note that f

0

(x) = J(x)g(x). Therefore

kf

0

(x)k

2

= h

¡

J

T

(x)J(x)

¢

g(x), g(x)i ≥ σkg(x)k

2

= 2σ(f(x) − f(x

∗

)).

Thus, f is a gradient dominated function on F of degree two with τ

f

=

1

2σ

. Note

that, for m < n, the set of solutions to (4.9) is not a singleton and therefore the

Hessians of function f are necessarily degenerate at the solutions. ut

In order to study the complexity of minimization of the gradient dominated

functions, we need one auxiliary result.

Lemma 7. At each step of method (3.3) we can guarantee the following decrease

of the objective function:

f(x

k

) −f(x

k+1

) ≥

L

0

·kf

0

(x

k+1

)k

3/2

3

√

2·(L+L

0

)

3/2

, k ≥ 0.

(4.10)

Proof. In view of inequalities (2.11) and (2.9) we get

f(x

k

) −f (x

k+1

) ≥

M

k

12

r

3

M

k

(x

k

) ≥

M

k

12

³

2kf

0

(x

k+1

)k

L+M

k

´

3/2

=

M

k

kf

0

(x

k+1

)k

3/2

3

√

2·(L+M

k

)

3/2

.

It remains to note that the right-hand side of this inequality is increasing in

M

k

≤ 2L. Thus, we can replace M

k

by its lower bound L

0

. ut

Let us start from the analysis of the gradient dominated functions of degree

one. The following theorem states that the process can be partitioned into two

phases. The initial phase (with large values of the objective function) terminates

fast enough, while at the second phase we have O(1/k

2

) rate of convergence.

Theorem 6. Let us apply method (3.3) to minimization of a gradient dominated

function f(x) of degree p = 1.

1. If the initial value of the objective function is large enough:

f(x

0

) −f(x

∗

) ≥ ˆω ≡

18

L

2

0

τ

3

f

· (L + L

0

)

3

,

then the process converges to the region L(ˆω) superlinearly:

ln

¡

1

ˆω

(f(x

k

) −f(x

∗

)

¢

≤

¡

2

3

¢

k

ln

¡

1

ˆω

(f(x

0

) −f(x

∗

)

¢

.

(4.11)

2. If f(x

0

) − f(x

∗

) ≤ γ

2

ˆω for some γ > 1, then we have the following estimate

for the rate of convergence:

f(x

k

) −f(x

∗

) ≤ ˆω ·

γ

2

(

2+

3

2

γ

)

2

(

2+

(

k+

3

2

)

·γ

)

2

, k ≥ 0. (4.12)

Cubic regularization of Newton method and its global performance 17

Proof. Using inequalities (4.10) and (4.7) with p = 1, we get

f(x

k

) −f(x

k+1

) ≥

L

0

·(f(x

k+1

)−f(x

∗

))

3/2

3

√

2·(L+L

0

)

3/2

·τ

3/2

f

= ˆω

−1/2

(f(x

k+1

) −f(x

∗

))

3/2

.

Denoting δ

k

= (f (x

k

) −f (x

∗

))/ˆω, we obtain

δ

k

− δ

k+1

≥ δ

3/2

k+1

. (4.13)

Hence, as far as δ

k

≥ 1, we get

ln δ

k

≤

¡

2

3

¢

k

ln δ

0

,

and that is (4.11).

Let us prove now inequality (4.12). Using inequality (4.13), we have

1

√

δ

k+1

−

1

√

δ

k

≥

1

√

δ

k+1

−

1

p

δ

k+1

+δ

3/2

k+1

=

p

δ

k+1

+δ

3/2

k+1

−

√

δ

k+1

√

δ

k+1

p

δ

k+1

+δ

3/2

k+1

=

1

p

1+

√

δ

k+1

·

³

1+

p

1+

√

δ

k+1

´

=

1

1+

√

δ

k+1

+

p

1+

√

δ

k+1

≥

1

2+

3

2

√

δ

k+1

≥

1

2+

3

2

√

δ

0

.

Thus,

1

δ

k

≥

1

γ

+

k

2+

3

2

γ

, and this is (4.12). ut

The reader should not be confused by the superlinear rate of convergence

established by (4.11). It is valid only for the ﬁrst stage of the process and de-

scribes a convergence to the set L(ˆω). For example, the ﬁrst stage of the process

discussed in Theorem 4 is even shorter: it takes a single iteration.

Let us look now at the gradient dominated functions of degree two. Here two

phases of the process can be indicated as well.

Theorem 7. Let us apply method (3.3) to minimization of a gradient dominated

function f(x) of degree p = 2.

1. If the initial value of the objective function is large enough:

f(x

0

) −f(x

∗

) ≥ ˜ω ≡

L

4

0

324(L+L

0

)

6

τ

3

f

,

(4.14)

then at its ﬁrst phase the process converges as follows:

f(x

k

) −f (x

∗

) ≤ (f(x

0

) −f (x

∗

)) ·e

−k·σ

, (4.15)

where σ =

˜ω

1/4

˜ω

1/4

+(f(x

0

)−f(x

∗

))

1/4

. This phase ends on the ﬁrst iteration k

0

, for

which (4.14) does not hold.

2. For k ≥ k

0

the rate of convergence is super-linear:

f(x

k+1

) −f (x

∗

) ≤ ˜ω ·

³

f(x

k

)−f(x

∗

)

˜ω

´

4/3

.

(4.16)

18 Yu. Nesterov, B. Polyak

Proof. Using inequalities (4.10) and (4.7) with p = 2, we get

f(x

k

) −f(x

k+1

) ≥

L

0

·(f(x

k+1

)−f(x

∗

))

3/4

3

√

2·(L+L

0

)

3/2

·τ

3/4

f

= ˜ω

1

/

4

(f(x

k+1

) −f (x

∗

))

3

/

4

.

Denoting δ

k

= (f (x

k

) −f (x

∗

))/˜ω, we obtain

δ

k

≥ δ

k+1

+ δ

3/4

k+1

. (4.17)

Hence,

δ

k

δ

k+1

≥ 1 + δ

−1/4

k

≥ 1 + δ

−1/4

0

=

1

1−σ

≥ e

σ

,

and we get (4.15). Finally, from (4.17) we have δ

k+1

≤ δ

4/3

k

, and that is (4.16).

ut

Comparing the statement of Theorem 7 with other theorems of this section we

see a signiﬁcant diﬀerence: this is the ﬁrst time when the initial gap f (x

0

)−f (x

∗

)

enters the complexity estimate of the ﬁrst phase of the process in a polynomial

way; in all other cases the dependence on this gap is much weaker.

Note that it is possible to embed the gradient dominated functions of degree

one into the class of gradient dominated functions of degree two. However, the

reader can check that this only spoils the eﬃciency estimates established by

Theorem 7.

4.3. Nonlinear transformations of convex functions

Let u(x) : R

n

→ R

n

be a non-degenerate operator. Denote by v(u) its inverse:

v(u) : R

n

→ R

n

, v(u(x)) ≡ x.

Consider the following function:

f(x) = φ(u(x)),

where φ(u) is a convex function with bounded level sets. Such classes are typical

for minimization problems with composite objective functions. Denote by x

∗

≡

v(u

∗

) its minimum. Let us ﬁx some x

0

∈ R

n

. Denote

σ = max

u

{kv

0

(u)k : φ(u) ≤ f(x

0

)},

D = max

u

{ku −u

∗

k : φ(u) ≤ f(x

0

)}.

The following result is straightforward.

Lemma 8. For any x, y ∈ L(f(x

0

)) we have

kx −yk ≤ σku(x) − u(y)k. (4.18)

Cubic regularization of Newton method and its global performance 19

Proof. Indeed, for x, y ∈ L(f(x

0

)), we have φ(u(x)) ≤ f(x

0

) and φ(u(y)) ≤

f(x

0

). Consider the trajectory x(t) = v(tu(y) + (1 − t)u(x)), t ∈ [0, 1]. Then

y − x =

1

R

0

x

0

(t)dt =

µ

1

R

0

v

0

(tu(y) + (1 − t)u(x))dt

¶

· (u(y) −u(x)),

and (4.18) follows. ut

The following result is very similar to Theorem 4.

Theorem 8. Assume that function f has Lipschitz continuous Hessian on F ⊇

L(f(x

0

)) with Lipschitz constant L. And let the sequence {x

k

} be generated by

method (3.3).

1. If f(x

0

) −f

∗

≥

3

2

L(σD)

3

, then f(x

1

) −f

∗

≤

1

2

L(σD)

3

.

2. If f (x

0

) −f

∗

≤

3

2

L(σD)

3

, then the rate of convergence of the process (3.3) is

as follows:

f(x

k

) −f(x

∗

) ≤

3L(σ D)

3

2(1+

1

3

k)

2

, k ≥ 0.

(4.19)

Proof. Indeed, in view of inequality (2.10), upper bound on the parameters M

k

and deﬁnition (4.1), for any k ≥ 0 we have:

f(x

k+1

) −f (x

∗

) ≤ min

y

[ f(y) − f(x

∗

) +

L

2

ky − x

k

k

3

:

y = v(αu

∗

+ (1 −α)u(x

k

)), α ∈ [0, 1] ].

By deﬁnition of points y in the above minimization problem and (4.18), we have

f(y) − f(x

∗

) = φ(αu

∗

+ (1 − α)u(x

k

)) −φ(u

∗

) ≤ (1 − α)(f(x

k

) −f (x

∗

)),

ky − x

k

k ≤ ασku(x

k

) −u

∗

k ≤ ασD.

This means that the reasoning of Theorem 4 goes through with replacement D

by σD. ut

Let us prove a statement on strongly convex φ. Denote ˇω =

1

L

2

¡

γ

2σ

2

¢

3

.

Theorem 9. Let function φ be strongly convex with convexity parameter γ > 0.

Then, under assumptions of Theorem 8, the performance of the scheme (3.3) is

as follows.

1. If f (x

0

)−f(x

∗

) ≥

4

9

ˇω, then at the ﬁrst phase of the process we get the following

rate of convergence:

f(x

k

) −f(x

∗

) ≤

h

(f(x

0

) −f(x

∗

))

1/4

−

k

6

q

2

3

ˇω

1/4

i

4

.

(4.20)

This phase is terminated as soon as f(x

k

0

) −f (x

∗

) ≤

4

9

ˇω for some k

0

≥ 0.

2. For k ≥ k

0

the sequence converges superlinearly:

f(x

k+1

) −f(x

∗

) ≤

1

2

(f(x

k

) −f(x

∗

))

q

f(x

k

)−f(x

∗

)

ˇω

.

(4.21)

20 Yu. Nesterov, B. Polyak

Proof. Indeed, in view of inequality (2.10), upper bound on the parameters M

k

and deﬁnition (4.1), for any k ≥ 0 we have:

f(x

k+1

) −f (x

∗

) ≤ min

y

[ f(y) − f(x

∗

) +

L

2

ky − x

k

k

3

:

y = v(αu

∗

+ (1 −α)u(x

k

)), α ∈ [0, 1] ].

By deﬁnition of points y in the above minimization problem and (4.18), we have

f(y) − f(x

∗

) = φ(αu

∗

+ (1 − α)u(x

k

)) −φ(u

∗

) ≤ (1 − α)(f(x

k

) −f (x

∗

)),

ky − x

k

k ≤ ασku(x

k

) −u

∗

k ≤ ασ

q

2

γ

(f(x

0

) −f(x

∗

)).

This means that the reasoning of Theorem 5 goes through with replacement L

by σ

3

L. ut

Note that the functions discussed in this section are often used as test func-

tions for non-convex optimization algorithms.

5. Implementation issues

5.1. Solving the cubic regularization

Note that the auxiliary minimization problem (2.4), which we need to solve in

order to compute the mapping T

M

(x), namely,

min

h∈R

n

£

hg, hi +

1

2

hHh, hi +

M

6

khk

3

¤

,

(5.1)

is substantially nonconvex. It can have isolated strict local minima, while we

need to ﬁnd a global one. Nevertheless, as we will show in this section, this

problem is equivalent to a convex one-dimensional optimization problem.

Before we present an “algorithmic” proof of this fact, let us provide it with

a general explanation. Introduce the following objects:

ξ

1

(h) = hg, hi +

1

2

hHh, hi, ξ

2

(h) = khk

2

,

Q =

©

ξ = (ξ

(1)

, ξ

(2)

)

T

: ξ

(1)

= ξ

1

(h), ξ

(2)

= ξ

2

(h), h ∈ R

n

ª

⊂ R

2

,

ϕ(ξ) = ξ

(1)

+

M

6

¡

ξ

(2)

¢

3/2

+

.

where (a)

+

= max{a, 0}. Then

min

h∈R

n

£

hg, hi +

1

2

hHh, hi +

M

6

khk

3

¤

≡ min

h∈R

n

h

ξ

1

(h) +

M

6

ξ

3/2

2

(h)

i

= min

ξ∈Q

ϕ(ξ).

Cubic regularization of Newton method and its global performance 21

Theorem 2.2 in [14] guarantees that for n ≥ 2 the set Q is convex and closed.

Thus, we have reduced the initial nonconvex minimization problem in R

n

to

a convex constrained minimization problem in R

2

. Up to this moment, this

reduction is not constructive, because Q is given in implicit form. However, the

next statement shows that the description of this set is quite simple.

Denote

v

u

(h) = hg, hi +

1

2

hHh, hi +

M

6

khk

3

, h ∈ R

n

,

and

v

l

(r) = −

1

2

h

¡

H +

Mr

2

I

¢

−1

g, gi −

M

12

r

3

.

For the ﬁrst function sometimes we use the notation v

u

(g; h). Denote

D = {r ∈ R : H +

M

2

rI Â 0, r ≥ 0}.

Theorem 10. For any M > 0 we have the following relation:

min

h∈R

n

v

u

(h) = sup

r∈D

v

l

(r). (5.2)

For any r ∈ D, direction h(r) = −

¡

H +

Mr

2

I

¢

−1

g satisﬁes equation

0 ≤ v

u

(h(r)) −v

l

(r) =

M

12

(r + 2kh(r)k)(kh(r)k − r)

2

=

4

3M

·

r+2kh(r)k

(r+kh(r)k)

2

· v

0

l

(r)

2

.

(5.3)

Proof. Denote the left-hand side of relation (5.2) by v

∗

u

, and its right-hand side

by v

∗

l

. Let us show that v

∗

u

≥ v

∗

l

. Indeed,

v

∗

u

= min

h∈R

n

£

hg, hi +

1

2

hHh, hi +

M

6

khk

3

¤

= min

h∈R

n

,

τ=khk

2

h

hg, hi +

1

2

hHh, hi +

M

6

(τ)

3/2

+

i

= min

h∈R

n

,

τ∈R

sup

r∈R

h

hg, hi +

1

2

hHh, hi +

M

6

(τ)

3/2

+

+

M

4

r

¡

khk

2

− τ

¢

i

≥ sup

r∈D

min

h∈R

n

,

τ∈R

h

hg, hi +

1

2

hHh, hi +

M

6

(τ)

3/2

+

+

M

4

r

¡

khk

2

− τ

¢

i

≡ v

∗

l

.

Consider now an arbitrary r ∈ D . Then

g = −Hh(r) −

M

2

rh(r).

Therefore

v

u

(h(r)) = hg, h(r)i +

1

2

hHh(r), h(r)i +

M

6

kh(r)k

3

= −

1

2

hHh(r), h(r)i −

M

2

rkh(r)k

2

+

M

6

kh(r)k

3

= −

1

2

h

¡

H +

Mr

2

I

¢

h(r), h(r)i −

M

4

rkh(r)k

2

+

M

6

kh(r)k

3

= v

l

(r) +

M

12

r

3

−

M

4

rkh(r)k

2

+

M

6

kh(r)k

3

= v

l

(r) +

M

12

(r + 2kh(r)k) · (kh(r)k − r)

2

.

22 Yu. Nesterov, B. Polyak

Thus, relation (5.3) is proved.

Note that

v

0

l

(r) =

M

4

(kh(r)k

2

− r

2

).

Therefore, if the optimal value v

∗

l

is attained at some r

∗

> 0 from D, then

v

0

l

(r

∗

) = 0 and by (5.3) we conclude that v

∗

r

= v

∗

l

. If r

∗

=

2

M

(−λ

n

(H))

+

, then

equality (5.2) can be justiﬁed by continuity arguments (since v

∗

u

≡ v

∗

u

(g) is a

concave function in g ∈ R

n

; see also the discussion below). ut

Note that Proposition 1 follows from the deﬁnition of set D.

Theorem 10 demonstrates that in non-degenerate situation the solution of

problem (5.2) can be found from one-dimensional equation

r = k

¡

H +

Mr

2

I

¢

−1

gk, r ≥

2

M

(−λ

n

(H))

+

.

(5.4)

A technique for solving such equations is very well developed for the needs of

trust region methods (see [2], Chapter 7, for exhaustive expositions of the dif-

ferent approaches). As compared with (5.4), the equation arising in trust region

schemes has a constant left-hand side. But of course, all possible diﬃculties in

this equation are due to the non-linear (convex) right-hand side.

For completeness of presentation, let us brieﬂy discuss the structure of equa-

tion (5.4). In the basis of eigenvectors of matrix H this equation can be written

as

r

2

=

n

P

i=1

˜g

2

i

(λ

i

+

M

2

r)

2

, r ≥

2

M

(−λ

n

)

+

,

(5.5)

where λ

i

are eigenvalues of matrix H and ˜g

i

are coordinates of vector g in the

new basis.

If ˜g

n

6= 0, then the solution r

∗

of equation (5.5) is in the interior of the

domain:

r >

2

M

(−λ

n

)

+

,

and we can compute the displacement h(r

∗

) by the explicit expression:

h(g; r

∗

) = −

³

H +

Mr

∗

2

I

´

−1

g.

If ˜g

n

= 0 then this formula does not work and we have to consider diﬀerent

cases. In order to avoid all these complications, let us mention the following

simple result.

Lemma 9. Let ˜g

n

= 0. Deﬁne g(²) = ˜g + ²e

n

, where e

n

is the nth coordinate

vector. Denote by r

∗

(²) the solution of equation (5.5) with ˜g = g(²). Then any

limit point of the trajectory

h(g(²); r

∗

(²)), ² → 0,

is a global minimum in h of function v

u

(g; h).

Cubic regularization of Newton method and its global performance 23

Proof. Indeed, function v

∗

u

(g) is concave for g ∈ R

n

. Therefore it is continuous.

Hence,

v

∗

u

(g) = lim

²→0

v

∗

u

(g(²)) = lim

²→0

v

u

(g(²); h(g(²); r

∗

(²))).

It remains to note that the function v

u

(g; h) is continuous in both arguments.

ut

In order to illustrate the diﬃculties arising in the dual problem, let us look

at an example.

Example 4. Let n = 2 and

˜g =

µ

−1

0

¶

, λ

1

= 0, λ

2

= −1, M = 1.

Thus, our primal problem is as follows:

min

h∈R

2

(

ψ(h) ≡ −h

(1)

−

1

2

¡

h

(2)

¢

2

+

1

6

·

q

¡

h

(1)

¢

2

+

¡

h

(2)

¢

2

¸

3

)

.

Following to (2.5), we have to solve the following system of non-linear equations:

h

(1)

2

q

¡

h

(1)

¢

2

+

¡

h

(2)

¢

2

= 1,

h

(2)

2

q

¡

h

(1)

¢

2

+

¡

h

(2)

¢

2

= h

(2)

.

Thus, we have three candidate solutions:

h

∗

1

=

µ

√

2

0

¶

, h

∗

2

=

µ

1

√

3

¶

, h

∗

3

=

µ

1

−

√

3

¶

.

By direct substitution we can see that

ψ(h

∗

1

) = −

2

√

2

3

> −

7

6

= ψ(h

∗

2

) = ψ(h

∗

3

).

Thus, both h

∗

2

and h

∗

3

are our global solutions.

Let us look at the dual problem:

sup

r

·

φ(r) ≡ −

r

3

12

−

1

2

·

1

0+

1

2

r

= −

r

3

12

−

1

r

: −1 +

1

2

r > 0

¸

.

Note that φ

0

(r) = −

r

2

4

+

1

r

2

. Thus, φ

0

(2) = −

3

4

< 0 and we conclude that

r

∗

= 2, φ

∗

= −

7

6

.

However, r

∗

does not satisfy the equation φ

0

(r) = 0 and the object h(r

∗

) is not

deﬁned. ut

24 Yu. Nesterov, B. Polyak

Let us conclude this section with a precise description of the solution of primal

problem in (5.2) in terms of the eigenvalues of matrix H. Denote by {s

i

}

n

i=1

an

orthonormal basis of eigenvectors of H, and let

ˆ

k satisfy the conditions

˜g

(i)

6= 0 for i <

ˆ

k,

˜g

(i)

= 0 for i ≥

ˆ

k.

Assume that r

∗

is the solution to the dual problem in (5.2). Then the solution

of the primal problem is given by the vector

h

∗

= −

ˆ

k−1

X

i=1

˜g

(i)

s

i

λ

i

+

M

2

r

∗

+ σs

n

,

where σ is chosen in accordance to the condition kh

∗

k = r

∗

. Note that this rule

works also for

ˆ

k = 1 or

ˆ

k = n + 1.

We leave justiﬁcation of above rule as an exercise for the reader. As far as we

know, a technique for ﬁnding h

∗

without computation of the basis of eigenvalues

is not known yet.

5.2. Line search strategies

Let us discuss the possible computational cost of Step 1 in the method (3.3),

which consists of ﬁnding M

k

∈ [L

0

, 2L] satisfying the equation:

f(T

M

k

(x)) ≤

¯

f

M

k

(x

k

).

Note that for M

k

≥ L this inequality holds. Consider now the strategy

while f(T

M

k

(x)) > f(x

k

) do M

k

:= 2M

k

; M

k+1

:= M

k

.

(5.6)

It is clear that if we start the process (3.3) with any M

0

∈ [L

0

, 2L], then the

above procedure, as applied at each iteration of the method, has the following

advantages:

– M

k

≤ 2L.

– The total amount of additional computations of the mappings T

M

k

(x) during

the whole process (3.3) is bounded by

log

2

2L

L

0

.

This amount does not depend on the number of iterations in the main process.

However, it may be that the rule (5.6) is too conservative. Indeed, we can

only increase our estimate for the constant L and never come back. This may

force the method to take only the short steps. A more reasonable strategy looks

as follows:

while f(T

M

k

(x)) > f(x

k

) do M

k

:= 2M

k

;

x

k+1

:= T

M

k

(x

k

); M

k+1

:= max{

1

2

M

k

, L

0

}.

(5.7)

Cubic regularization of Newton method and its global performance 25

Then it is easy to prove by induction that N

k

, the total number of computations

of mappings T

M

(x) made by (5.7) during the ﬁrst k iterations, is bounded as

follows:

N

k

≤ 2k + log

2

M

k

L

0

.

Thus, if N is the number of iterations in this process, then we compute at most

2N + log

2

2L

L

0

mappings T

M

(x). That seems to be a reasonable price to pay for the possibility

to go by long steps.

6. Discussion

Let us compare the results presented in this paper with some known facts on

global eﬃciency of other minimization schemes. Since there is almost no such

results for non-convex case, let us look at a simple class of convex problems.

Assume that function f(x) is strongly convex on R

n

with convexity param-

eter γ > 0 (see (4.8)). In this case there exists its unique minimum x

∗

and

condition (4.3) holds for all x ∈ R

n

(see, for example, [10], Section 2.1.3). As-

sume also that Hessian of f(x) is Lipschitz continuous:

kf

00

(x) −f

00

(y)k ≤ Lkx − yk, ∀x, y ∈ R

n

.

For such functions, let us obtain the complexity bounds of method (3.3) using

the results of Theorems 4 and 5.

Let us ﬁx some x

0

∈ R

n

. Denote by D the radius of its level set:

D = max

x

{kx −x

∗

k : f(x) ≤ f(x

0

)}.

From (4.3) we get

D ≤

h

2

γ

(f(x

0

) −f (x

∗

))

i

1/2

.

We will see that it is natural to measure the quality of starting point x

0

by the

following characteristic:

κ ≡ κ(x

0

) =

LD

γ

.

Let us introduce three switching values

ω

0

=

γ

3

18L

2

≡

4

9

¯ω, ω

1

=

3

2

γD

2

, ω

2

=

3

2

LD

3

.

In view of Theorem 4, we can reach the level f(x

0

) − f(x

∗

) ≤

1

2

LD

3

in one

additional iteration. Therefore without loss of generality we assume that

f(x

1

) −f(x

∗

) ≤ ω

2

.

Assume also that we are interested in a very high accuracy of the solution. Note

that the case κ ≤ 1 is very easy since the ﬁrst iteration of method (3.3) comes

very close to the region of super-linear convergence (see Item 2 of Theorem 5).

26 Yu. Nesterov, B. Polyak

Consider the case κ ≥ 1. Then ω

0

≤ ω

1

≤ ω

2

. Let us estimate the duration

of the following phases:

Phase 1: ω

1

≤ f(x

i

) ≤ ω

2

,

Phase 2: ω

0

≤ f(x

i

) ≤ ω

1

,

Phase 3: ² ≤ f(x

i

) ≤ ω

0

.

In view of Theorem 4, the duration k

1

of the ﬁrst phase is bounded as follows:

ω

1

≤

3LD

3

2(1+

1

3

k

1

)

2

.

Thus, k

1

≤ 3

√

κ. Further, in view of Item 1 of Theorem 5, we can bound the

duration k

2

of the second phase:

ω

1/4

0

≤ (f (x

k

1

+1

) −f (x

∗

))

1/4

−

k

2

6

ω

1/4

0

≤ (

1

2

γD

2

)

1/4

−

k

2

6

ω

1/4

0

.

This gives the following bound: k

2

≤ 3

3/4

2

1/2

√

κ ≤ 3.25

√

κ.

Finally, denote δ

k

=

1

4ω

0

(f(x

k

) −f (x

∗

)). In view of inequality (4.5) we have:

δ

k+1

≤ δ

3/2

k

, k ≥

¯

k ≡ k

1

+ k

2

+ 1.

At the same time f(x

¯

k

) − f(x

∗

) ≤ ω

0

. Thus, δ

¯

k

≤

1

4

, and the bound on the

duration k

3

of the last phase can be found from inequality

4

(

3

2

)

k

3

≤

4ω

0

²

.

That is k

3

≤ log

3

2

log

4

2γ

3

9²L

2

. Putting all bounds together, we obtain that the

total number of steps N in (3.3) is bounded as follows:

N ≤ 6.25

q

LD

γ

+ log

3

2

³

log

4

1

²

+ log

4

2γ

3

9L

2

´

.

(6.1)

It is interesting that in estimate (6.1) the parameters of our problem interact

with accuracy in an additive way. Recall that usually such an interaction is

multiplicative. Let us estimate, for example, the complexity of our problem for so

called “optimal ﬁrst-order method” for strongly convex functions with Lipschitz

continuous gradient (see [10], Section 2.2.1). Denote by

ˆ

L the largest eigenvalue

of matrix f

00

(x

∗

). Then can guarantee that

γI ≤ f

00

(x) ≤ (

ˆ

L + LD)I ∀x, kx − x

∗

k ≤ D.

Thus, the complexity bound for the optimal method is of the order

O

µ

q

ˆ

L+LD

γ

ln

(

ˆ

L+LD)D

2

²

¶

iterations. For gradient method it is much worse:

O

³

ˆ

L+LD

γ

ln

(

ˆ

L+LD)D

2

²

´

.

Cubic regularization of Newton method and its global performance 27

Thus, we conclude that the global complexity estimates of the modiﬁed New-

ton scheme (3.3) are incomparably better than the estimates of the gradient

schemes. At the same time, we should remember, of course, about the diﬀerence

in the computational cost of each iteration.

Note that the similar bounds can be obtained for other classes of non-convex

problems. For example, for nonlinear transformations of convex functions (see

Section 4.3), the complexity bound is as follows:

N ≤ 6.25

q

σ

γ

LD + log

3

2

³

log

4

1

²

+ log

4

2γ

3

9σ

6

L

2

´

.

(6.2)

To conclude, note that in scheme (3.3) it is possible to ﬁnd elements of

Levenberg-Marquardt approach (see relation (2.7)), or a trust-region approach

(see Theorem 10 and related discussion), or a line-search technique (see the rule

of Step 1 in (3.3)). However, all these facts are consequences of the main idea of

the scheme, that is the choice of the next test point as a global minimizer of the

upper second-order approximation of objective function.

References

1. Bennet A.A.: Newton’s method in general analysis. Proc. Nat. Ac. Sci. USA. 2(10),

592–598 (1916)

2. Conn A.B., Gould N.I.M., Toint Ph.L.: Trust Region Methods. SIAM, Philadelphia, 2000.

3. Dennis J.E., Jr., Schnabel R.B.: Numerical Methods for Unconstrained Optimization and

Nonlinear Equations. SIAM, Philadelphia, 1996.

4. Fletcher R.: Practical Methods of Optimization, Vol. 1, Unconstrained Minimization.

John Wiley, NY, 1980.

5. Goldfeld S., Quandt R., Trotter H.: Maximization by quadratic hill climbing. Economet-

rica. 34, 541–551 (1966)

6. Kantorovich L.V.: Functional analysis and applied mathematics. Uspehi Matem. Nauk.

3(1), 89–185 (1948), (in Russian). Translated as N.B.S. Rep ort 1509, Washington D.C.

(1952)

7. Levenberg K.: A method for the solution of certain problems in least squares. Quart.

Appl. Math. 2, 164–168 (1944).

8. Marquardt D.: An algorithm for least-squares estimation of nonlinear parameters. SIAM

J. Appl. Math. 11, 431–441 (1963)

9. Nemirovsky A. and Yudin D.: Informational complexity and eﬃcient methods for solution

of convex extremal problems. Wiley, New York, 1983.

10. Nesterov Yu.: Introductory lectures on convex programming: a basic course. Kluwer,

Boston, 2004.

11. Nesterov Yu., Nemirovskii A.: Interior-Point Polynomial Algorithms in Convex Program-

ming. SIAM, Philadelphia, 1994.

12. Ortega J.M., Rheinboldt W.C.: Iterative Solution of Nonlinear Equations in Several Vari-

ables. Academic Press, NY, 1970.

13. Polyak B.T.: Gradient methods for minimization of functionals. USSR Comp. Math.

Math. Phys. 3(3), 643–653 (1963)

14. Polyak B.T.: Convexity of quadratic transformations and its use in control and optimiza-

tion. J. Optim. Theory and Appl. 99(3), 553–583 (1998)