Content uploaded by Ping Tang
Author content
All content in this area was uploaded by Ping Tang on Jul 30, 2018
Content may be subject to copyright.
A Progressive Batching L-BFGS Method for Machine Learning
Raghu Bollapragada 1Dheevatsa Mudigere 2Jorge Nocedal 1Hao-Jun Michael Shi 1Ping Tak Peter Tang 3
Abstract
The standard L-BFGS method relies on gradient
approximations that are not dominated by noise,
so that search directions are descent directions,
the line search is reliable, and quasi-Newton up-
dating yields useful quadratic models of the objec-
tive function. All of this appears to call for a full
batch approach, but since small batch sizes give
rise to faster algorithms with better generalization
properties, L-BFGS is currently not considered
an algorithm of choice for large-scale machine
learning applications. One need not, however,
choose between the two extremes represented by
the full batch or highly stochastic regimes, and
may instead follow a progressive batching ap-
proach in which the sample size increases during
the course of the optimization. In this paper, we
present a new version of the L-BFGS algorithm
that combines three basic components — progres-
sive batching, a stochastic line search, and stable
quasi-Newton updating — and that performs well
on training logistic regression and deep neural
networks. We provide supporting convergence
theory for the method.
1. Introduction
The L-BFGS method (Liu & Nocedal,1989) has tradition-
ally been regarded as a batch method in the machine learn-
ing community. This is because quasi-Newton algorithms
need gradients of high quality in order to construct useful
quadratic models and perform reliable line searches. These
algorithmic ingredients can be implemented, it seems, only
by using very large batch sizes, resulting in a costly itera-
1
Department of Industrial Engineering and Manage-
ment Sciences, Northwestern University, Evanston, IL, USA
2
Intel Corporation, Bangalore, India
3
Intel Corporation,
Santa Clara, CA, USA. Correspondence to: Raghu Bol-
lapragada
<
raghu.bollapragada@u.northwestern.edu
>
, Dhee-
vatsa Mudigere
<
dheevatsa.mudigere@intel.com
>
, Jorge No-
cedal
<
j-nocedal@northwestern.edu
>
, Hao-Jun Michael Shi
<
hjmshi@u.northwestern.edu
>
, Ping Tak Peter Tang
<
pe-
ter.tang@intel.com>.
tion that makes the overall algorithm slow compared with
stochastic gradient methods (Robbins & Monro,1951).
Even before the resurgence of neural networks, many re-
searchers observed that a well-tuned implementation of the
stochastic gradient (SG) method was far more effective on
large-scale logistic regression applications than the batch
L-BFGS method, even when taking into account the advan-
tages of parallelism offered by the use of large batches. The
preeminence of the SG method (and its variants) became
more pronounced with the advent of deep neural networks,
and some researchers have speculated that SG is endowed
with certain regularization properties that are essential in the
minimization of such complex nonconvex functions (Hardt
et al.,2015;Keskar et al.,2016).
In this paper, we postulate that the most efficient algorithms
for machine learning may not reside entirely in the highly
stochastic or full batch regimes, but should employ a pro-
gressive batching approach in which the sample size is ini-
tially small, and is increased as the iteration progresses.
This view is consistent with recent numerical experiments
on training various deep neural networks (Smith et al.,2017;
Goyal et al.,2017), where the SG method, with increas-
ing sample sizes, yields similar test loss and accuracy as
the standard (fixed mini-batch) SG method, while offering
significantly greater opportunities for parallelism.
Progressive batching algorithms have received much atten-
tion recently from a theoretical perspective. It has been
shown that they enjoy complexity bounds that rival those of
the SG method (Byrd et al.,2012), and that they can achieve
a fast rate of convergence (Friedlander & Schmidt,2012).
The main appeal of these methods is that they inherit the
efficient initial behavior of the SG method, offer greater
opportunities to exploit parallelism, and allow for the in-
corporation of second-order information. The latter can be
done efficiently via quasi-Newton updating.
An integral part of quasi-Newton methods is the line search,
which ensures that a convex quadratic model can be con-
structed at every iteration. One challenge that immediately
arises is how to perform this line search when the objec-
tive function is stochastic. This is an issue that has not
received sufficient attention in the literature, where stochas-
tic line searches have been largely dismissed as inappropri-
ate. In this paper, we take a step towards the development
arXiv:1802.05374v2 [math.OC] 30 May 2018
A Progressive Batching L-BFGS Method for Machine Learning
of stochastic line searches for machine learning by study-
ing a key component, namely the initial estimate in the
one-dimensional search. Our approach, which is based on
statistical considerations, is designed for an Armijo-style
backtracking line search.
1.1. Literature Review
Progressive batching (sometimes referred to as dynamic
sampling) has been well studied in the optimization litera-
ture, both for stochastic gradient and subsampled Newton-
type methods (Byrd et al.,2012;Friedlander & Schmidt,
2012;Cartis & Scheinberg,2015;Pasupathy et al.,2015;
Roosta-Khorasani & Mahoney,2016a;b;Bollapragada et al.,
2016;2017;De et al.,2017). Friedlander and Schmidt
(2012) introduced theoretical conditions under which a pro-
gressive batching SG method converges linearly for finite
sum problems, and experimented with a quasi-Newton adap-
tation of their algorithm. Byrd et al. (2012) proposed a
progressive batching strategy, based on a norm test, that de-
termines when to increase the sample size; they established
linear convergence and computational complexity bounds
in the case when the batch size grows geometrically. More
recently, Bollapragada et al. (2017) introduced a batch con-
trol mechanism based on an inner product test that improves
upon the norm test mentioned above.
There has been a renewed interest in understanding the gen-
eralization properties of small-batch and large-batch meth-
ods for training neural networks; see (Keskar et al.,2016;
Dinh et al.,2017;Goyal et al.,2017;Hoffer et al.,2017).
Keskar et al. (2016) empirically observed that large-batch
methods converge to solutions with inferior generalization
properties; however, Goyal et al. (2017) showed that large-
batch methods can match the performance of small-batch
methods when a warm-up strategy is used in conjunction
with scaling the step length by the same factor as the batch
size. Hoffer et al. (2017) and You et al. (2017) also explored
larger batch sizes and steplengths to reduce the number of
updates necessary to train the network. All of these studies
naturally led to an interest in progressive batching tech-
niques. Smith et al. (2017) showed empirically that increas-
ing the sample size and decaying the steplength are quan-
titatively equivalent for the SG method; hence, steplength
schedules could be directly converted to batch size sched-
ules. This approach was parallelized by Devarakonda et al.
(2017). De et al. (2017) presented numerical results with
a progressive batching method that employs the norm test.
Balles et al. (2016) proposed an adaptive dynamic sample
size scheme and couples the sample size with the steplength.
Stochastic second-order methods have been explored within
the context of convex and non-convex optimization; see
(Schraudolph et al.,2007;Sohl-Dickstein et al.,2014;
Mokhtari & Ribeiro,2015;Berahas et al.,2016;Byrd et al.,
2016;Keskar & Berahas,2016;Curtis,2016;Berahas &
Tak
´
a
ˇ
c,2017;Zhou et al.,2017). Schraudolph et al. (2007)
ensured stability of quasi-Newton updating by computing
gradients using the same batch at the beginning and end
of the iteration. Since this can potentially double the cost
of the iteration, Berahas et al. (2016) proposed to achieve
gradient consistency by computing gradients based on the
overlap between consecutive batches; this approach was
further tested by Berahas and Takac (2017). An interesting
approach introduced by Martens and Grosse (2015;2016)
approximates the Fisher information matrix to scale the
gradient; a distributed implementation of their K-FAC ap-
proach is described in (Ba et al.,2016). Another approach
approximately computes the inverse Hessian by using the
Neumann power series representation of matrices (Krishnan
et al.,2017).
1.2. Contributions
This paper builds upon three algorithmic components that
have recently received attention in the literature — progres-
sive batching, stable quasi-Newton updating, and adaptive
steplength selection. It advances their design and puts them
together in a novel algorithm with attractive theoretical and
computational properties.
The cornerstone of our progressive batching strategy is the
mechanism proposed by Bollapragada et al. (2017) in the
context of first-order methods. We extend their inner prod-
uct control test to second-order algorithms, something that
is delicate and leads to a significant modification of the
original procedure. Another main contribution of the paper
is the design of an Armijo-style backtracking line search
where the initial steplength is chosen based on statistical
information gathered during the course of the iteration. We
show that this steplength procedure is effective on a wide
range of applications, as it leads to well scaled steps and
allows for the BFGS update to be performed most of the
time, even for nonconvex problems. We also test two tech-
niques for ensuring the stability of quasi-Newton updating,
and observe that the overlapping procedure described by
Berahas et al. (2016) is more efficient than a straightforward
adaptation of classical quasi-Newton methods (Schraudolph
et al.,2007).
We report numerical tests on large-scale logistic regression
and deep neural network training tasks that indicate that our
method is robust and efficient, and has good generalization
properties. An additional advantage is that the method re-
quires almost no parameter tuning, which is possible due to
the incorporation of second-order information. All of this
suggests that our approach has the potential to become one
of the leading optimization methods for training deep neural
networks. In order to achieve this, the algorithm must be
optimized for parallel execution, something that was only
A Progressive Batching L-BFGS Method for Machine Learning
briefly explored in this study.
2. A Progressive Batching Quasi-Newton
Method
The problem of interest is
min
x∈RdF(x) = Zf(x;z, y)dP (z, y),(1)
where
f
is the composition of a prediction function
(parametrized by
x
) and a loss function, and
(z, y)
are ran-
dom input-output pairs with probability distribution
P(z, y)
.
The associated empirical risk problem consists of minimiz-
ing
R(x) = 1
N
N
X
i=1
f(x;zi, yi)4
=1
N
N
X
i=1
Fi(x),
where we define
Fi(x) = f(x;zi, yi).
A stochastic quasi-
Newton method is given by
xk+1 =xk−αkHkgSk
k,(2)
where the batch (or subsampled) gradient is given by
gSk
k=∇FSk(xk)4
=1
|Sk|X
i∈Sk
∇Fi(xk),(3)
the set
Sk⊂ {1,2,· · · }
indexes data points
(yi, zi)
sam-
pled from the distribution
P(z, y)
, and
Hk
is a positive
definite quasi-Newton matrix. We now discuss each of the
components of the new method.
2.1. Sample Size Selection
The proposed algorithm has the form
(29)
-
(30)
. Initially, it
utilizes a small batch size
|Sk|
, and increases it gradually in
order to attain a fast local rate of convergence and permit the
use of second-order information. A challenging question is
to determine when, and by how much, to increase the batch
size
|Sk|
over the course of the optimization procedure based
on observed gradients — as opposed to using prescribed
rules that depend on the iteration number k.
We propose to build upon the strategy introduced by Bol-
lapragada et al. (2017) in the context of first-order methods.
Their inner product test determines a sample size such that
the search direction is a descent direction with high prob-
ability. A straightforward extension of this strategy to the
quasi-Newton setting is not appropriate since requiring only
that a stochastic quasi-Newton search direction be a de-
scent direction with high probability would underutilize the
curvature information contained in the search direction.
We would like, instead, for the search direction
dk=
−HkgSk
k
to make an acute angle with the true quasi-Newton
search direction
−Hk∇F(xk)
, with high probability. Al-
though this does not imply that
dk
is a descent direction
for
F
, this will normally be the case for any reasonable
quasi-Newton matrix.
To derive the new inner product quasi-Newton (IPQN) test,
we first observe that the stochastic quasi-Newton search
direction makes an acute angle with the true quasi-Newton
direction in expectation, i.e.,
Ekh(Hk∇F(xk))T(HkgSk
k)i=kHk∇F(xk)k2,(4)
where
Ek
denotes the conditional expectation at
xk
. We
must, however, control the variance of this quantity to
achieve our stated objective. Specifically, we select the
sample size
|Sk|
such that the following condition is satis-
fied:
Ek(Hk∇F(xk))T(HkgSk
k)− kHk∇F(xk)k22
≤θ2kHk∇F(xk)k4,
(5)
for some
θ > 0
. The left hand side of
(5)
is difficult to com-
pute but can be bounded by the true variance of individual
search directions, i.e.,
Ekh(Hk∇F(xk))T(Hkgi
k)− kHk∇F(xk)k22i
|Sk|
≤θ2kHk∇F(xk)k4,
(6)
where
gi
k=∇Fi(xk)
. This test involves the true expected
gradient and variance, but we can approximate these quan-
tities with sample gradient and variance estimates, respec-
tively, yielding the practical inner product quasi-Newton
test:
Vari∈Sv
k(gi
k)TH2
kgSk
k
|Sk|≤θ2
HkgSk
k
4,(7)
where
Sv
k⊆Sk
is a subset of the current sample (batch),
and the variance term is defined as
Pi∈Sv
k(gi
k)TH2
kgSk
k−
HkgSk
k
22
|Sv
k| − 1.(8)
The variance
(8)
may be computed using just one additional
Hessian vector product of
Hk
with
HkgSk
k
. Whenever con-
dition
(7)
is not satisfied, we increase the sample size
|Sk|
.
In order to estimate the increase that would lead to a satis-
faction of
(7)
, we reason as follows. If we assume that new
sample |¯
Sk|is such that
HkgSk
k
u
Hkg¯
Sk
k
,
A Progressive Batching L-BFGS Method for Machine Learning
and similarly for the variance estimate, then a simple com-
putation shows that a lower bound on the new sample size
is
|¯
Sk| ≥
Vari∈Sv
k(gi
k)TH2
kgSk
k
θ2
HkgSk
k
4
4
=bk.(9)
In our implementation of the algorithm, we set the new sam-
ple size as
|Sk+1|=dbke
. When the sample approximation
of
F(xk)
is not accurate, which can occur when
|Sk|
is
small, the progressive batching mechanism just described
may not be reliable. In this case we employ the moving
window technique described in Section 4.2 of Bollapragada
et al. (2017), to produce a sample estimate of ∇F(xk).
2.2. The Line Search
In deterministic optimization, line searches are employed
to ensure that the step is not too short and to guarantee
sufficient decrease in the objective function. Line searches
are particularly important in quasi-Newton methods since
they ensure robustness and efficiency of the iteration with
little additional cost.
In contrast, stochastic line searches are poorly understood
and rarely employed in practice because they must make
decisions based on sample function values
FSk(x) = 1
|Sk|X
i∈Sk
Fi(x),(10)
which are noisy approximations to the true objective
F
.
One of the key questions in the design of a stochastic line
search is how to ensure, with high probability, that there is
a decrease in the true function when one can only observe
stochastic approximations
FSk(x)
. We address this question
by proposing a formula for the step size
αk
that controls
possible increases in the true function. Specifically, the first
trial steplength in the stochastic backtracking line search
is computed so that the predicted decrease in the expected
function value is sufficiently large, as we now explain.
Using Lipschitz continuity of
∇F(x)
and taking conditional
expectation, we can show the following inequality
Ek[Fk+1]≤Fk−αk∇F(xk)TH1/2
kWkH1/2
k∇F(xk)
(11)
where
Wk=I−Lαk
21 + Var{Hkgi
k}
|Sk|kHk∇F(xk)k2Hk,
Var{Hkgi
k}=EkkHkgi
k−Hk∇F(xk)k2
,
Fk=
F(xk)
, and
L
is the Lipschitz constant. The proof of
(A.1)
is given in the supplement.
The only difference in
(A.1)
between the deterministic and
stochastic quasi-Newton methods is the additional variance
term in the matrix
Wk
. To obtain decrease in the function
value in the deterministic case, the matrix
I−Lαk
2Hk
must be positive definite, whereas in the stochastic case the
matrix
Wk
must be positive definite to yield a decrease in
F
in expectation. In the deterministic case, for a reasonably
good quasi-Newton matrix
Hk
, one expects that
αk= 1
will result in a decrease in the function, and therefore the
initial trial steplength parameter should be chosen to be 1.
In the stochastic case, the initial trial value
ˆαk=1 + Var{Hkgi
k}
|Sk|kHk∇F(xk)k2−1
(12)
will result in decrease in the expected function value. How-
ever, since formula
(12)
involves the expensive computation
of the individual matrix-vector products
Hkgi
k
, we approxi-
mate the variance-bias ratio as follows:
¯αk=1 + Var{gi
k}
|Sk|k∇F(xk)k2−1
,(13)
where
Var{gi
k}=Ekkgi
k− ∇F(xk)k2
. In our practical
implementation, we estimate the population variance and
gradient with the sample variance and gradient, respectively,
yielding the initial steplength
αk=
1 + Vari∈Sv
k{gi
k}
|Sk|
gSk
k
2
−1
,(14)
where
Vari∈Sv
k{gi
k}=1
|Sv
k| − 1X
i∈Sv
k
gi
k−gSk
k
2
(15)
and
Sv
k⊆Sk
. With this initial value of
αk
in hand, our
algorithm performs a backtracking line search that aims to
satisfy the Armijo condition
FSk(xk−αkHkgSk
k)
≤FSk(xk)−c1αk(gSk
k)THkgSk
k,(16)
where c1>0.
2.3. Stable Quasi-Newton Updates
In the BFGS and L-BFGS methods, the inverse Hessian
approximation is updated using the formula
Hk+1 =VT
kHkVk+ρksksT
k
ρk= (yT
ksk)−1
Vk=I−ρkyksT
k
(17)
where
sk=xk+1 −xk
and
yk
is the difference in the
gradients at
xk+1
and
xk
. When the batch changes from
A Progressive Batching L-BFGS Method for Machine Learning
one iteration to the next (
Sk+1 6=Sk
), it is not obvious how
yk
should be defined. It has been observed that when
yk
is computed using different samples, the updating process
may be unstable, and hence it seems natural to use the
same sample at the beginning and at the end of the iteration
(Schraudolph et al.,2007), and define
yk=gSk
k+1 −gSk
k.(18)
However, this requires that the gradient be evaluated twice
for every batch
Sk
at
xk
and
xk+1
. To avoid this additional
cost, Berahas et al. (2016) propose to use the overlap be-
tween consecutive samples in the gradient differencing. If
we denote this overlap as
Ok=Sk∩Sk+1
, then one defines
yk=gOk
k+1 −gOk
k.(19)
This requires no extra computation since the two gradients
in this expression are subsets of the gradients correspond-
ing to the samples
Sk
and
Sk+1
. The overlap should not
be too small to avoid differencing noise, but this is easily
achieved in practice. We test both formulas for
yk
in our
implementation of the method; see Section 4.
2.4. The Complete Algorithm
The pseudocode of the progressive batching L-BFGS
method is given in Algorithm 1. Observe that the limited
memory Hessian approximation
Hk
in Line 8 is indepen-
dent of the choice of the sample
Sk
. Specifically,
Hk
is
defined by a collection of curvature pairs
{(sj, yj)}
, where
the most recent pair is based on the sample
Sk−1
; see Line
14. For the batch size control test
(7)
, we choose
θ= 0.9
in
the logistic regression experiments, and
θ
is a tunable pa-
rameter chosen in the interval
[0.9,3]
in the neural network
experiments. The constant
c1
in
(16)
is set to
c1= 10−4
.
For L-BFGS, we set the memory as
m= 10
. We skip the
quasi-Newton update if the following curvature condition is
not satisfied:
yT
ksk> kskk2,with = 10−2.(20)
The initial Hessian matrix
Hk
0
in the L-BFGS recursion at
each iteration is chosen as γkIwhere γk=yT
ksk/yT
kyk.
3. Convergence Analysis
We now present convergence results for the proposed al-
gorithm, both for strongly convex and nonconvex objec-
tive functions. Our emphasis is in analyzing the effect of
progressive sampling, and therefore, we follow common
practice and assume that the steplength in the algorithm is
fixed (
αk=α
), and that the inverse L-BFGS matrix
Hk
has
bounded eigenvalues, i.e.,
Λ1IHkΛ2I. (21)
Algorithm 1 Progressive Batching L-BFGS Method
Input: Initial iterate x0, initial sample size |S0|;
Initialization: Set k←0
Repeat
until convergence:
1: Sample Sk⊆ {1,· · · , N}with sample size |Sk|
2: if condition (7) is not satisfied then
3: Compute bkusing (9), and set ˆ
bk← dbke − |Sk|
4: Sample S+⊆ {1,· · · , N} \ Skwith |S+|=ˆ
bk
5: Set Sk←Sk∪S+
6: end if
7: Compute gSk
k
8:
Compute
pk=−HkgSk
k
using L-BFGS Two-Loop
Recursion in (Nocedal & Wright,1999)
9: Compute αkusing (14)
10: while the Armijo condition (16) not satisfied do
11: Set αk=αk/2
12: end while
13: Compute xk+1 =xk+αkpk
14: Compute ykusing (18) or (19)
15: Compute sk=xk+1 −xk
16: if yT
ksk> kskk2then
17: if number of stored (yj, sj)exceeds mthen
18: Discard oldest curvature pair (yj, sj)
19: end if
20: Store new curvature pair (yk, sk)
21: end if
22: Set k←k+ 1
23: Set |Sk|=|Sk−1|
This assumption can be justified both in the convex and non-
convex cases under certain conditions; see (Berahas et al.,
2016). We assume that the sample size is controlled by the
exact inner product quasi-Newton test
(31)
. This test is de-
signed for efficiency, and in rare situations could allow for
the generation of arbitrarily long search directions. To pre-
vent this from happening, we introduce an additional control
on the sample size
|Sk|
, by extending (to the quasi-Newton
setting) the orthogonality test introduced in (Bollapragada
et al.,2017). This additional requirement states that the
current sample size |Sk|is acceptable only if
Ek
Hkgi
k−HkgSk
kT
(Hk∇F(xk))
kHk∇F(xk)k2Hk∇F(xk)
2
|Sk|
≤ν2kHk∇F(xk)k2,(22)
for some given ν > 0.
We now establish linear convergence when the objective is
strongly convex.
Theorem 3.1.
Suppose that
F
is twice continuously differ-
entiable and that there exist constants
0< µ ≤L
such that
A Progressive Batching L-BFGS Method for Machine Learning
µI ∇2F(x)LI, ∀x∈Rd.(23)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-
Newton test
(31)
. Suppose that the orthogonality condition
(32)
holds at every iteration, and that the matrices
Hk
sat-
isfy (B.2). Then, if
αk=α≤1
(1 + θ2+ν2)LΛ2
,(24)
we have that
E[F(xk)−F(x∗)] ≤ρk(F(x0)−F(x∗)),(25)
where
x∗
denotes the minimizer of
F
,
ρ= 1 −µΛ1α,
and
Edenotes the total expectation.
The proof of this result is given in the supplement. We now
consider the case when
F
is nonconvex and bounded below.
Theorem 3.2.
Suppose that
F
is twice continuously differ-
entiable and bounded below, and that there exists a constant
L > 0such that
∇2F(x)LI, ∀x∈Rd.(26)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen so that
(31)
and
(32)
are satisfied, and sup-
pose that
(B.2)
holds. Then, if
αk
satisfies
(36)
, we have
lim
k→∞
E[k∇F(xk)k2]→0.(27)
Moreover, for any positive integer Twe have that
min
0≤k≤T−1
E[k∇F(xk)k2]≤2
αT Λ1
(F(x0)−Fmin),
where Fmin is a lower bound on Fin Rd.
The proof is given in the supplement. This result shows that
the sequence of gradients
{k∇F(xk)k}
converges to zero
in expectation, and establishes a global sublinear rate of
convergence of the smallest gradients generated after every
Tsteps.
4. Numerical Results
In this section, we present numerical results for the proposed
algorithm, which we refer to as PBQN for the
P
rogressive
Batching Quasi-Newton algorithm.
4.1. Experiments on Logistic Regression Problems
We first test our algorithm on binary classification problems
where the objective function is given by the logistic loss
with `2regularization:
R(x) = 1
N
N
X
i=1
log(1 + exp(−zixTyi)) + λ
2kxk2,(28)
with
λ= 1/N
. We consider the
8
datasets listed in the
supplement. An approximation
R∗
of the optimal function
value is computed for each problem by running the full batch
L-BFGS method until
k∇R(xk)k∞≤10−8
. Training error
is defined as
R(xk)−R∗
, where
R(xk)
is evaluated over the
training set; test loss is evaluated over the test set without
the `2regularization term.
We tested two options for computing the curvature vector
yk
in the PBQN method: the multi-batch (MB) approach
(19)
with 25% sample overlap, and the full overlap (FO)
approach
(18)
. We set
θ= 0.9
in
(7)
, chose
|S0|= 512
, and
set all other parameters to the default values given in Sec-
tion 2. Thus, none of the parameters in our PBQN method
were tuned for each individual dataset. We compared our
algorithm against two other methods: (i) Stochastic gradient
(SG) with a batch size of 1; (ii) SVRG (Johnson & Zhang,
2013) with the inner loop length set to
N
. The steplength
for SG and SVRG is constant and tuned for each problem
(
αk≡α= 2j
, for
j∈ {−10,−9, ..., 9,10}
) so as to give
best performance.
In Figures 9and 2we present results for two datasets,
spam
and
covertype
; the rest of the results are given in the
supplement. The horizontal axis measures the number of
full gradient evaluations, or equivalently, the number of
times that
N
component gradients
∇Fi
were evaluated. The
left-most figure reports the long term trend over 100 gradient
evaluations, while the rest of the figures zoom into the first
10 gradient evaluations to show the initial behavior of the
methods. The vertical axis measures training error, test loss,
and test accuracy, respectively, from left to right.
The proposed algorithm competes well for these two
datasets in terms of training error, test loss and test accuracy,
and decreases these measures more evenly than the SG and
SVRG. Our numerical experience indicates that formula
(14)
is quite effective at estimating the steplength parameter,
as it is accepted by the backtracking line search for most
iterations. As a result, the line search computes very few
additional function values.
It is interesting to note that SVRG is not as efficient in the
initial epochs compared to PBQN or SG, when measured
either in terms of test loss and test accuracy. The training er-
ror for SVRG decreases rapidly in later epochs but this rapid
improvement is not observed in the test loss and accuracy.
Neither the PBQN nor SVRG significantly outperforms the
other across all datasets tested in terms of training error, as
observed in the supplement.
Our results indicate that defining the curvature vector using
the MB approach is preferable to using the FB approach.
The number of iterations required by the PBQN method is
significantly smaller compared to the SG method, suggest-
ing the potential efficiency gains of a parallel implementa-
A Progressive Batching L-BFGS Method for Machine Learning
0 10 20 30 40 50 60 70
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−4
10−3
10−2
10−1
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TestLoss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
TestAccuracy
SG
SVRG
PBQN-MB
PBQN-FO
Figure 1. spam dataset:
Performance of the progressive batching L-BFGS method (PBQN), with multi-batch (25% overlap) and
full-overlap approaches, and the SG and SVRG methods.
0 20 40 60 80 100
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−5
10−4
10−3
10−2
10−1
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
0.50
0.55
0.60
0.65
0.70
TestLoss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
TestAccuracy
SG
SVRG
PBQN-MB
PBQN-FO
Figure 2. covertype dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and full-overlap
approaches, and the SG and SVRG methods.
tion of our algorithm.
4.2. Results on Neural Networks
We have performed a preliminary investigation into the
performance of the PBQN algorithm for training neural
networks. As is well-known, the resulting optimization
problems are quite difficult due to the existence of local
minimizers, some of which generalize poorly. Thus our first
requirement when applying the PBQN method was to obtain
as good generalization as SG, something we have achieved.
Our investigation into how to obtain fast performance is,
however, still underway for reasons discussed below. Never-
theless, our results are worth reporting because they show
that our line search procedure is performing as expected, and
that the overall number of iterations required by the PBQN
method is small enough so that a parallel implementation
could yield state-of-the-art results, based on the theoretical
performance model detailed in the supplement.
We compared our algorithm, as described in Section 2,
against SG and Adam (Kingma & Ba,2014). It has taken
many years to design regularizations techniques and heuris-
tics that greatly improve the performance of the SG method
for deep learning (Srivastava et al.,2014;Ioffe & Szegedy,
2015). These include batch normalization and dropout,
which (in their current form) are not conducive to the PBQN
approach due to the need for gradient consistency when
evaluating the curvature pairs in L-BFGS. Therefore, we do
not implement batch normalization and dropout in any of
the methods tested, and leave the study of their extension to
the PBQN setting for future work.
We consider three network architectures: (i) a small convolu-
tional neural network on CIFAR-10 (
C
) (Krizhevsky,2009),
(ii) an AlexNet-like convolutional network on MNIST
and CIFAR-10 (
A1,A2
, respectively) (LeCun et al.,1998;
Krizhevsky et al.,2012), and (iii) a residual network
(ResNet18) on CIFAR-10 (
R
) (He et al.,2016). The net-
work architecture details and additional plots are given in
the supplement. All of these networks were implemented in
PyTorch (Paszke et al.,2017). The results for the CIFAR-
10 AlexNet and CIFAR-10 ResNet18 are given in Figures
15 and 16, respectively. We report results both against the
total number of iterations and the total number of gradient
evaluations. Table 1shows the best test accuracies attained
by each of the four methods over the various networks.
In all our experiments, we initialize the batch size as
|S0|=
512
in the PBQN method, and fix the batch size to
|Sk|=
128
for SG and Adam. The parameter
θ
given in
(7)
, which
controls the batch size increase in the PBQN method, was
tuned lightly by chosing among the 3 values: 0.9, 2, 3. SG
and Adam are tuned using a development-based decay (dev-
decay) scheme, which track the best validation loss at each
epoch and reduces the steplength by a constant factor
δ
if
the validation loss does not improve after eepochs.
We observe from our results that the PBQN method achieves
a similar test accuracy as SG and Adam, but requires more
gradient evaluations. Improvements in performance can be
obtained by ensuring that the PBQN method exerts a finer
control on the sample size in the small batch regime — some-
thing that requires further investigation. Nevertheless, the
small number of iterations required by the PBQN method,
together with the fact that it employs larger batch sizes than
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
TrainingLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
TestLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
Figure 3. CIFAR-10 AlexNet (A2):
Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 0.9.
101102103
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
TrainingLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
TestLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
Figure 4. CIFAR-10 ResNet18 (R):
Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 2.
Table 1.
Best test accuracy performance of SG, Adam, multi-batch
L-BFGS, and full overlap L-BFGS on various networks over 5
different runs and initializations.
Network SG Adam MB FO
C66.24 67.03 67.37 62.46
A199.25 99.34 99.16 99.05
A273.46 73.59 73.02 72.74
R69.5 70.16 70.28 69.44
SG during much of the run, suggests that a distributed ver-
sion similar to a data-parallel distributed implementation of
the SG method (Chen et al.,2016;Das et al.,2016) would
lead to a highly competitive method.
Similar to the logistic regression case, we observe that the
steplength computed via
(14)
is almost always accepted
by the Armijo condition, and typically lies within
(0.1,1)
.
Once the algorithm has trained for a significant number of
iterations using full-batch, the algorithm begins to overfit on
the training set, resulting in worsened test loss and accuracy,
as observed in the graphs.
5. Final Remarks
Several types of quasi-Newton methods have been proposed
in the literature to address the challenges arising in ma-
chine learning. Some of these method operate in the purely
stochastic setting (which makes quasi-Newton updating dif-
ficult) or in the purely batch regime (which leads to general-
ization problems). We believe that progressive batching is
the right context for designing an L-BFGS method that has
good generalization properties, does not expose any free pa-
rameters, and has fast convergence. The advantages of our
approach are clearly seen in logistic regression experiments.
To make the new method competitive with SG and Adam
for deep learning, we need to improve several of its compo-
nents. This includes the design of a more robust progressive
batching mechanism, the redesign of batch normalization
and dropout heuristics to improve the generalization perfor-
mance of our method for training larger networks, and most
importantly, the design of a parallelized implementation that
takes advantage of the higher granularity of each iteration.
We believe that the potential of the proposed approach as
an alternative to SG for deep learning is worthy of further
investigation.
Acknowledgements
We thank Albert Berahas for his insightful comments re-
garding multi-batch L-BFGS and probabilistic line searches,
as well as for his useful feedback on earlier versions of
the manuscript. We also thank the anonymous reviewers
for their useful feedback. Bollapragada is supported by
DOE award DE-FG02-87ER25047. Nocedal is supported
by NSF award DMS-1620070. Shi is supported by Intel
grant SP0036122.
References
Ba, J., Grosse, R., and Martens, J. Distributed second-order
optimization using kronecker-factored approximations.
A Progressive Batching L-BFGS Method for Machine Learning
2016.
Balles, L., Romero, J., and Hennig, P. Coupling adap-
tive batch sizes with learning rates. arXiv preprint
arXiv:1612.05086, 2016.
Berahas, A. S. and Tak
´
a
ˇ
c, M. A robust multi-batch
l-bfgs method for machine learning. arXiv preprint
arXiv:1707.08552, 2017.
Berahas, A. S., Nocedal, J., and Tak
´
ac, M. A multi-batch l-
bfgs method for machine learning. In Advances in Neural
Information Processing Systems, pp. 1055–1063, 2016.
Bertsekas, D. P., Nedi
´
c, A., and Ozdaglar, A. E. Convex
analysis and optimization. Athena Scientific Belmont,
2003.
Bollapragada, R., Byrd, R., and Nocedal, J. Exact and
inexact subsampled Newton methods for optimization.
arXiv preprint arXiv:1609.08502, 2016.
Bollapragada, R., Byrd, R., and Nocedal, J. Adaptive sam-
pling strategies for stochastic optimization. arXiv preprint
arXiv:1710.11258, 2017.
Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. Sam-
ple size selection in optimization methods for machine
learning. Mathematical Programming, 134(1):127–155,
2012.
Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A
stochastic quasi-Newton method for large-scale optimiza-
tion. SIAM Journal on Optimization, 26(2):1008–1031,
2016.
Carbonetto, P. New probabilistic inference algorithms that
harness the strengths of variational and Monte Carlo
methods. PhD thesis, University of British Columbia,
2009.
Cartis, C. and Scheinberg, K. Global convergence rate
analysis of unconstrained optimization methods based on
probabilistic models. Mathematical Programming, pp.
1–39, 2015.
Chang, C. and Lin, C. LIBSVM: A library for support vector
machines. ACM Transactions on Intelligent Systems and
Technology, 2:27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Re-
visiting distributed synchronous sgd. arXiv preprint
arXiv:1604.00981, 2016.
Cormack, G. and Lynam, T. Spam corpus creation for TREC.
In Proc. 2nd Conference on Email and Anti-Spam, 2005.
http://plg.uwaterloo.ca/gvcormac/treccorpus.
Curtis, F. A self-correcting variable-metric algorithm for
stochastic optimization. In International Conference on
Machine Learning, pp. 632–641, 2016.
Das, D., Avancha, S., Mudigere, D., Vaidynathan, K., Srid-
haran, S., Kalamkar, D., Kaul, B., and Dubey, P. Dis-
tributed deep learning using synchronous stochastic gra-
dient descent. arXiv preprint arXiv:1602.06709, 2016.
De, S., Yadav, A., Jacobs, D., and Goldstein, T. Automated
inference with adaptive batches. In Artificial Intelligence
and Statistics, pp. 1504–1513, 2017.
Devarakonda, A., Naumov, M., and Garland, M. Adabatch:
Adaptive batch sizes for training deep neural networks.
arXiv preprint arXiv:1712.02029, 2017.
Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp
minima can generalize for deep nets. arXiv preprint
arXiv:1703.04933, 2017.
Friedlander, M. P. and Schmidt, M. Hybrid deterministic-
stochastic methods for data fitting. SIAM Journal on
Scientific Computing, 34(3):A1380–A1405, 2012.
Goyal, P., Doll
´
ar, P., Girshick, R., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
He, K. Accurate, large minibatch sgd: Training imagenet
in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Grosse, R. and Martens, J. A kronecker-factored approxi-
mate fisher matrix for convolution layers. In International
Conference on Machine Learning, pp. 573–582, 2016.
Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet,
J., Spirtes, P., and Statnikov, A. R. Design and analy-
sis of the causation and prediction challenge. In WCCI
Causation and Prediction Challenge, pp. 1–33, 2008.
Hardt, M., Recht, B., and Singer, Y. Train faster, generalize
better: Stability of stochastic gradient descent. arXiv
preprint arXiv:1509.01240, 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Hoffer, E., Hubara, I., and Soudry, D. Train longer,
generalize better: closing the generalization gap in
large batch training of neural networks. arXiv preprint
arXiv:1705.08741, 2017.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
In International conference on machine learning, pp. 448–
456, 2015.
A Progressive Batching L-BFGS Method for Machine Learning
Johnson, R. and Zhang, T. Accelerating stochastic gradient
descent using predictive variance reduction. In Advances
in Neural Information Processing Systems 26, pp. 315–
323, 2013.
Keskar, N. S. and Berahas, A. S. adaqn: An adaptive quasi-
newton algorithm for training rnns. In Joint European
Conference on Machine Learning and Knowledge Dis-
covery in Databases, pp. 1–16. Springer, 2016.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
M., and Tang, P. T. P. On large-batch training for deep
learning: Generalization gap and sharp minima. arXiv
preprint arXiv:1609.04836, 2016.
Kingma, D. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
Krishnan, S., Xiao, Y., and Saurous, R. A. Neumann opti-
mizer: A practical optimization algorithm for deep neural
networks. arXiv preprint arXiv:1712.03298, 2017.
Krizhevsky, A. Learning multiple layers of features from
tiny images. 2009.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems,
pp. 1097–1105, 2012.
Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I.,
Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W.,
Smorkalov, M., et al. Deep learning at 15pf: Supervised
and semi-supervised classification for scientific data. In
Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis,
pp. 7. ACM, 2017.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
Liu, D. C. and Nocedal, J. On the limited memory bfgs
method for large scale optimization. Mathematical pro-
gramming, 45(1-3):503–528, 1989.
Martens, J. and Grosse, R. Optimizing neural networks with
kronecker-factored approximate curvature. In Interna-
tional Conference on Machine Learning, pp. 2408–2417,
2015.
Mokhtari, A. and Ribeiro, A. Global convergence of on-
line limited memory bfgs. Journal of Machine Learning
Research, 16(1):3151–3181, 2015.
Nocedal, J. and Wright, S. Numerical Optimization.
Springer New York, 2 edition, 1999.
Pasupathy, R., Glynn, P., Ghosh, S., and Hashemi, F. S. On
sampling rates in stochastic recursions. 2015. Under
Review.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.
Robbins, H. and Monro, S. A stochastic approximation
method. The annals of mathematical statistics, pp. 400–
407, 1951.
Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
Newton methods II: Local convergence rates. arXiv
preprint arXiv:1601.04738, 2016a.
Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
Newton methods I: Globally convergent algorithms.
arXiv preprint arXiv:1601.04737, 2016b.
Schraudolph, N. N., Yu, J., and G
¨
unter, S. A stochastic
quasi-newton method for online convex optimization. In
International Conference on Artificial Intelligence and
Statistics, pp. 436–443, 2007.
Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay
the learning rate, increase the batch size. arXiv preprint
arXiv:1711.00489, 2017.
Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-
scale optimization by unifying stochastic gradient and
quasi-Newton methods. In International Conference on
Machine Learning, pp. 604–612, 2014.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. The Journal of Machine
Learning Research, 15(1):1929–1958, 2014.
You, Y., Gitman, I., and Ginsburg, B. Scaling sgd
batch size to 32k for imagenet training. arXiv preprint
arXiv:1708.03888, 2017.
Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive
quasi-Newton methods for minimizing expected values.
In International Conference on Machine Learning, pp.
4150–4159, 2017.
A Progressive Batching L-BFGS Method for Machine Learning
A. Initial Step Length Derivation
To establish our results, recall that the stochastic quasi-Newton method is defined as
xk+1 =xk−αkHkgSk
k,(29)
where the batch (or subsampled) gradient is given by
gSk
k=∇FSk(xk) = 1
|Sk|X
i∈Sk
∇Fi(xk),(30)
and the set
Sk⊂ {1,2,· · · }
indexes data points
(yi, zi)
. The algorithm selects the Hessian approximation
Hk
through
quasi-Newton updating prior to selecting the new sample
Sk
to define the search direction
pk
. We will use
Ek
to denote the
conditional expectation at xkand use Eto denote the total expectation.
The primary theoretical mechanism for determining batch sizes is the exact variance inner product quasi-Newton (IPQN)
test, which is defined as
Ekh(Hk∇F(xk))T(Hkgi
k)− kHk∇F(xk)k22i
|Sk|≤θ2kHk∇F(xk)k4.(31)
We establish the inequality used to determine the initial steplength αkfor the stochastic line search.
Lemma A.1.
Assume that
F
is continuously differentiable with Lipschitz continuous gradient with Lipschitz constant
L
.
Then
Ek[F(xk+1)] ≤F(xk)−αk∇F(xk)TH1/2
kWkH1/2
k∇F(xk),
where
Wk=I−Lαk
21 + Var{Hkgi
k}
|Sk|kHk∇F(xk)k2Hk,
and Var{Hkgi
k}=EkkHkgi
k−Hk∇F(xk)k2.
Proof. By Lipschitz continuity of the gradient, we have that
Ek[F(xk+1)] ≤F(xk)−αk∇F(xk)THkEkhgSk
ki+Lα2
k
2EkhkHkgSk
kk2i
=F(xk)−αk∇F(xk)THk∇F(xk) + Lα2
k
2kHk∇F(xk)k2+EkhkHkgSk
k−Hk∇F(xk)k2i
≤F(xk)−αk∇F(xk)THk∇F(xk) + Lα2
k
2kHk∇F(xk)k2+Var{Hkgi
k}
|Sk|kHk∇F(xk)k2kHk∇F(xk)k2
=F(xk)−αk∇F(xk)TH1/2
kI−Lαk
21 + Var{Hkgi
k}
|Sk|kHk∇F(xk)k2HkH1/2
k∇F(xk)
=F(xk)−αk∇F(xk)TH1/2
kWkH1/2
k∇F(xk).
B. Convergence Analysis
For the rest of our analysis, we make the following two assumptions.
Assumptions B.1. The orthogonality condition is satisfied for all k, i.e.,
Ek
Hkgi
k−(Hkgi
k)T(Hk∇F(xk))
kHk∇F(xk)k2Hk∇F(xk)
2
|Sk|≤ν2kHk∇F(xk)k2,(32)
for some large ν > 0.
A Progressive Batching L-BFGS Method for Machine Learning
Assumptions B.2.
The eigenvalues of
Hk
are contained in an interval in
R+
, i.e., for all
k
there exist constants
Λ2≥Λ1>0
such that
Λ1IHkΛ2I. (33)
Condition
(32)
ensures that the stochastic quasi-Newton direction is bounded away from orthogonality to
−Hk∇F(xk)
,
with high probability, and prevents the variance in the individual quasi-Newton directions to be too large relative to the
variance in the individual quasi-Newton directions along
−Hk∇F(xk)
. Assumption B.2 holds, for example, when
F
is
convex and a regularization parameter is included so that any subsampled Hessian
∇2FS(x)
is positive definite. It can
also be shown to hold in the non-convex case by applying cautious BFGS updating; e.g. by updating
Hk
only when
yT
ksk≥kskk2
2where > 0is a predetermined constant (Berahas et al.,2016).
We begin by establishing a technical descent lemma.
Lemma B.3. Suppose that Fis twice continuously differentiable and that there exists a constant L > 0such that
∇2F(x)LI, ∀x∈Rd.(34)
Let
{xk}
be generated by iteration
(29)
for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-Newton
test (31)for given constant θ > 0and suppose that assumptions (B.1)and (B.2)hold. Then, for any k,
EkhkHkgSk
kk2i≤(1 + θ2+ν2)kHk∇F(xk)k2.(35)
Moreover, if αksatisfies
αk=α≤1
(1 + θ2+ν2)LΛ2
,(36)
we have that
Ek[F(xk+1)] ≤F(xk)−α
2kH1/2
k∇F(xk)k2.(37)
Proof. By Assumption (B.1), the orthogonality condition, we have that
Ek
HkgSk
k−(HkgSk
k)T(Hk∇F(xk))
kHk∇F(xk)k2Hk∇F(xk)
2
≤
Ek
Hkgi
k−(Hkgi
k)T(Hk∇F(xk))
kHk∇F(xk)k2Hk∇F(xk)
2
|Sk|(38)
≤ν2kHk∇F(xk)k2.
Now, expanding the left hand side of inequality (38), we get
Ek
HkgSk
k−(HkgSk
k)T(Hk∇F(xk))
kHk∇F(xk)k2Hk∇F(xk)
2
=EkhkHkgSk
kk2i−
2Ek(HkgSk
k)T(Hk∇F(xk))2
kHk∇F(xk)k2+
Ek(HkgSk
k)T(Hk∇F(xk))2
kHk∇F(xk)k2
=EkhkHkgSk
kk2i−
Ek(HkgSk
k)T(Hk∇F(xk))2
kHk∇F(xk)k2
≤ν2kHk∇F(xk)k2.
Therefore, rearranging gives the inequality
EkhkHkgSk
kk2i≤
Ek(HkgSk
k)T(Hk∇F(xk))2
kHk∇F(xk)k2+ν2kHk∇F(xk)k2.(39)
A Progressive Batching L-BFGS Method for Machine Learning
To bound the first term on the right side of this inequality, we use the inner product quasi-Newton test; in particular,
|Sk|
satisfies
Ek(Hk∇F(xk))T(HkgSk
k)) − kHk∇F(xk)k22≤
Ekh(Hk∇F(xk))T(Hkgi
k)− kHk∇F(xk)k22i
|Sk|
≤θ2kHk∇F(xk)k4,(40)
where the second inequality holds by the IPQN test. Since
Ek(Hk∇F(xk))T(HkgSk
k)− kHk∇F(xk)k22=Ek(Hk∇F(xk))T(HkgSk
k)2− kHk∇F(xk)k4,(41)
we have
Ek(HkgSk
k)T(Hk∇F(xk))2≤ kHk∇F(xk)k4+θ2kHk∇F(xk)k4
= (1 + θ2)kHk∇F(xk)k4,(42)
by (40) and (41). Substituting (42) into (39), we get the following bound on the length of the search direction:
EkhkHkgSk
kk2i≤(1 + θ2+ν2)kHk∇F(xk)k2,
which proves
(35)
. Using this inequality, Assumption B.2, and bounds on the Hessian and steplength
(34)
and
(36)
, we have
Ek[F(xk+1)] ≤F(xk)−Ekhα(HkgSk
k)T∇F(xk)i+EkLα2
2kHkgSk
kk2
=F(xk)−α∇F(xk)THk∇F(xk) + Lα2
2Ek[kHkgSk
kk2]
≤F(xk)−α∇F(xk)THk∇F(xk) + Lα2
2(1 + θ2+ν2)kHk∇F(xk)k2
=F(xk)−α(H1/2
k∇F(xk))TI−Lα(1 + θ2+ν2)
2HkH1/2
k∇F(xk)
≤F(xk)−α1−LΛ2α(1 + θ2+ν2)
2kH1/2
k∇F(xk)k2
≤F(xk)−α
2kH1/2
k∇F(xk)k2.
We now show that the stochastic quasi-Newton iteration
(29)
with a fixed steplength
α
is linearly convergent when
F
is
strongly convex. In the following discussion, x∗denotes the minimizer of F.
Theorem B.4. Suppose that Fis twice continuously differentiable and that there exist constants 0< µ ≤Lsuch that
µI ∇2F(x)LI, ∀x∈Rd.(43)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-Newton
test (31)and suppose that the assumptions (B.1)and (B.2)hold. Then, if αksatisfies (36)we have that
E[F(xk)−F(x∗)] ≤ρk(F(x0)−F(x∗)),(44)
where x∗denotes the minimizer of F, and ρ= 1 −µΛ1α.
Proof. It is well-known (Bertsekas et al.,2003) that for strongly convex functions,
k∇F(xk)k2≥2µ[F(xk)−F(x∗)].
A Progressive Batching L-BFGS Method for Machine Learning
Substituting this into (37) and subtracting F(x∗)from both sides and using Assumption B.2, we obtain
Ek[F(xk+1)−F(x∗)] ≤F(xk)−F(x∗)−α
2kH1/2
k∇F(xk)k2
≤F(xk)−F(x∗)−α
2Λ1k∇F(xk)k2
≤(1 −µΛ1α)(F(xk)−F(x∗)).
The theorem follows from taking total expectation.
We now consider the case when Fis nonconvex and bounded below.
Theorem B.5.
Suppose that
F
is twice continuously differentiable and bounded below, and that there exists a constant
L > 0such that
∇2F(x)LI, ∀x∈Rd.(45)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-Newton
test (31)and suppose that the assumptions (B.1)and (B.2)hold. Then, if αksatisfies (36), we have
lim
k→∞
E[k∇F(xk)k2]→0.(46)
Moreover, for any positive integer Twe have that
min
0≤k≤T−1
E[k∇F(xk)k2]≤2
αT Λ1
(F(x0)−Fmin),
where Fmin is a lower bound on Fin Rd.
Proof. From Lemma B.3 and by taking total expectation, we have
E[F(xk+1)] ≤E[F(xk)] −α
2E[kH1/2
k∇F(xk)k2],
and hence
E[kH1/2
k∇F(xk)k2]≤2
αE[F(xk)−F(xk+1)].
Summing both sides of this inequality from k= 0 to T−1, and since Fis bounded below by Fmin, we get
T−1
X
k=0
E[kH1/2
k∇F(xk)k2]≤2
αE[F(x0)−F(xT)] ≤2
α[F(x0)−Fmin].
Using the bound on the eigenvalues of Hkand taking limits, we obtain
Λ1lim
T→∞
T−1
X
k=0
E[k∇F(xk)k2]≤lim
T→∞
T−1
X
k=0
E[kH1/2
k∇F(xk)k2]<∞,
which implies (46). We can also conclude that
min
0≤k≤T−1
E[k∇F(xk)k2]≤1
T
T
X
k=0
E[k∇F(xk)k2]≤2
αT Λ1
(F(x0)−Fmin).
A Progressive Batching L-BFGS Method for Machine Learning
C. Additional Numerical Experiments
C.1. Datasets
Table 2summarizes the datasets used for the experiments. Some of these datasets divide the data into training and testing
sets; for the rest, we randomly divide the data so that the training set constitutes 90% of the total.
Table 2. Characteristics of all datasets used in the experiments.
Dataset # Data Points (train; test) # Features # Classes Source
gisette (6,000; 1,000) 5,000 2 (Chang & Lin,2011)
mushrooms (7,311; 813) 112 2 (Chang & Lin,2011)
sido (11,410; 1,268) 4,932 2 (Guyon et al.,2008)
ijcnn (35,000; 91701) 22 2 (Chang & Lin,2011)
spam (82,970; 9,219) 823,470 2 (Cormack & Lynam,2005;Carbonetto,2009)
alpha (450,000; 50,000) 500 2 synthetic
covertype (522,910; 58,102) 54 2 (Chang & Lin,2011)
url (2,156,517; 239,613) 3,231,961 2 (Chang & Lin,2011)
MNIST (60,000; 10,000) 28 ×28 10 (LeCun et al.,1998)
CIFAR-10 (50,000; 10,000) 32 ×32 10 (Krizhevsky,2009)
The alpha dataset is a synthetic dataset that is available at ftp://largescale.ml.tu-berlin.de.
C.2. Logistic Regression Experiments
We report the numerical results on binary classification logistic regression problems on the
8
datasets given in Table 2. We
plot the performance measured in terms of training error, test loss and test accuracy against gradient evaluations. We also
report the behavior of the batch sizes and steplengths for both variants of the PBQN method.
0 20 40 60 80 100
Gradient Evaluations
10−4
10−3
10−2
10−1
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−2
10−1
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140
Iterations
0
1
2
3
4
5
6
Batch Sizes
×103
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 5. gisette dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0 5 10 15 20 25 30 35 40
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−5
10−4
10−3
10−2
10−1
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
40
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 10 20 30 40 50 60 70
Iterations
0
1
2
3
4
5
6
7
8
Batch Sizes
×103
PBQN-MB
PBQN-FO
0 10 20 30 40 50 60 70
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 6. mushrooms dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
0 20 40 60 80 100
Gradient Evaluations
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Batch Sizes
×104
PBQN-MB
PBQN-FO
0 50 100 150 200
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 7. sido dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0 10 20 30 40 50 60
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.2
0.3
0.4
0.5
0.6
0.7
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
85
90
95
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140 160
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Batch Sizes
×104
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140 160
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 8. ijcnn dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap(FO) approaches, and the SG and SVRG methods.
0 10 20 30 40 50 60 70
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350
Iterations
0
1
2
3
4
5
6
7
8
9
Batch Sizes
×104
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 9. spam dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0 20 40 60 80 100
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350 400
Iterations
0
1
2
3
4
Batch Sizes
×105
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350 400
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 10. alpha dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
0 20 40 60 80 100
Gradient Evaluations
10−6
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
10−5
10−4
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
0.50
0.55
0.60
0.65
0.70
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 100 200 300 400 500 600
Iterations
0
1
2
3
4
5
6
Batch Sizes
×105
PBQN-MB
PBQN-FO
0 100 200 300 400 500 600
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 11. covertype dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations
10−3
10−2
10−1
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations
30
40
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Batch Sizes
×106
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 12. url dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods. Note that we only ran the SG and SVRG algorithms for 3 gradient evaluations since
the equivalent number of iterations already reached of order of magnitude 107.
C.3. Neural Network Experiments
We describe each neural network architecture below. We plot the training loss, test loss and test accuracy against the total
number of iterations and gradient evaluations. We also report the behavior of the batch sizes and steplengths for both variants
of the PBQN method.
C.3.1. CIFAR-10 CONVO LU TIO NA L NET WO R K (C) ARCHITECTURE
The small convolutional neural network (ConvNet) is a 2-layer convolutional network with two alternating stages of 5×5
kernels and
2×2
max pooling followed by a fully connected layer with 1000 ReLU units. The first convolutional layer
yields 6 output channels and the second convolutional layer yields 16 output channels.
C.3.2. CIFAR-10 AND MNIST ALE X NET-LIK E NET WOR K (A1,A2) ARCHITECTURE
The larger convolutional network (AlexNet) is an adaptation of the AlexNet architecture (Krizhevsky et al.,2012) for
CIFAR-10 and MNIST. The CIFAR-10 version consists of three convolutional layers with max pooling followed by two
fully-connected layers. The first convolutional layer uses a
5×5
kernel with a stride of 2 and 64 output channels. The second
and third convolutional layers use a
3×3
kernel with a stride of 1 and 64 output channels. Following each convolutional
layer is a set of ReLU activations and
3×3
max poolings with strides of 2. This is all followed by two fully-connected
layers with 384 and 192 neurons with ReLU activations, respectively. The MNIST version of this network modifies this by
only using a 2×2max pooling layer after the last convolutional layer.
C.3.3. CIFAR-10 RESIDUAL NE TWO RK (R) A RCHITECTURE
The residual network (ResNet18) is a slight modification of the ImageNet ResNet18 architecture for CIFAR-10 (He et al.,
2016). It follows the same architecture as ResNet18 for ImageNet but removes the global average pooling layer before the
1000 neuron fully-connected layer. ReLU activations and max poolings are included appropriately.
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 100 200 300 400 500
Iterations
0
1
2
3
4
5
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 500 1000 1500 2000
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 13. CIFAR-10 ConvNet (C):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 0.9.
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
70
75
80
85
90
95
100
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 10 20 30 40 50
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 10 20 30 40 50
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 10 20 30 40 50
Gradient Evaluations
70
75
80
85
90
95
100
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0
1
2
3
4
5
6
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 14. MNIST AlexNet (A1):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 2.
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0
1
2
3
4
5
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 500 1000 1500 2000 2500 3000
Iterations
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 15. CIFAR-10 AlexNet (A2):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 0.9.
A Progressive Batching L-BFGS Method for Machine Learning
101102103
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0
1
2
3
4
5
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000 1200 1400
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 16. CIFAR-10 ResNet18 (R):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 2.
A Progressive Batching L-BFGS Method for Machine Learning
D. Performance Model
The use of increasing batch sizes in the PBQN algorithm yields a larger effective batch size than the SG method, allowing
PBQN to scale to a larger number of nodes than currently permissible even with large-batch training (Goyal et al.,2017).
With improved scalability and richer gradient information, we expect reduction in training time. To demonstrate the potential
to reduce training time of a parallelized implementation of PBQN, we extend the idealized performance model from (Keskar
et al.,2016) to the PBQN algorithm. For PBQN to be competitive, it must achieve the following: (i) the quality of its solution
should match or improve SG’s solution (as shown in Table 1 of the main paper); (ii) it should utilize a larger effective batch
size; and (iii) it should converge to the solution in a lower number of iterations. We provide an initial analysis for this by
establishing the analytic requirements for improved training time; we leave discussion on implementation details, memory
requirements, and large-scale experiments for future work.
Let the effective batch size for PBQN and conventional SG batch size be denoted as
c
BL
and
BS
, respectively. From
Algorithm 1, we observe that the PBQN iteration involves extra computation in addition to the gradient computation as
in SG. The additional steps are as follows: the L-BFGS two-loop recursion, which includes several operations over the
stored curvature pairs and network parameters (Algorithm 1:6); the stochastic line search for identifying the steplength
(Algorithm 1:7-16); and curvature pair updating (Algorithm 1:18-21). However, most of these supplemental operations are
performed on the weights of the network, which is orders of magnitude lower than computing the gradient. The two-loop
recursion performs
O(10)
operations over the network parameters and curvature pairs. The cost for variance estimation is
negligible since we may use a fixed number of samples throughout the run for its computation which can be parallelized
while avoiding becoming a serial bottleneck.
The only exception is the stochastic line search, which requires additional forward propagations over the model for different
sets of network parameters. However, this happens only when the step-length is not accepted, which happens infrequently in
practice. We make the pessimistic assumption of an addition forward propagation every iteration, amounting to an additional
1
3
the cost of the gradient computation (forward propagation, back propagation with respect to activations and weights).
Hence, the ratio of cost-per-iteration for PBQN
CL
to SG’s cost-per-iteration
CS
is
4
3
. Let
IS
and
IL
be the number of
iterations that it takes SG and PBQN, respectively, to reach similar test accuracy. The target number of nodes to be used for
training is
N
, such that
N < c
BL
. For
N
nodes, the parallel efficiency of SG is assumed to be
Pe(N)
and we assume that
for the target node count, there is no drop in parallel efficiency for PBQN due to the large effective batch size.
For a lower training time with the PBQN method, the following relation should hold:
ILCLc
BL
N< ISCS
BS
NPe(N).(47)
In terms of iterations, we can rewrite this as IL
IS
<CS
CL
BS
c
BL
1
Pe(N).(48)
Assuming target node count
N=BS<ˆ
BL
, the scaling efficiency of SG drops significantly due to the reduced work per
single node, giving a parallel efficiency of
Pe(N)=0.2
; see (Kurth et al.,2017;You et al.,2017). If we additionally assume
that effective batch size for PBQN is
4×
larger, with SG large batch
≈
8K and PBQN
≈
32K as observed in our experiments
(from Section 4), this gives
c
BL
/BS= 4
. PBQN must converge with about the same number of iterations as SG in order to
achieve lower training time. From Section 4, the results show that PBQN converges in significantly fewer iterations than SG,
hence establishing the potential for lower training times. We refer the reader to (Das et al.,2016) for a more detailed model
and commentary on the effect of batch size on performance.