ArticlePDF Available

A Progressive Batching L-BFGS Method for Machine Learning

Authors:

Abstract

The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the L-BFGS algorithm that combines three basic components - progressive batching, a stochastic line search, and stable quasi-Newton updating - and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method.
A Progressive Batching L-BFGS Method for Machine Learning
Raghu Bollapragada 1Dheevatsa Mudigere 2Jorge Nocedal 1Hao-Jun Michael Shi 1Ping Tak Peter Tang 3
Abstract
The standard L-BFGS method relies on gradient
approximations that are not dominated by noise,
so that search directions are descent directions,
the line search is reliable, and quasi-Newton up-
dating yields useful quadratic models of the objec-
tive function. All of this appears to call for a full
batch approach, but since small batch sizes give
rise to faster algorithms with better generalization
properties, L-BFGS is currently not considered
an algorithm of choice for large-scale machine
learning applications. One need not, however,
choose between the two extremes represented by
the full batch or highly stochastic regimes, and
may instead follow a progressive batching ap-
proach in which the sample size increases during
the course of the optimization. In this paper, we
present a new version of the L-BFGS algorithm
that combines three basic components — progres-
sive batching, a stochastic line search, and stable
quasi-Newton updating — and that performs well
on training logistic regression and deep neural
networks. We provide supporting convergence
theory for the method.
1. Introduction
The L-BFGS method (Liu & Nocedal,1989) has tradition-
ally been regarded as a batch method in the machine learn-
ing community. This is because quasi-Newton algorithms
need gradients of high quality in order to construct useful
quadratic models and perform reliable line searches. These
algorithmic ingredients can be implemented, it seems, only
by using very large batch sizes, resulting in a costly itera-
1
Department of Industrial Engineering and Manage-
ment Sciences, Northwestern University, Evanston, IL, USA
2
Intel Corporation, Bangalore, India
3
Intel Corporation,
Santa Clara, CA, USA. Correspondence to: Raghu Bol-
lapragada
<
raghu.bollapragada@u.northwestern.edu
>
, Dhee-
vatsa Mudigere
<
dheevatsa.mudigere@intel.com
>
, Jorge No-
cedal
<
j-nocedal@northwestern.edu
>
, Hao-Jun Michael Shi
<
hjmshi@u.northwestern.edu
>
, Ping Tak Peter Tang
<
pe-
ter.tang@intel.com>.
tion that makes the overall algorithm slow compared with
stochastic gradient methods (Robbins & Monro,1951).
Even before the resurgence of neural networks, many re-
searchers observed that a well-tuned implementation of the
stochastic gradient (SG) method was far more effective on
large-scale logistic regression applications than the batch
L-BFGS method, even when taking into account the advan-
tages of parallelism offered by the use of large batches. The
preeminence of the SG method (and its variants) became
more pronounced with the advent of deep neural networks,
and some researchers have speculated that SG is endowed
with certain regularization properties that are essential in the
minimization of such complex nonconvex functions (Hardt
et al.,2015;Keskar et al.,2016).
In this paper, we postulate that the most efficient algorithms
for machine learning may not reside entirely in the highly
stochastic or full batch regimes, but should employ a pro-
gressive batching approach in which the sample size is ini-
tially small, and is increased as the iteration progresses.
This view is consistent with recent numerical experiments
on training various deep neural networks (Smith et al.,2017;
Goyal et al.,2017), where the SG method, with increas-
ing sample sizes, yields similar test loss and accuracy as
the standard (fixed mini-batch) SG method, while offering
significantly greater opportunities for parallelism.
Progressive batching algorithms have received much atten-
tion recently from a theoretical perspective. It has been
shown that they enjoy complexity bounds that rival those of
the SG method (Byrd et al.,2012), and that they can achieve
a fast rate of convergence (Friedlander & Schmidt,2012).
The main appeal of these methods is that they inherit the
efficient initial behavior of the SG method, offer greater
opportunities to exploit parallelism, and allow for the in-
corporation of second-order information. The latter can be
done efficiently via quasi-Newton updating.
An integral part of quasi-Newton methods is the line search,
which ensures that a convex quadratic model can be con-
structed at every iteration. One challenge that immediately
arises is how to perform this line search when the objec-
tive function is stochastic. This is an issue that has not
received sufficient attention in the literature, where stochas-
tic line searches have been largely dismissed as inappropri-
ate. In this paper, we take a step towards the development
arXiv:1802.05374v2 [math.OC] 30 May 2018
A Progressive Batching L-BFGS Method for Machine Learning
of stochastic line searches for machine learning by study-
ing a key component, namely the initial estimate in the
one-dimensional search. Our approach, which is based on
statistical considerations, is designed for an Armijo-style
backtracking line search.
1.1. Literature Review
Progressive batching (sometimes referred to as dynamic
sampling) has been well studied in the optimization litera-
ture, both for stochastic gradient and subsampled Newton-
type methods (Byrd et al.,2012;Friedlander & Schmidt,
2012;Cartis & Scheinberg,2015;Pasupathy et al.,2015;
Roosta-Khorasani & Mahoney,2016a;b;Bollapragada et al.,
2016;2017;De et al.,2017). Friedlander and Schmidt
(2012) introduced theoretical conditions under which a pro-
gressive batching SG method converges linearly for finite
sum problems, and experimented with a quasi-Newton adap-
tation of their algorithm. Byrd et al. (2012) proposed a
progressive batching strategy, based on a norm test, that de-
termines when to increase the sample size; they established
linear convergence and computational complexity bounds
in the case when the batch size grows geometrically. More
recently, Bollapragada et al. (2017) introduced a batch con-
trol mechanism based on an inner product test that improves
upon the norm test mentioned above.
There has been a renewed interest in understanding the gen-
eralization properties of small-batch and large-batch meth-
ods for training neural networks; see (Keskar et al.,2016;
Dinh et al.,2017;Goyal et al.,2017;Hoffer et al.,2017).
Keskar et al. (2016) empirically observed that large-batch
methods converge to solutions with inferior generalization
properties; however, Goyal et al. (2017) showed that large-
batch methods can match the performance of small-batch
methods when a warm-up strategy is used in conjunction
with scaling the step length by the same factor as the batch
size. Hoffer et al. (2017) and You et al. (2017) also explored
larger batch sizes and steplengths to reduce the number of
updates necessary to train the network. All of these studies
naturally led to an interest in progressive batching tech-
niques. Smith et al. (2017) showed empirically that increas-
ing the sample size and decaying the steplength are quan-
titatively equivalent for the SG method; hence, steplength
schedules could be directly converted to batch size sched-
ules. This approach was parallelized by Devarakonda et al.
(2017). De et al. (2017) presented numerical results with
a progressive batching method that employs the norm test.
Balles et al. (2016) proposed an adaptive dynamic sample
size scheme and couples the sample size with the steplength.
Stochastic second-order methods have been explored within
the context of convex and non-convex optimization; see
(Schraudolph et al.,2007;Sohl-Dickstein et al.,2014;
Mokhtari & Ribeiro,2015;Berahas et al.,2016;Byrd et al.,
2016;Keskar & Berahas,2016;Curtis,2016;Berahas &
Tak
´
a
ˇ
c,2017;Zhou et al.,2017). Schraudolph et al. (2007)
ensured stability of quasi-Newton updating by computing
gradients using the same batch at the beginning and end
of the iteration. Since this can potentially double the cost
of the iteration, Berahas et al. (2016) proposed to achieve
gradient consistency by computing gradients based on the
overlap between consecutive batches; this approach was
further tested by Berahas and Takac (2017). An interesting
approach introduced by Martens and Grosse (2015;2016)
approximates the Fisher information matrix to scale the
gradient; a distributed implementation of their K-FAC ap-
proach is described in (Ba et al.,2016). Another approach
approximately computes the inverse Hessian by using the
Neumann power series representation of matrices (Krishnan
et al.,2017).
1.2. Contributions
This paper builds upon three algorithmic components that
have recently received attention in the literature — progres-
sive batching, stable quasi-Newton updating, and adaptive
steplength selection. It advances their design and puts them
together in a novel algorithm with attractive theoretical and
computational properties.
The cornerstone of our progressive batching strategy is the
mechanism proposed by Bollapragada et al. (2017) in the
context of first-order methods. We extend their inner prod-
uct control test to second-order algorithms, something that
is delicate and leads to a significant modification of the
original procedure. Another main contribution of the paper
is the design of an Armijo-style backtracking line search
where the initial steplength is chosen based on statistical
information gathered during the course of the iteration. We
show that this steplength procedure is effective on a wide
range of applications, as it leads to well scaled steps and
allows for the BFGS update to be performed most of the
time, even for nonconvex problems. We also test two tech-
niques for ensuring the stability of quasi-Newton updating,
and observe that the overlapping procedure described by
Berahas et al. (2016) is more efficient than a straightforward
adaptation of classical quasi-Newton methods (Schraudolph
et al.,2007).
We report numerical tests on large-scale logistic regression
and deep neural network training tasks that indicate that our
method is robust and efficient, and has good generalization
properties. An additional advantage is that the method re-
quires almost no parameter tuning, which is possible due to
the incorporation of second-order information. All of this
suggests that our approach has the potential to become one
of the leading optimization methods for training deep neural
networks. In order to achieve this, the algorithm must be
optimized for parallel execution, something that was only
A Progressive Batching L-BFGS Method for Machine Learning
briefly explored in this study.
2. A Progressive Batching Quasi-Newton
Method
The problem of interest is
min
xRdF(x) = Zf(x;z, y)dP (z, y),(1)
where
f
is the composition of a prediction function
(parametrized by
x
) and a loss function, and
(z, y)
are ran-
dom input-output pairs with probability distribution
P(z, y)
.
The associated empirical risk problem consists of minimiz-
ing
R(x) = 1
N
N
X
i=1
f(x;zi, yi)4
=1
N
N
X
i=1
Fi(x),
where we define
Fi(x) = f(x;zi, yi).
A stochastic quasi-
Newton method is given by
xk+1 =xkαkHkgSk
k,(2)
where the batch (or subsampled) gradient is given by
gSk
k=FSk(xk)4
=1
|Sk|X
iSk
Fi(xk),(3)
the set
Sk⊂ {1,2,· · · }
indexes data points
(yi, zi)
sam-
pled from the distribution
P(z, y)
, and
Hk
is a positive
definite quasi-Newton matrix. We now discuss each of the
components of the new method.
2.1. Sample Size Selection
The proposed algorithm has the form
(29)
-
(30)
. Initially, it
utilizes a small batch size
|Sk|
, and increases it gradually in
order to attain a fast local rate of convergence and permit the
use of second-order information. A challenging question is
to determine when, and by how much, to increase the batch
size
|Sk|
over the course of the optimization procedure based
on observed gradients — as opposed to using prescribed
rules that depend on the iteration number k.
We propose to build upon the strategy introduced by Bol-
lapragada et al. (2017) in the context of first-order methods.
Their inner product test determines a sample size such that
the search direction is a descent direction with high prob-
ability. A straightforward extension of this strategy to the
quasi-Newton setting is not appropriate since requiring only
that a stochastic quasi-Newton search direction be a de-
scent direction with high probability would underutilize the
curvature information contained in the search direction.
We would like, instead, for the search direction
dk=
HkgSk
k
to make an acute angle with the true quasi-Newton
search direction
HkF(xk)
, with high probability. Al-
though this does not imply that
dk
is a descent direction
for
F
, this will normally be the case for any reasonable
quasi-Newton matrix.
To derive the new inner product quasi-Newton (IPQN) test,
we first observe that the stochastic quasi-Newton search
direction makes an acute angle with the true quasi-Newton
direction in expectation, i.e.,
Ekh(HkF(xk))T(HkgSk
k)i=kHkF(xk)k2,(4)
where
Ek
denotes the conditional expectation at
xk
. We
must, however, control the variance of this quantity to
achieve our stated objective. Specifically, we select the
sample size
|Sk|
such that the following condition is satis-
fied:
Ek(HkF(xk))T(HkgSk
k)− kHkF(xk)k22
θ2kHkF(xk)k4,
(5)
for some
θ > 0
. The left hand side of
(5)
is difficult to com-
pute but can be bounded by the true variance of individual
search directions, i.e.,
Ekh(HkF(xk))T(Hkgi
k)− kHkF(xk)k22i
|Sk|
θ2kHkF(xk)k4,
(6)
where
gi
k=Fi(xk)
. This test involves the true expected
gradient and variance, but we can approximate these quan-
tities with sample gradient and variance estimates, respec-
tively, yielding the practical inner product quasi-Newton
test:
VariSv
k(gi
k)TH2
kgSk
k
|Sk|θ2
HkgSk
k
4,(7)
where
Sv
kSk
is a subset of the current sample (batch),
and the variance term is defined as
PiSv
k(gi
k)TH2
kgSk
k
HkgSk
k
22
|Sv
k| − 1.(8)
The variance
(8)
may be computed using just one additional
Hessian vector product of
Hk
with
HkgSk
k
. Whenever con-
dition
(7)
is not satisfied, we increase the sample size
|Sk|
.
In order to estimate the increase that would lead to a satis-
faction of
(7)
, we reason as follows. If we assume that new
sample |¯
Sk|is such that
HkgSk
k
u
Hkg¯
Sk
k
,
A Progressive Batching L-BFGS Method for Machine Learning
and similarly for the variance estimate, then a simple com-
putation shows that a lower bound on the new sample size
is
|¯
Sk| ≥
VariSv
k(gi
k)TH2
kgSk
k
θ2
HkgSk
k
4
4
=bk.(9)
In our implementation of the algorithm, we set the new sam-
ple size as
|Sk+1|=dbke
. When the sample approximation
of
F(xk)
is not accurate, which can occur when
|Sk|
is
small, the progressive batching mechanism just described
may not be reliable. In this case we employ the moving
window technique described in Section 4.2 of Bollapragada
et al. (2017), to produce a sample estimate of F(xk).
2.2. The Line Search
In deterministic optimization, line searches are employed
to ensure that the step is not too short and to guarantee
sufficient decrease in the objective function. Line searches
are particularly important in quasi-Newton methods since
they ensure robustness and efficiency of the iteration with
little additional cost.
In contrast, stochastic line searches are poorly understood
and rarely employed in practice because they must make
decisions based on sample function values
FSk(x) = 1
|Sk|X
iSk
Fi(x),(10)
which are noisy approximations to the true objective
F
.
One of the key questions in the design of a stochastic line
search is how to ensure, with high probability, that there is
a decrease in the true function when one can only observe
stochastic approximations
FSk(x)
. We address this question
by proposing a formula for the step size
αk
that controls
possible increases in the true function. Specifically, the first
trial steplength in the stochastic backtracking line search
is computed so that the predicted decrease in the expected
function value is sufficiently large, as we now explain.
Using Lipschitz continuity of
F(x)
and taking conditional
expectation, we can show the following inequality
Ek[Fk+1]FkαkF(xk)TH1/2
kWkH1/2
kF(xk)
(11)
where
Wk=Ik
21 + Var{Hkgi
k}
|Sk|kHkF(xk)k2Hk,
Var{Hkgi
k}=EkkHkgi
kHkF(xk)k2
,
Fk=
F(xk)
, and
L
is the Lipschitz constant. The proof of
(A.1)
is given in the supplement.
The only difference in
(A.1)
between the deterministic and
stochastic quasi-Newton methods is the additional variance
term in the matrix
Wk
. To obtain decrease in the function
value in the deterministic case, the matrix
Ik
2Hk
must be positive definite, whereas in the stochastic case the
matrix
Wk
must be positive definite to yield a decrease in
F
in expectation. In the deterministic case, for a reasonably
good quasi-Newton matrix
Hk
, one expects that
αk= 1
will result in a decrease in the function, and therefore the
initial trial steplength parameter should be chosen to be 1.
In the stochastic case, the initial trial value
ˆαk=1 + Var{Hkgi
k}
|Sk|kHkF(xk)k21
(12)
will result in decrease in the expected function value. How-
ever, since formula
(12)
involves the expensive computation
of the individual matrix-vector products
Hkgi
k
, we approxi-
mate the variance-bias ratio as follows:
¯αk=1 + Var{gi
k}
|Sk|k∇F(xk)k21
,(13)
where
Var{gi
k}=Ekkgi
k− ∇F(xk)k2
. In our practical
implementation, we estimate the population variance and
gradient with the sample variance and gradient, respectively,
yielding the initial steplength
αk=
1 + VariSv
k{gi
k}
|Sk|
gSk
k
2
1
,(14)
where
VariSv
k{gi
k}=1
|Sv
k| − 1X
iSv
k
gi
kgSk
k
2
(15)
and
Sv
kSk
. With this initial value of
αk
in hand, our
algorithm performs a backtracking line search that aims to
satisfy the Armijo condition
FSk(xkαkHkgSk
k)
FSk(xk)c1αk(gSk
k)THkgSk
k,(16)
where c1>0.
2.3. Stable Quasi-Newton Updates
In the BFGS and L-BFGS methods, the inverse Hessian
approximation is updated using the formula
Hk+1 =VT
kHkVk+ρksksT
k
ρk= (yT
ksk)1
Vk=IρkyksT
k
(17)
where
sk=xk+1 xk
and
yk
is the difference in the
gradients at
xk+1
and
xk
. When the batch changes from
A Progressive Batching L-BFGS Method for Machine Learning
one iteration to the next (
Sk+1 6=Sk
), it is not obvious how
yk
should be defined. It has been observed that when
yk
is computed using different samples, the updating process
may be unstable, and hence it seems natural to use the
same sample at the beginning and at the end of the iteration
(Schraudolph et al.,2007), and define
yk=gSk
k+1 gSk
k.(18)
However, this requires that the gradient be evaluated twice
for every batch
Sk
at
xk
and
xk+1
. To avoid this additional
cost, Berahas et al. (2016) propose to use the overlap be-
tween consecutive samples in the gradient differencing. If
we denote this overlap as
Ok=SkSk+1
, then one defines
yk=gOk
k+1 gOk
k.(19)
This requires no extra computation since the two gradients
in this expression are subsets of the gradients correspond-
ing to the samples
Sk
and
Sk+1
. The overlap should not
be too small to avoid differencing noise, but this is easily
achieved in practice. We test both formulas for
yk
in our
implementation of the method; see Section 4.
2.4. The Complete Algorithm
The pseudocode of the progressive batching L-BFGS
method is given in Algorithm 1. Observe that the limited
memory Hessian approximation
Hk
in Line 8 is indepen-
dent of the choice of the sample
Sk
. Specifically,
Hk
is
defined by a collection of curvature pairs
{(sj, yj)}
, where
the most recent pair is based on the sample
Sk1
; see Line
14. For the batch size control test
(7)
, we choose
θ= 0.9
in
the logistic regression experiments, and
θ
is a tunable pa-
rameter chosen in the interval
[0.9,3]
in the neural network
experiments. The constant
c1
in
(16)
is set to
c1= 104
.
For L-BFGS, we set the memory as
m= 10
. We skip the
quasi-Newton update if the following curvature condition is
not satisfied:
yT
ksk> kskk2,with = 102.(20)
The initial Hessian matrix
Hk
0
in the L-BFGS recursion at
each iteration is chosen as γkIwhere γk=yT
ksk/yT
kyk.
3. Convergence Analysis
We now present convergence results for the proposed al-
gorithm, both for strongly convex and nonconvex objec-
tive functions. Our emphasis is in analyzing the effect of
progressive sampling, and therefore, we follow common
practice and assume that the steplength in the algorithm is
fixed (
αk=α
), and that the inverse L-BFGS matrix
Hk
has
bounded eigenvalues, i.e.,
Λ1IHkΛ2I. (21)
Algorithm 1 Progressive Batching L-BFGS Method
Input: Initial iterate x0, initial sample size |S0|;
Initialization: Set k0
Repeat
until convergence:
1: Sample Sk⊆ {1,· · · , N}with sample size |Sk|
2: if condition (7) is not satisfied then
3: Compute bkusing (9), and set ˆ
bk← dbke − |Sk|
4: Sample S+⊆ {1,· · · , N} \ Skwith |S+|=ˆ
bk
5: Set SkSkS+
6: end if
7: Compute gSk
k
8:
Compute
pk=HkgSk
k
using L-BFGS Two-Loop
Recursion in (Nocedal & Wright,1999)
9: Compute αkusing (14)
10: while the Armijo condition (16) not satisfied do
11: Set αk=αk/2
12: end while
13: Compute xk+1 =xk+αkpk
14: Compute ykusing (18) or (19)
15: Compute sk=xk+1 xk
16: if yT
ksk> kskk2then
17: if number of stored (yj, sj)exceeds mthen
18: Discard oldest curvature pair (yj, sj)
19: end if
20: Store new curvature pair (yk, sk)
21: end if
22: Set kk+ 1
23: Set |Sk|=|Sk1|
This assumption can be justified both in the convex and non-
convex cases under certain conditions; see (Berahas et al.,
2016). We assume that the sample size is controlled by the
exact inner product quasi-Newton test
(31)
. This test is de-
signed for efficiency, and in rare situations could allow for
the generation of arbitrarily long search directions. To pre-
vent this from happening, we introduce an additional control
on the sample size
|Sk|
, by extending (to the quasi-Newton
setting) the orthogonality test introduced in (Bollapragada
et al.,2017). This additional requirement states that the
current sample size |Sk|is acceptable only if
Ek
Hkgi
kHkgSk
kT
(HkF(xk))
kHkF(xk)k2HkF(xk)
2
|Sk|
ν2kHkF(xk)k2,(22)
for some given ν > 0.
We now establish linear convergence when the objective is
strongly convex.
Theorem 3.1.
Suppose that
F
is twice continuously differ-
entiable and that there exist constants
0< µ L
such that
A Progressive Batching L-BFGS Method for Machine Learning
µI  ∇2F(x)LI, xRd.(23)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-
Newton test
(31)
. Suppose that the orthogonality condition
(32)
holds at every iteration, and that the matrices
Hk
sat-
isfy (B.2). Then, if
αk=α1
(1 + θ2+ν2)LΛ2
,(24)
we have that
E[F(xk)F(x)] ρk(F(x0)F(x)),(25)
where
x
denotes the minimizer of
F
,
ρ= 1 µΛ1α,
and
Edenotes the total expectation.
The proof of this result is given in the supplement. We now
consider the case when
F
is nonconvex and bounded below.
Theorem 3.2.
Suppose that
F
is twice continuously differ-
entiable and bounded below, and that there exists a constant
L > 0such that
2F(x)LI, xRd.(26)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen so that
(31)
and
(32)
are satisfied, and sup-
pose that
(B.2)
holds. Then, if
αk
satisfies
(36)
, we have
lim
k→∞
E[k∇F(xk)k2]0.(27)
Moreover, for any positive integer Twe have that
min
0kT1
E[k∇F(xk)k2]2
αT Λ1
(F(x0)Fmin),
where Fmin is a lower bound on Fin Rd.
The proof is given in the supplement. This result shows that
the sequence of gradients
{k∇F(xk)k}
converges to zero
in expectation, and establishes a global sublinear rate of
convergence of the smallest gradients generated after every
Tsteps.
4. Numerical Results
In this section, we present numerical results for the proposed
algorithm, which we refer to as PBQN for the
P
rogressive
Batching Quasi-Newton algorithm.
4.1. Experiments on Logistic Regression Problems
We first test our algorithm on binary classification problems
where the objective function is given by the logistic loss
with `2regularization:
R(x) = 1
N
N
X
i=1
log(1 + exp(zixTyi)) + λ
2kxk2,(28)
with
λ= 1/N
. We consider the
8
datasets listed in the
supplement. An approximation
R
of the optimal function
value is computed for each problem by running the full batch
L-BFGS method until
k∇R(xk)k108
. Training error
is defined as
R(xk)R
, where
R(xk)
is evaluated over the
training set; test loss is evaluated over the test set without
the `2regularization term.
We tested two options for computing the curvature vector
yk
in the PBQN method: the multi-batch (MB) approach
(19)
with 25% sample overlap, and the full overlap (FO)
approach
(18)
. We set
θ= 0.9
in
(7)
, chose
|S0|= 512
, and
set all other parameters to the default values given in Sec-
tion 2. Thus, none of the parameters in our PBQN method
were tuned for each individual dataset. We compared our
algorithm against two other methods: (i) Stochastic gradient
(SG) with a batch size of 1; (ii) SVRG (Johnson & Zhang,
2013) with the inner loop length set to
N
. The steplength
for SG and SVRG is constant and tuned for each problem
(
αkα= 2j
, for
j∈ {−10,9, ..., 9,10}
) so as to give
best performance.
In Figures 9and 2we present results for two datasets,
spam
and
covertype
; the rest of the results are given in the
supplement. The horizontal axis measures the number of
full gradient evaluations, or equivalently, the number of
times that
N
component gradients
Fi
were evaluated. The
left-most figure reports the long term trend over 100 gradient
evaluations, while the rest of the figures zoom into the first
10 gradient evaluations to show the initial behavior of the
methods. The vertical axis measures training error, test loss,
and test accuracy, respectively, from left to right.
The proposed algorithm competes well for these two
datasets in terms of training error, test loss and test accuracy,
and decreases these measures more evenly than the SG and
SVRG. Our numerical experience indicates that formula
(14)
is quite effective at estimating the steplength parameter,
as it is accepted by the backtracking line search for most
iterations. As a result, the line search computes very few
additional function values.
It is interesting to note that SVRG is not as efficient in the
initial epochs compared to PBQN or SG, when measured
either in terms of test loss and test accuracy. The training er-
ror for SVRG decreases rapidly in later epochs but this rapid
improvement is not observed in the test loss and accuracy.
Neither the PBQN nor SVRG significantly outperforms the
other across all datasets tested in terms of training error, as
observed in the supplement.
Our results indicate that defining the curvature vector using
the MB approach is preferable to using the FB approach.
The number of iterations required by the PBQN method is
significantly smaller compared to the SG method, suggest-
ing the potential efficiency gains of a parallel implementa-
A Progressive Batching L-BFGS Method for Machine Learning
0 10 20 30 40 50 60 70
Gradient Evaluations
106
105
104
103
102
101
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
104
103
102
101
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TestLoss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
TestAccuracy
SG
SVRG
PBQN-MB
PBQN-FO
Figure 1. spam dataset:
Performance of the progressive batching L-BFGS method (PBQN), with multi-batch (25% overlap) and
full-overlap approaches, and the SG and SVRG methods.
0 20 40 60 80 100
Gradient Evaluations
106
105
104
103
102
101
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
105
104
103
102
101
100
TrainingError
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
0.50
0.55
0.60
0.65
0.70
TestLoss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
TestAccuracy
SG
SVRG
PBQN-MB
PBQN-FO
Figure 2. covertype dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and full-overlap
approaches, and the SG and SVRG methods.
tion of our algorithm.
4.2. Results on Neural Networks
We have performed a preliminary investigation into the
performance of the PBQN algorithm for training neural
networks. As is well-known, the resulting optimization
problems are quite difficult due to the existence of local
minimizers, some of which generalize poorly. Thus our first
requirement when applying the PBQN method was to obtain
as good generalization as SG, something we have achieved.
Our investigation into how to obtain fast performance is,
however, still underway for reasons discussed below. Never-
theless, our results are worth reporting because they show
that our line search procedure is performing as expected, and
that the overall number of iterations required by the PBQN
method is small enough so that a parallel implementation
could yield state-of-the-art results, based on the theoretical
performance model detailed in the supplement.
We compared our algorithm, as described in Section 2,
against SG and Adam (Kingma & Ba,2014). It has taken
many years to design regularizations techniques and heuris-
tics that greatly improve the performance of the SG method
for deep learning (Srivastava et al.,2014;Ioffe & Szegedy,
2015). These include batch normalization and dropout,
which (in their current form) are not conducive to the PBQN
approach due to the need for gradient consistency when
evaluating the curvature pairs in L-BFGS. Therefore, we do
not implement batch normalization and dropout in any of
the methods tested, and leave the study of their extension to
the PBQN setting for future work.
We consider three network architectures: (i) a small convolu-
tional neural network on CIFAR-10 (
C
) (Krizhevsky,2009),
(ii) an AlexNet-like convolutional network on MNIST
and CIFAR-10 (
A1,A2
, respectively) (LeCun et al.,1998;
Krizhevsky et al.,2012), and (iii) a residual network
(ResNet18) on CIFAR-10 (
R
) (He et al.,2016). The net-
work architecture details and additional plots are given in
the supplement. All of these networks were implemented in
PyTorch (Paszke et al.,2017). The results for the CIFAR-
10 AlexNet and CIFAR-10 ResNet18 are given in Figures
15 and 16, respectively. We report results both against the
total number of iterations and the total number of gradient
evaluations. Table 1shows the best test accuracies attained
by each of the four methods over the various networks.
In all our experiments, we initialize the batch size as
|S0|=
512
in the PBQN method, and fix the batch size to
|Sk|=
128
for SG and Adam. The parameter
θ
given in
(7)
, which
controls the batch size increase in the PBQN method, was
tuned lightly by chosing among the 3 values: 0.9, 2, 3. SG
and Adam are tuned using a development-based decay (dev-
decay) scheme, which track the best validation loss at each
epoch and reduces the steplength by a constant factor
δ
if
the validation loss does not improve after eepochs.
We observe from our results that the PBQN method achieves
a similar test accuracy as SG and Adam, but requires more
gradient evaluations. Improvements in performance can be
obtained by ensuring that the PBQN method exerts a finer
control on the sample size in the small batch regime — some-
thing that requires further investigation. Nevertheless, the
small number of iterations required by the PBQN method,
together with the fact that it employs larger batch sizes than
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
TrainingLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
TestLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
Figure 3. CIFAR-10 AlexNet (A2):
Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 0.9.
101102103
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
TrainingLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
TestLoss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
20
30
40
50
60
70
TestAccuracy
SG
Adam
PBQN-MB
PBQN-FO
Figure 4. CIFAR-10 ResNet18 (R):
Performance of the progressive batching L-BFGS methods, with multi-batch (25% overlap) and
full-overlap approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 2.
Table 1.
Best test accuracy performance of SG, Adam, multi-batch
L-BFGS, and full overlap L-BFGS on various networks over 5
different runs and initializations.
Network SG Adam MB FO
C66.24 67.03 67.37 62.46
A199.25 99.34 99.16 99.05
A273.46 73.59 73.02 72.74
R69.5 70.16 70.28 69.44
SG during much of the run, suggests that a distributed ver-
sion similar to a data-parallel distributed implementation of
the SG method (Chen et al.,2016;Das et al.,2016) would
lead to a highly competitive method.
Similar to the logistic regression case, we observe that the
steplength computed via
(14)
is almost always accepted
by the Armijo condition, and typically lies within
(0.1,1)
.
Once the algorithm has trained for a significant number of
iterations using full-batch, the algorithm begins to overfit on
the training set, resulting in worsened test loss and accuracy,
as observed in the graphs.
5. Final Remarks
Several types of quasi-Newton methods have been proposed
in the literature to address the challenges arising in ma-
chine learning. Some of these method operate in the purely
stochastic setting (which makes quasi-Newton updating dif-
ficult) or in the purely batch regime (which leads to general-
ization problems). We believe that progressive batching is
the right context for designing an L-BFGS method that has
good generalization properties, does not expose any free pa-
rameters, and has fast convergence. The advantages of our
approach are clearly seen in logistic regression experiments.
To make the new method competitive with SG and Adam
for deep learning, we need to improve several of its compo-
nents. This includes the design of a more robust progressive
batching mechanism, the redesign of batch normalization
and dropout heuristics to improve the generalization perfor-
mance of our method for training larger networks, and most
importantly, the design of a parallelized implementation that
takes advantage of the higher granularity of each iteration.
We believe that the potential of the proposed approach as
an alternative to SG for deep learning is worthy of further
investigation.
Acknowledgements
We thank Albert Berahas for his insightful comments re-
garding multi-batch L-BFGS and probabilistic line searches,
as well as for his useful feedback on earlier versions of
the manuscript. We also thank the anonymous reviewers
for their useful feedback. Bollapragada is supported by
DOE award DE-FG02-87ER25047. Nocedal is supported
by NSF award DMS-1620070. Shi is supported by Intel
grant SP0036122.
References
Ba, J., Grosse, R., and Martens, J. Distributed second-order
optimization using kronecker-factored approximations.
A Progressive Batching L-BFGS Method for Machine Learning
2016.
Balles, L., Romero, J., and Hennig, P. Coupling adap-
tive batch sizes with learning rates. arXiv preprint
arXiv:1612.05086, 2016.
Berahas, A. S. and Tak
´
a
ˇ
c, M. A robust multi-batch
l-bfgs method for machine learning. arXiv preprint
arXiv:1707.08552, 2017.
Berahas, A. S., Nocedal, J., and Tak
´
ac, M. A multi-batch l-
bfgs method for machine learning. In Advances in Neural
Information Processing Systems, pp. 1055–1063, 2016.
Bertsekas, D. P., Nedi
´
c, A., and Ozdaglar, A. E. Convex
analysis and optimization. Athena Scientific Belmont,
2003.
Bollapragada, R., Byrd, R., and Nocedal, J. Exact and
inexact subsampled Newton methods for optimization.
arXiv preprint arXiv:1609.08502, 2016.
Bollapragada, R., Byrd, R., and Nocedal, J. Adaptive sam-
pling strategies for stochastic optimization. arXiv preprint
arXiv:1710.11258, 2017.
Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. Sam-
ple size selection in optimization methods for machine
learning. Mathematical Programming, 134(1):127–155,
2012.
Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A
stochastic quasi-Newton method for large-scale optimiza-
tion. SIAM Journal on Optimization, 26(2):1008–1031,
2016.
Carbonetto, P. New probabilistic inference algorithms that
harness the strengths of variational and Monte Carlo
methods. PhD thesis, University of British Columbia,
2009.
Cartis, C. and Scheinberg, K. Global convergence rate
analysis of unconstrained optimization methods based on
probabilistic models. Mathematical Programming, pp.
1–39, 2015.
Chang, C. and Lin, C. LIBSVM: A library for support vector
machines. ACM Transactions on Intelligent Systems and
Technology, 2:27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. Re-
visiting distributed synchronous sgd. arXiv preprint
arXiv:1604.00981, 2016.
Cormack, G. and Lynam, T. Spam corpus creation for TREC.
In Proc. 2nd Conference on Email and Anti-Spam, 2005.
http://plg.uwaterloo.ca/gvcormac/treccorpus.
Curtis, F. A self-correcting variable-metric algorithm for
stochastic optimization. In International Conference on
Machine Learning, pp. 632–641, 2016.
Das, D., Avancha, S., Mudigere, D., Vaidynathan, K., Srid-
haran, S., Kalamkar, D., Kaul, B., and Dubey, P. Dis-
tributed deep learning using synchronous stochastic gra-
dient descent. arXiv preprint arXiv:1602.06709, 2016.
De, S., Yadav, A., Jacobs, D., and Goldstein, T. Automated
inference with adaptive batches. In Artificial Intelligence
and Statistics, pp. 1504–1513, 2017.
Devarakonda, A., Naumov, M., and Garland, M. Adabatch:
Adaptive batch sizes for training deep neural networks.
arXiv preprint arXiv:1712.02029, 2017.
Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp
minima can generalize for deep nets. arXiv preprint
arXiv:1703.04933, 2017.
Friedlander, M. P. and Schmidt, M. Hybrid deterministic-
stochastic methods for data fitting. SIAM Journal on
Scientific Computing, 34(3):A1380–A1405, 2012.
Goyal, P., Doll
´
ar, P., Girshick, R., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
He, K. Accurate, large minibatch sgd: Training imagenet
in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Grosse, R. and Martens, J. A kronecker-factored approxi-
mate fisher matrix for convolution layers. In International
Conference on Machine Learning, pp. 573–582, 2016.
Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet,
J., Spirtes, P., and Statnikov, A. R. Design and analy-
sis of the causation and prediction challenge. In WCCI
Causation and Prediction Challenge, pp. 1–33, 2008.
Hardt, M., Recht, B., and Singer, Y. Train faster, generalize
better: Stability of stochastic gradient descent. arXiv
preprint arXiv:1509.01240, 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Hoffer, E., Hubara, I., and Soudry, D. Train longer,
generalize better: closing the generalization gap in
large batch training of neural networks. arXiv preprint
arXiv:1705.08741, 2017.
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
In International conference on machine learning, pp. 448–
456, 2015.
A Progressive Batching L-BFGS Method for Machine Learning
Johnson, R. and Zhang, T. Accelerating stochastic gradient
descent using predictive variance reduction. In Advances
in Neural Information Processing Systems 26, pp. 315–
323, 2013.
Keskar, N. S. and Berahas, A. S. adaqn: An adaptive quasi-
newton algorithm for training rnns. In Joint European
Conference on Machine Learning and Knowledge Dis-
covery in Databases, pp. 1–16. Springer, 2016.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
M., and Tang, P. T. P. On large-batch training for deep
learning: Generalization gap and sharp minima. arXiv
preprint arXiv:1609.04836, 2016.
Kingma, D. and Ba, J. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
Krishnan, S., Xiao, Y., and Saurous, R. A. Neumann opti-
mizer: A practical optimization algorithm for deep neural
networks. arXiv preprint arXiv:1712.03298, 2017.
Krizhevsky, A. Learning multiple layers of features from
tiny images. 2009.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems,
pp. 1097–1105, 2012.
Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I.,
Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W.,
Smorkalov, M., et al. Deep learning at 15pf: Supervised
and semi-supervised classification for scientific data. In
Proceedings of the International Conference for High Per-
formance Computing, Networking, Storage and Analysis,
pp. 7. ACM, 2017.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
Liu, D. C. and Nocedal, J. On the limited memory bfgs
method for large scale optimization. Mathematical pro-
gramming, 45(1-3):503–528, 1989.
Martens, J. and Grosse, R. Optimizing neural networks with
kronecker-factored approximate curvature. In Interna-
tional Conference on Machine Learning, pp. 2408–2417,
2015.
Mokhtari, A. and Ribeiro, A. Global convergence of on-
line limited memory bfgs. Journal of Machine Learning
Research, 16(1):3151–3181, 2015.
Nocedal, J. and Wright, S. Numerical Optimization.
Springer New York, 2 edition, 1999.
Pasupathy, R., Glynn, P., Ghosh, S., and Hashemi, F. S. On
sampling rates in stochastic recursions. 2015. Under
Review.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.
Robbins, H. and Monro, S. A stochastic approximation
method. The annals of mathematical statistics, pp. 400–
407, 1951.
Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
Newton methods II: Local convergence rates. arXiv
preprint arXiv:1601.04738, 2016a.
Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled
Newton methods I: Globally convergent algorithms.
arXiv preprint arXiv:1601.04737, 2016b.
Schraudolph, N. N., Yu, J., and G
¨
unter, S. A stochastic
quasi-newton method for online convex optimization. In
International Conference on Artificial Intelligence and
Statistics, pp. 436–443, 2007.
Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay
the learning rate, increase the batch size. arXiv preprint
arXiv:1711.00489, 2017.
Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-
scale optimization by unifying stochastic gradient and
quasi-Newton methods. In International Conference on
Machine Learning, pp. 604–612, 2014.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. The Journal of Machine
Learning Research, 15(1):1929–1958, 2014.
You, Y., Gitman, I., and Ginsburg, B. Scaling sgd
batch size to 32k for imagenet training. arXiv preprint
arXiv:1708.03888, 2017.
Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive
quasi-Newton methods for minimizing expected values.
In International Conference on Machine Learning, pp.
4150–4159, 2017.
A Progressive Batching L-BFGS Method for Machine Learning
A. Initial Step Length Derivation
To establish our results, recall that the stochastic quasi-Newton method is defined as
xk+1 =xkαkHkgSk
k,(29)
where the batch (or subsampled) gradient is given by
gSk
k=FSk(xk) = 1
|Sk|X
iSk
Fi(xk),(30)
and the set
Sk⊂ {1,2,· · · }
indexes data points
(yi, zi)
. The algorithm selects the Hessian approximation
Hk
through
quasi-Newton updating prior to selecting the new sample
Sk
to define the search direction
pk
. We will use
Ek
to denote the
conditional expectation at xkand use Eto denote the total expectation.
The primary theoretical mechanism for determining batch sizes is the exact variance inner product quasi-Newton (IPQN)
test, which is defined as
Ekh(HkF(xk))T(Hkgi
k)− kHkF(xk)k22i
|Sk|θ2kHkF(xk)k4.(31)
We establish the inequality used to determine the initial steplength αkfor the stochastic line search.
Lemma A.1.
Assume that
F
is continuously differentiable with Lipschitz continuous gradient with Lipschitz constant
L
.
Then
Ek[F(xk+1)] F(xk)αkF(xk)TH1/2
kWkH1/2
kF(xk),
where
Wk=Ik
21 + Var{Hkgi
k}
|Sk|kHkF(xk)k2Hk,
and Var{Hkgi
k}=EkkHkgi
kHkF(xk)k2.
Proof. By Lipschitz continuity of the gradient, we have that
Ek[F(xk+1)] F(xk)αkF(xk)THkEkhgSk
ki+2
k
2EkhkHkgSk
kk2i
=F(xk)αkF(xk)THkF(xk) + 2
k
2kHkF(xk)k2+EkhkHkgSk
kHkF(xk)k2i
F(xk)αkF(xk)THkF(xk) + 2
k
2kHkF(xk)k2+Var{Hkgi
k}
|Sk|kHkF(xk)k2kHkF(xk)k2
=F(xk)αkF(xk)TH1/2
kIk
21 + Var{Hkgi
k}
|Sk|kHkF(xk)k2HkH1/2
kF(xk)
=F(xk)αkF(xk)TH1/2
kWkH1/2
kF(xk).
B. Convergence Analysis
For the rest of our analysis, we make the following two assumptions.
Assumptions B.1. The orthogonality condition is satisfied for all k, i.e.,
Ek
Hkgi
k(Hkgi
k)T(HkF(xk))
kHkF(xk)k2HkF(xk)
2
|Sk|ν2kHkF(xk)k2,(32)
for some large ν > 0.
A Progressive Batching L-BFGS Method for Machine Learning
Assumptions B.2.
The eigenvalues of
Hk
are contained in an interval in
R+
, i.e., for all
k
there exist constants
Λ2Λ1>0
such that
Λ1IHkΛ2I. (33)
Condition
(32)
ensures that the stochastic quasi-Newton direction is bounded away from orthogonality to
HkF(xk)
,
with high probability, and prevents the variance in the individual quasi-Newton directions to be too large relative to the
variance in the individual quasi-Newton directions along
HkF(xk)
. Assumption B.2 holds, for example, when
F
is
convex and a regularization parameter is included so that any subsampled Hessian
2FS(x)
is positive definite. It can
also be shown to hold in the non-convex case by applying cautious BFGS updating; e.g. by updating
Hk
only when
yT
kskkskk2
2where  > 0is a predetermined constant (Berahas et al.,2016).
We begin by establishing a technical descent lemma.
Lemma B.3. Suppose that Fis twice continuously differentiable and that there exists a constant L > 0such that
2F(x)LI, xRd.(34)
Let
{xk}
be generated by iteration
(29)
for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-Newton
test (31)for given constant θ > 0and suppose that assumptions (B.1)and (B.2)hold. Then, for any k,
EkhkHkgSk
kk2i(1 + θ2+ν2)kHkF(xk)k2.(35)
Moreover, if αksatisfies
αk=α1
(1 + θ2+ν2)LΛ2
,(36)
we have that
Ek[F(xk+1)] F(xk)α
2kH1/2
kF(xk)k2.(37)
Proof. By Assumption (B.1), the orthogonality condition, we have that
Ek
HkgSk
k(HkgSk
k)T(HkF(xk))
kHkF(xk)k2HkF(xk)
2
Ek
Hkgi
k(Hkgi
k)T(HkF(xk))
kHkF(xk)k2HkF(xk)
2
|Sk|(38)
ν2kHkF(xk)k2.
Now, expanding the left hand side of inequality (38), we get
Ek
HkgSk
k(HkgSk
k)T(HkF(xk))
kHkF(xk)k2HkF(xk)
2
=EkhkHkgSk
kk2i
2Ek(HkgSk
k)T(HkF(xk))2
kHkF(xk)k2+
Ek(HkgSk
k)T(HkF(xk))2
kHkF(xk)k2
=EkhkHkgSk
kk2i
Ek(HkgSk
k)T(HkF(xk))2
kHkF(xk)k2
ν2kHkF(xk)k2.
Therefore, rearranging gives the inequality
EkhkHkgSk
kk2i
Ek(HkgSk
k)T(HkF(xk))2
kHkF(xk)k2+ν2kHkF(xk)k2.(39)
A Progressive Batching L-BFGS Method for Machine Learning
To bound the first term on the right side of this inequality, we use the inner product quasi-Newton test; in particular,
|Sk|
satisfies
Ek(HkF(xk))T(HkgSk
k)) − kHkF(xk)k22
Ekh(HkF(xk))T(Hkgi
k)− kHkF(xk)k22i
|Sk|
θ2kHkF(xk)k4,(40)
where the second inequality holds by the IPQN test. Since
Ek(HkF(xk))T(HkgSk
k)− kHkF(xk)k22=Ek(HkF(xk))T(HkgSk
k)2− kHkF(xk)k4,(41)
we have
Ek(HkgSk
k)T(HkF(xk))2≤ kHkF(xk)k4+θ2kHkF(xk)k4
= (1 + θ2)kHkF(xk)k4,(42)
by (40) and (41). Substituting (42) into (39), we get the following bound on the length of the search direction:
EkhkHkgSk
kk2i(1 + θ2+ν2)kHkF(xk)k2,
which proves
(35)
. Using this inequality, Assumption B.2, and bounds on the Hessian and steplength
(34)
and
(36)
, we have
Ek[F(xk+1)] F(xk)Ekhα(HkgSk
k)TF(xk)i+Ek2
2kHkgSk
kk2
=F(xk)αF(xk)THkF(xk) + 2
2Ek[kHkgSk
kk2]
F(xk)αF(xk)THkF(xk) + 2
2(1 + θ2+ν2)kHkF(xk)k2
=F(xk)α(H1/2
kF(xk))TI(1 + θ2+ν2)
2HkH1/2
kF(xk)
F(xk)α1LΛ2α(1 + θ2+ν2)
2kH1/2
kF(xk)k2
F(xk)α
2kH1/2
kF(xk)k2.
We now show that the stochastic quasi-Newton iteration
(29)
with a fixed steplength
α
is linearly convergent when
F
is
strongly convex. In the following discussion, xdenotes the minimizer of F.
Theorem B.4. Suppose that Fis twice continuously differentiable and that there exist constants 0< µ Lsuch that
µI  ∇2F(x)LI, xRd.(43)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-Newton
test (31)and suppose that the assumptions (B.1)and (B.2)hold. Then, if αksatisfies (36)we have that
E[F(xk)F(x)] ρk(F(x0)F(x)),(44)
where xdenotes the minimizer of F, and ρ= 1 µΛ1α.
Proof. It is well-known (Bertsekas et al.,2003) that for strongly convex functions,
k∇F(xk)k22µ[F(xk)F(x)].
A Progressive Batching L-BFGS Method for Machine Learning
Substituting this into (37) and subtracting F(x)from both sides and using Assumption B.2, we obtain
Ek[F(xk+1)F(x)] F(xk)F(x)α
2kH1/2
kF(xk)k2
F(xk)F(x)α
2Λ1k∇F(xk)k2
(1 µΛ1α)(F(xk)F(x)).
The theorem follows from taking total expectation.
We now consider the case when Fis nonconvex and bounded below.
Theorem B.5.
Suppose that
F
is twice continuously differentiable and bounded below, and that there exists a constant
L > 0such that
2F(x)LI, xRd.(45)
Let
{xk}
be generated by iteration
(29)
, for any
x0
, where
|Sk|
is chosen by the (exact variance) inner product quasi-Newton
test (31)and suppose that the assumptions (B.1)and (B.2)hold. Then, if αksatisfies (36), we have
lim
k→∞
E[k∇F(xk)k2]0.(46)
Moreover, for any positive integer Twe have that
min
0kT1
E[k∇F(xk)k2]2
αT Λ1
(F(x0)Fmin),
where Fmin is a lower bound on Fin Rd.
Proof. From Lemma B.3 and by taking total expectation, we have
E[F(xk+1)] E[F(xk)] α
2E[kH1/2
kF(xk)k2],
and hence
E[kH1/2
kF(xk)k2]2
αE[F(xk)F(xk+1)].
Summing both sides of this inequality from k= 0 to T1, and since Fis bounded below by Fmin, we get
T1
X
k=0
E[kH1/2
kF(xk)k2]2
αE[F(x0)F(xT)] 2
α[F(x0)Fmin].
Using the bound on the eigenvalues of Hkand taking limits, we obtain
Λ1lim
T→∞
T1
X
k=0
E[k∇F(xk)k2]lim
T→∞
T1
X
k=0
E[kH1/2
kF(xk)k2]<,
which implies (46). We can also conclude that
min
0kT1
E[k∇F(xk)k2]1
T
T
X
k=0
E[k∇F(xk)k2]2
αT Λ1
(F(x0)Fmin).
A Progressive Batching L-BFGS Method for Machine Learning
C. Additional Numerical Experiments
C.1. Datasets
Table 2summarizes the datasets used for the experiments. Some of these datasets divide the data into training and testing
sets; for the rest, we randomly divide the data so that the training set constitutes 90% of the total.
Table 2. Characteristics of all datasets used in the experiments.
Dataset # Data Points (train; test) # Features # Classes Source
gisette (6,000; 1,000) 5,000 2 (Chang & Lin,2011)
mushrooms (7,311; 813) 112 2 (Chang & Lin,2011)
sido (11,410; 1,268) 4,932 2 (Guyon et al.,2008)
ijcnn (35,000; 91701) 22 2 (Chang & Lin,2011)
spam (82,970; 9,219) 823,470 2 (Cormack & Lynam,2005;Carbonetto,2009)
alpha (450,000; 50,000) 500 2 synthetic
covertype (522,910; 58,102) 54 2 (Chang & Lin,2011)
url (2,156,517; 239,613) 3,231,961 2 (Chang & Lin,2011)
MNIST (60,000; 10,000) 28 ×28 10 (LeCun et al.,1998)
CIFAR-10 (50,000; 10,000) 32 ×32 10 (Krizhevsky,2009)
The alpha dataset is a synthetic dataset that is available at ftp://largescale.ml.tu-berlin.de.
C.2. Logistic Regression Experiments
We report the numerical results on binary classification logistic regression problems on the
8
datasets given in Table 2. We
plot the performance measured in terms of training error, test loss and test accuracy against gradient evaluations. We also
report the behavior of the batch sizes and steplengths for both variants of the PBQN method.
0 20 40 60 80 100
Gradient Evaluations
104
103
102
101
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
102
101
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140
Iterations
0
1
2
3
4
5
6
Batch Sizes
×103
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 5. gisette dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0 5 10 15 20 25 30 35 40
Gradient Evaluations
106
105
104
103
102
101
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
105
104
103
102
101
100
101
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
40
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 10 20 30 40 50 60 70
Iterations
0
1
2
3
4
5
6
7
8
Batch Sizes
×103
PBQN-MB
PBQN-FO
0 10 20 30 40 50 60 70
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 6. mushrooms dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
0 20 40 60 80 100
Gradient Evaluations
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Batch Sizes
×104
PBQN-MB
PBQN-FO
0 50 100 150 200
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 7. sido dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0 10 20 30 40 50 60
Gradient Evaluations
106
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.2
0.3
0.4
0.5
0.6
0.7
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
85
90
95
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140 160
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Batch Sizes
×104
PBQN-MB
PBQN-FO
0 20 40 60 80 100 120 140 160
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 8. ijcnn dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap(FO) approaches, and the SG and SVRG methods.
0 10 20 30 40 50 60 70
Gradient Evaluations
106
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350
Iterations
0
1
2
3
4
5
6
7
8
9
Batch Sizes
×104
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 9. spam dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-
overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0 20 40 60 80 100
Gradient Evaluations
106
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350 400
Iterations
0
1
2
3
4
Batch Sizes
×105
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300 350 400
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 10. alpha dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
0 20 40 60 80 100
Gradient Evaluations
106
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
105
104
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0246810
Gradient Evaluations
0.50
0.55
0.60
0.65
0.70
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0 2 4 6 8 10
Gradient Evaluations
50
55
60
65
70
75
80
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 100 200 300 400 500 600
Iterations
0
1
2
3
4
5
6
Batch Sizes
×105
PBQN-MB
PBQN-FO
0 100 200 300 400 500 600
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 11. covertype dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and SVRG methods.
A Progressive Batching L-BFGS Method for Machine Learning
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations
103
102
101
100
Training Error
SG
SVRG
PBQN-MB
PBQN-FO
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
Test Loss
SG
SVRG
PBQN-MB
PBQN-FO
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Gradient Evaluations
30
40
50
60
70
80
90
100
Test Accuracy
SG
SVRG
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Batch Sizes
×106
PBQN-MB
PBQN-FO
0 50 100 150 200 250 300
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 12. url dataset:
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and full-overlap
(FO) approaches, and the SG and SVRG methods. Note that we only ran the SG and SVRG algorithms for 3 gradient evaluations since
the equivalent number of iterations already reached of order of magnitude 107.
C.3. Neural Network Experiments
We describe each neural network architecture below. We plot the training loss, test loss and test accuracy against the total
number of iterations and gradient evaluations. We also report the behavior of the batch sizes and steplengths for both variants
of the PBQN method.
C.3.1. CIFAR-10 CONVO LU TIO NA L NET WO R K (C) ARCHITECTURE
The small convolutional neural network (ConvNet) is a 2-layer convolutional network with two alternating stages of 5×5
kernels and
2×2
max pooling followed by a fully connected layer with 1000 ReLU units. The first convolutional layer
yields 6 output channels and the second convolutional layer yields 16 output channels.
C.3.2. CIFAR-10 AND MNIST ALE X NET-LIK E NET WOR K (A1,A2) ARCHITECTURE
The larger convolutional network (AlexNet) is an adaptation of the AlexNet architecture (Krizhevsky et al.,2012) for
CIFAR-10 and MNIST. The CIFAR-10 version consists of three convolutional layers with max pooling followed by two
fully-connected layers. The first convolutional layer uses a
5×5
kernel with a stride of 2 and 64 output channels. The second
and third convolutional layers use a
3×3
kernel with a stride of 1 and 64 output channels. Following each convolutional
layer is a set of ReLU activations and
3×3
max poolings with strides of 2. This is all followed by two fully-connected
layers with 384 and 192 neurons with ReLU activations, respectively. The MNIST version of this network modifies this by
only using a 2×2max pooling layer after the last convolutional layer.
C.3.3. CIFAR-10 RESIDUAL NE TWO RK (R) A RCHITECTURE
The residual network (ResNet18) is a slight modification of the ImageNet ResNet18 architecture for CIFAR-10 (He et al.,
2016). It follows the same architecture as ResNet18 for ImageNet but removes the global average pooling layer before the
1000 neuron fully-connected layer. ReLU activations and max poolings are included appropriately.
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 100 200 300 400 500
Iterations
0
1
2
3
4
5
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 500 1000 1500 2000
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 13. CIFAR-10 ConvNet (C):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 0.9.
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
70
75
80
85
90
95
100
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 10 20 30 40 50
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 10 20 30 40 50
Gradient Evaluations
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 10 20 30 40 50
Gradient Evaluations
70
75
80
85
90
95
100
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0
1
2
3
4
5
6
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 14. MNIST AlexNet (A1):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap) and
full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 2.
A Progressive Batching L-BFGS Method for Machine Learning
101102103104
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103104
Iterations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Gradient Evaluations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0
1
2
3
4
5
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 500 1000 1500 2000 2500 3000
Iterations
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 15. CIFAR-10 AlexNet (A2):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 0.9.
A Progressive Batching L-BFGS Method for Machine Learning
101102103
Iterations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
101102103
Iterations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
0.0
0.5
1.0
1.5
2.0
2.5
Training Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Test Loss
SG
Adam
PBQN-MB
PBQN-FO
0 50 100 150 200
Gradient Evaluations
20
30
40
50
60
70
Test Accuracy
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000
Iterations
0
1
2
3
4
5
Batch Sizes
×104
SG
Adam
PBQN-MB
PBQN-FO
0 200 400 600 800 1000 1200 1400
Iterations
0.0
0.2
0.4
0.6
0.8
1.0
Steplength
PBQN-MB
PBQN-FO
Figure 16. CIFAR-10 ResNet18 (R):
Performance of the progressive batching L-BFGS methods, with multi-batch (MB) (25% overlap)
and full-overlap (FO) approaches, and the SG and Adam methods. The best results for L-BFGS are achieved with θ= 2.
A Progressive Batching L-BFGS Method for Machine Learning
D. Performance Model
The use of increasing batch sizes in the PBQN algorithm yields a larger effective batch size than the SG method, allowing
PBQN to scale to a larger number of nodes than currently permissible even with large-batch training (Goyal et al.,2017).
With improved scalability and richer gradient information, we expect reduction in training time. To demonstrate the potential
to reduce training time of a parallelized implementation of PBQN, we extend the idealized performance model from (Keskar
et al.,2016) to the PBQN algorithm. For PBQN to be competitive, it must achieve the following: (i) the quality of its solution
should match or improve SG’s solution (as shown in Table 1 of the main paper); (ii) it should utilize a larger effective batch
size; and (iii) it should converge to the solution in a lower number of iterations. We provide an initial analysis for this by
establishing the analytic requirements for improved training time; we leave discussion on implementation details, memory
requirements, and large-scale experiments for future work.
Let the effective batch size for PBQN and conventional SG batch size be denoted as
c
BL
and
BS
, respectively. From
Algorithm 1, we observe that the PBQN iteration involves extra computation in addition to the gradient computation as
in SG. The additional steps are as follows: the L-BFGS two-loop recursion, which includes several operations over the
stored curvature pairs and network parameters (Algorithm 1:6); the stochastic line search for identifying the steplength
(Algorithm 1:7-16); and curvature pair updating (Algorithm 1:18-21). However, most of these supplemental operations are
performed on the weights of the network, which is orders of magnitude lower than computing the gradient. The two-loop
recursion performs
O(10)
operations over the network parameters and curvature pairs. The cost for variance estimation is
negligible since we may use a fixed number of samples throughout the run for its computation which can be parallelized
while avoiding becoming a serial bottleneck.
The only exception is the stochastic line search, which requires additional forward propagations over the model for different
sets of network parameters. However, this happens only when the step-length is not accepted, which happens infrequently in
practice. We make the pessimistic assumption of an addition forward propagation every iteration, amounting to an additional
1
3
the cost of the gradient computation (forward propagation, back propagation with respect to activations and weights).
Hence, the ratio of cost-per-iteration for PBQN
CL
to SG’s cost-per-iteration
CS
is
4
3
. Let
IS
and
IL
be the number of
iterations that it takes SG and PBQN, respectively, to reach similar test accuracy. The target number of nodes to be used for
training is
N
, such that
N < c
BL
. For
N
nodes, the parallel efficiency of SG is assumed to be
Pe(N)
and we assume that
for the target node count, there is no drop in parallel efficiency for PBQN due to the large effective batch size.
For a lower training time with the PBQN method, the following relation should hold:
ILCLc
BL
N< ISCS
BS
NPe(N).(47)
In terms of iterations, we can rewrite this as IL
IS
<CS
CL
BS
c
BL
1
Pe(N).(48)
Assuming target node count
N=BS<ˆ
BL
, the scaling efficiency of SG drops significantly due to the reduced work per
single node, giving a parallel efficiency of
Pe(N)=0.2
; see (Kurth et al.,2017;You et al.,2017). If we additionally assume
that effective batch size for PBQN is
4×
larger, with SG large batch
8K and PBQN
32K as observed in our experiments
(from Section 4), this gives
c
BL
/BS= 4
. PBQN must converge with about the same number of iterations as SG in order to
achieve lower training time. From Section 4, the results show that PBQN converges in significantly fewer iterations than SG,
hence establishing the potential for lower training times. We refer the reader to (Das et al.,2016) for a more detailed model
and commentary on the effect of batch size on performance.
... For every µ in the training set, the neural network is evaluated N t times on the initial POD coefficients; this is our approach in numerical examples. 4. Temporal reduced basis algorithm. ...
... Each block has input dimension 10 (seven coefficients, one parameter, and time), output dimension 7, and two hidden layers of width W = 12. We use the hyperbolic tangent as the nonlinear activation function, and optimize using the L-BFGS method with a weak Wolfe line search [4]. ...
... Due to the importance of machine learning and deep learning, [29] and [30] analyze quasi-Newton methods performance in these fields. Also, [31] and [32] seek to determine a suitable batch selection method for training machine learning models. ...
Article
Full-text available
L-BFGS is one of the widely used quasi-Newton methods. Instead of explicitly storing an approximation H of the inverse Hessian, L-BFGS keeps a limited number of vectors that can be used for computing the product of H by the gradient. However, this computation is sequential, each step depending on the outcome of the previous step. To solve this problem, we propose the Direct L-BFGS (DirL-BFGS) method that, seeing H as a linear operator, directly stores a low-rank plus diagonal (LRPD) representation of H. Employing the LRPD representation enables us to leverage the benefits of vector processing, leading to accelerating and parallelizing the calculations in the form of single instruction, multiple data. We evaluate our proposed method on different quadratic optimization problems and several regression and classification tasks with neural networks. Numerical results show that DirL-BFGS is faster overall than L-BFGS.
... As individual processors become more powerful, this gap increases. Thus, inspired by prior work [3], [5], [21], [24], [25], [32], we focus on algorithms that share vectors of size O(d) instead of matrices of size O(d 2 ), which are expensive in many machine learning applications. Since we use the master/worker structure, the recently proposed L-DQN [10] and DAve-QN [33] algorithms are the most relevant to our work. ...
... 19, 20, 21 and 22 show the performance of various algorithms using a feed forward neural network (5.3) across dataset MNIST. Figures 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,16,17,18,19,20,21 and 22 present the results obtained by the methods considered across all test datasets. As emphasized by these figures, the performance of Prox-DLmLBFGS shows promise and competitiveness for all the test problems. ...
Article
Full-text available
In the field of machine learning, many large-scale optimization problems can be decomposed into the sum of two functions: a smooth function and a nonsmooth function with a simple proximal mapping. In light of this, our paper introduces a novel variant of the proximal stochastic quasi-Newton algorithm, grounded in three key components: (i) developing an adaptive sampling method that dynamically increases the sample size during the iteration process, thus preventing rapid growth in sample size and mitigating the noise introduced by the stochastic approximation method; (ii) the integration of stochastic line search to ensure a sufficient decrease in the expected value of the objective function; and (iii) a stable update scheme for the stochastic modified limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm. For a general objective function, it can be proven that the limit points of the generated sequence almost surely converge to stationary points. Furthermore, the convergence rate and the number of required gradient computations for this process have been analyzed. In the case of a strongly convex objective function, a global linear convergence rate can be achieved, and the number of required gradient computations is thoroughly examined. Finally, numerical experiments demonstrate the robustness of the proposed method across various hyperparameter settings, establishing its competitiveness compared to state-of-the-art methods.
... Once the model has been trained, the output will contain the algorithm that uses the model with the best performance on the input data. In our case, this is the L-BFGS (Limited Broydon-Fletcher-Goldfarb-Shanno) algorithm, which is a quasi-Newton optimization method of functions with a large number of parameters or of a high complexity [62]. It is a method that makes limited use of memory, using it optimally and in fewer algorithms for the same problem. ...
Article
Full-text available
Software development has a relevant human side, and this could, for example, imply that developers' feelings have an impact on certain aspects of software development such as quality, productivity, or performance. This paper explores the effects of toxic emotions on code quality and presents the SentiQ tool, which gathers and analyzes sentiments from commit messages (obtained from GitHub) and code quality measures (obtained from SonarQube). The SentiQ tool we proposed performs a sentiment analysis (based on natural language processing techniques) and relates the results to the code quality measures. The datasets extracted are then used as the basis on which to conduct a preliminary case study, which demonstrates that there is a relationship between toxic comments and code quality that may affect the quality of the whole software project. This has resulted in the drafting of a predictive model to validate the correlation of the impact of toxic comments on code quality. The main implication of this work is that these results could, in the future, make it possible to estimate code quality as a function of developers' toxic comments.
Article
Full-waveform inversion (FWI) is widely used to reconstruct subsurface properties at different geological scales. For shallow land applications using surface waves, a lack of information on the source wavelet, dispersion, and presence of higher modes increases the nonlinearity of the inverse problem. Moreover, the inversion can become more challenging with the presence of near-surface (NS) complexities associated with scattering, attenuation, and high-contrast variations in the elastic parameters. To address those challenges, we exploit the potential of graph-space optimal transport-based waveform inversion (GSOT). Compared to least-squares (LS) formulation, GSOT provides a more convex misfit function and reduces dependence on accuracy of the initial model. While a few field-data applications have shown the potential and benefits of using GSOT-based FWI with body waves, there are limited real applications of inversion with GSOT misfit function for NS characterization. Moreover, despite a considerable effort with blind benchmark tests in exploration seismology, typically synthetic FWI examples for NS applications are demonstrated through an “inverse crime” approach, where the same forward modeling engine is used to generate observed and simulated data. Synthetic FWI examples performed here compare performance of LS- and GSOT-based FWI with more realistic scenarios. Those scenarios include using different wavefield propagators, inaccurate source wavelets, elastic inversion of anelastic data, and use of a coarse grid for the inversion of shallow seismic modeled and field data collected over a man-made elongated cavity located at about 10 m below ground surface. We demonstrate that the GSOT misfit function improves the initial 1D velocity models and guides the updates toward the actual subsurface properties. This enables the recovery of higher-mode Rayleigh waves and reconstruction of the cavity with better precision.
Preprint
Full-text available
In this work, we aim to formalize a novel scientific machine learning framework to reconstruct the hidden dynamics of the transmission rate, whose inaccurate extrapolation can significantly impair the quality of the epidemic forecasts, by incorporating the influence of exogenous variables (such as environmental conditions and strain-specific characteristics). We propose an hybrid model that blends a data-driven layer with a physics-based one. The data-driven layer is based on a neural ordinary differential equation that learns the dynamics of the transmission rate, conditioned on the meteorological data and wave-specific latent parameters. The physics-based layer, instead, consists of a standard SEIR compartmental model, wherein the transmission rate represents an input. The learning strategy follows an end-to-end approach: the loss function quantifies the mismatch between the actual numbers of infections and its numerical prediction obtained from the SEIR model incorporating as an input the transmission rate predicted by the neural ordinary differential equation. We validate this original approach using both a synthetic test case and a realistic test case based on meteorological data (temperature and humidity) and influenza data from Italy between 2010 and 2020. In both scenarios, we achieve low generalization error on the test set and observe strong alignment between the reconstructed model and established findings on the influence of meteorological factors on epidemic spread. Finally, we implement a data assimilation strategy to adapt the neural equation to the specific characteristics of an epidemic wave under investigation, and we conduct sensitivity tests on the network hyperparameters.
Conference Paper
Full-text available
This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate data. Our Intelcaffe-based implementation obtains ~2TFLOP/s on a single Cori Phase-II Xeon-Phi node. We use a hybrid strategy employing synchronous node-groups, while using asynchronous communication across groups. We use this strategy to scale training of a single model to ~9600 Xeon-Phi nodes; obtaining peak performance of 11.73-15.07 PFLOP/s and sustained performance of 11.41-13.27 PFLOP/s. At scale, our HEP architecture produces state-of-the-art classification accuracy on a dataset with 10M images, exceeding that achieved by selections on high-level physics-motivated features. Our semi-supervised architecture successfully extracts weather patterns in a 15TB climate dataset. Our results demonstrate that Deep Learning can be optimized and scaled effectively on many-core, HPC systems.
Article
Full-text available
Global convergence of an online (stochastic) limited memory version of the Broyden-Fletcher- Goldfarb-Shanno (BFGS) quasi-Newton method for solving optimization problems with stochastic objectives that arise in large scale machine learning is established. Lower and upper bounds on the Hessian eigenvalues of the sample functions are shown to suffice to guarantee that the curvature approximation matrices have bounded determinants and traces, which, in turn, permits establishing convergence to optimal arguments with probability 1. Numerical experiments on support vector machines with synthetic data showcase reductions in convergence time relative to stochastic gradient descent algorithms as well as reductions in storage and computation relative to other online quasi-Newton methods. Experimental evaluation on a search engine advertising problem corroborates that these advantages also manifest in practical applications.
Article
In this paper, we propose a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques that either require additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size is governed by an inner product test that ensures that search directions are descent directions with high probability. We show that the inner product test improves upon the well known norm test, and can be used as a basis for an algorithm that is globally convergent on nonconvex functions and enjoys a global linear rate of convergence on strongly convex functions. Numerical experiments on logistic regression problems illustrate the performance of the algorithm.
Article
This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.
Article
The paper studies the solution of stochastic optimization problems in which approximations to the gradient and Hessian are obtained through subsampling. We first consider Newton-like methods that employ these approximations and discuss how to coordinate the accuracy in the gradient and Hessian to yield a superlinear rate of convergence in expectation. The second part of the paper analyzes an inexact Newton method that solves linear systems approximately using the conjugate gradient (CG) method, and that samples the Hessian and not the gradient (the gradient is assumed to be exact). We provide a complexity analysis for this method based on the properties of the CG iteration and the quality of the Hessian approximation, and compare it with a method that employs a stochastic gradient iteration instead of the CG method. We report preliminary numerical results that illustrate the performance of inexact subsampled Newton methods on machine learning applications based on logistic regression.
Article
The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This inherently gives the algorithm a stochastic flavor that can cause instability in L-BFGS, a popular batch method in machine learning. These difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations; when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.
Article
We consider the problem of finding a zero of an unknown function, or optimizing an unknown function, with only a stochastic simulation that outputs noise-corrupted observations. A convenient paradigm to solve such problems takes a deterministic recursion, e.g., Newton-type or trust-region, and replaces function values and derivatives appearing in the recursion with their sampled counterparts. While such a paradigm is convenient, there is as yet no clear guidance on how much simulation effort should be expended as the resulting recursion evolves through the search space. In this paper, we take the first steps towards answering this question. We propose using a fully sequential Monte Carlo sampling method to adaptively decide how much to sample at each point visited by the stochastic recursion. The termination criterion for such sampling is based on a certain relative width confidence interval constructed to ensure that the resulting iterates are consistent, and efficient in a rigorous (Monte Carlo canonical) sense. The methods presented here are adaptive in the sense that they 'learn' to sample according to the algorithm trajectory. In this sense, our methods should be seen as refining recent methods in a similar context that use a predetermined sequence of sample sizes.