ArticlePublisher preview available

Speeding up L-BFGS by direct approximation of the inverse Hessian matrix

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

L-BFGS is one of the widely used quasi-Newton methods. Instead of explicitly storing an approximation H of the inverse Hessian, L-BFGS keeps a limited number of vectors that can be used for computing the product of H by the gradient. However, this computation is sequential, each step depending on the outcome of the previous step. To solve this problem, we propose the Direct L-BFGS (DirL-BFGS) method that, seeing H as a linear operator, directly stores a low-rank plus diagonal (LRPD) representation of H. Employing the LRPD representation enables us to leverage the benefits of vector processing, leading to accelerating and parallelizing the calculations in the form of single instruction, multiple data. We evaluate our proposed method on different quadratic optimization problems and several regression and classification tasks with neural networks. Numerical results show that DirL-BFGS is faster overall than L-BFGS.
This content is subject to copyright. Terms and conditions apply.
Computational Optimization and Applications (2025) 91:283–310
https://doi.org/10.1007/s10589-025-00665-0
Speeding up L-BFGS by direct approximation of the inverse
Hessian matrix
Ashkan Sadeghi-Lotfabadi1·Kamaledin Ghiasi-Shirazi1
Received: 31 January 2024 / Accepted: 31 January 2025 / Published online: 24 February 2025
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025
Abstract
L-BFGS is one of the widely used quasi-Newton methods. Instead of explicitly storing
an approximation Hof the inverse Hessian, L-BFGS keeps a limited number of vectors
that can be used for computing the product of Hby the gradient. However, this
computation is sequential, each step depending on the outcome of the previous step. To
solve this problem, we propose the Direct L-BFGS (DirL-BFGS) method that, seeing
Has a linear operator, directly stores a low-rank plus diagonal (LRPD) representation
of H. Employing the LRPD representation enables us to leverage the benefits of
vector processing, leading to accelerating and parallelizing the calculations in the
form of single instruction, multiple data. We evaluate our proposed method on different
quadratic optimization problems and several regression and classification tasks with
neural networks. Numerical results show that DirL-BFGS is faster overall than L-
BFGS.
Keywords Limited-memory BFGS ·Low-rank plus diagonal approximation ·
Vectorization ·Single instruction multiple data (SIMD)
1 Introduction
We consider an unconstrained optimization to find an optimal parameter wfor a
smooth objective function f:RnR:
w=arg min
wRn
f(w). (1)
BKamaledin Ghiasi-Shirazi
k.ghiasi@um.ac.ir
Ashkan Sadeghi-Lotfabadi
sadeghia@mail.um.ac.ir
1Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad,
Azadi Street, Mashhad 9177948974, Iran
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments. This Perspective describes the development and capabilities of SciPy 1.0, an open source scientific computing library for the Python programming language.
Article
Full-text available
Ensuring the positive definiteness and avoiding ill conditioning of the Hessian update in the stochastic Broyden-Fletcher-Goldfarb-Shanno (BFGS) method are significant in solving nonconvex problems. This article proposes a novel stochastic version of a damped and regularized BFGS method for addressing the above problems. While the proposed regularized strategy helps to prevent the BFGS matrix from being close to singularity, the new damped parameter further ensures the positivity of the product of correction pairs. To alleviate the computational cost of the stochastic limited memory BFGS (LBFGS) updates and to improve its robustness, the curvature information is updated using the averaged iterate at spaced intervals. The effectiveness of the proposed method is evaluated through the logistic regression and Bayesian logistic regression problems in machine learning. Numerical experiments are conducted by using both synthetic data set and several real data sets. The results show that the proposed method generally outperforms the stochastic damped LBFGS (SdLBFGS) method. In particular, for problems with small sample sizes, our method has shown superior performance and is capable of mitigating ill-conditioned problems. Furthermore, our method is more robust to the variations of the batch size and memory size than the SdLBFGS method.
Article
We present two quasi-Newton algorithms for solving unconstrained optimization problems based on two modified secant relations to get reliable approximations of the Hessian matrices of the objective function. The proposed methods make use of both gradient and function values, and utilize information from the two most recent steps in contrast to the the usual secant relation using only the latest step. We show that the modified BFGS methods based on the new secant relations are globally convergent and have a local superlinear rate of convergence. Computational experiments are made on problems from the CUTEst library. Comparative numerical results show competitiveness of the proposed methods in the sense of the Dolan-Mor´e performance profiles.
Article
Classical theory for quasi-Newton schemes has focused on smooth, deterministic, unconstrained optimization, whereas recent forays into stochastic convex optimization have largely resided in smooth, unconstrained, and strongly convex regimes. Naturally, there is a compelling need to address nonsmoothness, the lack of strong convexity, and the presence of constraints. Accordingly, this paper presents a quasi-Newton framework that can process merely convex and possibly nonsmooth (but smoothable) stochastic convex problems. We propose a framework that combines iterative smoothing and regularization with a variance-reduced scheme reliant on using an increasing sample size of gradients. We make the following contributions. (i) We develop a regularized and smoothed variable sample-size BFGS update (rsL-BFGS) that generates a sequence of Hessian approximations and can accommodate nonsmooth convex objectives by utilizing iterative regularization and smoothing. (ii) In strongly convex regimes with state-dependent noise, the proposed variable sample-size stochastic quasi-Newton (VS-SQN) scheme admits a nonasymptotic linear rate of convergence, whereas the oracle complexity of computing an [Formula: see text]-solution is [Formula: see text], where [Formula: see text] denotes the condition number and [Formula: see text]. In nonsmooth (but smoothable) regimes, using Moreau smoothing retains the linear convergence rate for the resulting smoothed VS-SQN (or sVS-SQN) scheme. Notably, the nonsmooth regime allows for accommodating convex constraints. To contend with the possible unavailability of Lipschitzian and strong convexity parameters, we also provide sublinear rates for diminishing step-length variants that do not rely on the knowledge of such parameters. (iii) In merely convex but smooth settings, the regularized VS-SQN scheme rVS-SQN displays a rate of [Formula: see text] with an oracle complexity of [Formula: see text]. When the smoothness requirements are weakened, the rate for the regularized and smoothed VS-SQN scheme rsVS-SQN worsens to [Formula: see text]. Such statements allow for a state-dependent noise assumption under a quadratic growth property on the objective. To the best of our knowledge, the rate results are among the first available rates for QN methods in nonsmooth regimes. Preliminary numerical evidence suggests that the schemes compare well with accelerated gradient counterparts on selected problems in stochastic optimization and machine learning with significant benefits in ill-conditioned regimes.
Article
In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation with additional stochastic proximal gradient steps to guarantee convergence. Based on suitable bounds on the step sizes, we establish global convergence to stationary points in expectation and an extension of the approach using variance reduction techniques is discussed. Motivated by large-scale and big data applications, we investigate a stochastic coordinate-type quasi-Newton scheme that allows to generate cheap and tractable stochastic higher order directions. Finally, numerical results on large-scale logistic regression and deep learning problems show that our proposed algorithm compares favorably with other state-of-the-art methods.
Article
Large-scale data science trains models for data sets containing massive numbers of samples. Training is often formulated as the solution of empirical risk minimization problems that are optimization programs whose complexity scales with the number of elements in the data set. Stochastic optimization methods overcome this challenge, but they come with their own set of limitations. This article discusses recent developments to accelerate the convergence of stochastic optimization through the exploitation of second-order information. This is achieved with stochastic variants of quasi-Newton methods that approximate the curvature of the objective function using stochastic gradient information. The reasons for why this leads to faster convergence are discussed along with the introduction of an incremental method that exploits memory to achieve a superlinear convergence rate. This is the best-known convergence rate for a stochastic optimization method. Stochastic quasi-Newton methods are applied to several problems, including prediction of the click-through rate of an advertisement displayed in response to a specific search engine query by a specific visitor. Experimental evaluations showcase reductions in overall computation time relative to stochastic gradient descent algorithms.
Chapter
Deep learning algorithms often require solving a highly nonlinear and non-convex unconstrained optimization problem. Methods for solving optimization problems in large-scale machine learning, such as deep learning and deep reinforcement learning (RL), are generally restricted to the class of first-order algorithms, like stochastic gradient descent (SGD). While SGD iterates are inexpensive to compute, they have slow theoretical convergence rates. Furthermore, they require exhaustive trial-and-error to fine-tune many learning parameters. Using second-order curvature information to find search directions can help with more robust convergence for non-convex optimization problems. However, computing Hessian matrices for large-scale problems is not computationally practical. Alternatively, quasi-Newton methods construct an approximate of the Hessian matrix to build a quadratic model of the objective function. Quasi-Newton methods, like SGD, require only first-order gradient information, but they can result in superlinear convergence, which makes them attractive alternatives to SGD. The limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) approach is one of the most popular quasi-Newton methods that constructs positive definite Hessian approximations. In this chapter, we propose efficient optimization methods based on L-BFGS quasi-Newton methods using line search and trust-region strategies. Our methods bridge the disparity between first- and second-order methods by using gradient information to calculate low-rank updates to Hessian approximations. We provide formal convergence analysis of these methods as well as empirical results on deep learning applications, such as image classification tasks and deep reinforcement learning on a set of Atari 2600 video games. Our results show a robust convergence with preferred generalization characteristics as well as fast training time.