PreprintPDF Available

Basis Matters: Better Communication-Efficient Second Order Methods for Federated Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Recent advances in distributed optimization have shown that Newton-type methods with proper communication compression mechanisms can guarantee fast local rates and low communication cost compared to first order methods. We discover that the communication cost of these methods can be further reduced, sometimes dramatically so, with a surprisingly simple trick: {\em Basis Learn (BL)}. The idea is to transform the usual representation of the local Hessians via a change of basis in the space of matrices and apply compression tools to the new representation. To demonstrate the potential of using custom bases, we design a new Newton-type method (BL1), which reduces communication cost via both {\em BL} technique and bidirectional compression mechanism. Furthermore, we present two alternative extensions (BL2 and BL3) to partial participation to accommodate federated learning applications. We prove local linear and superlinear rates independent of the condition number. Finally, we support our claims with numerical experiments by comparing several first and second~order~methods. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Full text: https://arxiv.org/abs/2111.01847
Content may be subject to copyright.
A preview of the PDF is not available
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We develop several new communication-efficient second-order methods for distributed optimization. Our first method, NEWTON-STAR, is a variant of Newton's method from which it inherits its fast local quadratic rate. However, unlike New-ton's method, NEWTON-STAR enjoys the same per iteration communication cost as gradient descent. While this method is impractical as it relies on the use of certain unknown parameters characterizing the Hessian of the objective function at the optimum, it serves as the starting point which enables us to design practical variants thereof with strong theoretical guarantees. In particular, we design a stochastic sparsification strategy for learning the unknown parameters in an iterative fashion in a communication efficient manner. Applying this strategy to NEWTON-STAR leads to our next method, NEWTON-LEARN, for which we prove local linear and superlinear rates independent of the condition number. When applicable, this method can have dramatically superior convergence behavior when compared to state-of-the-art methods. Finally, we develop a globalization strategy using cubic regularization which leads to our next method, CUBIC-NEWTON-LEARN, for which we prove global sublinear and linear convergence rates, and a fast superlinear rate. Our results are supported with experimental results on real datasets, and show several orders of magnitude improvement on baseline and state-of-the-art methods in terms of communication complexity.
Article
Full-text available
Modern large scale machine learning applications require stochastic optimization algorithms to be implemented on distributed computational architectures. A key bottleneck is the communication overhead for exchanging information such as stochastic gradients among different workers. In this paper, to reduce the communication cost we propose a convex optimization formulation to minimize the coding length of stochastic gradients. To solve the optimal sparsification efficiently, several simple and fast algorithms are proposed for approximate solution, with theoretical guaranteed for sparseness. Experiments on $\ell_2$ regularized logistic regression, support vector machines, and convolutional neural networks validate our sparsification approaches.
Article
Full-text available
A classical algorithm for solving the system of nonlinear equations $F(x) = 0$ is Newton’s method \[ x_{k + 1} = x_k + s_k ,\quad {\text{where }}F'(x_k )s_k = - F(x_k ),\quad x_0 {\text{ given}}.\] The method is attractive because it converges rapidly from any sufficiently good initial guess $x_0 $. However, solving a system of linear equations (the Newton equations) at each stage can be expensive if the number of unknowns is large and may not be justified when $x_k $ is far from a solution. Therefore, we consider the class of inexact Newton methods: \[ x_{k + 1} = x_k + s_k ,\quad {\text{where }}F'(x_k )s_k = - F(x_k ) + r_k ,\quad {{\left\| {r_k } \right\|} / {\left\| {F(x_k )} \right\|}} \leqq \eta _k \] which solve the Newton equations only approximately and in some unspecified manner. Under the natural assumption that the forcing sequence $\{ n_k \} $ is uniformly less than one, we show that all such methods are locally convergent and characterize the order of convergence in terms of the rate of conv...
Article
We consider distributed optimization over several devices, each sending incremental model updates to a central server. This setting is considered, for instance, in federated learning. Various schemes have been designed to compress the model updates in order to reduce the overall communication cost. However, existing methods suffer from a significant slowdown due to additional variance ω>0 coming from the compression operator and as a result, only converge sublinearly. What is needed is a variance reduction technique for taming the variance introduced by compression. We propose the first methods that achieve linear convergence for arbitrary compression operators. For strongly convex functions with condition number κ, distributed among n machines with a finite-sum structure, each worker having less than m components, we also (i) give analysis for the weakly convex and the non-convex cases and (ii) verify in experiments that our novel variance reduced schemes are more efficient than the baselines. Moreover, we show theoretically that as the number of devices increases, higher compression levels are possible without this affecting the overall number of communications in comparison with methods that do not perform any compression. This leads to a significant reduction in communication cost. Our general analysis allows to pick the most suitable compression for each problem, finding the right balance between additional variance and communication savings. Finally, we also (iii) give analysis for arbitrary quantized updates.
Article
High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1,0,1} which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet does not incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks.
Conference Paper
Regularized logistic regression is a very useful classification method, but for large-scale data, its distributed training has not been investigated much. In this work, we propose a distributed Newton method for training logistic regression. Many interesting techniques are discussed for reducing the communication cost and speeding up the computation. Experiments show that the proposed method is competitive with or even faster than state-of-the-art approaches such as Alternating Direction Method of Multipliers (ADMM) and Vowpal Wabbit (VW). We have released an MPI-based implementation for public use.
Article
Let F be a mapping from real n -dimensional Euclidean space into itself. Most practical algorithms for finding a zero of F are of the form \[ x k + 1 = x k − B k − 1 F x k , {x_{k + 1}} = {x_k} - B_k^{ - 1}F{x_k}, \] where { B k } \{ {B_k}\} is a sequence of nonsingular matrices. The main result of this paper is a characterization theorem for the superlinear convergence to a zero of F of sequences of the above form. This result is then used to give a unified treatment of the results on the superlinear convergence of the Davidon-Fletcher-Powell method obtained by Powell for the case in which exact line searches are used, and by Broyden, Dennis, and Moré for the case without line searches. As a by-product, several results on the asymptotic behavior of the sequence { B k } \{ {B_k}\} are obtained. An interesting aspect of these results is that superlinear convergence is obtained without any consistency conditions; i.e., without requiring that the sequence { B k } \{ {B_k}\} converge to the Jacobian matrix of F at the zero. In fact, a modification of an example due to Powell shows that most of the known quasi-Newton methods are not, in general, consistent. Finally, it is pointed out that the above-mentioned characterization theorem applies to other single and double rank quasi-Newton methods, and that the results of this paper can be used to obtain their superlinear convergence.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.