Preprint

# A Communication-Efficient Decentralized Newton's Method with Provably Faster Convergence

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

## Abstract

In this paper, we consider a strongly convex finite-sum minimization problem over a decentralized network and propose a communication-efficient decentralized Newton's method for solving it. We first apply dynamic average consensus (DAC) so that each node is able to use a local gradient approximation and a local Hessian approximation to track the global gradient and Hessian, respectively. Second, since exchanging Hessian approximations is far from communication-efficient, we require the nodes to exchange the compressed ones instead and then apply an error compensation mechanism to correct for the compression noise. Third, we introduce multi-step consensus for exchanging local variables and local gradient approximations to balance between computation and communication. To avoid each node transmitting the entire local Hessian approximation, we design a compression procedure with error compensation to estimate the global Hessian in a communication-efficient way. With novel analysis, we establish the globally linear (resp., asymptotically super-linear) convergence rate of the proposed method when m is constant (resp., tends to infinity), where m is the number of consensus inner steps. To the best of our knowledge, this is the first super-linear convergence result for a communication-efficient decentralized Newton's method. Moreover, the rate we establish is provably faster than those of first-order methods. Our numerical results on various applications corroborate the theoretical findings.

## No file available

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Communication compression techniques are of growing interests for solving the decentralized optimization problem under limited communication, where the global objective is to minimize the average of local cost functions over a multiagent network using only local computation and peer-to-peer communication. In this article, we propose a novel compressed gradient tracking algorithm (C-GT) that combines gradient tracking technique with communication compression. In particular, C-GT is compatible with a general class of compression operators that unifies both unbiased and biased compressors. We show that C-GT inherits the advantages of gradient tracking-based algorithms and achieves linear convergence rate for strongly convex and smooth objective functions. Numerical examples complement the theoretical findings and demonstrate the efficiency and flexibility of the proposed algorithm.
Conference Paper
Full-text available
We develop several new communication-efficient second-order methods for distributed optimization. Our first method, NEWTON-STAR, is a variant of Newton's method from which it inherits its fast local quadratic rate. However, unlike New-ton's method, NEWTON-STAR enjoys the same per iteration communication cost as gradient descent. While this method is impractical as it relies on the use of certain unknown parameters characterizing the Hessian of the objective function at the optimum, it serves as the starting point which enables us to design practical variants thereof with strong theoretical guarantees. In particular, we design a stochastic sparsification strategy for learning the unknown parameters in an iterative fashion in a communication efficient manner. Applying this strategy to NEWTON-STAR leads to our next method, NEWTON-LEARN, for which we prove local linear and superlinear rates independent of the condition number. When applicable, this method can have dramatically superior convergence behavior when compared to state-of-the-art methods. Finally, we develop a globalization strategy using cubic regularization which leads to our next method, CUBIC-NEWTON-LEARN, for which we prove global sublinear and linear convergence rates, and a fast superlinear rate. Our results are supported with experimental results on real datasets, and show several orders of magnitude improvement on baseline and state-of-the-art methods in terms of communication complexity.
Article
Full-text available
We introduce the primal-dual quasi-Newton (PD-QN) method as an approximated second order method for solving decentralized optimization problems. The PD-QN method performs quasi-Newton updates on both the primal and dual variables of the consensus optimization problem to find the optimal point of the augmented Lagrangian. By optimizing the augmented Lagrangian, the PD-QN method is able to find the exact solution to the consensus problem with a linear rate of convergence. We derive fully decentralized quasi-Newton updates that approximate second order information to reduce the computational burden relative to dual methods and to make the method more robust in ill-conditioned problems relative to first order methods. The linear convergence rate of PD-QN is established formally and strong performance advantages relative to existing dual and primal-dual methods are shown numerically.
Article
Full-text available
Article
Full-text available
Recently, there have been growing interests in solving consensus optimization problems in a multi-agent network. In this paper, we develop a decentralized algorithm for the consensus optimization problem $$\min\limits_{x\in\mathbb{R}^p}\bar{f}(x)=\frac{1}{n}\sum\limits_{i=1}^n f_i(x),$$ which is defined over a connected network of $n$ agents, where each $f_i$ is held privately by agent $i$ and encodes the agent's data and objective. All the agents shall collaboratively find the minimizer while each agent can only communicate with its neighbors. Such a computation scheme avoids a data fusion center or long-distance communication and offers better load balance to the network. This paper proposes a novel decentralized \uline{ex}act firs\uline{t}-orde\uline{r} \uline{a}lgorithm (abbreviated as EXTRA) to solve the consensus optimization problem. "Exact" means that it can converge to the exact solution. EXTRA can use a fixed, large step size and has synchronized iterations, and the local variable of every agent $i$ converges uniformly and consensually to an exact minimizer of $\bar{f}$. In contrast, the well-known decentralized gradient descent (DGD) method must use diminishing step sizes in order to converge to an exact minimizer. EXTRA and DGD have a similar per-iteration complexity. EXTRA uses the gradients of the last two iterates instead of the last one only by DGD. EXTRA has the best known convergence {rates} among the existing first-order decentralized algorithms for decentralized consensus optimization with convex differentiable objectives. Specifically, if $f_i$'s are convex and have Lipschitz continuous gradients, EXTRA has an ergodic convergence rate $O\left(\frac{1}{k}\right)$. If $\bar{f}$ is also (restricted) strongly convex, the rate improves to linear at $O(C^{-k})$ for some constant $C>1$.
Article
Full-text available
We study a distributed computation model for optimizing a sum of convex objective functions corresponding to multiple agents. For solving this (not necessarily smooth) optimization problem, we consider a subgradient method that is distributed among the agents. The method involves every agent minimizing his/her own objective function while exchanging information locally with other agents in the network over a time-varying topology. We provide convergence results and convergence rate estimates for the subgradient method. Our convergence rate results explicitly characterize the tradeoff between a desired accuracy of the generated approximate optimal solutions and the number of iterations needed to achieve the accuracy.
Article
Full-text available
. Conjugate gradient methods are widely used for unconstrained optimization, especially large scale problems. However, the strong Wolfe conditions are usually used in the analyses and implementations of conjugate gradient methods. This paper presents a new version of the conjugate gradient method, which converges globally provided the line search satisfies the standard Wolfe conditions. The conditions on the objective function are also weak, which are similar to that required by the Zoutendijk condition. Key words. unconstrained optimization, new conjugate gradient method, Wolfe conditions, global convergence. AMS subject classifications. 65k, 90c 1. Introduction. Our problem is to minimize a function of n variables min f(x); (1.1) where f is smooth and its gradient g(x) is available. Conjugate gradient methods for solving (1.1) are iterative methods of the form x k+1 = x k + ff k d k ; (1.2) where ff k ? 0 is a steplength, d k is a search direction. Normally the search direction at...
Article
This paper considers the distributed optimization problem where each node of a peer-to-peer network minimizes a finite sum of objective functions by communicating with its neighboring nodes. In sharp contrast to the existing literature where the fastest distributed algorithms converge either with a global linear or a local superlinear rate, we propose a distributed adaptive Newton (DAN) algorithm with a global quadratic convergence rate. Our key idea lies in the design of a finite-time set-consensus method with Polyak’s adaptive stepsize. Moreover, we introduce a low-rank matrix approximation (LA) technique to compress the innovation of Hessian matrix so that each node only needs to transmit message of dimension O(p) (where p is the dimension of decision vectors) per iteration, which is essentially the same as that of first-order methods. Nevertheless, the resulting DAN-LA converges to an optimal solution with a global superlinear rate. Numerical experiments on logistic regression problems are conducted to validate their advantages over existing methods.
Article
We study distributed composite optimization over networks: agents minimize a sum of smooth (strongly) convex functions–the agents’ sum-utility–plus a nonsmooth (extended-valued) convex one. We propose a general unified algorithmic framework for such a class of problems and provide a convergence analysis leveraging the theory of operator splitting. Distinguishing features of our scheme are: (i) When each of the agent’s functions is strongly convex, the algorithm converges at a linear rate, whose dependence on the agents’ functions and network topology is decoupled ; (ii) When the objective function is convex (but not strongly convex), similar decoupling as in (i) is established for the coefficient of the proved sublinear rate. This also reveals the role of function heterogeneity on the convergence rate. (iii) The algorithm can adjust the ratio between the number of communications and computations to achieve a rate (in terms of computations) independent on the network connectivity; and (iv) A by-product of our analysis is a tuning recommendation for several existing (non-accelerated) distributed algorithms yielding provably faster (worst-case) convergence rate for the class of problems under consideration.
Article
This paper considers the problem of decentralized consensus optimization over a network, where each node holds a strongly convex and twice-differentiable local objective function. Our goal is to minimize the sum of the local objective functions and find the exact optimal solution using only local computation and neighboring communication. We propose a novel Newton tracking algorithm, which updates the local variable in each node along a local Newton direction modified with neighboring and historical information. We investigate the connections between the proposed Newton tracking algorithm and several existing methods, including gradient tracking and second-order methods. We prove that the proposed algorithm converges to the exact optimal solution at a linear rate. Furthermore, when the iterate is close to the optimal solution, we show that the proposed algorithm requires $O\left(\max\left\{\kappa_f \sqrt{\kappa_g} + \kappa_f^2, \tfrac{\kappa_g^{3/2}}{\kappa_f} + \kappa_f\sqrt{\kappa_g} \right\} \log{\frac{1}{\Delta}} \right)$ iterations to find a $\Delta$ -optimal solution, where $\kappa_f$ and $\kappa_g$ are condition numbers of the objective function and the graph, respectively. Our numerical results demonstrate the efficacy of Newton tracking and validate the theoretical findings.
Article
One of the main advantages of second-order methods in a centralized setting is that they are insensitive to the condition number of the objective function's Hessian. For applications such as regression analysis, this means that less pre-processing of the data is required for the algorithm to work well, as the ill-conditioning caused by highly correlated variables will not be as problematic. Similar condition number independence has not yet been established for distributed methods. In this paper, we analyze the performance of a simple distributed second-order algorithm on quadratic problems and show that its convergence depends only logarithmically on the condition number. Our empirical results indicate that the use of second-order information can yield large efficiency improvements over first-order methods, both in terms of iterations and communications, when the condition number is of the same order of magnitude as the problem dimension.
Article
In this paper, a class of Decentralized Approximate Newton (DEAN) methods for addressing convex optimization on a networked system are developed, where nodes in the networked system seek for a consensus that minimizes the sum of their individual objective functions through local interactions only. The proposed DEAN algorithms allow each node to repeatedly perform a local approximate Newton update, which leverages tracking the global Newton direction and dissipating the discrepancies among the nodes. Under less restrictive problem assumptions in comparison with most existing second-order methods, the DEAN algorithms enable the nodes to reach a consensus that can be arbitrarily close to the optimum. Moreover, for a particular DEAN algorithm, the nodes linearly converge to a common suboptimal solution with an explicit error bound. Finally, simulations demonstrate the competitive performance of DEAN in convergence speed, accuracy, and efficiency.
Article
This paper is concerned with developing a novel distributed Kalman filtering algorithm over wireless sensor networks based on randomized consensus strategy. Compared with centralized algorithm, distributed filtering techniques require less computation per sensor and lead to more robust estimation since they simply use the information from the neighboring nodes in the network. However, poor local sensor estimation caused by limited observability and network topology changes which interfere the global consensus are challenging issues. Motivated by this observation, we propose a novel randomized gossip based distributed Kalman filtering algorithm. Information exchange and computation in the proposed algorithm can be carried out in an arbitrarily connected network of nodes. In addition, the computational burden can be distributed for a sensor which communicates with a stochastically selected neighbor at each clock step under schemes of gossip algorithm. In this case, the error covariance matrix changes stochastically at every clock step, thus the convergence is considered in a probabilistic sense. We provide the mean square convergence analysis of the proposed algorit
Article
This work studies a class of non-smooth decentralized multi-agent optimization problems where the agents aim at minimizing a sum of local strongly-convex smooth components plus a common non-smooth term. We propose a general primal-dual algorithmic framework that captures many existing state-of-the-art algorithms including the adapt-then-combine gradient-tracking methods for smooth costs. We then prove linear convergence of the proposed method (to the exact minimizer) in the presence of the non-smooth term. Moreover, for the more general class of problems with agent specific non-smooth terms, we show that linear convergence cannot be achieved (in the worst case) for the class of algorithms that uses the gradients and the proximal mappings of the smooth and non-smooth parts, respectively. We further provide a numerical counterexample that shows some state-of-the-art algorithms fail to converge linearly for strongly-convex objectives and different local non-smooth terms.
Article
This paper considers nonconvex distributed constrained optimization over networks, modeled as directed (possibly time-varying) graphs. We introduce the first algorithmic framework for the minimization of the sum of a smooth nonconvex (nonseparable) function—the agent’s sum-utility—plus a difference-of-convex function (with nonsmooth convex part). This general formulation arises in many applications, from statistical machine learning to engineering. The proposed distributed method combines successive convex approximation techniques with a judiciously designed perturbed push-sum consensus mechanism that aims to track locally the gradient of the (smooth part of the) sum-utility. Sublinear convergence rate is proved when a fixed step-size (possibly different among the agents) is employed whereas asymptotic convergence to stationary solutions is proved using a diminishing step-size. Numerical results show that our algorithms compare favorably with current schemes on both convex and nonconvex problems. © 2019, Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society.
Article
There has been a growing effort in studying the distributed optimization problem over a network. The objective is to optimize a global function formed by a sum of local functions, using only local computation and communication. Literature has developed consensus-based distributed (sub)gradient descent (DGD) methods and has shown that they have the same convergence rate $O\left({\log t}\over{\sqrt{t}}\right)$ as the centralized (sub)gradient methods (CGD) when the function is convex but possibly nonsmooth. However, when the function is convex and smooth, under the framework of DGD, it is unclear how to harness the smoothness to obtain a faster convergence rate comparable to CGD’s convergence rate. In this paper, we propose a distributed algorithm that, despite using the same amount of communication per iteration as DGD, can effectively harnesses the function smoothness and converge to the optimum with a rate of $O\left({1\over t}\right)$ . If the objective function is further strongly convex, our algorithm has a linear convergence rate. Both rates match the convergence rate of CGD. The key step in our algorithm is a novel gradient estimation scheme that uses history information to achieve fast and accurate estimation of the average gradient. To motivate the necessity of history information, we also show that it is impossible for a class of distributed algorithms like DGD to achieve a linear convergence rate without using history information even if the objective function is strongly convex and smooth.
Article
This work develops a distributed optimization strategy with guaranteed exact convergence for a broad class of left-stochastic combination policies. The resulting exact diffusion strategy is shown in Part II to have a wider stability range and superior convergence performance than the EXTRA strategy. The exact diffusion solution is applicable to non-symmetric left-stochastic combination matrices, while most earlier developments on exact consensus implementations are limited to doubly-stochastic matrices; these latter matrices impose stringent constraints on the network topology. Similar difficulties arise for implementations with right-stochastic policies, which are common in push-sum consensus solutions. The derivation of the exact diffusion strategy in this work relies on reformulating the aggregate optimization problem as a penalized problem and resorting to a diagonally-weighted incremental construction. Detailed stability and convergence analyses are pursued in Part II and are facilitated by examining the evolution of the error dynamics in a transformed domain. Numerical simulations illustrate the theoretical conclusions.
Article
We study the problem of minimizing a sum of convex objective functions, where the components of the objective are available at different nodes of a network and nodes are allowed to only communicate with their neighbors. The use of distributed gradient methods is a common approach to solve this problem. Their popularity notwithstanding, these methods exhibit slow convergence and a consequent large number of communications between nodes to approach the optimal argument because they rely on first-order information only. This paper proposes the network Newton (NN) method as a distributed algorithm that incorporates second-order information. This is done via distributed implementation of approximations of a suitably chosen Newton step. The approximations are obtained by truncation of the Newton step's Taylor expansion. This leads to a family of methods defined by the number K of Taylor series terms kept in the approximation. When keeping K terms of the Taylor series, the method is called NN-K and can be implemented through the aggregation of information in K-hop neighborhoods. Convergence to a point close to the optimal argument at a rate that is at least linear is proven and the existence of a tradeoff between convergence time and the distance to the optimal argument is shown. The numerical experiments corroborate reductions in the number of iterations and the communication cost that are necessary to achieve convergence relative to first-order alternatives.
Article
In this paper, we propose a distributed Newton method for consensus optimization. Our approach outperforms state-of-the-art methods, including ADMM. The key idea is to exploit the sparsity of the dual Hessian and recast the computation of the Newton step as one of efficiently solving symmetric diagonally dominant linear equations. We validate our algorithm both theoretically and empirically. On the theory side, we demonstrate that our algorithm exhibits superlinear convergence within a neighborhood of optimality. Empirically, we show the superiority of this new method on a variety of machine learning problems. The proposed approach is scalable to very large problems and has a low communication overhead.
Article
This paper considers decentralized consensus optimization problems where different summands of a global objective function are available at nodes of a network that can communicate with neighbors only. The proximal method of multipliers is considered as a powerful tool that relies on proximal primal descent and dual ascent updates on a suitably defined augmented Lagrangian. The structure of the augmented Lagrangian makes this problem nondecomposable, which precludes distributed implementations. This problem is regularly addressed by the use of the alternating direction method of multipliers. The exact second order method (ESOM) is introduced here as an alternative that relies on: (i) The use of a separable quadratic approximation of the augmented Lagrangian. (ii) A truncated Taylor’s series to estimate the solution of the first order condition imposed on the minimization of the quadratic approximation of the augmented Lagrangian. The sequences of primal and dual variables generated by ESOM are shown to converge linearly to their optimal arguments when the aggregate cost function is strongly convex and its gradients are Lipschitz continuous. Numerical results demonstrate advantages of ESOM relative to decentralized alternatives in solving least squares and logistic regression problems.
Article
We study nonconvex distributed optimization in multi-agent networks with time-varying (nonsymmetric) connectivity. We introduce the first algorithmic framework for the distributed minimization of the sum of a smooth (possibly nonconvex and nonseparable) function - the agents' sum-utility - plus a convex (possibly nonsmooth and nonseparable) regularizer. The latter is usually employed to enforce some structure in the solution, typically sparsity. The proposed method hinges on successive convex approximation techniques while leveraging dynamic consensus as a mechanism to distribute the computation among the agents: each agent first solves (possibly inexactly) a local convex approximation of the nonconvex original problem, and then performs local averaging operations. Asymptotic convergence to (stationary) solutions of the nonconvex problem is established. Our algorithmic framework is then customized to a variety of convex and nonconvex problems in several fields, including signal processing, communications, networking, and machine learning. Numerical results show that the new method compares favorably to existing distributed algorithms on both convex and nonconvex problems.
Article
This paper develops the Decentralized Linearized Alternating Direction Method of Multipliers (DLM) that minimizes a sum of local cost functions in a multiagent network. The algorithm mimics operation of the decentralized alternating direction method of multipliers (DADMM) except that it linearizes the optimization objective at each iteration. This results in iterations that, instead of successive minimizations, implement steps whose cost is akin to the much lower cost of the gradient descent step used in the distributed gradient method (DGM). The algorithm is proven to converge to the optimal solution when the local cost functions have Lipschitz continuous gradients. Its rate of convergence is shown to be linear if the local cost functions are further assumed to be strongly convex. Numerical experiments in least squares and logistic regression problems show that the umber of iterations to achieve equivalent optimality gaps are similar for DLM and ADMM and both much smaller than those of DGM. In that sense, DLM combines the rapid convergence of ADMM with the low computational burden of DGM.
Article
We consider the problem of finding a linear iteration that yields distributed averaging consensus over a network, i.e., that asymptotically computes the average of some initial values given at the nodes. When the iteration is assumed symmetric, the problem of finding the fastest converging linear iteration can be cast as a semidefinite program, and therefore efficiently and globally solved. These optimal linear iterations are often substantially faster than several common heuristics that are based on the Laplacian of the associated graph.We show how problem structure can be exploited to speed up interior-point methods for solving the fastest distributed linear iteration problem, for networks with up to a thousand or so edges. We also describe a simple subgradient method that handles far larger problems, with up to 100 000 edges. We give several extensions and variations on the basic problem.
Article
We consider a symmetric random walk on a connected graph, where each edge is labeled with the probability of transition between the two adjacent vertices. The associated Markov chain has a uniform equilibrium distribution; the rate of convergence to this distribution, i. the mixing rate of the Markov chain, is determined by the second largest (in magnitude) eigenvalue of the transition matrix. In this paper we address the problem of assigning probabilities to the edges of the graph in such a way as to minimize the second largest magnitude eigenvalue, i.e., the problem of finding the fastest mixing Markov chain on the graph.
Decentralized parallel algorithm for training generative adversarial nets
• M L Liu
• Y Mroueh
• W Zhang
• X Cui
• J Ross
• P Das
M. L. Liu, Y. Mroueh, W. Zhang, X. Cui, J. Ross, and P. Das, "Decentralized parallel algorithm for training generative adversarial nets," in Advances in Neural Information Processing Systems, 2020.
Distributed saddle-point problems under data similarity
• A Beznosikov
• G Scutari
• A Rogozin
• A Gasnikov
A. Beznosikov, G. Scutari, A. Rogozin, and A. Gasnikov, "Distributed saddle-point problems under data similarity," in Advances in Neural Information Processing Systems, 2021.
Asynchronous decentralized parallel stochastic gradient descent
• X Lian
• W Zhang
• C Zhang
• J Liu
X. Lian, W. Zhang, C. Zhang, and J. Liu, "Asynchronous decentralized parallel stochastic gradient descent," in International Conference on Machine Learning, 2018.
Decentralized sketching of low rank matrices
• R S Srinivasa
• K Lee
• M Junge
• J Romberg
R. S. Srinivasa, K. Lee, M. Junge, and J. Romberg, "Decentralized sketching of low rank matrices," in Advances in Neural Information Processing Systems, 2019.
Communication-efficient distributed optimization in networks with gradient tracking and variance reduction
• B Li
• S Cen
• Y Chen
• Y Chi
B. Li, S. Cen, Y. Chen, and Y. Chi, "Communication-efficient distributed optimization in networks with gradient tracking and variance reduction," in International Conference on Artificial Intelligence and Statistics, 2020.
An improved analysis of gradient tracking for decentralized machine learning
• A Koloskova
• T Lin
• S U Stich
A. Koloskova, T. Lin, and S. U. Stich, "An improved analysis of gradient tracking for decentralized machine learning," in Advances in Neural Information Processing Systems, 2021.
Optimal algorithms for smooth and strongly convex distributed optimization in networks
• K Scaman
• F Bach
• S Bubeck
• Y T Lee
• L Massoulié
K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, "Optimal algorithms for smooth and strongly convex distributed optimization in networks," in International Conference on Machine Learning, 2017.
A sharp convergence rate analysis for distributed accelerated gradient methods
• H Li
• C Fang
• W Yin
• Z Lin
H. Li, C. Fang, W. Yin, and Z. Lin, "A sharp convergence rate analysis for distributed accelerated gradient methods," arXiv preprint arXiv:1810.01053, 2018.
• H Ye
• L Luo
• Z Zhou
• T Zhang
H. Ye, L. Luo, Z. Zhou, and T. Zhang, "Multi-consensus decentralized accelerated gradient descent," arXiv preprint arXiv:2005.00797, 2020.
Variance-reduced stochastic quasi-Newton methods for decentralized learning-Part I: General framework
• J Zhang
• H Liu
• A M C So
• Q Ling
J. Zhang, H. Liu, A. M.-C. So, and Q. Ling, "Variance-reduced stochastic quasi-Newton methods for decentralized learning-Part I: General framework," arXiv preprint arXiv:2201.07699, 2022.
Variance-reduced stochastic quasi-Newton methods for decentralized learning-Part II: Damped limited-memory DFP and BFGS methods
--, "Variance-reduced stochastic quasi-Newton methods for decentralized learning-Part II: Damped limited-memory DFP and BFGS methods," arXiv preprint arXiv:2201.07733, 2022.
FedNL: Making Newton-type methods applicable to federated learning
• M Safaryan
• R Islamov
• X Qian
• P Richtárik
M. Safaryan, R. Islamov, X. Qian, and P. Richtárik, "FedNL: Making Newton-type methods applicable to federated learning," arXiv preprint arXiv:2106.02969, 2021.
Ef21: A new, simpler, theoretically better, and practically faster error feedback
• P Richtárik
• I Sokolov
• I Fatkhullin
P. Richtárik, I. Sokolov, and I. Fatkhullin, "Ef21: A new, simpler, theoretically better, and practically faster error feedback," Advances in Neural Information Processing Systems, vol. 34, 2021.
Basis matters: Better communication-efficient second order methods for federated learning
• X Qian
• R Islamov
• M Safaryan
• P Richtárik
X. Qian, R. Islamov, M. Safaryan, and P. Richtárik, "Basis matters: Better communication-efficient second order methods for federated learning," arXiv preprint arXiv:2111.01847, 2021.