PreprintPDF Available

FedNL: Making Newton-Type Methods Applicable to Federated Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-$K$ or Rank-$R$, which are vastly superior in practice. Notably, we do not need to rely on error feedback for our methods to work with contractive compressors. Moreover, we develop FedNL-PP, FedNL-CR and FedNL-LS, which are variants of FedNL that support partial participation, and globalization via cubic regularization and line search, respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional compression of gradients and models, i.e., smart uplink gradient and smart downlink model compression. We prove local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Our communication efficient Hessian learning technique provably learns the Hessian at the optimum. Finally, we perform a variety of numerical experiments that show that our FedNL methods have state-of-the-art communication complexity when compared to key baselines.
Content may be subject to copyright.
FedNL: Making Newton-Type Methods
Applicable to Federated Learning
Mher SafaryanRustem IslamovXun QianPeter Richtárik§
June 05, 2021
Abstract
Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton
Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order
methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different
Hessian learning technique which i) enhances privacy as it does not rely on the training data to
be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models,
and iii) provably works with general contractive compression operators for compressing the local
Hessians, such as Top-
K
or Rank-
R
, which are vastly superior in practice. Notably, we do not
need to rely on error feedback for our methods to work with contractive compressors.
Moreover, we develop FedNL-PP,FedNL-CR and FedNL-LS, which are variants of FedNL
that support partial participation, and globalization via cubic regularization and line search,
respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional
compression of gradients and models, i.e., smart uplink gradient and smart downlink model
compression.
We prove local convergence rates that are independent of the condition number, the number
of training data points, and compression variance. Our communication efficient Hessian learning
technique provably learns the Hessian at the optimum.
Finally, we perform a variety of numerical experiments that show that our FedNL methods
have state-of-the-art communication complexity when compared to key baselines.
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia, and Moscow Institute of
Physics and Technology (MIPT), Dolgoprudny, Russia. This research was conducted while this author was an intern
at KAUST and an undergraduate student at MIPT.
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
§King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
1
arXiv:2106.02969v1 [cs.LG] 5 Jun 2021
Contents
1 Introduction 4
1.1 First-order methods for FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Towards second-order methods for FL . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Desiderata for second-order methods applicable to FL . . . . . . . . . . . . . . . . . 5
2 Contributions 6
2.1 The Newton Learn framework of Islamov et al. [2021] . . . . . . . . . . . . . . . . . . 6
2.2 Issues with the Newton Learn framework . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Our FedNL framework ................................... 7
3 The Vanilla Federated Newton Learn Method 9
3.1 New Hessian learning technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Compressingmatrices ................................... 10
3.3 Two options for updating the global model . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 FedNL andtheNewtonTriangle”............................. 13
4FedNL with Partial Participation, Globalization and Bidirectional Compression 13
4.1 PartialParticipation .................................... 14
4.2 Globalization via Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Globalization via Cubic Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Bidirectional Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Experiments 15
5.1 Parametersetting...................................... 15
5.2 Localconvergence...................................... 16
5.3 Globalconvergence..................................... 16
5.4 Comparison with NL1 ................................... 17
A Extra Experiments 20
A.1 Datasets .......................................... 20
A.2 Parameterssetting ..................................... 20
A.3 Compressionoperators................................... 20
A.3.1 Random dithering for vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
A.3.2 Rank-Rcompression operator for matrices . . . . . . . . . . . . . . . . . . . . 21
A.3.3 Top-Kcompression operator for matrices . . . . . . . . . . . . . . . . . . . . 22
A.3.4 Rand-Kcompression operator for matrices . . . . . . . . . . . . . . . . . . . 22
A.4 Projection onto the cone of positive definite matrices . . . . . . . . . . . . . . . . . . 22
A.5 Theeectofcompression ................................. 23
A.6 Comparison of Options 1and 2.............................. 23
A.7 Comparison of different compression operators . . . . . . . . . . . . . . . . . . . . . . 23
A.8 Comparison of different update rules for Hessians . . . . . . . . . . . . . . . . . . . . 24
A.9 Bi-directional compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
A.10 The performance of FedNL-PP ............................... 25
A.11 Comparison with NL1 ................................... 25
2
A.12Localcomparison...................................... 26
A.13Globalcompersion ..................................... 26
A.14 Effect of statistical heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
B Proofs of Results from Section 3 30
B.1 Auxiliarylemma ...................................... 30
B.2 ProofofTheorem3.6.................................... 32
B.3 ProofofLemma3.7 .................................... 34
B.4 ProofofLemma3.8 .................................... 35
C Extension: Partial Participation (FedNL-PP) 36
C.1 Hessian corrected local gradients gk
i............................ 36
C.2 Importance of compression errors lk
i........................... 37
C.3 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
C.4 ProofofTheoremC.1 ................................... 38
C.5 ProofofLemmaC.2 .................................... 41
C.6 ProofofLemmaC.3 .................................... 41
D Extension: Globalization via Line Search (FedNL-LS) 43
D.1 Linesearchprocedure ................................... 43
D.2 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
D.3 ProofofTheoremD.1 ................................... 44
D.4 ProofofLemmaD.2 .................................... 44
E Extension: Globalization via Cubic Regularization (FedNL-CR) 46
E.1 Cubicregularization .................................... 46
E.2 Solvingthesubproblem .................................. 46
E.3 Importance of compression errors lk
i........................... 46
E.4 Global and local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
E.5 ProofofTheoremE.1 ................................... 47
E.6 ProofofLemmaE.2 .................................... 51
F Extension: Bidirectional Compression (FedNL-BC) 53
F.1 Model learning technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
F.2 Hessian corrected local gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
F.3 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
F.4 ProofofTheoremF.4 ................................... 56
F.5 ProofofLemmaF.5 .................................... 59
F.6 ProofofLemmaF.6 .................................... 60
G Local Quadratic Rate of NEWTON-STAR for General Finite-Sum Problems 61
H Table of Frequently Used Notation 62
I Limitations 63
3
1 Introduction
In this paper we consider the cross-silo federated learning problem
min
xRd(f(x) := 1
n
n
X
i=1
fi(x)),(1)
where
d
denotes dimension of the model
xRd
we wish to train,
n
is the total number of
silos/machines/devices/clients in the distributed system,
fi
(
x
)is the loss/risk associated with the
data stored on machine i[n] := {1,2, . . . , n}, and f(x)is the empirical loss/risk.
1.1 First-order methods for FL
The prevalent paradigm for training federated learning (FL) models [Konečný et al., 2016b,a,
McMahan et al., 2017] (see also the recent surveys by Kairouz et al [2019], Li et al. [2020a]) is to use
distributed first-order optimization methods employing one or more tools for enhancing communication
efficiency, which is a key bottleneck in the federated setting.
These tools include communication compression [Konečný et al., 2016b, Alistarh et al., 2017,
Khirirat et al., 2018] and techniques for progressively reducing the variance introduced by compression
[Mishchenko et al., 2019, Horváth et al., 2019, Gorbunov et al., 2020a, Li et al., 2020b, Gorbunov et al.,
2021a], local computation [McMahan et al., 2017, Stich, 2020, Khaled et al., 2020, Mishchenko et al.,
2021a] and techniques for reducing the client drift introduced by local computation [Karimireddy
et al., 2020, Gorbunov et al., 2021b], and partial participation [McMahan et al., 2017, Gower et al.,
2019] and techniques for taming the slow-down introduced by partial participation [Gorbunov et al.,
2020a, Chen et al., 2020].
Other useful techniques for further reducing the communication complexity of FL methods
include the use of momentum [Mishchenko et al., 2019, Li et al., 2020b], and adaptive learning rates
[Malitsky and Mishchenko, 2019, Xie et al., 2019, Reddi et al., 2020, Xie et al., 2019, Mishchenko
et al., 2021b]. In addition, aspiring FL methods need to protect the privacy of the clients’ data, and
need to be built with data heterogeneity in mind [Kairouz et al, 2019].
1.2 Towards second-order methods for FL
While first-order methods are the methods of choice in the context of FL at the moment, their
communication complexity necessarily depends on (a suitable notion of) the condition number of
the problem, which can be very large as it depends on the structure of the model being trained, on
the choice of the loss function, and most importantly, on the properties of the training data.
However, in many situations when algorithm design is not constrained by the stringent require-
ments characterizing FL, it is very well known that carefully designed second-order methods can be
vastly superior. On an intuitive level, this is mainly because these methods make an extra compu-
tational effort to estimate the local curvature of the loss landscape, which is useful in generating
more powerful and adaptive update direction. However, in FL, it is often communication and not
computation which forms the key bottleneck, and hence the idea of “going second order” looks
attractive. The theoretical benefits of using curvature information are well know. For example, the
classical Newton’s method, which forms the basis for most efficient second-order method in much
the same way the gradient descent method forms the basis for more elaborate first-order methods,
4
enjoys a fast condition-number-independent (local) convergence rate [Beck, 2014], which is beyond the
reach of all first-order methods. However, Newton’s method does not admit an efficient distributed
implementation in the heterogeneous data regime as it requires repeated communication of local
Hessian matrices
2fiRd×d
to the server, which is prohibitive as this constitutes a massive burden
on the communication links.
1.3 Desiderata for second-order methods applicable to FL
In this paper we take the stance that it would be highly desirable to develop Newton-type methods
for solving the cross-silo federated learning problem (1) that would
[hd]
work well in the truly heterogeneous data setting (i.e., we do not want to assume that the
functions f1, . . . , fnare “similar”),
[fs]
apply to the general finite-sum problem
(1)
, without imparting strong structural assumptions
on the local functions
f1, . . . , fn
(e.g., we do not want to assume that the functions
f1, . . . , fn
are quadratics, generalized linear models, and so on),
[as] benefits from Newton-like (matrix-valued) adaptive stepsizes,
[pe]
employ at least a rudimentary privacy enhancement mechanism (in particular, we do not want
the devices to be sending/revealing their training data to the server),
[uc]
enjoy, through ubiased communication compression strategies applied to the Hessian, such
as Rand-
K
, the same low
O
(
d
)communication cost per communication round as gradient
descent,
[cc]
be able to benefit from the more aggressive contractive communication compression strategies
applied to the Hessian, such as Top-Kand Rank-R,
[fr]
have fast local rates unattainable by first order methods (e.g., rates independent of the condition
number),
[pp] support partial participation (this is important when the number nof devices is very large),
[gg]
have global convergence guarantees, and superior global empirical behavior, when combined
with a suitable globalization strategy (e.g., line search or cubic regularization),
[gc]
optionally be able to use, for a more dramatic communication reduction, additional smart
uplink (i..e, device to server) gradient compression,
[mc]
optionally be able to use, for a more dramatic communication reduction, additional smart
downlink (i.e., server to device) model compression,
[lc]
perform provably useful local computation, even in the heterogeneous data setting (it is known
that local computation via gradient-type steps, which form the backbone of methods such as
FedAvg and LocalSGD, provably helps under some degree of data similarity only).
However, to the best of our knowledge, existing Newton-type methods are not applicable to FL
as they are not compatible with most of the aforementioned desiderata.
5
It is therefore natural and pertinent to ask whether it is possible to design theoretically
well grounded and empirically well performing Newton-type methods that would be able to
conform to the FL-specific desiderata listed above.
In this work, we address this challenge in the affirmative.
2 Contributions
Before detailing our contributions, it will be very useful to briefly outline the key elements of the
recently proposed Newton Learn (NL) framework of Islamov et al. [2021], which served as the main
inspiration for our work, and which is also the closest work to ours.
2.1 The Newton Learn framework of Islamov et al. [2021]
The starting point of their work is the observation that the Newton-like method
xk+1 =xk(2f(x))1f(xk),
called Newton Star (NS), where
x
is the (unique) solution of
(1)
, converges to
x
locally quadratically
under suitable assumptions, which is a desirable property it inherits from the classical Newton
method. Clearly, this method is not practical, as it relies on the knowledge of the Hessian at the
optimum.
However, under the assumption that the matrix
2f
(
x
)is known to the server, NS can be
implemented with
O
(
d
)cost in each communication round. Indeed, NS can simply be treated as
gradient descent, albeit with a matrix-valued stepsize equal to (2f(x))1.
The first key contribution of Islamov et al. [2021] is the design of a strategy, for which they coined
the term Newton Learn, which learns the Hessians
2f1
(
x
)
,...,2fn
(
x
), and hence their average,
2f
(
x
), progressively throughout the iterative process, and does so in a communication efficient
manner, using unbiased compression [uc] of Hessian information. In particular, the compression level
can be adjusted so that in each communication round,
O
(
d
)floats need to be communicated between
each device and the server only. In each iteration, the master uses the average of the current learned
local Hessian matrices in place of the Hessian at the optimum, and subsequently performs a step
similar to that of NS. So, their method uses adaptive matrix-valued stepsizes [as].
Islamov et al. [2021] prove that their learning procedure indeed works in the sense that the
sequences of the learned local matrices converge to the local optimal Hessians
2fi
(
x
). This
property leads to a Newton-like acceleration, and as a result, their NL methods enjoy a local linear
convergence rate (for a Lyapunov function that includes Hessian convergence) and local superlinear
convergence rate (for distance to the optimum) that is independent of the condition number, which
is a property beyond the reach of any first-order method [fr]. Moreover, all of this provably works in
the heterogeneous data setting [hd].
Finally, they develop a practical and theoretically grounded globalization strategy [gg] based on
cubic regularization, called Cubic Newton Learn (CNL).
2.2 Issues with the Newton Learn framework
While the above development is clearly very promising in the context of distributed optimization,
the results suffer from several limitations which prevent the methods from being applicable to FL.
6
Table 1: Comparison of the main features of our family of FedNL algorithms and results with those
of Islamov et al. [2021], which we used as an inspiration. We have made numerous and significant
modifications and improvements in order to obtain methods applicable to federated learning.
# Feature Islamov et al.
[2/’21]
This Work
[5/’21]
[hd] supports heterogeneous data setting 3 3
[fs] applies to general finite-sum problems 73
[as] uses adaptive stepsizes 3 3
[pe] privacy is enhanced (training data is not sent to the server) 73
[uc] supports unbiased Hessian compression (e.g., Rand-K)3 3
[cc] supports contractive Hessian compression (e.g., Top-K)73
[fr] fast local rate: independent of the condition number 3 3
[fr] fast local rate: independent of the # of training data points 73
[fr] fast local rate: independent of the compressor variance 73
[pp] supports partial participation 73(Alg 2)
[gg] has global convergence guarantees via line search 73(Alg 3)
[gg] has global convergence guarantees via cubic regularization 3 3(Alg 4)
[gc] supports smart uplink gradient compression at the devices 73(Alg 5)
[mc] supports smart downlink model compression by the master 73(Alg 5)
[lc] performs useful local computation 3 3
First, the Newton Learn strategy of Islamov et al. [2021] critically depends on the assumption that
the local functions are of the form
fi(x) = 1
m
m
X
j=1
ϕij(a>
ijx),(2)
where
ϕij
:
RR
are sufficiently well behaved functions, and
ai1, . . . , aim Rd
are the training
data points owned by device
i
. As a result, their approach is limited to generalized linear models
only, which violates [fs] from the aforementioned wish list. Second their communication strategy
critically relies on each device
i
sending a small subset of their private training data
{ai1, . . . , aim}
to
the server in each communication round, which violates [pe]. Further, while their approach supports
O
(
d
)communication, it does not support more general contractive compressors [cc], such as Top-
K
and Rank-
R
, which have been found very useful in the context of first order methods with gradient
compression. Finally, the methods of Islamov et al. [2021] do not support bidirectional compression
[bc] of gradients and models, and do not support partial participation [pp].
2.3 Our FedNL framework
We propose a family of five Federated Newton Learn methods (Algorithms 1–5), which we believe
constitutes a marked step in the direction of making second-order methods applicable to FL.
In contrast to the work of Islamov et al. [2021] (see Table 1), our vanilla method FedNL
(Algorithm 1) employs a different Hessian learning technique, which makes it applicable beyond
generalized linear models
(2)
to general finite-sum problems [fs], enhances privacy as it does not rely
on the training data to be revealed to the coordinating server [pe], and provably works with general
contractive compression operators for compressing the local Hessians, such as Top-
K
or Rank-
R
,
which are vastly superior in practice [cc]. Notably, we do not need to rely on error feedback [Seide
et al., 2014, Stich et al., 2018, Karimireddy et al., 2019, Gorbunov et al., 2020b], which is essential
7
Table 2: Summary of algorithms proposed and convergence results proved in this paper.
Convergence Rate independent of
Method result type rate
the condition # (left)
# training data (middle)
compressor (right)
Theorem
Newton Zero
N0 (Equation (9)) rk1
2kr0local linear 3 3 3 3.6
FedNL (Algorithm 1)
rk1
2kr0local linear 3 3 3 3.6
Φk
1θkΦ0
1local linear 3373.6
rk+1 krklocal superlinear 3373.6
Partial Participation
FedNL-PP (Algorithm 2)
WkθkW0local linear 3 3 3 C.1
Φk
2θkΦ0
2local linear 337C.1
rk+1 kWklocal linear 337C.1
Line Search
FedNL-LS (Algorithm 3) kθk0global linear 73 3 D.1
Cubic Regularization
FedNL-CR (Algorithm 4)
kc/k global sublinear 73 3 E.1
kθk0global linear 73 3 E.1
Φk
1θkΦ0
1local linear 337E.1
rk+1 krklocal superlinear 337E.1
Bidirectional Compression
FedNL-BC (Algorithm 5) Φk
3θkΦ0
3local linear 337F.4
Newton Star
NS (Equation (55)) rk+1 cr2
klocal quadratic 3 3 3 G.1
Quantities for which we prove convergence: (i) distance to solution rk:= kxkxk2;Wk:= 1
nPn
i=1 kwk
ixk2(ii)
Lyapunov functions Φk
1:= ckxkxk2+1
nPn
i=1 kHk
i− ∇2fi(x)k2
F;Φk
2:= cWk+1
nPn
i=1 kHk
i− ∇2fi(x)k2
F;
Φk
3:= kzkxk2+ckwkxk2. (iii) Function value suboptimality k:= f(xk)f(x)
constants c > 0and θ(0,1) are possibly different each time they appear in this table. Refer to the precise statements
of the theorems for the exact values.
to prevent divergence in first-order methods employing such compressors [Beznosikov et al., 2020],
for our methods to work with contractive compressors. We prove that our communication efficient
Hessian learning technique provably learns the Hessians at the optimum.
Like Islamov et al. [2021], we prove local convergence rates that are independent of the condition
number [fr]. However, unlike their rates, some of our rates are also independent of number training
data points, and of compression variance [fr]. All our complexity results are summarized in Table 2.
Moreover, we show that our approach works in the partial participation [pp] regime by developing
the FedNL-PP method (Algorithm 2), and devise methods employing globalization strategies: FedNL-
LS (Algorithm 3), based on backtracking line search, and FedNL-CR (Algorithm 4), based on cubic
regularization [gg]. We show through experiments that the former is much more powerful in practice
than the latter. Hence, the proposed line search globalization is superior to the cubic regularization
approach employed by Islamov et al. [2021].
Our approach can further benefit from smart uplink gradient compression [gc] and smart downlink
model compression [mc] – see FedNL-BC (Algorithm 5).
Finally, we perform a variety of numerical experiments that show that our FedNL methods have
state-of-the-art communication complexity when compared to key baselines.
8
3 The Vanilla Federated Newton Learn Method
We start the presentation of our algorithms with the vanilla FedNL method, commenting on the
intuitions and technical novelties. The method is formally described1in Algorithm 1.
3.1 New Hessian learning technique
The first key technical novelty in FedNL is the new mechanism for learning the Hessian
2f
(
x
)at
the (unique) solution
x
in a communication efficient manner. This is achieved by maintaining and
progressively updating local Hessian estimates
Hk
i
of
2fi
(
x
)for all devices
i
[
n
]and the global
Hessian estimate
Hk=1
n
n
X
i=1
Hk
i
of
2f
(
x
)for the central server. Thus, the goal is to induce
Hk
i→ ∇2fi
(
x
)for all
i
[
n
], and as
a consequence, Hk→ ∇2f(x), throughout the training process.
Algorithm 1 FedNL (Federated Newton Learn)
1: Parameters: Hessian learning rate α0; compression operators {Ck
1,...,Ck
n}
2: Initialization: x0Rd;H0
1,...,H0
nRd×dand H0:= 1
nPn
i=1 H0
i
3: for each device i= 1, . . . , n in parallel do
4: Get xkfrom the server and compute local gradient fi(xk)and local Hessian 2fi(xk)
5: Send fi(xk),Sk
i:= Ck
i(2fi(xk)Hk
i)and lk
i:= kHk
i− ∇2fi(xk)kFto the server
6: Update local Hessian shift to Hk+1
i=Hk
i+αSk
i
7: end for
8: on server
9: Get fi(xk),Sk
iand lk
ifrom each node i[n]
10: f(xk) = 1
nPn
i=1 fi(xk),Sk=1
nPn
i=1 Sk
i, lk=1
nPn
i=1 lk
i,Hk+1 =Hk+αSk
11: Option 1: xk+1 =xkHk1
µf(xk)
12: Option 2: xk+1 =xkHk+lkI1f(xk)
Anaive choice for the local estimates
Hk
i
would be the exact local Hessians
2fi
(
xk
), and
consequently the global estimate
Hk
would be the exact global Hessian
2f
(
xk
). While this naive
approach learns the global Hessian at the optimum, it needs to communicate the entire matrices
2fi
(
xk
)to the server in each iteration, which is extremely costly. Instead, in FedNL we aim to
reuse past Hessian information and build the next estimate
Hk+1
i
by updating the current estimate
Hk
i
. Since all devices have to be synchronized with the server, we also need to make sure the update
from
Hk
i
to
Hk+1
i
is easy to communicate. With this intuition in mind, we propose to update the
local Hessian estimates via the rule
Hk+1
i=Hk
i+αSk
i,
where
Sk
i=Ck
i(2fi(xk)Hk
i),
1
For all our methods, we describe the steps constituting a single communication round only. To get an iterative
method, one simply needs to repeat provided steps in an iterative fashion.
9
and
α >
0is the learning rate. Notice that we reduce the communication cost by explicitly requiring
all devices i[n]to send compressed matrices Sk
ito the server only.
The Hessian learning technique employed in the Newton Learn framework of Islamov et al. [2021]
is critically different to ours as it heavily depends on the structure
(2)
of the local functions. Indeed,
the local optimal Hessians
2fi(x) = 1
m
m
X
j=1
ϕ00
ij(a>
ijx)aija>
ij
are learned via the proxy of learning the optimal scalars
ϕ00
ij
(
a>
ijx
)for all local data points
j
[
m
],
which also requires the transmission of the active data points
aij
to the server in each iteration. This
makes their method inapplicable to the general finite sum problems [fs], and incapable of securing
even the most rudimentary privacy enhancement [pe] mechanism.
We do not make any structural assumption on the problem
(1)
, and rely on the following general
conditions to prove effectiveness of our Hessian learning technique:
Assumption 3.1.
The average loss
f
is
µ
-strongly convex, and all local losses
fi
(
x
)have Lipschitz
continuous Hessians. Let
L
,
LF
and
L
be the Lipschitz constants with respect to three different
matrix norms: spectral, Frobenius and infinity norms, respectively. Formally, we require
k∇2fi(x)− ∇2fi(y)k ≤ Lkxyk
k∇2fi(x)− ∇2fi(y)kFLFkxyk
max
j,l |(2fi(x)− ∇2fi(y))jl | ≤ Lkxyk
to hold for all i[n]and x, y Rd.
3.2 Compressing matrices
In the literature on first-order compressed methods, compression operators are typically applied to
vectors (e.g., gradients, gradient differences, models). As our approach is based on second-order
information, we apply compression operators to
d×d
matrices of the form
2fi
(
xk
)
Hk
i
instead.
For this reason, we adapt two popular classes of compression operators used in first-order methods
to act on d×dmatrices by treating them as vectors of dimension d2.
Definition 3.2
(Unbiased Compressors)
.
By
B
(
ω
)we denote the class of (possibly randomized)
unbiased compression operators C:Rd×dRd×dwith variance parameter ω0satisfying
E[C(M)] = M,EkC(M)Mk2
FωkMk2
F,MRd×d.(3)
Common choices of unbiased compressors are random sparsification and quantization (see
Appendix).
Definition 3.3
(Contractive Compressors)
.
By
C
(
δ
)we denote the class of deterministic contractive
compression operators C:Rd×dRd×dwith contraction parameter δ[0,1] satisfying
kC(M)kF≤ kMkF,kC(M)Mk2
F(1 δ)kMk2
F,MRd×d.(4)
10
The first condition of
(4)
can be easily removed by scaling the operator
C
appropriately. In-
deed, if for some
MRd×d
we have
kC
(
M
)
kF>kMkF
, then we can use the scaled compressor
e
C
(
M
)
:= kMkF
kC(M)kFC
(
M
)instead, as this satisfies
(4)
with the same parameter
δ
. Common examples
of contractive compressors are Top-Kand Rank-Roperators (see Appendix).
From the theory of first-order methods employing compressed communication, it is known
that handling contractive biased compressors is much more challenging than handling unbiased
compressors. In particular, a popular mechanism for preventing first-order methods utilizing biased
compressors from divergence is the error feedback framework. However, contractive compressors
often perform much better empirically than their unbiased counterparts. To highlight the strength
of our new Hessian learning technique, we develop our theory in a flexible way, and handle both
families of compression operators. Surprisingly, we do not need to use error feedback for contractive
compressors for our methods to work.
Compression operators are used in [Islamov et al., 2021] in a fundamentally different way. First,
their theory supports unbiased compressors only, and does not cover the practically favorable
contractive compressors [cc]. More importantly, compression is applied within the representation
(2)
as an operator acting on the space
Rm
. In contrast to our strategy of using compression
operators, this brings the necessity to reveal, in each iteration, the training data
{ai1, . . . , aim}
whose
corresponding coefficients in
(2)
are not zeroed out after the compression step [pe]. Moreover, when
O
(
d
)communication cost per communication round is achieved, the variance of the compression
noise depends on the number of data points
m
, which then negatively affects the local convergence
rates. As the amount of training data can be huge, our convergence rates provide stronger guarantees
by not depending on the size of the training dataset [fr].
3.3 Two options for updating the global model
Finally, we offer two options for updating the global model at the server.
The first option assumes that the server knows the strong convexity parameter
µ >
0(see
Assumption 3.1), and that it is powerful enough to compute the projected Hessian estimate
Hkµ, i.e., that it is able to project the current global Hessian estimate Hkonto the set
nMRd×d:M>=M, µIMo
in each iteration (see the Appendix).
Alternatively, if µis unknown, all devices send the compression errors
lk
i:= kHk
i− ∇2fi(xk)kF
(this extra communication is extremely cheap as all
lk
i
variables are floats) to the server,
which then computes the corrected Hessian estimate
Hk
+
lkI
by adding the average error
lk=1
nPn
i=1 lk
ito the global Hessian estimate Hk.
Both options require the server in each iteration to solve a linear system to invert either the projected,
or the corrected, global Hessian estimate. The purpose of these options is quite simple: unlike the
true Hessian, the compressed local Hessian estimates
Hk
i
, and also the global Hessian estimate
Hk
,
might not be positive definite, or might even not be of full rank. Further importance of the errors
lk
i
will be discussed when we consider extensions of FedNL to partial participation and globalization via
cubic regularization.
11
3.4 Local convergence theory
Note that FedNL includes two parameters, compression operators
Ck
i
and Hessian learning rate
α >
0,
and two options to perform global updates by the master. To provide theoretical guarantees, we
need one of the following two assumptions.
Assumption 3.4. Ck
iC(δ)for all i[n]and k. Moreover, (i) α= 1 1δ, or (ii) α= 1.
Assumption 3.5. Ck
iB
(
ω
)for all
i
[
n
]and
k
and 0
< α 1
ω+1
. Moreover, for all
i
[
n
]and
j, l [d], each entry (Hk
i)jl is a convex combination of {(2fi(xt))jl}k
t=0 for any k0.
To present our results in a unified manner, we define some constants depending on what
parameters and which option is used in FedNL. Below, constants
A
and
B
depend on the choice of
the compressors
Ck
i
and the learning rate
α
, while
C
and
D
depend on which option is chosen for
the global update.
(A, B) := (α2)if Assumption 3.4(i) holds
(δ/4,6
/δ7
/2)if Assumption 3.4(ii) holds
(α,α)if Assumption 3.5holds
,(C, D) := n(2,L2
)if Option 1 is used
(8,(L+2LF)2)if Option 2 is used .(5)
We prove three local rates for FedNL: for the squared distance to the solution
kxkxk2
, and
for the Lyapunov function
Φk:= Hk+ 6BL2
Fkxkxk2,
where
Hk:= 1
n
n
X
i=1 kHk
i− ∇2fi(x)k2
F.
Theorem 3.6.
Let Assumption 3.1 hold. Assume
kx0xk2µ2
2D
and
Hkµ2
4C
for all
k
0.
Then, FedNL (Algorithm 1) converges linearly with the rate
kxkxk21
2kkx0xk2.(6)
Moreover, depending on the choice
(5)
of the compressors
Ck
i
, learning rate
α
, and which option is
used for global model updates, we have the following linear and superlinear rates:
Ek]1min A, 1
3k
Φ0,(7)
Ekxk+1 xk2
kxkxk21min A, 1
3kC+D
12BL2
FΦ0
µ2.(8)
Let us comment on these rates.
First, the local linear rate
(6)
with respect to iterates is based on a universal constant, i.e.,
it does not depend on the condition number of the problem, the size of the training data, or
the dimension of the problem. Indeed, the squared distance to the optimum is halved in each
iteration.
12
Second, we have linear rate
(7)
for the Lyapunov function Φ
k
, which implies the linear
convergence of all local Hessian estimates
Hk
i
to the local optimal Hessians
2fi
(
x
). Thus,
our initial goal to progressively learn the local optimal Hessians
2fi
(
x
)in a communication
efficient manner is achieved, justifying the effectiveness of the new Hessian learning technique.
Finally, our Hessian learning process accelerates the convergence of iterates to a superlinear
rate
(8)
. Both rates
(7)
and
(8)
are independent of the condition number of the problem, or the
number of data points. However, they do depend on the compression variance (since
A
depends
on
δ
or
ω
), which, in case of
O
(
d
)communication constraints, depend on the dimension
d
only.
For clarity of exposition, in Theorem 3.6 we assumed
Hkµ2
4C
for all iterations
k
0. Below,
we prove that this inequality holds, using the initial conditions only.
Lemma 3.7.
Let Assumption 3.4 hold, and assume
kx0xk2e1:= min{2
4BCL2
F
,µ2
2D}
and
kH0
i− ∇2fi(x)k2
Fµ2
4C. Then kxkxk2e1and kHk
i− ∇2fi(x)k2
Fµ2
4Cfor all k0.
Lemma 3.8.
Let Assumption 3.5 hold, and assume
kx0xk2e2:= µ2
D+4Cd2L2
. Then
kxkxk2
e2and Hkµ2
4Cfor all k0.
3.5 FedNL and the “Newton Triangle”
One implication of Theorem 3.6 is that the local rate
1
2k
(see
(6)
) holds even when we specialize
FedNL to
Ck
i0
,
α
= 0 and
H0
i
=
2fi
(
x0
)for all
i
[
n
]. These parameter choices give rise to the
following simple method, which we call Newton Zero (N0):
xk+1 =xk2f(x0)1f(xk), k 0.(9)
Interestingly, N0 only needs initial second-order information, i.e., Hessian at the zeroth iterate,
and the same first-order information as Gradient Descent (GD), i.e.,
f
(
xk
)in each iteration.
Moreover, unlike GD, whose rate depends on a condition number, the local rate
1
2k
of N0 does not.
Besides, FedNL includes NS (when
Ck
i0
,
α
= 0,
H0
i
=
2fi
(
x
)) and classical Newton (N) (when
Ck
iI,α= 1,H0
i=0) as special cases.
It can be helpful to visualize the three special Newton-type methods—N,NS and N0 —as the
vertices of a triangle capturing a subset of two of these three requirements: 1)
O
(
d
)communication
cost per round, 2) implementability in practice, and 3) local quadratic rate. Indeed, each of these
three methods satisfies two of these requirements only: N(2+3), NS (1+3) and N0 (1+2). Finally,
FedNL interpolates between these requirements. See Figure 1.
4FedNL with Partial Participation, Globalization and Bidirectional
Compression
Due to space limitations, here we briefly describe four extensions to FedNL and the key technical
contributions. Detailed sections for each extension are deferred to the Appendix.
13
FedNL
Newton
Newton
Star
Newton
Zero
LS
PP
BC
CR
O(d)
communication
cost per round
Implementability
in practice
Local
quadratic rate
Figure 1: Visualization of the three special Newton-type methods—Newton (N), Newton Star
(NS) and Newton Zero (N0)—as the vertices of a triangle capturing a subset of two of these three
requirements: 1)
O
(
d
)communication cost per round, 2) implementability in practice, and 3) local
quadratic rate. Indeed, each of these three methods satisfies two of these requirements only: N
(2+3), NS (1+3) and N0 (1+2). Finally, the proposed FedNL framework with its four extensions to
Partial Participation (FedNL-PP), globalization via Line Search (FedNL-LS), globalization via Cubic
Regularization (FedNL-CR) and Bidirectional Compression (FedNL-BC) interpolates between these
requirements.
4.1 Partial Participation
In FedNL-PP (Algorithm 2), the server selects a subset
Sk
[
n
]of
τ
devices, uniformly at random,
to participate in each iteration. As devices might be inactive for several iterations, the same local
gradient and local Hessian used in FedNL does not provide convergence in this case. To guarantee
convergence, devices need to compute Hessian corrected local gradients
gk
i= (Hk
i+lk
iI)wk
i− ∇fi(wk
i),
where
wk
i
is the last global model that device
i
received from the server. This is an innovation which
also requires a different analysis.
4.2 Globalization via Line Search
Our first globalization strategy, FedNL-LS (Algorithm 3), which performs significantly better in
practice than FedNL-CR (described next), is based on a backtracking line search procedure. The idea
is to fix the search direction
dk=hHki1
µf(xk)
by the server and find the smallest integer s0which leads to a sufficient decrease in the loss
f(xk+γsdk)f(xk) + sDf(xk), dkE
14
with some parameters c(0,1
/2]and γ(0,1).
4.3 Globalization via Cubic Regularization
Our next globalization strategy, FedNL-CR (Algorithm 4), is to use a cubic regularization term
L
6khk3
,
where
L
is the Lipschitz constant for Hessians and
h
is the direction to the next iterate. However,
to get a global upper bound, we had to correct the global Hessian estimate
Hk
via compression error
lk. Indeed, since 2f(xk)Hk+lkI, we deduce
f(xk+1)f(xk) + Df(xk), hkE+1
2D(Hk+lkI)hk, hkE+L
6khkk3
for all k0. This leads to theoretical challenges and necessitates a new analysis.
4.4 Bidirectional Compression
Finally, we modify FedNL to allow for an even more severe level of compression that can’t be attained
by compressing the Hessians only. This is achieved by compressing the gradients (uplink) and
the model (downlink), in a “smart” way. In FedNL-BC (Algorithm 5), the server operates its own
compressors
Ck
M
applied to the model, and uses an additional “smart” model learning technique
similar to the proposed Hessian learning technique. On top of this, all devices compress their local
gradients via a Bernoulli compression scheme, which necessitates the use of another “smart” strategy
using Hessian corrected local gradients
gk
i=Hk
i(zkwk) + fi(wk),
where
zk
is the current learned global model and
wk
is the last learned global model when local
gradients are sent to the server. These changes are substantial and require novel analysis.
5 Experiments
We carry out numerical experiments to study the performance of FedNL, and compare it with various
state-of-the-art methods in federated learning. We consider the problem
min
xRd(f(x) := 1
n
n
X
i=1
fi(x) + λ
2kxk2), fi(x) = 1
m
m
X
j=1
log 1 + exp(bija>
ijx),(10)
where
{aij, bij}j[m]
are data points at the
i
-th device. The datasets were taken from LibSVM library
[Chang and Lin, 2011]: a1a,a9a,w7a,w8a, and phishing.
5.1 Parameter setting
In all experiments we use the theoretical parameters for gradient type methods (except those using
line search): vanilla gradient descent GD,DIANA [Mishchenko et al., 2019], and ADIANA [Li et al.,
2020b]. For DINGO [Crane and Roosta, 2019] we use the authors’ choice:
θ
= 10
4, φ
= 10
6, ρ
=
10
4
. Backtracking line search for DINGO selects the largest stepsize from
{
1
,
2
1,...,
2
10}.
The
initialization of
H0
i
for NL1 [Islamov et al., 2021], FedNL and FedNL-LS is
2fi
(
x0
), and for FedNL-CR
is
0
. For FedNL,FedNL-LS, and FedNL-CR we use Rank-1compression operator and stepsize
α
= 1.
15
We use two values of the regularization parameter:
λ∈ {
10
3,
10
4}
. In the figures we plot the
relation of the optimality gap
f
(
xk
)
f
(
x
)and the number of communicated bits per node, or the
number of communication rounds. The optimal value
f
(
x
)is chosen as the function value at the
20-th iterate of standard Newton’s method.
5.2 Local convergence
In our first experiment we compare FedNL and N0 with gradient type methods: ADIANA with random
dithering (ADIANA, RD,
s
=
d
), DIANA with random dithering (DIANA, RD,
s
=
d
), vanilla
gradient descent (GD), and DINGO.According to the results summarized in Figure 2 (first row),
we conclude that FedNL and N0 outperform all gradient type methods and DINGO, locally, by many
orders in magnitude. We want to note that we include the communication cost of the initialization
for FedNL and N0 in order to make a fair comparison (this is why there is a straight line for these
methods initially).
(a) a1a,λ= 103(b) a9a,λ= 104(c) w8a,λ= 103(d) phishing,λ= 104
(a) w7a,λ= 103(b) a1a,λ= 104(c) phishing,λ= 103(d) a9a,λ= 104
(a) w8a,λ= 103(b) phishing,λ= 103(c) a1a,λ= 104(d) w7a,λ= 104
Figure 2:
First row:
Local comparison of FedNL and N0 with (a), (b) ADIANA,DIANA,GD; with (c),
(d) DINGO in terms of communication complexity.
Second row:
Global comparison of FedNL-LS,
N0-LS and FedNL-CR with (a), (b) ADIANA,DIANA,GD, and GD with line search; with (c), (d)
DINGO in terms of communication complexity.
Third row:
Local comparison of FedNL with 3
types of compression operators and NL1 in terms of communication complexity.
5.3 Global convergence
We now compare FedNL-LS,N0-LS, and FedNL-CR with the first-order methods ADIANA and DIANA
with random dithering, gradient descent (GD), and GD with line search (GD-LS). Besides, we compare
16
FedNL-LS and FedNL-CR with DINGO. In this experiment we choose
x0
far from the solution
x
, i.e.,
we test the global convergence behavior; see Figure 2 (second row). We observe that FedNL-LS and
N0-LS are more communication efficient than all first-order methods and DINGO. However, FedNL-CR
is better than GD and GD-LS only. In these experiments we again include the communication cost of
initialization for FedNL-LS and N0-LS.
5.4 Comparison with NL1
Next, we compare FedNL with three type of compression operators: Rank-
R
(
R
= 1), Top-
K
(
K
=
d
),
and PowerSGD [Vogels et al., 2019] (
R
= 1) against NL1 with the Rand-
K
(
K
= 1) compressor.
The results, presented in Figure 2 (third row), show that FedNL with Rank-1compressor performs
the best.
References
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-
efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing
Systems, pages 1709–1720, 2017.
Amir Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with
MATLAB. Society for Industrial and Applied Mathematics, USA, 2014. ISBN 1611973643.
Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression
for distributed learning. arXiv preprint arXiv:2002.12410, 2020.
Chih-Chung Chang and Chih-Jen Lin. LibSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2(3):1–27, 2011.
Wenlin Chen, Samuel Horváth, and Peter Richtárik. Optimal client sampling for federated learning.
arXiv preprint arXiv:2010.13723, 2020.
Rixon Crane and Fred Roosta. Dingo: Distributed newton-type method for gradient-norm opti-
mization. In Advances in Neural Information Processing Systems, volume 32, pages 9498–9508,
2019.
Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction,
sampling, quantization and coordinate descent. In The 23rd International Conference on Artificial
Intelligence and Statistics, 2020a.
Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter Richtárik. Linearly converging
error compensated SGD. In 34th Conference on Neural Information Processing Systems (NeurIPS
2020), 2020b.
Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter Richtárik. MARINA: Faster
non-convex distributed learning with compression. arXiv preprint arXiv:2102.07845, 2021a.
Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local SGD: Unified theory and new efficient
methods. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021b.
17
Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter
Richtárik. SGD: General analysis and improved rates. In Kamalika Chaudhuri and Ruslan
Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning,
volume 97 of Proceedings of Machine Learning Research, pages 5200–5209, Long Beach, California,
USA, 09–15 Jun 2019. PMLR.
Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik.
Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint
arXiv:1904.05115, 2019.
Rustem Islamov, Xun Qian, and Peter Richtárik. Distributed second order methods with fast rates
and compressed communication. arXiv preprint arXiv:2102.07158, 2021.
Peter Kairouz et al. Advances and open problems in federated learning. arXiv preprint
arXiv:1912.04977, 2019.
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback
fixes SignSGD and other gradient compression schemes. In Proceedings of the 36th International
Conference on Machine Learning, volume 97, pages 3252–3261, 2019.
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and
Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for on-device federated
learning. In International Conference on Machine Learning (ICML), 2020.
Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local SGD on
identical and heterogeneous data. In The 23rd International Conference on Artificial Intelligence
and Statistics (AISTATS 2020), 2020.
Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with
compressed gradients. arXiv preprint arXiv:1806.06573, 2018.
Jakub Konečný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization:
Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016a.
Jakub Konečný, H. Brendan McMahan, Felix Yu, Peter Richtárik, Ananda Theertha Suresh, and
Dave Bacon. Federated learning: strategies for improving communication efficiency. In NIPS
Private Multi-Party Machine Learning Workshop, 2016b.
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.
Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: challenges,
methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020a. doi:
10.1109/MSP.2020.2975749.
Zhize Li, Dmitry Kovalev, Xun Qian, and Peter Richtárik. Acceleration for compressed gradient
descent in distributed and federated optimization. In International Conference on Machine
Learning, 2020b.
18
Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for
efficient distributed learning. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2020.
Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. In Interna-
tional Conference on Machine Learning (ICML), 2019.
H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas.
Communication-efficient learning of deep networks from decentralized data. In Proceedings of the
20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning
with compressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random
reshuffling. arXiv preprint arXiv:2102.06704, 2021a.
Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, and Peter Richtárik. IntSGD: Floatless
compression of stochastic gradients. arXiv preprint arXiv:2102.08374, 2021b.
Constantin Philippenko and Aymeric Dieuleveut. Bidirectional compression in heterogeneous settings
for distributed or federated learning with partial participation: tight convergence guarantees.
arXiv preprint arXiv:2006.14591, 2021.
Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný,
Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. arXiv preprint
arXiv:2003.00295, 2020.
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its
application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference
of the International Speech Communication Association, 2014.
S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In Advances in Neural
Information Processing Systems (NeurIPS), 2018.
Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference
on Learning Representations (ICLR), 2020.
Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD: Practical low-rank gradient
compression for distributed optimization. In Advances in Neural Information Processing Systems
32 (NeurIPS), 2019.
Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, and Haibin Lin. Local AdaAlter: Communication-
efficient stochastic gradient descent with adaptive learning rates. arXiv preprint arXiv:1911.09030,
2019.
19
Appendix
A Extra Experiments
We carry out numerical experiments to study the performance of FedNL, and compare it with various
state-of-the-art methods in federated learning. We consider the following problem
min
xRd(1
n
n
X
i=1
fi(x) + λ
2kxk2), fi(x) = 1
m
m
X
j=1
log 1 + exp(bija>
ijx),(11)
where {aij, bij}j[m]are data points at the i-th device.
A.1 Data sets
The datasets were taken from LibSVM library [Chang and Lin, 2011]:
a1a
,
a9a
,
w7a
,
w8a
,
phishing
.
We partitioned each data set across several nodes to capture a variety of scenarios. See Table 3 for
more detailed description of data sets settings.
Table 3: Data sets used in the experiments with the number of worker nodes nused in each case.
Data set # workers n# data points (=nm) # features d
a1a 16 1600 123
a9a 80 32560 123
w7a 50 24600 300
w8a 142 49700 300
phishing 100 110 68
A.2 Parameters setting
In all experiments we use theoretical parameters for gradient type methods (except those with
line search procedure): vanilla gradient descent, DIANA [Mishchenko et al., 2019], and ADIANA
[Li et al., 2020b]. The constants for DINGO [Crane and Roosta, 2019] are set as the authors did:
θ
= 10
4, φ
= 10
6, ρ
= 10
4
. Backtracking line search for DINGO selects the largest stepsize from
{
1
,
2
1,...,
2
10}.
The initialization of
H0
i
for NL1 [Islamov et al., 2021], FedNL,FedNL-LS, and
FedNL-PP is 2fi(x0), and for FedNL-CR is 0.
We conduct experiments for two values of regularization parameter
λ∈ {
10
3,
10
4}
. In the
figures we plot the relation of the optimality gap
f
(
xk
)
f
(
x
)and the number of communicated
bits per node or the number of communication rounds. The optimal value
f
(
x
)is chosen as the
function value at the 20-th iterate of standard Newton’s method.
A.3 Compression operators
Here we describe four compression operators that are used in our experiments.
20
A.3.1 Random dithering for vectors
For first order methods ADIANA and DIANA we use random dithering operator [Alistarh et al., 2017,
Horváth et al., 2019]. This compressor with slevels is defined via the following formula
C(x) := sign(x)· kxkq·ξs
s,(12)
where kxkq:= (Pi|xi|q)1/q and ξsRdis a random vector with i-th element defind as follows
(ξs)i=(l+ 1 with probability |xi|
kxkqsl,
lotherwise.(13)
Here
sN+
denotes the levels of rounding, and
l
satisfies
|xi|
kxkql
s,l+1
s
. According to [Horváth
et al., 2019], this compressor has the variance parameter
ω
2 +
d1/2+d1/q
s.
However, for standard
euclidean norm (q= 2) one can improve the bound by ωmin nd
s2,d
so[Alistarh et al., 2017].
A.3.2 Rank-Rcompression operator for matrices
Our theory supports contractive compression operators; see Definition 3.3. In the experiments for
FedNL we use Rank-
R
compression operator. Let
XRd×d
and
U
Σ
V>
be the singular value
decomposition of X:
X=
d
X
i=1
σiuiv>
i,(14)
where the singular values
σi
are sorted in non-increasing order:
σ1σ2≥ ··· ≥ σd
. Then, the
Rank-Rcompressor, for Rd, is defined by
C(X) :=
R
X
i=1
σiuiv>
i.(15)
Note that
kXk2
F
(14)
=
d
X
i=1
σiuiv>
i
2
F
=
d
X
i=1
σ2
i
and
kC(X)Xk2
F
(14)+(15)
=
d
X
i=R+1
σiuiv>
i
2
F
=
d
X
i=R+1
σ2
i.
Since 1
dRPd
i=R+1 σ2
i1
dPd
i=1 σ2
i, we have
kC(X)Xk2
FdR
dkXk2
F=1R
dkXk2
F,
and hence the Rank-
R
compression operator belongs to
C
(
δ
)with
δ
=
R
d.
In case when
XSd
, we
have
ui
=
vi
for all
i
[
d
], and Rank-
R
compressor on matrix
X
transforms to
PR
i=1 σiuiu>
i
, i.e.,
the output of Rank-Rcompressor is automatically a symmetric matrix, too.
21
A.3.3 Top-Kcompression operator for matrices
Another example of contractive compression operators is Top-
K
compressor for matrices. For
arbitrary matrix
XRd×d
let sort its entires in non-increasing order by magnitude, i.e.,
Xik,jk
is
the k-th maximal element of Xby magnitude. Let’s {Eij }d
i,j=1 me matrices for which
(Eij)ps := (1,if (p, s)=(i, j),
0,otherwise. (16)
Then, the Top-Kcompression operator can be defined via
C(X) :=
K
X
k=1
Xik,jkEik,jk.(17)
This compression operator belongs to
C
(
δ
)with
δ
=
d2
K
. If we need to keep the output of Top-
K
on
symmetric matrix
X
to be symmetric matrix too, then we apply Top-
K
compressor only on lower
triangular part of X.
A.3.4 Rand-Kcompression operator for matrices
Our theory also supports unbiased compression operators; see Definition 3.2. One of the examples
is Rand-
K
. For arbitrary matrix
XRd×d
we choose a set
SK
of indexes (
i, j
)of cardinality
K
uniformly at random. Then Rand-Kcompressor can be defined via
C(X)ij := (d2
KXij if (i, j)∈ SK,
0otherwise.(18)
This compression operator belongs to
B
(
ω
)with
ω
=
d2
K
1. If we need to make the output of this
compressor to be symmetric matrix, then we apply this compressor only on lower triangular part of
the input.
A.4 Projection onto the cone of positive definite matrices
If one uses FedNL with Option 1, then we need to project onto the cone of symmetric and positive
definite matrices with constant µ:
{MRd×d:M>=M,MµI}.
The projection of symmetric matrix
X
onto the cone of positive semidefinite matrices can by
computed via
[X]0:=
d
X
i=1
max{λi,0}uiu>
i,(19)
where
Piλiuiu>
i
is an eigenvalue decomposition of
X
. Using the projection onto the cone of positive
semidefinite matrices we can define the projection onto the cone of positive definite matrices with
constant µvia
[X]µ:= [XµI]0+µI.(20)
22
a1a,λ= 103a1a,λ= 104a9a,λ= 103a9a,λ= 104
phishing,λ= 103phishing,λ= 104w7a,λ= 103w8a,λ= 103
a1a,λ= 103a1a,λ= 104a9a,λ= 103a9a,λ= 104
Figure 3: The performance of FedNL with different types of compression operators: Rank-
R
(first
row); Top-
K
(second row); PowerSGD of rank
R
(third row) for several values of
R
and
K
in terms
of communication complexity.
A.5 The effect of compression
First, we investigate how the level of compression influences the performance of FedNL; see Figure 3.
Here we study the performance for three types of compression operators: Rank-
R
, Top-
K
, and
PowerSGD of rank
R
. According to numerical experiments, the smaller parameter is, the better
performance of FedNL is. This statement is true for all three types of compressors.
A.6 Comparison of Options 1and 2
In our next experiment we investigate which Option (1or 2) for FedNL with Rank-
R
and stepsize
α
= 1 compressor demonstrates better results in terms of communication compexity. According
to the results in Figure 4, we see that FedNL with projection (Option 1) is more communication
effective than that with Option 2. However, Option 1requires more computing resources.
A.7 Comparison of different compression operators
Next, we study which compression operator is better in terms of communication complexity. Based
on the results in Figure 5, we can conclude that Rank-
R
is the best compression operator; Top-
K
and PowerSGD compressors can beat each other in different cases.
23
a1a,λ= 103a9a,λ= 103phishing,λ= 104w7a,λ= 104
Figure 4: The performance of FedNL with Options 1and 2in terms of communication complexity.
w8a,λ= 103a1a,λ= 104phishing,λ= 103a9a,λ= 104
Figure 5: Comparison of the performance of FedNL with different compression operators in terms of
communication complexity.
A.8 Comparison of different update rules for Hessians
On the following step we compare FedNL with three update rules for Hessians in order to find the
best one. They are biased Top-
K
compression operator with stepsize
α
= 1 (Option 1); biased
Top-
K
compression operator with stepsize
α
= 1
1δ
; unbiased Rand-
K
compression operator
with stepsize
α
=
1
ω+1
. The results of this experiment are presented in Figure 6. Based on them,
we can make a conclusion that FedNL with Top-
K
compressor and stepsize
α
= 1 demonstrates
the best performance. FedNL with Rand-
K
compressor and stepsize
α
=
1
ω+1
performs a little bit
better than that with Top-
K
compressor and stepsize
α
= 1
1δ
. As a consequence, we will
use biased compression operator with stepsize