Content uploaded by Rustem Islamov

Author content

All content in this area was uploaded by Rustem Islamov on Nov 10, 2021

Content may be subject to copyright.

FedNL: Making Newton-Type Methods

Applicable to Federated Learning

Mher Safaryan∗Rustem Islamov†Xun Qian‡Peter Richtárik§

June 05, 2021

Abstract

Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton

Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order

methods applicable to FL. In contrast to the aforementioned work, FedNL employs a diﬀerent

Hessian learning technique which i) enhances privacy as it does not rely on the training data to

be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models,

and iii) provably works with general contractive compression operators for compressing the local

Hessians, such as Top-

K

or Rank-

R

, which are vastly superior in practice. Notably, we do not

need to rely on error feedback for our methods to work with contractive compressors.

Moreover, we develop FedNL-PP,FedNL-CR and FedNL-LS, which are variants of FedNL

that support partial participation, and globalization via cubic regularization and line search,

respectively, and FedNL-BC, which is a variant that can further beneﬁt from bidirectional

compression of gradients and models, i.e., smart uplink gradient and smart downlink model

compression.

We prove local convergence rates that are independent of the condition number, the number

of training data points, and compression variance. Our communication eﬃcient Hessian learning

technique provably learns the Hessian at the optimum.

Finally, we perform a variety of numerical experiments that show that our FedNL methods

have state-of-the-art communication complexity when compared to key baselines.

∗King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

†

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia, and Moscow Institute of

Physics and Technology (MIPT), Dolgoprudny, Russia. This research was conducted while this author was an intern

at KAUST and an undergraduate student at MIPT.

‡King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

§King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.

1

arXiv:2106.02969v1 [cs.LG] 5 Jun 2021

Contents

1 Introduction 4

1.1 First-order methods for FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Towards second-order methods for FL . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Desiderata for second-order methods applicable to FL . . . . . . . . . . . . . . . . . 5

2 Contributions 6

2.1 The Newton Learn framework of Islamov et al. [2021] . . . . . . . . . . . . . . . . . . 6

2.2 Issues with the Newton Learn framework . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Our FedNL framework ................................... 7

3 The Vanilla Federated Newton Learn Method 9

3.1 New Hessian learning technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Compressingmatrices ................................... 10

3.3 Two options for updating the global model . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 FedNL andthe“NewtonTriangle”............................. 13

4FedNL with Partial Participation, Globalization and Bidirectional Compression 13

4.1 PartialParticipation .................................... 14

4.2 Globalization via Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Globalization via Cubic Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Bidirectional Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Experiments 15

5.1 Parametersetting...................................... 15

5.2 Localconvergence...................................... 16

5.3 Globalconvergence..................................... 16

5.4 Comparison with NL1 ................................... 17

A Extra Experiments 20

A.1 Datasets .......................................... 20

A.2 Parameterssetting ..................................... 20

A.3 Compressionoperators................................... 20

A.3.1 Random dithering for vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.3.2 Rank-Rcompression operator for matrices . . . . . . . . . . . . . . . . . . . . 21

A.3.3 Top-Kcompression operator for matrices . . . . . . . . . . . . . . . . . . . . 22

A.3.4 Rand-Kcompression operator for matrices . . . . . . . . . . . . . . . . . . . 22

A.4 Projection onto the cone of positive deﬁnite matrices . . . . . . . . . . . . . . . . . . 22

A.5 Theeﬀectofcompression ................................. 23

A.6 Comparison of Options 1and 2.............................. 23

A.7 Comparison of diﬀerent compression operators . . . . . . . . . . . . . . . . . . . . . . 23

A.8 Comparison of diﬀerent update rules for Hessians . . . . . . . . . . . . . . . . . . . . 24

A.9 Bi-directional compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A.10 The performance of FedNL-PP ............................... 25

A.11 Comparison with NL1 ................................... 25

2

A.12Localcomparison...................................... 26

A.13Globalcompersion ..................................... 26

A.14 Eﬀect of statistical heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

B Proofs of Results from Section 3 30

B.1 Auxiliarylemma ...................................... 30

B.2 ProofofTheorem3.6.................................... 32

B.3 ProofofLemma3.7 .................................... 34

B.4 ProofofLemma3.8 .................................... 35

C Extension: Partial Participation (FedNL-PP) 36

C.1 Hessian corrected local gradients gk

i............................ 36

C.2 Importance of compression errors lk

i........................... 37

C.3 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

C.4 ProofofTheoremC.1 ................................... 38

C.5 ProofofLemmaC.2 .................................... 41

C.6 ProofofLemmaC.3 .................................... 41

D Extension: Globalization via Line Search (FedNL-LS) 43

D.1 Linesearchprocedure ................................... 43

D.2 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

D.3 ProofofTheoremD.1 ................................... 44

D.4 ProofofLemmaD.2 .................................... 44

E Extension: Globalization via Cubic Regularization (FedNL-CR) 46

E.1 Cubicregularization .................................... 46

E.2 Solvingthesubproblem .................................. 46

E.3 Importance of compression errors lk

i........................... 46

E.4 Global and local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

E.5 ProofofTheoremE.1 ................................... 47

E.6 ProofofLemmaE.2 .................................... 51

F Extension: Bidirectional Compression (FedNL-BC) 53

F.1 Model learning technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

F.2 Hessian corrected local gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

F.3 Local convergence theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

F.4 ProofofTheoremF.4 ................................... 56

F.5 ProofofLemmaF.5 .................................... 59

F.6 ProofofLemmaF.6 .................................... 60

G Local Quadratic Rate of NEWTON-STAR for General Finite-Sum Problems 61

H Table of Frequently Used Notation 62

I Limitations 63

3

1 Introduction

In this paper we consider the cross-silo federated learning problem

min

x∈Rd(f(x) := 1

n

n

X

i=1

fi(x)),(1)

where

d

denotes dimension of the model

x∈Rd

we wish to train,

n

is the total number of

silos/machines/devices/clients in the distributed system,

fi

(

x

)is the loss/risk associated with the

data stored on machine i∈[n] := {1,2, . . . , n}, and f(x)is the empirical loss/risk.

1.1 First-order methods for FL

The prevalent paradigm for training federated learning (FL) models [Konečný et al., 2016b,a,

McMahan et al., 2017] (see also the recent surveys by Kairouz et al [2019], Li et al. [2020a]) is to use

distributed ﬁrst-order optimization methods employing one or more tools for enhancing communication

eﬃciency, which is a key bottleneck in the federated setting.

These tools include communication compression [Konečný et al., 2016b, Alistarh et al., 2017,

Khirirat et al., 2018] and techniques for progressively reducing the variance introduced by compression

[Mishchenko et al., 2019, Horváth et al., 2019, Gorbunov et al., 2020a, Li et al., 2020b, Gorbunov et al.,

2021a], local computation [McMahan et al., 2017, Stich, 2020, Khaled et al., 2020, Mishchenko et al.,

2021a] and techniques for reducing the client drift introduced by local computation [Karimireddy

et al., 2020, Gorbunov et al., 2021b], and partial participation [McMahan et al., 2017, Gower et al.,

2019] and techniques for taming the slow-down introduced by partial participation [Gorbunov et al.,

2020a, Chen et al., 2020].

Other useful techniques for further reducing the communication complexity of FL methods

include the use of momentum [Mishchenko et al., 2019, Li et al., 2020b], and adaptive learning rates

[Malitsky and Mishchenko, 2019, Xie et al., 2019, Reddi et al., 2020, Xie et al., 2019, Mishchenko

et al., 2021b]. In addition, aspiring FL methods need to protect the privacy of the clients’ data, and

need to be built with data heterogeneity in mind [Kairouz et al, 2019].

1.2 Towards second-order methods for FL

While ﬁrst-order methods are the methods of choice in the context of FL at the moment, their

communication complexity necessarily depends on (a suitable notion of) the condition number of

the problem, which can be very large as it depends on the structure of the model being trained, on

the choice of the loss function, and most importantly, on the properties of the training data.

However, in many situations when algorithm design is not constrained by the stringent require-

ments characterizing FL, it is very well known that carefully designed second-order methods can be

vastly superior. On an intuitive level, this is mainly because these methods make an extra compu-

tational eﬀort to estimate the local curvature of the loss landscape, which is useful in generating

more powerful and adaptive update direction. However, in FL, it is often communication and not

computation which forms the key bottleneck, and hence the idea of “going second order” looks

attractive. The theoretical beneﬁts of using curvature information are well know. For example, the

classical Newton’s method, which forms the basis for most eﬃcient second-order method in much

the same way the gradient descent method forms the basis for more elaborate ﬁrst-order methods,

4

enjoys a fast condition-number-independent (local) convergence rate [Beck, 2014], which is beyond the

reach of all ﬁrst-order methods. However, Newton’s method does not admit an eﬃcient distributed

implementation in the heterogeneous data regime as it requires repeated communication of local

Hessian matrices

∇2fi∈Rd×d

to the server, which is prohibitive as this constitutes a massive burden

on the communication links.

1.3 Desiderata for second-order methods applicable to FL

In this paper we take the stance that it would be highly desirable to develop Newton-type methods

for solving the cross-silo federated learning problem (1) that would

[hd]

work well in the truly heterogeneous data setting (i.e., we do not want to assume that the

functions f1, . . . , fnare “similar”),

[fs]

apply to the general finite-sum problem

(1)

, without imparting strong structural assumptions

on the local functions

f1, . . . , fn

(e.g., we do not want to assume that the functions

f1, . . . , fn

are quadratics, generalized linear models, and so on),

[as] beneﬁts from Newton-like (matrix-valued) adaptive stepsizes,

[pe]

employ at least a rudimentary privacy enhancement mechanism (in particular, we do not want

the devices to be sending/revealing their training data to the server),

[uc]

enjoy, through ubiased communication compression strategies applied to the Hessian, such

as Rand-

K

, the same low

O

(

d

)communication cost per communication round as gradient

descent,

[cc]

be able to beneﬁt from the more aggressive contractive communication compression strategies

applied to the Hessian, such as Top-Kand Rank-R,

[fr]

have fast local rates unattainable by ﬁrst order methods (e.g., rates independent of the condition

number),

[pp] support partial participation (this is important when the number nof devices is very large),

[gg]

have global convergence guarantees, and superior global empirical behavior, when combined

with a suitable globalization strategy (e.g., line search or cubic regularization),

[gc]

optionally be able to use, for a more dramatic communication reduction, additional smart

uplink (i..e, device to server) gradient compression,

[mc]

optionally be able to use, for a more dramatic communication reduction, additional smart

downlink (i.e., server to device) model compression,

[lc]

perform provably useful local computation, even in the heterogeneous data setting (it is known

that local computation via gradient-type steps, which form the backbone of methods such as

FedAvg and LocalSGD, provably helps under some degree of data similarity only).

However, to the best of our knowledge, existing Newton-type methods are not applicable to FL

as they are not compatible with most of the aforementioned desiderata.

5

It is therefore natural and pertinent to ask whether it is possible to design theoretically

well grounded and empirically well performing Newton-type methods that would be able to

conform to the FL-speciﬁc desiderata listed above.

In this work, we address this challenge in the aﬃrmative.

2 Contributions

Before detailing our contributions, it will be very useful to brieﬂy outline the key elements of the

recently proposed Newton Learn (NL) framework of Islamov et al. [2021], which served as the main

inspiration for our work, and which is also the closest work to ours.

2.1 The Newton Learn framework of Islamov et al. [2021]

The starting point of their work is the observation that the Newton-like method

xk+1 =xk−(∇2f(x∗))−1∇f(xk),

called Newton Star (NS), where

x∗

is the (unique) solution of

(1)

, converges to

x∗

locally quadratically

under suitable assumptions, which is a desirable property it inherits from the classical Newton

method. Clearly, this method is not practical, as it relies on the knowledge of the Hessian at the

optimum.

However, under the assumption that the matrix

∇2f

(

x∗

)is known to the server, NS can be

implemented with

O

(

d

)cost in each communication round. Indeed, NS can simply be treated as

gradient descent, albeit with a matrix-valued stepsize equal to (∇2f(x∗))−1.

The ﬁrst key contribution of Islamov et al. [2021] is the design of a strategy, for which they coined

the term Newton Learn, which learns the Hessians

∇2f1

(

x∗

)

,...,∇2fn

(

x∗

), and hence their average,

∇2f

(

x∗

), progressively throughout the iterative process, and does so in a communication eﬃcient

manner, using unbiased compression [uc] of Hessian information. In particular, the compression level

can be adjusted so that in each communication round,

O

(

d

)ﬂoats need to be communicated between

each device and the server only. In each iteration, the master uses the average of the current learned

local Hessian matrices in place of the Hessian at the optimum, and subsequently performs a step

similar to that of NS. So, their method uses adaptive matrix-valued stepsizes [as].

Islamov et al. [2021] prove that their learning procedure indeed works in the sense that the

sequences of the learned local matrices converge to the local optimal Hessians

∇2fi

(

x∗

). This

property leads to a Newton-like acceleration, and as a result, their NL methods enjoy a local linear

convergence rate (for a Lyapunov function that includes Hessian convergence) and local superlinear

convergence rate (for distance to the optimum) that is independent of the condition number, which

is a property beyond the reach of any ﬁrst-order method [fr]. Moreover, all of this provably works in

the heterogeneous data setting [hd].

Finally, they develop a practical and theoretically grounded globalization strategy [gg] based on

cubic regularization, called Cubic Newton Learn (CNL).

2.2 Issues with the Newton Learn framework

While the above development is clearly very promising in the context of distributed optimization,

the results suﬀer from several limitations which prevent the methods from being applicable to FL.

6

Table 1: Comparison of the main features of our family of FedNL algorithms and results with those

of Islamov et al. [2021], which we used as an inspiration. We have made numerous and signiﬁcant

modiﬁcations and improvements in order to obtain methods applicable to federated learning.

# Feature Islamov et al.

[2/’21]

This Work

[5/’21]

[hd] supports heterogeneous data setting 3 3

[fs] applies to general finite-sum problems 73

[as] uses adaptive stepsizes 3 3

[pe] privacy is enhanced (training data is not sent to the server) 73

[uc] supports unbiased Hessian compression (e.g., Rand-K)3 3

[cc] supports contractive Hessian compression (e.g., Top-K)73

[fr] fast local rate: independent of the condition number 3 3

[fr] fast local rate: independent of the # of training data points 73

[fr] fast local rate: independent of the compressor variance 73

[pp] supports partial participation 73(Alg 2)

[gg] has global convergence guarantees via line search 73(Alg 3)

[gg] has global convergence guarantees via cubic regularization 3 3(Alg 4)

[gc] supports smart uplink gradient compression at the devices 73(Alg 5)

[mc] supports smart downlink model compression by the master 73(Alg 5)

[lc] performs useful local computation 3 3

First, the Newton Learn strategy of Islamov et al. [2021] critically depends on the assumption that

the local functions are of the form

fi(x) = 1

m

m

X

j=1

ϕij(a>

ijx),(2)

where

ϕij

:

R→R

are suﬃciently well behaved functions, and

ai1, . . . , aim ∈Rd

are the training

data points owned by device

i

. As a result, their approach is limited to generalized linear models

only, which violates [fs] from the aforementioned wish list. Second their communication strategy

critically relies on each device

i

sending a small subset of their private training data

{ai1, . . . , aim}

to

the server in each communication round, which violates [pe]. Further, while their approach supports

O

(

d

)communication, it does not support more general contractive compressors [cc], such as Top-

K

and Rank-

R

, which have been found very useful in the context of ﬁrst order methods with gradient

compression. Finally, the methods of Islamov et al. [2021] do not support bidirectional compression

[bc] of gradients and models, and do not support partial participation [pp].

2.3 Our FedNL framework

We propose a family of ﬁve Federated Newton Learn methods (Algorithms 1–5), which we believe

constitutes a marked step in the direction of making second-order methods applicable to FL.

In contrast to the work of Islamov et al. [2021] (see Table 1), our vanilla method FedNL

(Algorithm 1) employs a diﬀerent Hessian learning technique, which makes it applicable beyond

generalized linear models

(2)

to general ﬁnite-sum problems [fs], enhances privacy as it does not rely

on the training data to be revealed to the coordinating server [pe], and provably works with general

contractive compression operators for compressing the local Hessians, such as Top-

K

or Rank-

R

,

which are vastly superior in practice [cc]. Notably, we do not need to rely on error feedback [Seide

et al., 2014, Stich et al., 2018, Karimireddy et al., 2019, Gorbunov et al., 2020b], which is essential

7

Table 2: Summary of algorithms proposed and convergence results proved in this paper.

Convergence Rate independent of

Method result †type rate

the condition # (left)

# training data (middle)

compressor (right)

Theorem

Newton Zero

N0 (Equation (9)) rk≤1

2kr0local linear 3 3 3 3.6

FedNL (Algorithm 1)

rk≤1

2kr0local linear 3 3 3 3.6

Φk

1≤θkΦ0

1local linear 3373.6

rk+1 ≤cθkrklocal superlinear 3373.6

Partial Participation

FedNL-PP (Algorithm 2)

Wk≤θkW0local linear 3 3 3 C.1

Φk

2≤θkΦ0

2local linear 337C.1

rk+1 ≤cθkWklocal linear 337C.1

Line Search

FedNL-LS (Algorithm 3) ∆k≤θk∆0global linear 73 3 D.1

Cubic Regularization

FedNL-CR (Algorithm 4)

∆k≤c/k global sublinear 73 3 E.1

∆k≤θk∆0global linear 73 3 E.1

Φk

1≤θkΦ0

1local linear 337E.1

rk+1 ≤cθkrklocal superlinear 337E.1

Bidirectional Compression

FedNL-BC (Algorithm 5) Φk

3≤θkΦ0

3local linear 337F.4

Newton Star

NS (Equation (55)) rk+1 ≤cr2

klocal quadratic 3 3 3 G.1

Quantities for which we prove convergence: (i) distance to solution rk:= kxk−x∗k2;Wk:= 1

nPn

i=1 kwk

i−x∗k2(ii)

Lyapunov functions Φk

1:= ckxk−x∗k2+1

nPn

i=1 kHk

i− ∇2fi(x∗)k2

F;Φk

2:= cWk+1

nPn

i=1 kHk

i− ∇2fi(x∗)k2

F;

Φk

3:= kzk−x∗k2+ckwk−x∗k2. (iii) Function value suboptimality ∆k:= f(xk)−f(x∗)

†constants c > 0and θ∈(0,1) are possibly diﬀerent each time they appear in this table. Refer to the precise statements

of the theorems for the exact values.

to prevent divergence in ﬁrst-order methods employing such compressors [Beznosikov et al., 2020],

for our methods to work with contractive compressors. We prove that our communication eﬃcient

Hessian learning technique provably learns the Hessians at the optimum.

Like Islamov et al. [2021], we prove local convergence rates that are independent of the condition

number [fr]. However, unlike their rates, some of our rates are also independent of number training

data points, and of compression variance [fr]. All our complexity results are summarized in Table 2.

Moreover, we show that our approach works in the partial participation [pp] regime by developing

the FedNL-PP method (Algorithm 2), and devise methods employing globalization strategies: FedNL-

LS (Algorithm 3), based on backtracking line search, and FedNL-CR (Algorithm 4), based on cubic

regularization [gg]. We show through experiments that the former is much more powerful in practice

than the latter. Hence, the proposed line search globalization is superior to the cubic regularization

approach employed by Islamov et al. [2021].

Our approach can further beneﬁt from smart uplink gradient compression [gc] and smart downlink

model compression [mc] – see FedNL-BC (Algorithm 5).

Finally, we perform a variety of numerical experiments that show that our FedNL methods have

state-of-the-art communication complexity when compared to key baselines.

8

3 The Vanilla Federated Newton Learn Method

We start the presentation of our algorithms with the vanilla FedNL method, commenting on the

intuitions and technical novelties. The method is formally described1in Algorithm 1.

3.1 New Hessian learning technique

The ﬁrst key technical novelty in FedNL is the new mechanism for learning the Hessian

∇2f

(

x∗

)at

the (unique) solution

x∗

in a communication eﬃcient manner. This is achieved by maintaining and

progressively updating local Hessian estimates

Hk

i

of

∇2fi

(

x∗

)for all devices

i∈

[

n

]and the global

Hessian estimate

Hk=1

n

n

X

i=1

Hk

i

of

∇2f

(

x∗

)for the central server. Thus, the goal is to induce

Hk

i→ ∇2fi

(

x∗

)for all

i∈

[

n

], and as

a consequence, Hk→ ∇2f(x∗), throughout the training process.

Algorithm 1 FedNL (Federated Newton Learn)

1: Parameters: Hessian learning rate α≥0; compression operators {Ck

1,...,Ck

n}

2: Initialization: x0∈Rd;H0

1,...,H0

n∈Rd×dand H0:= 1

nPn

i=1 H0

i

3: for each device i= 1, . . . , n in parallel do

4: Get xkfrom the server and compute local gradient ∇fi(xk)and local Hessian ∇2fi(xk)

5: Send ∇fi(xk),Sk

i:= Ck

i(∇2fi(xk)−Hk

i)and lk

i:= kHk

i− ∇2fi(xk)kFto the server

6: Update local Hessian shift to Hk+1

i=Hk

i+αSk

i

7: end for

8: on server

9: Get ∇fi(xk),Sk

iand lk

ifrom each node i∈[n]

10: ∇f(xk) = 1

nPn

i=1 ∇fi(xk),Sk=1

nPn

i=1 Sk

i, lk=1

nPn

i=1 lk

i,Hk+1 =Hk+αSk

11: Option 1: xk+1 =xk−Hk−1

µ∇f(xk)

12: Option 2: xk+1 =xk−Hk+lkI−1∇f(xk)

Anaive choice for the local estimates

Hk

i

would be the exact local Hessians

∇2fi

(

xk

), and

consequently the global estimate

Hk

would be the exact global Hessian

∇2f

(

xk

). While this naive

approach learns the global Hessian at the optimum, it needs to communicate the entire matrices

∇2fi

(

xk

)to the server in each iteration, which is extremely costly. Instead, in FedNL we aim to

reuse past Hessian information and build the next estimate

Hk+1

i

by updating the current estimate

Hk

i

. Since all devices have to be synchronized with the server, we also need to make sure the update

from

Hk

i

to

Hk+1

i

is easy to communicate. With this intuition in mind, we propose to update the

local Hessian estimates via the rule

Hk+1

i=Hk

i+αSk

i,

where

Sk

i=Ck

i(∇2fi(xk)−Hk

i),

1

For all our methods, we describe the steps constituting a single communication round only. To get an iterative

method, one simply needs to repeat provided steps in an iterative fashion.

9

and

α >

0is the learning rate. Notice that we reduce the communication cost by explicitly requiring

all devices i∈[n]to send compressed matrices Sk

ito the server only.

The Hessian learning technique employed in the Newton Learn framework of Islamov et al. [2021]

is critically diﬀerent to ours as it heavily depends on the structure

(2)

of the local functions. Indeed,

the local optimal Hessians

∇2fi(x∗) = 1

m

m

X

j=1

ϕ00

ij(a>

ijx∗)aija>

ij

are learned via the proxy of learning the optimal scalars

ϕ00

ij

(

a>

ijx∗

)for all local data points

j∈

[

m

],

which also requires the transmission of the active data points

aij

to the server in each iteration. This

makes their method inapplicable to the general ﬁnite sum problems [fs], and incapable of securing

even the most rudimentary privacy enhancement [pe] mechanism.

We do not make any structural assumption on the problem

(1)

, and rely on the following general

conditions to prove eﬀectiveness of our Hessian learning technique:

Assumption 3.1.

The average loss

f

is

µ

-strongly convex, and all local losses

fi

(

x

)have Lipschitz

continuous Hessians. Let

L∗

,

LF

and

L∞

be the Lipschitz constants with respect to three diﬀerent

matrix norms: spectral, Frobenius and inﬁnity norms, respectively. Formally, we require

k∇2fi(x)− ∇2fi(y)k ≤ L∗kx−yk

k∇2fi(x)− ∇2fi(y)kF≤LFkx−yk

max

j,l |(∇2fi(x)− ∇2fi(y))jl | ≤ L∞kx−yk

to hold for all i∈[n]and x, y ∈Rd.

3.2 Compressing matrices

In the literature on ﬁrst-order compressed methods, compression operators are typically applied to

vectors (e.g., gradients, gradient diﬀerences, models). As our approach is based on second-order

information, we apply compression operators to

d×d

matrices of the form

∇2fi

(

xk

)

−Hk

i

instead.

For this reason, we adapt two popular classes of compression operators used in ﬁrst-order methods

to act on d×dmatrices by treating them as vectors of dimension d2.

Deﬁnition 3.2

(Unbiased Compressors)

.

By

B

(

ω

)we denote the class of (possibly randomized)

unbiased compression operators C:Rd×d→Rd×dwith variance parameter ω≥0satisfying

E[C(M)] = M,EkC(M)−Mk2

F≤ωkMk2

F,∀M∈Rd×d.(3)

Common choices of unbiased compressors are random sparsiﬁcation and quantization (see

Appendix).

Deﬁnition 3.3

(Contractive Compressors)

.

By

C

(

δ

)we denote the class of deterministic contractive

compression operators C:Rd×d→Rd×dwith contraction parameter δ∈[0,1] satisfying

kC(M)kF≤ kMkF,kC(M)−Mk2

F≤(1 −δ)kMk2

F,∀M∈Rd×d.(4)

10

The ﬁrst condition of

(4)

can be easily removed by scaling the operator

C

appropriately. In-

deed, if for some

M∈Rd×d

we have

kC

(

M

)

kF>kMkF

, then we can use the scaled compressor

e

C

(

M

)

:= kMkF

kC(M)kFC

(

M

)instead, as this satisﬁes

(4)

with the same parameter

δ

. Common examples

of contractive compressors are Top-Kand Rank-Roperators (see Appendix).

From the theory of ﬁrst-order methods employing compressed communication, it is known

that handling contractive biased compressors is much more challenging than handling unbiased

compressors. In particular, a popular mechanism for preventing ﬁrst-order methods utilizing biased

compressors from divergence is the error feedback framework. However, contractive compressors

often perform much better empirically than their unbiased counterparts. To highlight the strength

of our new Hessian learning technique, we develop our theory in a ﬂexible way, and handle both

families of compression operators. Surprisingly, we do not need to use error feedback for contractive

compressors for our methods to work.

Compression operators are used in [Islamov et al., 2021] in a fundamentally diﬀerent way. First,

their theory supports unbiased compressors only, and does not cover the practically favorable

contractive compressors [cc]. More importantly, compression is applied within the representation

(2)

as an operator acting on the space

Rm

. In contrast to our strategy of using compression

operators, this brings the necessity to reveal, in each iteration, the training data

{ai1, . . . , aim}

whose

corresponding coeﬃcients in

(2)

are not zeroed out after the compression step [pe]. Moreover, when

O

(

d

)communication cost per communication round is achieved, the variance of the compression

noise depends on the number of data points

m

, which then negatively aﬀects the local convergence

rates. As the amount of training data can be huge, our convergence rates provide stronger guarantees

by not depending on the size of the training dataset [fr].

3.3 Two options for updating the global model

Finally, we oﬀer two options for updating the global model at the server.

•

The ﬁrst option assumes that the server knows the strong convexity parameter

µ >

0(see

Assumption 3.1), and that it is powerful enough to compute the projected Hessian estimate

Hkµ, i.e., that it is able to project the current global Hessian estimate Hkonto the set

nM∈Rd×d:M>=M, µIMo

in each iteration (see the Appendix).

•Alternatively, if µis unknown, all devices send the compression errors

lk

i:= kHk

i− ∇2fi(xk)kF

(this extra communication is extremely cheap as all

lk

i

variables are ﬂoats) to the server,

which then computes the corrected Hessian estimate

Hk

+

lkI

by adding the average error

lk=1

nPn

i=1 lk

ito the global Hessian estimate Hk.

Both options require the server in each iteration to solve a linear system to invert either the projected,

or the corrected, global Hessian estimate. The purpose of these options is quite simple: unlike the

true Hessian, the compressed local Hessian estimates

Hk

i

, and also the global Hessian estimate

Hk

,

might not be positive deﬁnite, or might even not be of full rank. Further importance of the errors

lk

i

will be discussed when we consider extensions of FedNL to partial participation and globalization via

cubic regularization.

11

3.4 Local convergence theory

Note that FedNL includes two parameters, compression operators

Ck

i

and Hessian learning rate

α >

0,

and two options to perform global updates by the master. To provide theoretical guarantees, we

need one of the following two assumptions.

Assumption 3.4. Ck

i∈C(δ)for all i∈[n]and k. Moreover, (i) α= 1 −√1−δ, or (ii) α= 1.

Assumption 3.5. Ck

i∈B

(

ω

)for all

i∈

[

n

]and

k

and 0

< α ≤1

ω+1

. Moreover, for all

i∈

[

n

]and

j, l ∈[d], each entry (Hk

i)jl is a convex combination of {(∇2fi(xt))jl}k

t=0 for any k≥0.

To present our results in a uniﬁed manner, we deﬁne some constants depending on what

parameters and which option is used in FedNL. Below, constants

A

and

B

depend on the choice of

the compressors

Ck

i

and the learning rate

α

, while

C

and

D

depend on which option is chosen for

the global update.

(A, B) := (α2,α)if Assumption 3.4(i) holds

(δ/4,6

/δ−7

/2)if Assumption 3.4(ii) holds

(α,α)if Assumption 3.5holds

,(C, D) := n(2,L2

∗)if Option 1 is used

(8,(L∗+2LF)2)if Option 2 is used .(5)

We prove three local rates for FedNL: for the squared distance to the solution

kxk−x∗k2

, and

for the Lyapunov function

Φk:= Hk+ 6BL2

Fkxk−x∗k2,

where

Hk:= 1

n

n

X

i=1 kHk

i− ∇2fi(x∗)k2

F.

Theorem 3.6.

Let Assumption 3.1 hold. Assume

kx0−x∗k2≤µ2

2D

and

Hk≤µ2

4C

for all

k≥

0.

Then, FedNL (Algorithm 1) converges linearly with the rate

kxk−x∗k2≤1

2kkx0−x∗k2.(6)

Moreover, depending on the choice

(5)

of the compressors

Ck

i

, learning rate

α

, and which option is

used for global model updates, we have the following linear and superlinear rates:

E[Φk]≤1−min A, 1

3k

Φ0,(7)

Ekxk+1 −x∗k2

kxk−x∗k2≤1−min A, 1

3kC+D

12BL2

FΦ0

µ2.(8)

Let us comment on these rates.

•

First, the local linear rate

(6)

with respect to iterates is based on a universal constant, i.e.,

it does not depend on the condition number of the problem, the size of the training data, or

the dimension of the problem. Indeed, the squared distance to the optimum is halved in each

iteration.

12

•

Second, we have linear rate

(7)

for the Lyapunov function Φ

k

, which implies the linear

convergence of all local Hessian estimates

Hk

i

to the local optimal Hessians

∇2fi

(

x∗

). Thus,

our initial goal to progressively learn the local optimal Hessians

∇2fi

(

x∗

)in a communication

eﬃcient manner is achieved, justifying the eﬀectiveness of the new Hessian learning technique.

•

Finally, our Hessian learning process accelerates the convergence of iterates to a superlinear

rate

(8)

. Both rates

(7)

and

(8)

are independent of the condition number of the problem, or the

number of data points. However, they do depend on the compression variance (since

A

depends

on

δ

or

ω

), which, in case of

O

(

d

)communication constraints, depend on the dimension

d

only.

For clarity of exposition, in Theorem 3.6 we assumed

Hk≤µ2

4C

for all iterations

k≥

0. Below,

we prove that this inequality holds, using the initial conditions only.

Lemma 3.7.

Let Assumption 3.4 hold, and assume

kx0−x∗k2≤e1:= min{Aµ2

4BCL2

F

,µ2

2D}

and

kH0

i− ∇2fi(x∗)k2

F≤µ2

4C. Then kxk−x∗k2≤e1and kHk

i− ∇2fi(x∗)k2

F≤µ2

4Cfor all k≥0.

Lemma 3.8.

Let Assumption 3.5 hold, and assume

kx0−x∗k2≤e2:= µ2

D+4Cd2L2

∞

. Then

kxk−x∗k2≤

e2and Hk≤µ2

4Cfor all k≥0.

3.5 FedNL and the “Newton Triangle”

One implication of Theorem 3.6 is that the local rate

1

2k

(see

(6)

) holds even when we specialize

FedNL to

Ck

i≡0

,

α

= 0 and

H0

i

=

∇2fi

(

x0

)for all

i∈

[

n

]. These parameter choices give rise to the

following simple method, which we call Newton Zero (N0):

xk+1 =xk−∇2f(x0)−1∇f(xk), k ≥0.(9)

Interestingly, N0 only needs initial second-order information, i.e., Hessian at the zeroth iterate,

and the same ﬁrst-order information as Gradient Descent (GD), i.e.,

∇f

(

xk

)in each iteration.

Moreover, unlike GD, whose rate depends on a condition number, the local rate

1

2k

of N0 does not.

Besides, FedNL includes NS (when

Ck

i≡0

,

α

= 0,

H0

i

=

∇2fi

(

x∗

)) and classical Newton (N) (when

Ck

i≡I,α= 1,H0

i=0) as special cases.

It can be helpful to visualize the three special Newton-type methods—N,NS and N0 —as the

vertices of a triangle capturing a subset of two of these three requirements: 1)

O

(

d

)communication

cost per round, 2) implementability in practice, and 3) local quadratic rate. Indeed, each of these

three methods satisﬁes two of these requirements only: N(2+3), NS (1+3) and N0 (1+2). Finally,

FedNL interpolates between these requirements. See Figure 1.

4FedNL with Partial Participation, Globalization and Bidirectional

Compression

Due to space limitations, here we brieﬂy describe four extensions to FedNL and the key technical

contributions. Detailed sections for each extension are deferred to the Appendix.

13

FedNL

Newton

Newton

Star

Newton

Zero

LS

PP

BC

CR

O(d)

communication

cost per round

Implementability

in practice

Local

quadratic rate

Figure 1: Visualization of the three special Newton-type methods—Newton (N), Newton Star

(NS) and Newton Zero (N0)—as the vertices of a triangle capturing a subset of two of these three

requirements: 1)

O

(

d

)communication cost per round, 2) implementability in practice, and 3) local

quadratic rate. Indeed, each of these three methods satisﬁes two of these requirements only: N

(2+3), NS (1+3) and N0 (1+2). Finally, the proposed FedNL framework with its four extensions to

Partial Participation (FedNL-PP), globalization via Line Search (FedNL-LS), globalization via Cubic

Regularization (FedNL-CR) and Bidirectional Compression (FedNL-BC) interpolates between these

requirements.

4.1 Partial Participation

In FedNL-PP (Algorithm 2), the server selects a subset

Sk⊆

[

n

]of

τ

devices, uniformly at random,

to participate in each iteration. As devices might be inactive for several iterations, the same local

gradient and local Hessian used in FedNL does not provide convergence in this case. To guarantee

convergence, devices need to compute Hessian corrected local gradients

gk

i= (Hk

i+lk

iI)wk

i− ∇fi(wk

i),

where

wk

i

is the last global model that device

i

received from the server. This is an innovation which

also requires a diﬀerent analysis.

4.2 Globalization via Line Search

Our ﬁrst globalization strategy, FedNL-LS (Algorithm 3), which performs signiﬁcantly better in

practice than FedNL-CR (described next), is based on a backtracking line search procedure. The idea

is to ﬁx the search direction

dk=−hHki−1

µ∇f(xk)

by the server and ﬁnd the smallest integer s≥0which leads to a suﬃcient decrease in the loss

f(xk+γsdk)≤f(xk) + cγsD∇f(xk), dkE

14

with some parameters c∈(0,1

/2]and γ∈(0,1).

4.3 Globalization via Cubic Regularization

Our next globalization strategy, FedNL-CR (Algorithm 4), is to use a cubic regularization term

L∗

6khk3

,

where

L∗

is the Lipschitz constant for Hessians and

h

is the direction to the next iterate. However,

to get a global upper bound, we had to correct the global Hessian estimate

Hk

via compression error

lk. Indeed, since ∇2f(xk)Hk+lkI, we deduce

f(xk+1)≤f(xk) + D∇f(xk), hkE+1

2D(Hk+lkI)hk, hkE+L∗

6khkk3

for all k≥0. This leads to theoretical challenges and necessitates a new analysis.

4.4 Bidirectional Compression

Finally, we modify FedNL to allow for an even more severe level of compression that can’t be attained

by compressing the Hessians only. This is achieved by compressing the gradients (uplink) and

the model (downlink), in a “smart” way. In FedNL-BC (Algorithm 5), the server operates its own

compressors

Ck

M

applied to the model, and uses an additional “smart” model learning technique

similar to the proposed Hessian learning technique. On top of this, all devices compress their local

gradients via a Bernoulli compression scheme, which necessitates the use of another “smart” strategy

using Hessian corrected local gradients

gk

i=Hk

i(zk−wk) + ∇fi(wk),

where

zk

is the current learned global model and

wk

is the last learned global model when local

gradients are sent to the server. These changes are substantial and require novel analysis.

5 Experiments

We carry out numerical experiments to study the performance of FedNL, and compare it with various

state-of-the-art methods in federated learning. We consider the problem

min

x∈Rd(f(x) := 1

n

n

X

i=1

fi(x) + λ

2kxk2), fi(x) = 1

m

m

X

j=1

log 1 + exp(−bija>

ijx),(10)

where

{aij, bij}j∈[m]

are data points at the

i

-th device. The datasets were taken from LibSVM library

[Chang and Lin, 2011]: a1a,a9a,w7a,w8a, and phishing.

5.1 Parameter setting

In all experiments we use the theoretical parameters for gradient type methods (except those using

line search): vanilla gradient descent GD,DIANA [Mishchenko et al., 2019], and ADIANA [Li et al.,

2020b]. For DINGO [Crane and Roosta, 2019] we use the authors’ choice:

θ

= 10

−4, φ

= 10

−6, ρ

=

10

−4

. Backtracking line search for DINGO selects the largest stepsize from

{

1

,

2

−1,...,

2

−10}.

The

initialization of

H0

i

for NL1 [Islamov et al., 2021], FedNL and FedNL-LS is

∇2fi

(

x0

), and for FedNL-CR

is

0

. For FedNL,FedNL-LS, and FedNL-CR we use Rank-1compression operator and stepsize

α

= 1.

15

We use two values of the regularization parameter:

λ∈ {

10

−3,

10

−4}

. In the ﬁgures we plot the

relation of the optimality gap

f

(

xk

)

−f

(

x∗

)and the number of communicated bits per node, or the

number of communication rounds. The optimal value

f

(

x∗

)is chosen as the function value at the

20-th iterate of standard Newton’s method.

5.2 Local convergence

In our ﬁrst experiment we compare FedNL and N0 with gradient type methods: ADIANA with random

dithering (ADIANA, RD,

s

=

√d

), DIANA with random dithering (DIANA, RD,

s

=

√d

), vanilla

gradient descent (GD), and DINGO.According to the results summarized in Figure 2 (ﬁrst row),

we conclude that FedNL and N0 outperform all gradient type methods and DINGO, locally, by many

orders in magnitude. We want to note that we include the communication cost of the initialization

for FedNL and N0 in order to make a fair comparison (this is why there is a straight line for these

methods initially).

(a) a1a,λ= 10−3(b) a9a,λ= 10−4(c) w8a,λ= 10−3(d) phishing,λ= 10−4

(a) w7a,λ= 10−3(b) a1a,λ= 10−4(c) phishing,λ= 10−3(d) a9a,λ= 10−4

(a) w8a,λ= 10−3(b) phishing,λ= 10−3(c) a1a,λ= 10−4(d) w7a,λ= 10−4

Figure 2:

First row:

Local comparison of FedNL and N0 with (a), (b) ADIANA,DIANA,GD; with (c),

(d) DINGO in terms of communication complexity.

Second row:

Global comparison of FedNL-LS,

N0-LS and FedNL-CR with (a), (b) ADIANA,DIANA,GD, and GD with line search; with (c), (d)

DINGO in terms of communication complexity.

Third row:

Local comparison of FedNL with 3

types of compression operators and NL1 in terms of communication complexity.

5.3 Global convergence

We now compare FedNL-LS,N0-LS, and FedNL-CR with the ﬁrst-order methods ADIANA and DIANA

with random dithering, gradient descent (GD), and GD with line search (GD-LS). Besides, we compare

16

FedNL-LS and FedNL-CR with DINGO. In this experiment we choose

x0

far from the solution

x∗

, i.e.,

we test the global convergence behavior; see Figure 2 (second row). We observe that FedNL-LS and

N0-LS are more communication eﬃcient than all ﬁrst-order methods and DINGO. However, FedNL-CR

is better than GD and GD-LS only. In these experiments we again include the communication cost of

initialization for FedNL-LS and N0-LS.

5.4 Comparison with NL1

Next, we compare FedNL with three type of compression operators: Rank-

R

(

R

= 1), Top-

K

(

K

=

d

),

and PowerSGD [Vogels et al., 2019] (

R

= 1) against NL1 with the Rand-

K

(

K

= 1) compressor.

The results, presented in Figure 2 (third row), show that FedNL with Rank-1compressor performs

the best.

References

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-

eﬃcient SGD via gradient quantization and encoding. In Advances in Neural Information Processing

Systems, pages 1709–1720, 2017.

Amir Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with

MATLAB. Society for Industrial and Applied Mathematics, USA, 2014. ISBN 1611973643.

Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression

for distributed learning. arXiv preprint arXiv:2002.12410, 2020.

Chih-Chung Chang and Chih-Jen Lin. LibSVM: a library for support vector machines. ACM

Transactions on Intelligent Systems and Technology (TIST), 2(3):1–27, 2011.

Wenlin Chen, Samuel Horváth, and Peter Richtárik. Optimal client sampling for federated learning.

arXiv preprint arXiv:2010.13723, 2020.

Rixon Crane and Fred Roosta. Dingo: Distributed newton-type method for gradient-norm opti-

mization. In Advances in Neural Information Processing Systems, volume 32, pages 9498–9508,

2019.

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A uniﬁed theory of SGD: Variance reduction,

sampling, quantization and coordinate descent. In The 23rd International Conference on Artiﬁcial

Intelligence and Statistics, 2020a.

Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter Richtárik. Linearly converging

error compensated SGD. In 34th Conference on Neural Information Processing Systems (NeurIPS

2020), 2020b.

Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter Richtárik. MARINA: Faster

non-convex distributed learning with compression. arXiv preprint arXiv:2102.07845, 2021a.

Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local SGD: Uniﬁed theory and new eﬃcient

methods. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2021b.

17

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter

Richtárik. SGD: General analysis and improved rates. In Kamalika Chaudhuri and Ruslan

Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning,

volume 97 of Proceedings of Machine Learning Research, pages 5200–5209, Long Beach, California,

USA, 09–15 Jun 2019. PMLR.

Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik.

Stochastic distributed learning with gradient quantization and variance reduction. arXiv preprint

arXiv:1904.05115, 2019.

Rustem Islamov, Xun Qian, and Peter Richtárik. Distributed second order methods with fast rates

and compressed communication. arXiv preprint arXiv:2102.07158, 2021.

Peter Kairouz et al. Advances and open problems in federated learning. arXiv preprint

arXiv:1912.04977, 2019.

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback

ﬁxes SignSGD and other gradient compression schemes. In Proceedings of the 36th International

Conference on Machine Learning, volume 97, pages 3252–3261, 2019.

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and

Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for on-device federated

learning. In International Conference on Machine Learning (ICML), 2020.

Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local SGD on

identical and heterogeneous data. In The 23rd International Conference on Artiﬁcial Intelligence

and Statistics (AISTATS 2020), 2020.

Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with

compressed gradients. arXiv preprint arXiv:1806.06573, 2018.

Jakub Konečný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization:

Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016a.

Jakub Konečný, H. Brendan McMahan, Felix Yu, Peter Richtárik, Ananda Theertha Suresh, and

Dave Bacon. Federated learning: strategies for improving communication eﬃciency. In NIPS

Private Multi-Party Machine Learning Workshop, 2016b.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.

Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: challenges,

methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020a. doi:

10.1109/MSP.2020.2975749.

Zhize Li, Dmitry Kovalev, Xun Qian, and Peter Richtárik. Acceleration for compressed gradient

descent in distributed and federated optimization. In International Conference on Machine

Learning, 2020b.

18

Xiaorui Liu, Yao Li, Jiliang Tang, and Ming Yan. A double residual compression algorithm for

eﬃcient distributed learning. In International Conference on Artiﬁcial Intelligence and Statistics

(AISTATS), 2020.

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. In Interna-

tional Conference on Machine Learning (ICML), 2019.

H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas.

Communication-eﬃcient learning of deep networks from decentralized data. In Proceedings of the

20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning

with compressed gradient diﬀerences. arXiv preprint arXiv:1901.09269, 2019.

Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Proximal and federated random

reshuﬄing. arXiv preprint arXiv:2102.06704, 2021a.

Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, and Peter Richtárik. IntSGD: Floatless

compression of stochastic gradients. arXiv preprint arXiv:2102.08374, 2021b.

Constantin Philippenko and Aymeric Dieuleveut. Bidirectional compression in heterogeneous settings

for distributed or federated learning with partial participation: tight convergence guarantees.

arXiv preprint arXiv:2006.14591, 2021.

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný,

Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. arXiv preprint

arXiv:2003.00295, 2020.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its

application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference

of the International Speech Communication Association, 2014.

S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsiﬁed SGD with memory. In Advances in Neural

Information Processing Systems (NeurIPS), 2018.

Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference

on Learning Representations (ICLR), 2020.

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD: Practical low-rank gradient

compression for distributed optimization. In Advances in Neural Information Processing Systems

32 (NeurIPS), 2019.

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, and Haibin Lin. Local AdaAlter: Communication-

eﬃcient stochastic gradient descent with adaptive learning rates. arXiv preprint arXiv:1911.09030,

2019.

19

Appendix

A Extra Experiments

We carry out numerical experiments to study the performance of FedNL, and compare it with various

state-of-the-art methods in federated learning. We consider the following problem

min

x∈Rd(1

n

n

X

i=1

fi(x) + λ

2kxk2), fi(x) = 1

m

m

X

j=1

log 1 + exp(−bija>

ijx),(11)

where {aij, bij}j∈[m]are data points at the i-th device.

A.1 Data sets

The datasets were taken from LibSVM library [Chang and Lin, 2011]:

a1a

,

a9a

,

w7a

,

w8a

,

phishing

.

We partitioned each data set across several nodes to capture a variety of scenarios. See Table 3 for

more detailed description of data sets settings.

Table 3: Data sets used in the experiments with the number of worker nodes nused in each case.

Data set # workers n# data points (=nm) # features d

a1a 16 1600 123

a9a 80 32560 123

w7a 50 24600 300

w8a 142 49700 300

phishing 100 110 68

A.2 Parameters setting

In all experiments we use theoretical parameters for gradient type methods (except those with

line search procedure): vanilla gradient descent, DIANA [Mishchenko et al., 2019], and ADIANA

[Li et al., 2020b]. The constants for DINGO [Crane and Roosta, 2019] are set as the authors did:

θ

= 10

−4, φ

= 10

−6, ρ

= 10

−4

. Backtracking line search for DINGO selects the largest stepsize from

{

1

,

2

−1,...,

2

−10}.

The initialization of

H0

i

for NL1 [Islamov et al., 2021], FedNL,FedNL-LS, and

FedNL-PP is ∇2fi(x0), and for FedNL-CR is 0.

We conduct experiments for two values of regularization parameter

λ∈ {

10

−3,

10

−4}

. In the

ﬁgures we plot the relation of the optimality gap

f

(

xk

)

−f

(

x∗

)and the number of communicated

bits per node or the number of communication rounds. The optimal value

f

(

x∗

)is chosen as the

function value at the 20-th iterate of standard Newton’s method.

A.3 Compression operators

Here we describe four compression operators that are used in our experiments.

20

A.3.1 Random dithering for vectors

For ﬁrst order methods ADIANA and DIANA we use random dithering operator [Alistarh et al., 2017,

Horváth et al., 2019]. This compressor with slevels is deﬁned via the following formula

C(x) := sign(x)· kxkq·ξs

s,(12)

where kxkq:= (Pi|xi|q)1/q and ξs∈Rdis a random vector with i-th element deﬁnd as follows

(ξs)i=(l+ 1 with probability |xi|

kxkqs−l,

lotherwise.(13)

Here

s∈N+

denotes the levels of rounding, and

l

satisﬁes

|xi|

kxkq∈l

s,l+1

s

. According to [Horváth

et al., 2019], this compressor has the variance parameter

ω≤

2 +

d1/2+d1/q

s.

However, for standard

euclidean norm (q= 2) one can improve the bound by ω≤min nd

s2,√d

so[Alistarh et al., 2017].

A.3.2 Rank-Rcompression operator for matrices

Our theory supports contractive compression operators; see Deﬁnition 3.3. In the experiments for

FedNL we use Rank-

R

compression operator. Let

X∈Rd×d

and

U

Σ

V>

be the singular value

decomposition of X:

X=

d

X

i=1

σiuiv>

i,(14)

where the singular values

σi

are sorted in non-increasing order:

σ1≥σ2≥ ··· ≥ σd

. Then, the

Rank-Rcompressor, for R≤d, is deﬁned by

C(X) :=

R

X

i=1

σiuiv>

i.(15)

Note that

kXk2

F

(14)

=

d

X

i=1

σiuiv>

i

2

F

=

d

X

i=1

σ2

i

and

kC(X)−Xk2

F

(14)+(15)

=

d

X

i=R+1

σiuiv>

i

2

F

=

d

X

i=R+1

σ2

i.

Since 1

d−RPd

i=R+1 σ2

i≤1

dPd

i=1 σ2

i, we have

kC(X)−Xk2

F≤d−R

dkXk2

F=1−R

dkXk2

F,

and hence the Rank-

R

compression operator belongs to

C

(

δ

)with

δ

=

R

d.

In case when

X∈Sd

, we

have

ui

=

vi

for all

i∈

[

d

], and Rank-

R

compressor on matrix

X

transforms to

PR

i=1 σiuiu>

i

, i.e.,

the output of Rank-Rcompressor is automatically a symmetric matrix, too.

21

A.3.3 Top-Kcompression operator for matrices

Another example of contractive compression operators is Top-

K

compressor for matrices. For

arbitrary matrix

X∈Rd×d

let sort its entires in non-increasing order by magnitude, i.e.,

Xik,jk

is

the k-th maximal element of Xby magnitude. Let’s {Eij }d

i,j=1 me matrices for which

(Eij)ps := (1,if (p, s)=(i, j),

0,otherwise. (16)

Then, the Top-Kcompression operator can be deﬁned via

C(X) :=

K

X

k=1

Xik,jkEik,jk.(17)

This compression operator belongs to

C

(

δ

)with

δ

=

d2

K

. If we need to keep the output of Top-

K

on

symmetric matrix

X

to be symmetric matrix too, then we apply Top-

K

compressor only on lower

triangular part of X.

A.3.4 Rand-Kcompression operator for matrices

Our theory also supports unbiased compression operators; see Deﬁnition 3.2. One of the examples

is Rand-

K

. For arbitrary matrix

X∈Rd×d

we choose a set

SK

of indexes (

i, j

)of cardinality

K

uniformly at random. Then Rand-Kcompressor can be deﬁned via

C(X)ij := (d2

KXij if (i, j)∈ SK,

0otherwise.(18)

This compression operator belongs to

B

(

ω

)with

ω

=

d2

K−

1. If we need to make the output of this

compressor to be symmetric matrix, then we apply this compressor only on lower triangular part of

the input.

A.4 Projection onto the cone of positive deﬁnite matrices

If one uses FedNL with Option 1, then we need to project onto the cone of symmetric and positive

deﬁnite matrices with constant µ:

{M∈Rd×d:M>=M,MµI}.

The projection of symmetric matrix

X

onto the cone of positive semideﬁnite matrices can by

computed via

[X]0:=

d

X

i=1

max{λi,0}uiu>

i,(19)

where

Piλiuiu>

i

is an eigenvalue decomposition of

X

. Using the projection onto the cone of positive

semideﬁnite matrices we can deﬁne the projection onto the cone of positive deﬁnite matrices with

constant µvia

[X]µ:= [X−µI]0+µI.(20)

22

a1a,λ= 10−3a1a,λ= 10−4a9a,λ= 10−3a9a,λ= 10−4

phishing,λ= 10−3phishing,λ= 10−4w7a,λ= 10−3w8a,λ= 10−3

a1a,λ= 10−3a1a,λ= 10−4a9a,λ= 10−3a9a,λ= 10−4

Figure 3: The performance of FedNL with diﬀerent types of compression operators: Rank-

R

(ﬁrst

row); Top-

K

(second row); PowerSGD of rank

R

(third row) for several values of

R

and

K

in terms

of communication complexity.

A.5 The eﬀect of compression

First, we investigate how the level of compression inﬂuences the performance of FedNL; see Figure 3.

Here we study the performance for three types of compression operators: Rank-

R

, Top-

K

, and

PowerSGD of rank

R

. According to numerical experiments, the smaller parameter is, the better

performance of FedNL is. This statement is true for all three types of compressors.

A.6 Comparison of Options 1and 2

In our next experiment we investigate which Option (1or 2) for FedNL with Rank-

R

and stepsize

α

= 1 compressor demonstrates better results in terms of communication compexity. According

to the results in Figure 4, we see that FedNL with projection (Option 1) is more communication

eﬀective than that with Option 2. However, Option 1requires more computing resources.

A.7 Comparison of diﬀerent compression operators

Next, we study which compression operator is better in terms of communication complexity. Based

on the results in Figure 5, we can conclude that Rank-

R

is the best compression operator; Top-

K

and PowerSGD compressors can beat each other in diﬀerent cases.

23

a1a,λ= 10−3a9a,λ= 10−3phishing,λ= 10−4w7a,λ= 10−4

Figure 4: The performance of FedNL with Options 1and 2in terms of communication complexity.

w8a,λ= 10−3a1a,λ= 10−4phishing,λ= 10−3a9a,λ= 10−4

Figure 5: Comparison of the performance of FedNL with diﬀerent compression operators in terms of

communication complexity.

A.8 Comparison of diﬀerent update rules for Hessians

On the following step we compare FedNL with three update rules for Hessians in order to ﬁnd the

best one. They are biased Top-

K

compression operator with stepsize

α

= 1 (Option 1); biased

Top-

K

compression operator with stepsize

α

= 1

−√1−δ

; unbiased Rand-

K

compression operator

with stepsize

α

=

1

ω+1

. The results of this experiment are presented in Figure 6. Based on them,

we can make a conclusion that FedNL with Top-

K

compressor and stepsize

α

= 1 demonstrates

the best performance. FedNL with Rand-

K

compressor and stepsize

α

=

1

ω+1

performs a little bit

better than that with Top-

K

compressor and stepsize

α

= 1

−√1−δ

. As a consequence, we will

use biased compression operator with stepsize