PreprintPDF Available

Adaptive Expert Models for Personalization in Federated Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Federated Learning (FL) is a promising framework for distributed learning when data is private and sensitive. However, the state-of-the-art solutions in this framework are not optimal when data is heterogeneous and non-Independent and Identically Distributed (non-IID). We propose a practical and robust approach to personalization in FL that adjusts to heterogeneous and non-IID data by balancing exploration and exploitation of several global models. To achieve our aim of personalization, we use a Mixture of Experts (MoE) that learns to group clients that are similar to each other, while using the global models more efficiently. We show that our approach achieves an accuracy up to 29.78 % and up to 4.38 % better compared to a local model in a pathological non-IID setting, even though we tune our approach in the IID setting.
Content may be subject to copyright.
Adaptive Expert Models for Personalization in Federated Learning
Martin Isaksson1,2,Edvin Listo Zec3,2,Rickard C¨
oster1,Daniel Gillblad4,5and ˇ
Sar¯
unas Girdzijauskas3,2
1Ericsson AB, Stockholm, Sweden
2KTH Royal Institute of Technology, Stockholm, Sweden
3RISE Research Institutes of Sweden, Stockholm, Sweden
4AI Sweden, Stockholm, Sweden
5Chalmers AI Research Center, Chalmers University of Technology, G¨
oteborg, Sweden
Abstract
Federated Learning (FL) is a promising framework
for distributed learning when data is private and sen-
sitive. However, the state-of-the-art solutions in this
framework are not optimal when data is heteroge-
neous and non-IID. We propose a practical and ro-
bust approach to personalization in FL that adjusts
to heterogeneous and non-IID data by balancing ex-
ploration and exploitation of several global models.
To achieve our aim of personalization, we use a Mix-
ture of Experts (MoE) that learns to group clients
that are similar to each other, while using the global
models more efficiently. We show that our approach
achieves an accuracy up to
29.78 %
better than the
state-of-the-art and up to
4.38 %
better compared to
a local model in a pathological non-IID setting, even
though we tune our approach in the IID setting.
1 Introduction
In many real-world scenarios, data is distributed over organi-
zations or devices and is difficult to centralize. Due to legal
reasons, data might have to remain and be processed where it
is generated, and in many cases may not be allowed to be trans-
ferred
[
GDPR, 2016
]
. Furthermore, due to communication lim-
itations it can be practically impossible to send data to a central
point of processing. In many applications of Machine Learning
(ML) these challenges are becoming increasingly important to
address. For example, sensors, cars, radio base stations and mo-
bile devices are capable of generating more relevant training
data than can be practically communicated to the cloud
[
Fers-
man et al., 2018
]
, and datasets in healthcare and industry can-
not legally be moved between hospitals or countries of origin.
Federated Learning (FL)
[
McMahan et al., 2017; Bonawitz
et al., 2019
]
shows promise to leveraging data that cannot eas-
ily be centralized. It has the potential to utilize compute and
storage resources of clients to scale towards large, decentral-
ized datasets while enhancing privacy. However, current ap-
proaches fall short when data is heterogeneous as well as non-
Independent and Identically Distributed (non-IID), where stark
differences between clients and groups of clients can be found.
Therefore, personalization of collectively learned models will
in practice often be critical to adapt to differences between re-
gions, organizations and individuals to achieve the required
Personalized inference
Learned weights
Previous approach
uses only cluster models,
selects one for inference
Improved adaptation
during federated training
of global cluster models
Local and gating
models (from MoE),
trained locally
Global cluster models are
trained using FL and used
as expert models in the MoE
x
y
x
Figure 1: Our approach adjusts to non-Independent and Identically
Distributed (IID) data distributions by adaptively training a Mixture
of Experts (MoE) for clients that share similar data distributions.
performance
[
Kairouz et al., 2021; Ghosh et al., 2020
]
. This is
the problem we address in this paper.
Our approach adjusts to non-IID data distributions by adap-
tively training a Mixture of Experts (MoE) for clients that share
similar data distributions.We explore a wide spectrum of data
distribution settings: ranging from the same distribution for
all clients, all the way to different distributions for each client.
Our aim is an end-to-end framework that performs comparable
or better than vanilla FL and is robust in all of these settings.
In order to achieve personalization, the authors of
[
Ghosh
et al., 2020
]
introduce a method for training cluster models
using FL. We show that their solution does not perform well
in our settings, where only one or a few of the cluster models
converge. To solve this, inspired by the Multi-Armed Bandit
(MAB) field, we employ an efficient and effective way of
balancing exploration and exploitation of these cluster models.
As proposed by the authors of
[
Peterson et al., 2019; Listo
Zec et al., 2020
]
, we add a local model and use a MoE that
learns to weigh, and make use of, all of the available models to
produce a better personalized inference, see Figure 1.
In summary, our main contributions are:
1.
We devise an FL algorithm which improve upon
[
Ghosh
et al., 2020
]
by balancing exploration and exploitation to
produce better adapted cluster models, see Section 3.1;
2.
We use said cluster models as expert models in an MoE
to improve performance, described in Section 3.1;
3.
An extensive analysis
1
of our approach with respect to
different non-IID aspects that also considers the distribu-
tion of client performance, see Section 4.5.
1
The source code for the experiments can be found at https://
github.com/EricssonResearch/fl-moe.
arXiv:2206.07832v1 [cs.LG] 15 Jun 2022
2 Background
2.1 Problem formulation
Consider a distributed and decentralized ML setting with
K
clients. Each client
k {1,2, . . . , K}
has access to a local
data partition
Pk
that never leaves the client where
nk=|Pk|
is the number of local data samples.
In this paper we are considering a multi-class classification
problem where we have
n=PK
k=1 nk
data samples
xi
, in-
dexed
i {1,2, . . . , nk}
, and output class label
yi
is in a fi-
nite set. We further divide each client partition
Pk
into local
training and test sets. We are interested in performance on the
local test set in a non-IID setting, see Section 2.2.
2.2 Regimes of non-IID data
In any decentralized setting it is common to have non-IID data
that can be of non-identical client distributions
[
Hsieh et al.,
2020; Kairouz et al., 2021
]
, and which can be characterized as:
Feature distribution skew (covariate-shift). The feature
distributions are different between clients. Marginal dis-
tributions P(x)varies, but P(y|x)is shared;
Label distribution skew (prior probability shift, or class
imbalance). The distribution of class labels vary between
clients, so that P(y)varies but P(x|y)is shared;
Same label, different features (concept shift). The condi-
tional distributions
P(x|y)
varies between clients but
P(y)is shared;
Same features, different label (concept shift). The condi-
tional distribution
P(y|x)
varies between clients, but
P(x)is shared;
Quantity skew (unbalancedness). Clients have different
amounts of data.
Furthermore, the data independence between clients and
between data samples within a client can also be violated.
2.3 Federated Learning
In a centralized ML solution data that may be potentially
privacy-sensitive is collected to a central location. One way of
improving privacy is to use a collaborative ML algorithm such
as Federated Averaging (FE DAVG)
[
McMahan et al., 2017
]
. In
FEDAVG training of a global model
fg(x,wg)
is distributed,
decentralized and synchronous. A parameter server coordinates
training on many clients over several communication rounds
until convergence.
In communication round
t
, the parameter server selects
a fraction
C
out of
K
clients as the set
St
. Each se-
lected client
kSt
will train locally on
nk
data sam-
ples
(xi, yi), i Pk
, for
E
epochs before an update is sent to
the parameter server. The parameter server performs aggre-
gation of all received updates and updates the global model
parameters
wg
. Finally, the new global model parameters are
distributed to all clients.
We can now define our objective as
min
wgRdL(w)
= min
wgRd
K
X
k=1
nk
n
client kaverage loss
1
nkX
iPk
l(xi, yi,wg)
sample iloss
population average loss
,(1)
where
l(xi, yi,wg)
is the loss for
yi,ˆyg=fg(xi,wg)
. In
other words, we aim to minimize the average loss of the global
model over all clients in the population.
2.4 Iterative Federated Clustering
In many real distributed use-cases, data is naturally non-IID and
clients form clusters of similar clients. A possible improvement
over FEDAVG is to introduce cluster models that map to these
clusters, but the problem of identifying clients that belong
to these clusters remains. We aim to find clusters, subsets
of the population of clients, that benefit more from training
together within the subset, as opposed to training with the
entire population.
Using Iterative Federated Clustering Algorithm
(IFCA)
[
Ghosh et al., 2020
]
we set the expected largest num-
ber of clusters to be
J
and initialize one cluster model with
weights
wj
g
per cluster
j {1,2, . . . , J }
. At communication
round
t
each selected client
k
performs a cluster identity esti-
mation, where it selects the cluster model
ˆ
jk
that has the low-
est estimated loss on the local training set. The cluster model
parameters
wj
g
at time
t+ 1
are then updated by using only up-
dates from clients the
j
th selected cluster model, so that (using
model averaging
[
McMahan et al., 2017; Ghosh et al., 2020
]
)
njPk{St|ˆ
jk=j}nk,(2)
wj
g(t+ 1) Pk{St|ˆ
jk=j}nk
nj
wk(t+ 1).(3)
2.5 Federated Learning using a Mixture of Experts
In order to construct a personalized model for each client,
[
Listo
Zec et al., 2020
]
first add a local expert model
fk
l(x,wk
l)
that is trained only on local data. Recall the global model
fg(x,wg)
from Section 2.3. The authors of
[
Listo Zec et
al., 2020
]
then learn to weigh the local expert model and
the global model using a gating function from MoE
[
Jacobs
et al., 1991; Peterson et al., 2019; Hanzely and Richt
´
arik,
2020
]
. The gating function takes the same input
x
and
outputs a weight for each of the expert models. It uses a
Softmax in the output layer so that these weights sum to
1
.
We define
fk
hx,wk
h
as the gating function for client
k
.
The same model architectures are used for all local mod-
els, so
fk
h(x,w) = fk0
h(x,w)
and
fk
l(x,w) = fk0
l(x,w)
for all pairs of clients
k, k0
. For simplicity, we write
fl(x) = fk
lx,wk
l
and
fh(x) = fk
hx,wk
h
for each
client
k
. Parameters
wk
l
and
wk
h
are local to client
k
and not
shared. Finally, the personalized inference is
ˆyh=fh(x)fl(x) + [1 fh(x)] fg(x).(4)
3 Adaptive Expert Models for Personalization
3.1 Framework overview and motivation
In IFCA, after the training phase, the cluster model with the
lowest loss on the validation set is used for all future infer-
ences. All other cluster models are discarded in the clients. A
drawback of IFCA is therefore that it does not use all the infor-
mation available in the clients in form of unused cluster mod-
els. Each client has access to the full set of cluster models, and
our hypothesis is that if a client can make use of all of these
models we can increase performance.
It is sometimes advantageous to incorporate a local model,
as in Section 2.5, especially when the local data distribution
is very different from other clients. We therefore modify the
MoE
[
Listo Zec et al., 2020
]
to incorporate all the cluster
models from IFCA
[
Ghosh et al., 2020
]
and the local model
as expert models in the mixture, see Figure 2. We revise
(4)
to
ˆyh=glfk
l(x) +
J1
X
j=0
gk
jfj
g(x),(5)
where
gl
is the local model expert weight, and
gk
j
is the cluster
model expert weight for cluster jfrom fk
h(x), see Figure 2.
However, importantly, we note that setting
J
in
[
Ghosh et
al., 2020
]
to a large value produces few cluster models that
actually converge, which lowers performance when used in
a MoE. We differ from
[
Ghosh et al., 2020
]
in the cluster
estimation step in that we select the same number of clients
Ks=dCK e
in every communication round, regardless of
J
.
This spreads out more evenly over the global cluster models.
Since cluster models are randomly initialized we can end up
updating one cluster model more than the others by chance.
In following communication rounds, a client is more likely to
select this cluster model, purely because it has been updated
more. This also has the effect that as
J
increases, the quality
of the updates are reduced as they are averaged from a smaller
set of clients. In turn, this means that we needed more iter-
ations to converge. Therefore, we make use of the
ε
-greedy
algorithm
[
Sutton, 1995
]
in order to allow each client to priori-
tize gathering information (exploration) of the cluster models
or use the estimated best cluster model (exploitation). In each
iteration we select a random cluster model with probability
ε
and the currently best otherwise, see Algorithm 3.
By using the
ε
-greedy algorithm we make more expert
models converge. We can then use the gating function
fk
h
from
MoE to adapt to the underlying data distributions and weigh
the different expert models. We outline our setup in Figure 1
and provide details in Figure 2 and Algorithms 1 to 4.
x
fh
fl
+
ˆyh=gk
lˆyl+PJ1
j=0 gk
jˆyj
f0
g
ˆy0gk
0
f1
g
ˆy1
gk
1
ˆylgk
l
Figure 2: Our solution with
2
global cluster models. Each client
k
has one local expert model
fl(x,wk
l)
and share
J= 2
expert cluster
models
fj
g(x,wj
g)
with all other clients. A gating model
fh(x,wk
h)
is used to weigh the expert cluster models and produce a personalized
inference ˆyhfrom the input x.
When a cluster model has converged it is not cost-effective
to transmit this cluster model to every client, so by using
per-model early stopping we can reduce communication in
both uplink and downlink. Specifically, before training we
initialize
J={1,2, . . . , J }
. When early stopping is triggered
for a cluster model we remove that cluster model from the set
J
.
The early-stopping algorithm is described in Algorithm 1.
4 Experiments
4.1 Datasets
We use three different datasets, with different non-IID charac-
teristics, in which the task is an image multi-class classifica-
tion task with varying number of classes.
CIFAR-10 [
Krizhevsky, 2009
]
, where we use a technique
from
[
Listo Zec et al., 2020
]
to create client partitions
with a controlled Label distribution skew, see Section 4.2;
Rotated CIFAR-10 [
Ghosh et al., 2020
]
, where the client
feature distributions are controlled by rotating
CIFAR-10
images an example of same label, different features;
Federated Extended MNIST (FEMNIST) [
Caldas et
al., 2018; Cohen et al., 2017
]
with handwritten characters
written by many writers, exhibiting many of the non-IID
characteristics outlined in Section 2.2.
4.2 Non-IID sampling
In order to construct a non-IID dataset from the
CIFAR-10
dataset
[
Krizhevsky, 2009
]
with the properties of class im-
balance that we are interested in we first look at
[
McMahan
et al., 2017
]
. A pathological non-IID dataset is constructed
by sorting the dataset by label, dividing it into shards of
300 data samples and giving each client 2shards.
However, as in
[
Listo Zec et al., 2020
]
, we are interested in
varying the degree of non-IIDness and therefore we assign two
majority classes to each client which make up a fraction
p
of
the data samples of the client. The remainder fraction
(1 p)
is sampled uniformly from the other
8
classes. When
p= 0.2
each class has an equal probability of being sampled. A similar
case to the pathological non-IID above is represented by
p= 1
.
In reality, pis unknown.
4.3 Model architecture
We start with the benchmark model defined in
[
Caldas et al.,
2018
]
which is a Convolutional Neural Network (CNN) model
with two convolutional layers and one fully connected layer
with fixed hyperparameters. However, in our case where
nk
is small, the local model is prone to over-fitting, so it is desir-
able to have a model with lower capacity. Similarly, the gat-
ing model is also prone to overfitting due to both a small local
dataset and the fact that it aims to solve a multi-label classifica-
tion problem with fewer classes (expert models), than in the
original multi-class classification problem. The local model,
gating model and cluster models therefore share the same un-
derlying architecture, but have hyperparameter individually
tuned, see Section 4.4. The
AdamW [
Loshchilov and Hutter,
2019
]
optimizer is used to train the local model and the gating
model, while Stochastic Gradient Descent (SGD)
[
Bottou et
al., 2018
]
is used to train the cluster models to avoid issues
related to momentum parameters when averaging. We use neg-
ative log-likelihood loss in (1).
4.4 Hyperparameter tuning
Hyperparameters are tuned using
[
Liaw et al., 2018
]
in four
stages. For each model we tune the learning rate
η
, the number
of filters in two convolutional layers, the number of hidden
units in the fully connected layer, dropout, and weight decay.
For the ε-greedy exploration method we also tune ε.
Algorithm 1 Adaptive Expert Models for FL server
1: procedure SERV ER(C, K)
2: initialize J {1,2, . . . , J },wj
g(0)
j J Initialize Jglobal cluster models
3: Ks dCK eNumber of clients to select per communication round
4: for t {1,2, . . .}do Until convergence
5: St {1,2, . . . , K},|St|=KsRandom sampling of Ksclients
6: for all kStdo For all clients, in parallel
7: wk(t+ 1), nk,ˆ
jkk.C LI EN T wj
g
j J  Local training (Algorithm 2)
8: for all j {1,2, . . . , J}do For all cluster models
9: njPk{St|ˆ
jk=j}nkTotal number of samples for cluster model jfrom clients where j=ˆ
j
10: wj
g(t+ 1) Pk{St|ˆ
jk=j}nk
nj
wk(t+ 1) Update cluster model jwith clients where j=ˆ
j
11: if Early stopping triggered for model jthen
12: J J \ j Optional: Remove jfrom the set of selectable cluster models J
Algorithm 2 Adaptive Expert Models for FL client
13: procedure CLI EN T(wj
g
j J )
14: ˆ
jCL .-ES T.ε, wj
g
j J , nk |Pk|Estimate cluster belonging (Algorithm 3)
15: wk(t+ 1) U PDATE wˆ
j
g(t), nkPerform local training using cluster model ˆ
j(Algorithm 4)
16: return (wk(t+ 1), nk,ˆ
j)
First, we tune the hyperparameters for a local model and for
a single global model. Thereafter, we tune the hyperparame-
ters for the gating model using the best hyperparameters found
in the earlier steps. Lastly, we tune
ε
with two cluster mod-
els J= 2. For the no exploration experiments we set ε= 0.
Hyperparameters depend on
p
and
J
but we tune the hyper-
parameters for a fixed majority class fraction
p= 0.2
, which
corresponds to the IID case. The tuned hyperparameters are
then used for all experiments. We show that our method is still
robust in the fully non-IID case when
p= 1
. See Table 1 for
tuned hyperparameters in the CIFAR-10 experiment.
4.5 Results
We summarize our results for the class imbalance case exem-
plified with the
CIFAR-10
dataset in Table 2. In Figure 3, we
see an example of how the performance varies when we in-
crease the non-IID-ness factor
p
for the case when
J= 3
.
In Figure 3a we see the performance of IFCA
[
Ghosh et al.,
2020
]
compared to our solution in Figure 3b. We also com-
pare to: a local model fine-tuned from the best cluster model,
an entirely local model, and an ensemble model where we in-
clude all cluster models as well as the local model with equal
weights. In Figure 4 we vary the number of cluster models
J
for different values of the majority class fraction p.
An often overlooked aspect of performance in FL is the
variance between clients. We achieve a smaller inter-client
variance, shown for CIFAR-10 in Figure 6a and Table 2.
We see that for
CIFAR-10
our
ε
-greedy exploration method
achieves better results for lower values of
p
by allowing more of
the cluster models to converge thereby more cluster models
are useful as experts in the MoE, even though the models are
similar, see Figure 5a. For higher values of
p
we see that the
cluster models are adapting to existing clusters in the data,
see Figure 5c. The most interesting result is seen in between
these extremes, see Figure 5b. We note that the same number
of clients pick each cluster model as in IFCA, but we manage
to make a better selection and achieve higher performance.
0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
Majority class fraction p
Validationaccuracy
Fine-tuned MoE Ensemble
IFCA Local
(a) No exploration.
0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
Majority class fraction p
Validationaccuracy
Fine-tuned MoE Ensemble
IFCA Local
(b) ε-greedy exploration.
Figure 3: Results for
CIFAR-10
. Comparison between no exploration
and our
ε
-greedy exploration method for
J= 6
.Our proposed MoE
solution with
ε
-greedy exploration is superior in all cases from IID to
pathological non-IID class distributions, here shown by varying the
majority class fraction p.
For the rotated
CIFAR-10
case we see that IFCA manages
to assign each client to the correct clusters at
J= 2
, and in
this Same label, different features case our exploration method
requires a larger
J
to achieve the same performance. We also
note the very high
ε= 0.82
. More work is needed on better
exploration methods for this case.
The FEMNIST dataset represents a more difficult scenario
since there are many non-IID aspects in this dataset. We find
that for FEMNIST the best performance is achieved when
J= 9
and in Figure 6b we show the distribution of accuracy
for the clients for the different models.
5 Related Work
The FE DAVG algorithm
[
McMahan et al., 2017
]
is the most
prevalent algorithm for learning a global model in FL. This
algorithm has demonstrated that an average over model pa-
rameters is an efficient way to aggregate local models into
Algorithm 3 Adaptive Expert Models for FL cluster assignment
17: procedure CL.-E ST.(ε, wj
g
j J )
18: return ˆ
jargminj∈J PiPklxi, yi,wj
gwith prob. 1ε Lowest loss cluster model
U {1, J }with prob. ε Random assignment
Algorithm 4 Adaptive Expert Models for FL local update
19: procedure UPDATE(wk(t+ 1), nk)Mini-batch gradient descent
20: for e {1,2, . . . , E}do For a few epochs
21: for all batches of size Bdo Batch update
22: wk(t+ 1) wk(t+ 1) η
Bwk(t+1) PB
i=1 lixi, yi,wk(t+ 1)Local parameter update
23: return wk(t+ 1)
5 10 15
0
20
40
60
80
100
Number of
cluster models
J
Validation accuracy
MoE IFCA
Local Ensemble
Fine-tuned ε-greedy
No exp.
(a) p= 0.2
5 10 15
0
20
40
60
80
100
Number of
cluster models
J
Validation accuracy
MoE IFCA
Local Ensemble
Fine-tuned ε-greedy
No exp.
(b) p= 0.4
5 10 15
0
20
40
60
80
100
Number of
cluster models
J
Validation accuracy
MoE IFCA
Local Ensemble
Fine-tuned ε-greedy
No exp.
(c) p= 0.6
5 10 15
0
20
40
60
80
100
Number of
cluster models
J
Validation accuracy
MoE IFCA
Local Ensemble
Fine-tuned ε-greedy
No exp.
(d) p= 0.8
5 10 15
0
20
40
60
80
100
Number of
cluster models
J
Validation accuracy
MoE IFCA
Local Ensemble
Fine-tuned ε-greedy
No exp.
(e) p= 1
Figure 4: Results for
CIFAR-10
. Comparison between no exploration (colored dashed lines) and the
ε
-greedy exploration method (colored solid
lines). Our proposed MoE solution with the
ε
-greedy exploration outperforms all other solutions, including the baseline from IFCA
[
Ghosh et
al., 2020
]
. It performs better the greater the non-IIDness, here seen by varying the majority class fraction
p
. Furthermore, our solution is robust
to changes in the number of cluster models J.
12345678
0
10
20
30
40
50
Cluster model (sorted by frequency)
Number of clients
No exploration
ε-greedy (ours)
(a) p= 0.2
12345678
0
10
20
30
40
50
Cluster model (sorted by frequency)
Number of clients
No exploration
ε-greedy (ours)
(b) p= 0.6
12345678
0
10
20
30
40
50
Cluster model (sorted by frequency)
Number of clients
No exploration
ε-greedy (ours)
(c) p= 1
Figure 5: Results for
CIFAR-10
. The number of clients in each cluster
for the different exploration methods. Clusters are sorted, so that the
lowest index corresponds to the most picked cluster. Our
ε
-greedy
exploration method picks the cluster models more evenly.
a global model. However, when data is non-IID, FE DAVG
converges slowly or not at all. This has given rise to per-
sonalization methods for FL
[
Kairouz et al., 2021; Hsieh et
al., 2020
]
. Research on how to handle non-IID data among
clients is ample and expanding. Solutions include fine-tuning
locally [Wang et al., 2019], meta-learning [Jiang et al., 2019;
Finn et al., 2017
]
, MAB
[
Shi et al., 2021
]
, multi-task learn-
ing
[
Li et al., 2021
]
, model heterogeneous methods
[
He et al.,
2020; Diao et al., 2021
]
, data extension
[
Tijani et al., 2021
]
,
distillation-based methods
[
Jeong et al., 2018; Li and Wang,
2019]and Prototypical Contrastive FL [Mu et al., 2021].
Mixing local and global models has been explored by
[
Deng
et al., 2020
]
, where a scalar
α
is optimized to combine global
and local models. In
[
Peterson et al., 2019
]
the authors propose
to use MoE
[
Jacobs et al., 1991
]
and learn a gating function that
weighs a local and global expert to enhance user privacy. This
work is developed further in
[
Listo Zec et al., 2020
]
, where the
authors use a gating function with larger capacity to learn a
personalized model when client data is non-IID. We differ in
using cluster models as expert models, and by evaluating our
method on datasets with different non-IID characteristics.
Recent work has studied clustering in FL settings for non-IID
data
[
Ghosh et al., 2020; Kim et al., 2021; Briggs et al., 2020
]
.
In
[
Ghosh et al., 2020
]
the authors implement a clustering
algorithm for handling non-IID data in form of covariate shift.
Their proposed algorithm learns one global model per cluster
with a central parameter server, using the training loss of global
models on local data of clients to perform cluster assignment.
In their work, they only perform clustering in the last layer
and aggregate the rest into a single model. If a global model
cluster is unused for some communication rounds, the global
cluster model is removed from the list to reduce communication
overhead. However, this means that a client cannot use other
global cluster models to increase performance.
6 Discussion
We adapted the inspiring work by
[
Ghosh et al., 2020
]
to work
better in our setting and efficiently learned expert models for
non-IID client data. Sending all cluster models in each itera-
tion introduces more communication overhead. We addressed
Model ηConv1 Conv2 FC Dropout Weight Dec. E ε
Global 5.86×103128 32 1024 0.80 1.10×10330.33
Local 2.69×10432 256 256 0.76 9.89×103
Gate 3×10612 12 8 0.78 6.88×104
Table 1: Tuned hyper-parameters in the CIFAR-10 experiment for the global cluster models, the local models and the gating model.
MoE IFCA Ensemble Fine-tuned Local
pExp. strategy # trials µ σ µ σ µ σ µ σ µ σ
0.2 ε-greedy (ours) 7 72.39 1.26 70.38 0.74 70.82 1.87 70.16 0.61 38.52 0.75
No exploration 6 57.73 1.95 71.25 0.86 58.58 1.96 70.13 1.04 38.06 0.87
0.4 ε-greedy (ours) 6 72.05 1.79 68.59 1.00 69.96 2.33 70.28 1.05 43.36 0.47
No exploration 9 60.12 2.37 68.30 1.56 59.54 1.31 69.42 1.54 43.16 0.71
0.6 ε-greedy (ours) 8 75.22 0.75 66.53 0.98 71.44 1.17 72.50 0.54 54.63 0.33
No exploration 9 67.94 0.75 61.47 2.27 65.15 0.87 68.27 1.71 55.04 0.45
0.8 ε-greedy (ours) 14 81.09 1.18 65.23 2.40 74.44 0.76 80.13 1.02 69.49 0.58
No exploration 15 75.04 0.93 62.59 1.95 70.82 1.56 76.49 1.27 69.55 0.70
1.0 ε-greedy (ours) 14 90.76 0.82 48.79 5.35 71.06 4.03 90.26 1.02 86.65 0.39
No exploration 6 88.79 0.52 60.97 2.07 71.56 9.12 91.11 0.33 86.37 0.31
Table 2: Results for
CIFAR-10
and
p {0.2,0.4,...,1}
when
J= 6
. Mean
µ
and standard deviation
σ
for our exploration method
ε
-greedy
and without exploration. We compare our proposed MoE solution to the baseline from IFCA
[
Ghosh et al., 2020
]
. Our proposed solution is
superior in all but one case, indicated by bold numbers.
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Validationaccuracy
P(Xx)
MoE
IFCA
Local
Ensemble
Fine-tuned
ε-greedy (ours)
No exp.
(a) CIFAR-10,
J=6
,
p=0.6
.
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Validationaccuracy
P(Xx)
MoE
IFCA
Local
Ensemble
ε-greedy (ours)
No exp.
(b) FEMNIST, J=9.
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
Validationaccuracy
P(Xx)
MoE
IFCA
Local
Ensemble
ε-greedy (ours)
No exp.
(c) Rot. CIFAR-10, J=2.
Figure 6: CDF of client accuracy. Comparison between no exploration (colored dashed lines) and the
ε
-greedy exploration method (colored
solid lines). Our proposed MoE solution with ε-greedy exploration improves accuracy and fairness for two of the datasets.
this by removing converged cluster models from the set of se-
lectable cluster models in Alg. 1, although this is not used in
our main results. This only affects the result to a minor degree,
but has a larger effect on training time due to wasting client
updates on already converged models. Another improvement
is the reduces complexity in the cluster assignment step. A no-
table difference between our work and IFCA is that we share
the all weights, as opposed to only the last layer in
[
Ghosh et
al., 2020
]
. These differences increase the communication over-
head further, but this has not been our priority and we leave
this for future work.
7 Conclusion
In this paper, we have investigated personalization in a dis-
tributed and decentralized ML setting where the data generated
on the clients is heterogeneous with non-IID characteristics.
We noted that neither FEDAVG nor state-of-the-art solutions
achieve high performance in this setting. To address this prob-
lem, we proposed a practical framework of MoE using cluster
models and local models as expert models and improved the
adaptiveness of the expert models by balancing exploration
and exploitation. Specifically, we used a MoE
[
Listo Zec et al.,
2020
]
to make better use of the cluster models available in the
clients and added a local model. We showed that IFCA
[
Ghosh
et al., 2020
]
does not work well in our setting, and inspired
by the MAB field, added an
ε
-greedy exploration
[
Sutton,
1995
]
method to improve the adaptiveness of the cluster mod-
els which increased their usefulness in the MoE. We evaluated
our method on three datasets representing different non-IID
settings, and found that our approach achieve superior perfor-
mance in two of the datasets, and is robust in the third. Even
though we tune our algorithm and hyperparameters in the IID
setting, it generalizes well in non-IID settings or with varying
number of cluster models a testament to its robustness. For
example, for
CIFAR-10
we see an average accuracy improve-
ment of
29.78 %
compared to IFCA and
4.38 %
compared to a
local model in the pathological non-IID setting. Furthermore,
our approach improved the inter-client accuracy variance with
60.39 %
compared to IFCA, which indicates improved fair-
ness, but 60.98 % worse than a local model.
In real-world scenarios data is distributed and often displays
non-IID characteristics, and we consider personalization to be a
very important direction of research. Finding clusters of similar
clients to make learning more efficient is still an open problem.
We believe there is potential to improve the convergence of the
cluster models further, and that privacy, security and system
aspects provide interesting directions for future work.
Ethical Statement
There are no ethical issues.
Acknowledgment
This work was partially supported by the Wallenberg AI, Au-
tonomous Systems and Software Program (WASP) funded by
the Knut and Alice Wallenberg Foundation.
The computations were enabled by resources provided by
the Swedish National Infrastructure for Computing (SNIC),
partially funded by the Swedish Research Council through
grant agreement no. 2018–05973.
We thank all reviewers who made suggestions that helped
improve and clarify this manuscript, especially Dr. A. Alam,
F. Cornell, Dr. R. Gaigalas, T. Kvernvik, C. Svahn, F. Vannella,
Dr. H. Shokri Ghadikolaei, D. Sandberg and Prof. S. Haridi.
References
[Bonawitz et al., 2019]
Keith Bonawitz, Hubert Eichner,
Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman,
Vladimir Ivanov, Chlo
´
e Kiddon, Jakub Konecn
´
y, Stefano
Mazzocchi, Brendan McMahan, Timon Van Overveldt,
David Petrou, Daniel Ramage, and Jason Roselander. To-
wards Federated Learning at Scale: System Design. In Proc.
of Machine Learning and Systems (MLSys), Stanford, CA,
USA, 2019.
[Bottou et al., 2018]
L
´
eon Bottou, Frank E. Curtis, and Jorge
Nocedal. Optimization Methods for Large-Scale Machine
Learning. SIAM Rev., 60(2):223–311, 2018.
[Briggs et al., 2020]
Christopher Briggs, Zhong Fan, and Pe-
ter Andras. Federated learning with hierarchical clustering
of local updates to improve training on non-IID data. In Int.
Joint Conf. on Neural Networks (IJCNN), Glasgow, United
Kingdom. IEEE, 2020.
[Caldas et al., 2018]
Sebastian Caldas, Peter Wu, Tian Li,
Jakub Konecn
´
y, H. Brendan McMahan, Virginia Smith, and
Ameet Talwalkar. LEAF: A Benchmark for Federated Set-
tings. CoRR, abs/1812.01097, 2018.
[Cohen et al., 2017]
Gregory Cohen, Saeed Afshar, Tapsonm
Jonathan, and Andr
´
e van Schaik. EMNIST: Extending
MNIST to handwritten letters. In Int. Joint Conf. on Neural
Networks (IJCNN), Anchorage, AK, USA. IEEE, 2017.
[Deng et al., 2020]
Yuyang Deng, Mohammad Mahdi Ka-
mani, and Mehrdad Mahdavi. Adaptive Personalized Feder-
ated Learning. CoRR, abs/2003.13461, 2020.
[Diao et al., 2021]Enmao Diao, Jie Ding, and Vahid Tarokh.
HeteroFL: Computation and Communication Efficient Fed-
erated Learning for Heterogeneous Clients. In 9th Int. Conf.
on Learning Representations (ICLR), Austria, 2021.
[Fersman et al., 2018]
Elena Fersman, Julien Forgeat,
Rickard C
¨
oster, Swarup Kumar Mohalik, and Viktor
Berggren. Artificial intelligence and machine learning
in next-generation systems. Technical report, Ericsson
Research, Ericsson AB, 2018.
[Finn et al., 2017]
Chelsea Finn, Pieter Abbeel, and Sergey
Levine. Model-Agnostic Meta-Learning for Fast Adaptation
of Deep Networks. In Proc. of the 34th Int. Conf. on
Machine Learning (ICML), Sydney, NSW, Australia. PMLR,
2017.
[GDPR, 2016]
GDPR. Regulation (EU) 2016/679 on the
protection of natural persons with regard to the processing
of personal data and the free movement of such data, 2016.
[Ghosh et al., 2020]
Avishek Ghosh, Jichan Chung, Dong
Yin, and Kannan Ramchandran. An Efficient Framework
for Clustered Federated Learning. In Advances in Neural
Information Processing Systems (NeurIPS), 2020.
[Hanzely and Richt´
arik, 2020]
Filip Hanzely and Peter
Richt
´
arik. Federated Learning of a Mixture of Global and
Local Models. CoRR, abs/2002.05516, 2020.
[He et al., 2020]
Chaoyang He, Murali Annavaram, and
Salman Avestimehr. FedNAS: Federated Deep Learning via
Neural Architecture Search. CoRR, abs/2004.08546, 2020.
[Hsieh et al., 2020]
Kevin Hsieh, Amar Phanishayee, Onur
Mutlu, and Phillip B. Gibbons. The Non-IID Data Quag-
mire of Decentralized Machine Learning. In Proc. of the
37th Int. Conf. on Machine Learning (ICML). PMLR, 2020.
[Jacobs et al., 1991]
Robert A. Jacobs, Michael I. Jordan,
Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive
Mixtures of Local Experts. Neural Comput., 3(1):79–87,
1991.
[Jeong et al., 2018]
Eunjeong Jeong, Seungeun Oh, Hyesung
Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim.
Communication-Efficient On-Device Machine Learning:
Federated Distillation and Augmentation under Non-IID
Private Data. CoRR, abs/1811.11479, 2018.
[Jiang et al., 2019]
Yihan Jiang, Jakub Kone
ˇ
cn
´
y, Keith Rush,
and Sreeram Kannan. Improving Federated Learning Per-
sonalization via Model Agnostic Meta Learning. CoRR,
abs/1909.12488, 2019.
[Kairouz et al., 2021]
Peter Kairouz, H. Brendan McMahan,
Brendan Avent, Aur
´
elien Bellet, Mehdi Bennis, Arjun Nitin
Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham
Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hu-
bert Eichner, Salim El Rouayheb, David Evans, Josh Gard-
ner, Zachary Garrett, Adri
`
a Gasc
´
on, Badih Ghazi, Phillip B.
Gibbons, Marco Gruteser, Za
¨
ıd Harchaoui, Chaoyang He,
Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu,
Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak,
Jakub Kone
ˇ
cn
´
y, Aleksandra Korolova, Farinaz Koushan-
far, Sanmi Koyejo, Tancr
`
ede Lepoint, Yang Liu, Prateek
Mittal, Mehryar Mohri, Richard Nock, Ayfer
¨
Ozg
¨
ur, Ras-
mus Pagh, Hang Qi, Daniel Ramage, Ramesh Raskar, Mar-
iana Raykova, Dawn Song, Weikang Song, Sebastian U.
Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tram
`
er,
Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu,
Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances
and Open Problems in Federated Learning. volume 14,
pages 1–210, 2021.
[Kim et al., 2021]
Yeongwoo Kim, Ezeddin Al Hakim, Johan
Haraldson, Henrik Eriksson, Jos
´
e Mairton B. da Silva Jr.,
and Carlo Fischione. Dynamic Clustering in Federated
Learning. In Int. Conf. on Communications (ICC), Mon-
treal, QC, Canada, pages 1–6. IEEE, 2021.
[Krizhevsky, 2009]
Alex Krizhevsky. Learning Multiple Lay-
ers of Features from Tiny Images. CoRR, pages 1–60, 2009.
[Li and Wang, 2019]
Daliang Li and Junpu Wang. FedMD:
Heterogenous Federated Learning via Model Distillation.
CoRR, abs/1910.03581, 2019.
[Li et al., 2021]
Tian Li, Shengyuan Hu, Ahmad Beirami, and
Virginia Smith. Ditto: Fair and Robust Federated Learning
Through Personalization. In Proc. of the 38th Int. Conf. on
Machine Learning (ICML). PMLR, 2021.
[Liaw et al., 2018]
Richard Liaw, Eric Liang, Robert Nishi-
hara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica.
Tune: A Research Platform for Distributed Model Selection
and Training. CoRR, abs/1807.05118, 2018.
[Listo Zec et al., 2020]
Edvin Listo Zec, Olof Mogren, John
Martinsson, Leon Ren
´
e S
¨
utfeld, and Daniel Gillblad. Fed-
erated learning using a mixture of experts. CoRR,
abs/2010.02056, 2020.
[Loshchilov and Hutter, 2019]
Ilya Loshchilov and Frank
Hutter. Decoupled Weight Decay Regularization. In 7th Int.
Conf. on Learning Representations (ICLR), New Orleans,
LA, USA, 2019.
[McMahan et al., 2017]
Brendan McMahan, Eider Moore,
Daniel Ramage, Seth Hampson, and Blaise Ag
¨
uera y Ar-
cas. Communication-Efficient Learning of Deep Networks
from Decentralized Data. In Proc. of the 20th Int. Conf. on
Artificial Intelligence and Statistics (AISTATS), Fort Laud-
erdale, FL, USA, volume 54. PMLR, 2017.
[Mu et al., 2021]
Xutong Mu, Yulong Shen, Ke Cheng, Xueli
Geng, Jiaxuan Fu, Tao Zhang, and Zhiwei Zhang. FedProc:
Prototypical Contrastive Federated Learning on Non-IID
data. CoRR, abs/2109.12273, 2021.
[Peterson et al., 2019]
Daniel W. Peterson, Pallika Kanani,
and Virendra J. Marathe. Private Federated Learning with
Domain Adaptation. CoRR, abs/1912.06733, 2019.
[Shi et al., 2021]
Chengshuai Shi, Cong Shen, and Jing Yang.
Federated Multi-armed Bandits with Personalization. In
The 24th Int. Conf. on Artificial Intelligence and Statistics
(AISTATS), 2021.
[Sutton, 1995]
Richard S. Sutton. Generalization in Rein-
forcement Learning: Successful Examples Using Sparse
Coarse Coding. In Advances in Neural Information Pro-
cessing Systems (NeurIPS). MIT Press, 1995.
[Tijani et al., 2021]
Saheed A. Tijani, Xingjun Ma, Ran
Zhang, Frank Jiang, and Robin Doss. Federated Learning
with Extreme Label Skew: A Data Extension Approach. In
Int. Joint Conf. on Neural Networks (IJCNN), Shenzhen,
China. IEEE, 2021.
[Wang et al., 2019]
Kangkang Wang, Rajiv Mathews, Chlo
´
e
Kiddon, Hubert Eichner, Fran
c¸
oise Beaufays, and Daniel
Ramage. Federated Evaluation of On-device Personaliza-
tion. CoRR, abs/1910.10252, 2019.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (...
Conference Paper
Federated learning (FL) is a well established method for performing machine learning tasks over massively distributed data. However in settings where data is distributed in a non-iid (not independent and identically distributed) fashion - as is typical in real world situations - the joint model produced by FL suffers in terms of test set accuracy and/or communication costs compared to training on iid data. We show that learning a single joint model is often not optimal in the presence of certain types of non-iid data. In this work we present a modification to FL by introducing a hierarchical clustering step (FL+HC) to separate clusters of clients by the similarity of their local updates to the global joint model. Once separated, the clusters are trained independently and in parallel on specialised models. We present a robust empirical analysis of the hyperparameters for FL+HC for several iid and non-iid settings. We show how FL+HC allows model training to converge in fewer communication rounds (significantly so under some non-iid settings) compared to FL without clustering. Additionally, FL+HC allows for a greater percentage of clients to reach a target accuracy compared to standard FL. Finally we make suggestions for good default hyperparameters to promote superior performing specialised models without modifying the the underlying federated learning communication protocol. Public version available at: https://arxiv.org/abs/2004.11791
FedMD: Heterogenous Federated Learning via Model Distillation
  • Kim
Kim et al., 2021] Yeongwoo Kim, Ezeddin Al Hakim, Johan Haraldson, Henrik Eriksson, José Mairton B. da Silva Jr., and Carlo Fischione. Dynamic Clustering in Federated Learning. In Int. Conf. on Communications (ICC), Montreal, QC, Canada, pages 1-6. IEEE, 2021. [Krizhevsky, 2009] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. CoRR, pages 1-60, 2009. [Li and Wang, 2019] Daliang Li and Junpu Wang. FedMD: Heterogenous Federated Learning via Model Distillation. CoRR, abs/1910.03581, 2019.
Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data
  • Listo Zec
Listo Zec et al., 2020] Edvin Listo Zec, Olof Mogren, John Martinsson, Leon René Sütfeld, and Daniel Gillblad. Federated learning using a mixture of experts. CoRR, abs/2010.02056, 2020. [Loshchilov and Hutter, 2019] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In 7th Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA, 2019. [McMahan et al., 2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proc. of the 20th Int. Conf. on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, volume 54. PMLR, 2017.
Federated Multi-armed Bandits with Personalization
  • Shi
Shi et al., 2021] Chengshuai Shi, Cong Shen, and Jing Yang. Federated Multi-armed Bandits with Personalization. In The 24th Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 2021.