ArticlePDF Available

FedAUX: Leveraging Unlabeled Auxiliary Data in Federated Learning


Abstract and Figures

Federated distillation (FD) is a popular novel algorithmic paradigm for Federated learning (FL), which achieves training performance competitive to prior parameter averaging-based methods, while additionally allowing the clients to train different model architectures, by distilling the client predictions on an unlabeled auxiliary set of data into a student model. In this work, we propose FEDAUX, an extension to FD, which, under the same set of assumptions, drastically improves the performance by deriving maximum utility from the unlabeled auxiliary data. FEDAUX modifies the FD training procedure in two ways: First, unsupervised pre-training on the auxiliary data is performed to find a suitable model initialization for the distributed training. Second, ε, δ)-differentially private certainty scoring is used to weight the ensemble predictions on the auxiliary data according to the certainty of each client model. Experiments on large-scale convolutional neural networks (CNNs) and transformer models demonstrate that our proposed method achieves remarkable performance improvements over state-of-the-art FL methods, without adding appreciable computation, communication, or privacy cost. For instance, when training ResNet8 on non-independent identically distributed (i.i.d.) subsets of CIFAR10, FEDAUX raises the maximum achieved validation accuracy from 30.4% to 78.1%, further closing the gap to centralized training performance. Code is available at
Content may be subject to copyright.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
FEDAUX: Leveraging Unlabeled Auxiliary
Data in Federated Learning
Felix Sattler ,TimKorjakow , Roman Rischke ,andWojciechSamek ,Member, IEEE
Abstract— Federated distillation (FD) is a popular novel
algorithmic paradigm for Federated learning (FL), which
achieves training performance competitive to prior parameter
averaging-based methods, while additionally allowing the clients
to train different model architectures, by distilling the client
predictions on an unlabeled auxiliary set of data into a student
model. In this work, we propose FEDAUX, an extension to FD,
which, under the same set of assumptions, drastically improves
the performance by deriving maximum utility from the unlabeled
auxiliary data. FEDAUX modifies the FD training procedure
in two ways: First, unsupervised pre-training on the auxiliary
data is performed to find a suitable model initialization for the
distributed training. Second, (ε, δ )-differentially private certainty
scoring is used to weight the ensemble predictions on the auxiliary
data according to the certainty of each client model. Exper-
iments on large-scale convolutional neural networks (CNNs)
and transformer models demonstrate that our proposed method
achieves remarkable performance improvements over state-of-
the-art FL methods, without adding appreciable computation,
communication, or privacy cost. For instance, when train-
ing ResNet8 on non-independent identically distributed (i.i.d.)
subsets of CIFAR10, FEDAUX raises the maximum achieved
validation accuracy from 30.4% to 78.1%, further closing the
gap to centralized training performance. Code is available at
Index Terms—Certain ty-weighted aggregation , different ial
privacy (DP), federated distillation (FD), federated
learning (FL), unsupervised pre-training.
FEDERATED learning (FL) allows distributed entities
(“clients”) to jointly train (deep) machine learning models
on their combined local data, without having to transfer this
data to a centralized location [1]. The Federated training
process is conducted over multiple communication rounds,
where, in each round, a central server aggregates the training
state of the participating learners, for instance, via a para-
meter averaging operation. Since local training data never
leaves the participating devices, FL can drastically improve
privacy [2]–[4], ownership rights [5], and security [6] for
Manuscript received May 28, 2021; revised September 20, 2021; accepted
November 16, 2021. This work was supported in part by the German
Federal Ministry of Education and Research (BMBF) through the Berlin
Institute for the Foundations of Learning and Data (BIFOLD) under Grant
01IS18025A and Grant 01IS18037I and in part by the EU’s Horizon 2020
Project COPA EUROPE under Grant 957059. (Corresponding author:
Wojciech Samek.)
The authors are with the Department of Artificial Intelligence, Fraun-
hofer Heinrich Hertz Institute, 10587 Berlin, Germany (e-mail: woj-
This article has supplementary material provided by the
authors and color versions of one or more figures available at
Digital Object Identifier 10.1109/TNNLS.2021.3129371
the participants. As the number of mobile and IoT devices
and their capacities to collect and process large amounts of
high-quality and privacy-sensitive data steadily grows, Feder-
ated training procedures become increasingly relevant.
While the client data in FL is typically assumed to be pri-
vate, in many real-world applications, the server additionally
has access to unlabeled auxiliary data, which roughly matches
the distribution of the client data. For instance, for many
Federated computer vision and natural language processing
problems, such auxiliary data can be given in the form of
public databases such as ImageNet [7] or WikiText [8]. These
databases contain millions to billions of data samples but are
typically lacking the necessary label information to be useful
for training task-specific models.
Recently, Federated distillation (FD), a novel algorithmic
paradigm for FL problems where such auxiliary data is avail-
able, was proposed. In contrast to classic parameter averaging-
based FL algorithms [1], [9]–[12], which require all client’s
models to have the same size and structure, FD allows the
clients to train heterogeneous model architectures, by distilling
the client predictions on the auxiliary set of data into a student
model. This can be particularly beneficial in situations where
clients are running on heterogeneous hardware and recent
studies show that FD-based training also has favorable com-
munication properties [13], [14] and can outperform parameter
averaging-based FL algorithms [15].
However, just like for their parameter-averaging-based
counterparts, the performance of FD-based learning algorithms
falls short of centralized training and deteriorates quickly
if the training data is distributed in a heterogeneous [“non-
independent identically distributed (i.i.d.”)] way among the
In this work, we aim to further close this perfor-
mance gap, by exploring the core assumption of FD-based
training and deriving maximum utility from the available
unlabeled auxiliary data. Our main contributions are as
1) We show that a wide range of (out-of-distribution)
auxiliary datasets are suitable for self-supervised pre-
training and can drastically improve FL performance
across all levels of data heterogeneity.
2) We propose a novel certainty-weighted FD technique,
which improves the performance of FD on non-i.i.d. data
substantially, by exploiting the available auxiliary data,
addressing a long-standing problem in FL research.
3) We derive an (ε, δ)-differentially private mechanism to
constrain the privacy loss associated with transmitting
certainty scores.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. FL procedure of FEDAUX is organized in a preparation and a training phase: Preparation phase: P1) The unlabeled auxiliary data is used to pre-train
a feature extractor (e.g., using contrastive representation learning). P2) The feature-extractor is sent to the clients, where it is used to initialize the client
models. Based on the extracted features, a logistic scoring head is trained to distinguish local client data from a subset of the auxiliary data. P3) Thetrained
scoring head is sanitized using a (ε, δ )-differentially private mechanism and then used to compute (differentially private) certainty scores on the distillation
data. Training phase: T1) In each communication round, a subset of the client population is selected for training. Each selected client downloads a model
initialization from the server and then updates the full model fi(feature extractor and scoring head) using their private local data. T2) The locally trained
classifier and scoring models fiand siare sent to the server, where they are combined into a weighted ensemble. T3) Using the unlabeled auxiliary data and
the weighted ensemble as a teacher, the server distills a student model which is used as the initialization point for the next round of Federated training. Note
that, in practice, it is more practical to perform computation of soft labels and scores at the server to save client resources.
4) We extensively evaluate our new method on a wide vari-
ety of Federated image and text classification problems,
using large-scale convolutional neural networks (CNNs)
and transformer models.
Notably, as we will see, the observed significant performance
improvements achieved by F EDAUX are possible: 1) under
the same assumptions made in the FD literature; 2) with only
negligible additional computational overhead for the resource-
constrained clients; and 3) with small quantifiable excess
privacy loss.
The remainder of this manuscript is organized as follows:
In Section II, we give an introduction to FD and clearly state
our assumptions on the FL setting. In Section III, we describe
the components of our proposed FEDAUX algorithm, namely
unsupervised pre-training and weighted ensemble distillation
and derive an (ε, δ)-differentially private mechanism to obfus-
cate the ensemble weights. In Section IV, we provide the
detailed algorithm for the general FL setting where clients
may locally train different model architectures. In Section V,
we give an overview over the current state of research in FD as
well as FL in the presence of unlabeled auxiliary data, in gen-
eral. In Section VI, we perform extensive numerical studies
evaluating the performance, privacy properties, and sensitivity
to auxiliary data of FEDAUX against several important base-
line methods in a variety of different FL scenarios, including
resource constrained settings. In Section VII, we complement
these quantitative results with a qualitative analysis of our
method, before concluding in Section VIII.
We assume the conventional FL setting, where a population
of nclients is holding potentially non-i.i.d. subsets of private
labeled data D1,...,Dn, from a training data distribution
The goal in FL is to train a single model fon the combined
private data of all local clients. This is generally achieved by
performing multiple communication rounds, where each round
consists of the following steps.
1) A subset St⊆{1,...,n}of the client population is
selected for training and downloads a model initializa-
tion from the server.
2) Starting from this model initialization, each client then
proceeds to train a model fion its local private data Di
by taking multiple steps of stochastic gradient descent
over the model parameters θi.
3) Finally, the updated models fi,iSt, are sent back
to the server, where they are aggregated to form a new
server model f, which is used as the initialization point
for the next round of FL.
The goal of FL is to obtain a server model f, which optimally
generalizes to new samples from the training data distribution
ϕ, within a minimum number of communication rounds tT.
FD offers a new way of performing the last step of the
FL protocol, namely the aggregation of the contributions of
FL clients into a single-server model [13], [15]–[17]. Instead
of aggregating the client model parameters θidirectly (for
instance, via an averaging operation), the server leverages
distillation [18] to train a model on the combined predictions
of the client models fion some public auxiliary set of
unlabeled data
Daux ψ(X).(2)
The distribution of the unlabeled auxiliary data ψ(X)
hereby is generally assumed to deviate from the unknown
private data distribution ϕ(X).
Let xDaux be a batch of data from the auxiliary
distillation dataset. Then one iteration of distillation over the
parameters of the server model θtin communication round t
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 2. Weighted ensemble distillation illustrated in a toy example on the Iris dataset (data points are projected to their two principal components).Three
FL clients hold disjoint non-i.i.d. subsets of the training data. Panels 1–3: Predictions made by linear classifiers trained on the data of each client. Labels and
predictions are color-coded, client certainty (measured via Gaussian KDE) is visualized via the alpha-channel. The mean of client predictions (panel 4) only
poorly captures the distribution of training data. In contrast, the certainty-weighted mean of client predictions (panel 5) achieves much higher accuracy.
is performed as
Hereby, DKL denotes the Kullback–Leibler divergence,
η>0 is the learning rate, σis the softmax function, and Ais
a mechanism to aggregate the soft labels. Existing work [15]
aggregates the client predictions by taking the mean according
FD is shown to yield better model fusion than parameter
averaging-based techniques, like FEDAVG, resulting in bet-
ter generalization performance within fewer communication
rounds [15]. However, like for all other FL methods, perfor-
mance of models trained via FD still lacks behind centralized
training and convergence speed suffers considerably if training
data is distributed in a non-i.i.d. way among the clients.
To address these issues, in this work, we will present
two improvements to FD-based training, which, as we will
demonstrate, drastically improve training performance in FL
scenarios with both homogeneous and heterogeneous client
data, leading to greater model performance within fewer
communication rounds T.
In this section, we describe how FD-based training can
be improved by deriving maximum utility from the available
unlabeled auxiliary data. An illustration of our proposed
FEDAUX training framework is given in Fig. 1. We first
describe FEDAUX for the homogeneous setting where all
clients locally train the same model architecture. This setting
can readily be generalized to heterogeneous client model
architectures as we will describe in Section IV, where also the
detailed training procedure is given. An exhaustive qualitative
comparison between FEDAUX and baseline methods is given
in Section VII.
A. Self-Supervised Pre-Training
As the first component of the FEDAUX training procedure,
we will exploit the fact that all FD methods require access to
unlabeled auxiliary data Daux. Self-supervised representation
learning can leverage such large records of unlabeled data to
create models which extract meaningful features. For the two
types of data considered in this study—image and sequence
data—strong self-supervised training algorithms are known in
the form of contrastive representation learning [19], [20] and
next-token prediction [21], [22].
denote a decomposition of the local client models fi,i=
1,...,ninto a feature extractor hiand a classification head
gi. Such a decomposition can trivially be given, for instance,
for CNNs and transformer models, where the feature extractor
gcontains all but the final layer of the network, while the
classification head is just a single fully connected layer,
followed by a sigmoid activation. As part of the FEDAUX
preparation phase (cf. Fig. 1, P1) we propose to pre-train the
feature extractor models hiat the server using self-supervised
training on the auxiliary data Daux . We emphasize that this
step is only performed once at the beginning of training and
makes no assumptions on the similarity between the local
training data and the auxiliary data. The pre-training operation
results in a parameterization for the feature extractor h0.Since
the training is performed at the server, using only publicly
available data, this step inflicts neither computational overhead
nor privacy loss on the resource-constrained clients.
B. Weighted Ensemble Distillation
Different studies have shown that the training speed, sta-
bility, and maximum achievable accuracy in existing FL
algorithms deteriorate if the training data is distributed in
a heterogeneous “non-i.i.d.” way among the clients [12],
[23], [24]. Federated Ensemble Distillation (FedDF) makes no
exception to this rule [15].
The underlying problem of combining hypotheses derived
from different source domains has been explored in multiple-
source domain adaptation theory [25], [26], which shows
that standard convex combinations of the hypotheses of the
clients as done in [15] may perform poorly on the target
domain. Instead, a distribution-weighted combination of the
local hypotheses fi, obtained on data distributions Di, accord-
ing to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Left: Toy example with three clients holding data sampled from
multivariate Gaussian distributions D1,D2,andD3. All clients solve opti-
mization problem Jby contrasting their local data with the public negative
data, to obtain scoring models s1,s2,s3, respectively. As can be seen in
the plots to the right, our proposed scoring method approximates the robust
weights proposed in [25] as it holds si(x)/ jsj(x)Di(x)/ jDj(x)on
the support of the data distributions.
is shown to be robust [25], [26] (in slight abuse of notation
Di(x)hereby refers to the probability density of the local
data Di). A simple toy example, displayed in Fig. 2, further
illustrates this point: Displayed as scatter points are elements
of the Iris dataset, projected to their two main PCA compo-
nents. The training data is distributed among three clients in
a non-i.i.d. fashion, with the label of each data point being
indicated by the marker color in the plot. Overlayed in the
background are the predictions of linear classifier models that
were trained on the local data of each client. As we can see,
the models which were trained on the data of clients 1 and 3,
uniformly predict that all inputs belong to the “red” and “blue”
class, respectively. The predictive power of these models and
consequently their value as teachers for model distillation is
thus very limited. This is also visualized in panel 4, where the
mean prediction of the teacher models is displayed. We can,
however, improve the teacher ensemble quite significantly,
if we weight each teacher’s predictions at every location xby
its certainty s(x)(approximated via Gaussian KDE), illustrated
via the alpha channel in panels 1–3. As we can see in panel 5,
weighing the ensemble predictions raises the accuracy from
33% to 88% in this particular toy example.
Based on these insights, we propose to modify the aggre-
gation rule of FD (4) to a certainty-weighted average
The question remains, how to calculate the certainty scores
si(x)in a privacy preserving way and for arbitrary high-
dimensional data, where simple methods, such as Gaussian
KDE used in our toy example, fall victim to the curse
of dimensionality. To this end, we propose the following
We split the available auxiliary data randomly into two
disjoint subsets
DDdistill =Daux (8)
the “negative” data and the “distillation” data. Using the pre-
trained model h0(Section III-A) as a feature extractor,
on each client, we then train a logistic regression classi-
fier to separate the local data Difrom the negatives D,
Fig. 4. Comparison of validation performance for FD of ResNet-8 on the
CIFAR-10 dataset (left) and DistillBert on the Amazon dataset (right) when
different scoring techniques are used to obtain the certainty weights si(x)used
during ensemble distillation. Certainty scores obtained via two-class logistic
regression achieve the best performance and can readily be augmented with
a differentially private mechanism.
by optimizing the following regularized empirical risk min-
imization (ERM) problem:
i=arg min
wJw, h0,Di,D(9)
Jw, h0,Di,D=a
ltxw, ˜
Hereby, tx=2(xDi)1∈[1,1]defines the binary
labels of the separation task, a=(|Di|+|D|)1is a
normalizing factor and ˜
are the normalized features. We choose l(z)=log(1+exp(z))
to be the logistic loss and R(w) =(1/2)w2
2to be the
2-regularizer. Since Jis λ-strongly convex in w, problem
(9) is uniquely solvable. This step is performed only once
on every client, during the preparation phase (cf. Fig. 1, P2)
and the computational overhead for the clients of solving (9)
is negligible in comparison to the cost of multiple rounds of
training the (deep) model fi.
Given the solution of the regularized ERM problem w
certainty scores on the distillation data Ddistill can be obtained
via the logistic scoring head
h0(x)1+ξ. (11)
A small additive ξ>0 ensures numerical stability when
taking the weighted mean in (7). We always set ξ=1e8.
While the scores si(x)can be estimated using a number
of different techniques like density estimation, uncertainty
quantification [27], or outlier detection [28], [29], we will
now present three distinct motivations for using the logistic
regression-based approach described above.
First of all, as illustrated using the toy example given in
Fig. 3, the scores obtained via our proposed logistic regression-
based approach (11) give a good approximation to the distri-
bution weights suggested by domain adaptation theory [25].
As we can see in the panels to the right, it approximately holds
on the support of the data distributions piDi.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Second, scores obtained via logistic regression yield strong
empirical performance on highly complex image data. Fig. 4
shows the maximum accuracy achieved after ten communi-
cation rounds, by different weighted FedDF methods in an
FL scenario with ten clients and highly heterogeneous data
(α=0.01, further details on the data splitting strategy are
given in Section VI). As we can see, the contrastive logistic
scoring approach described above distinctively outperforms
the uniform scoring approach used in [15] and also yields
better results than other generative and discriminative scoring
methods, like Gaussian KDE, Isolation Forests, or One- and
Two-Class SVMs. Details on the implementation of these
scoring methods are given in Supplementary Materials C.
Finally, as we will see in Section III-C, the logistic scoring
mechanism can readily be augmented with differential pri-
vacy (DP) and provides high utility even under strong formal
privacy constraints.
C. Differentially Private Weighted Ensemble Distillation
Sharing the certainty scores {si(x)|xDdistill }with the
central server intuitively causes privacy loss for the clients.
After all, a high score si(x)indicates that the public data point
xDdistill is similar to the private data Diof client i(in the
sense of (9)). To protect the privacy of the clients as well as
quantify and limit the privacy loss, we propose to use data-
level DP (cf. Fig. 1, P3). Following the classic definition of
[30], a randomized mechanism is called differentially private,
if its output on any input database dis indistinguishable from
output on any neighboring database dwhich differs from d
in one element.
Definition 1: A randomized mechanism M:DR
satisfies (ε, δ)-DP if for any two adjacent inputs dand d
that differ in only one element and for any subset of outputs
SR, it holds that
P[M(d)S]exp(ε)PMdS+δ. (13)
DP of a mechanism Mcan be achieved, by limiting its
and then applying a randomized noise mechanism. We adapt
a theorem from [31] to establish the sensitivity of (9).
Theorem 1: If R(·)is differentiable and one-strongly con-
vex and lis differentiable with |l(z)|≤1z, then the 2-
sensitivity 2(M)of the mechanism
M:Di→ arg min
is at most 2(λ(|Di|+|D|))1.
The proof can be found in Supplementary Materials G.
As we can see the sensitivity scales inversely with the size of
the total data |Di|+|D|. From Theorem 1 and application of
the Gaussian mechanism [30], it follows that the randomized
Msan :Di→ arg min
with NN(0,Iσ2),andσ2=(8ln(1.25δ1))/(ε2λ2(|Di|+
|D|)2)is (ε, δ)-differentially private.
The post-processing property of DP ensures that the release
of any number of scores computed using the output of mech-
anism Msan is still (ε, δ)-private. Note that in this work we
restrict ourselves to the privacy analysis of the scoring mech-
anism. The differentially private training of deep classifiers fi
is a challenge in its own right and has been addressed, for
example, in [32]. Following the basic composition theorem
[30], the total privacy cost of running FEDAUX is the sum
of the privacy loss of the scoring mechanism Msan and the
privacy loss of communicating the updated models fi(the
latter is the same for all FL algorithms).
Like many other FD methods, FEDAUX can natively be
applied to FL scenarios where the clients locally train different
model architectures. To perform model fusion in such hetero-
geneous scenarios, FEDAUX constructs several prototypical
models on the server, where each prototype represents all
clients with identical architecture.
Let us denote by Pthe set of all such model prototypes.
Then we can define a HashMap Rthat maps each client ito
its corresponding model prototype Pas well as the inverse
HashMap ˜
Rthat maps each model prototype Pto the set of
corresponding clients (s.t. i˜
R[R[i]] ∀i).
The training procedure of FEDAUX can be divided into a
preparation phase, which is given in Algorithm 1 and a training
phase, which is given in Algorithm 2.
A. Preparation Phase
In the preparation phase, the server uses the unlabeled
auxiliary data Daux , to pre-train the feature extractor hP
for each model prototype Pusing self-supervised training.
Suitable methods for self-supervised pre-training are con-
trastive representation learning [19], or self-supervised lan-
guage modeling/next-token prediction [21]. The pre-trained
feature extractors hP
0are then communicated to the clients and
used to initialize part of the local classifier f=gh.The
server also communicates the negative data Dto the clients
(in practice, we can instead communicate the extracted features
0(x)|xD}of the raw data Dto save communication).
Each client then optimizes the logistic similarity objective J
(9) and sanitizes the output by adding properly scaled Gaussian
noise. Finally, the sanitized scoring model w
iis communicated
to the server, where it is used to compute certainty scores
sion the distillation data (the certainty scores can also be
computed on the clients, however this results in additional
communication of distillation data and scores).
B. Training Phase
The training phase is carried out in Tcommunication
rounds. In every round tT, the server randomly selects
a subset Stof the overall client population and transmits to
them the latest server models θR[i], which match their model
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Algorithm 1 FEDAUX Preparation Phase (With Different
Model Prototypes P)
init: Split DDdistill Daux
init: HashMap Rthat maps client ito model prototype P
Server does:
for each model prototype PPdo
end for
for each client i∈{1,...,n}in parallel do
Client idoes:
iarg minwJ(w, hP
end for
Server does:
for i=1,...,ndo
create HashMap
0(x)))1+ξfor x
Ddistill }
end for
Fig. 5. Illustration of the Dirichlet data splitting strategy we use throughout
the article, exemplary for an FL setting with 20 clients and ten different
classes. Marker size indicates the number of samples held by one client
for each particular class. Lower values of αlead to more heterogeneous
distributions of client data. Figure adapted from [15].
prototype P(in round t=1 only the pre-trained feature
extractor hP
0is transmitted). Each selected client updates its
local model by performing multiple steps of stochastic gradient
descent (or its variants) on its local training data. This results
in an updated parameterization θion every client, which is
communicated to the server. After all clients have finished their
local training, the server gathers the updated parameters θi.
Following the recommendations from [15], each prototypi-
cal student model is initialized with the average of the parame-
ters from all client models which share the same architecture,
according to
Using these model averages as a starting point, for each
prototype, the server then distills a new model, based on the
client’s certainty-weighted predictions.
A. Ensemble Distillation in FL
FD is a new area of research, which has attracted tremen-
dous attention in the past couple of years. FD techniques
Algorithm 2 FEDAUX Training Phase (With Different Model
Prototypes P). Training Requires Feature Extractors hP
Scores siFrom Alg. 1. The Same DDdistill Daux as in
Alg. 1 Is Used. Choose Learning Rate ηand Set ξ=108
init: HashMap Rthat maps client ito model prototype P
init: Inverse HashMap ˜
Rthat maps model prototype Pto
set of clients (s.t. i˜
R[R[i]] ∀i)
init: Initialize model prototype weights θPwith feature
extractor weights hPfrom Alg. 1
for communication round t=1,...,Tdo
select subset of clients St⊆{1,...,n}
for selected clients iStin parallel do
Client idoes:
θitrain0θR[i],Di)# Local Training
end for
Server does:
for each model prototype PPdo
R[P]|Dl|θi# Parameter
# Averaging
for mini-batch xDdistill do
iStsi[x])# Can be arbitrary
θPθPηDKL(˜y,σ ( f(xP)))
∂θP# Optimizer
end for
end for
end for
have at least three distinct advantages over prior, parameter
averaging-based methods and related work can be organized
according to which of these aspects it primarily focuses on.
First, FD enables aggregation of client knowledge indepen-
dent of the model architecture and thus allows clients to train
models of different architecture, which gives additional flex-
ibility, especially in hardware-constrained settings. FEDMD
[33], Cronus [34], and FEDH2L [35] are methods which
focus on this aspect. While the main focus of FEDAUX is to
improve performance, our proposed approach is still flexible
enough to handle heterogeneous client models as shown in
Section IV.
A second line of FD research explores the advantageous
communication properties of the framework. As models are
aggregated by means of distillation instead of parameter
averaging, it is no longer necessary to communicate the
raw parameters. Instead, it is sufficient for the clients to
only send their soft-label predictions on the distillation data.
Consequently, the communication in FD scales with the size
of the distillation dataset and not with the size of the jointly
trained model as in the classical parameter averaging-based
FL. This leads to communication savings, especially if the
local models are large and the distillation dataset is small.
Jeong et al. and subsequent work [13], [14], [16], [36] focus
on this aspect. These methods, however, are computationally
more expensive for the resource constrained clients, as distil-
lation needs to be performed locally and perform worse than
parameter averaging-based training after the same number of
communication rounds. We want to highlight that improving
communication efficiency is not a goal of our proposed
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 6. Evaluation on different neural networks and client population sizes n. Accuracy achieved after T=100 communication rounds by different FD
methods at different levels of data heterogeneity α. STL-10 is used as auxiliary dataset. In the “Mixed” setting, one-third of the client population each trains
on ResNet8, MobileNetv2, and Shufflenet, respectively. Black dashed line indicates centralized training performance.
Fig. 7. Evaluating FEDAUX on NLP benchmarks. Performance of FEDAUX for different combinations of local datasets and heterogeneity levels α.Ten
clients training TinyBERT at α=0.01 and C=100%. Bookcorpus is used as auxiliary dataset. Black dashed line indicates centralized training performance.
method, which relies on communication of full models and
thus requires communication at the order of conventional
parameter averaging-based methods.
Third, when combined with parameter averaging, it has been
observed that FD methods achieve better performance than
purely parameter averaging-based techniques. Lin et al. [15]
and Chen and Chao [17] propose FL protocols, which are
based on classical FEDAVG and perform ensemble distillation
after averaging the received client updates at the server to
improve performance. FEDBE, proposed by [17], additionally
combines client predictions by means of a Bayesian model
ensemble to further improve robustness of the aggregation.
Our work primarily focuses on this latter aspect. Building
upon the work of [15], we additionally leverage the auxiliary
distillation data for unsupervised pre-training and weigh the
client predictions in the distillation step according to their
certainty scores to better cope with settings where the client’s
data generating distributions are statistically heterogeneous.
We also mention the related work by Guha et al. [37], which
proposes a one-shot distillation method for convex models,
where the server distills the locally optimized client models
in a single round as well as the work of [38] which addresses
privacy issues in FD. Federated one-shot distillation is also
addressed in [39]. FD for edge-learning was proposed in [40].
B. Weighted Ensembles
FEDAUX leverages a weighted ensemble of client models
to distill the locally acquired knowledge into a central server
model. The ensemble weights are determined at an instance
level, based on the certainty of each local model’s predic-
tion. The study of weighted ensembles started around the
1990s with the work by Hashem and Schmeiser [41], Perrone
and Cooper [42], and Sollich and Krogh [43]. A weighted
ensemble of models combines the output of the individual
models by means of a weighted average in order to improve
the overall generalization performance. The weights allow us
to indicate the percentage of trust or expected performance
for each individual model. See [44], [45] for an overview of
ensemble methods. Instead of giving each client a static weight
in the aggregation step of distillation, we weight the clients on
an instance base as in [46], that is, each client’s prediction is
weighted using a data-dependent certainty score. We note that
weighted combinations of weak classifiers are also commonly
leveraged in centralized settings in the context of mixture of
experts and boosting methods [47]–[49].
C. Data Heterogeneity in FL
As we will demonstrate, FEDAUX excels, in particular,
in situations where data is distributed heterogeneously among
the clients. As the training data is generated independently on
the participation devices, this type of statistical heterogeneity
in the client data is very typical for FL problems [1]. It is
well known that conventional FL algorithms like FEDAV G
[1] perform best on statistically homogeneous data and suffer
severely in this (“non-i.i.d.”) setting [23], [24]. A number of
different studies [11], [12], [17], [23] have tried to address
this issue, but relevant performance improvements so far have
only been possible under strong assumptions. For instance,
[23] assume that the server has access to labeled public
data from the same distribution as the clients. In contrast,
we only assume that the server has access to unlabeled
public data from a potentially deviating distribution. Other
approaches [12] require high-frequent communication, with
up to thousands of communication rounds, between the server
and clients, which might be prohibitive in a majority of FL
applications where communication channels are intermittent
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
and slow. In contrast, our proposed approach can drastically
improve FL performance on non-i.i.d. data even after just one
single communication round. For completeness, we note that
there also exists a different line of research, which aims to
address data heterogeneity in FL via meta- and multi-task
learning. Here, separate models are trained for each client [50],
[51] or clients are grouped into different clusters with similar
distributions [52], [53].
D. Unlabeled Data in FL
FEDAUX, like all FD methods, leverages unlabeled aux-
iliary data during Federated training. To the best of our
knowledge, there do not exist any prior studies on the use of
unlabeled auxiliary data in FL outside of FD methods. Feder-
ated semi-supervised learning techniques [54], [55] assume
that clients hold both labeled and unlabeled private data
from the local training distribution. In contrast, we assume
that the server has access to public unlabeled data that may
differ in distribution from the local client data. Federated
self-supervised representation learning [56] aims to train a
feature extractor on private unlabeled client data. In contrast,
we leverage self-supervised representation learning at the
server to find a suitable model initialization.
A. Setup
1) Datasets and Models: We e v al u at e F EDAUX and SOTA
FL methods on both Federated image and text classifi-
cation problems with large-scale convolutional and trans-
former models, respectively. For our image classification
problems, we train ResNet- [57], MobileNet- [58], and
ShuffleNet- [59]-type models on CIFAR-10 and CIFAR-100
and use STL-10, CIFAR-100, and SVHN as well as different
subsets of ImageNet (Mammals, Birds, Dogs, Devices, Inver-
tebrates, Structures)1as auxiliary data. In our experiments,
we always use 80% of the auxiliary data as distillation data
Ddistill and 20% as negative data D. For our text classifi-
cation problems, we train Tiny-Bert [60] on the AG-NEWS
[61] and Multilingual Amazon Reviews Corpus [62] and use
BookCorpus [63] as auxiliary data.
2) FL Environment and Data Partitioning: We consider FL
problems with up to n=100 participating clients. In all
experiments, we split the training data evenly among the
clients according to a Dirichlet distribution following the
procedure outlined in [64] and illustrated in Fig. 5. This allows
us to smoothly adapt the level of non-i.i.d.-ness in the client
data using the Dirichlet parameter α. We experiment with
values for αvarying between 100.0 and 0.01. A value of
α=100.0 results in an almost identical label distribution,
while setting α=0.01 results in a split, where the vast
majority of data on every client stems from one single class.
See Supplementary Material A for a more detailed description
of our data splitting procedure. We vary the client participation
rate Cin every round between 20% and 100%.
1The methodology for generating these subsets is described in Supplemen-
tary Materials D.
3) Pre-Training Strategy: For our image classification prob-
lems, we use contrastive representation learning as described
in [19] for pre-training. We use the default set of data
augmentations proposed in this article and train with the Adam
optimizer, learning rate set to 103, and a batch size of 512.
For our text classification problems, we pre-train using self-
supervised next-word prediction.
4) Training the Scoring Model and Privacy Setting: We s e t
the default privacy parameters to λ=0.1, ε=0.1, and
δ=1e5, and solve (9) by running L-BFGS [65] until
convergence (1000 steps).
5) Baselines: We compare the performance of FEDAUX
to state-of-the-art FL methods: FEDAV G [ 1 ], F EDPROX [11],
FedDF [15], and FEDBE [17]. To clearly discern the perfor-
mance benefits of the two components of FEDAUX (unsuper-
vised pre-training and weighted ensemble distillation), we also
report performance metrics on versions of these methods
where the auxiliary data was used to pre-train the feature
extractor h(“FEDAV G +P,” “FEDPROX +P,” “FEDDF +
P,” respectively, “FEDBE +P”). For FEDBE, we set the
sample size to 10 as suggested in this article. For FEDPROX,
we always tune the proximal parameter μ.
6) Optimization: On all image classification task, we use the
very popular Adam optimizer [66], with a fixed learning rate of
η=103and a batch size of 32 for local training. Distillation
is performed for one epoch for all methods using Adam at
a batch size of 128 and fixed learning rate of 5e5. More
detailed hyperparameter analysis in Supplementary Material F
shows that this choice of optimization parameters is approxi-
mately optimal for all of the methods. If not stated otherwise,
the number of local epochs Eis set to 1.
B. Evaluating FEDAUX on Common FL Benchmarks
We start out by evaluating the performance of FEDAUX on
classic benchmarks for Federated image classification. Fig. 6
shows the maximum accuracy achieved by different FD meth-
ods after T=100 communication rounds at different levels
of data heterogeneity. As we can see, F EDAUX distinctively
outperforms FEDDF on the entire range of data heterogeneity
levels αon all benchmarks. For instance, when training
ResNet8 with n=80 clients at α=0.01, FEDAUX raises the
maximum achieved accuracy from 30.4% to 78.1% (under the
same set of assumptions). The two components of FEDAUX,
unsupervised pre-training and weighted ensemble distillation,
both contribute independently to the performance improve-
ment, as can be seen when comparing with FEDDF +P,
which only uses unsupervised pre-training. Weighted ensemble
distillation as done in FEDAUX leads to greater or equal
performance than equally weighted distillation (FEDDF +P)
across all levels of data heterogeneity. The same overall picture
can be observed in the “Mixed” setting where one-third of the
client population each trains on ResNet8, MobileNetv2, and
Shufflenet, respectively. (In this setting, parameter averaging
is not possible and thus FEDAVG cannot be applied.) Detailed
training curves are given in the Supplementary Material B.
Table I compares the performance of FEDAUX and baseline
methods at different client participation rates C. We can see
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Privacy analysis. Performance of FEDAUX for different combinations of the privacy parameters ε,δ,andλ. Forty clients training Resnet-8 for
T=10 rounds on CIFAR-10 at α=0.01 and C=40%. STL-10 is used as auxiliary dataset.
Fig. 9. Linear evaluation. Training curves for different FL methods at different levels of data heterogeneity αwhen only the classification head gis updated
in the training phase. A total of n=80 clients training ResNet8 on CIFAR-10 at C=40%, using STL-10 as auxiliary dataset.
that FEDAUX benefits from higher participation rates. In all
scenarios, methods which are initialized using the pre-trained
feature-extractor h0distinctively outperform their randomly
initialized counterparts. In the i.i.d. setting at α=100.0,
FEDAUX is mostly on par with the (improved) parameter
averaging-based methods FEDAV G +PandFEDPROX +P,
with a maximum performance gap of 0.8%. At α=0.01,
on the other hand, FEDAUX outperforms all other methods
with a margin of up to 29%.
C. Evaluating FEDAUX on NLP Benchmarks
Fig. 7 shows learning curves for Federated training of
TinyBERT on the Amazon and AG-News datasets at two
different levels of data heterogeneity α. We observe that
FEDAUX significantly outperforms FEDDF +Paswellas
FEDAV G +P in the heterogeneous setting (α=0.01) and
reaches 95% of its final accuracy after one communication
round on both datasets, indicating suitability for one-shot
learning. On more homogeneous data (α=1.0), FEDAUX
performs mostly on par with pre-trained versions of FEDAV G
and FEDDF, with a maximal performance gap of 1.1 % accu-
racy on the test set. We note that effects of data heterogeneity
are less severe as in this setting as both the AG News and the
Amazon dataset only have four and five labels, respectively,
and an αof 1.0 already leads to a distribution where each client
owns a subset of the private dataset containing all possible
labels. Further details on our implementation can be found
the Supplementary Material E.
D. Privacy Analysis of FEDAUX
Fig. 8 examines the dependence of FEDAUX’ training
performance of the privacy parameters ε,δ, and the regular-
ization parameter λ. As we can see, performance comparable
to non-private scoring is achievable at conservative privacy
parameters εand δ. For instance, at λ=0.01 setting ε=
0.04 and δ=106reduces the accuracy from 74.6% to
70.8%. At higher values of λ, better privacy guarantees have
an even less harmful effect, at the cost however of an overall
degradation in performance. Throughout this empirical study,
we have set the default privacy parameters to λ=0.1,
ε=0.1, and δ=1e5. We also perform an empirical
privacy analysis in the Supplementary Material H, which
provides additional intuitive understanding and confidence in
the privacy properties of our method.
E. Evaluating the Dependence on Auxiliary Data
Next, we investigate the influence of the auxiliary dataset
Daux on unsupervised pretraining, distillation, and weighted
distillation, respectively. We use CIFAR-10 as training dataset
and consider 8 different auxiliary datasets, which differ w.r.t.
their similarity to this client training data—from more simi-
lar (STL-10, CIFAR-100) to less similar (Devices, SVHN).2
Table II shows the maximum achieved accuracy after T=
100 rounds when each of these datasets is used as auxiliary
data. As we can see, performance always improves when
auxiliary data is used for unsupervised pre-training. Even for
the highly dissimilar SVHN dataset (which contains images
of house numbers) performance of FEDDF +Pimprovesby
1% over F EDDF in both the i.i.d. and non-i.i.d. regime. For
other datasets like Dogs, Birds, or Invertebrates, performance
improves by up to 14%, although they overlap with only one
single class of the CIFAR-10 dataset. The outperformance of
FEDAUXonsuchawidevarietyofhighly dissimilar datasets
suggest that beneficial auxiliary data should be available in
2The CIFAR-10 dataset contains images from the classes airplane, automo-
bile, bird, cat, deer, dog, frog, horse, ship and truc.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
AT C=40%
the majority of practical FL problems and also has positive
implications from the perspective of privacy. Interestingly, the
performance of FEDDF seems to only weakly correlate with
the performance of FEDDF +PandFEDAUX as a function of
the auxiliary dataset. This suggests that the properties, which
make a dataset useful for distillation, are not the same ones
that make it useful for pre-training and weighted distillation.
Investigating this relationship further is an interesting direction
of future research.
F. FEDAUX in Hardware-Constrained Settings
1) Linear Evaluation: In settings where the FL clients are
hardware-constrained mobile or IoT devices, local training of
entire deep neural networks like ResNet8 might be infeasible.
We therefore also consider the evaluation of different FL
methods, when only the linear classification head gis updated
during the training phase. Fig. 9 shows the training curves in
this setting when clients hold data from the CIFAR-10 dataset.
We see that in this setting, performance of FEDAUX is high,
independent of the data heterogeneity levels α, suggesting that
in the absence of non-convex training dynamics, our proposed
scoring method actually yields robust weighted ensembles in
the sense of [25]. We note that FEDAUX also trains much
more smoothly, than all other baseline methods.
2) One-Shot Evaluation: In many FL applications, the num-
ber of times a client can participate in the Federated training
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
is restricted by communication, energy, and/or privacy con-
straints [37], [67]. To study these types of settings, we inves-
tigate the performance of FEDAUX and other FL methods
in Federated one-shot learning where we set T=1and
C=100%. Table III compares performance in this setting
for n=100 clients training MobileNetv2 (resp. ShuffleNet).
FEDAUX outperforms the baseline methods in this setting at
all levels of data heterogeneity α.
The experiments performed in the previous section demon-
strate that F EDAUX outperforms state-of-the-art FL methods
by wide margins, in particular, if the training data is distributed
in a heterogeneous way among the clients. In Table IV,
we additionally provide a qualitative comparison between
FEDAUX and the baseline methods FEDAV G a nd F EDDF.
We can note the following.
A. Client Workload
Compared with FEDAV G a nd F EDDF, FEDAUX addition-
ally requires the clients to once solve the λ-strongly convex
ERM (9). For this problem, linearly convergent algorithms are
known [65] and thus the computational overhead (and energy
consumption) is negligible compared with the complexity of
multiple rounds of locally training deep neural networks.
B. Server Workload
FEDAUX also adds computational load to the server for
self-supervised pre-training and computation of the certainty
scores si. As the server is typically assumed to have massively
stronger computational resources than the clients, this can be
C. Communication Client Server
Once, in the preparation phase of FEDAUX, the scoring
models w
ineed to be communicated from the clients to the
server. The overhead of communicating these H-dimensional
vectors, where His the feature dimension, is negligible
compared to the communication of the full models fi.
D. Communication Server Clients
FEDAUX also requires the communication of the negative
data Dand the feature extractor h0from the server to the
clients. The overhead of sending h0is lower than sending
the full model f, and thus the total downstream commu-
nication is increased by less than a factor of (T+1)/ T.
The overhead of sending Dis small (in our experiments
|D|=0.2|Daux|) and can be further reduced by sending
extracted features {|hP
0(x)|xD}instead of the full data. For
instance, in our experiments with ResNet-8 and CIFAR-100,
we have |D|=12 000 and hP
0(x)512, resulting in a total
communication overhead of 12 000 ×512 ×4B=24.58 MB
for D. For comparison, the total communication overhead of
once sending the parameters of ResNet-8 (needs to be done
Ttimes) is 19.79 MB.
E. Privacy Loss
Communicating the scoring models w
iincurs additional
privacy loss for the clients. Using our proposed sanitation
mechanism, this process is made (ε, δ)-differentially private.
Our experiments in Section VI-D demonstrate that FEDAUX
can achieve drastic performance improvements, even under
conservative privacy constraints. All empirical results reported
are obtained with (ε, δ) DP at ε=0.1andδ=105.
F. Assumptions
Finally, FEDAUX makes the additional assumption that
unlabeled auxiliary data is available to the server. This assump-
tion is made by all FD methods including FEDDF.
In conclusion, FEDAUX requires comparable resources as
state-of-the-art FD methods and has similar privacy proper-
ties, while at the same time achieving significantly better
In this work, we have explored FL in the presence of
unlabeled auxiliary data, an assumption made in the quickly
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
growing area of FD. By leveraging auxiliary data for unsuper-
vised pre-training and certainty weighted ensemble distillation,
we were able to demonstrate that this assumption is rather
strong and can lead to drastically improved performance of FL
algorithms. As we have seen, these performance improvements
can be obtained even if the distribution of the auxiliary data
is highly divergent from the client data distribution and are
maintained when the certainty scores are obfuscated using a
strong DP mechanism. Additionally, our detailed qualitative
comparison with baseline methods revealed that FEDAUX
incurs only marginal excess computation and communication
On a more fundamental note, the dramatic performance
improvements observed in FEDAUX call into question the
common practice of comparing FD-based methods (which
assume auxiliary data) with parameter averaging-based meth-
ods (which do not make this assumption) [15], [17] and thus
have implications for the future evaluation of FD methods in
An interesting direction of future research would be to
explore how well FD methods and FEDAUX, in particular,
fare if only synthetically generated auxiliary data is available
for distillation and/or pre-training. First studies already show
promising results in this direction [68]. Another interesting
direction to explore would be the extension of our proposed
privacy mechanism from Section III-C to the training phase to
fully quantify the privacy loss of the FEDAUX method. Fur-
thermore, certainty estimates of client predictions as provided
by FEDAUX could also be used to detect anomalous client
behavior and thus increase adversarial robustness [69], [70].
Finally, certainty estimates could also be used to group the
client population into clusters in the spirit of [53] for improved
performance under structured heterogeneity of the client data.
[1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas,
“Communication-efficient learning of deep networks from decentralized
data,” in Proc. 20th Int. Conf. Artif. Intell. Statist. (AISTATS), 2017,
pp. 1273–1282.
[2] Q. Li et al., “A survey on federated learning systems: Vision, hype and
reality for data privacy and protection,” 2019, arXiv:1907.09693.
[3] U. Ahmed, G. Srivastava, and J. C.-W. Lin, “A federated learning
approach to frequent itemset mining in cyber-physical systems,” J. Netw.
Syst. Manage., vol. 29, no. 4, pp. 1–17, Oct. 2021.
[4] D. Połap, G. Srivastava, and K. Yu, “Agent architecture of an intelligent
medical system based on federated learning and blockchain technology,
J. Inf. Secur. Appl., vol. 58, May 2021, Art. no. 102748.
[5] M.J.Shelleret al., “Federated learning in medicine: Facilitating multi-
institutional collaborations without sharing patient data,” Sci. Rep.,
vol. 10, no. 1, pp. 1–12, Dec. 2020.
[6] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha,
and G. Srivastava, “A survey on security and privacy of federated learn-
ing,” Future Gener. Comput. Syst., vol. 115, pp. 619–640, Feb. 2021.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. IEEE Comput. Soc.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255.
[8] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
mixture models,” 2016, arXiv:1609.07843.
[9] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,”
in Proc. 36th Int. Conf. Mach. Learn. (ICML), 2019, pp. 4615–4625.
[10] S. Reddi et al., “Adaptive federated optimization,” 2020,
[11] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
“Federated optimization in heterogeneous networks,” in Proc. Mach.
Learn. Syst. (MLSys), 2020, pp. 1–22.
[12] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Robust and
communication-efficient federated learning from non-iid data,” IEEE
Trans. Neural Netw. Learn. Syst., vol. 31, no. 9, pp. 3400–3413,
Sep. 2020.
[13] S. Itahara, T. Nishio, Y. Koda, M. Morikura, and K. Yamamoto,
“Distillation-based semi-supervised federated learning for
communication-efficient collaborative training with non-IID private
data,” 2020, arXiv:2008.06180.
[14] F. Sattler, A. Marban, R. Rischke, and W. Samek, “CFD:
Communication-efficient federated distillation via soft-label quantiza-
tion and delta coding,” IEEE Trans. Netw. Sci. Eng., early access,
May 19, 2021, doi: 10.1109/TNSE.2021.3081748.
[15] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for
robust model fusion in federated learning,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), vol. 33, 2020, pp. 1–26.
[16] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S.-L. Kim,
“Communication-efficient on-device machine learning: Federated dis-
tillation and augmentation under non-IID private data,” 2018,
[17] H.-Y. Chen and W.-L. Chao, “FedBE: Making Bayesian model ensemble
applicable to federated learning,” 2020, arXiv:2009.01974.
[18] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
network,” 2015, arXiv:1503.02531.
[19] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple
framework for contrastive learning of visual representations,” in Proc.
37th Int. Conf. Mach. Learn. (ICML),2020, pp. 1597–1607.
[20] T. Wang and P. Isola, “Understanding contrastive representation learning
through alignment and uniformity on the hypersphere,” in Proc. Int.
Conf. Mach. Learn., 2020, pp. 9929–9939.
[21] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
of deep bidirectional transformers for language understanding,” in Proc.
Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
Technol. (NAACL-HLT), vol. 1, 2019, pp. 4171–4186.
[22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language models are unsupervised multitask learners,” OpenAI blog,
vol. 1, no. 8, p. 9, 2019.
[23] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
learning with non-IID data,” 2018, arXiv:1806.00582.
[24] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence
of FedAvg on non-iid data,” in Proc. 8th Int. Conf. Learn. Represent.
(ICLR), 2020, pp. 1–26.
[25] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation
with multiple sources,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS),
vol. 21, 2008, pp. 1041–1048.
[26] J. Hoffman, M. Mohri, and N. Zhang, “Algorithms and theory for
multiple-source adaptation,” in Proc. Adv. Neural Inf. Process. Syst.
(NIPS), vol. 31, 2018, pp. 8256–8266.
[27] L. Oala, C. Heiß, J. Macdonald, M. März, G. Kutyniok, and W. Samek,
“Detecting failure modes in image reconstructions with interval neural
network uncertainty,Int. J. Comput. Assist. Radiol. Surgery,vol.4,
pp. 1–9, Sep. 2021.
[28] L. Ruff et al., “A unifying review of deep and shallow anomaly
detection,” Proc. IEEE, vol. 109, no. 5, pp. 756–795, May 2021.
[29] L. Ruff et al., “Deep semi-supervised anomaly detection,” 2019,
[30] C. Dwork and A. Roth, “The algorithmic foundations of differential pri-
vacy,” Found. Trends Theor. Comput. Sci., vol. 9, nos. 3–4, pp. 211–407,
[31] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially
private empirical risk minimization,” J. Mach. Learn. Res., vol. 12,
pp. 1069–1109, Mar. 2011.
[32] M. Abadi et al., “Deep learning with differential privacy,” in Proc. ACM
SIGSAC Conf. Comput. Commun. Secur., Oct. 2016, pp. 308–318.
[33] D. Li and J. Wang, “FedMD: Heterogenous federated learning via model
distillation,” 2019, arXiv:1910.03581.
[34] H. Chang, V. Shejwalkar, R. Shokri, and A. Houmansadr, “Cronus:
Robust and heterogeneous collaborative learning with black-box knowl-
edge transfer,” 2019, arXiv:1912.11279.
[35] Y. Li, W. Zhou, H. Wang, H. Mi, and T. M. Hospedales, “FedH2L:
Federated learning with model and statistical heterogeneity,” 2021,
[36] H. Seo, J. Park, S. Oh, M. Bennis, and S.-L. Kim, “Federated knowledge
distillation,” 2020, arXiv:2011.02367.
[37] N. Guha, A. Talwalkar, and V. Smith, “One-shot federated learning,”
2019, arXiv:1902.11175.
[38] L. Sun and L. Lyu, “Federated model distillation with noise-free
differential privacy,” 2020, arXiv:2009.05537.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[39] Y. Zhou, G. Pu, X. Ma, X. Li, and D. Wu, “Distilled one-shot federated
learning,” 2020, arXiv:2009.07999.
[40] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation
for distributed edge learning with heterogeneous data,” in Proc. IEEE
30th Annu. Int. Symp. Pers., Indoor Mobile Radio Commun. (PIMRC),
Sep. 2019, pp. 1–6.
[41] S. Hashem and B. Schmeiser, “Approximating a function and its deriv-
atives using mse-optimal linear combinations of trained feedforward
neural networks,” in Proc. World Congr. Neural Netw., vol. 1, 1993,
pp. 617–620.
[42] M. P. Perrone and L. N. Cooper, “When networks disagree: Ensemble
methods for hybrid neural networks,” in Neural Networks for Speech
and Image Processing, R. J. Mammone, Ed. London, U.K.: Chapman
and Hall, 1993.
[43] P. Sollich and A. Krogh, “Learning with ensembles: How overfitting can
be useful,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 8, 1995,
pp. 190–196.
[44] A. J. C. Sharkey, “On combining artificial neural nets,” Connection Sci.,
vol. 8, nos. 3–4, pp. 299–314, Dec. 1996.
[45] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical
study,J. Artif. Intell. Res., vol. 11, pp. 169–198, Aug. 1999.
[46] D. Jimenez, “Dynamically weighted ensemble neural networks for
classification,” in Proc. IEEE Int. Joint Conf. Neural Netw. World Congr.
Comput. Intell., vol. 1, May 1998, pp. 753–756.
[47] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture
of experts,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8,
pp. 1177–1193, Aug. 2012.
[48] S. Masoudnia and R. Ebrahimpour, “Mixture of experts: A literature
survey,” Artif. Intell. Rev., vol. 42, no. 2, pp. 275–293, 2014.
[49] R. E. Schapire, “A brief introduction to boosting,” in Proc. 16th Int.
Joint Conf. Artif. Intell. (IJCAI), 1999, pp. 1401–1406.
[50] V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-
task learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 30,
2017, pp. 4424–4434.
[51] H. Wu, C. Chen, and L. Wang, “A theoretical perspective on differen-
tially private federated multi-task learning,” 2020, arXiv:2011.07179.
[52] A. Ghosh, J. Hong, D. Yin, and K. Ramchandran, “Robust federated
learning in a heterogeneous environment,” 2019, arXiv:1906.06629.
[53] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learn-
ing: Model-agnostic distributed multitask optimization under privacy
constraints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8,
pp. 3710–3722, Aug. 2021.
[54] Z. Zhang, Y. Yang, Z. Yao, Y. Yan, J. E. Gonzalez, and M. W. Mahoney,
“Improving semi-supervised federated learning by reducing the gradient
diversity of models,” 2020, arXiv:2008.11364.
[55] W. Jeong, J. Yoon, E. Yang, and S. Ju Hwang, “Federated semi-
supervised learning with inter-client consistency & disjoint learning,”
2020, arXiv:2006.12097.
[56] F. Zhang et al., “Federated unsupervised representation learning,” 2020,
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
[58] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and
L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 4510–4520.
[59] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
efficient convolutional neural network for mobile devices,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
pp. 6848–6856.
[60] X. Jiao et al., “TinyBERT: Distilling BERT for natural language under-
standing,” in Proc. Conf. Empirical Methods Natural Lang. Process.,
Findings (EMNLP), 2020, pp. 4163–4174.
[61] X. Zhang, J. J. Zhao, and Y. LeCun, “Character-level convolutional
networks for text classification,” in Proc. Adv. Neural Inf. Process. Syst.
(NIPS), vol. 28, 2015, pp. 649–657.
[62] P. Keung, Y. Lu, G. Szarvas, and N. A. Smith, “The multilingual
Amazon reviews corpus,” in Proc. Conf. Empirical Methods Natural
Lang. Process. (EMNLP), 2020, pp. 4563–4568.
[63] Y. Zhu et al., “Aligning books and movies: Towards story-like visual
explanations by watching movies and reading books,” in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 19–27.
[64] T.-M. Harry Hsu, H. Qi, and M. Brown, “Measuring the effects of
non-identical data distribution for federated visual classification,” 2019,
[65] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for
large scale optimization,” Math. Program., vol. 45, no. 1, pp. 503–528,
[66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
2014, arXiv:1412.6980.
[67] N. Papernot, S. Song, I. Mironov, A. Raghunathan, K. Talwar, and
U. Erlingsson, “Scalable private learning with PATE,” in Proc. 6th Int.
Conf. Learn. Represent. (ICLR), 2018, pp. 1–34.
[68] Z. Zhu, J. Hong, and J. Zhou, “Data-free knowledge distillation for
heterogeneous federated learning,” 2021, arXiv:2105.10056.
[69] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyzing
federated learning through an adversarial lens,” in Proc. Int. Conf. Mach.
Learn., 2019, pp. 634–643.
[70] V. Mothukuri, P. Khare, R. M. Parizi, S. Pouriyeh, A. Dehghantanha,
and G. Srivastava, “Federated learning-based anomaly detection for IoT
security attacks,” IEEE Internet Things J., early access, May 5, 2021,
doi: 10.1109/JIOT.2021.3077803.
Felix Sattler received the B.Sc. degree in mathe-
matics, the M.Sc. degree in computer science, and
the M.Sc. degree in applied mathematics from Tech-
nische Universität Berlin, Berlin, Germany, in 2016
and 2018, respectively.
He is currently with the Machine Learning
Group, Fraunhofer Heinrich Hertz Institute, Berlin.
His research interests include efficient and robust
machine learning, federated learning, and multi-task
Tim Korjakow received the B.Sc. degree in com-
puter science from Technische Universität Berlin,
Berlin, Germany, in 2019.
He currently works with the Machine Learn-
ing Group, Fraunhofer Heinrich Hertz Institute,
Berlin. His research interests include distributed
machine learning, neural networks, and interpretabil-
ity methods.
Roman Rischke received the M.Sc. degree in
business mathematics from Technische Universität
Berlin, Berlin, Germany, in 2012, and the Dr.rer.nat.
degree in mathematics from Technische Universität
München, Munich, Germany, in 2016.
He currently works as a Post-Doctoral Researcher
with the Department of Artificial Intelligence, Fraun-
hofer Heinrich Hertz Institute, Berlin. His research
interests include discrete optimization under data
uncertainty, efficient and robust machine learning,
and federated learning.
Wojciech Samek (Member, IEEE) studied computer
science at the Humboldt University of Berlin, Berlin,
Germany, from 2004 to 2010. He received the
Ph.D. degree (Hons.) from the Technical University
of Berlin, Berlin, in 2014.
He was a Visiting Researcher with the NASA
Ames Research Center, Mountain View, CA, USA.
In 2014, he founded the Machine Learning Group,
Fraunhofer HHI, which he has directed until 2020.
He is an Associated Faculty at the Berlin Institute for
the Foundation of Learning and Data (BIFOLD), the
ELLIS Unit Berlin, and the DFG Graduate School BIOQIC. He is currently
the Head of the Department of Artificial Intelligence and the Explainable
AI Group, Fraunhofer Heinrich Hertz Institute, Berlin. His research interests
include deep learning, explainable AI, neural network compression, and
federated learning.
Dr. Samek is an Elected Member of the IEEE MLSP Technical Committee.
During his studies, he was awarded scholarships from the German Academic
Scholarship Foundation and the DFG Research Training Group GRK 1589/1.
He has been serving as an AC for NAACL 2021, was a recipient of multiple
best paper awards, including the 2020 Pattern Recognition Best Paper Award,
and a part of the MPEG-7 Part 17 standardization. He is an Editorial Board
Member of Pattern Recognition,PLoS ONE, and IEEE TRANSACTIONS ON
... Federated distillation methods have been effectively employed to address data heterogeneity issues, which can be applied either on the server side (e.g., utilizing a proxy dataset for ensemble distillation [108,109,110]) to adjust the global model, or on the client side by incorporating regularization techniques to control the data drift [111,112,113,114]. Specifically, Chen and Chao [110] use a Bayesian model ensemble for robust aggregation as an alternative to average predictions. ...
... Lin et al. [108] present a server-side ensemble distillation method using a proxy dataset to support model heterogeneity and enhance Fe-dAvg. The work of FedAUX [109] propose unsupervised pre-training on auxiliary data for client-side feature extractor initialization and ensemble prediction weighting based on private certainty scores. Zhang et al. [115] refine the global model server-side using data-free knowledge distillation and adversarial training of a generator model. ...
... For the heterogeneous models, the FedAvg protocol can be enhanced through server-side ensemble distillation during the aggregation process [108,109]. The server manages prototypical models, aggregates post-update, and leverages unlabeled/synthetic data for fine-tuning, which enables effective knowledge exchange across diverse client architectures. ...
Although Deep neural networks (DNNs) have shown a strong capacity to solve large-scale problems in many areas, such DNNs are hard to be deployed in real-world systems due to their voluminous parameters. To tackle this issue, Teacher-Student architectures were proposed, where simple student networks with a few parameters can achieve comparable performance to deep teacher networks with many parameters. Recently, Teacher-Student architectures have been effectively and widely embraced on various knowledge distillation (KD) objectives, including knowledge compression, knowledge expansion, knowledge adaptation, and knowledge enhancement. With the help of Teacher-Student architectures, current studies are able to achieve multiple distillation objectives through lightweight and generalized student networks. Different from existing KD surveys that primarily focus on knowledge compression, this survey first explores Teacher-Student architectures across multiple distillation objectives. This survey presents an introduction to various knowledge representations and their corresponding optimization objectives. Additionally, we provide a systematic overview of Teacher-Student architectures with representative learning algorithms and effective distillation schemes. This survey also summarizes recent applications of Teacher-Student architectures across multiple purposes, including classification, recognition, generation, ranking, and regression. Lastly, potential research directions in KD are investigated, focusing on architecture design, knowledge quality, and theoretical studies of regression-based learning, respectively. Through this comprehensive survey, industry practitioners and the academic community can gain valuable insights and guidelines for effectively designing, learning, and applying Teacher-Student architectures on various distillation objectives.
... FedKT [128], FedMD [121], FedAD [66], FedMD-NFDP [208], FCCL [88], FedAUX [196], RHFL [52], FedKD 1 [67], Cronus [24], KT-pFL [261], DS-FL [90] KD w/ Generated Data (DFKD) DENSE [259], FedCAVE-KD [83], FedGen [275], FedFTG [265] KD w/ Differential Privacy FedKC [226], FedSSL [49] model. Therefore, it should be classified as Combination instead of Amalgamation like FedAvg. ...
... The last mechanism is to assume that a public unlabeled dataset, which does not contain sensitive information, can be accessed by both the server and clients for KD [24,52,66,67,88,90,121,128,196,208,261]. Sharing these extracted contents will not raise any privacy concerns, and only minimal communication is generated during KD for the purpose of aligning sample IDs. ...
... Meanwhile, the generation approachs also offer the same one-shot capacity (i.e., FedDISC [242] and FRD [21]) as distillation by directly generating synthetic data for fusion. This one-shot capacity can also be extended to the hybrid case, such as FedCAV [83], FedBE [28], FedAUX [196], DENSE [259]. That is why the final step in all one-shot solutions is typically distillation or generation on the server-side (excluding local training), rather than amalgamation. ...
Full-text available
Traditional Federated Learning (FL) follows a server-domincated cooperation paradigm which narrows the application scenarios of FL and decreases the enthusiasm of data holders to participate. To fully unleash the potential of FL, we advocate rethinking the design of current FL frameworks and extending it to a more generalized concept: Open Federated Learning Platforms. We propose two reciprocal cooperation frameworks for FL to achieve this: query-based FL and contract-based FL. In this survey, we conduct a comprehensive review of the feasibility of constructing an open FL platform from both technical and legal perspectives. We begin by reviewing the definition of FL and summarizing its inherent limitations, including server-client coupling, low model reusability, and non-public. In the query-based FL platform, which is an open model sharing and reusing platform empowered by the community for model mining, we explore a wide range of valuable topics, including the availability of up-to-date model repositories for model querying, legal compliance analysis between different model licenses, and copyright issues and intellectual property protection in model reusing. In particular, we introduce a novel taxonomy to streamline the analysis of model license compatibility in FL studies that involve batch model reusing methods, including combination, amalgamation, distillation, and generation. This taxonomy provides a systematic framework for identifying the corresponding clauses of licenses and facilitates the identification of potential legal implications and restrictions when reusing models. Through this survey, we uncover the the current dilemmas faced by FL and advocate for the development of sustainable open FL platforms. We aim to provide guidance for establishing such platforms in the future, while identifying potential problems and challenges that need to be addressed.
... A smaller β indicates a greater degree of data heterogeneity, following some similar approaches [33,39]. We divide the Dirichlet division into 10 clients and simulate scenarios with IID and different degrees of Non-IID by choosing 100, 0.1 and 0.03 for β, which covers IID, normal Non-IID and extreme Non-IID scenarios. ...
... This work uses Dirichlet distribution to simulate different data distribution scenarios with unbalanced label classes, controlling the degree of Non-IID by a parameter . A smaller indicates a greater degree of data heterogeneity, following some similar approaches [33,39]. We divide the Dirichlet division into 10 clients and simulate scenarios with IID and different degrees of Non-IID by choosing 100, 0.1 and 0.03 for , which covers IID, normal Non-IID and extreme Non-IID scenarios. ...
Full-text available
As the development of the Internet of Things (IoT) continues, Federated Learning (FL) is gaining popularity as a distributed machine learning framework that does not compromise the data privacy of each participant. However, the data held by enterprises and factories in the IoT often have different distribution properties (Non-IID), leading to poor results in their federated learning. This problem causes clients to forget about global knowledge during their local training phase and then tends to slow convergence and degrades accuracy. In this work, we propose a method named FedRAD, which is based on relational knowledge distillation that further enhances the mining of high-quality global knowledge by local models from a higher-dimensional perspective during their local training phase to better retain global knowledge and avoid forgetting. At the same time, we devise an entropy-wise adaptive weights module (EWAW) to better regulate the proportion of loss in single-sample knowledge distillation versus relational knowledge distillation so that students can weigh losses based on predicted entropy and learn global knowledge more effectively. A series of experiments on CIFAR10 and CIFAR100 show that FedRAD has better performance in terms of convergence speed and classification accuracy compared to other advanced FL methods.
... Wang et al. [45] proposed a federated peer-topeer network architecture, which called heterogeneous defect prediction based on federated transfer learning via knowledge distillation (FTLKD). Leveraging unlabeled auxiliary data (FEDAUX) [46] is an extended version of federated distillation. This method proposes to conduct unsupervised pre-training on auxiliary data to find an initialization model. ...
Full-text available
Software defect prediction is used to identify modules in software projects that may have defects. Heterogeneous Defect Prediction (HDP) establishes a cross project defect prediction model based on different software defect datasets. However, due to the heterogeneity of multi-source data, the model performance is usually not ideal. In addition, the project data holder is unwilling to disclose the data due to privacy regulations and other reasons, resulting in data islands. This paper presents a federal prototype learning based on prototype averaging (FPLPA), which combines federated learning (FL) with prototype learning for heterogeneous defect prediction. Firstly, the client used one-sided selection (OSS) algorithm to remove noise from local training data, and applied Chi-Squares Test algorithm to select the optimal subset of features. Secondly, the client constructed the convolution prototype network (CPN) to generate their own local prototypes. CPN are more robust to heterogeneous data than convolutional neural networks (CNN), while avoiding the deviation effect of class imbalances in software data. The prototype is used as the communication subject between the clients and the server. Because the local prototype is generated in an irreversible way, it can play a role of privacy protection in the communication process. Finally, the local CPN network is updated with the loss of local prototype and global prototype as regularization. We have verified on 10 projects in three public data sets (AEEEM, NASA and Relink), and the experimental results show that FPLPA is superior to other HDP solutions.
... At the client side, global knowledge is used to control the client drift via on-device regularizers [80], [81], [117], [246] or using synthetically-generated data [276]. On the server side, instead, the global model can be rectified via ensemble distillation of a proxy dataset [32], [138], [187] or using a generator network [215], [263], [264]. ...
Full-text available
Federated Learning (FL) has recently emerged as a novel machine learning paradigm allowing to preserve privacy and to account for the distributed nature of the learning process in many real-world settings. Computer vision tasks deal with huge datasets often with critical privacy issues, therefore many federated learning approaches have been presented to exploit its distributed and privacy-preserving nature. Firstly, this paper introduces the different FL settings used in computer vision and the main challenges that need to be tackled. Then, it provides a comprehensive overview of the different strategies used for FL in vision applications and presents several different approaches for image classification, object detection, semantic segmentation and for focused settings in face recognition and medical imaging. For the various approaches the considered FL setting, the employed data and methodologies and the achieved results are thoroughly discussed.
... In FedML [46], latent information from homogeneous models is applied to train heterogeneous models. FedAUX [45] initialized heterogeneous models by unsupervised pre-training and unlabeled auxiliary data. FCCL [21] calculate a cross-correlation matrix according to the global unlabeled dataset to exchange knowledge. ...
Federated learning (FL) inevitably confronts the challenge of system heterogeneity in practical scenarios. To enhance the capabilities of most model-homogeneous FL methods in handling system heterogeneity, we propose a training scheme that can extend their capabilities to cope with this challenge. In this paper, we commence our study with a detailed exploration of homogeneous and heterogeneous FL settings and discover three key observations: (1) a positive correlation between client performance and layer similarities, (2) higher similarities in the shallow layers in contrast to the deep layers, and (3) the smoother gradients distributions indicate the higher layer similarities. Building upon these observations, we propose InCo Aggregation that leverags internal cross-layer gradients, a mixture of gradients from shallow and deep layers within a server model, to augment the similarity in the deep layers without requiring additional communication between clients. Furthermore, our methods can be tailored to accommodate model-homogeneous FL methods such as FedAvg, FedProx, FedNova, Scaffold, and MOON, to expand their capabilities to handle the system heterogeneity. Copious experimental results validate the effectiveness of InCo Aggregation, spotlighting internal cross-layer gradients as a promising avenue to enhance the performance in heterogenous FL.
... Scaffold [36] proposes control variates to correct the local updates and eliminate the "client drift" which happens because of data heterogeneity resulting in convergence rate improvement. Other approaches have focused on proposing better model fusion techniques to improve performance [23,52,53,69,78]. FedDF [52] adds a server-side KL training step after averaging local models by using the average of clients' logits on a public dataset. ...
Full-text available
Federated Learning (FL) has been an area of active research in recent years. There have been numerous studies in FL to make it more successful in the presence of data heterogeneity. However, despite the existence of many publications, the state of progress in the field is unknown. Many of the works use inconsistent experimental settings and there are no comprehensive studies on the effect of FL-specific experimental variables on the results and practical insights for a more comparable and consistent FL experimental setup. Furthermore, the existence of several benchmarks and confounding variables has further complicated the issue of inconsistency and ambiguity. In this work, we present the first comprehensive study on the effect of FL-specific experimental variables in relation to each other and performance results, bringing several insights and recommendations for designing a meaningful and well-incentivized FL experimental setup. We further aid the community by releasing FedZoo-Bench, an open-source library based on PyTorch with pre-implementation of 22 state-of-the-art methods, and a broad set of standardized and customizable features available at We also provide a comprehensive comparison of several state-of-the-art (SOTA) methods to better understand the current state of the field and existing limitations.
Full-text available
Federated and Continual Learning have emerged as potential paradigms for the robust and privacy-aware use of Deep Learning in dynamic environments. However, Client Drift and Catastrophic Forgetting are fundamental obstacles to guaranteeing consistent performance. Existing work only addresses these problems separately, which neglects the fact that the root cause behind both forms of performance deterioration is connected. We propose a unified analysis framework for building a controlled test environment for Client Drift -- by perturbing a defined ratio of clients -- and Catastrophic Forgetting -- by shifting all clients with a particular strength. Our framework further leverages this new combined analysis by generating a 3D landscape of the combined performance impact from both. We demonstrate that the performance drop through Client Drift, caused by a certain share of shifted clients, is correlated to the drop from Catastrophic Forgetting resulting from a corresponding shift strength. Correlation tests between both problems for Computer Vision (CelebA) and Medical Imaging (PESO) support this new perspective, with an average Pearson rank correlation coefficient of over 0.94. Our framework's novel ability of combined spatio-temporal shift analysis allows us to investigate how both forms of distribution shift behave in mixed scenarios, opening a new pathway for better generalization. We show that a combination of moderate Client Drift and Catastrophic Forgetting can even improve the performance of the resulting model (causing a "Generalization Bump") compared to when only one of the shifts occurs individually. We apply a simple and commonly used method from Continual Learning in the federated setting and observe this phenomenon to be reoccurring, leveraging the ability of our framework to analyze existing and novel methods for Federated and Continual Learning.
Full-text available
For most healthcare organizations, a significant challenge today is predicting diseases with incomplete data information, often resulting in isolation. Federated learning (FL) solves the issue of data silos by enabling remote local machines to train a globally optimal model collaboratively without the need for sharing data. In this research, we present FedDK, a serverless framework designed to obtain personalized models for each federation through data from local federations using convolutional neural networks and training through FL. Our approach involves using convolutional neural networks (CNNs) to accumulate common knowledge and transfer it using knowledge distillation, which helps prevent common knowledge forgetting. Additionally, the missing common knowledge is filled circularly between each federation, culminating in a personalized model for each group. This novel design leverages federated, deep, and integrated learning methods to produce more accurate machine-learning models. Our federated model exhibits superior performance to local and baseline FL methods, achieving significant advantages.
Full-text available
Purpose The quantitative detection of failure modes is important for making deep neural networks reliable and usable at scale. We consider three examples for common failure modes in image reconstruction and demonstrate the potential of uncertainty quantification as a fine-grained alarm system. Methods We propose a deterministic, modular and lightweight approach called Interval Neural Network ( INN ) that produces fast and easy to interpret uncertainty scores for deep neural networks. Importantly, INN s can be constructed post hoc for already trained prediction networks. We compare it against state-of-the-art baseline methods ( MCDrop , ProbOut ). Results We demonstrate on controlled, synthetic inverse problems the capacity of INN s to capture uncertainty due to noise as well as directional error information. On a real-world inverse problem with human CT scans, we can show that INN s produce uncertainty scores which improve the detection of all considered failure modes compared to the baseline methods. Conclusion Interval Neural Networks offer a promising tool to expose weaknesses of deep image reconstruction models and ultimately make them more reliable. The fact that they can be applied post hoc to equip already trained deep neural network models with uncertainty scores makes them particularly interesting for deployment.
Full-text available
Effective vector representation has been proven useful for transaction classification and clustering tasks in Cyber-Physical Systems. Traditional methods use heuristic-based approaches and different pruning strategies to discover the required patterns efficiently. With the extensive and high dimensional availability of transactional data in cyber-physical systems, traditional methods that used frequent itemsets (FIs) as features suffer from dimensionality, sparsity, and privacy issues. In this paper, we first propose a federated learning-based embedding model for the transaction classification task. The model takes transaction data as a set of frequent item-sets. Afterward, the model can learn low dimensional continuous vectors by preserving the frequent item-sets contextual relationship. We perform an in-depth experimental analysis on the number of high dimensional transactional data to verify the developed models with attention-based mechanism and federated learning. From the results, it can be seen that the designed model can help and improve the decision boundary by reducing the global loss function while maintaining both security and privacy.
Full-text available
Communication constraints are one of the major challenges preventing the wide-spread adoption of Federated Learning systems. Recently, Federated Distillation (FD), a new algorithmic paradigm for Federated Learning with fundamentally different communication properties, emerged. FD methods leverage ensemble distillation techniques and exchange model outputs, presented as soft labels on an unlabeled public data set, between the central server and the participating clients. In this work, we investigate FD from the perspective of communication efficiency by analyzing the effects of active distillation-data curation, soft-label quantization, and delta-coding techniques. Based on the insights gathered from this analysis, we present Compressed Federated Distillation (CFD), an efficient Federated Distillation method. Extensive experiments, on federated image classification and language modeling problems, at different levels of data heterogeneity, demonstrate that our method can reduce the amount of communication necessary to achieve fixed performance targets by more than two orders of magnitude when compared to FD, and by more than four orders of magnitude when compared to parameter averaging based techniques like Federated Averaging.
Full-text available
This study develops a federated learning (FL) framework overcoming largely incremental communication costs due to model sizes in typical frameworks without compromising model performance. To this end, based on the idea of leveraging an unlabeled open dataset, we propose a distillation-based semi-supervised FL (DS-FL) algorithm that exchanges the outputs of local models among mobile devices, instead of model parameter exchange employed by the typical frameworks. In DS-FL, the communication cost depends only on the output dimensions of the models and does not scale up according to the model size. The exchanged model outputs are used to label each sample of the open dataset, which creates an additionally labeled dataset. Based on the new dataset, local models are further trained, and model performance is enhanced owing to the data augmentation effect. We further highlight that in DS-FL, the heterogeneity of the devices dataset leads to ambiguous of each data sample and lowing of the training convergence. To prevent this, we propose entropy reduction averaging, where the aggregated model outputs are intentionally sharpened. Moreover, extensive experiments show that DS-FL reduces communication costs up to 99% relative to those of the FL benchmark while achieving similar or higher classification accuracy.
Federated Learning (FL) is a decentralized machine-learning paradigm in which a global server iteratively aggregates the model parameters of local users without accessing their data. User heterogeneity has imposed significant challenges to FL, which can incur drifted global models that are slow to converge. Knowledge Distillation has recently emerged to tackle this issue, by refining the server model using aggregated knowledge from heterogeneous users, other than directly aggregating their model parameters. This approach, however, depends on a proxy dataset, making it impractical unless such prerequisite is satisfied. Moreover, the ensemble knowledge is not fully utilized to guide local model learning, which may in turn affect the quality of the aggregated model. In this work, we propose a data-free knowledge distillation approach to address heterogeneous FL, where the server learns a lightweight generator to ensemble user information in a data-free manner, which is then broadcasted to users, regulating local training using the learned knowledge as an inductive bias. Empirical studies powered by theoretical implications show that, our approach facilitates FL with better generalization performance using fewer communication rounds, compared with the state-of-the-art.
As data privacy increasingly becomes a critical societal concern, federated learning has been a hot research topic in enabling the collaborative training of machine learning models among different organizations under the privacy restrictions. As researchers try to support more machine learning models with different privacy-preserving approaches, there is a requirement in developing systems and infrastructures to ease the development of various federated learning algorithms. Similar to deep learning systems such as PyTorch and TensorFlow that boost the development of deep learning, federated learning systems (FLSs) are equivalently important, and face challenges from various aspects such as effectiveness, efficiency, and privacy. In this survey, we conduct a comprehensive review on FLSs. To understand the key design system components and guide future research, we introduce the definition of FLSs and analyze the system components. Moreover, we provide a thorough categorization for FLSs according to six different aspects, including data distribution, machine learning model, privacy mechanism, communication architecture, scale of federation and motivation of federation. The categorization can help the design of FLSs as shown in our case studies. By systematically summarizing the existing FLSs, we present the design factors, case studies, and future research opportunities.
Federated learning provides a promising paradigm to enable network edge intelligence in the future sixth generation (6G) systems. However, due to the high dynamics of wireless circumstances and user behavior, the collected training data is non-independent and identically distributed (non-IID), which causes severe performance degradation of federated learning. To solve this problem, federated learning with non-IID data in wireless networks is studied in this paper. Firstly, based on the derived upper bound of expected weight divergence, a federated averaging scheme is proposed to reduce the distribution divergence of non-IID data. Secondly, to further harmonize the distribution divergence, data sharing is associated with federated learning in wireless networks, and a joint optimization algorithm is designed to keep a sophisticated balance between the model accuracy and the cost. Finally, the simulation results based on a common-used image data set are provided to evaluate the performance of our proposed schemes, which can achieve significant performance gains with a small price of latency and energy consumption.
Conference Paper
Conventional federated learning directly averages model weights, which is only possible for collaboration between models with homogeneous architectures. Sharing prediction instead of weight removes this obstacle and eliminates the risk of white-box inference attacks in conventional federated learning. However, the predictions from local models are sensitive and would leak training data privacy to the public. To address this issue, one naive approach is adding the differentially private random noise to the predictions, which however brings a substantial trade-off between privacy budget and model performance. In this paper, we propose a novel framework called FEDMD-NFDP, which applies a Noise-FreeDifferential Privacy (NFDP) mechanism into a federated model distillation framework. Our extensive experimental results on various datasets validate that FEDMD-NFDP can deliver not only comparable utility and communication efficiency but also provide a noise-free differential privacy guarantee. We also demonstrate the feasibility of our FEDMD-NFDP by considering both IID and Non-IID settings, heterogeneous model architectures, and unlabelled public datasets from a different distribution.
The Internet of Things (IoT) is made up of billions of physical devices connected to the Internet via networks that perform tasks independently with less human intervention. Such brilliant automation of mundane tasks requires a considerable amount of user data in digital format, which in turn makes IoT networks an open-source of Personally Identifiable Information data for malicious attackers to steal, manipulate and perform nefarious activities. Huge interest has developed over the past years in applying machine learning (ML)-assisted approaches in the IoT security space. However, the assumption in many current works is that big training data is widely available and transferable to the main server because data is born at the edge and is generated continuously by IoT devices. This is to say that classic ML works on the legacy set of entire data located on a central server, which makes it the least preferred option for domains with privacy concerns on user data. To address this issue, we propose federated learning (FL)-based anomaly detection approach to proactively recognize intrusion in IoT networks using decentralized on-device data. Our approach uses federated training rounds on Gated Recurrent Units (GRUs) models and keeps the data intact on local IoT devices by sharing only the learned weights with the central server of the FL. Also, the approach's ensembler part aggregates the updates from multiple sources to optimize the global ML model's accuracy. Our experimental results demonstrate that our approach outperforms the classic/centralized machine learning (non-FL) versions in securing the privacy of user data and provides an optimal accuracy rate in attack detection.