Preprint

Towards Practical Overlay Networks for Decentralized Federated Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Decentralized federated learning (DFL) uses peer-to-peer communication to avoid the single point of failure problem in federated learning and has been considered an attractive solution for machine learning tasks on distributed devices. We provide the first solution to a fundamental network problem of DFL: what overlay network should DFL use to achieve fast training of highly accurate models, low communication, and decentralized construction and maintenance? Overlay topologies of DFL have been investigated, but no existing DFL topology includes decentralized protocols for network construction and topology maintenance. Without these protocols, DFL cannot run in practice. This work presents an overlay network, called FedLay, which provides fast training and low communication cost for practical DFL. FedLay is the first solution for constructing near-random regular topologies in a decentralized manner and maintaining the topologies under node joins and failures. Experiments based on prototype implementation and simulations show that FedLay achieves the fastest model convergence and highest accuracy on real datasets compared to existing DFL solutions while incurring small communication costs and being resilient to node joins and failures.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In recent years, Federated Learning (FL) has gained relevance in training collaborative models without sharing sensitive data. Since its birth, Centralized FL (CFL) has been the most common approach in the literature, where a central entity creates a global model. However, a centralized approach leads to increased latency due to bottlenecks, heightened vulnerability to system failures, and trustworthiness concerns affecting the entity responsible for the global model creation. Decentralized Federated Learning (DFL) emerged to address these concerns by promoting decentralized model aggregation and minimizing reliance on centralized architectures. However, despite the work done in DFL, the literature has not (i) studied the main aspects differentiating DFL and CFL; (ii) analyzed DFL frameworks to create and evaluate new solutions; and (iii) reviewed application scenarios using DFL. Thus, this article identifies and analyzes the main fundamentals of DFL in terms of federation architectures, topologies, communication mechanisms, security approaches, and key performance indicators. Additionally, the paper at hand explores existing mechanisms to optimize critical DFL fundamentals. Then, the most relevant features of the current DFL frameworks are reviewed and compared. After that, it analyzes the most used DFL application scenarios, identifying solutions based on the fundamentals and frameworks previously defined. Finally, the evolution of existing DFL solutions is studied to provide a list of trends, lessons learned, and open challenges.
Conference Paper
Full-text available
The stability and generalization of stochastic gradient-based methods provide valuable insights into understanding the algorithmic performance of machine learning models. As the main workhorse for deep learning, the stochastic gradient descent has received a considerable amount of studies. Nevertheless, the community paid little attention to its decentralized variants. In this paper, we provide a novel formulation of the decentralized stochastic gradient descent. Leveraging this formulation together with (non)convex optimization theory, we establish the first stability and generalization guarantees for the decentralized stochastic gradient descent. Our theoretical results are built on top of a few common and mild assumptions and reveal that the decentralization deteriorates the stability of SGD for the first time. We verify our theoretical findings by using a variety of decentralized settings and benchmark machine learning models.
Article
Full-text available
Federated learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning, however, comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods, however, are only of limited utility in the federated learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions, such as i.i.d. distribution of the client data, which typically cannot be found in federated learning. In this article, we propose sparse ternary compression (STC), a new compression framework that is specifically designed to meet the requirements of the federated learning environment. STC extends the existing compression technique of top-k gradient sparsification with a novel mechanism to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms federated averaging in common federated learning scenarios. These results advocate for a paradigm shift in federated optimization toward high-frequency low-bitwidth communication, in particular in the bandwidth-constrained learning environments.
Article
Full-text available
In this paper, we discuss how to design the graph topology to reduce the communication complexity of certain algorithms for decentralized optimization. Our goal is to minimize the total communication needed to achieve a prescribed accuracy. We discover that the so-called expander graphs are near-optimal choices. We propose three approaches to construct expander graphs for different numbers of nodes and node degrees. Our numerical results show that the performance of decentralized optimization is significantly better on expander graphs than other regular graphs.
Article
Federated learning (FL) has emerged as a distributed machine learning (ML) technique to train models without sharing users’ private data. In this paper, we introduce a decentralized FL scheme that is called federated learning empowered overlapped clustering for decentralized aggregation (FL-EOCD). The introduced FL-EOCD leverages device-to-device (D2D) communications and overlapped clustering to enable decentralized aggregation, where a cluster is defined as a coverage zone of a typical device. The devices located on the overlapped clusters are called bridge devices (BDs). In the proposed FL-EOCD scheme, a clustering topology is envisioned where clusters are connected through BDs, so as the aggregated models of each cluster is disseminated to the other clusters in a decentralized manner without the need for a global aggregator or an additional hop of transmission. To evaluate our proposed FL-EOCD scheme as opposed to baseline FL schemes, we consider minimizing the overall energy-consumption of devices while maintaining the convergence rate of FL subject to its time constraint. To this end, a joint optimization problem, considering scheduling the local devices/BDs to the CHs and computation frequency allocation, is formulated, where an iterative solution to this joint problem is devised. Extensive simulations are conducted to verify the effectiveness of the proposed FL-EOCD algorithm over FL conventional schemes in terms of energy consumption, latency, and convergence rate.
Article
Federated averaging (FedAvg) is a communication-efficient algorithm for distributed training with an enormous number of clients. In FedAvg, clients keep their data locally for privacy protection; a central parameter server is used to communicate between clients. This central server distributes the parameters to each client and collects the updated parameters from clients. FedAvg is mostly studied in centralized fashions, requiring massive communications between the central server and clients, which leads to possible channel blocking. Moreover, attacking the central server can break the whole system's privacy. Indeed, decentralization can significantly reduce the communication of the busiest node (the central one) because all nodes only communicate with their neighbors. To this end, in this paper, we study the decentralized FedAvg with momentum (DFedAvgM), implemented on clients that are connected by an undirected graph. In DFedAvgM, all clients perform stochastic gradient descent with momentum and communicate with their neighbors only. To further reduce the communication cost, we also consider the quantized DFedAvgM. The proposed algorithm involves the mixing matrix, momentum, client training with multiple local iterations, and quantization, introducing extra items in the Lyapunov analysis. Thus, the analysis of this paper is much more challenging than previous decentralized (momentum) SGD or FedAvg. We prove convergence of the (quantized) DFedAvgM under trivial assumptions; the convergence rate can be improved to sublinear when the loss function satisfies the PŁ property. Numerically, we find that the proposed algorithm outperforms FedAvg in both convergence speed and communication cost.
Article
Federated Learning (FL) is a new distributed machine learning (ML) approach which enables thousands of mobile devices to collaboratively train artificial intelligence (AI) models using local data without compromising user privacy. Although FL represents a promising computing paradigm, such training process can not be fully realized without an appropriate economic mechanism that incentivizes the participation of heterogeneous clients. This work targets social cost minimization, and studies the incentive mechanism design in FL through a procurement auction. Different from existing literature, we consider a practical scenario of FL where clients are selected and scheduled at different global iterations to guarantee the completion of the FL job, and capture the distinct feature of FL that the number of global iterations is determined by the local accuracy of all participants to balance between computation and communication. Our auction framework AFLA_{FL} first decomposes the social cost minimization problem into a series of winner determination problems (WDPs) based on the number of global iterations. To solve each WDP, AFLA_{FL} invokes a greedy algorithm to determine the winners, and a payment algorithm for computing remuneration to winners. Finally, AFLA_{FL} returns the best solution among all WDPs. We carried out theoretical analysis to prove that AFLA_{FL} is truthful, individual rational, computationally efficient, and achieves a near-optimal social cost. We further extend our model to consider multiple FL jobs with corresponding budgets and propose another efficient algorithm AFLMA_{FL-M} to solve the extended problem. We conduct large-scale simulations based on the real-world data and testbed experiments by adopting FL frameworks FAVOR and CoCoA. Simulation and experiment results show that both AFLA_{FL} and AFLMA_{FL-M} can reduce the social cost by up to 55% compared with state-of-the-art algorithms.
Article
Decentralized learning involves training machine learning models over remote mobile devices, edge servers, or cloud servers while keeping data localized. Even though many studies have shown the feasibility of preserving privacy, enhancing training performance or introducing Byzantine resilience, but none of them simultaneously considers all of them. Therefore we face the following problem: how can we efficiently coordinate the decentralized learning process while simultaneously maintaining learning security and data privacy for the entire system? To address this issue, in this paper we propose SPDL, a blockchain-secured and privacy-preserving decentralized learning system. SPDL integrates blockchain, Byzantine Fault-Tolerant (BFT) consensus, BFT Gradients Aggregation Rule (GAR), and differential privacy seamlessly into one system, ensuring efficient machine learning while maintaining data privacy, Byzantine fault tolerance, transparency, and traceability. To validate our approach, we provide rigorous analysis on convergence and regret in the presence of Byzantine nodes. We also build a SPDL prototype and conduct extensive experiments to demonstrate that SPDL is effective and efficient with strong security and privacy guarantees.
Article
Federated learning can achieve the purpose of distributed machine learning without sharing privacy and sensitive data of end devices. However, high concurrent access to the server increases the transmission delay of model updates, and the local model may be an unnecessary model with the opposite gradient from the global model, thus incurring a large number of additional communication costs. To this end, we study a framework of edge-based communication optimization to reduce the number of end devices directly connected to the server while avoiding uploading unnecessary local updates. Specifically, we cluster devices in the same network location and deploy mobile edge nodes in different network locations to serve as hubs for cloud and end devices communications, thereby avoiding the latency associated with high server concurrency. Meanwhile, we propose a model cleaning method based on cosine similarity. If the value of similarity is less than a preset threshold, the local update will not be uploaded to the mobile edge nodes, thus avoid unnecessary communication. Experimental results show that compared with traditional federated learning, the proposed scheme reduces the number of local updates by 60%, and accelerates the convergence speed of the regression model by 10.3%.
Article
Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Article
Our personal social networks are big and cluttered, and currently there is no good way to organize them. Social networking sites allow users to manually categorize their friends into social circles (e.g. 'circles' on Google+, and 'lists' on Facebook and Twitter), however they are laborious to construct and must be updated whenever a user's network grows. We define a novel machine learning task of identifying users' social circles. We pose the problem as a node clustering problem on a user's ego-network, a network of connections between her friends. We develop a model for detecting circles that combines network structure as well as user profile information. For each circle we learn its members and the circle-specific user profile similarity metric. Modeling node membership to multiple circles allows us to detect overlapping as well as hierarchically nested circles. Experiments show that our model accurately identifies circles on a diverse set of data from Facebook, Google+, and Twitter for all of which we obtain hand-labeled ground-truth.
Article
Data center applications require the network to be scalable and bandwidth-rich. Current data center network architectures often use rigid topologies to increase network bandwidth. A major limitation is that they can hardly support incremental network growth. Recent work proposes to use random interconnects to provide growth flexibility. However routing on a random topology suffers from control and data plane scalability problems, because routing decisions require global information and forwarding state cannot be aggregated. In this paper we design a novel flexible data center network architecture, Space Shuffle (S2), which applies greedy routing on multiple ring spaces to achieve high-throughput, scalability, and flexibility. The proposed greedy routing protocol of S2 effectively exploits the path diversity of densely connected topologies and enables key-based routing. Extensive experimental studies show that S2 provides high bisectional bandwidth and throughput, near-optimal routing path lengths, extremely small forwarding state, fairness among concurrent data flows, and resiliency to network failures.
Article
In this issue, “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research.
Conference Paper
We design a new suite of protocols for a set of nodes in d-dimension (d > 1) to construct and maintain a distributed Delaunay triangulation (DT) in a dynamic environment. The join, leave, and failure protocols in the suite are proved to be correct for a single join, leave, and failure, respectively. For a system under churn, it is impossible to maintain a correct distributed DT continually. We define an accuracy metric such that accuracy is 100% if and only if the distributed DT is correct. The suite also includes a maintenance protocol designed to recover from incorrect system states and to improve accuracy. In designing the protocols, we make use of two novel observations to substantially improve protocol efficiency. First, in the neighbor discovery process of a node, many replies to the nodepsilas queries contain redundant information. Second, the use of a new failure protocol that employs a proactive approach to recovery is better than the reactive approaches used in prior work. Experimental results show that our new suite of protocols maintains high accuracy for systems under churn and each system converges to 100% accuracy after churning stopped. They are much more efficient than protocols in prior work.
Article
Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect, which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree: A Jellyfish interconnect built using the same equipment as a fat-tree, supports as many as 25% more servers at full capacity at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish's unstructured design brings new challenges in routing, physical layout, and wiring. We describe and evaluate approaches that resolve these challenges effectively, indicating that Jellyfish could be deployed in today's data centers.
Article
We consider a symmetric random walk on a connected graph, where each edge is labeled with the probability of transition between the two adjacent vertices. The associated Markov chain has a uniform equilibrium distribution; the rate of convergence to this distribution, i. the mixing rate of the Markov chain, is determined by the second largest (in magnitude) eigenvalue of the transition matrix. In this paper we address the problem of assigning probabilities to the edges of the graph in such a way as to minimize the second largest magnitude eigenvalue, i.e., the problem of finding the fastest mixing Markov chain on the graph.
Article
We propose a family of constant-degree routing networks of logarithmic diameter, with the additional property that the addition or removal of a node to the network requires no global coordination, only a constant number of linkage changes in expectation, and a logarithmic number with high probability. Our randomized construction improves upon existing solutions, such as balanced search trees, by ensuring that the congestion of the network is always within a logarithmic factor of the optimum with high probability. Our construction derives from recent advances in the study of peer-to-peer lookup networks, where rapid changes require efficient and distributed maintenance, and where the lookup efficiency is impacted both by the lengths of paths to requested data and the presence or elimination of bottlenecks in the network.
Article
Efficiently determining the node that stores a data item in a distributed network is an important and challenging problem. This paper describes the motivation and design of the Chord system, a decentralized lookup service that stores key/value pairs for such networks. The Chord protocol takes as input an m-bit identifier (derived by hashing a higher-level application specific key), and returns the node that stores the value corresponding to that key. Each Chord node is identified by an m-bit identifier and each node stores the key identifiers in the system closest to the node's identifier. Each node maintains an m-entry routing table that allows it to look up keys efficiently. Results from theoretical analysis, simulations, and experiments show that Chord is incrementally scalable, with insertion and lookup costs scaling logarithmically with the number of Chord nodes.
Federated learning based on dynamic regularization
  • Yue Durmus Alp Emre Acar
  • Ramon Zhao
  • Matthew Matas
  • Paul Mattina
  • Venkatesh Whatmough
  • Saligrama
Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In Proc. of ICLR, 2021.
Leaf: A benchmark for federated settings
  • Sebastian Caldas
  • Sai Meher Karthik
  • Peter Duddu
  • Tian Wu
  • Jakub Li
  • H Brendan Konečný
  • Virginia Mcmahan
  • Ameet Smith
  • Talwalkar
Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings, 2019.
Mehdi Bennis, and Vaneet Aggarwal. Communication efficient framework for decentralized machine learning
  • Anis Elgabli
  • Jihong Park
  • S Amrit
  • Bedi
Anis Elgabli, Jihong Park, Amrit S Bedi, Mehdi Bennis, and Vaneet Aggarwal. Communication efficient framework for decentralized machine learning. In Proc. of IEEE CISS, 2020.
Cola: Decentralized linear learning
  • Lie He
  • An Bian
  • Martin Jaggi
Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. Proc. of NIPS, 2018.
Gaia: Geo-Distributed machine learning approaching LAN speeds
  • Kevin Hsieh
  • Aaron Harlap
  • Nandita Vijaykumar
  • Dimitris Konomis
  • Gregory R Ganger
  • Phillip B Gibbons
  • Onur Mutlu
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-Distributed machine learning approaching LAN speeds. In Prof. of USENIX NSDI, 2017.
Scaffold: Stochastic controlled averaging for federated learning
  • Satyen Sai Praneeth Karimireddy
  • Mehryar Kale
  • Sashank Mohri
  • Sebastian Reddi
  • Ananda Theertha Stich
  • Suresh
Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In Proc. of ICML, 2020.
Geographic Routing in d-dimensional Spaces with Guaranteed Delivery and Low Stretch
  • S Simon
  • Chen Lam
  • Qian
Simon S. Lam and Chen Qian. Geographic Routing in d-dimensional Spaces with Guaranteed Delivery and Low Stretch. In Proceedings of ACM SIGMETRICS, 2011.
Communication-efficient learning of deep networks from decentralized data
  • Brendan Mcmahan
  • Eider Moore
  • Daniel Ramage
  • Seth Hampson
  • Blaise Aguera Y Arcas
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. of PMLR AISTATS, 2017.
Automatic differentiation in pytorch
  • A Paszke
  • S Gross
  • S Chintala
  • G Chanan
  • E Yang
  • Z Devito
  • Z Lin
  • A Desmaison
  • L Antiga
  • A Lerer
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. Tech Report, 2017.
Fedsplit: An algorithmic framework for fast federated optimization
  • Reese Pathak
  • J Martin
  • Wainwright
Reese Pathak and Martin J Wainwright. Fedsplit: An algorithmic framework for fast federated optimization. arXiv preprint arXiv:2005.05238, 2020.
  • Sashank Reddi
  • Zachary Charles
  • Manzil Zaheer
  • Zachary Garrett
  • Keith Rush
  • Jakub Konečnỳ
  • Sanjiv Kumar
  • H Brendan Mcmahan
Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
Beyond spectral gap: The role of the topology in decentralized learning
  • Thijs Vogels
  • Hadrien Hendrikx
  • Martin Jaggi
Thijs Vogels, Hadrien Hendrikx, and Martin Jaggi. Beyond spectral gap: The role of the topology in decentralized learning. arXiv preprint arXiv:2206.03093, 2022.
Federated learning with non-iid data
  • Yue Zhao
  • Meng Li
  • Liangzhen Lai
  • Naveen Suda
  • Damon Civin
  • Vikas Chandra
Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.