Conference Paper

Predictive switch-controller association and control devolution for SDN systems

Authors:
  • Shenzhen Institute of Artificial Intelligence and Robotics for Society
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In software-defined networking (SDN) systems, the scalability and reliability of the control plane still remain as major concerns. Existing solutions adopt either multi-controller designs or control devolution back to the data plane. The former requires a flexible yet efficient switch-controller association mechanism to adapt to workload changes and potential failures, while the latter demands timely decision making with low overheads. The integrate design for both is even more challenging. Meanwhile, the dramatic advancement in machine learning techniques has boosted the practice of predictive scheduling to improve the responsiveness in various systems. Nonetheless, so far little work has been conducted for SDN systems. In this paper, we study the joint problem of dynamic switch-controller association and control devolution, while investigating the benefits of predictive scheduling in SDN systems. We propose POSCAD, an efficient, online, and distributed scheme that exploits predictive future information to minimize the total system cost and the average request response time with queueing stability guarantee. Theoretical analysis and trace-driven simulation results show that POSCAD requires only mild-value of future information to achieve a near-optimal system cost and near-zero average request response time. Further, POSCAD is robust against mis-prediction to reduce the average request response time.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Among such works, there are two lines of research in recent years that focus on optimizing the effective control and optimization for switch-controller association [3]- [12] and control devolution [13]- [16], respectively. Instead of studying these two topics separately, later works [17] [27] considered the problem of joint switch-controller association and control devolution, then proposed online control and predictive control schemes to optimize system costs with queue stability guarantee, respectively. Although the problem studied by such works is similar to our work, we would like to point out that all such works implicitly assume the full availability of instant system dynamics, which is hard to attain in practice. ...
Preprint
In software-defined networking (SDN) systems, it is a common practice to adopt a multi-controller design and control devolution techniques to improve the performance of the control plane. However, in such systems, the decision-making for joint switch-controller association and control devolution often involves various uncertainties, e.g., the temporal variations of controller accessibility, and computation and communication costs of switches. In practice, statistics of such uncertainties are unattainable and need to be learned in an online fashion, calling for an integrated design of learning and control. In this paper, we formulate a stochastic network optimization problem that aims to minimize time-average system costs and ensure queue stability. By transforming the problem into a combinatorial multi-armed bandit problem with long-term stability constraints, we adopt bandit learning methods and optimal control techniques to handle the exploration-exploitation tradeoff and long-term stability constraints, respectively. Through an integrated design of online learning and online control, we propose an effective Learning-Aided Switch-Controller Association and Control Devolution (LASAC) scheme. Our theoretical analysis and simulation results show that LASAC achieves a tunable tradeoff between queue stability and system cost reduction with a sublinear time-averaged regret bound over a finite time horizon.
Article
A distributed control plane is more scalable and robust in software defined networking. This paper focuses on controller load balancing using packet-in request redirection, that is, given the instantaneous state of the system, determining whether to redirect packet-in requests for each switch, such that the overall control plane response time (CPRT) is minimized. To address the above problem, we propose a framework based on Lyapunov optimization. First, we use the drift-plus-penalty algorithm to combine CPRT minimization problem with controller capacity constraints, and further derive a non-linear program, whose optimal solution is obtained with brute force using standard linearization techniques. Second, we present a greedy strategy to efficiently obtain a solution with a bounded approximation ratio. Third, we reformulate the program as a problem of maximizing a non-monotone submodular function subject to matroid constraints. We implement a controller prototype for packet-in request redirection, and conduct trace-driven simulations to validate our theoretical results. The results show that our algorithms can reduce the average CPRT by 81.6% compared to static assignment, and achieve a 3× improvement in maximum controller capacity violation ratio.
Article
Software defined networking (SDN) is regarded as a new paradigm of flow management in data center networks (DCNs). In SDN, the existing methods of managing flows are centralized by controller in the control plane and highly rely on the switches in the data plane to get forwarding rules from controller. Therefore, it is of great significance to consider the mapping relationship between the controller and the switch based on flow types. To address this issue, this paper first designs a flow-based two-tier centralized management framework for software defined DCNs and formulates a dynamic mapping problem using integer programming called cost-effective adaptive controller provisioning (Cost_EACP) problem, which is proved to be NP-complete. Then, in order to solve this problem, we transform the integer programming problem into fraction programming problem. Subsequently, we design a rounding-based Cost_EACP algorithm (RC_EACP) and propose a switch controller mapping algorithm (SCM_EACP) based on RC_EACP algorithm to address the mapping problem. Furthermore, we analyze the approximation performance and time complexity of proposed algorithm. Finally, experimental results demonstrate that the proposed algorithm is more effective than existing algorithms in terms of reducing the control channel bandwidth cost and delay cost for setting up flow rules, and the gap between the proposed algorithm and the optimal one is less than 24%.
Article
In software-defined networking (SDN) systems, it is a common practice to adopt a multi-controller design and control devolution techniques to improve the performance of the control plane. However, in such systems the decision-making for joint switch-controller association and control devolution often involves various uncertainties, e.g., the temporal variations of controller accessibility, and computation and communication costs of switches. In practice, statistics of such uncertainties are unattainable and need to be learned in an online fashion, calling for an integrated design of learning and control. In this article, we formulate a stochastic network optimization problem that aims to minimize time-average system costs and ensure queue stability. By transforming the problem into a combinatorial multi-armed bandit problem with long-term stability constraints, we adopt bandit learning methods and optimal control techniques to handle the exploration-exploitation tradeoff and long-term stability constraints, respectively. Through an integrated design of online learning and online control, we propose an effective Learning-Aided Switch-Controller Association and Control Devolution ( LASAC ) scheme. Our theoretical analysis and simulation results show that LASAC achieves a tunable tradeoff between queue stability and system cost reduction with a sublinear time-averaged regret bound over a finite time horizon.
Article
For wireless caching networks, the scheme design for content delivery is non-trivial in the face of the following tradeoff. On one hand, to optimize overall throughput, users can associate their nearby APs with great channel capacities; however, this may lead to unstable queue backlogs on APs and prolong request delays. On the other hand, to ensure queue stability, some users may have to associate APs with inferior channel states, which would incur throughput loss. Moreover, for such systems, how to conduct predictive scheduling to reduce delays and the fundamental limits of its benefits remain unexplored. In this paper, we formulate the problem of online user-AP association and resource allocation for content delivery with predictive scheduling under a fixed content placement as a stochastic network optimization problem. By exploiting its unique structure, we transform the problem into a series of modular maximization sub-problems with matroid constraints. Then we devise PUARA , a Predictive User-AP Association and Resource Allocation scheme which achieves a provably near-optimal throughput with queue stability. Our theoretical analysis and simulation results show that PUARA can not only perform a tunable control between throughput maximization and queue stability, but also incur a notable delay reduction with predicted information.
Article
For software-defined networking (SDN) systems, to enhance the scalability and reliability of control plane, existing solutions adopt either multi-controller design with static switch-controller association, or static control devolution by delegating certain request processing back to switches. Such solutions can fall short in face of temporal variations of request traffics, incurring considerable local computation costs on switches and their communication costs to controllers. So far, it still remains an open problem to develop a joint online scheme that conducts dynamic switch-controller association and dynamic control devolution. In addition, the fundamental benefits of predictive scheduling to SDN systems still remain unexplored. In this paper, we identify the non-trivial trade-off in such a joint design and formulate a stochastic network optimization problem which aims to minimize time-averaged total system costs and ensure long-term queue stability. By exploiting the unique problem structure, we devise a predictive online switch-controller association and control devolution (POSCAD) scheme, which solves the problem through a series of online distributed decision making. Theoretical analysis shows that without prediction, POSCAD can achieve near-optimal total system costs a tunable trade-off for queue stability. With prediction, POSCAD can achieve even better performance with shorter latencies. We conduct extensive simulations to evaluate POSCAD. Notably, with mild-value of future information, POSCAD incurs a significant reduction in request latencies, even when faced with prediction errors.
Full-text available
Article
Decentralized orchestration of the control plane is critical to the scalability and reliability of software-defined network (SDN). However, existing orchestrations of SDN are either one-off or centralized, and would be inefficient the presence of temporal and spatial variations in traffic requests. In this paper, a fully distributed orchestration is proposed to minimize the time-average cost of SDN, adapting to the variations. This is achieved by stochastically optimizing the on-demand activation of controllers, adaptive association of controllers and switches, and real-time request processing and dispatching. The proposed approach is able to operate at multiple timescales for activation and association of controllers, and request processing and dispatching, thereby alleviating potential service interruptions caused by orchestration. A new analytic framework is developed to confirm the asymptotic optimality of the proposed approach in the presence of non-negligible signaling delays between controllers. Corroborated from extensive simulations, the proposed approach can save up to 73% the time-average operational cost of SDN, as compared to the existing static orchestration.
Full-text available
Conference Paper
Software Defined Networking (SDN) uses a logically centralized controller to replace the distributed control plane in a traditional network. One of the central challenges faced by the SDN paradigm is the scalability of the logical controller. As a network grows in size, the computational and communication demand faced by a controller may soon exceed the capabilities of a commodity server. In this work, we revisit the task division of labour between the controller and switches, and propose FOCUS, an architecture that offloads a specific subset of control functions, i.e., stable local functions, to the switches' software stack. We implemented a prototype of FOCUS and analyzed the benefits of converting several SDN applications. Due to space restrictions, we only present results for ARP, LLDP and elephant flow detection. Our initial results are promising and they show that FOCUS can reduce a controller's communication overhead by 50% to nearly 100%, and the computational overhead from 80% to 98%. Furthermore, we observe that FOCUS offloading to the switches saves switch CPU because FOCUS reduces the overheads for communication with the controller.
Full-text available
Article
In software-defined networking (SDN), as data plane scale expands, scalability and reliability of the control plane has become major concerns. To mitigate such concerns, two kinds of solutions have been proposed separately. One is multi-controller architecture, i.e., a logically centralized control plane with physically distributed controllers. The other is control devolution, i.e., delegating control of some flows back to switches. Most of existing solutions adopt either static switch-controller association or static devolution, which may not adapt well to the traffic variation, leading to high communication costs between switches and controller, and high computation costs of switches. In this paper, we propose a novel scheme to jointly consider both solutions, i.e., we dynamically associate switches with controllers and dynamically devolve control of flows to switches. Our scheme is an efficient online algorithm that does not need the statistics of traffic flows. By adjusting some parameter V, we can make a trade-off between costs and queue backlogs. Theoretical analysis and extensive simulations show that our scheme yields much lower costs and latency compared to static schemes, and balanced loads among controllers.
Full-text available
Conference Paper
An experimental setup of 32 honeypots reported 17M login attempts originating from 112 different countries and over 6000 distinct source IP addresses. Due to decoupled control and data plane, Software Defined Networks (SDN) can handle these increasing number of attacks by blocking those network connections at the switch level. However, the challenge lies in defining the set of rules on the SDN controller to block malicious network connections. Historical network attack data can be used to automatically identify and block the malicious connections. There are a few existing open-source software tools to monitor and limit the number of login attempts per source IP address one-by-one. However, these solutions cannot efficiently act against a chain of attacks that comprises multiple IP addresses used by each attacker. In this paper, we propose using machine learning algorithms, trained on historical network attack data, to identify the potential malicious connections and potential attack destinations. We use four widely-known machine learning algorithms: C4.5, Bayesian Network (BayesNet), Decision Table (DT), and Naive-Bayes to predict the host that will be attacked based on the historical data. Experimental results show that average prediction accuracy of 91.68% is attained using Bayesian Networks.
Full-text available
Article
The explosive growth of global mobile traffic has lead to a rapid growth in the energy consumption in communication networks. In this paper, we focus on the energy-aware design of the network selection, subchannel, and power allocation in cellular and Wi-Fi networks, while taking into account the traffic delay of mobile users. The problem is particularly challenging due to the two-timescale operations for the network selection (large timescale) and subchannel and power allocation (small timescale). Based on the two-timescale Lyapunov optimization technique, we first design an online Energy-Aware Network Selection and Resource Allocation (ENSRA) algorithm. The ENSRA algorithm yields a power consumption within O(1/V) bound of the optimal value, and guarantees an O(V) traffic delay for any positive control parameter V. Motivated by the recent advancement in the accurate estimation and prediction of user mobility, channel conditions, and traffic demands, we further develop a novel predictive Lyapunov optimization technique to utilize the predictive information, and propose a Predictive Energy-Aware Network Selection and Resource Allocation (P-ENSRA) algorithm. We characterize the performance bounds of P-ENSRA in terms of the power-delay tradeoff theoretically. To reduce the computational complexity, we finally propose a Greedy Predictive Energy-Aware Network Selection and Resource Allocation (GP-ENSRA) algorithm, where the operator solves the problem in P-ENSRA approximately and iteratively. Numerical results show that GP-ENSRA significantly improves the power-delay performance over ENSRA in the large delay regime. For a wide range of system parameters, GP-ENSRA reduces the traffic delay over ENSRA by 20~30% under the same power consumption.
Full-text available
Article
The advent of software defined networking enables flexible, reliable and feature-rich control planes for data center networks. However, the tight coupling of centralized control and complete visibility leads to a wide range of issues among which scalability has risen to prominence. To address this, we present LazyCtrl, a novel hybrid control plane design for data center networks where network control is carried out by distributed control mechanisms inside independent groups of switches while complemented with a global controller. Our design is motivated by the observation that data center traffic is usually highly skewed and thus edge switches can be grouped according to traffic locality. LazyCtrl aims at bringing laziness to the global controller by dynamically devolving most of the control tasks to independent switch groups to process frequent intra-group events near datapaths while handling rare inter-group or other specified events by the controller. We implement LazyCtrl and build a prototype based on Open vSwich and Floodlight. Trace-driven experiments on our prototype show that an effective switch grouping is easy to maintain in multi-tenant clouds and the central controller can be significantly shielded by staying lazy, with its workload reduced by up to 82%.
Full-text available
Article
Software Defined Networks (SDN) give network designers freedom to refactor the network control plane. One core benefit of SDN is that it enables the network control logic to be designed and operated on a global network view, as though it were a centralized application, rather than a distributed system - logically centralized. Regardless of this abstraction, control plane state and logic must inevitably be physically distributed to achieve responsiveness, reliability, and scalability goals. Consequently, we ask: "How does distributed SDN state impact the performance of a logically centralized control application?" Motivated by this question, we characterize the state exchange points in a distributed SDN control plane and identify two key state distribution trade-offs. We simulate these exchange points in the context of an existing SDN load balancer application. We evaluate the impact of inconsistent global network view on load balancer performance and compare different state management approaches. Our results suggest that SDN control state inconsistency significantly degrades performance of logically centralized control applications agnostic to the underlying state distribution.
Full-text available
Conference Paper
OpenFlow is a great concept, but its original design imposes excessive overheads. It can simplify network and traffic management in enterprise and data center environments, because it enables flow-level control over Ethernet switching and provides global visibility of the flows in the network. However, such fine-grained control and visibility comes with costs: the switch-implementation costs of involving the switch's control-plane too often and the distributed-system costs of involving the OpenFlow controller too frequently, both on flow setups and especially for statistics-gathering. In this paper, we analyze these overheads, and show that OpenFlow's current design cannot meet the needs of high-performance networks. We design and evaluate DevoFlow, a modification of the OpenFlow model which gently breaks the coupling between control and global visibility, in a way that maintains a useful amount of visibility without imposing unnecessary costs. We evaluate DevoFlow through simulations, and find that it can load-balance data center traffic as well as fine-grained solutions, without as much overhead: DevoFlow uses 10--53 times fewer flow table entries at an average switch, and uses 10--42 times fewer control messages.
Full-text available
Conference Paper
Computer networks lack a general control paradigm, as traditional networks do not provide any network-wide management abstractions. As a result, each new function (such as routing) must provide its own state distribution, element discovery, and failure recovery mechanisms. We believe this lack of a common control platform has significantly hindered the development of flexible, reliable and feature-rich network control planes. To address this, we present Onix, a platform on top of which a network control plane can be implemented as a distributed system. Control planes written within Onix operate on a global view of the network, and use basic state distribution primitives provided by the platform. Thus Onix provides a general API for control plane implementations, while allowing them to make their own trade-offs among consistency, durability, and scalability.
Full-text available
Article
This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use ev- ery day. OpenFlow is based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries. Our goal is to encourage network- ing vendors to add OpenFlow to their switch products for deployment in college campus backbones and wiring closets. We believe that OpenFlow is a pragmatic compromise: on one hand, it allows researchers to run experiments on hetero- geneous switches in a uniform way at line-rate and with high port-density; while on the other hand, vendors do not need to expose the internal workings of their switches. In addition to allowing researchers to evaluate their ideas in real-world traffic settings, OpenFlow could serve as a useful campus component in proposed large-scale testbeds like GENI. Two buildings at Stanford University will soon run OpenFlow networks, using commercial Ethernet switches and routers. We will work to encourage deployment at other schools; and We encourage you to consider deploying OpenFlow in your university network too.
Article
Software defined networking is increasingly prevalent in data center networks for it enables centralized network configuration and management. However, since switches are statically assigned to controllers and controllers are statically provisioned, traffic dynamics may cause long response time and incur high maintenance cost. To address these issues, we formulate the dynamic controller assignment problem (DCAP) as an online optimization to minimize the total cost caused by response time and maintenance on the cluster of controllers. By applying the randomized fixed horizon control framework, we decompose DCAP into a series of stable matching problems with transfers, guaranteeing a small loss in competitive ratio. Since the matching problem is NP-hard, we propose a hierarchical two-phase algorithm that integrates key concepts from both matching theory and coalitional games to solve it efficiently. Theoretical analysis proves that our algorithm converges to a near-optimal Nash stable solution within tens of iterations. Extensive simulations show that our online approach reduces total cost by about 46%, and achieves better load balancing among controllers compared with static assignment.
Article
We consider online convex optimization (OCO) problems with switching costs and noisy predictions. While the design of online algorithms for OCO problems has received considerable attention, the design of algorithms in the context of noisy predictions is largely open. To this point, two promising algorithms have been proposed: Receding Horizon Control (RHC) and Averaging Fixed Horizon Control (AFHC). The comparison of these policies is largely open. AFHC has been shown to provide better worst-case performance, while RHC outperforms AFHC in many realistic settings. In this paper, we introduce a new class of policies, Committed Horizon Control (CHC), that generalizes both RHC and AFHC. We provide average-case analysis and concentration results for CHC policies, yielding the first analysis of RHC for OCO problems with noisy predictions. Further, we provide explicit results characterizing the optimal CHC policy as a function of properties of the prediction noise, e.g., variance and correlation structure. Our results provide a characterization of when AFHC outperforms RHC and vice versa, as well as when other CHC policies outperform both RHC and AFHC.
Conference Paper
We consider online convex optimization (OCO) problems with switching costs and noisy predictions. While the design of online algorithms for OCO problems has received considerable attention, the design of algorithms in the context of noisy predictions is largely open. To this point, two promising algorithms have been proposed: Receding Horizon Control (RHC) and Averaging Fixed Horizon Control (AFHC). The comparison of these policies is largely open. AFHC has been shown to provide better worst-case performance, while RHC outperforms AFHC in many realistic settings. In this paper, we introduce a new class of policies, Committed Horizon Control (CHC), that generalizes both RHC and AFHC. We provide average-case analysis and concentration results for CHC policies, yielding the first analysis of RHC for OCO problems with noisy predictions. Further, we provide explicit results characterizing the optimal CHC policy as a function of properties of the prediction noise, e.g., variance and correlation structure. Our results provide a characterization of when AFHC outperforms RHC and vice versa, as well as when other CHC policies outperform both RHC and AFHC.
Article
In online service systems, the delay experienced by a user from the service request to the service completion is one of the most critical performance metrics. To improve user delay experience, recent industrial practice suggests a modern system design mechanism: proactive serving, where the system predicts future user requests and allocates its capacity to serve these upcoming requests proactively. In this paper, we investigate the fundamentals of proactive serving from a theoretical perspective. In particular, we show that proactive serving decreases average delay exponentially (as a function of the prediction window size). Our results provide theoretical foundations for proactive serving and shed light on its application in practical systems.
Article
Several distributed SDN controller architectures have been proposed to mitigate the risks of overload and failure. However, since they statically assign switches to controller instances and store state in distributed data stores (which doubles flow setup latency), they hinder operators' ability to minimize both flow setup latency and controller resource consumption. To address this, we propose a novel approach for assigning SDN switches and partitions of SDN application state to distributed controller instances. We present a new way to partition SDN application state that considers the dependencies between application state and SDN switches. We then formally model the assignment problem as a variant of multi-dimensional bin packing and propose a practical heuristic to solve the problem with strict time constraints. Our preliminary evaluations show that our approach yields a 44% decrease in flow setup latency and a 42% reduction in controller operating costs.
Article
In a queuing process, let 1/λ be the mean time between the arrivals of two consecutive units, L be the mean number of units in the system, and W be the mean time spent by a unit in the system. It is shown that, if the three means are finite and the corresponding stochastic processes strictly stationary, and, if the arrival process is metrically transitive with nonzero mean, then L = λW.
Conference Paper
Distributed controllers have been proposed for Software Defined Networking to address the issues of scalability and reliability that a centralized controller suffers from. One key limitation of the distributed controllers is that the mapping between a switch and a controller is statically configured, which may result in uneven load distribution among the controllers. To address this problem, we propose ElastiCon, an elastic distributed controller architecture in which the controller pool is dynamically grown or shrunk according to traffic conditions and the load is dynamically shifted across controllers. We propose a novel switch migration protocol for enabling such load shifting, which conforms with the Openflow standard. We also build a prototype to demonstrate the efficacy of our design.
Conference Paper
OpenFlow assumes a logically centralized controller, which ideally can be physically distributed. However, current deployments rely on a single controller which has major drawbacks including lack of scalability. We present HyperFlow, a distributed event-based control plane for OpenFlow. HyperFlow is logically centralized but physically distributed: it provides scalability while keeping the benefits of network control centralization. By passively synchronizing network-wide views of OpenFlow controllers, HyperFlow localizes decision making to individual controllers, thus minimizing the control plane response time to data plane requests. HyperFlow is resilient to network partitioning and component failures. It also enables interconnecting independently managed OpenFlow networks, an essential feature missing in current OpenFlow deployments. We have implemented HyperFlow as an application for NOX. Our implementation requires minimal changes to NOX, and allows reuse of existing NOX applications with minor modifications. Our preliminary evaluation shows that, assuming sufficient control bandwidth, to bound the window of inconsistency among controllers by a factor of the delay between the farthest controllers, the network changes must occur at a rate lower than 1000 events per second across the network.
Article
Limiting the overhead of frequent events on the control plane is essential for realizing a scalable Software-Defined Network. One way of limiting this overhead is to process frequent events in the data plane. This requires modifying switches and comes at the cost of visibility in the control plane. Taking an alternative route, we propose Kandoo, a framework for preserving scalability without changing switches. Kandoo has two layers of controllers: (i) the bottom layer is a group of controllers with no interconnection, and no knowledge of the network-wide state, and (ii) the top layer is a logically centralized controller that maintains the network-wide state. Controllers at the bottom layer run only local control applications (i.e., applications that can function using the state of a single switch) near datapaths. These controllers handle most of the frequent events and effectively shield the top layer. Kandoo's design enables network operators to replicate local controllers on demand and relieve the load on the top layer, which is the only potential bottleneck in terms of scalability. Our evaluations show that a network controlled by Kandoo has an order of magnitude lower control channel consumption compared to normal OpenFlow networks.
Conference Paper
Although there is tremendous interest in designing improved networks for data centers, very little is known about the network-level traffic characteristics of data centers today. In this paper, we conduct an empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers. Our definition of cloud data centers includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications). We collect and analyze SNMP statistics, topology and packet-level traces. We examine the range of applications deployed in these data centers and their placement, the flow-level and packet-level transmission properties of these applications, and their impact on network and link utilizations, congestion and packet drops. We describe the implications of the observed traffic patterns for data center internal traffic engineering as well as for recently proposed architectures for data center networks.
Conference Paper
Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
Book
This text presents a modern theory of analysis, control, and optimization for dynamic networks. Mathematical techniques of Lyapunov drift and Lyapunov optimization are developed and shown to enable constrained optimization of time averages in general stochastic systems. The focus is on communication and queueing systems, including wireless networks with time-varying channels, mobility, and randomly arriving traffic. A simple drift-plus-penalty framework is used to optimize time averages such as throughput, throughput-utility, power, and distortion. Explicit performance-delay tradeoffs are provided to illustrate the cost of approaching optimality. This theory is also applicable to problems in operations research and economics, where energy-efficient and profit-maximizing decisions must be made without knowing the future. Topics in the text include the following: • Queue stability theory • Backpressure, max-weight, and virtual queue methods • Primal-dual methods for non-convex stochastic utility maximization • Universal scheduling theory for arbitrary sample paths • Approximate and randomized scheduling theory • Optimization of renewal systems and Markov decision systems Detailed examples and numerous problem set questions are provided to reinforce the main concepts.
Article
Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect, which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree: A Jellyfish interconnect built using the same equipment as a fat-tree, supports as many as 25% more servers at full capacity at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish's unstructured design brings new challenges in routing, physical layout, and wiring. We describe and evaluate approaches that resolve these challenges effectively, indicating that Jellyfish could be deployed in today's data centers.
Introducing FBLearner Flow
  • Jeffrey Dunn
Jeffrey Dunn. 2016. Introducing FBLearner Flow. https://code.facebook.com/ posts/1072626246134461/introducing-fblearner-flow-facebook-s-ai-backbone/
Predicting network attack patterns in SDN using machine learning approach
  • Faheem Saurav Nanda
  • Casimer Zafari
  • Eric Decusatis
  • Baijian Wedaa
  • Yang
Netflix adds download functionality
  • Jonathan Broughton
Jonathan Broughton. 2016. Netflix adds download functionality. https: //technology.ihs.com/586280/netflix-adds-download-support