Article

Predictive Switch-Controller Association and Control Devolution for SDN Systems

Authors:
  • Shenzhen Institute of Artificial Intelligence and Robotics for Society
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

For software-defined networking (SDN) systems, to enhance the scalability and reliability of control plane, existing solutions adopt either multi-controller design with static switch-controller association, or static control devolution by delegating certain request processing back to switches. Such solutions can fall short in face of temporal variations of request traffics, incurring considerable local computation costs on switches and their communication costs to controllers. So far, it still remains an open problem to develop a joint online scheme that conducts dynamic switch-controller association and dynamic control devolution. In addition, the fundamental benefits of predictive scheduling to SDN systems still remain unexplored. In this paper, we identify the non-trivial trade-off in such a joint design and formulate a stochastic network optimization problem which aims to minimize time-averaged total system costs and ensure long-term queue stability. By exploiting the unique problem structure, we devise a predictive online switch-controller association and control devolution (POSCAD) scheme, which solves the problem through a series of online distributed decision making. Theoretical analysis shows that without prediction, POSCAD can achieve near-optimal total system costs a tunable trade-off for queue stability. With prediction, POSCAD can achieve even better performance with shorter latencies. We conduct extensive simulations to evaluate POSCAD. Notably, with mild-value of future information, POSCAD incurs a significant reduction in request latencies, even when faced with prediction errors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Yang Xu et al. [21] proposed BalCon algorithm to measure the cost of switch redistribution from the routing computational load and load cost. Controller deployment [13,22] and task scheduling [23] based on network traffic prediction. ...
Article
Full-text available
The software-defined networks-enable mobile edge computing (SDN-enable MEC) architecture, which integrates SDN and MEC technologies, realizes the flexibility and dynamic management of the underlying network resources by the MEC, reduces the distance between the access terminal and computing resources and network resources, and increases the terminal's access to resources. However, the static distribution relationship between MEC servers (MECSs) and controllers in the multi-controller architecture may result in unbalanced load distribution among the controllers, which would degrade network performance. In this paper, a multi-objective optimization MECS redistribution algorithm (MOSRA) is proposed to decrease the response time and overhead. A controller response time model and link transmit overhead model are introduced as basis of an evolutionary algorithm which is proposed to optimize MECS redistribution. The proposed algorithm aims to select an available sub-optimizes individual by using a strategy based coordination transformation from Pareto Front. That is, when the master controller of the MECS is redistributed, both of the network overhead of the MECS to the controller and the response time of the controller to the MECS processing request are optimized. Finally, the simulation results demonstrate that the MOSRA can solve the redistribution problem in different network load levels and different network sizes within the effective time, and has a lower control plane response time, while making the edge network plane transmission overhead lower, compared with other algorithms .
Article
Computer network and Communication over such networks become integral part of human life from the very beginning of twenty first century. From the very first radio communication in the beginning of 19th Century to the latest wireless and Software Defined Network (SDN) communication over the satellite and local allied computer networks, caused increase in more and more traffic over the computer network. Internet was not only provided over the copper, fiber optics cables, but rather it becomes a commercial service sector over the low earth orbit satellites across the globe. Majority of the contents flowing over such networks are either academic, research, corporate data bases like banking, finance, IT services, social media, and also specially of the entertainment industry. Every small to large organization in corporate and research and development organization faced problems of computer network congestion over the time. As the demand for more data access is required hence the congestion over the network started to increase. Every nation and individual organization have their own policy to control domestic network traffic as bandwidth is limited and commercial aspects are involved. Various algorithms have focused to regulate network, but very few algorithms exist which focuses on providing alternative paths for better network traffic management. This paper focuses on using modern Genetic Algorithms like Ants colony optimization to first identify congestions over the network and then using such insights to find alternative paths through mutation and cross over. The proposed solution was executed, and generated result comparatively proved that use of Genetic Algorithm has helped to find alternative paths more effectively over the SDN.
Article
A distributed control plane is more scalable and robust in software defined networking. This paper focuses on controller load balancing using packet-in request redirection, that is, given the instantaneous state of the system, determining whether to redirect packet-in requests for each switch, such that the overall control plane response time (CPRT) is minimized. To address the above problem, we propose a framework based on Lyapunov optimization. First, we use the drift-plus-penalty algorithm to combine CPRT minimization problem with controller capacity constraints, and further derive a non-linear program, whose optimal solution is obtained with brute force using standard linearization techniques. Second, we present a greedy strategy to efficiently obtain a solution with a bounded approximation ratio. Third, we reformulate the program as a problem of maximizing a non-monotone submodular function subject to matroid constraints. We implement a controller prototype for packet-in request redirection, and conduct trace-driven simulations to validate our theoretical results. The results show that our algorithms can reduce the average CPRT by 81.6% compared to static assignment, and achieve a 3× improvement in maximum controller capacity violation ratio.
Article
For NFV systems, the key design space includes the function chaining for network requests and the resource scheduling for servers. The problem is challenging since NFV systems usually require multiple (often conflicting) design objectives and the computational efficiency of real-time decision making with limited information. Furthermore, the benefits of predictive scheduling to NFV systems still remain unexplored. In this article, we propose POSCARS, an efficient predictive and online service chaining and resource scheduling scheme that achieves tunable trade-offs among various system metrics with stability guarantee. Through a careful choice of granularity in system modeling, we acquire a better understanding of the trade-offs in our design space. By a non-trivial transformation, we decouple the complex optimization problem into a series of online sub-problems to achieve the optimality with only limited information. By employing randomized load balancing techniques, we propose three variants of POSCARS to reduce the overheads of decision making. Theoretical analysis and simulations show that POSCARS and its variants require only mild-value of future information to achieve near-optimal system cost with an ultra-low request response time.
Article
Software defined networking (SDN) is regarded as a new paradigm of flow management in data center networks (DCNs). In SDN, the existing methods of managing flows are centralized by controller in the control plane and highly rely on the switches in the data plane to get forwarding rules from controller. Therefore, it is of great significance to consider the mapping relationship between the controller and the switch based on flow types. To address this issue, this paper first designs a flow-based two-tier centralized management framework for software defined DCNs and formulates a dynamic mapping problem using integer programming called cost-effective adaptive controller provisioning (Cost_EACP) problem, which is proved to be NP-complete. Then, in order to solve this problem, we transform the integer programming problem into fraction programming problem. Subsequently, we design a rounding-based Cost_EACP algorithm (RC_EACP) and propose a switch controller mapping algorithm (SCM_EACP) based on RC_EACP algorithm to address the mapping problem. Furthermore, we analyze the approximation performance and time complexity of proposed algorithm. Finally, experimental results demonstrate that the proposed algorithm is more effective than existing algorithms in terms of reducing the control channel bandwidth cost and delay cost for setting up flow rules, and the gap between the proposed algorithm and the optimal one is less than 24%.
Article
Full-text available
Decentralized orchestration of the control plane is critical to the scalability and reliability of software-defined network (SDN). However, existing orchestrations of SDN are either one-off or centralized, and would be inefficient the presence of temporal and spatial variations in traffic requests. In this paper, a fully distributed orchestration is proposed to minimize the time-average cost of SDN, adapting to the variations. This is achieved by stochastically optimizing the on-demand activation of controllers, adaptive association of controllers and switches, and real-time request processing and dispatching. The proposed approach is able to operate at multiple timescales for activation and association of controllers, and request processing and dispatching, thereby alleviating potential service interruptions caused by orchestration. A new analytic framework is developed to confirm the asymptotic optimality of the proposed approach in the presence of non-negligible signaling delays between controllers. Corroborated from extensive simulations, the proposed approach can save up to 73% the time-average operational cost of SDN, as compared to the existing static orchestration.
Article
Full-text available
As opposed to the decentralized control logic underpinning the devising of the Internet as a complex bundle of box-centric protocols and vertically-integrated solutions, the SDN paradigm advocates the separation of the control logic from hardware and its centralization in software-based controllers. These key tenets offer new opportunities to introduce innovative applications and incorporate automatic and adaptive control aspects, thereby easing network management and guaranteeing the user’s QoE. Despite the excitement, SDN adoption raises many challenges including the scalability and reliability issues of centralized designs that can be addressed with the physical decentralization of the control plane. However, such physically distributed, but logically centralized systems bring an additional set of challenges. This paper presents a survey on SDN with a special focus on the distributed SDN control. Besides reviewing the SDN concept and studying the SDN architecture as compared to the classical one, the main contribution of this survey is a detailed analysis of state-of-the-art distributed SDN controller platforms which assesses their advantages and drawbacks and classifies them in novel ways (physical and logical classifications) in order to provide useful guidelines for SDN research and deployment initiatives. A thorough discussion on the major challenges of distributed SDN control is also provided along with some insights into emerging and future trends in that area.
Article
Full-text available
This paper presents NeuTM, a framework for network Traffic Matrix (TM) prediction based on Long Short-Term Memory Recurrent Neural Networks (LSTM RNNs). TM prediction is defined as the problem of estimating future network traffic matrix from the previous and achieved network traffic data. It is widely used in network planning, resource management and network security. Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that is well-suited to learn from data and classify or predict time series with time lags of unknown size. LSTMs have been shown to model long-range dependencies more accurately than conventional RNNs. NeuTM is a LSTM RNN-based framework for predicting TM in large networks. By validating our framework on real-world data from GEEANT network, we show that our model converges quickly and gives state of the art TM prediction performance.
Conference Paper
Full-text available
Software Defined Networking (SDN) uses a logically centralized controller to replace the distributed control plane in a traditional network. One of the central challenges faced by the SDN paradigm is the scalability of the logical controller. As a network grows in size, the computational and communication demand faced by a controller may soon exceed the capabilities of a commodity server. In this work, we revisit the task division of labour between the controller and switches, and propose FOCUS, an architecture that offloads a specific subset of control functions, i.e., stable local functions, to the switches' software stack. We implemented a prototype of FOCUS and analyzed the benefits of converting several SDN applications. Due to space restrictions, we only present results for ARP, LLDP and elephant flow detection. Our initial results are promising and they show that FOCUS can reduce a controller's communication overhead by 50% to nearly 100%, and the computational overhead from 80% to 98%. Furthermore, we observe that FOCUS offloading to the switches saves switch CPU because FOCUS reduces the overheads for communication with the controller.
Conference Paper
Full-text available
An experimental setup of 32 honeypots reported 17M login attempts originating from 112 different countries and over 6000 distinct source IP addresses. Due to decoupled control and data plane, Software Defined Networks (SDN) can handle these increasing number of attacks by blocking those network connections at the switch level. However, the challenge lies in defining the set of rules on the SDN controller to block malicious network connections. Historical network attack data can be used to automatically identify and block the malicious connections. There are a few existing open-source software tools to monitor and limit the number of login attempts per source IP address one-by-one. However, these solutions cannot efficiently act against a chain of attacks that comprises multiple IP addresses used by each attacker. In this paper, we propose using machine learning algorithms, trained on historical network attack data, to identify the potential malicious connections and potential attack destinations. We use four widely-known machine learning algorithms: C4.5, Bayesian Network (BayesNet), Decision Table (DT), and Naive-Bayes to predict the host that will be attacked based on the historical data. Experimental results show that average prediction accuracy of 91.68% is attained using Bayesian Networks.
Article
Full-text available
The explosive growth of global mobile traffic has lead to a rapid growth in the energy consumption in communication networks. In this paper, we focus on the energy-aware design of the network selection, subchannel, and power allocation in cellular and Wi-Fi networks, while taking into account the traffic delay of mobile users. The problem is particularly challenging due to the two-timescale operations for the network selection (large timescale) and subchannel and power allocation (small timescale). Based on the two-timescale Lyapunov optimization technique, we first design an online Energy-Aware Network Selection and Resource Allocation (ENSRA) algorithm. The ENSRA algorithm yields a power consumption within O(1/V) bound of the optimal value, and guarantees an O(V) traffic delay for any positive control parameter V. Motivated by the recent advancement in the accurate estimation and prediction of user mobility, channel conditions, and traffic demands, we further develop a novel predictive Lyapunov optimization technique to utilize the predictive information, and propose a Predictive Energy-Aware Network Selection and Resource Allocation (P-ENSRA) algorithm. We characterize the performance bounds of P-ENSRA in terms of the power-delay tradeoff theoretically. To reduce the computational complexity, we finally propose a Greedy Predictive Energy-Aware Network Selection and Resource Allocation (GP-ENSRA) algorithm, where the operator solves the problem in P-ENSRA approximately and iteratively. Numerical results show that GP-ENSRA significantly improves the power-delay performance over ENSRA in the large delay regime. For a wide range of system parameters, GP-ENSRA reduces the traffic delay over ENSRA by 20~30% under the same power consumption.
Article
Full-text available
Motivated by the internets of the future, which will likely be considerably larger in size as well as highly heterogeneous and decentralized, we propose Decentralize-SDN, D-SDN, a framework that enables not only physical- but also logical distribution of the Software-Defined Networking (SDN) control plane. D-SDN accomplishes network control distribution by defining a hierarchy of controllers that can 'match' an internet's organizational- and administrative structure. By delegating control between main controllers and secondary controllers, D-SDN is able to accommodate administrative decentralization and autonomy.It incorporates security as an integral part of the framework. This paper describes D-SDN and presents two use cases, namely network capacity sharing and public safety network services.
Article
Full-text available
Software-Defined Networking (SDN) is an emerging paradigm that promises to change the state of affairs of current networks, by breaking vertical integration, separating the network's control logic from the underlying routers and switches, promoting (logical) centralization of network control, and introducing the ability to program the network. The separation of concerns introduced between the definition of network policies, their implementation in switching hardware, and the forwarding of traffic, is key to the desired flexibility: by breaking the network control problem into tractable pieces, SDN makes it easier to create and introduce new abstractions in networking, simplifying network management and facilitating network evolution. Today, SDN is both a hot research topic and a concept gaining wide acceptance in industry, which justifies the comprehensive survey presented in this paper. We start by introducing the motivation for SDN, explain its main concepts and how it differs from traditional networking. Next, we present the key building blocks of an SDN infrastructure using a bottom-up, layered approach. We provide an in-depth analysis of the hardware infrastructure, southbound and northbounds APIs, network virtualization layers, network operating systems, network programming languages, and management applications. We also look at cross-layer problems such as debugging and troubleshooting. In an effort to anticipate the future evolution of this new paradigm, we discuss the main ongoing research efforts and challenges of SDN. In particular, we address the design of switches and control platforms -- with a focus on aspects such as resiliency, scalability, performance, security and dependability -- as well as new opportunities for carrier transport networks and cloud providers. Last but not least, we analyze the position of SDN as a key enabler of a software-defined environment.
Conference Paper
Full-text available
The data center network is increasingly a cost, reliability and performance bottleneck for cloud computing. Although multi-tree topologies can provide scalable bandwidth and traditional routing algorithms can provide eventual fault tolerance, we argue that recovery speed can be dramatically improved through the co-design of the network topology, routing algorithm and failure detector. We create an engineered network and routing protocol that directly address the failure characteristics observed in data centers. At the core of our proposal is a novel network topology that has many of the same desirable properties as FatTrees, but with much better fault recovery properties. We then create a series of failover protocols that benefit from this topology and are designed to cascade and complement each other. The resulting system, F10, can almost instantaneously reestablish connectivity and load balance, even in the presence of multiple failures. Our results show that following network link and switch failures, F10 has less than 1/7th the packet loss of current schemes. A trace-driven evaluation of MapReduce performance shows that F10's lower packet loss yields a median application-level 30% speedup.
Article
Full-text available
Modern multi-domain networks now span over datacenter networks, enterprise networks, customer sites and mobile entities. Such networks are critical and, thus, must be resilient, scalable and easily extensible. The emergence of Software-Defined Networking (SDN) protocols, which enables to decouple the data plane from the control plane and dynamically program the network, opens up new ways to architect such networks. In this paper, we propose DISCO, an open and extensible DIstributed SDN COntrol plane able to cope with the distributed and heterogeneous nature of modern overlay networks and wide area networks. DISCO controllers manage their own network domain and communicate with each others to provide end-to-end network services. This communication is based on a unique lightweight and highly manageable control channel used by agents to self-adaptively share aggregated network-wide information. We implemented DISCO on top of the Floodlight OpenFlow controller and the AMQP protocol. We demonstrated how DISCO's control plane dynamically adapts to heterogeneous network topologies while being resilient enough to survive to disruptions and attacks and providing classic functionalities such as end-point migration and network-wide traffic engineering. The experimentation results we present are organized around three use cases: inter-domain topology disruption, end-to-end priority service request and virtual machine migration.
Article
Full-text available
Software Defined Networks (SDN) give network designers freedom to refactor the network control plane. One core benefit of SDN is that it enables the network control logic to be designed and operated on a global network view, as though it were a centralized application, rather than a distributed system - logically centralized. Regardless of this abstraction, control plane state and logic must inevitably be physically distributed to achieve responsiveness, reliability, and scalability goals. Consequently, we ask: "How does distributed SDN state impact the performance of a logically centralized control application?" Motivated by this question, we characterize the state exchange points in a distributed SDN control plane and identify two key state distribution trade-offs. We simulate these exchange points in the context of an existing SDN load balancer application. We evaluate the impact of inconsistent global network view on load balancer performance and compare different state management approaches. Our results suggest that SDN control state inconsistency significantly degrades performance of logically centralized control applications agnostic to the underlying state distribution.
Conference Paper
Full-text available
OpenFlow is a great concept, but its original design imposes excessive overheads. It can simplify network and traffic management in enterprise and data center environments, because it enables flow-level control over Ethernet switching and provides global visibility of the flows in the network. However, such fine-grained control and visibility comes with costs: the switch-implementation costs of involving the switch's control-plane too often and the distributed-system costs of involving the OpenFlow controller too frequently, both on flow setups and especially for statistics-gathering. In this paper, we analyze these overheads, and show that OpenFlow's current design cannot meet the needs of high-performance networks. We design and evaluate DevoFlow, a modification of the OpenFlow model which gently breaks the coupling between control and global visibility, in a way that maintains a useful amount of visibility without imposing unnecessary costs. We evaluate DevoFlow through simulations, and find that it can load-balance data center traffic as well as fine-grained solutions, without as much overhead: DevoFlow uses 10--53 times fewer flow table entries at an average switch, and uses 10--42 times fewer control messages.
Article
Full-text available
This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use ev- ery day. OpenFlow is based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries. Our goal is to encourage network- ing vendors to add OpenFlow to their switch products for deployment in college campus backbones and wiring closets. We believe that OpenFlow is a pragmatic compromise: on one hand, it allows researchers to run experiments on hetero- geneous switches in a uniform way at line-rate and with high port-density; while on the other hand, vendors do not need to expose the internal workings of their switches. In addition to allowing researchers to evaluate their ideas in real-world traffic settings, OpenFlow could serve as a useful campus component in proposed large-scale testbeds like GENI. Two buildings at Stanford University will soon run OpenFlow networks, using commercial Ethernet switches and routers. We will work to encourage deployment at other schools; and We encourage you to consider deploying OpenFlow in your university network too.
Conference Paper
In software-defined networking (SDN) systems, the scalability and reliability of the control plane still remain as major concerns. Existing solutions adopt either multi-controller designs or control devolution back to the data plane. The former requires a flexible yet efficient switch-controller association mechanism to adapt to workload changes and potential failures, while the latter demands timely decision making with low overheads. The integrate design for both is even more challenging. Meanwhile, the dramatic advancement in machine learning techniques has boosted the practice of predictive scheduling to improve the responsiveness in various systems. Nonetheless, so far little work has been conducted for SDN systems. In this paper, we study the joint problem of dynamic switch-controller association and control devolution, while investigating the benefits of predictive scheduling in SDN systems. We propose POSCAD, an efficient, online, and distributed scheme that exploits predictive future information to minimize the total system cost and the average request response time with queueing stability guarantee. Theoretical analysis and trace-driven simulation results show that POSCAD requires only mild-value of future information to achieve a near-optimal system cost and near-zero average request response time. Further, POSCAD is robust against mis-prediction to reduce the average request response time.
Chapter
The operations landscape today is more complex than ever. IT Ops teams have to fight an uphill battle managing the massive amounts of data that is being generated by modern IT systems. They are expected to handle more incidents than ever before with shorter service-level agreements (SLAs), respond to these incidents more quickly, and improve on key metrics, such as mean time to detect (MTTD), mean time to failure (MTTF), mean time between failures (MTBF), and mean time to repair (MTTR). This is not because of lack of tools. Digital enterprise journal research suggests that 41 percent of enterprises use ten or more tools for IT performance monitoring, and downtime can get expensive when companies lose a whopping $5.6 million per outage and MTTR averages 4.2 hours and wastes precious resources. With a hybrid multi-cloud, multi-tenant environment, organizations need even more tools to manage the multiple facets of capacity planning, resource utilization, storage management, anomaly detection, and threat detection and analysis, to name a few.
Article
In recent years, with the rapid development of current Internet and mobile communication technologies, the infrastructure, devices and resources in networking systems are becoming more complex and heterogeneous. In order to efficiently organize, manage, maintain and optimize networking systems, more intelligence needs to be deployed. However, due to the inherently distributed feature of traditional networks, machine learning techniques are hard to be applied and deployed to control and operate networks. Software Defined Networking (SDN) brings us new chances to provide intelligence inside the networks. The capabilities of SDN (e.g., logically centralized control, global view of the network, software-based traffic analysis, and dynamic updating of forwarding rules) make it easier to apply machine learning techniques. In this paper, we provide a comprehensive survey on the literature involving machine learning algorithms applied to SDN. First, the related works and background knowledge are introduced. Then, we present an overview of machine learning algorithms. In addition, we review how machine learning algorithms are applied in the realm of SDN, from the perspective of traffic classification, routing optimization, Quality of Service (QoS)/Quality of Experience (QoE) prediction, resource management and security. Finally, challenges and broader perspectives are discussed.
Article
Software defined networking is increasingly prevalent in data center networks for it enables centralized network configuration and management. However, since switches are statically assigned to controllers and controllers are statically provisioned, traffic dynamics may cause long response time and incur high maintenance cost. To address these issues, we formulate the dynamic controller assignment problem (DCAP) as an online optimization to minimize the total cost caused by response time and maintenance on the cluster of controllers. By applying the randomized fixed horizon control framework, we decompose DCAP into a series of stable matching problems with transfers, guaranteeing a small loss in competitive ratio. Since the matching problem is NP-hard, we propose a hierarchical two-phase algorithm that integrates key concepts from both matching theory and coalitional games to solve it efficiently. Theoretical analysis proves that our algorithm converges to a near-optimal Nash stable solution within tens of iterations. Extensive simulations show that our online approach reduces total cost by about 46%, and achieves better load balancing among controllers compared with static assignment.
Conference Paper
We present the design, implementation, and evaluation of B4, a private WAN connecting Google's data centers across the planet. B4 has a number of unique characteristics: i) massive bandwidth requirements deployed to a modest number of sites, ii) elastic traffic demand that seeks to maximize average bandwidth, and iii) full control over the edge servers and network, which enables rate limiting and demand measurement at the edge. These characteristics led to a Software Defined Networking architecture using OpenFlow to control relatively simple switches built from merchant silicon. B4's centralized traffic engineering service drives links to near 100% utilization, while splitting application flows among multiple paths to balance capacity against application priority/demands. We describe experience with three years of B4 production deployment, lessons learned, and areas for future work.
Article
In online service systems, the delay experienced by users from service request to service completion is one of the most critical performance metrics. To improve user delay experience, recent industrial practices suggest a modern system design mechanism: proactive serving, where the service system predicts future user requests and allocates its capacity to serve these upcoming requests proactively. This approach complements the conventional mechanism of capability boosting. In this paper, we propose queuing models for online service systems with proactive serving capability and characterize the user delay reduction by proactive serving. In particular, we show that proactive serving decreases average delay exponentially (as a function of the prediction window size) in the cases where service time follows light-tailed distributions. Furthermore, the exponential decrease in user delay is robust against prediction errors (in terms of miss detection and false alarm) and user demand fluctuation. Compared with the conventional mechanism of capability boosting, proactive serving is more effective in decreasing delay when the system is in the light-load regime. Our trace-driven evaluations demonstrate the practical power of proactive serving: for example, for the data trace of light-tailed YouTube videos, the average user delay decreases by 50% when the system predicts 60 s ahead. Our results provide, from a queuing-theoretical perspective, justifications for the practical application of proactive serving in online service systems.
Article
Datacenter networks suffer unpredictable performance due to a lack of application level bandwidth guarantees. A lot of attention has been drawn to solve this problem such as how to provide bandwidth guarantees for virtualized machines (VMs), proportional bandwidth share among tenants, and high network utilization under peak traffic. However, existing solutions fail to cope with highly dynamic traffic in datacenter networks. In this paper, we propose eBA, an efficient solution to bandwidth allocation that provides end-to-end bandwidth guarantee for VMs under large numbers of short flows and massive bursty traffic in datacenters. eBA leverages a novel distributed VM-to-VM rate control algorithm based on the logistic model under the control-theoretic framework. eBA's implementation requires no changes to hardware or applications and can be deployed in standard protocol stack. The theoretical analysis and the experimental results show that eBA not only guarantees the bandwidth for VMs, but also provides fast convergence to efficiency and fairness, as well as smooth response to bursty traffic.
Article
We consider online convex optimization (OCO) problems with switching costs and noisy predictions. While the design of online algorithms for OCO problems has received considerable attention, the design of algorithms in the context of noisy predictions is largely open. To this point, two promising algorithms have been proposed: Receding Horizon Control (RHC) and Averaging Fixed Horizon Control (AFHC). The comparison of these policies is largely open. AFHC has been shown to provide better worst-case performance, while RHC outperforms AFHC in many realistic settings. In this paper, we introduce a new class of policies, Committed Horizon Control (CHC), that generalizes both RHC and AFHC. We provide average-case analysis and concentration results for CHC policies, yielding the first analysis of RHC for OCO problems with noisy predictions. Further, we provide explicit results characterizing the optimal CHC policy as a function of properties of the prediction noise, e.g., variance and correlation structure. Our results provide a characterization of when AFHC outperforms RHC and vice versa, as well as when other CHC policies outperform both RHC and AFHC.
Conference Paper
We consider online convex optimization (OCO) problems with switching costs and noisy predictions. While the design of online algorithms for OCO problems has received considerable attention, the design of algorithms in the context of noisy predictions is largely open. To this point, two promising algorithms have been proposed: Receding Horizon Control (RHC) and Averaging Fixed Horizon Control (AFHC). The comparison of these policies is largely open. AFHC has been shown to provide better worst-case performance, while RHC outperforms AFHC in many realistic settings. In this paper, we introduce a new class of policies, Committed Horizon Control (CHC), that generalizes both RHC and AFHC. We provide average-case analysis and concentration results for CHC policies, yielding the first analysis of RHC for OCO problems with noisy predictions. Further, we provide explicit results characterizing the optimal CHC policy as a function of properties of the prediction noise, e.g., variance and correlation structure. Our results provide a characterization of when AFHC outperforms RHC and vice versa, as well as when other CHC policies outperform both RHC and AFHC.
Article
We present our experiences to date building ONOS (Open Network Operating System), an experimental distributed SDN control platform motivated by the performance, scalability, and availability requirements of large operator networks. We describe and evaluate two ONOS prototypes. The first version implemented core features: a distributed, but logically centralized, global network view; scale-out; and fault tolerance. The second version focused on improving performance. Based on experience with these prototypes, we identify additional steps that will be required for ONOS to support use cases such as core network traffic engineering and scheduling, and to become a usable open source, distributed network OS platform that the SDN community can build upon.
Article
Several distributed SDN controller architectures have been proposed to mitigate the risks of overload and failure. However, since they statically assign switches to controller instances and store state in distributed data stores (which doubles flow setup latency), they hinder operators' ability to minimize both flow setup latency and controller resource consumption. To address this, we propose a novel approach for assigning SDN switches and partitions of SDN application state to distributed controller instances. We present a new way to partition SDN application state that considers the dependencies between application state and SDN switches. We then formally model the assignment problem as a variant of multi-dimensional bin packing and propose a practical heuristic to solve the problem with strict time constraints. Our preliminary evaluations show that our approach yields a 44% decrease in flow setup latency and a 42% reduction in controller operating costs.
Article
In a queuing process, let 1/λ be the mean time between the arrivals of two consecutive units, L be the mean number of units in the system, and W be the mean time spent by a unit in the system. It is shown that, if the three means are finite and the corresponding stochastic processes strictly stationary, and, if the arrival process is metrically transitive with nonzero mean, then L = λW.
Article
Big data, with their promise to discover valuable insights for better decision making, have recently attracted significant interest from both academia and industry. Voluminous data are generated from a variety of users and devices, and are to be stored and processed in powerful data centers. As such, there is a strong demand for building an unimpeded network infrastructure to gather geologically distributed and rapidly generated data, and move them to data centers for effective knowledge discovery. The express network should also be seamlessly extended to interconnect multiple data centers as well as interconnect the server nodes within a data center. In this article, we take a close look at the unique challenges in building such a network infrastructure for big data. Our study covers each and every segment in this network highway: the access networks that connect data sources, the Internet backbone that bridges them to remote data centers, as well as the dedicated network among data centers and within a data center. We also present two case studies of real-world big data applications that are empowered by networking, highlighting interesting and promising future research directions.
Conference Paper
Cloud computing realises the vision of utility computing. Tenants can benefit from on-demand provisioning of computational resources according to a pay-per-use model and can outsource hardware purchases and maintenance. Tenants, however, have only limited ...
Conference Paper
Distributed controllers have been proposed for Software Defined Networking to address the issues of scalability and reliability that a centralized controller suffers from. One key limitation of the distributed controllers is that the mapping between a switch and a controller is statically configured, which may result in uneven load distribution among the controllers. To address this problem, we propose ElastiCon, an elastic distributed controller architecture in which the controller pool is dynamically grown or shrunk according to traffic conditions and the load is dynamically shifted across controllers. We propose a novel switch migration protocol for enabling such load shifting, which conforms with the Openflow standard. We also build a prototype to demonstrate the efficacy of our design.
Conference Paper
OpenFlow assumes a logically centralized controller, which ideally can be physically distributed. However, current deployments rely on a single controller which has major drawbacks including lack of scalability. We present HyperFlow, a distributed event-based control plane for OpenFlow. HyperFlow is logically centralized but physically distributed: it provides scalability while keeping the benefits of network control centralization. By passively synchronizing network-wide views of OpenFlow controllers, HyperFlow localizes decision making to individual controllers, thus minimizing the control plane response time to data plane requests. HyperFlow is resilient to network partitioning and component failures. It also enables interconnecting independently managed OpenFlow networks, an essential feature missing in current OpenFlow deployments. We have implemented HyperFlow as an application for NOX. Our implementation requires minimal changes to NOX, and allows reuse of existing NOX applications with minor modifications. Our preliminary evaluation shows that, assuming sufficient control bandwidth, to bound the window of inconsistency among controllers by a factor of the delay between the farthest controllers, the network changes must occur at a rate lower than 1000 events per second across the network.
Article
Motivated by the increasing popularity of learning and predicting human user behavior in communication and computing systems, in this paper, we investigate the fundamental benefit of predictive scheduling, i.e., predicting and pre-serving arrivals, in controlled queueing systems. Based on a lookahead window prediction model, we first establish a novel equivalence between the predictive queueing system with a \emph{fully-efficient} scheduling scheme and an equivalent queueing system without prediction. This connection allows us to analytically demonstrate that predictive scheduling necessarily improves system delay performance and can drive it to zero with increasing prediction power. We then propose the \textsf{Predictive Backpressure (PBP)} algorithm for achieving optimal utility performance in such predictive systems. \textsf{PBP} efficiently incorporates prediction into stochastic system control and avoids the great complication due to the exponential state space growth in the prediction window size. We show that \textsf{PBP} can achieve a utility performance that is within $O(\epsilon)$ of the optimal, for any $\epsilon>0$, while guaranteeing that the system delay distribution is a \emph{shifted-to-the-left} version of that under the original Backpressure algorithm. Hence, the average packet delay under \textsf{PBP} is strictly better than that under Backpressure, and vanishes with increasing prediction window size. This implies that the resulting utility-delay tradeoff with predictive scheduling beats the known optimal $[O(\epsilon), O(\log(1/\epsilon))]$ tradeoff for systems without prediction.
Article
Limiting the overhead of frequent events on the control plane is essential for realizing a scalable Software-Defined Network. One way of limiting this overhead is to process frequent events in the data plane. This requires modifying switches and comes at the cost of visibility in the control plane. Taking an alternative route, we propose Kandoo, a framework for preserving scalability without changing switches. Kandoo has two layers of controllers: (i) the bottom layer is a group of controllers with no interconnection, and no knowledge of the network-wide state, and (ii) the top layer is a logically centralized controller that maintains the network-wide state. Controllers at the bottom layer run only local control applications (i.e., applications that can function using the state of a single switch) near datapaths. These controllers handle most of the frequent events and effectively shield the top layer. Kandoo's design enables network operators to replicate local controllers on demand and relieve the load on the top layer, which is the only potential bottleneck in terms of scalability. Our evaluations show that a network controlled by Kandoo has an order of magnitude lower control channel consumption compared to normal OpenFlow networks.
Conference Paper
Although there is tremendous interest in designing improved networks for data centers, very little is known about the network-level traffic characteristics of data centers today. In this paper, we conduct an empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers. Our definition of cloud data centers includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications). We collect and analyze SNMP statistics, topology and packet-level traces. We examine the range of applications deployed in these data centers and their placement, the flow-level and packet-level transmission properties of these applications, and their impact on network and link utilizations, congestion and packet drops. We describe the implications of the observed traffic patterns for data center internal traffic engineering as well as for recently proposed architectures for data center networks.
Conference Paper
Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
Book
This text presents a modern theory of analysis, control, and optimization for dynamic networks. Mathematical techniques of Lyapunov drift and Lyapunov optimization are developed and shown to enable constrained optimization of time averages in general stochastic systems. The focus is on communication and queueing systems, including wireless networks with time-varying channels, mobility, and randomly arriving traffic. A simple drift-plus-penalty framework is used to optimize time averages such as throughput, throughput-utility, power, and distortion. Explicit performance-delay tradeoffs are provided to illustrate the cost of approaching optimality. This theory is also applicable to problems in operations research and economics, where energy-efficient and profit-maximizing decisions must be made without knowing the future. Topics in the text include the following: • Queue stability theory • Backpressure, max-weight, and virtual queue methods • Primal-dual methods for non-convex stochastic utility maximization • Universal scheduling theory for arbitrary sample paths • Approximate and randomized scheduling theory • Optimization of renewal systems and Markov decision systems Detailed examples and numerous problem set questions are provided to reinforce the main concepts.
Article
Industry experience indicates that the ability to incrementally expand data centers is essential. However, existing high-bandwidth network designs have rigid structure that interferes with incremental expansion. We present Jellyfish, a high-capacity network interconnect, which, by adopting a random graph topology, yields itself naturally to incremental expansion. Somewhat surprisingly, Jellyfish is more cost-efficient than a fat-tree: A Jellyfish interconnect built using the same equipment as a fat-tree, supports as many as 25% more servers at full capacity at the scale of a few thousand nodes, and this advantage improves with scale. Jellyfish also allows great flexibility in building networks with different degrees of oversubscription. However, Jellyfish's unstructured design brings new challenges in routing, physical layout, and wiring. We describe and evaluate approaches that resolve these challenges effectively, indicating that Jellyfish could be deployed in today's data centers.