ThesisPDF Available

Network-Layer Protocols for Data Center Scalability

Authors:

Abstract and Figures

With the development of demand for computing resources, data center architectures are growing both in scale and in complexity.In this context, this thesis takes a step back as compared to traditional network approaches, and shows that providing generic primitives directly within the network layer is a great way to improve efficiency of resource usage, and decrease network traffic and management overhead.Using recently-introduced network architectures, Segment Routing (SR) and Bit-Indexed Explicit Replication (BIER), network layer protocols are designed and analyzed to provide three high-level functions: (1) task mobility, (2) reliable content distribution and (3) load-balancing.First, task mobility is achieved by using SR to provide a zero-loss virtual machine migration service.This then opens the opportunity for studying how to orchestrate task placement and migration while aiming at (i) maximizing the inter-task throughput, while (ii) maximizing the number of newly-placed tasks, but (iii) minimizing the number of tasks to be migrated.Second, reliable content distribution is achieved by using BIER to provide a reliable multicast protocol, in which retransmissions of lost packets are targeted towards the precise set of destinations having missed that packet, thus incurring a minimal traffic overhead.To decrease the load on the source link, this is then extended to enable retransmissions by local peers from the same group, with SR as a helper to find a suitable retransmission candidate.Third, load-balancing is achieved by way of using SR to distribute queries through several application candidates, each of which taking local decisions as to whether to accept those, thus achieving better fairness as compared to centralized approaches.The feasibility of hardware implementation of this approach is investigated, and a solution using covert channels to transparently convey information to the load-balancer is implemented for a state-of-the-art programmable network card.Finally, the possibility of providing autoscaling as a network service is investigated: by letting queries go through a fixed chain of applications using SR, autoscaling is triggered by the last instance, depending on its local state.
Content may be subject to copyright.
A preview of the PDF is not available
Article
Full-text available
Cloud architectures achieve scaling through two main functions: (i) load-balancers, which dispatch queries among replicated virtualized application instances, and (ii) autoscalers, which automatically adjust the number of replicated instances to accommodate variations in load patterns. These functions are often provided through centralized load monitoring, incurring operational complexity. This paper introduces a unified and centralized-monitoring-free architecture achieving both autoscaling and load-balancing, reducing operational overhead while increasing response time performance. Application instances are virtually ordered in a chain, and new queries are forwarded along this chain until an instance, based on its local load, accepts the query. Autoscaling is triggered by the last application instance, which inspects its average load and infers if its chain is under-or over-provisioned. An analytical model of the system is derived, and proves that the proposed technique can achieve asymptotic zero-wait time with high (and controlable) probability. This result is confirmed by extensive simulations, which highlight close-toideal performance in terms of both response time and resource costs.
Conference Paper
Full-text available
Datacenter load balancers (or muxes) steer traffic destined to a given service across a dynamic set of backend machines. To ensure consistent load balancing decisions when backends come or leave, existing solutions make a load balancing decision per connection and then store it as per-connection state to be used for future packets. While simple to implement, per-connection state is brittle: SYN-flood attacks easily fill state memory, preventing muxes from keeping state for good connections. We present Beamer, a datacenter load-balancer that is designed to ensure stateless mux operation. The key idea is to leverage the connection state already stored in back-end servers to ensure that connections are never dropped under churn: when a server receives a mid-connection packet for which it doesn't have state, it forwards it to another server that should have state for the packet. Stateless load balancing brings many benefits: our software implementation of Beamer is twice faster than Google's Maglev, the state of the art software load bal-ancer, and can process 40Gbps of HTTP uplink traffic on 7 cores. Beamer is simple to deploy both in software and in hardware as our P4 implementation shows. Finally, Beamer allows arbitrary scale-out and scale-in events without dropping any connections.
Article
Full-text available
BIER (Bit-Indexed Explicit Replication) alleviates the operational complexities of multicast protocols (associated to the multicast tree and the incurred state in intermediate routers), by allowing for source-driven, per-packet destination selection, efficient encoding thereof in packet headers, and stateless forwarding along shortest-path multicast trees. BIER perpacket destination selection enables efficient reliable multicast delivery: packets not received by a subset of intended destinations can be efficiently BIER-retransmitted to only that subset. While BIER-based reliable multicast exhibits attractive performance attributes, relying on source retransmissions for packet recovery may be costly-even unnecessary, if topologically close peers are able to provide a copy of the packet. Thus, this paper extends the use of reliable BIER multicast to allow recovery also from peers, using Segment Routing (SR) to steer retransmission requests through a set of potential (local) candidates, before requesting retransmissions from the source as a last resort only. A general framework is introduced, which can accommodate different policies for the selection of candidate peers for retransmissions. Simple (both static and adaptive) policies are introduced and analyzed, both (i) theoretically and (ii) by way of simulations in data-center-like and real-world topologies. Results indicate that local peer recovery is able to substantially reduce the overall retransmission traffic, and that this can be achieved through simple policies, where no signalling is required to build a set of candidate peers.
Conference Paper
Full-text available
Internet users consume increasing quantities of video content with higher Quality of Experience (QoE) expectations. Network scalability thus becomes a critical problem for video delivery as traditional Content Delivery Networks (CDN) struggle to cope with the demand. In particular, content-awareness has been touted as a tool for scaling CDNs through clever request and content placement. Building on that insight, we propose a network paradigm that provides application-awareness in the network layer, enabling the offload of CDN decisions to the data-plane. Namely, it uses chunk-level identifiers encoded into IPv6 addresses. These identifiers are used to perform network-layer cache admission by estimating the popularity of requests with a Least-Recently-Used (LRU) filter. Popular requests are then served from the edge cache, while unpopular requests are directly redirected to the origin server, circumventing the HTTP proxy. The parameters of the filter are optimized through analytical modeling and validated via both simulation and experimentation with a testbed featuring real cache servers. It yields improvements in QoE while decreasing the hardware requirements on the edge cache. Specifically, for a typical content distribution, our evaluation shows a 22% increase of the hit rate, a 36% decrease of the chunk download-time, and a 37% decrease of the cache server CPU load.
Conference Paper
Full-text available
With the development of large-scale data centers, Virtual Machine (VM) migration is a key component for resource optimization, cost reduction, and maintenance. From a network perspective, traditional VM migration mechanisms rely on the hypervisor running at the destination host advertising the new location of the VM once migration is complete. However, this creates a period of time during which the VM is not reachable, yielding packet loss. This paper introduces a method to perform zero-loss VM migration by using IPv6 Segment Routing (SR). Rather than letting the hypervisor update a locator mapping after VM migration is complete, a logical path consisting of the source and destination hosts is pre-provisioned. Packets destined to the migrating VM are sent through this path using SR, shortly before, during, and shortly after migration - the virtual router on the source host being in charge of forwarding packets locally if the VM migration has not completed yet, or to the destination host otherwise. The proposed mechanism is implemented as a VPP plugin, and feasibility of zero-loss VM migration is demonstrated with various workloads. Evaluation shows that this yields benefits in terms of session opening latency and TCP throughput.
Conference Paper
P4 has emerged as the de facto standard language for describing how network packets should be processed, and is becoming widely used by network owners, systems developers, researchers and in the classroom. The goal of the work presented here is to make it easier for engineers, researchers and students to learn how to program using P4, and to build prototypes running on real hardware. Our target is the NetFPGA SUME platform, a 4x10 Gb/s PCIe card designed for use in universities for teaching and research. Until now, NetFPGA users have needed to learn an HDL such as Verilog or VHDL, making it off limits to many software developers and students. Therefore, we developed the P4->NetFPGA workflow, allowing developers to describe how packets are to be processed in the high-level P4 language, then compile their P4 programs to run at line rate on the NetFPGA SUME board. The P4->NetFPGA workflow is built upon the Xilinx P4-SDNet compiler and the NetFPGA SUME open source code base. In this paper, we provide an overview of the P4 programming language and describe the P4->NetFPGA workflow. We also describe how the workflow is being used by the P4 community to build research prototypes, and to teach how network systems are built by providing students with hands-on experience working with real hardware.
Article
In the last decade, a number of frameworks started to appear that implement, directly in user-space with kernel-bypass mode, high-speed software data plane functionalities on commodity hardware. Vector Packet Processor (VPP) is one of such frameworks, representing an interesting point in the design space in that it offers, in user-space networking, the flexibility of a modular router (Click and variants), with the benefits provided by techniques such as batch processing that have become commonplace in high-speed networking stacks (such as netmap or DPDK). Similarly to Click, VPP lets users arrange functions as a processing graph, providing a full-blown stack of network functions. However, unlike Click, where the whole tree is traversed for each packet, in VPP each traversed node processes all packets in the batch (called vector) before moving to the next node. This design choice enables several code optimizations that greatly improve the achievable processing throughput. This article introduces the main VPP concepts and architecture, and experimentally evaluates the impact of design choices (such as batch packet processing) on performance.
Book
The Illustrated Network: How TCP/IP Works in a Modern Network, Second Edition presents an illustrated explanation on how TCP/IP works, using consistent examples from a working network configuration that includes servers, routers and workstations. Diagnostic traces allow the reader to follow the discussion with unprecedented clarity and precision. True to its title, there are 330+ diagrams and screenshots, as well as topology diagrams and a unique repeating chapter opening diagram. Illustrations are also used as end-of-chapter questions. Based on examples of a complete and modern network, all the material comes from real objects connected and running on the network. The book emphasizes the similarities across all networks, since all share similar components, from the smallest LAN to the global internet. Layered protocols are the rule, and all hosts attached to the Internet run certain core protocols to enable their applications to function properly. This second edition includes updates throughout, along with four completely new chapters that introduce developments that have occurred since the publication of the first edition, including optical networking, cloud concepts and VXLAN.