Conference Paper

# Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

In Software Defined Networking (SDN) the control plane is physically separate from the forwarding plane. Control software programs the forwarding plane (e.g., switches and routers) using an open interface, such as OpenFlow. This paper aims to overcomes two limitations in current switching chips and the OpenFlow protocol: i) current hardware switches are quite rigid, allowing Match-Action'' processing on only a fixed set of fields, and ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (reconfigurable match tables) model, a new RISC-inspired pipelined architecture for switching chips, and we identify the essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware. As in OpenFlow, the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resource limit, with each table configurable for matching on arbitrary fields. However, RMT allows the programmer to modify all header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/s switch chip implementing the RMT model. Our concrete design demonstrates, contrary to concerns within the community, that flexible OpenFlow hardware switch implementations are feasible at almost no additional cost or power.

## No full-text available

... Programmable switches, which allow the data plane behavior to be reconfigured, provide the necessary flexibility. The RMT-based Protocol-Independent Switch Architecture (PISA) [10] has emerged as the de facto standard for programmable switch architecture. ...
... Banzai provides implementations only for the functional units, not for the entire switch chip, so we are unable to directly evaluate the impact of our modifications on the full chip design. However, prior work suggests that ALUs take up only a small portion (i.e., ∼ 10%) of the power/area budget for the entire chip [10]; from this we infer that our modifications would have negligible impact. In other words, this hardware enhancement is feasible today, and is unlikely to become a bottleneck in future hardware generations. ...
... Extending switches' processing capability. Proposed enhancements to the RMT architecture [10] include transactions [105], disaggregated memory [15], and better stateful data plane support [29]. While many focus on improving stateful computations, none address floating point operations. ...
Preprint
Full-text available
The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples. In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning and for query processing, and evaluate their performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use 25-75% fewer CPU cores and provide up to 85.9% better throughput in a CPU-constrained environment than SwitchML. For distributed query processing with floating point data, FPISA enables up to 2.7x better throughput than Spark.
... We also introduce a new primitive, reset, which models the behavior of P4 between pipeline stages. In many switch architectures [Bosshart et al. 2013], packets are deparsed and then reparsed between pipelines-e.g., after ingress and before egress. The reset command encodes the behavior of the inner step: it combines the deparsed bits with the packet's unparsed payload and passes it along as the input to the next stage. ...
... P4 is a domain-specific programming language for specifying the behavior of network data planes. It is designed to be used with programmable devices such as PISA switches [Bosshart et al. 2013], FPGAs [Ibanez et al. 2019;Wang et al. 2017], or software devices (e.g., eBPF [Høiland-Jørgensen et al. 2018]). The language is based on a pipeline abstraction: given an input packet it executes a sequence of blocks of code, one per pipeline component, to produce the outputs. ...
Preprint
Full-text available
Programming languages like P4 enable specifying the behavior of network data planes in software. However, with increasingly powerful and complex applications running in the network, the risk of faults also increases. Hence, there is growing recognition of the need for methods and tools to statically verify the correctness of P4 code, especially as the language lacks basic safety guarantees. Type systems are a lightweight and compositional way to establish program properties, but there is a significant gap between the kinds of properties that can be proved using simple type systems (e.g., SafeP4) and those that can be obtained using full-blown verification tools (e.g., p4v). In this paper, we close this gap by developing $\Pi$4, a dependently-typed version of P4 based on decidable refinements. We motivate the design of $\Pi$4, prove the soundness of its type system, develop an SMT-based implementation, and present case studies that illustrate its applicability to a variety of data plane programs.
... In this paper, we propose a data-plane algorithm that produces a provably unbiased delay distribution, specifically designed for programmable switches using the Protocol Independent Switch Architecture (PISA) [4]. Our algorithms tackle the bias by keeping track of the probability of getting each sample, and applying a correction factor inversely proportional to this probability when computing the distribution. ...
... 4.1 Using many pipeline stages per fridge. On a PISA programmable switch [4], we are limited to accessing only one index per register array when processing a packet. Algorithms often span multiple pipeline stages and allocate multiple register arrays to improve performance. ...
... Reconfigurable Match Tables (RMTs): The RMT pipelined architecture proposed by Forwarding Metamorphosis [4] consists of: a parser that produces a packet header vector, a series of logical stages that perform Match-Action operations, a recombination block to reattach headers to packets, and configurable output queues. Logical stages are mapped to one, multiple, and/or fractions of physical stages. ...
... This was chosen using lean levels; the details are omitted for space reasons.4 Previous optimal hybrid trees were used; the conversion factor for hybridization was 3 and 8 for IPv4 and IPv6 respectively. ...
Preprint
Full-text available
Ternary content addressable memories (TCAMs) are commonly used to implement IP lookup, but suffer from high power and area costs. Thus TCAM included in modern chips is limited and can support moderately large datasets in data centers and enterprises, but fails to scale to backbone WAN databases of millions of prefixes. IPv6 deployment also makes it harder to deploy TCAMs because of the larger prefixes used in the 128-bit address space. While the combination of algorithmic techniques and TCAM has been proposed before for reducing power consumption or update costs(e.g., CoolCAM [32] and TreeCAM [28]), we focus on reducing TCAM bits using a scheme we call MashUp that can easily be implemented in modern reconfigurable pipeline chips such as Tofino-3. MashUp uses a new technique, tiling trees, which takes into account TCAM grain (tile) sizes. When applied to a publicly available IPv6 dataset using Tofino-3 TCAM grain sizes (44 by 512), there was a 2X reduction in TCAM required. Further, if we mix TCAM and SRAM using a new technique we call node hybridization, MashUp decreases TCAM bits by 4.5X for IPv6, and by 7.5X for IPv4, allowing wide area databases of 900,000 prefixes to be supported by Tofino-3 and similar chips
... The advent of fully programmable switches [20,21] and high-level programming languages [22] has solved the problem of poor scalability of traditional switches. Nick McKeown proposed programming protocol-independent packet processors (P4) [23] and the corresponding forwarding model [24,25] that allows administrators to customize the packet-forwarding behavior of switches and improve the programmability of data plane and the flexibility of packet processing. Signorello proposed NDN.P4 [26], which first implemented the native NDN packet parsing and forwarding function in the software switch [27] by using P4_14 [28]. ...
Article
Full-text available
Aiming at examining the problems of the low cache hit ratio and high-average routing hops in named data networking (NDN), this paper proposes a cache-optimization strategy based on dynamic popularity and replacement value. When the requested content arrives at the routing node, the latest popularity is calculated based on the number of requests in the current cycle and the popularity of the previous cycle. We adjust the node cache threshold according to the occupation of the node cache space and cache the content with a higher popularity than the threshold. When the cache is complete, the cache-optimization strategy considers the last request time, popularity, and transmission cost of cached content to calculate the replacement value of cached content. We move the content with the lowest replacement value out of the cache, and keep the content with a high replacement value. We deploy the proposed cache-optimization strategy by using a programmable language in a real network with programmable devices. The experimental results illustrate that the strategy proposed in this paper can effectively improve the cache hit ratio and reduce the average routing hops for user request responses compared with other traditional NDN caching strategies.
... This model is motivated by specialized TCAM hardware such as [15], and has been studied intensively [9], [10], [11], [14]. As detailed in [16], common programmable switch architectures such as the RMT and Intel's FlexPipe have tables of different types and in particular tables dedicated to longest prefix matching [17], [18]. In this setting a pattern is in fact a prefix of bits that matches all addresses that start with this prefix. ...
Preprint
Traffic splitting is a required functionality in networks, for example for load balancing over multiple paths or among different servers. The capacities of the servers determine the partition by which traffic should be split. A recent approach implements traffic splitting within the ternary content addressable memory (TCAM), which is often available in switches. It is important to reduce the amount of memory allocated for this task since TCAMs are power consuming and are often also required for other tasks such as classification and routing. Previous work showed how to compute the smallest prefix-matching TCAM necessary to implement a given partition exactly. In this paper we solve the more practical case, where at most $n$ prefix-matching TCAM rules are available, restricting the ability to implement exactly the desired partition. We give simple and efficient algorithms to find $n$ rules that generate a partition closest in $L_\infty$ to the desired one. We do the same for a one-sided version of $L_\infty$ which equals to the maximum overload on a server and for a relative version of it. We use our algorithms to evaluate how the expected error changes as a function of the number of rules, the number of servers, and the width of the TCAM.
... However, it targets at fixed-function switches that support a predetermined set of header fields and actions. Data plane programmability [22,21] is one step towards more flexible switches whose data plane can be changed. ...
Preprint
Full-text available
Applications running in geographically distributed setting are becoming prevalent. Large-scale online services often share or replicate their data into multiple data centers (DCs) in different geographic regions. Driven by the data communication need of these applications, inter-datacenter network (IDN) is getting increasingly important. However, we find congestion control for inter-datacenter networks quite challenging. Firstly, the inter-datacenter communication involves both data center networks (DCNs) and wide-area networks (WANs) connecting multiple data centers. Such a network environment presents quite heterogeneous characteristics (e.g., buffer depths, RTTs). Existing congestion control mechanisms consider either DCN or WAN congestion, while not simultaneously capturing the degree of congestion for both. Secondly, to reduce evolution cost and improve flexibility, large enterprises have been building and deploying their wide-area routers based on shallow-buffered switching chips. However, with legacy congestion control mechanisms (e.g., TCP Cubic), shallow buffer can easily get overwhelmed by large BDP (bandwidth-delay product) wide-area traffic, leading to high packet losses and degraded throughput. This thesis describes my research efforts on optimizing congestion control mechanisms for the inter-datacenter networks. First, we design GEMINI - a practical congestion control mechanism that simultaneously handles congestions both in DCN andWAN. Second, we present FlashPass - a proactive congestion control mechanism that achieves near zero loss without degrading throughput under the shallow-buffered WAN. Extensive evaluation shows their superior performance over existing congestion control mechanisms.
... Wedge 100BF switches are driven by the Barefoot Tofino, a commodity multi-Terabit data plane ASIC that integrates recent designs for programmable line rate packet parsing [69], match-action forwarding [40], and stateful processing [168]. ...
Article
Modern networks can encompass over 100,000 servers. Managing such an extensive network with a diverse set of network policies has become more complicated with the introduction of programmable hardwares and distributed network functions. Furthermore, service level agreements (SLAs) require operators to maintain high performance and availability with low latencies. Therefore, it is crucial for operators to resolve any issues in networks quickly. The problems can occur at any layer of stack: network (load imbalance), data-plane (incorrect packet processing), control-plane (bugs in configuration) and the coordination among them. Unfortunately, existing debugging tools are not sufficient to monitor, analyze, or debug modern networks; either they lack visibility in the network, require manual analysis, or cannot check for some properties. These limitations arise from the outdated view of the networks, i.e., that we can look at a single component in isolation. In this thesis, we describe a new approach that looks at measuring, understanding, and debugging the network across devices and time. We also target modern stateful packet processing devices: programmable data-planes and distributed network functions as these becoming increasingly common part of the network. Our key insight is to leverage both in-network packet processing (to collect precise measurements) and out-of-network processing (to coordinate measurements and scale analytics). The resulting systems we design based on this approach can support testing and monitoring at the data center scale, and can handle stateful data in the network. We automate the collection and analysis of measurement data to save operator time and take a step towards self driving networks.
... In the context of computer networks, many advances have occurred in the past decade, especially in terms of Software-Defined Networking (SDN). Such advances have led to the emergence of network programmability [11], [19] and have provided network administrators with the ability to reprogram the behavior of forwarding devices through Domain-Specific Languages (DSL) such as P4, POF, and Lyra [20]- [23]. In the same way that the networking infrastructure advanced in programmability, it also advanced in computational power. ...
Preprint
Network congestion and packet loss pose an ever-increasing challenge to video streaming. Despite the research efforts toward making video encoding schemes resilient to lossy network conditions, forwarding devices have not considered monitoring packet content to prioritize packets and minimize the impact of packet loss on video transmission. In this work, we advocate in favor of in-network computing employing a packet drop algorithm and an in-network hardware module to devise a solution for improving content-aware video streaming in congested network. Results show that our approach can reduce intra-predicted packet loss by over 80% at negligible resource usage and performance costs.
Article
Network programming languages (NPLs) empower operators to program network data planes (NDPs) with unprecedented efficiency. Currently, various NPLs and NDPs coexist and no one can prevail over others in the short future. Such diversity is raising many problems including: (1) programs written with different NPLs can hardly interoperate in the same network, (2) most NPLs are bound to specific NDPs, hindering their independent evolution, and (3) compilation techniques cannot be readily reused, resulting in much wasteful work. These problems are mostly owing to the lack of modularity in the compilers, where the missing part is an intermediate representation (IR) for NPLs. To this end, we propose Network Transaction Automaton (NTA), a highly-expressive and language-independent IR, and show it can express semantics of 7 mainstream NPLs. Then, we design CODER, a modular compiler based on NTA, which currently supports 2 NPLs and 3 NDPs. Experiments with real and synthetic programs show CODER can correctly compile those programs for real networks within moderate time.
Article
Traffic splitting is a required functionality in networks, for example for load balancing over multiple paths or among different servers. The capacities of the servers determine the partition by which traffic should be split. A recent approach implements traffic splitting within the ternary content addressable memory (TCAM), which is often available in switches. It is important to reduce the amount of memory allocated for this task since TCAMs are power consuming and are often also required for other tasks such as classification and routing. Previous work showed how to compute the smallest prefix-matching TCAM necessary to implement a given partition exactly. In this paper we solve the more practical case, where at most n prefix-matching TCAM rules are available, restricting the ability to implement exactly the desired partition. We give simple and efficient algorithms to find n rules that generate a partition closest in $L_{∞}$ to the desired one. We do the same for a one-sided version of $L_{∞}$ which equals to the maximum overload on a server and for a relative version of it. We use our algorithms to evaluate how the expected error changes as a function of the number of rules, the number of servers, and the width of the TCAM.
Article
Network monitoring systems are designed to fulfill operators' intents and serve as essential tools to modern networks. As a result of rapidly increasing network bandwidth and scale nowadays, network monitors should satisfy on-demand network monitoring for continuously growing traffic volumes. However, existing monitoring systems either cannot satisfy flexible intents on demand or produce significant overheads. In this paper, we present Newton, an intent-driven traffic monitor that is able to specify operators' intents with traffic monitoring queries and conduct dynamic and scalable network-wide queries deployment. Newton enables operators to customize and modify queries dynamically without interrupting the network workflow. Besides, Newton proposes systematic optimizations at device level and network-wide level to reduce resource consumption while deploying queries. Newton can combine the resources across switches to deploy complex queries with high resilience to dynamic network status. Evaluations prove that Newton is of high flexibility, scalability, and resource efficiency, which demonstrates Newton is promising to be deployed in large-scale programmable networks.
Article
Full-text available
Space information networks is network systems that can receive, transmit, and process spatial information lively. It uses satellites, stratosphere airships, Unmanned Aerial Vehicles, and other platforms as the carrier. It supports high-dynamic, real-time broadband transmission of earth observations and ultra-long-distance, long-delay reliable transmission of deep space exploration. The deeper the network integration, the higher the system’s security concerns and the more likely SINs will be controlled and destroyed in terms of cybersecurity. How to integrate new IT technologies such as artificial intelligence, digital twins, and blockchain to diverse application scenarios of SINs while maintaining SIN cybersecurity will be a long-term critical technical issue. This study is a review of the security issues for space information networks. First, this paper examines space information networks’ security issues and figures out the relationship between the main security threats, services, and mechanisms. Then, this article selects secure routing and anomaly detection from many security technologies to conduct a detailed overview from two perspectives of traditional methods and artificial intelligence. Subsequently, this paper investigates anomaly detection schemes for spatial information networks and proposes a deep learning-based anomaly detection scheme. Finally, we suggest the potential research directions and opening problems of space information network security. Overall, this paper aims to give readers an overview of the newly emerging technologies in space information networks’ security issues and provide inspiration for future exploration.
Conference Paper
Transport protocols can be implemented in NIC (Network Interface Card) hardware to increase throughput, reduce latency and free up CPU cycles. If the ideal transport protocol were known, the optimal implementation would be simple: bake it into fixed-function hardware. But transport layer protocols are still evolving, with innovative new algorithms proposed every year. A recent study proposed Tonic, a Verilog-programmable transport layer in hardware. We build on this work to propose a new programmable hardware transport layer architecture, called nanoTransport, optimized for the extremely low-latency message-based RPCs (Remote Proce- dure Calls) that dominate large, modern distributed data center applications. NanoTransport is programmed using the P4 language, making it easy to modify existing (or create entirely new) transport protocols in hardware. We identify common events and primitive operations, allowing for a streamlined, modular, programmable pipeline, including packetization, reassembly, timeouts and packet generation, all to be expressed by the programmer. We evaluate our nanoTransport prototype by programming it to run the reliable message-based transport protocols NDP and Homa, as well as a hybrid variant. Our FPGA prototype – implemented in Chisel and running on the Firesim simulator – exposes P4-programmable pipelines and is designed to run in an ASIC at 200Gb/s with each packet processed end-to-end in less than 10ns (including message reassembly).
Article
The P4 language and programmable switch hardware, like the Intel Tofino, have made it possible for network engineers to write new programs that customize operation of computer networks, thereby improving performance, fault-tolerance, energy use, and security. Unfortunately, possible does not mean easy —there are many implicit constraints that programmers must obey if they wish their programs to compile to specialized networking hardware. In particular, all computations on the same switch must access data structures in a consistent order, or it will not be possible to lay that data out along the switch’s packet-processing pipeline. In this paper, we define Lucid 2.0, a new language and type system that guarantees programs access data in a consistent order and hence are pipeline-safe . Lucid 2.0 builds on top of the original Lucid language, which is also pipeline-safe, but lacks the features needed for modular construction of data structure libraries. Hence, Lucid 2.0 adds (1) polymorphism and ordering constraints for code reuse; (2) abstract, hierarchical pipeline locations and data types to support information hiding; (3) compile-time constructors, vectors and loops to allow for construction of flexible data structures; and (4) type inference to lessen the burden of program annotations. We develop the meta-theory of Lucid 2.0, prove soundness, and show how to encode constraint checking as an SMT problem. We demonstrate the utility of Lucid 2.0 by developing a suite of useful networking libraries and applications that exploit our new language features, including Bloom filters, sketches, cuckoo hash tables, distributed firewalls, DNS reflection defenses, network address translators (NATs) and a probabilistic traffic monitoring service.
Article
The network interface cards (NICs) of modern computers are changing to adapt to faster data rates and to help with the scaling issues of general-purpose CPU technologies. Among the ongoing innovations, the inclusion of programmable accelerators on the NIC's data path is particularly interesting, since it provides the opportunity to offload some of the CPU's network packet processing tasks to the accelerator. Given the strict latency constraints of packet processing tasks, accelerators are often implemented leveraging platforms such as Field-Programmable Gate Arrays (FPGAs). FPGAs can be re-programmed after deployment, to adapt to changing application requirements, and can achieve both high throughput and low latency when implementing packet processing tasks. However, they have limited resources that may need to be shared among diverse applications, and programming them is difficult and requires hardware design expertise. We present hXDP, a solution to run on FPGAs software packet processing tasks described with the eBPF technology and targeting the Linux's eXpress Data Path. hXDP uses only a fraction of the available FPGA resources, while matching the performance of high-end CPUs. The iterative execution model of eBPF is not a good fit for FPGA accelerators. Nonetheless, we show that many of the instructions of an eBPF program can be compressed, parallelized, or completely removed, when targeting a purpose-built FPGA design, thereby significantly improving performance. We implement hXDP on an FPGA NIC and evaluate it running real-world unmodified eBPF programs. Our implementation runs at 156.25MHz and uses about 15% of the FPGA resources. Despite these modest requirements, it can run dynamically loaded programs, achieves the packet processing throughput of a high-end CPU core, and provides a 10X lower packet forwarding latency.
Article
Article
Programming languages like P4 enable specifying the behavior of network data planes in software. However, with increasingly powerful and complex applications running in the network, the risk of faults also increases. Hence, there is growing recognition of the need for methods and tools to statically verify the correctness of P4 code, especially as the language lacks basic safety guarantees. Type systems are a lightweight and compositional way to establish program properties, but there is a significant gap between the kinds of properties that can be proved using simple type systems (e.g., SafeP4) and those that can be obtained using full-blown verification tools (e.g., p4v). In this paper, we close this gap by developing Π4, a dependently-typed version of P4 based on decidable refinements. We motivate the design of Π4, prove the soundness of its type system, develop an SMT-based implementation, and present case studies that illustrate its applicability to a variety of data plane programs.
Article
By introducing programmability, automated verification, and innovative debugging tools, Software-Defined Networks (SDNs) are poised to meet the increasingly stringent dependability requirements of today's communication networks. However, the design of fault-tolerant SDNs remains an open challenge. This paper considers the design of dependable SDNs through the lenses of self-stabilization—a very strong notion of fault-tolerance. In particular, we develop algorithms for an in-band and distributed control plane for SDNs, called Renaissance, which tolerate a wide range of failures. Our self-stabilizing algorithms ensure that after the occurrence of arbitrary failures, (i) every non-faulty SDN controller can reach any switch (or another controller) within a bounded communication delay (in the presence of a bounded number of failures) and (ii) every switch is managed by a controller. We evaluate Renaissance through a rigorous worst-case analysis as well as a prototype implementation (based on OVS and Floodlight, and Mininet).
Chapter
Recent advances in virtualization technologies and distributed networking architectures have led to an increased interest in jointly considering computation and forwarding in network nodes. This is spurred in great parts by the proliferation of edge computing as a complementary or even alternative to centralized cloud computing and supplementing data centers for data-driven, time-sensitive, and critical applications and telemetry. These trends have given rise to the concept of the edge-cloud continuum, the melding of networking and computing with common and tightly integrated resource allocation capabilities from the edge of the network to the back-end cloud infrastructure including computing and storage. The network is becoming more like a distributed computer board than the telephone network of yore: instead of providing connections and forwarding, the network can now be considered as an essential constituent of the applications themselves where the boundaries between the networking and computing domains are redefined. Although the joint optimization of communication and computation resource allocation has already been proposed in the past, it predated the current softwarization of networks, the development of new resource sharing paradigms in mobile networks, as well as the rise of data-driven approaches including machine learning. The availability of new hardware architectures (e.g. Tofino) and programming frameworks (e.g. P4) makes it possible to perform some in-network computing at line speed. In order to better understand the fundamental changes that computing will bring to the 6G era, this chapter examines the state of the art of the in-network computation areas and how the edge-cloud continuum should evolve to support the next generation of applications and service.
Article
Full-text available
Conference Paper
Full-text available
Configuration changes are a common source of instability in networks, leading to outages, performance disruptions, and security vulnerabilities. Even when the initial and final configurations are correct, the update process itself often steps through intermediate configurations that exhibit incorrect behaviors. This paper introduces the notion of consistent network updates---updates that are guaranteed to preserve well-defined behaviors when transitioning mbetween configurations. We identify two distinct consistency levels, per-packet and per-flow, and we present general mechanisms for implementing them in Software-Defined Networks using switch APIs like OpenFlow. We develop a formal model of OpenFlow networks, and prove that consistent updates preserve a large class of properties. We describe our prototype implementation, including several optimizations that reduce the overhead required to perform consistent updates. We present a verification tool that leverages consistent updates to significantly reduce the complexity of checking the correctness of network control software. Finally, we describe the results of some simple experiments demonstrating the effectiveness of these optimizations on example applications.
Conference Paper
Full-text available
Cuckoo hashing holds great potential as a high-performance hashing scheme for real applications. Up to this point, the greatest drawback of cuckoo hashing appears to be that there is a polynomially small but practically significant probability that a failure occurs during the insertion of an item, requiring an expensive rehashing of all items in the table. In this paper, we show that this failure probability can be dramatically reduced by the addition of a very small constant-sized stash. We demonstrate both analytically and through simulations that stashes of size equivalent to only three or four items yield tremendous improvements, enhancing cuckoo hashing’s practical viability in both hardware and software. Our analysis naturally extends previous analyses of multiple cuckoo hashing variants, and the approach may prove useful in further related schemes.
Article
Full-text available
We generalize Cuckoo Hashing to d-ary Cuckoo Hashing and show how this yields a simple hash table data structure that stores n elements in (1 + )n memory cells, for any constant > 0. Assuming uniform hashing, accessing or deleting table entries takes at most d=O (ln (1/)) probes and the expected amortized insertion time is constant. This is the first dictionary that has worst case constant access time and expected constant update time, works with (1 + )n space, and supports satellite information. Experiments indicate that d = 4 probes suffice for 0.03. We also describe variants of the data structure that allow the use of hash functions that can be evaluated in constant time.
Conference Paper
Full-text available
Virtual routers are a promising way to provide network ser- vices such as customer-specific routing, policy-based rout- ing, multi-topology routing, and network virtulization. How- ever, the need to support a separate forwarding information base (FIB) for each virtual router leads to memory scaling challenges. In this paper, we present a small, shared data structure and a fast lookup algorithm that capitalize on the commonality of IP prefixes between each FIB. Experiments with real packet traces and routing tables show that our ap- proach achieves much lower memory requirements and con- siderably faster lookup times. Our prototype implementa- tion in the Click modular router, running both in user space and in the Linux kernel, demonstrates that our data structure and algorithm are an interesting solution for building scal- able routers that support virtualization.
Conference Paper
Full-text available
New protocols for the data link and network layer are being pro- posed to address limitations of current protocols in terms of scala- bility, security, andmanageability. High-speedroutersandswitches that implement these protocols traditionally perform packet pro- cessing using ASICs which offer high speed, low chip area, and low power. But with inflexible custom hardware, the deployment of new protocols could happen only through equipment upgrades. While newer routers use more flexible network processors for data plane processing, due to power and area constraints lookups in for- warding tables are done with custom lookup modules. Thus most of the proposed protocols can only be deployed with equipment upgrades. To speed up the deployment of new protocols, we propose a flexible lookup module, PLUG (Pipelined Lookup Grid). We can achieve generality without loosing efficiency because various cus- tom lookup modules have the same fundamental features we retain: area dominated by memories, simple processing, and strict access patterns defined by the data structure. We implemented IPv4, Eth- ernet, Ethane, and SEATTLE in our dataflow-based programming model for the PLUG and mapped them to the PLUG hardware which consists of a grid of tiles. Throughput, area, power, and la- tency of PLUGs are close to those of specialized lookup modules.
Article
Full-text available
Cuckoo hashing holds great potential as a high-performance hashing scheme for real appli- cations. Up to this point, the greatest drawback of cuckoo hashing appears to be that there is a polynomially small but practically signicant probability that a failure occurs during the insertion of an item, requiring an expensive rehashing of all items in the table. In this paper, we show that this failure probability can be dramatically reduced by the addition of a very small constant-sized stash. We demonstrate both analytically and through simulations that stashes of size equivalent to only three or four items yield tremendous improvements, enhancing cuckoo hashing's practical viability in both hardware and software. Our analysis naturally extends previous analyses of multiple cuckoo hashing variants, and the approach may prove useful in further related schemes.
Article
Full-text available
This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use ev- ery day. OpenFlow is based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries. Our goal is to encourage network- ing vendors to add OpenFlow to their switch products for deployment in college campus backbones and wiring closets. We believe that OpenFlow is a pragmatic compromise: on one hand, it allows researchers to run experiments on hetero- geneous switches in a uniform way at line-rate and with high port-density; while on the other hand, vendors do not need to expose the internal workings of their switches. In addition to allowing researchers to evaluate their ideas in real-world traffic settings, OpenFlow could serve as a useful campus component in proposed large-scale testbeds like GENI. Two buildings at Stanford University will soon run OpenFlow networks, using commercial Ethernet switches and routers. We will work to encourage deployment at other schools; and We encourage you to consider deploying OpenFlow in your university network too.
Article
Full-text available
We revisit the problem of scaling software routers, motivated by recent advances in server technology that enable high-speed parallel processing---a feature router workloads appear ideally suited to exploit. We propose a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server. By carefully exploiting parallelism at every opportunity, we demonstrate a 35Gbps parallel router prototype; this router capacity can be linearly scaled through the use of additional servers. Our prototype router is fully programmable using the familiar Click/Linux environment and is built entirely from off-the-shelf, general-purpose server hardware.
Conference Paper
We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.
Article
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one-port memories) in order to minimize contention; however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of two-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of \frac 32\frac 32 asymptotically. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice a more sophisticated modification of this approximation algorithm is in fact optimal. We also discuss the online memory allocation problem and present fast online algorithms that provide good memory utilization while allowing fast updates.
Article
We present a simple dictionary with worst case constant lookup time, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al. [SIAM J. Comput. 23 (4) (1994) 738–761]. The space usage is similar to that of binary search trees. Besides being conceptually much simpler than previous dynamic dictionaries with worst case constant lookup time, our data structure is interesting in that it does not use perfect hashing, but rather a variant of open addressing where keys can be moved back in their probe sequences. An implementation inspired by our algorithm, but using weaker hash functions, is found to be quite practical. It is competitive with the best known dictionaries having an average case (but no nontrivial worst case) guarantee on lookup time.
Conference Paper
We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.
Article
Internet routers and switches need to maintain millions of (e.g., per prefix) counters at up to OC-768 speeds that are essential for traffic engineering. Unfortunately, the speed requirements require the use of large amounts of expensive SRAM memory. Shah et al [1] introduced a cheaper statistics counter architecture that uses a much smaller amount of SRAM by using the SRAM as a cache together with a (cheap) backing DRAM that stores the complete counters. Counters in SRAM are periodically updated to the DRAM before they overflow under the control of a counter management algorithm. Shah et al [1] also devised a counter management algorithm called LCF that they prove uses an optimal amount of SRAM. Unfortunately, it is difficult to implement LCF at high speeds because it requires sorting to evict the largest counter in the SRAM. This paper removes this bottleneck in [1] by proposing a counter management algorithm called LR(T) (Largest Recent with threshold T) that avoids sorting by only keeping a bitmap that tracks counters that are larger than threshold T. This allows LR(T) to be practically realizable using only at most 2 bits extra per counter and a simple pipelined data structure. Despite this, we show through a formal analysis, that for a particular value of the threshold T, the LR(T) requires an optimal amount of SRAM, matching LCF. Further, we also describe an implementation, based on a novel data structure called aggregated bitmap, that allows the LR(T) algorithm to be realized at line rates.
NP-5 Network Processor
• Ezchip
EZchip. NP-5 Network Processor. http://www.ezchip.com/p_np5.htm.
7 series FPGA overview
• Xilinx
Xilinx. 7 series FPGA overview. http://www.xilinx.com/support/documentation/data_ sheets/ds180_7Series_Overview.pdf.
VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks
• Ietf
• Vxlan
IETF. VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks, May 2013. https://tools.ietf.org/html/ draft-mahalingam-dutt-dcops-vxlan-04.
Rate Control Protocol (RCP)
• N Dukkipati
N. Dukkipati. Rate Control Protocol (RCP). PhD thesis, Stanford University, 2008.
Low power TCAM. US Patent 8
• P Bosshart
P. Bosshart. Low power TCAM. US Patent 8,125,810, Feb. 2012.
NVGRE: Network Virtualization using Generic Routing Encapsulation Feb
• Ietf
• Nvgre
IETF. NVGRE: Network Virtualization using Generic Routing Encapsulation, Feb. 2013. https://tools.ietf. org/html/draft-sridharan-virtualization-nvgre-02.
NFP-6xxx Flow Processor
• Netronome
Netronome. NFP-6xxx Flow Processor. http://www.netronome.com/pages/flow-processors/.
5810 ForCES Protocol Specification
• Ietf
• Rfc