Ori Rottenstreich

Princeton University, Princeton, New Jersey, United States

Are you Ori Rottenstreich?

Claim your profile

Publications (24)13.17 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: The most demanding tenants of shared clouds require complete isolation from their neighbors, in order to guarantee that their application performance is not affected by other tenants. Unfortunately, while shared clouds can offer an option whereby tenants obtain dedicated servers, they do not offer any network provisioning service, which would shield these tenants from network interference. In this paper, we introduce Links as a Service, a new abstraction for cloud service that provides physical isolation of network links. Each tenant gets an exclusive set of links forming a virtual fat tree, and is guaranteed to receive the exact same bandwidth and delay as if it were alone in the shared cloud. Under simple assumptions, we derive theoretical conditions for enabling LaaS without capacity over-provisioning in fat-trees. New tenants are only admitted in the network when they can be allocated hosts and links that maintain these conditions. Using experiments on real clusters as well as simulations with real-life tenant sizes, we show that LaaS completely avoids the performance degradation caused by traffic from concurrent tenants on shared links. Compared to mere host isolation, LaaS can improve the application performance by up to 200%, at the cost of a 10% reduction in the cloud utilization.
    No preview · Article · Sep 2015
  • O. Rottenstreich · J'. Tapolcai
    [Show abstract] [Hide abstract]
    ABSTRACT: Packet classification is a building block in many network services such as routing, filtering, intrusion detection, accounting, monitoring, load-balancing and policy enforcement. Compression has gained attention recently as a way to deal with the expected increase of classifiers size. Typically, compression schemes try to reduce a classifier size while keeping it semantically-equivalent to its original form. Inspired by the advantages of popular compression schemes (e.g. JPEG and MPEG), we study in this paper the applicability of lossy compression to create packet classifiers requiring less memory than optimal semantically-equivalent representations. Our objective is to find a limited-size classifier that can correctly classify a high portion of the traffic so that it can be implemented in commodity switches with classification modules of a given size. We develop optimal dynamic programming based algorithms for several versions of the problem and describe how a small amount of traffic that cannot be classified can be easily treated, especially in software-defined networks. We generalize our solutions for a wide range of classifiers with different similarity metrics. We evaluate their performance on real classifiers and traffic traces and show that in some cases we can reduce a classifier size by orders of magnitude while still classifying almost all traffic correctly.
    No preview · Article · May 2015
  • Source
    Tal Mizrahi · Ori Rottenstreich · Yoram Moses

    Full-text · Conference Paper · Apr 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: —Efficient packet classification is a core concern for network services. Traditional multi-field classification approaches, in both software and ternary content-addressable memory (TCAMs), entail tradeoffs between (memory) space and (lookup) time. TCAMs cannot efficiently represent range rules, a common class of classification rules confining values of packet fields to given ranges. The exponential space growth of TCAM entries relative to the number of fields is exacerbated when multiple fields contain ranges. In this work, we present a novel approach which identifies properties of many classifiers which can be implemented in linear space and with worst-case guaranteed logarithmic time and allows the addition of more fields including range constraints without impacting space and time complexities. On real-life classifiers from Cisco Systems and additional classifiers from ClassBench [11] (with real parameters), 90-95% of rules are thus handled, and the other 5-10% of rules can be stored in TCAM to be processed in parallel.
    Full-text · Article · Feb 2015 · IEEE/ACM Transactions on Networking
  • Ori Rottenstreich · Isaac Keslassy · Avinatan Hassidim · Haim Kaplan · Ely Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware-based packet classification has become an essential component in many networking devices. It often relies on ternary content-addressable memories (TCAMs), which compare the packet header against a set of rules. TCAMs are not well suited to encode range rules. Range rules are often encoded by multiple TCAM entries, and little is known about the smallest number of entries that one needs for a specific range. In this paper, we introduce the In/Out TCAM, a new architecture that combines a regular TCAM together with a modified TCAM. This custom architecture enables independent encoding of each rule in a set of rules. We provide the following theoretical results for the new architecture: 1) We give an upper bound on the worst-case expansion of range rules in one and two dimensions. 2) For extremal ranges, which are 89% of the ranges that occur in practice, we provide an efficient algorithm that computes an optimal encoding. 3) We present a closed-form formula for the average expansion of an extremal range.
    No preview · Article · Jan 2015 · IEEE/ACM Transactions on Networking
  • A. Shpiner · E. Zahavi · O. Rottenstreich
    [Show abstract] [Hide abstract]
    ABSTRACT: Data center networks demand high bandwidth switches. These networks also sustain common in cast scenarios, which require large switch buffers. Therefore, network and switch designers encounter a buffer-bandwidth trade off as follows. Large switch buffers allow absorbing larger in cast workload. However, higher switch bandwidth allows both faster buffer draining and more link pausing, which reduces buffering demand for in cast. As the two features compete for silicon resources and device power budget, modeling their relative impact on the network is critical. In this work our aim is to evaluate this buffer-band width trade off. We analyze the worst case in cast scenario in the lossless network and find by how much the buffer size can be reduced, while the link bandwidth increased to stand in the same network performance. In addition, we analyze the multi-level in cast cascade and support our findings by simulations. Our analysis shows that increasing bandwidth allows reducing the buffering demand by at least the same ratio, while preserving the same network performance. In particular, we show that the switch buffers can be omitted if the links bandwidth is doubled.
    No preview · Article · Oct 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Efficient packet classification is a core concern for network services. Traditional multi-field classification approaches, in both software and Ternary Content-Addressable Memory (TCAMs), entail tradeoffs between (memory) space and (lookup) time. TCAMs cannot efficiently represent range rules, a common class of classification rules confining values of packet fields to given ranges. The exponential space growth of TCAM entries relative to the number of fields is exacerbated when multiple fields contain ranges. In this work, we present a novel approach which identifies properties of many classifiers which can be implemented in linear space and with worst-case guaranteed logarithmic time \emph{and} allows the addition of more fields including range constraints without impacting space and time complexities. On real-life classifiers from Cisco Systems and additional classifiers from ClassBench (with real parameters), 90-95\% of rules are thus handled, and the other 5-10\% of rules can be stored in TCAM to be processed in parallel.
    Full-text · Conference Paper · Aug 2014
  • Ori Rottenstreich · Yoram Revah
    [Show abstract] [Hide abstract]
    ABSTRACT: We study a perfectly-periodic scheduling problem of a resource shared among several users. Each user is characterized by a weight describing the number of times it has to use the resource within a cyclic schedule. With the constraint that the resource can be used by at most one user in each time slot, we would like to find a schedule with a minimal time period (number of time slots), in which each user is served once in a fixed number of time slots according to its required total number of times. As many other variants of periodic-scheduling problems, we first prove that the problem is NP-hard. We then describe different cases for which we can calculate the exact value of the optimal time period and present algorithms that obtain optimal schedules. We also study the optimal time period in the case of two users with random weights drawn according to known distributions. We then discuss the general case of arbitrary number of users with general weights and provide approximation algorithms that achieve schedules with guaranteed time periods. Last, we conduct simulations to examine the presented analysis.
    No preview · Conference Paper · Jul 2014
  • Ori Rottenstreich · Pu Li · Inbal Horev · Isaac Keslassy · Shivkumar Kalyanaraman
    [Show abstract] [Hide abstract]
    ABSTRACT: Packet reordering has now become one of the most significant bottlenecks in next-generation switch designs. A switch practically experiences a reordering delay contagion, such that a few late packets may affect a disproportionate number of other packets. This contagion can have two possible forms. First, since switch designers tend to keep the switch flow order, i.e. the order of packets arriving at the same switch input and departing from the same switch output, a packet may be delayed due to packets of other flows with little or no reason. Further, within a flow, if a single packet is delayed for a long time, then all the other packets of the same flow will have to wait for it and suffer as well. In this paper, we suggest solutions against this reordering contagion. We first suggest several hash-based counter schemes that prevent inter-flow blocking and reduce reordering delay. We further suggest schemes based on network coding to protect against rare events with high queueing delay within a flow. Last, we demonstrate using both analysis and simulations that the use of these solutions can indeed reduce the resequencing delay. For instance, resequencing delays are reduced by up to an order of magnitude using real-life traces and a real hashing function.
    No preview · Article · May 2014 · IEEE Transactions on Computers
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rise of datacenter virtualization, the number of entries in the forwarding tables of datacenter switches is expected to scale from several thousands to several millions. Unfortunately, such forwarding table sizes would not fit on-chip memory using current implementations. In this paper, we investigate the compressibility of forwarding tables. We first introduce a novel forwarding table architecture with separate encoding in each column. It is designed to keep supporting fast random accesses and fixed-width memory words. Then, we show that although finding the optimal encoding is NP-hard, we can suggest an encoding whose memory requirement per row entry is guaranteed to be within a small additive constant of the optimum. Next, we analyze the common case of two-column forwarding tables, and show that such tables can be presented as bipartite graphs. We deduce graph-theoretical bounds on the encoding size. We also introduce an algorithm for optimal conditional encoding of the second column given an encoding of the first one. In addition, we explain how our architecture can handle table updates. Last, we evaluate our suggested encoding techniques on synthetic forwarding tables as well as on real-life tables.
    No preview · Article · Jan 2014 · IEEE Journal on Selected Areas in Communications
  • Ori Rottenstreich · Isaac Keslassy · Yoram Revah · Aviran Kadosh
    [Show abstract] [Hide abstract]
    ABSTRACT: Pipelines are widely used to increase throughput in multi-core chips by parallelizing packet processing. Typically, each packet type is serviced by a dedicated pipeline. However, with the increase in the number of packet types and their number of required services, there are not enough cores for pipelines. In this paper, we study pipeline sharing, such that a single pipeline can be used to serve several packet types. Pipeline sharing decreases the needed total number of cores, but typically increases pipeline lengths and therefore packet delays. We consider the optimization problem of allocating cores between different packet types such that the average delay is minimized. We suggest a polynomial-time algorithm that finds the optimal solution when the packet types preserve a specific property. We also present a greedy algorithm for the general case. Last, we examine our solutions on synthetic examples, on packet-processing applications, and on real-life H.264 standard requirements.
    No preview · Conference Paper · Aug 2013
  • Ori Rottenstreich · Rami Cohen · Danny Raz · Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, hardware-based packet classification has became an essential component in many networking devices. It often relies on ternary content-addressable memories (TCAMs), which can compare in parallel the packet header against a large set of rules. Designers of TCAMs often have to deal with unpredictable sets of rules. These result in highly variable rule expansions, and can only rely on heuristic encoding algorithms with no reasonable guarantees. In this paper, given several types of rules, we provide new upper bounds on the TCAM worst case rule expansions. In particular, we prove that a W-bit range can be encoded in W TCAM entries, improving upon the previously known bound of 2W - 5. We further prove the optimality of this bound of W for prefix encoding, using new analytical tools based on independent sets and alternating paths. Next, we generalize these lower bounds to a new class of codes called hierarchical codes that includes both binary codes and Gray codes. Last, we propose a modified TCAM architecture that can use additional logic to significantly reduce the rule expansions, both in the worst case and using real-life classification databases.
    No preview · Article · Jun 2013 · IEEE Transactions on Computers
  • Ori Rottenstreich · Isaac Keslassy · Avinatan Hassidim · Haim Kaplan · Ely Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware-based packet classification has become an essential component in many networking devices. It often relies on TCAMs (ternary content-addressable memories), which need to compare the packet header against a set of rules. But efficiently encoding these rules is not an easy task. In particular, the most complicated rules are range rules, which usually require multiple TCAM entries to encode them. However, little is known on the optimal encoding of such non-trivial rules. In this work, we take steps towards finding an optimal encoding scheme for every possible range rule. We first present an optimal encoding for all possible generalized extremal rules. Such rules represent 89% of all non-trivial rules in a typical real-life classification database. We also suggest a new method of simply calculating the optimal expansion of an extremal range, and present a closed-form formula of the average optimal expansion over all extremal ranges. Next, we present new bounds on the worst-case expansion of general classification rules, both in one-dimensional and two-dimensional ranges. Last, we introduce a new TCAM architecture that can leverage these results by providing a guaranteed expansion on the tough rules, while dealing with simpler rules using a regular TCAM. We conclude by verifying our theoretical results in experiments with synthetic and real-life classification databases.
    No preview · Conference Paper · Apr 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rise of datacenter virtualization, the number of entries in forwarding tables is expected to scale from several thousands to several millions. Unfortunately, such forwarding table sizes can hardly be implemented today in on-chip memory. In this paper, we investigate the compressibility of forwarding tables. We first introduce a novel forwarding table architecture with separate encoding in each column. It is designed to keep supporting fast random accesses and fixed-width memory words. Then, we suggest an encoding whose memory requirement per row entry is guaranteed to be within a small additive constant of the optimum. Next, we analyze the common case of two-column forwarding tables, and show that such tables can be presented as bipartite graphs. We deduce graph-theoretical bounds on the encoding size. We also introduce an algorithm for optimal conditional encoding of the second column given an encoding of the first one. In addition, we explain how our architecture can handle table updates. Last, we evaluate our suggested encoding techniques on synthetic forwarding tables as well as on real-life tables.
    No preview · Conference Paper · Apr 2013
  • O. Rottenstreich · A. Berman · Y. Cassuto · I. Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: To enable direct access to a memory word based on its index, memories make use of fixed-width arrays, in which a fixed number of bits is allocated for the representation of each data entry. In this paper we consider the problem of encoding data entries of two fields, drawn independently according to known and generally different distributions. Our goal is to find two prefix codes for the two fields, that jointly maximize the probability that the total length of an encoded data entry is within a fixed given width. We study this probability and develop upper and lower bounds. We also show how to find an optimal code for the second field given a fixed code for the first field.
    No preview · Conference Paper · Jan 2013
  • Source
    Ori Rottenstreich · Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, it is better to disregard the query results of Bloom filters, and in fact not to even query them, thus making them useless. We first analyze conditions under which the Bloom paradox occurs in a Bloom filter, and demonstrate that it depends on the a priori probability that a given element belongs to the represented set. We show that the Bloom paradox also applies to Counting Bloom Filters (CBFs), and depends on the product of the hashed counters of each element. In addition, both for Bloom filters and CBFs, we suggest improved architectures that deal with the Bloom paradox. We also provide fundamental memory lower bounds required to support element queries with limited false-positive and false-negative rates. Last, using simulations, we verify our theoretical results, and show that our improved schemes can lead to a significant improvement in the performance of Bloom filters and CBFs.
    Preview · Article · May 2012 · Proceedings - IEEE INFOCOM
  • Source
    Ori Rottenstreich · Yossi Kanizo · Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Counting Bloom Filters (CBFs) are widely used in networking device algorithms. They implement fast set representations to support membership queries with limited error, and support element deletions unlike Bloom Filters. However, they consume significant amounts of memory. In this paper we introduce a new general method based on variable increments to improve the efficiency of CBFs and their variants. Unlike CBFs, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit increment. Then, to query an element, the exact value of a counter is considered and not just its positiveness. We present two simple schemes based on this method. We demonstrate that this method can always achieve a lower false positive rate and a lower overflow probability bound than CBF in practical systems. We also show how it can be easily implemented in hardware, with limited added complexity and memory overhead. We further explain how this method can extend many variants of CBF that have been published in the literature. Last, using simulations, we show how it can improve the false positive rate of CBFs by up to an order of magnitude given the same amount of memory.
    Preview · Article · Mar 2012 · Proceedings - IEEE INFOCOM
  • A. Berman · Y. Birk · O. Rottenstreich
    [Show abstract] [Hide abstract]
    ABSTRACT: The level of write-once memory cells (e.g., Flash) can only be raised individually. Bulk erasure is possible, but only a number of times (endurance) that decreases sharply with increasing cell capacity or cell-size reduction. A device's declared storage capacity and the total amount of information that can be written to it over its lifetime thus jointly characterize it. Write-once memory (WOM) coding permits a trade-off between these, enabling multiple writes between erasures. The guaranteed (over data) number of such writes is presently considered the important measure. We observe that given the typical endurance (104-105 for Flash), the actual write capacity is very close to its mean value with extremely high probability, rendering the mean number of writes between erasures much more interesting than its guaranteed value in any practical setting. For the Linear WOM code, we derive the write capacity CDF, and show that even the nearly guaranteed number of writes between erasures is substantially larger than the guaranteed one. Based on this, we present Piecewise Linear WOM code, whereby a page is partitioned into numerous independent WOM segments (whose writes must all succeed), permitting a flexible storage/write capacity tradeoff. We show that the aforementioned behavior is preserved. Finally, we outline an extension to Multi-Level cells.
    No preview · Conference Paper · Jan 2012
  • Source
    Ori Rottenstreich · Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: All high-speed Internet devices need to implement classification, i.e. they must determine whether incoming packet headers belong to a given subset of a search space. To do it, they encode the subset using ternary arrays in special high-speed devices called TCAMs (ternary content-addressable memories). However, the optimal coding for arbitrary subsets is unknown. In particular, to encode an arbitrary range subset of the space of all W-bit values, previous works have successively reduced the upper-bound on the code length from 2W-2 to 2W-4, then 2W-5, and finally W TCAM entries. In this paper, we prove that this final result is optimal for typical prefix coding and cannot be further improved, i.e. the bound of W is tight. To do so, we introduce new analytical tools based on independent sets and alternating paths.
    Preview · Conference Paper · Jun 2010
  • Source
    Itamar Cohen · Ori Rottenstreich · Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Chip multiprocessors (CMPs) combine increasingly many general-purpose processor cores on a single chip. These cores run several tasks with unpredictable communication needs, resulting in uncertain and often-changing traffic patterns. This unpredictability leads network-on-chip (NoC) designers to plan for the worst case traffic patterns, and significantly overprovision link capacities. In this paper, we provide NoC designers with an alternative statistical approach. We first present the traffic-load distribution plots (T-Plots), illustrating how much capacity overprovisioning is needed to service 90, 99, or 100 percent of all traffic patterns. We prove that in the general case, plotting T-Plots is #P-complete, and therefore extremely complex. We then show how to determine the exact mean and variance of the traffic load on any edge, and use these to provide Gaussian-based models for the T-Plots, as well as guaranteed performance bounds. We also explain how to practically approximate T-Plots using random-walk-based methods. Finally, we use T-Plots to reduce the network power consumption by providing an efficient capacity allocation algorithm with predictable performance guarantees.
    Full-text · Article · Jun 2010 · IEEE Transactions on Computers