Ori Rottenstreich

Technion - Israel Institute of Technology, H̱efa, Haifa District, Israel

Are you Ori Rottenstreich?

Claim your profile

Publications (17)8.56 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Efficient packet classification is a core concern for network services. Traditional multi-field classification approaches, in both software and Ternary Content-Addressable Memory (TCAMs), entail tradeoffs between (memory) space and (lookup) time. TCAMs cannot efficiently represent range rules, a common class of classification rules confining values of packet fields to given ranges. The exponential space growth of TCAM entries relative to the number of fields is exacerbated when multiple fields contain ranges. In this work, we present a novel approach which identifies properties of many classifiers which can be implemented in linear space and with worst-case guaranteed logarithmic time \emph{and} allows the addition of more fields including range constraints without impacting space and time complexities. On real-life classifiers from Cisco Systems and additional classifiers from ClassBench (with real parameters), 90-95\% of rules are thus handled, and the other 5-10\% of rules can be stored in TCAM to be processed in parallel.
    SIGCOMM; 08/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Packet reordering has now become one of the most significant bottlenecks in next-generation switch designs. A switch practically experiences a reordering delay contagion, such that a few late packets may affect a disproportionate number of other packets. This contagion can have two possible forms. First, since switch designers tend to keep the switch flow order, i.e. the order of packets arriving at the same switch input and departing from the same switch output, a packet may be delayed due to packets of other flows with little or no reason. Further, within a flow, if a single packet is delayed for a long time, then all the other packets of the same flow will have to wait for it and suffer as well. In this paper, we suggest solutions against this reordering contagion. We first suggest several hash-based counter schemes that prevent inter-flow blocking and reduce reordering delay. We further suggest schemes based on network coding to protect against rare events with high queueing delay within a flow. Last, we demonstrate using both analysis and simulations that the use of these solutions can indeed reduce the resequencing delay. For instance, resequencing delays are reduced by up to an order of magnitude using real-life traces and a real hashing function.
    IEEE Transactions on Computers 05/2014; · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rise of datacenter virtualization, the number of entries in the forwarding tables of datacenter switches is expected to scale from several thousands to several millions. Unfortunately, such forwarding table sizes would not fit on-chip memory using current implementations. In this paper, we investigate the compressibility of forwarding tables. We first introduce a novel forwarding table architecture with separate encoding in each column. It is designed to keep supporting fast random accesses and fixed-width memory words. Then, we show that although finding the optimal encoding is NP-hard, we can suggest an encoding whose memory requirement per row entry is guaranteed to be within a small additive constant of the optimum. Next, we analyze the common case of two-column forwarding tables, and show that such tables can be presented as bipartite graphs. We deduce graph-theoretical bounds on the encoding size. We also introduce an algorithm for optimal conditional encoding of the second column given an encoding of the first one. In addition, we explain how our architecture can handle table updates. Last, we evaluate our suggested encoding techniques on synthetic forwarding tables as well as on real-life tables.
    IEEE Journal on Selected Areas in Communications 01/2014; 32(1):138-151. · 4.14 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Pipelines are widely used to increase throughput in multi-core chips by parallelizing packet processing. Typically, each packet type is serviced by a dedicated pipeline. However, with the increase in the number of packet types and their number of required services, there are not enough cores for pipelines. In this paper, we study pipeline sharing, such that a single pipeline can be used to serve several packet types. Pipeline sharing decreases the needed total number of cores, but typically increases pipeline lengths and therefore packet delays. We consider the optimization problem of allocating cores between different packet types such that the average delay is minimized. We suggest a polynomial-time algorithm that finds the optimal solution when the packet types preserve a specific property. We also present a greedy algorithm for the general case. Last, we examine our solutions on synthetic examples, on packet-processing applications, and on real-life H.264 standard requirements.
    Proceedings of the 2013 IEEE 21st Annual Symposium on High-Performance Interconnects; 08/2013
  • O. Rottenstreich, R. Cohen, D. Raz, I. Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, hardware-based packet classification has became an essential component in many networking devices. It often relies on ternary content-addressable memories (TCAMs), which can compare in parallel the packet header against a large set of rules. Designers of TCAMs often have to deal with unpredictable sets of rules. These result in highly variable rule expansions, and can only rely on heuristic encoding algorithms with no reasonable guarantees. In this paper, given several types of rules, we provide new upper bounds on the TCAM worst case rule expansions. In particular, we prove that a W-bit range can be encoded in W TCAM entries, improving upon the previously known bound of 2W - 5. We further prove the optimality of this bound of W for prefix encoding, using new analytical tools based on independent sets and alternating paths. Next, we generalize these lower bounds to a new class of codes called hierarchical codes that includes both binary codes and Gray codes. Last, we propose a modified TCAM architecture that can use additional logic to significantly reduce the rule expansions, both in the worst case and using real-life classification databases.
    IEEE Transactions on Computers 06/2013; 62(6):1127-1140. · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rise of datacenter virtualization, the number of entries in forwarding tables is expected to scale from several thousands to several millions. Unfortunately, such forwarding table sizes can hardly be implemented today in on-chip memory. In this paper, we investigate the compressibility of forwarding tables. We first introduce a novel forwarding table architecture with separate encoding in each column. It is designed to keep supporting fast random accesses and fixed-width memory words. Then, we suggest an encoding whose memory requirement per row entry is guaranteed to be within a small additive constant of the optimum. Next, we analyze the common case of two-column forwarding tables, and show that such tables can be presented as bipartite graphs. We deduce graph-theoretical bounds on the encoding size. We also introduce an algorithm for optimal conditional encoding of the second column given an encoding of the first one. In addition, we explain how our architecture can handle table updates. Last, we evaluate our suggested encoding techniques on synthetic forwarding tables as well as on real-life tables.
    INFOCOM, 2013 Proceedings IEEE; 01/2013
  • O. Rottenstreich, I. Keslassy, A. Hassidim, H. Kaplan, E. Porat
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware-based packet classification has become an essential component in many networking devices. It often relies on TCAMs (ternary content-addressable memories), which need to compare the packet header against a set of rules. But efficiently encoding these rules is not an easy task. In particular, the most complicated rules are range rules, which usually require multiple TCAM entries to encode them. However, little is known on the optimal encoding of such non-trivial rules. In this work, we take steps towards finding an optimal encoding scheme for every possible range rule. We first present an optimal encoding for all possible generalized extremal rules. Such rules represent 89% of all non-trivial rules in a typical real-life classification database. We also suggest a new method of simply calculating the optimal expansion of an extremal range, and present a closed-form formula of the average optimal expansion over all extremal ranges. Next, we present new bounds on the worst-case expansion of general classification rules, both in one-dimensional and two-dimensional ranges. Last, we introduce a new TCAM architecture that can leverage these results by providing a guaranteed expansion on the tough rules, while dealing with simpler rules using a regular TCAM. We conclude by verifying our theoretical results in experiments with synthetic and real-life classification databases.
    INFOCOM, 2013 Proceedings IEEE; 01/2013
  • O. Rottenstreich, A. Berman, Y. Cassuto, I. Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: To enable direct access to a memory word based on its index, memories make use of fixed-width arrays, in which a fixed number of bits is allocated for the representation of each data entry. In this paper we consider the problem of encoding data entries of two fields, drawn independently according to known and generally different distributions. Our goal is to find two prefix codes for the two fields, that jointly maximize the probability that the total length of an encoded data entry is within a fixed given width. We study this probability and develop upper and lower bounds. We also show how to find an optimal code for the second field given a fixed code for the first field.
    Information Theory Proceedings (ISIT), 2013 IEEE International Symposium on; 01/2013
  • Source
    Ori Rottenstreich, Yossi Kanizo, Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Counting Bloom Filters (CBFs) are widely used in networking device algorithms. They implement fast set representations to support membership queries with limited error, and support element deletions unlike Bloom Filters. However, they consume significant amounts of memory. In this paper we introduce a new general method based on variable increments to improve the efficiency of CBFs and their variants. Unlike CBFs, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit increment. Then, to query an element, the exact value of a counter is considered and not just its positiveness. We present two simple schemes based on this method. We demonstrate that this method can always achieve a lower false positive rate and a lower overflow probability bound than CBF in practical systems. We also show how it can be easily implemented in hardware, with limited added complexity and memory overhead. We further explain how this method can extend many variants of CBF that have been published in the literature. Last, using simulations, we show how it can improve the false positive rate of CBFs by up to an order of magnitude given the same amount of memory.
    Proceedings - IEEE INFOCOM 01/2012; 22(4):1880-1888.
  • Source
    Ori Rottenstreich, Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, it is better to disregard the query results of Bloom filters, and in fact not to even query them, thus making them useless. We first analyze conditions under which the Bloom paradox occurs in a Bloom filter, and demonstrate that it depends on the a priori probability that a given element belongs to the represented set. We show that the Bloom paradox also applies to Counting Bloom Filters (CBFs), and depends on the product of the hashed counters of each element. In addition, both for Bloom filters and CBFs, we suggest improved architectures that deal with the Bloom paradox. We also provide fundamental memory lower bounds required to support element queries with limited false-positive and false-negative rates. Last, using simulations, we verify our theoretical results, and show that our improved schemes can lead to a significant improvement in the performance of Bloom filters and CBFs.
    Proceedings - IEEE INFOCOM 01/2012;
  • A. Berman, Y. Birk, O. Rottenstreich
    [Show abstract] [Hide abstract]
    ABSTRACT: The level of write-once memory cells (e.g., Flash) can only be raised individually. Bulk erasure is possible, but only a number of times (endurance) that decreases sharply with increasing cell capacity or cell-size reduction. A device's declared storage capacity and the total amount of information that can be written to it over its lifetime thus jointly characterize it. Write-once memory (WOM) coding permits a trade-off between these, enabling multiple writes between erasures. The guaranteed (over data) number of such writes is presently considered the important measure. We observe that given the typical endurance (104-105 for Flash), the actual write capacity is very close to its mean value with extremely high probability, rendering the mean number of writes between erasures much more interesting than its guaranteed value in any practical setting. For the Linear WOM code, we derive the write capacity CDF, and show that even the nearly guaranteed number of writes between erasures is substantially larger than the guaranteed one. Based on this, we present Piecewise Linear WOM code, whereby a page is partitioned into numerous independent WOM segments (whose writes must all succeed), permitting a flexible storage/write capacity tradeoff. We show that the aforementioned behavior is preserved. Finally, we outline an extension to Multi-Level cells.
    Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on; 01/2012
  • Source
    Itamar Cohen, Ori Rottenstreich, Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Chip multiprocessors (CMPs) combine increasingly many general-purpose processor cores on a single chip. These cores run several tasks with unpredictable communication needs, resulting in uncertain and often-changing traffic patterns. This unpredictability leads network-on-chip (NoC) designers to plan for the worst case traffic patterns, and significantly overprovision link capacities. In this paper, we provide NoC designers with an alternative statistical approach. We first present the traffic-load distribution plots (T-Plots), illustrating how much capacity overprovisioning is needed to service 90, 99, or 100 percent of all traffic patterns. We prove that in the general case, plotting T-Plots is #P-complete, and therefore extremely complex. We then show how to determine the exact mean and variance of the traffic load on any edge, and use these to provide Gaussian-based models for the T-Plots, as well as guaranteed performance bounds. We also explain how to practically approximate T-Plots using random-walk-based methods. Finally, we use T-Plots to reduce the network power consumption by providing an efficient capacity allocation algorithm with predictable performance guarantees.
    IEEE Transactions on Computers 06/2010; 59:748-761. · 1.47 Impact Factor
  • Source
    Ori Rottenstreich, Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Designers of TCAMs (Ternary CAMs) for packet classification deal with unpredictable sets of rules, resulting in highly variable rule expansions, and rely on heuristic encoding algorithms with no reasonable expansion guarantees. In this paper, given several types of rules, we provide new upper bounds on the TCAM worst-case rule expansions. In particular, we prove that a W-bit range can be encoded using W TCAM entries, improving upon the previously-known bound of 2W-5. We also propose a modified TCAM architecture that uses additional logic to significantly reduce the rule expansions, both in the worst case and in experiments with real-life classification databases.
    INFOCOM 2010. 29th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 15-19 March 2010, San Diego, CA, USA; 01/2010
  • Source
    Ori Rottenstreich, Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: All high-speed Internet devices need to implement classification, i.e. they must determine whether incoming packet headers belong to a given subset of a search space. To do it, they encode the subset using ternary arrays in special high-speed devices called TCAMs (ternary content-addressable memories). However, the optimal coding for arbitrary subsets is unknown. In particular, to encode an arbitrary range subset of the space of all W-bit values, previous works have successively reduced the upper-bound on the code length from 2W-2 to 2W-4, then 2W-5, and finally W TCAM entries. In this paper, we prove that this final result is optimal for typical prefix coding and cannot be further improved, i.e. the bound of W is tight. To do so, we introduce new analytical tools based on independent sets and alternating paths.
    IEEE International Symposium on Information Theory, ISIT 2010, June 13-18, 2010, Austin, Texas, USA, Proceedings; 01/2010
  • Source
    I. Cohen, O. Rottenstreich, I. Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Chip multiprocessors (CMPs) combine increasingly many general-purpose processor cores on a single chip. These cores run several tasks with unpredictable communication needs, resulting in uncertain and often-changing traffic patterns. This unpredictability leads network-on-chip (NoC) designers to plan for the worst-case traffic patterns, and significantly over-provision link capacities. In this paper, we provide NoC designers with an alternative statistical approach. We first present the traffic-load distribution plots (T-plots), illustrating how much capacity over- provisioning is needed to service 90%, 99%, or 100% of all traffic patterns. We prove that in the general case, plotting T-plots is #P-complete, and therefore extremely complex. We then show how to determine the exact mean and variance of the traffic load on any edge, and use these to provide Gaussian-based models for the T-plots, as well as guaranteed performance bounds. Finally, we use T-plots to reduce the network power consumption by providing an efficient capacity allocation algorithm with predictable performance guarantees.
    Networks-on-Chip, 2008. NoCS 2008. Second ACM/IEEE International Symposium on; 05/2008
  • Source
    Itamar Cohen, Ori Rottenstreich, Isaac Keslassy
    [Show abstract] [Hide abstract]
    ABSTRACT: Chip multiprocessors (CMPs) combine increasingly many general-purpose processor cores on a single chip. These cores run several tasks with unpredictable commu- nication needs, resulting in uncertain and often-changing traffic patterns. This unpredictability leads network-on- chip (NoC) designers to plan for the worst-case traffic pat- terns, and significantly over-provision link capacities. In this paper, we provide NoC designers with an alternative statistical approach. We first present the traffic-load distri- bution plots (T-Plots), illustrating how much capacity over- provisioning is needed to service 90%, 99%, or 100% of all traffic patterns. We prove that in the general case, plotting T-Plots is #P-complete, and therefore extremely complex. We then show how to determine the exact mean and vari- ance of the traffic load on any edge, and use these to provide Gaussian-based models for the T-Plots, as well as guaran- teed performance bounds. We also explain how to practi- cally approximate T-Plots using random-walk-based meth- ods. Finally, we use T-Plots to reduce the network power consumption by providing an efficient capacity allocation algorithm with predictable performance guarantees.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Packet reordering has now become one of the most significant bottlenecks in next-generation switch designs. In this paper, we argue that current packet order requirements for switches are too stringent, with little or no reason. Instead of requiring all packets sharing the same switch input and switch output to be ordered, we only require packets that share the same source and destination IP addresses to be ordered. We then exploit this new definition by suggesting several hash-based counter schemes that prevent inter-flow blocking and reduce reordering delay. The schemes are transparent to the routing scheme. We further suggest schemes based on network coding to protect against rare events with high queueing delay. We also point out an inherent reordering delay unfairness between elephants and mice, and introduce several mechanisms to correct this unfairness. Last, we demonstrate using both analysis and simulations that the use of these solutions can indeed reduce the resequencing delay. For instance, resequencing delays are reduced by up to a factor of 10 using real-life traces and a real hashing function.