-
[show abstract]
[hide abstract]
ABSTRACT: Large data sets are increasingly common in cloud and virtualized
environments. For example, transfers of multiple gigabytes are commonplace, as
are replicated blocks of such sizes. There is a need for fast error-correction
or data reconciliation in such settings even when the expected number of errors
is small.
Motivated by such cloud reconciliation problems, we consider error-correction
schemes designed for large data, after explaining why previous approaches
appear unsuitable. We introduce Biff codes, which are based on Bloom filters
and are designed for large data. For Biff codes with a message of length $L$
and $E$ errors, the encoding time is $O(L)$, decoding time is $O(L + E)$ and
the space overhead is $O(E)$. Biff codes are low-density parity-check codes;
they are similar to Tornado codes, but are designed for errors instead of
erasures. Further, Biff codes are designed to be very simple, removing any
explicit graph structures and based entirely on hash tables. We derive Biff
codes by a simple reduction from a set reconciliation algorithm for a recently
developed data structure, invertible Bloom lookup tables. While the underlying
theory is extremely simple, what makes this code especially attractive is the
ease with which it can be implemented and the speed of decoding. We present
results from a prototype implementation that decodes messages of 1 million
words with thousands of errors in well under a second.
08/2012;
-
[show abstract]
[hide abstract]
ABSTRACT: As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel methodology to identify a meaningful support threshold s* for a dataset, such that the number of itemsets with support at least s* represents a substantial deviation from what would be expected in a random dataset with the same number of transactions and the same individual item frequencies. These itemsets can then be flagged as statistically significant with a small false discovery rate. We present extensive experimental results to substantiate the effectiveness of our methodology. Comment: A preliminary version of this work was presented in ACM PODS 2009. 20 pages, 0 figures
02/2010;
-
IEEE/ACM Trans. Netw. 01/2010; 18:1752-1765.
-
[show abstract]
[hide abstract]
ABSTRACT: Hashing is an extremely useful technique for a variety of high-speed packet-processing applications in routers. In this chapter,
we survey much of the recent work in this area, paying particular attention to the interaction between theoretical and applied
research. We assume very little background in either the theory or applications of hashing, reviewing the fundamentals as
necessary.
12/2009: pages 181-218;
-
SIAM J. Comput. 01/2009; 39:1543-1561.
-
[show abstract]
[hide abstract]
ABSTRACT: We introduce the hiring problem, in which a growing company continuously interviews and decides whether to hire applicants. This problem is similar in spirit but quite different from the well-studied secretary problem. Like the secretary problem, it captures fundamental aspects of decision making under uncertainty and has many possible applications. We analyze natural strategies of hiring above the current average, considering both the mean and the median averages; we call these Lake Wobegon strategies. Like the hiring problem itself, our strategies are intuitive, simple to describe, and amenable to mathematically and economically significant modifications. We demonstrate several intriguing behaviors of the two strategies. Specifically, we show dramatic differences between hiring above the mean and above the median. We also show that both strategies are intrinsically connected to the lognormal distribution, leading to only very weak concentration results, and the marked importance of the first few hires on the overall outcome.
SIAM J. Comput. 01/2009; 39:1184-1193.
-
[show abstract]
[hide abstract]
ABSTRACT: A standard technique from the hashing literature is to use two hash functions h1(x) and h2(x) to simulate additional hash functions of the form gi(x) = h1(x) + ih2(x). We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically, only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability. This leads to less computation and potentially less need for randomness in practice. © 2008 Wiley Periodicals, Inc. Random Struct. Alg., 2008
Random Structures and Algorithms 08/2008; 33(2):187 - 218. · 1.03 Impact Factor
-
Algorithms - ESA 2008, 16th Annual European Symposium, Karlsruhe, Germany, September 15-17, 2008. Proceedings; 01/2008
-
IEEE/ACM Trans. Netw. 01/2008; 16:218-231.
-
INFOCOM 2008. 27th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 13-18 April 2008, Phoenix, AZ, USA; 01/2008
-
Random Struct. Algorithms. 01/2008; 33:187-218.
-
Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008; 01/2008
-
[show abstract]
[hide abstract]
ABSTRACT: A standard technique from the hashing literature is to use two hash functions h
1(x) and h
2(x) to simulate additional hash functions of the form g
i
(x) = h
1(x) + ih
2(x). We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically,
only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive
probability. This leads to less computation and potentially less need for randomness in practice.
09/2006: pages 456-467;
-
Algorithms - ESA 2006, 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006, Proceedings; 01/2006
-
[show abstract]
[hide abstract]
ABSTRACT: Cuckoo hashing holds great potential as a high-performance hashing scheme for real applications. Up to this point, the greatest
drawback of cuckoo hashing appears to be that there is a polynomially small but practically significant probability that a
failure occurs during the insertion of an item, requiring an expensive rehashing of all items in the table. In this paper,
we show that this failure probability can be dramatically reduced by the addition of a very small constant-sized stash. We demonstrate both analytically and through simulations that stashes of size equivalent to only three or four items yield
tremendous improvements, enhancing cuckoo hashing’s practical viability in both hardware and software. Our analysis naturally
extends previous analyses of multiple cuckoo hashing variants, and the approach may prove useful in further related schemes.
01/1970: pages 611-622;
-
Applied Sciences.
-
[show abstract]
[hide abstract]
ABSTRACT: A technique from the hashing literature is to use two hash functions h 1 (x) and h 2 (x) to simulate additional hash functions of the form g i (x) = h 1 (x) + ih 2 (x). We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically, only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability. This leads to less computation and potentially less need for randomness in practice.
-
[show abstract]
[hide abstract]
ABSTRACT: Cuckoo hashing combines multiple-choice hashing with the power to move elements, providing hash tables with very high space utilization and low probability of overflow. However, inserting a new object into such a hash table can take substantial time, requiring many elements to be moved. While these events are rare and the amortized performance of these data structures is excellent, this shortcoming is unacceptable in many applications, particularly those involving hardware router implementations. We address this difficulty, focusing on the po-tential of content-addressable memories and queueing techniques to provide a de-amortization of cuckoo hashing suitable for hardware, and in particular for high-performance routers.