Conference Paper

A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services

DOI: 10.1007/11945918_30 Conference: High Performance Computing - HiPC 2006, 13th International Conference, Bangalore, India, December 18-21, 2006, Proceedings
Source: DBLP


A Bloom filter has been widely utilized to represent a set of items because it is a simple space-efficient randomized data structure. In this paper, we propose a new structure to support the representation of items with multiple attributes based on Bloom filters. The structure is composed of Parallel Bloom Filters (PBF) and a hash table to support the accurate and efficient representation and query of items.The PBF is a counter-based matrix and consists of multiple submatrixes. Each sub- matrix can store one attribute of an item. The hash table as an auxiliary structure captures a verification value of an item, which can reflect the inherent dependency of all attributes for the item. Because the correct query of an item with multiple attributes becomes complicated, we use a two-step verification process to ensure the presence of a particular item to reduce false positive probability.

Full-text preview

Available from:
  • Source
    • "Parallel bloom filter (PBF) [7], [8] tries to solve the problem by adding a verification value of each attribute insertion for the item. Then it stores the verification value in a summary bloom filter. "
    [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid accumulation of data in various types, modern database systems are facing the problem of managing multidimensional data. The main challenge is to design a highly efficient storage mechanism which can support fast item lookup with exact membership queries or partial information membership queries. This paper presents a novel data structure called Cartesian-join of Bloom Filters. The method maintains a matrix that stores the Cartesian product of attribute bloom filters, each of which represents one dimension of the dataset. Experiments show that the proposed approach can not only achieve the same false positive rate as the traditional bloom filter with the same size, but also have an advantageous feature of by-attribute membership query. The data structure uses only ten bits to store a four-dimensional item and the average false rate for a query is one percent. The algorithm is robust even if it goes through high-correlated queries.
    Full-text · Article · Sep 2014
  • Source
    • "[32] discusses the use of Bloom Filters to speed-up the name-to-location resolution process in large distributed systems. Parallel versions of Bloom Filters were also proposed for multi-core applications [13] [22] [27]. A related problem of computing the number of distinct elements in a data stream was studied in [20]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this paper we propose a novel data structure, Streaming Quotient Filter, (SQF) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the signatures of elements arriving on a data stream, and along with an eviction policy provides near zero false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss Dynamic SQF for evolving streams and the parallel implementation of SQF.
    Full-text · Article · Jun 2013 · Proceedings of the VLDB Endowment
  • Source
    • "Bloom Filters have been applied even to network related applications such as finding heavy flows for stochastically fair blue queue management [26], packet classification [27], per-flow state management and longest prefix matching [28]. Multiple Bloom Filters in conjunction with hash tables have been studied to represent items with multiple attributes accurately and efficiently with low false positive rates [29]. Bloomjoin used for distributed joins have also been extended to minimize network usage for query execution based on database statistics. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent. . In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.
    Full-text · Article · Dec 2012
Show more