Conference Proceeding

A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services.

01/2006; DOI:10.1007/11945918_30 In proceeding of: High Performance Computing - HiPC 2006, 13th International Conference, Bangalore, India, December 18-21, 2006, Proceedings
Source: DBLP

ABSTRACT A Bloom filter has been widely utilized to represent a set of items because it is a simple space-efficient randomized data structure. In this paper, we propose a new structure to support the representation of items with multiple attributes based on Bloom filters. The structure is composed of Parallel Bloom Filters (PBF) and a hash table to support the accurate and efficient representation and query of items.The PBF is a counter-based matrix and consists of multiple submatrixes. Each sub- matrix can store one attribute of an item. The hash table as an auxiliary structure captures a verification value of an item, which can reflect the inherent dependency of all attributes for the item. Because the correct query of an item with multiple attributes becomes complicated, we use a two-step verification process to ensure the presence of a particular item to reduce false positive probability.

0 0
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and others there is a strong need for real-time data redundancy removal of enormous amounts of data flowing at the rate of 1Gb/s or higher. We consider the problem of finding Range Motifs (clusters) over records in a large dataset such that records within the same cluster are approximately close to each other. This problem is closely related to the approximate nearest neighbour search but is more computationally expensive. Real-time scalable approximate Range Motif discovery on massive datasets is a challenging problem. We present the design of novel sequential and parallel approximate Range Motif discovery and data de-duplication algorithms using Bloom filters. We establish asymptotic upper bounds on the false positive and false negative rates for our algorithm. Further, time complexity analysis of our parallel algorithm on multi-core architectures has been presented. For 10 million records, our parallel algorithm can perform approximate Range Motif discovery and data de-duplication, on 4 sets (clusters), in 59s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 170K records/s and around 700Mb/s (using records of size 4K bits). To the best of our knowledge, this is the highest real-time throughput for approximate Range Motif discovery and data redundancy removal on such massive datasets.
    EDBT 2011, 14th International Conference on Extending Database Technology, Uppsala, Sweden, March 21-24, 2011, Proceedings; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: With the explosion of information stored world-wide,data intensive computing has become a central area of research.Efficient management and processing of this massively exponential amount of data from diverse sources,such as telecommunication call data records,online transaction records,etc.,has become a necessity.Removing redundancy from such huge(multi-billion records) datasets resulting in resource and compute efficiency for downstream processing constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios,for precise identification and elimination of duplicates from the unbounded datastream is a greater challenge given the realtime nature of data arrival.Stable Bloom Filters(SBF) address this problem to a certain extent.However,SBF suffers from a high false negative rate(FNR) and slow convergence rate,thereby rendering it inefficient for applications with low FNR tolerance.In this paper, we present a novel Reservoir Sampling based Bloom Filter,(RSBF) data structure,based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams.Using detailed theoretical analysis we prove analytical bounds on its false positive rate(FPR),false negative rate(FNR) and convergence rates with low memory requirements.We show that RSBF offers the currently lowest FN and convergence rates,and are better than those of SBF while using the same memory.Using empirical analysis on real-world datasets(3 million records) and synthetic datasets with around 1 billion records,we demonstrate upto 2x improvement in FNR with better convergence rates as compared to SBF,while exhibiting comparable FPR.To the best of our knowledge,this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.
  • [show abstract] [hide abstract]
    ABSTRACT: Abstract-This paper shows a scalable and adaptive decentralized metadata lookup scheme for ultra large-scale file systems Petabytes or even Exabytes. Our scheme logically creates metadata servers (MDS) into a multi-layered query hierarchy and exploits grouped filters Bloom to efficiently route metadata requests to desired MDSs through the hierarchy. This metadata lookup scheme can be performed at the network or memory speed. An effective workload balance algorithm is also developed in this paper for server reconfigurations. This scheme is calculated through extensive trace-driven simulations and prototype implementation in Linux. This scheme can significantly improve metadata management scalability and query efficiency in ultra large-scale storage systems.
    International Journal of Latest Trends in Engineering and Technology (IJLTET). 07/2013; Vol. 2(Issue 4):295-300.



Yu Hua