A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services.
ABSTRACT A Bloom filter has been widely utilized to represent a set of items because it is a simple space-efficient randomized data structure. In this paper, we propose a new structure to support the representation of items with multiple attributes based on Bloom filters. The structure is composed of Parallel Bloom Filters (PBF) and a hash table to support the accurate and efficient representation and query of items.The PBF is a counter-based matrix and consists of multiple submatrixes. Each sub- matrix can store one attribute of an item. The hash table as an auxiliary structure captures a verification value of an item, which can reflect the inherent dependency of all attributes for the item. Because the correct query of an item with multiple attributes becomes complicated, we use a two-step verification process to ensure the presence of a particular item to reduce false positive probability.
- SourceAvailable from: Ankur Narang
Conference Proceeding: Real-time approximate Range Motif discovery & data redundancy removal algorithm.[show abstract] [hide abstract]
ABSTRACT: Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and others there is a strong need for real-time data redundancy removal of enormous amounts of data flowing at the rate of 1Gb/s or higher. We consider the problem of finding Range Motifs (clusters) over records in a large dataset such that records within the same cluster are approximately close to each other. This problem is closely related to the approximate nearest neighbour search but is more computationally expensive. Real-time scalable approximate Range Motif discovery on massive datasets is a challenging problem. We present the design of novel sequential and parallel approximate Range Motif discovery and data de-duplication algorithms using Bloom filters. We establish asymptotic upper bounds on the false positive and false negative rates for our algorithm. Further, time complexity analysis of our parallel algorithm on multi-core architectures has been presented. For 10 million records, our parallel algorithm can perform approximate Range Motif discovery and data de-duplication, on 4 sets (clusters), in 59s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 170K records/s and around 700Mb/s (using records of size 4K bits). To the best of our knowledge, this is the highest real-time throughput for approximate Range Motif discovery and data redundancy removal on such massive datasets.EDBT 2011, 14th International Conference on Extending Database Technology, Uppsala, Sweden, March 21-24, 2011, Proceedings; 01/2011
- [show abstract] [hide abstract]
ABSTRACT: With the explosion of information stored world-wide,data intensive computing has become a central area of research.Efficient management and processing of this massively exponential amount of data from diverse sources,such as telecommunication call data records,online transaction records,etc.,has become a necessity.Removing redundancy from such huge(multi-billion records) datasets resulting in resource and compute efficiency for downstream processing constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios,for precise identification and elimination of duplicates from the unbounded datastream is a greater challenge given the realtime nature of data arrival.Stable Bloom Filters(SBF) address this problem to a certain extent.However,SBF suffers from a high false negative rate(FNR) and slow convergence rate,thereby rendering it inefficient for applications with low FNR tolerance.In this paper, we present a novel Reservoir Sampling based Bloom Filter,(RSBF) data structure,based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams.Using detailed theoretical analysis we prove analytical bounds on its false positive rate(FPR),false negative rate(FNR) and convergence rates with low memory requirements.We show that RSBF offers the currently lowest FN and convergence rates,and are better than those of SBF while using the same memory.Using empirical analysis on real-world datasets(3 million records) and synthetic datasets with around 1 billion records,we demonstrate upto 2x improvement in FNR with better convergence rates as compared to SBF,while exhibiting comparable FPR.To the best of our knowledge,this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.11/2011;
- [show abstract] [hide abstract]
ABSTRACT: Abstract-This paper shows a scalable and adaptive decentralized metadata lookup scheme for ultra large-scale file systems Petabytes or even Exabytes. Our scheme logically creates metadata servers (MDS) into a multi-layered query hierarchy and exploits grouped filters Bloom to efficiently route metadata requests to desired MDSs through the hierarchy. This metadata lookup scheme can be performed at the network or memory speed. An effective workload balance algorithm is also developed in this paper for server reconfigurations. This scheme is calculated through extensive trace-driven simulations and prototype implementation in Linux. This scheme can significantly improve metadata management scalability and query efficiency in ultra large-scale storage systems.International Journal of Latest Trends in Engineering and Technology (IJLTET). 07/2013; Vol. 2(Issue 4):295-300.
A Multi-attribute Data Structure with Parallel
Bloom Filters for Network Services⋆
Yu Hua1,2and Bin Xiao1
1Department of Computing
Hong Kong Polytechnic University, Kowloon, Hong Kong
2School of Computer Science and Technology
Huazhong University of Science and Technology, Wuhan, China
Abstract. A Bloom filter has been widely utilized to represent a set of
items because it is a simple space-efficient randomized data structure. In
this paper, we propose a new structure to support the representation of
items with multiple attributes based on Bloom filters. The structure is
composed of Parallel Bloom Filters (PBF) and a hash table to support
the accurate and efficient representation and query of items. The PBF is
a counter-based matrix and consists of multiple submatrixes. Each sub-
matrix can store one attribute of an item. The hash table as an auxiliary
structure captures a verification value of an item, which can reflect the
inherent dependency of all attributes for the item. Because the correct
query of an item with multiple attributes becomes complicated, we use a
two-step verification process to ensure the presence of a particular item
to reduce false positive probability.
A standard Bloom filter can represent a set of items as a bit array using several
independent hash functions and support the query of items . Using a Bloom
filter to represent a set, one can query whether an item is a member of the set
according to the Bloom filter, instead of the set. This compact representation is
the tradeoff for allowing a small probability of false positive in the membership
query. However, the space savings often outweigh this drawback when the false
positive probability is rather low. Bloom filters can be widely used in practice
when space resource is at a premium.
From the standard Bloom filters, many other forms of Bloom filters are pro-
posed for various purposes, such as counting Bloom filters , compressed Bloom
filters , hierarchical Bloom filters , space-code Bloom filters  and spectral
Bloom filters . Counting Bloom filters replace an array of bits with counters
in order to count the number of items hashed to that location. It is very useful
⋆This work is partially supported by HK RGC CERG B-Q827 and POLYU A-
PA2F, and by the National Basic Research 973 Program of China under Grant
to apply counting Bloom filters to support the deletion operation and handle a
set that is changing over time.
With the booming development of network services, the query based on mul-
tiple attributes of an item becomes more attractive. However, not much work
has been done in this aspect. Previous work mainly focused on the represen-
tation of a set of items with a single attribute, and they could not be used to
represent items with multiple attributes accurately. Because one item has mul-
tiple attributes, the inherent dependency among multiple attributes could be
lost if we only store attributes in different places by computing hash functions
independently. There are no functional units to record the multiple attributes de-
pendency by the simple data structure expansion on the standard Bloom filters
and the query operations could often receive wrong answers. The lost of depen-
dency information among multiple attributes of an item greatly increases the
false probability. Thus, we need to develop a new structure to the representation
of items with multiple attributes.
In this paper, we make the following main contributions. First, we propose
a new Bloom filter structure that can support the representation of items with
multiple attributes and allow the false positive probability of the membership
queries at a very low level. The new structure is composed of Parallel Bloom
Filters (PBF) and a hash table to support the accurate and efficient represen-
tation and query of items. The PBF is a counter-based matrix and consists of
multiple submatrixes. Each submatrix can store one attribute of an item. The
hash table captures a verification value of an item, which can reflect the in-
herent dependency of all attributes for one item. We generate the verification
values by an attenuated method, which tremendously reduces the items colli-
sion probability. Second, we present a two-step verification process to justify the
presence of a particular item. Because the multiple attributes of an item make
the correct query become complicated, the verification in the PBF alone is insuf-
ficient to distinguish attributes from one item to another. The verification in the
hash table can complement the verification process and lead to accurate query
results. Third, the new data structure in the PBF explores a counter in each
entry such that it can support comprehensive data operations of adding, query-
ing and removing items and these operations remain computational complexity
O(1) using the novel structure. We also study the false positive probability and
algebra operations through mathematic analysis and experiments. Finally, we
show that the new Bloom filter structure and proposed algorithms of data op-
erations are efficient and accurate to realize the representation of an item with
multiple attributes while they yield sufficiently small false positive probability
through theoretical analysis and simulations.
The rest of the paper is organized as follows. Section 2 introduces the related
work. Section 3 presents the new Bloom filter structure, which is composed of the
PBF and hash table. Section 4 illustrates the operations of adding, querying and
removing items. In Section 5, we present the corresponding algebra operations.
Section 6 provides the performance evaluation and Section 7 concludes our paper.
A Bloom filter can be used to support membership queries ,  because
of its simple space-efficient data structure to represent a set and Bloom filters
have been broadly applied to network-related applications. Bloom filters are used
to find heavy flows for stochastic fair blue queue management scheme  and
summarize contents to help the global collaboration . Bloom filters provide
a useful tool to assist the network routing, such as route lookup , packet
classification , per-flow state management and the longest prefix matching
There is a great deal of room to develop variants or extensions of Bloom
filters for specific applications. When space is an issue, a Bloom filter can be an
excellent alternative to keeping an explicit list. In , authors designed a data
structure called an exponentially decaying bloom filter (EDBF) that encoded
such probabilistic routing tables in a highly compressed manner and allowed for
efficient aggregation and propagation.
In addition, network applications emphasize a strong need to engineer hash-
based data structure, which can achieve faster lookup speeds with better worst-
case performance in practice. From the engineering perspective, authors in 
extended the multiple-hashing Bloom filter by using a small amount of multi-port
on-chip memory, which can support better throughput for router applications
based on hash tables.
Due to the essential role in network services, the structure expansion of Bloom
filters is a well-researched topic. While some approaches exist in the literature,
most work emphasizes the improvements on the Bloom filters themselves. Au-
thors in  suggested the multi-dimension dynamic bloom filters (MDDBF)
to support representation and membership queries based on the multi-attribute
dimension. Their basic idea was to represent a dynamic set A with a dynamic
s×m bit matrix that consists of s standard Bloom filters. However, the MDDBF
lacks a verification process of the inherent dependency of multiple attributes of
an item, which may increase the false positive probability.
In this section, we will introduce a novel structure, which is composed of PBF
and a hash table, to represent items of p attributes. The hash table stores the
verification values of items and we provide an improved method for generating
the verification values.
3.1 Proposed Structure
Figure 1 shows the proposed structure based on the counting Bloom filters. The
whole structure includes two parts: PBF and a hash table. PBF and the hash
table are used to store multiple attributes and the verification values of items,
respectively. PBF uses the counting Bloom filters  to support the deletion
10.. .. 03
Hash TableParallel Bloom Filters
Fig.1. The proposed structure based on counting Bloom filters.
operation and can be viewed as a matrix, which consists of p parallel submatrixes
in order to represent p attributes. A submatrix is composed of q parallel arrays
and can be used to represent one attribute. An array consists of m counters and
is related to one hash function. q arrays in parallel are corresponding to q hash
functions. Assume that aiis the ith attribute of item a. We use H[i][j](ai)(1 ≤
i ≤ p,1 ≤ j ≤ q) to represent the hash value computed by the jth hash function
for the ith attribute of item a. Thus, each submatrix has q × m counters and
PBF composed of p submatrixes utilizes p × q × m counters to store the items
with p attributes.
The hash table contains the verification values, which can be used to verify
the inherent dependency among different attributes from one item. We measure
the verification values as a function of the hash values. Let vi= F(H[i][j](ai))
be the verification value of the ith attribute of item a. The verification value of
item a can be computed by Va=?p
i=1vi, which can be inserted into the hash
table for future dependency tests.
3.2Role of Hash Table
The fundamental role of the hash table is to verify the inherent dependency
of all attributes for an item and avoid the query collision. The main reason
for the query collision in terms of multiple attributes is that the dependency
among multiple attributes is lost after we insert p attributes into p independent
submatrixes, respectively. Then, the PBF only knows the existence of attributes
and cannot determine whether those attributes belong to one item. Meanwhile,
the verification based on PBF itself is not enough to distinguish attributes from
one item to another. Therefore, the hash table can be used to confirm whether
the queried multiple attributes belong to one item.
Thus, if a query receives answer True, the two-step verification process must
be conducted. First, we need to check whether queried attributes exist in PBF.
Second, we need to verify whether the multiple attributes belong to a single item
based on the verification value in the hash table.
Traditionally, the hash values computed by hash functions are only used to
update the location counters in the counting Bloom filters. In the proposed
structure, we utilize the hash values to generate the verification values, which
can stand for existing items.
The basic method of generating the verification value is to add all the hash
values and store their sum in the hash table. For example, the value of variable
viis vi= F(H[i][j](ai)) =?q
case, the function F is a sum operation. Then, the verification value of item a
stands for an existing item a. However, in the basic method, the values computed
by different hash functions are possible to be the same and their sums might be
the same, too. Thus, different items might hold the same verification values in
the hash table and this will lead to the verification collision.
The improved method utilizes the sequential information of hash functions
to distinguish the verification values of different items. We allocate different
weights to sequential hash functions in order to reflect the difference among
hash functions. As for the ith attribute of item a, the value from the jth hash
function in the ith submatrix is defined as
the Attenuate Bloom Filters . In attenuate Bloom filters, higher filter levels
are attenuated with respect to earlier filter level and it is a lossy distributed
index. Therefore, as for the item a, the verification value of the ith attribute is
defined as vi= F(H[i][j](ai)) =?q
. This verification value of item a can be inserted
into the hash table.
j=1H[i][j](ai) for the ith attribute of item a. In this
j=1H[i][j](ai). Thus, Vacan be inserted into the hash table and
, which is similar to the idea of
. The verification value of item a
4 Operations on Data Structure
Given a certain item a, it has p attributes and each attribute can be represented
using q hash functions as shown in Figure 1. We denote its verification value by
Va, which is initialized to zero. Meanwhile, we can implement the corresponding
operations, such as adding, querying and removing items, with a complexity of
O(1) in the parallel Bloom filters and the hash table.
Figure 2 presents the algorithm of adding items in the proposed structure. We
need to compute the hash values of multiple attributes by hash functions and