A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services.
ABSTRACT A Bloom filter has been widely utilized to represent a set of items because it is a simple space-efficient randomized data structure. In this paper, we propose a new structure to support the representation of items with multiple attributes based on Bloom filters. The structure is composed of Parallel Bloom Filters (PBF) and a hash table to support the accurate and efficient representation and query of items.The PBF is a counter-based matrix and consists of multiple submatrixes. Each sub- matrix can store one attribute of an item. The hash table as an auxiliary structure captures a verification value of an item, which can reflect the inherent dependency of all attributes for the item. Because the correct query of an item with multiple attributes becomes complicated, we use a two-step verification process to ensure the presence of a particular item to reduce false positive probability.
- [Show abstract] [Hide abstract]
ABSTRACT: Abstract-This paper shows a scalable and adaptive decentralized metadata lookup scheme for ultra large-scale file systems Petabytes or even Exabytes. Our scheme logically creates metadata servers (MDS) into a multi-layered query hierarchy and exploits grouped filters Bloom to efficiently route metadata requests to desired MDSs through the hierarchy. This metadata lookup scheme can be performed at the network or memory speed. An effective workload balance algorithm is also developed in this paper for server reconfigurations. This scheme is calculated through extensive trace-driven simulations and prototype implementation in Linux. This scheme can significantly improve metadata management scalability and query efficiency in ultra large-scale storage systems.International Journal of Latest Trends in Engineering and Technology (IJLTET). 07/2013; Vol. 2(Issue 4):295-300.
- [Show abstract] [Hide abstract]
ABSTRACT: The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this paper we propose a novel data structure, Streaming Quotient Filter, (SQF) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the signatures of elements arriving on a data stream, and along with an eviction policy provides near zero false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss Dynamic SQF for evolving streams and the parallel implementation of SQF.Proceedings of the VLDB Endowment. 06/2013; 6(8):589-600.
- [Show abstract] [Hide abstract]
ABSTRACT: Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent. . In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.12/2012;
A Multi-attribute Data Structure with Parallel
Bloom Filters for Network Services⋆
Yu Hua1,2and Bin Xiao1
1Department of Computing
Hong Kong Polytechnic University, Kowloon, Hong Kong
2School of Computer Science and Technology
Huazhong University of Science and Technology, Wuhan, China
Abstract. A Bloom filter has been widely utilized to represent a set of
items because it is a simple space-efficient randomized data structure. In
this paper, we propose a new structure to support the representation of
items with multiple attributes based on Bloom filters. The structure is
composed of Parallel Bloom Filters (PBF) and a hash table to support
the accurate and efficient representation and query of items. The PBF is
a counter-based matrix and consists of multiple submatrixes. Each sub-
matrix can store one attribute of an item. The hash table as an auxiliary
structure captures a verification value of an item, which can reflect the
inherent dependency of all attributes for the item. Because the correct
query of an item with multiple attributes becomes complicated, we use a
two-step verification process to ensure the presence of a particular item
to reduce false positive probability.
A standard Bloom filter can represent a set of items as a bit array using several
independent hash functions and support the query of items . Using a Bloom
filter to represent a set, one can query whether an item is a member of the set
according to the Bloom filter, instead of the set. This compact representation is
the tradeoff for allowing a small probability of false positive in the membership
query. However, the space savings often outweigh this drawback when the false
positive probability is rather low. Bloom filters can be widely used in practice
when space resource is at a premium.
From the standard Bloom filters, many other forms of Bloom filters are pro-
posed for various purposes, such as counting Bloom filters , compressed Bloom
filters , hierarchical Bloom filters , space-code Bloom filters  and spectral
Bloom filters . Counting Bloom filters replace an array of bits with counters
in order to count the number of items hashed to that location. It is very useful
⋆This work is partially supported by HK RGC CERG B-Q827 and POLYU A-
PA2F, and by the National Basic Research 973 Program of China under Grant
to apply counting Bloom filters to support the deletion operation and handle a
set that is changing over time.
With the booming development of network services, the query based on mul-
tiple attributes of an item becomes more attractive. However, not much work
has been done in this aspect. Previous work mainly focused on the represen-
tation of a set of items with a single attribute, and they could not be used to
represent items with multiple attributes accurately. Because one item has mul-
tiple attributes, the inherent dependency among multiple attributes could be
lost if we only store attributes in different places by computing hash functions
independently. There are no functional units to record the multiple attributes de-
pendency by the simple data structure expansion on the standard Bloom filters
and the query operations could often receive wrong answers. The lost of depen-
dency information among multiple attributes of an item greatly increases the
false probability. Thus, we need to develop a new structure to the representation
of items with multiple attributes.
In this paper, we make the following main contributions. First, we propose
a new Bloom filter structure that can support the representation of items with
multiple attributes and allow the false positive probability of the membership
queries at a very low level. The new structure is composed of Parallel Bloom
Filters (PBF) and a hash table to support the accurate and efficient represen-
tation and query of items. The PBF is a counter-based matrix and consists of
multiple submatrixes. Each submatrix can store one attribute of an item. The
hash table captures a verification value of an item, which can reflect the in-
herent dependency of all attributes for one item. We generate the verification
values by an attenuated method, which tremendously reduces the items colli-
sion probability. Second, we present a two-step verification process to justify the
presence of a particular item. Because the multiple attributes of an item make
the correct query become complicated, the verification in the PBF alone is insuf-
ficient to distinguish attributes from one item to another. The verification in the
hash table can complement the verification process and lead to accurate query
results. Third, the new data structure in the PBF explores a counter in each
entry such that it can support comprehensive data operations of adding, query-
ing and removing items and these operations remain computational complexity
O(1) using the novel structure. We also study the false positive probability and
algebra operations through mathematic analysis and experiments. Finally, we
show that the new Bloom filter structure and proposed algorithms of data op-
erations are efficient and accurate to realize the representation of an item with
multiple attributes while they yield sufficiently small false positive probability
through theoretical analysis and simulations.
The rest of the paper is organized as follows. Section 2 introduces the related
work. Section 3 presents the new Bloom filter structure, which is composed of the
PBF and hash table. Section 4 illustrates the operations of adding, querying and
removing items. In Section 5, we present the corresponding algebra operations.
Section 6 provides the performance evaluation and Section 7 concludes our paper.
A Bloom filter can be used to support membership queries ,  because
of its simple space-efficient data structure to represent a set and Bloom filters
have been broadly applied to network-related applications. Bloom filters are used
to find heavy flows for stochastic fair blue queue management scheme  and
summarize contents to help the global collaboration . Bloom filters provide
a useful tool to assist the network routing, such as route lookup , packet
classification , per-flow state management and the longest prefix matching
There is a great deal of room to develop variants or extensions of Bloom
filters for specific applications. When space is an issue, a Bloom filter can be an
excellent alternative to keeping an explicit list. In , authors designed a data
structure called an exponentially decaying bloom filter (EDBF) that encoded
such probabilistic routing tables in a highly compressed manner and allowed for
efficient aggregation and propagation.
In addition, network applications emphasize a strong need to engineer hash-
based data structure, which can achieve faster lookup speeds with better worst-
case performance in practice. From the engineering perspective, authors in 
extended the multiple-hashing Bloom filter by using a small amount of multi-port
on-chip memory, which can support better throughput for router applications
based on hash tables.
Due to the essential role in network services, the structure expansion of Bloom
filters is a well-researched topic. While some approaches exist in the literature,
most work emphasizes the improvements on the Bloom filters themselves. Au-
thors in  suggested the multi-dimension dynamic bloom filters (MDDBF)
to support representation and membership queries based on the multi-attribute
dimension. Their basic idea was to represent a dynamic set A with a dynamic
s×m bit matrix that consists of s standard Bloom filters. However, the MDDBF
lacks a verification process of the inherent dependency of multiple attributes of
an item, which may increase the false positive probability.
In this section, we will introduce a novel structure, which is composed of PBF
and a hash table, to represent items of p attributes. The hash table stores the
verification values of items and we provide an improved method for generating
the verification values.
3.1 Proposed Structure
Figure 1 shows the proposed structure based on the counting Bloom filters. The
whole structure includes two parts: PBF and a hash table. PBF and the hash
table are used to store multiple attributes and the verification values of items,
respectively. PBF uses the counting Bloom filters  to support the deletion
10.. .. 03
Hash TableParallel Bloom Filters
Fig.1. The proposed structure based on counting Bloom filters.
operation and can be viewed as a matrix, which consists of p parallel submatrixes
in order to represent p attributes. A submatrix is composed of q parallel arrays
and can be used to represent one attribute. An array consists of m counters and
is related to one hash function. q arrays in parallel are corresponding to q hash
functions. Assume that aiis the ith attribute of item a. We use H[i][j](ai)(1 ≤
i ≤ p,1 ≤ j ≤ q) to represent the hash value computed by the jth hash function
for the ith attribute of item a. Thus, each submatrix has q × m counters and
PBF composed of p submatrixes utilizes p × q × m counters to store the items
with p attributes.
The hash table contains the verification values, which can be used to verify
the inherent dependency among different attributes from one item. We measure
the verification values as a function of the hash values. Let vi= F(H[i][j](ai))
be the verification value of the ith attribute of item a. The verification value of
item a can be computed by Va=?p
i=1vi, which can be inserted into the hash
table for future dependency tests.
3.2Role of Hash Table
The fundamental role of the hash table is to verify the inherent dependency
of all attributes for an item and avoid the query collision. The main reason
for the query collision in terms of multiple attributes is that the dependency
among multiple attributes is lost after we insert p attributes into p independent
submatrixes, respectively. Then, the PBF only knows the existence of attributes
and cannot determine whether those attributes belong to one item. Meanwhile,
the verification based on PBF itself is not enough to distinguish attributes from
one item to another. Therefore, the hash table can be used to confirm whether
the queried multiple attributes belong to one item.
Thus, if a query receives answer True, the two-step verification process must
be conducted. First, we need to check whether queried attributes exist in PBF.
Second, we need to verify whether the multiple attributes belong to a single item
based on the verification value in the hash table.
Traditionally, the hash values computed by hash functions are only used to
update the location counters in the counting Bloom filters. In the proposed
structure, we utilize the hash values to generate the verification values, which
can stand for existing items.
The basic method of generating the verification value is to add all the hash
values and store their sum in the hash table. For example, the value of variable
viis vi= F(H[i][j](ai)) =?q
case, the function F is a sum operation. Then, the verification value of item a
stands for an existing item a. However, in the basic method, the values computed
by different hash functions are possible to be the same and their sums might be
the same, too. Thus, different items might hold the same verification values in
the hash table and this will lead to the verification collision.
The improved method utilizes the sequential information of hash functions
to distinguish the verification values of different items. We allocate different
weights to sequential hash functions in order to reflect the difference among
hash functions. As for the ith attribute of item a, the value from the jth hash
function in the ith submatrix is defined as
the Attenuate Bloom Filters . In attenuate Bloom filters, higher filter levels
are attenuated with respect to earlier filter level and it is a lossy distributed
index. Therefore, as for the item a, the verification value of the ith attribute is
defined as vi= F(H[i][j](ai)) =?q
. This verification value of item a can be inserted
into the hash table.
j=1H[i][j](ai) for the ith attribute of item a. In this
j=1H[i][j](ai). Thus, Vacan be inserted into the hash table and
, which is similar to the idea of
. The verification value of item a
4 Operations on Data Structure
Given a certain item a, it has p attributes and each attribute can be represented
using q hash functions as shown in Figure 1. We denote its verification value by
Va, which is initialized to zero. Meanwhile, we can implement the corresponding
operations, such as adding, querying and removing items, with a complexity of
O(1) in the parallel Bloom filters and the hash table.
Figure 2 presents the algorithm of adding items in the proposed structure. We
need to compute the hash values of multiple attributes by hash functions and