Approximating Aggregation Queries in Peer-to-Peer Networks
ABSTRACT Peer-to-peer databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large scale, ad-hoc analysis queries ― e.g., aggregation queries ― on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement given the distributed and dynamic nature of peer-to-peer databases. In this paper we present novel sampling-based techniques for approximate answering of ad-hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors ― the data is distributed (usually in uneven quantities) across many peers, within each peer the data is often highly correlated, and moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach, based on random walks of the P2P graph as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solutio
- [Show abstract] [Hide abstract]
ABSTRACT: Over the recent years, the proliferation of mobile networking and the increasing capabilities of smartphone devices have led to the development of the "Community-based Participatory Sensing" approach, where users participate in data collection and sharing in a wide range of application areas such as entertainment, transportation and environmental monitoring. This paper develops a participatory sensing system that uses a sampling mechanism that aims to stimulate user participation in dynamic groups that provide services and get compensated for the services they provide. Users participate in the community by sensing and sharing streams of events. The system then uses a sampling mechanism to define a subset of events that preserves the characteristics of the stream data and provides the highest "information gain" to the system, given the budget and resource constraints. Our experimental results illustrate that our approach is practical, efficient and depicts good performance.07/2012; - SourceAvailable from: worldsciencepublisher.com[Show abstract] [Hide abstract]
ABSTRACT: The present peer-to-peer (P2P) content distribution system is based on simple on-demand content discovery technique. This can be improved by implementing additional capabilities namely a mechanism through which peers can register with the network so that they can be continuously informed of new data items, and a means for the peers to advertise their contents. Existing unstructured overlay based systems require complex indexing and routing schemes makes the network less flexible for transient peers. For these applications, we study the alternate continuous query paradigm, which is a best-effort service providing the services. We present a scalable and effective middleware called CQUOS for supporting continuous queries in unstructured overlay networks. CQUOS preserves the simplicity and flexibility of the unstructured P2P network. It has two techniques namely cluster resilient random walk algorithm which is responsible for pro propagating the queries to various regions of the network and dynamic probability-based query registration scheme to ensure that the registrations are well distributed in the overlay. This paper studies the properties of our algorithms through theoretical analysis. - [Show abstract] [Hide abstract]
ABSTRACT: Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has not yet been well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, data mining, and so on. In this paper, we propose a novel algorithm which is based on compact multi-dimensional histogram information to achieve high estimation accuracy with low estimation cost. Maintaining data distribution in a multi-dimensional histogram which is spread among peers without overlapping and each part of which is further condensed by a set of discrete cosine transform coefficients, each peer is capable to hierarchically accumulate the compact information to the entire histogram by information exchange and consequently estimates the global data density with accuracy and efficiency. Algorithms on discrete cosine transform coefficients hierarchically accumulating as well as density estimation error are introduced with detailed theoretical analysis and proof. Our extensive performance study confirms the effectiveness and efficiency of our methods on density estimation in dynamic P2P networks.Distributed and Parallel Databases - DPD. 01/2009; 26:261-289.
Page 1
Approximating Aggregation Queries in Peer-to-Peer Networks
Benjamin Arai
UC Riverside
barai@cs.ucr.edu
gdas@cse.uta.edu
Gautam Das
UT Arlington
Dimitrios Gunopulos
UC Riverside
dg@cs.ucr.edu
Vana Kalogeraki
UC Riverside
vana@cs.ucr.edu
Abstract
Peer-to-peer databases are becoming prevalent on the
Internet for distribution and sharing of documents,
applications, and other digital media. The problem of
answering large scale, ad-hoc analysis queries – e.g.,
aggregation queries – on these databases poses unique
challenges. Exact solutions can be time consuming and
difficult to implement given the distributed and dynamic
nature of peer-to-peer databases. In this paper we
present novel sampling-based techniques for approximate
answering of ad-hoc aggregation queries in such
databases. Computing a high-quality random sample of
the database efficiently in the P2P environment is
complicated due to several factors – the data is
distributed (usually in uneven quantities) across many
peers, within each peer the data is often highly
correlated, and moreover, even collecting a random
sample of the peers is difficult to accomplish. To counter
these problems, we have developed an adaptive two-phase
sampling approach, based on random walks of the P2P
graph as well as block-level sampling techniques. We
present extensive experimental
demonstrate the feasibility of our proposed solution.
evaluations to
1. Introduction
Peer-to-Peer Databases: The peer-to-peer network
model is quickly becoming the preferred medium for file
sharing and distributing data over the Internet. A peer-to-
peer (P2P) network consists of numerous peer nodes that
share data and resources with other peers on an equal
basis. Unlike traditional client-server models, no central
coordination exists in a P2P system, thus there is no
central point of failure. P2P network are scalable, fault
tolerant, and dynamic, and nodes can join and depart the
network with ease. The most compelling applications on
P2P systems to date have been file sharing and retrieval.
For example, P2P systems such as Napster [25], Gnutella
[15], KaZaA [20] and Freenet [13] are principally known
for their file sharing capabilities, e.g., the sharing of
songs, music, and so on. Furthermore, researchers have
been interested in extending sophisticated IR techniques
such as keyword search and relevance retrieval to P2P
databases.
Aggregation Queries: In this paper, however, we
consider a problem on P2P systems that is different from
the typical search and retrieval applications. As P2P
systems mature beyond file sharing applications and start
getting deployed in increasingly sophisticated e-business
and scientific environments, the vast amount of data
within P2P databases pose a different challenge that has
not been adequately researched thus far – that of how to
answer aggregation queries
Aggregation queries have the potential of finding
applications in decision support, data analysis and data
mining. For example, millions of peers across the world
may be cooperating on a grand experiment in astronomy,
and astronomers may be interesting in asking decision
support queries that require the aggregation of vast
amounts of data covering thousands of peers.
We make the problem more precise as follows.
Consider a single table T that is distributed over a P2P
system; i.e., the peers store horizontal partitions (of
varying sizes) of this table. An aggregation query such as
the following may be introduced at any peer (this peer is
henceforth called the sink):
In the above query, the Agg-Op may be any aggregation
operator such as SUM, COUNT, AVG, and so on; Col
may be any numeric measure column of T, or even an
expression involving multiple columns; and the selection-
condition decides which tuples should be involved in the
aggregation. While our main focus is on the above
standard SQL aggregation operators, we also briefly
discuss other interesting statistical estimators such as
medians, quantiles, histograms, and distinct values.
While aggregation queries have been heavily
investigated in traditional databases, it is not clear that
on such databases.
Aggregation Query
SELECT Agg-Op(Col) FROM T
WHERE selection-condition
Page 2
these techniques will easily adapt to the P2P domain. For
example, decision support techniques such as OLAP
commonly employ materialized views, however the
distribution and management of such views appears
difficult in such a dynamic and decentralized domain [19,
11]. In contrast, the alternative of answering aggregation
queries at runtime “from scratch” by crawling and
scanning the entire P2P repository is prohibitively slow.
Approximate Query Processing: Fortunately, it has
been observed that in most typical data analysis and data
mining applications, timeliness and interactivity are more
important considerations than accuracy - thus data
analysts are often willing to overlook small inaccuracies
in the answer provided the answer can be obtained fast
enough. This observation has been the primary driving
force behind recent development of approximate query
processing (AQP) techniques for aggregation queries in
traditional databases and decision support systems [9, 3,
6, 8, 1, 14, 5, 7, 23]. Numerous AQP techniques have
been developed, the most popular ones based on random
sampling, where a small random sample of the rows of
the database is drawn, the query is executed on this small
sample, and the results extrapolated to the whole
database. In addition to simplicity of implementation,
random sampling has the compelling advantage that in
addition to an estimate of the aggregate, one can also
provide confidence intervals of the error with high
probability. Broadly, two types of sampling-based
approaches have been investigated: (a) Pre-computed
samples - where a random sample is pre-computed by
scanning the database, and the same sample is reused for
several queries, and (b) Online samples - where the
sample is drawn “on the fly” upon encountering a query.
Goal of Paper: In this paper, we also approach the
challenges of decision support and data analysis on P2P
databases in the same manner, i.e., we investigate what it
takes to enable AQP techniques on such distributed
databases.
The cost of query execution in traditional databases is
usually a straightforward concept – it is either I/O cost or
CPU cost, or a combination of the two. In fact, most AQP
approaches simplify this concept even further, by just
trying to minimize the number of tuples in the sample;
thus making the assumption that the sample size is
directly related to the cost of query execution. However,
in P2P networks, the cost of query execution is a
combination of several quantities, e.g., the number of
participating peers, the bandwidth consumed (i.e.,
amount of data shipped over the network), the number of
messages exchanged, the latency (the end-to-end time to
propagate the query across multiple peers and receive
replies), the I/O cost of accessing data from participating
peers, the CPU cost of processing data at participating
peers, and so on. In this paper, we shall be concerned with
several of these cost metrics.
Challenges: Let us now discuss what it takes for
sampling-based AQP techniques to be incorporated into
P2P systems. We first observe that two main approaches
have emerged for constructing P2P networks today,
structured and unstructured. Structured P2P networks
(such as Pastry [27] and Chord [30]) are organized in such
a way that data items are located at specific nodes in the
network and nodes maintain some state information, to
enable efficient retrieval of the data. This organization
sacrifices atomicity by mapping data items to particular
nodes and assume that all nodes are equal in terms of
resources, which can lead to bottlenecks and hot-spots.
Our work focuses on unstructured P2P networks, which
make no assumption about the location of the data items
on the nodes, and nodes are able to join the system at
random times and depart without a priori notification.
Several recent efforts have demonstrated that unstructured
P2P networks can be used efficiently for multicast,
distributed object location and information retrieval [10,
24, 31].
For approximate query processing in unstructured
P2P systems, attempting to adapt the approach of pre-
computed samples is impractical for several reasons: (a) it
involves scanning the entire P2P repository, which is
difficult, (b) since no centralized storage exists, it is not
clear where the pre-compute sample should reside, and (c)
the very dynamic nature of P2P systems indicates that
pre-computed samples will quickly become stale unless
they are frequently refreshed.
Thus, the approach taken in this paper is to
investigate the feasibility of online sampling techniques
for AQP on P2P databases. However, online sampling
approaches in P2P databases pose their own set of
challenges. To illustrate these challenges, consider the
problem of attempting to draw a uniform random sample
of n tuples from such a P2P database containing a total of
N tuples. To ensure a true uniform random sample, our
sampling procedure should be such that each subset of n
tuples out of N should be equally likely to be drawn.
However, this is an extremely challenging problem due to
the following two reasons.
•
Picking even a set of uniform random peers is a
difficult problem, as the sink does not have the IP
addresses of all peers in the network. This is a well-
known problem that other researchers have tackled
(in different contexts) using random walk techniques
Goal of Paper: Approximating Aggregation
Queries in P2P Networks
Given an aggregation query and a desired error
bound at a sink peer, compute with “minimum cost”
an approximate answer to this query that satisfied the
error bound.
Page 3
on the P2P graph [14, 21, 4] – i.e., where a
Markovian random walk is initiated from the sink
that picks adjacent peers to visit with equal
probability, and under certain connectivity properties,
the random walk is expected to rapidly reach a
stationary distribution. If the graph is badly clustered
with small cuts, this affects the speed at which the
walk converges. Moreover, even after convergence,
the stationary distribution is not uniform; in fact, it is
skewed towards giving higher probabilities to nodes
with larger degrees in the P2P graph.
Even if we could select a peer (or a set of peers)
uniformly at random, it does not make the problem of
selecting a uniform random set of tuples much easier.
This is because visiting a peer at random has an
associated overhead, thus it makes sense to select
multiples tuples at random from this peer during the
same visit. However, this may compromise the
quality of the final set of tuples retrieved, as the
tuples within the same peer are likely to be correlated
– e.g., if the P2P database contained listings of, say
movies, the movies stored on a specific peer are
likely to be of the same genre. This correlation can be
reduced if we select just one tuple at random from a
randomly selected peer; however the overheads
associated with such a scheme will be intolerable.
•
Our Approach: We briefly describe the framework of
our approach. Essentially, we abandon trying to pick true
uniform random samples of the tuples, as such samples
are likely to be extremely impractical to obtain. Instead,
we consider an approach where we are willing to work
with skewed samples, provided we can accurately
estimate the skew during the sampling process. To get the
accuracy in the query answer desired by the user, our
skewed samples can be larger than the size of a
corresponding uniform random sample that delivers the
same accuracy, however, our samples are much more cost
efficient to generate.
Although we do not advocate any significant pre-
processing, we assume that certain aspects of the P2P
graph are known to all peers, such as the average degree
of the nodes, a good estimate of the number of peers in
the system, certain topological characteristics of the graph
structure, and so on. Estimating these parameters via pre-
processing are interesting problems in their own right,
however we omit these details from this paper. The main
point we make is that these parameters are relatively slow
to change and thus do not have to be estimated at query
time – it is the data contents of peers that changes more
rapidly, hence the random sampling process that picks a
representative sample of tuples has to be done at runtime.
Our approach has two major phases. In the first
phase, we initiate a fixed-length random walk from the
sink. This random walk should be long enough to ensure
that the visited peers1 represent a close sample from the
underlying stationary distribution – the appropriate length
of such a walk is determined in a pre-processing step. We
then retrieve certain information from the visited peers,
such as the number of tuples, the aggregate of tuples (e.g.,
SUM/COUNT/AVG, etc.) that satisfy the selection
condition, and send this information back to the sink. This
information is then analyzed at the sink to determine the
skewed nature of the data that is distributed across the
network - such as the variance of the aggregates of the
data at peers, the amount of correlation between tuples
that exists within the same peers, the variance in the
degrees of individual nodes in the P2P graph (recall that
the degree has a bearing on the probability that a node
will be sampled by the random walk), and so on. Once
this data has been analyzed at the sink, an estimation is
made on how much more samples are required - and in
what way should these samples be collected - so that the
original query can be optimally answered within the
desired accuracy with high probability. For example, the
first phase may recommend that the best way to answer
this query is to visit m’ more peers, and from each peer,
randomly sample t tuples. We mention that the first phase
is not overly driven by heuristics – instead it is based on
strong underlying theoretical principles, such as theory of
random walks [14, 21, 4], as well as statistical techniques
such as cluster sampling, block-level sampling and cross-
validation [9, 16].
The second phase is then straightforward – a random
walk is reinitiated and tuples collected according to the
recommendations made by the first phase. Effectively, the
first phase is used to “sniff” the network and determine an
optimal-cost “query plan”, which is then implemented in
the second phase. For certain aggregates, such as COUNT
and SUM, further optimizations may be achieved by
pushing the selections and aggregations to the peers – i.e.,
the local aggregates instead of raw samples are returned
to the sink, which are then composed into a final answer.
Summary of Contributions:
•
We introduce the important problem of approximate
query processing in P2P databases that is likely to be
of increasing significance in the future.
•
The problem is analyzed in detail, and its unique
challenges are comprehensively discussed.
•
Adaptive, two-phase sampling-based approaches are
proposed, based on
principles.
•
The results of extensive experiments are presented
that demonstrate the importance of the problem and
the validity of our approaches.
well-founded theoretical
1 Actually, only a small fraction of the visited peers are selected
for consideration, and the remaining is “jumped over” – this is
determined by the jump size parameter that is discussed in later
sections.
Page 4
The rest of this paper is organized as follows. In Section
2 we describe related work. We provide the foundation of
our approach in Section 3, and the algorithm in Section 4.
In Section 5 we present the results of experiments, and
conclude in Section 6.
2. Related Work
Peer-to-Peer (P2P) systems are becoming very popular
because they provide an efficient mechanism for building
large, scalable systems [24]. Most recent work has
focused on Distributed Hash Tables (DHTs) [26, 27, 30].
Such techniques provide scalability advantages over
unstructured systems (such as Gnutella) however they are
not flexible enough for some applications, especially
when nodes join or leave the network frequently or
change their connections often.
Recent work has proposed different techniques for
exact query processing in P2P systems. Most proposals
use structured overlay networks (DHTs), such as CAN,
Pastry, or Chord. Such techniques include PIER [17],
DIM [23], or [28], and since they use DHTs they have a
different focus and are not directly applicable to our case.
A hybrid system, Mercury [4], using routing hubs to
answer range queries, was also recently proposed. This
system is also designed to provide exact answers to range
queries. Exact solutions to OLAP queries have been
considered in [11, 19].
Methods to sample random peers in P2P networks
have been proposed in [14, 21, 4]. These techniques use
Markov chain random walks to select random peers from
the network. Their results show that when certain
structural properties of the graph are known or can be
estimated (such as the second eigenvalue of the graph) the
parameters of the walk can be set so that a representative
sample of the stationary distribution can be collected with
high probability. In [4] it is shown that if the graph is an
expander, a random walk converges to the stationary
distribution in O(logM) steps, where M is the number of
peers in the network.
Our work also generalizes to the P2P domain
previous work on approximate query processing in
relational databases. Recent work by [9, 3, 6, 8, 1, 14, 5,
7, 23] has developed powerful techniques for employing
sampling in the database engine to approximate
aggregation queries and to estimate database statistics.
Recent techniques have focused on providing formal
foundations and algorithms for block-level sampling and
are thus most relevant to our work. The objective in
block-level sampling is to derive a representative sample
by only randomly selecting a set of disk blocks of a
relation [9, 16]. Specifically, [9] presents a technique for
histogram estimation that uses cross-validation to identify
the amount of sampling required for a desired accuracy
level. In addition, the paper [16] considers the problem of
deciding what percentage of a disk block should be
included in the sample, given a cost model.
3. Foundations of our Approach
In this section we discuss the principles behind our
approach for approximate query processing on P2P
databases. Our actual algorithm is described in Section 4.
3.1. The Peer-to-Peer Model
We assume an unstructured P2P network represented as a
graph G = (P, E), with a vertex set P={p1, p2, ..., pM} and
an edge set E. The vertices in P represent the peers in the
network and the edges in E represent the connections
between the vertices in P. Each peer p is identified by the
processor’s IP address and a port number (IPp, portp). The
peer p is also characterized by the capabilities of the
processor on which it is located, including its CPU speed
pcpu, memory bandwidth pmem and disk space pdisk. The
node also has a limited amount of bandwidth to the
network, noted by pband. In unstructured P2P networks, a
node becomes a member of the network by establishing a
connection with at least one peer currently in the network.
Each node maintains a small number of connections with
its peers; the number of connections is typically limited
by the resources at the peer. We denote the number of
connections a peer is maintaining by pconn.
The peers in the network use the Gnutella’s P2P
protocol to communicate. The Gnutella P2P protocol
supports four message types (Ping, Pong, Query,
Query_Hit); of which the Ping and Pong messages are
used to establish connections with other peers, and the
Query and Query_Hit messages are used to search in the
P2P network. Gnutella, however, uses a naïve Breadth
First Search (BFS) technique in which queries are
propagated to all the peers in the network, and thus
consumes excessive network and processing resources
and results in poor performance. Our approach, on the
other hand, uses a probabilistic search algorithm based on
random walks. The key idea is that, each node forwards a
query message, called walker, randomly to one of its
adjacent peers. This technique is shown to improve the
search efficiency and reduce unnecessary traffic in the
P2P network.
3.2. Query Cost Measures
As mentioned in the introduction, the cost of the
execution of a query in P2P databases is more
complicated that equivalent cost measures in traditional
databases. The primary cost measure we consider is
latency, which is the end-to-end time to propagate the
query across multiple peers and receive replies.
For the purpose of illustration, we focus in this
section on the SUM and COUNT aggregates. For these
specific aggregates, latency can be approximated by an
Page 5
even simpler measure: the number of peers that
participate in the algorithm. This measure is appropriate
for these aggregates primarily because the overheads of
visiting peers dominate other incurred costs.
To see this, we note that the aggregation operator (as
well as the selection filter) can be pushed to each visited
peer. Once a peer is visited by the algorithm, the peer can
be instructed to simply execute the original query on its
local data and send only the aggregate (and its degree)
back to the sink, from which the sink can reconstruct the
overall answer. Moreover, this information can be sent
directly without necessitating any intermediate hops, as
the visited peer knows the IP address of the sink from
which the query originated. Thus the bandwidth
requirement of such an approach is uniformly very small
for all visited peers – they are not required to send more
voluminous raw data (e.g., all or parts of the local
database) back to the sink.
In approximating latency by the number of visited
peers, we also make the implicit assumption that the
overhead of visiting peers dominates the costs of local
computations (such as, execution of the original query on
the local database). This is of course true if the local
databases are fairly small. To ensure that the local
computations remain small even if local databases are
large, our approach in such cases is to execute the
aggregation query only on a small fixed-sized random
sample of the local data – i.e., we sub-sample from the
peer - scale the result to the entire local database, and
send the scaled aggregate back to the sink. This way, we
ensure that the local computations are uniformly small
across all visited peers.
In summary, for SUM and COUNT aggregates,
latency is shown to be proportional to the number of
visited peers. Thus, our goal is to minimize the number of
peers that must be visited in order to arrive at an
approximate answer with the desired accuracy.
We mention that for other types of aggregations –
e.g., statistics computations such as medians, quantiles,
histograms, and distinct values – the cost model is more
complex as the aggregation operator usually cannot be
pushed to the peers. In such cases, more voluminous data
has to be sub-sampled from the visited peers and sent
back to the sink, thus incurring nontrivial bandwidth
costs. An appropriate cost model usually has to take into
account multiple factors, such as costs of visiting peers,
local computations at peers, transportation of data back to
the sink, and local computations at the sink. Handling
such aggregations is part of ongoing work – e.g., we have
interesting results on the median and quantile
computations that are presented later in the paper -
however we omit complete details of these efforts due to
lack of space.
3.3. Random Walk in Graphs
In seeking a random sample of the P2P database, we have
to overcome the sub-problem of how to collect a random
sample of the peers themselves. Unrepresentative samples
of peers can quickly skew results producing erroneous
aggregation statistics. Sampling in a non-hierarchical
decentralized P2P network presents several obstacles in
obtaining near uniform random samples. This is because
no peer (including the query sink) knows the IP addresses
of all other peers in the network – they are only aware of
their immediate neighbors. If this were not the case,
clearly the sink could locally generate a random subset of
IP addresses from among all the IP addresses, and visit
the appropriate peers directly. We note that this problem
is not encountered in traditional databases, as even if one
has to resort to cluster (or block-level) sampling such as
in [9, 16], obtaining an efficient sample of the blocks
themselves is straightforward.
This problem has been recognized in other contexts
(see [14] and the references therein), and interesting
solutions based on Markov chain random walks have been
proposed. We briefly review such approaches here. A
Markov chain random walk is a procedure that is initiated
at the sink, and for each visited peer, the next peer to visit
is selected with equal probability from among its
neighbors (and itself – thus self loops are allowed). It is
well known that, if this walk is carried out long enough,
the eventual probability of reaching any peer p will reach
a stationary distribution. To make this more precise, let P
= {p1, p2, …, pM} be the entire set of peers, let E be the
entire set of edges, and let the degree of a peer p be
deg(p). Then the probability of any peer p in the
stationary distribution is:
( )
p prob
=
( )
E
p
2
deg
It is important to note that the above distribution is not
uniform – the probability of each peer is proportional to
its degree. Thus, even if we can efficiently achieve this
distribution, we will have to compensate for the fact that
the distribution is skewed as above, if we have to use
samples drawn from it for answering aggregation queries.
The main issue that has concerned researchers has
been the speed of convergence, i.e., how many hops h are
necessary before one gets close to the stationary
distribution. Most results have pointed to certain broad
connectivity properties that the graph should possess for
this to happen. In particular, it has been shown that if the
transition probabilities that govern the random walk on
the P2P graph are modeled as an MxM matrix, the second
eigenvalue plays an important role in these convergence
results. The second eigenvalue describes connectivity
properties of graphs - in particular whether the graph has
small cuts which would adversely impact the length of the
Page 6
walk necessary to arrive at convergence. For example,
Figure 2 describes a clustered graph with a small cut.
Figure 1: Two clusters with a small cut between each other
As the results in [14] show, if the P2P graph is well
connected (i.e., it has a small second eigenvalue, and a
minimum degree of the graph is large), then the random
walk quickly converges as it “loses memory” rapidly. In
fact, under certain specific conditions of connectedness
(e.g., expander graphs that are common in P2P networks),
convergence can be achieved in O(logM) steps.
In our case, recall from the introduction that we
assume that we are allowed a certain amount of
preprocessing to determine various properties of the P2P
graph that will be useful at query time – under the
assumption that the graph topology changes less rapidly
compared to the data content at the peers. The speed of
convergence of a random walk in this graph is determined
in this preprocessing step, in addition to other useful
properties such as number of nodes M, the number of
edges |E|, and so on. With respect to speed of
convergence, we essentially determine a jump parameter j
that determines how many peers to skip between
selections of peers for the sample. As the jump increases,
the correlation between successive peers that are selected
for the sample decreases rapidly.
3.4. Sampling Theorems
In this subsection, we shall develop the formal sampling
theorems that drive our algorithm. We shall show how the
tuples that are retrieved from the first phase of our
algorithm can be utilized to recommend how the second
phase should be executed, i.e., the “query plan” for
answering the query approximately so that a desired error
is achieved.
We focus here on the COUNT aggregate for the
purpose of illustrating our main ideas (our formal results
can be easily extended for the SUM case). Finally, to
keep the discussion simple, we assume that all local
databases at peers are small, i.e., sub-sampling is not
required (our results can be extended for the sub-sampling
case, and in fact our algorithm in Section 4 does not make
this assumption).
As discussed earlier, our algorithm has two phases. In
the first phase, our algorithm will visit a predefined
number of peers m using a random walk such that the
sample of visited peers will appear as if they have been
drawn from the stationary distribution of the graph. The
query will be executed locally at each visited peer, and
the aggregates will be sent back to the sink, along other
information such as the degrees of the visited peers (from
which information such as the peers probabilities in the
stationary distribution can be computed). The sink
analyzes this information, and then determines how many
more peers need to be visited in the second phase. The
theorems that we develop next provide the foundations on
which the decisions in the first phase are made.
Recall that P = {p1, p2, …, pM} is the set of peers.
For a tuple u, let y(u) = 1 if u satisfies the selection
condition, and = 0 otherwise.
Let the aggregate for a peer p be ( ) ( )y u
u p
∈
y p
=∑
Let y be the exact answer for the query, i.e.
∑
∈
p
=
P
pyy)(
The query also comes with a desired error threshold
The implication of this requirement is that, if y’ is the
estimated count by our algorithm, then
'yy
−
req
∆
.
req
≤ ∆
Now, consider a fixed-size sample of peers S = {s1, s2…
sm} where each si is from P. This sample is picked by the
random walk in the first phase. We can approximate this
process as that of picking peers in m rounds, where in
each round a random peer si is picked from P with
probability prob(si). We also assume that peers may be
picked with replacement – i.e., multiple copies of the
same peer may be added to the sample – as this greatly
simplifies the statistical derivations below.
Consider the quantity y’’defined as follows
sy
s∑
∈
=
' '
(1)
Theorem 1:
estimator of y.
Proof: Intuitively, each sampled peer s tries to estimate y
as y(s)/prob(s), i.e., by scaling its own aggregate by the
inverse of its probability of getting picked. The final
estimate y’’ is simply the average of the m individual
estimates.
To proceed with the proof, consider the simple case
of only one sampled peer, i.e., m = 1. In this case,
py
yE
Pp
∈
)(
To extend to any m, we make use of the linearity of
expectation formula: E[X+Y] = E[X] + E[Y] for random
variables X and Y (that need not even be independent).
yyE
=
] ' '[
, that is, y’’ is an unbiased
yp prob
p prob
=
=∑
)(
)(
] "[
m
s prob
y
S
)(
)(
Page 7
Thus if the expected estimate of any single random peer is
y, then the expected average estimate by m random peers
is also y.
We next need to determine the variance of the
random variable y’’.
Theorem 2 (Standard Error Theorem):
y(
∑
∈
=
m
p proby
p prob
p
(
y Var
Pp
)(
)
)
] ' '[
2
−
Proof: To easily derive this variance, let us consider the
simple case of only one sampled peer, i.e., m = 1. In this
case, it is easy to see that the variance is defined by the
quantity
=
)(
)(
)(
2
p proby
p prob
py
C
Pp∑
∈
−
To extend to any m, we make use of the following
formulas for variance: (a) Var[aX] = a2Var[X], and (b)
Var[X+Y] = Var[X] + Var[Y], where X and Y are
independent random variables and a is a constant. Using
these formulas, we can easily show that Var[y’’] = C/m.
The above Standard Error Theorem shows that the
variance varies inversely as the sample size. The quantity
C also represents the “badness” of the clustering of the
data in the peers – the larger the C, the more the
correlation amongst the tuples within peers, and
consequently the more peers need to be sampled to keep
the variance of the estimator y’’ small. Notice also that if
we divide the variance by N2, we will effectively get the
square of the error of the relative count aggregate, if y’’
was used as an estimator for y.
Our case is actually the reverse, i.e., we are given a
desired error threshold
req
∆
, and the task is to determine
the appropriate number of peers to sample that will satisfy
this threshold. Of course, we have used a fixed-sized m in
the first phase, so unless we are simply lucky, it’s unlikely
that this particular m will satisfy the desired accuracy.
However, we can use the first phase more carefully to
determine the appropriate sample size to draw in the
second phase, say m’.
The main task is to use the sample drawn in the first
phase to try and estimate C; because once we estimate C,
we can determine m’ using Theorem 2. We suggest a
simple cross-validation procedure as described below to
estimate C (this procedure is inspired by previous work in
a different context, see [9]).
Consider two random sample of peers of size m each
drawn from the stationary distribution. Let y1’’ and y2’’ be
the two estimates of y by these samples respectively
according to Equation 1. We define the cross-validation
error as:
1
'' CVErrory
=−
2
''y
Theorem 3: [
Proof:
[
(
1
' '
yE
−
This theorem says that the expected value of the square of
the cross-validation error is 2 times the expected value of
the square of the actual error.
This cross-validation error can be estimated in the
first phase by the following procedure. Randomly divide
the m samples into two halves, and compute the cross-
validation error (for sample size m/2). We can then
determine C by fitting this computed error and the sample
size m/2 into the equation in Theorem 2. To get a
somewhat more robust estimation for C, we can repeat the
random halving of the sample collected in the first phase
several times and take the average value of C. We also
note that since the cross-validation error is larger than the
true error, the value of C is conservatively overestimated.
Once C is determined (i.e., the “badness” of the
clustering of data in the peers), we can determine the right
number of peers to sample in the second phase, m’, to
achieve the desired accuracy.
]
()
[]
2
2
' 'y2yE CVErrorE
−=
]
+
()
[
y
]
2
)
[]
()
[]
()
[]
22
2
2
2
21
2
' '
y
' '
' '' '
−
yEyEy
yyE CVErrorE
−=
=−=
4. Our Algorithm
In this section we present details of our two-phase
algorithm for approximating answering of aggregate
queries. For the sake of illustration, we focus on
approximating COUNT queries – it can be easily
extended to SUM queries. The pseudo code of the
algorithm is presented below.
Algorithm: COUNT queries
Predefined Values
M : Total number of peers in network
E : Total number of edges in network
m : Number of peers to visit in Phase I
j : Jump size for random walk
t : Max #tuples to be sub-sampled per peer
Inputs
Q : COUNT query with selection condition
Sink : Peer where query is initiated
∆
: Desired max error
Phase I
// Perform Random Walk
1. Curr = Sink; Hops = 1;
2. while (Hops < j * m) {
3. if (Hops % j)
4. Visit(Curr);
5. Hops++;
req
Page 8
6.
7. }
// Visit Peer
1. Visit(Curr) {
2. if (#tuples of Curr) <= t) {
3. Execute Q on all tuples
4. else
5. Execute Q on t randomly sampled
6. tuples
7. }
8.
# processedT
10. Return (y(Curr), deg(Curr)) to Sink
11. }
// Cross-Validate at Sink
1. Let S = {s1, s2, …, sm} be the visited peers
2. Partition S randomly into two halves: S1 & S2
3. Compute
deg( )
( )
2E
4. Compute
1
''CVErrory
=−
∆
Phase II
1. Visit m’ peers using random walk
2. Let S’ = {s1, s2, …, sm’} be the visited peers
( / )(
'
m
Our approach in the first phase is broken up into the
following main components. First, we perform a random
walk on the peer-to-peer network, attempting to avoid
skewing due to graph clustering and vertices of high
degree. Our walk skips j nodes between each selection to
reduce the dependency between consecutive selected
peers. As the jump size increases, our method increases
overall bandwidth requirements within the database but
for most cases small jump sizes suffice for obtaining
random samples.
Second, we compute aggregates of the data at the
peers and send these back to the sink. Note that in the
previous section, we had not formally discussed the issue
of sub-sampling at peers – this was primarily done to keep
the previous discussion simple. In reality, the local
databases at some peers can be quite large, and
aggregating them in their entirety may not be negligible
compared to the overhead of visiting the peer – in other
words, the simplistic cost model of only counting the
number of visited peers is inappropriate. In such cases, it
Curr = random adjacent peer
)__(*
#
)(Q of result
uples
tuples
Curry
=
where
s
prob s
=
2
''y
5. Return
=
req
CVError
mm
2
2
* ) 2/('
3. Return
'
'
)s probsy
y
Ss∑
∈
=
is preferable to randomly sub-sample a small portion of
the local database, and apply the aggregation only to this
sub-sample. Thus, the ideal approach for this problem is
to develop a cost model that takes into account cost of
visiting peers as well as local processing costs; and for
such cost models, an ideal two-phase algorithm should
determine various parameters in the first phase, such as
how many peers to visit in the second phase, and how
many tuples to sub-sample from each visited peer. In this
paper we taken a somewhat simpler approach, in which
we fix a constant t (determined at preprocessing time via
experiments), such that if a peer has at most t tuples, its
database is aggregated in its entirety, whereas if the peer
has more than t tuples, then t tuples are randomly selected
and aggregated. Sub-sampling can be more efficient than
scanning the entire local database – e.g., by block-level
sampling in which only a small number of disk blocks are
retrieved. If the data in the disk blocks are highly
correlated, it will simply mean that the number of peers to
be visited will increase, as determined by our cross-
validation approach at query time.
Third, we estimate the cross-validation error of the
collected sample, and use that to estimate the additional
number of peers that need to be visited in the second
phase. For improving robustness, steps 2-4 in the cross-
validation procedure can be repeated a few times and the
average squared CVError computed.
Once the first phase has completed, the second phase
is then straightforward – we simply initiate a second
random walk based on the recommendations of the first
phase, and compute the final aggregate.
Although the algorithm has been presented for the
case of COUNT, it can be easily extended for SUM.
Finally, we re-emphasize that for more complex
aggregates, such as estimation of medians, quantiles, and
distinct values, more sophisticated algorithms are
required. This is part of ongoing work, and we mention
some preliminary results in the experimental section.
5. Experimental Evaluation
In this section, we have provided experimental
justification for our methods. We have implemented our
algorithms on simulated and real-world topologies using
various degrees of data clustering and topology structures.
5.1. Implementation
Our algorithms
implemented in Java 5.0 with the graph generation tool
Jung [15] version 1.6. Our implementation includes both
sampled and real-world Gnutella topology samples. All
of our experiments were run on AMD dual Opteron 2.0
GHz processors with 2GB of RAM.
and peer-to-peer topologies are
2/
)( / )s(
' '
1
y
1
m
s proby
Ss∑
∈
=
2/
)( / )s(
' '
2
2
m
s proby
y
Ss∑
∈
=
Page 9
5.2. Generation of P2P Networks and Databases
5.2.1. P2P Networks
Synthetic Topology: The power-laws [12] offer insight
to the structure of Internet topologies; and [2] confirms
that the power-laws extend to peer-to-peer networks. Our
synthetic topology is created through the process of
connecting sub-graphs using the graph generation tool
Jung [15]. It consists of 10,000 peers and 100,000 edges.
The parameters during graph creation are:
• Sub-graphs [s]: s sub-graphs are created that follow
the power-laws topology [12].
• Edges between sub-graphs [e]: The size of e
determines the cut size between sub-graphs. As the
cut size decreases, number of edges between sub-
graphs decreases.
Real-World Topology: We also experimented with 2001
Gnutella topology data containing 22,556 peers and
52,321 edges, acquired from the group of M. Ripeanu at
the University of Chicago.
5.2.2. P2P Databases
Both types of networks were populated with data
generated by a synthetic data generator. We use single
attribute tuples. The attribute values have a range
between 1 and 100. The values follow the Zipf-
distribution. The parameters that define the main
characteristics of our synthetic data sets are as follows:
• Cluster Level [CL]: If the cluster level is equal to 0,
then the dataset is perfectly clustered, i.e., it is sorted
and then partitioned across the peers. If the cluster
level is set to 1, then the dataset is randomly
permuted, then partitioned across the peers. In-
between values correspond to in-between scenarios.
• Skew [Z]: The skew determines the slant in
frequency distribution of distinct values the data. Low
skew values give the dataset an even distribution of
frequencies per value, conversely high skew values
distort the distribution of frequencies.
We populated the synthetic network with 1,000,000 tuples
and the Gnutella network with 2,200,000 tuples. It is well-
known that peer-to-peer databases have strong clustering
properties, e.g., large networks such as Gnutella contain
sub-graphs of peers, containing similar music genre,
movies, software, or documents [22]. Thus, while
populating the peers of both networks, we distributed the
data in a breadth-first method, in order to obtain
reasonable clustering of synthetic data within the
topologies. I.e., when loading a peer, the adjacent peers
are also loaded with similarly clustered data.
5.2.3. Aggregation Queries
In our experiments we use SUM and COUNT range
queries with different selectivity of the form: “SELECT
COUNT(A) FROM T WHERE A BETWEEN A1 AND
A2” (i.e. find the number of tuples with values in the
range [A1, A2]).
5.3. Input Parameters
We evaluate the accuracy, use of network resources, the
size of sample acquired, and total number of tuples
sampled from the network. We define each of the user
defined inputs as follows:
1. Required Accuracy [
req
∆
is the maximum allowed error for the estimated
answer.
2. Tuples Sampled per Peer [t]: This parameter
defines the number of tuples to sample from each
selected peer.
3. Jump Size [j]: This parameter defines the Number
of peers to pass over before selecting the next peer
for sampling.
4. Initial Sample Size [
orig
r
the initial number of tuples to acquire from the
]: This parameter defines
]: This parameter defines
database to execute the first phase. (Thus,
m where m is the number of peers visited in the first
phase. In our experiments, the local databases are
always large enough to ensure that sub-sampling
always takes place.)
Parameter 1 is provided by the user for each query.
Parameters 2-4 may be provided by the user, or may be
set via a pre-processing step. In the end of the
experimental section we provide a user guide for setting
parameters 2-4.
orig
r
/ t =
5.4. Evaluation Metrics – Cost and Accuracy
Our algorithms are evaluated based on the cost of
execution as well as how close they get to the desired
accuracy. As discussed earlier, we use latency as a
measure of our cost, noting that in our case that it is
proportional to the number of peers visited. In fact, if the
number of tuples to be sampled is the same for all peers -
which is true in our experiments - latency is also
proportional to the total number of sample tuples drawn
by the overall algorithm. Thus we use the number of
sample tuples used as a surrogate for latency in describing
our results.
5.5. Experiments
All of our results were generated from five independent
experiments and averaged for each individual parameter
configurations. Errors are normalized between 0 and 1.
Accuracy: Figure 2 and 3 shows representative accuracy
results for COUNT using synthetic and real datasets. In
this case we have a query with selectivity 30%, CL=0.2,
and Z=0.2. In Figure 2 we vary the required accuracy.
The figure shows that the algorithm’s result is always
within the required accuracy. In Figure 3 we set required
Page 10
accuracy to 0.1 and show the resulting accuracy for each
query with different selectivity’s.
Sample Size: Figures 4 and 5 show that the required
sample size increases with
1 ∆
the required sample size does not vary much when the
initial sample is ranged from 1000 to 3000. The
selectivity of the query in this experiment was 30%, and
the algorithm gave an answer within the required
accuracy in all cases. We note that the result of our
algorithm specifies the number of peers to be sampled. In
the experiments we convert it to the number of samples
by taking 25 samples per peer. Figure 6 shows that the
improvement by getting more tuples per peer is small. To
minimize the cost of sampling in each peer we take 25
samples in each peer.
Comparison with naïve techniques: Figure 7 compares
our approach with DFS, where we collect our sample
using a random walk with j=0, and BFS, where we collect
our sample from the peers in the neighborhood of the
querying peer. Note that our method always meets the
required accuracy. Our technique clearly outperforms
both techniques.
Effects of data clustering and skew: Figures 8, 9, 10,
and 11 show the effects of different degrees of data
req
2
. They also, show that
clustering (8, 9) and different degrees of skew (10, 11).
Figures 7 to 12 simulate a peer-to-peer database with two
sub-graphs, each containing similar data within individual
sub-graphs but different from others. The results show
that with clustering closer to 0 (data are more clustered)
we need to collect more samples, while with clustering
close to 1 (data are less clustered) we need less samples;
since each peer contains a better sample of the entire
dataset. Regarding skew, the results show that when
skew increases, we need fewer samples. The reason is
that some values become much more frequent in the
dataset and therefore easier to estimate their count.
Graph size vs. jump size: Figure 12 illustrate the
relationship between jump size and size of cuts in a peer-
to-peer database. As the number of edges connecting sub-
graphs or the jump size increase, the accuracy of the
sample increase. The relationship between number of
edges connecting sub-graphs and the jump size are
inversely proportional in determining the quality of the
sample acquired.
Evaluating the SUM query: Figures 13 and 14 show
that our technique shows similar accuracy results for
SUM. Here we estimate the SUM of all tuples in the
database. (i.e. selectivity=1).
Required Accuracy vs. Error %
(CL=0.25, Z=0.2, j=10, Selectivity=30)
0%
2%
4%
6%
8%
10%
12%
14%
16%
0.25 0.20.150.1
Required Accuracy
Error %
Synthetic
Gnutella
Figure 2: Effects of required accuracy on the
error percentage for the COUNT technique
Selectivity vs. Error %
(Required Acc=0.10, Z=0.2, j=10)
0%
1%
2%
3%
4%
5%
6%
7%
8%
2.55 10 20 40
Selectivity
Error %
Synthetic
Gnutella
Figure 3: Effects of selectivity on the error
percentage for the COUNT technique
1000
2000
30000
0.25
0.2
0.15
0.1
0.05
0
2000
4000
6000
8000
10000
12000
14000
Sample
Size
Initial
Sample
Size
Required
Accuracy
Required Acc vs. Initial Sample Size vs. Sample Size
Synthetic Topology
(Peers=10,000, Edges=100,000, Tuples Per Peer=50)
12000-14000
10000-12000
8000-10000
6000-8000
4000-6000
2000-4000
0-2000
Figures 4: Effects of the sample size collected
for given required accuracies and initial sample
sizes for the COUNT technique
1000
2000
30000
0.25
0.2
0.15
0.1
0.05
0
2000
4000
6000
8000
10000
12000
Sample
Size
Initial
Sample
Size
Required
Accuracy
Required Acc vs. Initial Sample Size vs. Sample Size
Real-world Topology: Gnutella
(Peers=22,556, Edges=52,321, Tuples Per Peer=50)
10000-12000
8000-10000
6000-8000
4000-6000
2000-4000
0-2000
Figure 5: Effects of the sample size collected for
given required accuracies and initial sample sizes
for the COUNT technique
Samples per peer vs. Error %
Synthetic Topology
(Peers=10,000, Edges=100,000, Req Acc=0.10,Z=0.2,j=10)
0%
1%
1%
2%
2%
3%
3%
4%
4%
5%
5%
50100 150 200250
Samples per peer
Error %
Synthetic
Figure 6: The figure shows the number of peers
does not make a vast difference in accuracy
Required Accuracy vs. Error %
Synthetic Topology
(CL =0.25,Z=0.2,Peers=10,000,Edges=100,000, j=10)
(Sub-Graphs=2,Cut-Size=1000)
0%
5%
10%
15%
20%
25%
0.250.20.15 0.10.05
Required Accuracy
Error %
Random Walk
BFS
DFS
Figure 7: The figure shows random walks
perform better then BFS and DFS
Page 11
5.6. Estimating the Median
Figure 15 and 16 shows that our technique can be
extended to accurately estimate the median. Our
algorithm for computing the median is given below:
1. Select m peers at random using random walk.
2. Each peer sj computes its median medj and
sends it to the sink, along with prob(sj).
3. The sink randomly partitions the m medians
into two groups of m/2 medians, Group1 and
Group2.
4. Let medg1 be the weighted median of Group1,
i.e., such that the following is minimized
−
∑
∈
, 1
j
med med
∑
∈
med
><
11
, 1
)( / 1)( / 1
gj
j
gj
med
Group med
j
Groupmed
j
s probs prob abs
5. Find the error between the median of Group2
(say medg2) and the weighted rank of medg1 in
Group2. I.e., let c =
−
∑
∈
) 2/ /()( / 1)( / 1
21
, 2 , 2
msprobsprob abs
gj
j
gj
j
med med
Group
<
med
j
med med
Group
<
med
j
∑
∈
6. Select additional
walk.
7. Find and return the weighted median of the
medians of the additional peers.
In these experiments we use both the Gnutella and
synthetic graph, vary the clustering factor, and set
1 . 0
=∆req
difference between the true rank of the median that the
algorithm returns, and
2N
.
2
2
req
c
∆
peers using random
. The error that we show in the graph is the
6. Conclusion & Future Work
In this paper we present adaptive sampling-based
techniques for the novel problem of approximate
answering of ad-hoc aggregation queries in P2P
databases. We present
evaluations to demonstrate the feasibility of our solutions.
Several intriguing open problems remain. Is it
possible to build hybrid solutions that do some amount of
pre-computations of samples, in addition to “on-the-fly”
sampling such as ours? Is it possible for sampling-based
algorithms to perform “biased sampling”, i.e., focus the
samples from regions of the database where tuples that
satisfy the query are likely to exist? More generally,
decision support and data analysis in P2P databases
extensive experimental
Clustering vs. Error %
(Required Acc=0.10, Z=0.2, j=10,Selectivity=30)
0%
1%
2%
3%
4%
5%
6%
0 0.250.5 0.751
Clustering
Error %
Synthetic
Gnutella
Figure 8: Effects of clustering on the error
percentage for the COUNT technique
Clustering vs. Sample Size
(Required Acc=0.10, Z=0.2, j=10, Selectivity=30, j=10)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 0.250.50.751
Clustering
Sample Size
Synthetic
Gnutella
Figure 9: Effects of clustering on the sample
size for the COUNT technique
Skew vs. Error %
(Required Acc=0.10,CL=0.25)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
00.51 1.52
Skew
Error %
Synthetic
Gnutella
Figure 10: Effects of skew on the error
percentage for the COUNT technique
Skew vs. Sample Size
(Required Acc=0.10,CL=0.25,j=10)
0
500
1000
1500
2000
2500
3000
0 0.51 1.52
Skew
Sample Size
Synthetic
Gnutella
Figure 11: Effects of skew on the sample size
for the COUNT technique
1
10
100
1000
10000
10000
1000
10
0%
5%
10%
15%
20%
25%
30%
35%
Error %
Jump Size
Cut Size
Cut Size vs. Jump Size vs. Error %
Synthetic Topology
(Peers=10,000, Req Acc=0.10, Zeta=0.2, Sub-Graph=2)
0.3-0.35
0.25-0.3
0.2-0.25
0.15-0.2
0.1-0.15
0.05-0.1
0-0.05
Figure 12: Effects of cut size with jump size on
error percentage for SUM technique
Clustering vs. Error %
(Z=0.2, Req Acc=0.10, j=10)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
00.250.50.751
Clustering
Error %
Synthetic
Gnutella
Figure 13: Effects of clustering on the error
percentage for the SUM technique
Page 12
appears to be an important area of research with emerging
applications, and we hope our work will encourage
further research in this field.
7. Acknowledgements
Thanks to M. Ripeanu at the University of Chicago for
providing us with the Gnutella topologies samples. The
work of Kalogeraki and Gunopulos was supported by
NSF 0330481.
8. References
[1]
S. Acharya, P. B. Gibbons and V. Poosala. Aqua: A Fast
Decision Support System Using Approximate Query Answers.
Demo in Intl. Conf. on Very Large Databases (VLDB '99).
[2]
L. Adamic, R. Lukose, A. Puniyani, and B. Huberman.
Search in Power-Law Networks. Phys. Rev. E, 2001.
[3]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic Sample
Selection for Approximate Query Processing. SIGMOD
Conference 2003: 539-550.
[4]
A.R. Bharambe, M. Agrawal, and S. Seshan. Mercury:
Supporting Scalable Multi-Attribute Range Queries. SIGCOMM
2004.
[5]
M. Charikar, S. Chaudhuri, R. Motwani, and V.
Narasayya. Towards estimation error guarantees for distinct
values. In Proceedings of the ACM Symp. On Principles of
Database Systems, 2000.
[6]
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V.
Narasayya. Overcoming Limitations
Aggregation Queries. ICDE 2001: 534-542.
[7]
S. Chaudhuri, R. Motwani, and V. Narasayya. Random
sampling for histogram construction: How much is enough? IN
Proceedings. Of the 1998 ACM SIGMOD Intl. Conf. on
Management of Data, pages 436-447, 1998.
[8]
S. Chaudhuri, G. Das, and V. Narasayya. A Robust,
Optimization-Based Approach for Approximate Answering of
Aggregate Queries. SIGMOD Conference 2001.
[9]
S. Chaudhuri, G. Das, and U. Srivastava. Effective Use of
Block-Level Sampling in Statistics Estimation. SIGMOD 2004.
[10] Y. Chu, S. Rao, and H. Zhang. A case for end system
multicast. In Proceedings of ACM Sigmetrics 2000.
[11] Mauricio Minuto Espil and Alejandro A. Vaisman.
Aggregate queries in peer-to-peer OLAP. DOLAP '04.
of Sampling for
[12] C. Faloutsos, P. Faloutsos, and M. Faloutsos. On Power-
Law Relationships of the Internet Topology. SIGCOMM 1999.
[13] Freenet Homepage, http://freenet.sourceforge.net
[14] C. Gkantsidis, M. Mihail, and A. Saberi. Random Walks in
Peer-to-Peer Networks. IEEE Infocom 2004.
[15] Gnutella Homepage, http://rfc-gnutella.sourceforge.net.
[16] P. Haas, and C. Kőnig. A Bi-Level Bernoulli Scheme for
Database Sampling. SIGMOD 2004.
[17] R. Heubsch, J. Hellerstein, N. Lanhan, B. T. Loo, S.
Shenker, and I. Stoica. Querying the Internet with PIER.
VLDB 2003.
[18] JUNG website. http://jung.sourceforge.net.
[19] P. Kalnis, W. S. Ng, B. C. Ooi and D. Papadias and K-L.
Tan. An adaptive peer-to-peer network for distributed caching
of OLAP results. SIGMOD 2002.
[20] KaZaA Homepage, http://www.kazaa.com.
[21] V. King and J. Saia. Choosing a Random Peer. PODC
2004.
[22] F. Le Fessant, S. Handurukande, A.-M. Kermarrec, and L.
Massoulié. Clustering in Peer-to-Peer File Sharing Workloads.
3rd Intl. Workshop on Peer-to-Peer Systems IPTPS 2004.
[23] X. Li, Y.J. Kim, R. Govindan, and W. Hong. Multi-
dimensional range queries in sensor networks. SENSYS 2003.
[24] D. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J.
Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-Peer
Computing. HP Technical Report, HPL-2002-57.
[25] Napster Hompage, http://www.napster.com.
[26] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S.
Shenker. A Scalable Content-Addressable Network. SIGCOMM
2001.
[27] A. Rowstron and P. Druschel. Pastry: Scalable,
distributed object location and routing for large-scale peer-to-
peer systems. IFIP/ACM Middleware 2001.
[28] O.D. Sahin, A. Gupta, D. Aggrawal, and A. El Abbadi. A
Peer-to-peer Framework for Caching Range Queries. ICDE
2004.
[29] Julian L. Simon. Resampling: The New Statistics. Second
Edition published October 1997.
[30] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H.
Balakrishnan. Chord: A scalable Peer-to-peer Lookup Service
for Internet Applications. SIGCOMM 2001.
[31] D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos.
Exploiting locality for scalable information retrieval in peer-to-
peer networks. Inf. Syst. 30(4): 277-298 (2005).
Clustering vs. Sample Size
(Z=0.2, Req Acc=0.10, j=10)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 0.25 0.50.751
Clustering
Sample Size
Synthetic
Gnutella
Figure 14: Effects of clustering on the sample
size for the SUM technique
Clustering vs. Error %
(Z=0.2, Req Acc=0.10, j=10)
0%
2%
4%
6%
8%
10%
12%
0 0.250.5 0.751
Clustering
Error %
Synthetic
Gnutella
Figure 15: Effects of clustering on the error
percentage for the median technique
Clustering vs. Sample Size
(Z=0.2, Req Acc=0.10, j=10)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 0.250.5 0.751
Clustering
Sample Size
Synthetic
Gnutella
Figure 16: Effects of clustering on the sample
size for the median technique
View other sources
Hide other sources
- Available from Benjamin Arai · Nov 19, 2014
- Available from psu.edu