Efficient Approximate Query Processing in Peer-to-Peer Networks
ABSTRACT Peer-to-peer (P2P) databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large-scale ad hoc analysis queries, for example, aggregation queries, on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement, given the distributed and dynamic nature of P2P databases. In this paper, we present novel sampling-based techniques for approximate answering of ad hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors: the data is distributed (usually in uneven quantities) across many peers, within each peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solution.
-
Citations (0)
- Cited In (1)
-
Article: Decentralising a service-oriented architecture.
Jan Sacha, Bartosz Biskupski, Dominik Dahlem, Raymond Cunningham, René Meier, Jim Dowling, Mads HaahrPeer-to-Peer Networking and Applications. 01/2010; 3:323-350.
Page 1
Efficient Approximate Query Processing
in Peer-to-Peer Networks
Benjamin Arai, Student Member, IEEE, Gautam Das, Dimitrios Gunopulos, Member, IEEE, and
Vana Kalogeraki, Member, IEEE
Abstract—Peer-to-peer (P2P) databases are becoming prevalent on the Internet for distribution and sharing of documents,
applications, and other digital media. The problem of answering large-scale ad hoc analysis queries, for example, aggregation queries,
on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement, given the distributed
and dynamic nature of P2P databases. In this paper, we present novel sampling-based techniques for approximate answering of ad
hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P
environment is complicated due to several factors: the data is distributed (usually in uneven quantities) across many peers, within each
peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish. To
counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as
well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed
solution.
Index Terms—Approximation methods, computer networks, distributed databases, distributed database query processing, distributed
estimation, database systems, distributed systems.
Ç
1
1.1
T
Internet. A P2P network consists of numerous peer nodes
that share data and resources with other peers on an equal
basis. Unlike traditional client-server models, no central
coordination exists in a P2P system; thus, there is no central
point of failure. P2P networks are scalable, fault tolerant,
and dynamic, and nodes can join and depart the network
with ease. The most compelling applications on P2P systems
to date have been file sharing and retrieval. For example,
P2P systems such as Napster [30], Gnutella [17], KaZaA [22],
and Freenet [15] are principally known for their file sharing
capabilities, for example, the sharing of songs, music, and
so on. Furthermore, researchers have been interested in
extending sophisticated infrared (IR) techniques such as
keyword search and relevance retrieval to P2P databases.
INTRODUCTION
Peer-to-Peer (P2P) Databases
HEP2P network model is quickly becoming the preferred
medium for file sharing and distributing data over the
1.2
In this paper, however, we consider a problem on P2P
systems that is different from the typical search and
retrieval applications. As P2P systems mature beyond file
sharing applications and start getting deployed in increas-
ingly sophisticated e-business and scientific environments,
Aggregation Queries
the vast amount of data within P2P databases poses a
different challenge that has not been adequately researched
thus far, that is, how aggregation queries on such databases
can be answered. Aggregation queries have the potential of
finding applications in decision support, data analysis, and
data mining. For example, millions of peers across the
world may be cooperating on a grand experiment in
astronomy, and astronomers may be interested in asking
decision support queries that require the aggregation of
vast amounts of data covering thousands of peers. In
addition, there is real-world value for aggregation queries in
network monitoring scenarios such as temperature and
anomaly detection in sensor networks [39], Intrusion
Detection Systems [26], [29], and application signature
analysis [35] in P2P networks. Sensor networks can directly
benefit from aggregation of traffic analysis data by offering
a more efficient means of computing various network-based
aggregates such as the average message size and maximum
data throughput within the network, with minimal energy
consumption and decreased response times.
We make the problem more precise as follows: Consider
a single table T that is distributed over a P2P system; that is,
the peers store horizontal partitions (of varying sizes) of this
table. An aggregation query such as the following may be
introduced at any peer (this peer is henceforth called the
query node).
Aggregation query
SELECT Agg-Op(Col) FROM T WHERE selection-condition
In the above query, the Agg-Op may be any aggrega-
tion operator such as SUM, COUNT, AVG, and so on,
Col may be any numeric measure column of T or even
an expression involving multiple columns, and the
selection condition decides which tuples should be
involved in the aggregation. Although our main focus
is on the above standard SQL aggregation operators, we
also briefly discuss other interesting statistical estimators
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 19,NO. 7,JULY 2007919
. B. Arai, D. Gunopulos, and V. Kalogeraki are with the Computer Science
and Engineering Department, University of California, Riverside, River-
side, CA 92507. E-mail: {barai, dg, vana}@cs.ucr.edu.
. G. Das is with the Computer Science and Engineering Department,
University of Texas at Arlington, Arlington, TX 76019.
E-mail: gdas@cse.uta.edu.
Manuscript received 28 Feb. 2006; revised 16 Nov. 2006; accepted 24 Jan.
2007; published online 5 Feb. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0105-0206.
Digital Object Identifier no. 10.1109/TKDE.2007.1064.
1041-4347/07/$25.00 ? 2007 IEEEPublished by the IEEE Computer Society
Page 2
such as medians, quantiles, histograms, and distinct
values.
Although aggregation queries have been heavily inves-
tigated in traditional databases, it is not clear that these
techniques will easily adapt to the P2P domain. For
example, decision support techniques such as online
analytical processing (OLAP) commonly employ materi-
alized views; however, the distribution and management of
such views appear difficult in such a dynamic and
decentralized domain [21], [13]. In contrast, the alternative
of answering aggregation queries at runtime “from scratch”
by crawling and scanning the entire P2P repository is
prohibitively slow.
1.3
Fortunately, it has been observed that in most typical data
analysis and data mining applications, timeliness and
interactivity are more important considerations than accu-
racy; thus, data analysts are often willing to overlook small
inaccuracies in the answer, provided that the answer can be
obtained fast enough. This observation has been the
primary driving force behind the recent development of
AQP techniques for aggregation queries in traditional
databases and decision support systems [11], [3], [8], [10],
[1], [16], [7], [9], [27]. Numerous AQP techniques have been
developed: The most popular ones are based on random
sampling, where a small random sample of the rows of the
database is drawn, the query is executed on this small
sample, and the results are extrapolated to the whole
database. In addition to simplicity of implementation,
random sampling has the compelling advantage that, in
addition to an estimate of the aggregate, one can also
provide confidence intervals of the error, with high
probability. Broadly, two types of sampling-based ap-
proaches have been investigated: 1) precomputed samples,
where a random sample is precomputed by scanning the
database and the same sample is reused for several queries
and 2) online samples, where the sample is drawn “on the
fly” upon encountering a query.
Approximate Query Processing (AQP)
1.4
In this paper, we also approach the challenges of decision
support and data analysis on P2P databases in the same
manner; that is, we investigate what it takes to enable AQP
techniques on such distributed databases.
Goal of the Paper
Goal of the Paper: Approximating Aggregation Queries in
P2P Networks.
Given an aggregation query and a desired error bound at a query
node peer, compute with “minimum cost” an approximate
answer to this query that satisfied the error bound.
The cost of query execution in traditional databases is
usually a straightforward concept: It is either I/O cost or
CPU cost or a combination of the two. In fact, most AQP
approaches simplify this concept even further by just trying
to minimize the number of tuples in the sample, thus
making the assumption that the sample size is directly
related to the cost of query execution. However, in P2P
networks, the cost of query execution is a combination of
several quantities such as the number of participating peers,
the bandwidth consumed (that is, the amount of data
shipped over the network), the number of messages ex-
changed, the latency (the time to propagate the query across
multiple peers and receive replies), the I/O cost of accessing
data from participating peers, the CPU cost of processing
data at participating peers, and so on. In this paper, we shall
be concerned with latency (the time to propagate the query
across multiple peers and receive replies) as our primary
quantity to minimize though our technique could be easily
extended to deal with other cost metrics.
1.5
Let us now discuss what it takes for sampling-based AQP
techniques to be incorporated into P2P systems. We first
observe that two main approaches have emerged for
constructing P2P networks today: structured and unstruc-
tured. Structured P2P networks (such as Pastry [33] and
Chord [37]) are organized in such a way that data items are
located at specific nodes in the network, and nodes
maintain some state information to enable efficient retrieval
of the data. This organization maps data items to particular
nodes and assumes that all nodes are equal in terms of
resources, which can lead to bottlenecks and hot spots. Our
work focuses on unstructured P2P networks, which makes
no assumption about the location of the data items in the
node, and nodes are able to join the system at random times
and depart without a priori notification. Several recent
efforts have demonstrated that unstructured P2P networks
can be used efficiently for multicast distributed object
location and information retrieval [12], [27], [38].
For AQP in unstructured P2P systems, attempting to
adapt the approach of precomputed samples is impractical
for several reasons: 1) It involves scanning the entire P2P
repository, which is difficult, 2) since no centralized storage
exists, it is not clear where the precomputed sample should
reside, and 3) the very dynamic nature of P2P systems
indicates that precomputed samples will quickly become
stale, unless they are frequently refreshed.
Thus, the approach taken in this paper is to investigate
the feasibility of online sampling techniques for AQP on
P2P databases. However, online sampling approaches in
P2P databases pose their own set of challenges. To illustrate
these challenges, consider the problem of attempting to
draw a uniform random sample of n tuples from such a P2P
database containing a total of N tuples. To ensure a true
uniform random sample, our sampling procedure should
be such that each subset of n tuples out of N should be
equally likely to be drawn. However, this is an extremely
challenging problem due to two reasons:
Challenges
.
Picking even a set of uniform random peers is a
difficult problem, as the query node does not have
the Internet Protocol (IP) addresses of all peers in the
network. This is a well-known problem that other
researchers have tackled (in different contexts) by
using random-walk techniques on the P2P graph [16],
[24], [4]. That is, where a Markovian random walk is
initiated from the query node that picks adjacent
peers to visit, with equal probability and under
certain connectivity properties, the random walk is
expected to rapidly reach a stationary distribution. If
the graph is badly clustered with small cuts, then
920IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 19,NO. 7,JULY 2007
Page 3
this affects the speed at which the walk converges.
Moreover, even after convergence, the stationary
distribution is not uniform; in fact, it is skewed
toward giving higher probabilities to nodes with
larger degrees in the P2P graph.
Even if we could select a peer (or a set of peers)
uniformly at random, it does not make the problem
of selecting a uniform random set of tuples much easier.
This is because visiting a peer at random has an
associated overhead; thus, it makes sense to select
multiple tuples at random from this peer during the
same visit. However, this may compromise the
quality of the final set of tuples retrieved, as the
tuples within the same peer are likely to be correlated.
For example, if the P2P database contained listings
of, say, movies, then the movies stored on a specific
peer are likely to be of the same genre. This
correlation can be reduced if we select just one tuple
at random from a randomly selected peer; however,
the overheads associated with such a scheme will be
intolerable.
.
1.6
We briefly describe the framework of our approach.
Essentially, we abandon trying to pick true uniform random
samples of the tuples, as such samples are likely to be
extremely impractical to obtain. Instead, we consider an
approach where we are willing to work with skewed samples,
provided that we can accurately estimate the skew during
the sampling process. To get the accuracy in the query
answer desired by the user, our skewed samples can be
larger than the size of a corresponding uniform random
sample that delivers the same accuracy; however, our
samples are much more cost efficient to generate.
Although we do not advocate any significant preproces-
sing, we assume that certain aspects of the P2P graph are
known to all peers, such as the average degree of the nodes,
a good estimate of the number of peers in the system,
certain topological characteristics of the graph structure,
and so on. Estimating these parameters via preprocessing
are interesting problems in their own right; however, we
omit these details from this paper. The main point that we
make is that these parameters are relatively slow to change
and thus do not have to be estimated at query time: It is the
data contents of peers that changes more rapidly; hence, the
random sampling process that picks a representative
sample of tuples has to be done at runtime.
Our approach has two major phases. In the first phase,
we initiate a fixed-length random walk from the query
node. This random walk should be long enough to ensure
that the visited peers represent a close sample from the
underlying stationary distribution (the appropriate length
of such a walk is determined in a preprocessing step). We
then retrieve certain information from the visited peers,
such as the number of tuples, the aggregate of tuples (for
example, SUM, COUNT, AVG, and so forth) that satisfy the
selection condition, and send this information back to the
query node. This information is then analyzed at the query
node to determine the skewed nature of the data that is
distributed across the network, such as the variance of the
aggregates of the data at peers, the amount of correlation
Our Approach
between tuples that exists within the same peers, the
variance in the degrees of individual nodes in the P2P
graph (recall that the degree has a bearing on the
probability that a node will be sampled by the random
walk), and so on. Once this data has been analyzed at the
query node, an estimation is made on how much more
samples are required (and in what way should these
samples be collected) so that the original query can be
optimally answered within the desired accuracy, with high
probability. For example, the first phase may recommend
that the best way to answer this query is to visit m0more
peers and, from each peer, randomly sample t tuples. We
mention that the first phase is not overly driven by
heuristics. Instead, it is based on underlying theoretical
principles such as the theory of random walks [16], [24], [4]
as well as statistical techniques such as cluster sampling,
block-level sampling, and cross validation [11], [18].
The second phase is then straightforward: A random
walk is reinitiated, and tuples are collected according to the
recommendations made by the first phase. Effectively, the
first phase is used to “sniff” the network and determine an
optimal-cost “query plan,” which is then implemented in
the second phase. For certain aggregates such as COUNT
and SUM, further optimizations may be achieved by
pushing the selections and aggregations to the peers; that
is, the local aggregates instead of raw samples are returned
to the query node, which are then composed into a final
answer.
In addition, we explore in-network techniques for
dissemination of values throughout the network. We
accomplish this through a hybrid technique building upon
the Gossip protocol. A Gossip protocol is executed in
rounds. For each round, participating peers select adjacent
peers uniformly at random sharing information. The Gossip
protocol exploits a communication mechanism where peers
diffuse local aggregates with adjacent peers. This process
relies heavily upon mass conversation, which describes that
the average of all of the sums of individual peers is the
correct average, and the sum of all of the weights is n [23].
In general, as the number of passes of the Gossip protocol
increases, values of participating peers are increasingly
diffused through the network (in our case, the local groups);
therefore, sampling-diffused values provide a better repre-
sentation of the values contained in the network as opposed
to a single peer.
The contributions of this paper are summarized as
follows:
.
We introduce the important problem of AQP in P2P
databases, which is likely to be of increasing
significance in the future.
The problem is analyzed in detail, and its unique
challenges are comprehensively discussed.
Hybrid sampling technique maximizes per-peer in-
network computation building upon the Gossip
protocol.
Adaptive two-phase sampling-based approaches
are proposed based on well-founded theoretical
principles.
.
.
.
ARAI ET AL.: EFFICIENT APPROXIMATE QUERY PROCESSING IN PEER-TO-PEER NETWORKS921
Page 4
.
We present an adaptive approach for computing
aggregates such as COUNT, SUM, AVERAGE, and
MEDIAN.
The results of extensive experiments are presented,
which demonstrate the importance of the problem
and the validity of our approaches.
The rest of this paper is organized as follows: In Section 2,
we describe related work. We provide the foundation of our
approach in Section 3, the algorithm in Section 4, and the
hybrid solution to random sampling in Section 5. In
Section 6, we present the experimental results, and we
conclude in Section 7.
.
2RELATED WORK
P2P systems are becoming very popular because they
provide an efficient mechanism for building large scalable
systems [28]. Most recent work has focused on Distributed
Hash Tables (DHTs) [32], [33], [37]. Such techniques
provide scalability advantages over unstructured systems
(such as Gnutella); however, they are not flexible enough
for some applications, especially when nodes join or leave
the network frequently or change their connections often.
Recent work has proposed different techniques for
exact query processing in P2P systems. Most proposals
use structured overlay networks (DHTs), such as CAN,
Pastry, and Chord. Such techniques include PIER [19],
DIM [27], or Pastry [33], and since they use DHTs, they
have a different focus and are not directly applicable to
our case. A hybrid system, Mercury [4], using routing
hubs to answer range queries was also recently proposed.
This system is also designed to provide exact answers to
range queries. Exact solutions to OLAP queries have been
considered in [13] and [21].
Methods to sample random peers in P2P networks have
been proposed in [16], [24], and [4]. These techniques use
Markov-chain random walks to select random peers from
the network. Their results show that when certain structural
properties of the graph are known or can be estimated (such
as the second eigenvalue of the graph), the parameters of
the walk can be set so that a representative sample of the
stationary distribution can be collected with high prob-
ability. In [4], it is shown that if the graph is an expander,
then a random walk converges to the stationary distribution
in OðlogMÞ steps, where M is the number of peers in the
network.
There are known techniques for computing approximate
aggregates in distributed settings (most notably, the Gossip
protocol [5], [6], [23]). The technique works generally as a
preprocessing step where all peers in a network attempt to
mix data among adjacent peers, eventually converging
upon a single value. The inability to contact all nodes in the
network makes it exceedingly difficult to Gossip in the
traditional sense.
Our work also generalizes to the P2P domain and
previous work on AQP in relational databases. Recent
work in [11], [3], [8], [10], [1], [16], [7], [9], and [27] has
developed powerful techniques for employing sampling in
the database engine to approximate aggregation queries
and to estimate database statistics. Recent techniques have
focused on providing formal foundations and algorithms
for block-level sampling and are thus most relevant to our
work. The objective in block-level sampling is to derive a
representative sample by only randomly selecting a set of
disk blocks of a relation [11], [18]. Specifically, [11] presents
a technique for histogram estimation, which uses cross
validation to identify the amount of sampling required for a
desired accuracy level. In addition, [18] considers the
problem of deciding what percentage of a disk block
should be included in the sample, given a cost model.
3FOUNDATIONS OF OUR APPROACH
In this section, we discuss the principles behind our
approach for AQP on P2P databases. Our actual algorithm
is described in Section 4.
3.1
We assume an unstructured P2P network represented as a
graph G ¼ ðP;EÞ, with a vertex set P ¼ fp1;p2;...;pMg and
an edge set E. The vertices in P represent the peers in the
network, and the edges in E represent the connections
between the vertices in P. Each peer p is identified by the
processor’s IP address and a port number (IPpand portp).
The peer p is also characterized by the capabilities of the
processor on which it is located, including its CPU speed
pcpu, memory bandwidth pmem, and disk space pdisk. The
node also has a limited amount of bandwidth to the
network, noted by pband. In unstructured P2P networks, a
node becomes a member of the network by establishing a
connection with at least one peer currently in the network.
Each node maintains a small number of connections with its
peers: The number of connections is typically limited by the
resources at the peer. We denote the number of connections
that a peer is maintaining by pconn.
The peers in the network use the Gnutella P2P protocol
to communicate. The Gnutella P2P protocol supports four
message types (Ping, Pong, Query, and Query_Hit), of
which the Ping and Pong messages are used to establish
connections with other peers, and the Query and Query_Hit
messages are used to search in the P2P network. Gnutella,
however, uses a naive Breadth-First Search (BFS) technique
in which queries are propagated to all the peers in the
network and thus consumes excessive network and proces-
sing resources and results in poor performance. Our
approach, on the other hand, uses a probabilistic search
algorithm based on random walks. The key idea is that each
node forwards a query message, called walker, randomly to
one of its adjacent peers. This technique is shown to
improve the search efficiency and reduce unnecessary
traffic in the P2P network.
The Peer-to-Peer Model
3.2
As mentioned in Section 1, the cost of the execution of a
query in P2P databases is more complicated than equivalent
cost measures in traditional databases. The primary cost
measure that we consider is latency, which is the time that it
takes to propagate the query across multiple peers and
receive replies at the query node. In our algorithm, latency
can be approximated by the number of peers that
participate in the random walk. This measure is appropriate
for our algorithm because it performs a single random walk
starting from the query node. Thus, latency becomes
proportional to the total number of visited peers in the
random walk.
To see this, we note that the aggregation operator (as
well as the selection filter and IP address of the query node)
can be pushed to each visited peer. Once a peer is visited by
the algorithm, the peer can be instructed to simply execute
Query Cost Measures
922IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19,NO. 7,JULY 2007
Page 5
the original query on its local data and send only the
aggregate and the degree of the node back to the query
node, from which the query node can reconstruct the
overall answer. Moreover, this information can be sent
directly without necessitating any intermediate hops, as the
visited peer knows the IP address of the query node from
which the query originated. This is reasonable, considering
that the IP address can be pushed to visited peers along
with the aggregation operator and the P2P networks such as
Kazaa run on top of a TCP/IP layer, making it feasible to
make direct connections with peers. Thus, the bandwidth
requirement of such an approach is uniformly very small
for all visited peers: They are not required to send more
voluminous raw data (for example, all or parts of the local
database) back to the query node.
In approximating latency by the number of peers
participating in the random walk, we also make the implicit
assumption that the overhead of visiting peers dominates
the costs of local computations (such as execution of the
original query on the local database). This is, of course, true
if the local databases are fairly small. To ensure that the
local computations remain small even if local databases are
large, our approach in such cases is to execute the
aggregation query only on a small fixed-sized random
sample of the local data (that is, we subsample from the
peer), scale the result to the entire local database, and send
the scaled aggregate back to the query node. This way, we
ensure that the local computations are uniformly small
across all visited peers.
In contrast, suppose that instead of a fixed sized sample,
we decided on sampling a fixed fraction of a visited peer’s
local database. The main problem with this approach is that
it complicates the query cost model. Now, local processing
costs cannot be ignored and, thus, latency cost of executing
a query cannot be modeled as simply being proportional to
the number of visited peers (or even the overall number of
sampled tuples). The latency now becomes a complex (and,
perhaps, system dependent) function of the cost of visiting
peers and local query processing costs. The consequence of
a complicated latency model is that it now becomes difficult
to have a principled two-phase approach to solving the
problem because the first phase now has the task of
determining how many peers should be sampled in the
second phase so that the target accuracy can be achieved
with minimum latency. Moreover, even if the first phase
can somehow determine the number of peers to visit in the
second phase, the actual latency cost of the second phase is
unpredictable. It depends on the type of peers we visit, as
peers with large databases will increase latency, whereas
peers with small databases will decrease latency.
In summary, for SUM and COUNT aggregates, latency is
shown to be proportional to the number of peers participat-
ing in the random walk. Thus, our goal is to minimize the
number of peers that must be visited in order to arrive at an
approximate answer with the desired accuracy.
3.3
In seeking a random sample of the P2P database, we have to
overcome the subproblem of how a random sample of the
peers themselves can be collected. Unrepresentative sam-
ples of peers can quickly skew results, producing erroneous
aggregation statistics. Sampling in a nonhierarchical decen-
tralized P2P network presents several obstacles in obtaining
near-uniform random samples. This is because no peer
(including the query node) knows the IP addresses of all
other peers in the network: They are only aware of their
Random Walk in Graphs
immediate neighbors. If this were not the case, then, clearly,
the query node could locally generate a random subset of
IP addresses from among all the IP addresses and visit the
appropriate peers directly. We note that this problem is not
encountered in traditional databases, as even if one has to
resort to cluster (or block-level) sampling such as in [11] and
[18], obtaining an efficient sample of the blocks themselves
is straightforward.
This problem has been recognized in other contexts (see
[16] and the references therein), and interesting solutions
based on Markov-chain random walks have been proposed.
We briefly review such approaches here. A Markov-chain
random walk is a procedure that is initiated at the query
node, and for each visited peer, the next peer to visit is
selected with equal probability from among its neighbors
(and itself and, thus, self loops are allowed). It is well
known that if this walk is carried out long enough, then the
eventual probability of reaching any peer p will reach a
stationary distribution. To make this more precise, let P ¼
fp1;p2;...;pMg be the entire set of peers, let E be the entire
set of edges, and let the degree of a peer p be degðpÞ. Then,
the probability of any peer p in the stationary distribution is
probðpÞ ¼degðpÞ
2jEj
:
It is important to note that the above distribution is not
uniform: The probability of each peer is proportional to its
degree. Thus, even if we can efficiently achieve this
distribution, we will have to compensate for the fact that
the distribution is skewed as above if we have to use
samples drawn from it for answering aggregation queries.
The main issue that has concerned researchers has been
the speed of convergence.1Most results have pointed to certain
broad connectivity properties that the graph should possess
for this to happen. In particular, it has been shown that if the
transition probabilities that govern the random walk on the
P2P graph are modeled as an M ? M matrix, then the second
eigenvalue2plays an important role in these convergence
results. The second eigenvalue describes connectivity prop-
erties of graphs, in particular, whether the graph has a small
cut size,3which would adversely impact the length of the
walk necessary to arrive at convergence.
As the results in [16] show, if the P2P graph is well
connected (that is, it has a small second eigenvalue, and a
minimum degree of the graph is large), then the random
walk quickly converges as it “loses memory” rapidly. In
fact, under certain specific conditions of connectedness (for
example, expander graphs that are common in P2P net-
works), convergence can be achieved in OðlogMÞ steps.
In our case, recall from Section 1 that we assume that we
are allowed a certain amount of preprocessing to determine
various properties of the P2P graph that will be useful at
query time (under the assumption that the graph topology
changes less rapidly compared to the data content at the
peers). The speed of convergence of a random walk in this
graph is determined in this preprocessing step, in addition
to other useful properties such as the number of nodes M,
the number of edges jEj, and so on. With respect to speed of
ARAI ET AL.: EFFICIENT APPROXIMATE QUERY PROCESSING IN PEER-TO-PEER NETWORKS923
1. We define speed of convergence as how many hops h are necessary
before one gets close to the stationary distribution.
2. The second eigenvalue tells how well the peers within the network are
connected, that is, expander versus clustered sets of peers.
3. Given a partition of peers in two sets A and B, any edge
crossing from A to be B is crossing the cut. The cut size is sum of
the edges crossing A and B.