Conference Paper

Approximating Aggregation Queries in Peer-to-Peer Networks

UC Riverside
DOI: 10.1109/ICDE.2006.23 Conference: Data Engineering, 2006. ICDE '06. Proceedings of the 22nd International Conference on
Source: DBLP

ABSTRACT Peer-to-peer databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large scale, ad-hoc analysis queries ― e.g., aggregation queries ― on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement given the distributed and dynamic nature of peer-to-peer databases. In this paper we present novel sampling-based techniques for approximate answering of ad-hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors ― the data is distributed (usually in uneven quantities) across many peers, within each peer the data is often highly correlated, and moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach, based on random walks of the P2P graph as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solutio

Download full-text

Full-text

Available from: Benjamin Arai, Nov 19, 2014
0 Followers
 · 
120 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.
    01/2012; DOI:10.1109/ICDE.2012.19
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Peer-to-peer database systems (P2PDBs) aim at providing database services with node autonomy, high availability and loose coupling between participating nodes by building the DBMS on top of a peer-to-peer network. A key feature of current peer-to-peer systems is resilience to churn in the overlay network layer. A major challenge in P2PDBs is to provide similar robustness in the data and query processing layer. In this paper we in particular describe how aggrega- tion queries in P2PDBs can be handled in order to reduce the impact of churn on accuracy of results. We perform a formal study of data loss and accuracy of such queries, and describe new approaches that increase the accuracy of aggregation queries in P2PDBs under churn.
    12th International Database Engineering and Applications Symposium (IDEAS 2008), September 10-12, 2008, Coimbra, Portugal; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Peer-to-Peer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various hospitals, clinics, and research laboratories. However, the sheer scale of these networks has made it difficult to gather statistics that could be used for building new features. In this paper, we present a technique to obtain estimations of the number of distinct values matching a query on the network. We evaluate the technique experimentally and provide a set of results that demonstrate its effectiveness, as well as its flexibility in supporting a variety of queries and applications.
    Proceedings of the 1st ACM International Conference on Pervasive Technologies Related to Assistive Environments, PETRA 2008, Athens, Greece, July 16-18, 2008; 01/2008