Sharing Aggregate Computation of Multiple Group by Queries over Distributed Data Stream.
ABSTRACT Data streaming systems are becoming essential for monitoring applications such as financial analysis, network intrusion detection and sensor network. These systems often have to process multiple similar but different continuous aggregation queries simultaneously. Since executing each query separately can lead to significant scalability and performance problems, it is vital to share resources by exploiting similarities in the queries. The challenge is to identify overlapping computations that may not be obvious in the queries themselves. In this paper, we reveal new opportunities for sharing work in the context of distributed aggregation queries that vary in their group by predicates. We identify settings in which a large set of m such queries can be answered by executing n< m different queries. The n queries are revealed by analyzing the binary two-dimension array capturing the connection among the queries that they satisfy. We propose a novel algorithmic solution for problem of finding the minimum number of queries in such a distributed-streams setting, in order to optimize the communicate cost across the network. The experiment result show that our approach gives us as much as magnitude performance improvement over the no-share settings.
- SourceAvailable from: Sumit Ganguly
Conference Paper: Distributed Set Expression Cardinality Estimation.[Show abstract] [Hide abstract]
ABSTRACT: We consider the problem of estimating set-expression cardinality in a distributed streaming environment where rapid update streams originating at remote sites are continually transmitted to a central processing sys- tem. At the core of our algorithmic solutions for answering set-expression cardinality queries are two novel techniques for lowering data communication costs without sacrificing answer precision. Our first technique exploits global knowledge of the distribution of certain frequently occurring stream elements to sig- nificantly reduce the transmission of element state in- formation to the central site. Our second technical con- tribution involves a novel way of capturing the seman- tics of the input set expression in a boolean logic for- mula, and using models (of the formula) to determine whether an element state change at a remote site can affect the set expression result. Results of our experi- mental study with real-life as well as synthetic data sets indicate that our distributed set-expression cardinality estimation algorithms achieve substantial reductions in message traffic compared to naive approaches that pro- vide the same accuracy guarantees.(e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, August 31 - September 3 2004; 01/2004
- [Show abstract] [Hide abstract]
ABSTRACT: This paper presents the architecture of PIER , an Internetscale query engine we have been building over the last three years. PIER is the first general-purpose relational query processor targeted at a peer-to-peer (p2p) architecture of thousands or millions of participating nodes on the Internet. It supports massively distributed, database-style dataflows for snapshot and continuous queries. It is intended to serve as a building block for a diverse set of Internet-scale informationcentric applications, particularly those that tap into the standardized data readily available on networked machines, including packet headers, system logs, and file names
Conference Paper: A scalable distributed information management system.[Show abstract] [Hide abstract]
ABSTRACT: We present a Scalable Distributed Information Management System (SDIMS) that aggregates information about large-scale networked systems and that can serve as a basic building block for a broad range of large-scale distributed applications by providing detailed views of nearby information and summary views of global information. To serve as a basic building block, a SDIMS should have four properties: scalability to many nodes and attributes, flexibility to accommodate a broad range of applications, administrative isolation for security and availability, and robustness to node and network failures. We design, implement and evaluate a SDIMS that (1) leverages Distributed Hash Tables (DHT) to create scalable aggregation trees, (2) provides flexibility through a simple API that lets applications control propagation of reads and writes, (3) provides administrative isolation through simple extensions to current DHT algorithms, and (4) achieves robustness to node and network reconfigurations through lazy reaggregation, on-demand reaggregation, and tunable spatial replication. Through extensive simulations and micro-benchmark experiments, we observe that our system is an order of magnitude more scalable than existing approaches, achieves isolation properties at the cost of modestly increased read latency in comparison to flat DHTs, and gracefully handles failures.Proceedings of the ACM SIGCOMM 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, August 30 - September 3, 2004, Portland, Oregon, USA; 10/2004