Conference Paper

Query Optimization over Distributed Data Stream

Software Coll., Northeastern Univ., Shenyang, China
DOI: 10.1109/HIS.2009.198 Conference: Hybrid Intelligent Systems, 2009. HIS '09. Ninth International Conference on, Volume: 2
Source: IEEE Xplore

ABSTRACT Recent research efforts in the fields of data stream processing show the increasing importance of processing data streams, e.g., in the e-science domain. Together with the advent of peer-to-peer (P2P) networks and grid computing, this leads to the necessity of developing new techniques for distributing and processing continuous queries over data streams in such networks. These systems often have to process multiple similar but different continuous aggregation queries simultaneously. Since executing each query separately can lead to significant scalability and performance problems, it is vital to share resources by exploiting similarities in the queries. The challenge is to identify overlapping computations that may not be obvious in the queries themselves. In this paper, we propose a novel algorithmic solution for problem of finding the minimum number of queries in such a distributed-streams setting, in order to optimize the communicate cost across the network. The experiment result show that our approach gives us as much as magnitude performance improvement over the no-share settings.

1 Read
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Data streaming systems are becoming essential for monitoring ap- plications such as financial analysis and network intrusion detec- tion. These systems often have to process many similar but differ- ent queries over common data. Since executing each query sepa- rately can lead to significant scalability and performance problems, it is vital to share resources by exploiting similarities in the queries. In this paper we present ways to efficiently share streaming aggre- gate queries with differing periodic windows and arbitrary selec- tion predicates. A major contribution is our sharing technique that does not require any up-front multiple query optimization. This is a significant departure from existing techniques that rely on complex static analyses of fixed query workloads. Our approach is partic- ularly vital in streaming systems where queries can join and leave the system at any point. We present a detailed performance study that evaluates our strategies with an implementation and real data. In these experiments, our approach gives us as much as an order of magnitude performance improvement over the state of the art.
    Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a Scalable Distributed Information Management System(SDIMS) that aggregates information about large-scale networkedsystems and that can serve as a basic building block for abroad range of large-scale distributed applications by providing detailedviews of nearby information and summary views of global information.To serve as a basic building block, a SDIMS should havefour properties: scalability to many nodes and attributes, flexibilityto accommodate a broad range of...
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: CQL, a continuous query language, is supported by the STREAM prototype data stream management system (DSMS) at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and stored relations. We begin by presenting an abstract semantics that relies only on "black-box" mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL's query execution plans as well as details of the most important components: operators, interoperator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for DSMSs. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL. The relative ease of capturing these applications in CQL is one indicator that the language contains an appropriate set of constructs for data stream processing.
    The VLDB Journal 03/2004; 2(2). DOI:10.1007/s00778-004-0147-z · 1.57 Impact Factor