Article

Continuous Queries over Data Streams

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... However, it is impossible to well define count-based windows ω count . As stated by Arasu in [Ara06], count-based windows may cause ambiguous semantics: they may produce a nondeterministic output because the window size has to be exactly N elements, but multiple elements in the input stream may have the same timestamps. ...
Article
RSS and Atom are generally less known than the HTML web format, but they are omnipresent in many modern web applications for publishing highly dynamic web contents. Nowadays, news sites publish thousands of RSS/Atom feeds, often organized into general topics like politics, economy, sports, culture, etc. Weblog and microblogging systems like Twitter use the RSS publication format, and even more general social media like Facebook produce an RSS feed for every user and trending topic. This vast number of continuous data-sources can be accessed by using general-purpose feed aggregator applications like Google Reader, desktop clients like Firefox or Thunderbird and by RSS mash-up applications like Yahoo! pipes, Netvibes or Google News. Today, RSS and Atom feeds represent a huge stream of structured text data which potential is still not fully exploited. In this thesis, we first present ROSES -Really Open Simple and Efficient Syndication-, a data model and continuous query language for RSS/Atom feeds. ROSES allows users to create new personalized feeds from existing real-world feeds through a simple, yet complete, declarative query language and algebra. The ROSES algebra has been implemented in a complete scalable prototype system capable of handling and processing ROSES feed aggregation queries. The query engine has been designed in order to scale in terms of the number of queries. In particular, it implements a new cost-based multi-query optimization approach based on query normalization and shared filter factorization. We propose two different factorization algorithms: (i) STA, an adaption of an existing approximate algorithm for finding minimal directed Steiner trees [CCC+98], and (ii) VCA, a greedy approximation algorithm based on efficient heuristics outperforming the previous one with respect to optimization cost. Our optimization approach has been validated by extensive experimental evaluation on real world data collections.
... If the system is extended to support the tracing, the incremental evaluation methods can be modified to evaluate post-aggregate nodes as well. 3 The system only supports 2-way joins now. We plan to extend to support multi-way joins. ...
Conference Paper
Full-text available
We present the methods and architecture of ARGUS, a stream processing system implemented atop commercial DBMSs to support large-scale complex continuous queries over data streams. ARGUS supports incremental operator evaluations and incremental multi-query plan optimization as new queries arrive. The latter is done to a degree well beyond the previous state-of-the-art via a suite of techniques such as query-algebra canonicalization, indexing, and searching, and topological query network optimization. Building on top of a DBMS, the system provides a value-adding package to the existing database applications where the needs of stream processing become increasingly demanding. Compared to directly running the continuous queries on the DBMS, ARGUS achieves well over a 100-fold improvement in performance.
... A first attempt of algebraic formalization of window was made in the core semantic of STREAM [7] and has been further analyzed in [16]. It was stated for the first time a clear semantics of the classical definitions. ...
Conference Paper
Full-text available
Querying streams of data from the sensors or other devices requires several operators. One of the most important operators is called Windowing. Creating windows consists in grouping of tuples from data streams at a specific rate according to a certain pattern. A large variety of window patterns exist and reflect different data management semantics that are useful for different purposes. Prior arts mainly focused on simple windows, like landmark and sliding windows, and only a few properties were considered in the case of query rewriting. This paper goes one step forward by proposing an algebraic model for generic windows. Our proposed model supports temporal, positional and cross-domain windows. Window's creation time can be specified by a complex function. This proposal subsumes most popular system formalizations and extends the possibilities of window management. This paper also demonstrates associativity and transposition properties useful for algebraic rewriting in query optimization. The implementation of this model is briefly presented.
Conference Paper
The view maintenance issues are very important in the data warehouse process as its goal to make the data warehouse always consistent with its sources but it generates big challenges in the P2P environments. The materialized views maintenance problem take a lot of attention in distributed data warehouse but in the Peer to peer (P2P) systems there is little attention to it. Thus, the Peer Joining Real Time Data Warehouse Algorithm (PJRT) is introduced to maintain the materialized views on our peerDW architecture that proposed in [1] in order to reduce the maintenance time in P2P architecture. The performance of the PJRT was measured by comparing it to the existing the view maintenance algorithm (VM) algorithm [2]. Then the PJRT is compared to The Joining Real Time Data Warehouse Algorithm (JRT) that work upon Distributed DW and proposed in [3] to observe the effect of the peer architecture on the maintenance time.
Article
Full-text available
We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.
Conference Paper
Full-text available
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
Conference Paper
Full-text available
We study the problem of power-conserving computation of order statistics in sensor networks. Significant power-reducing optimizations have been devised for computing simple aggregate queries such as COUNT, AVERAGE, or MAX over sensor networks. In contrast, aggregate queries such as MEDIAN have seen little progress over the brute force approach of forwarding all data to a central server. Moreover, battery life of current sensors seems largely determined by communication costs - therefore we aim to minimize the number of bytes transmitted. Unoptimized aggregate queries typically impose extremely high power consumption on a subset of sensors located near the server. Metrics such as total communication cost underestimate the penalty of such imbalance: network lifetime may be dominated by the worst-case replacement time for depleted batteries. In this paper, we design the first algorithms for computing order-statistics such that power consumption is balanced across the entire network. Our first main result is a distributed algorithm ε-approximate quantile summary of the sensor data such that each sensor transmits only O(log2n/ε) data values, irrespective of the network topology, an improvement over the current worst-case behavior of Ω(n). Second, we show an improved result when the height, h, of the network is significantly smaller than n. Our third result is that we can exactly compute any order statistic (e.g., median) in a distributed manner such that each sensor needs to transmit O(log3n) values. Further, we design the aggregates used by our algorithms to be decomposable. An aggregate Q over a set S is decomposable if there exists a function, f, such that for all S = S1 ∪ S2, Q(S) = f(Q(S1),Q(S2)). We can thus directly apply existing optimizations to decomposable aggregates that inrease error-resilience and reduce communication cost. Finally, we validate our results empirically, through simulation. When we compute the median exactly, we show that, even for moderate size networks, the worst communication cost for any single node is several times smaller than the corresponding cost in prior median algorithms. We show similar cost reductions when computing approximate order-statistic summaries with guaranteed precision. In all cases, our total communication cost over the entire network is smaller than or equal to the total cost of prior algorithms.
Conference Paper
Full-text available
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model---maintaining variance and maintaining a k--median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most ε using O(1/ε2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O(k/τ4 N2τ log2N) memory, where τ < 1/2 is a parameter which trades off the space bound with the approximation factor of O(2O(1/τ)).
Conference Paper
Full-text available
We consider the problem of maintaining ε-approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixed-size sliding window) or variable (variable-size sliding window). In a fixed-size sliding window, both the ends of the window slide synchronously over the stream. In a variable-size sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window.We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O(1/ε polylog(1/ε, N)) space. For quantiles, this space requirement is an improvement over the previous best bound of O(1/ε2 polylog(1/ε, N)). We believe that no previous work on space-efficient approximate counts over sliding windows exists.
Conference Paper
Full-text available
Despite the recent surge of research in query processing over data streams, little attention has been devoted to defining precise semantics for continuous queries over streams. We first present an abstract semantics based on several building blocks: formal definitions for streams and relations, mappings among them, and any relational query language. From these basics we define a precise interpretation for continuous queries over streams and relations. We then propose a concrete language, CQL (for Continuous Query Language), which instantiates the abstract semantics using SQL as the relational query language and window specifications derived from SQL-99 to map from streams to relations. We have implemented most of the CQL language in a Data Stream Management System at Stanford, and we have developed a public repository of data stream applications that includes a wide variety of queries expressed in CQL.
Conference Paper
Full-text available
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processors (GPUs). We exploit the high computation power and memory bandwidth of graphics processors and present a new sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and construction of ε-approximate quantile and frequency summaries. Our algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variable-sized sliding windows and use a limited memory footprint. We use GPU as a co-processor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We also compared the performance of our GPU-based algorithms with optimized implementations of prior CPU-based algorithms. Overall, our results demonstrate that the graphics processors available on a commodity computer system are efficient stream-processor and useful co-processors for mining data streams.
Conference Paper
Full-text available
Monitoring aggregates on IP traffic data streams is a compelling application for data stream management systems. The need for exploratory IP traffic data analysis naturally leads to posing related aggregation queries on data streams, that differ only in the choice of grouping attributes. In this paper, we address this problem of efficiently computing multiple aggregations over high speed data streams, based on a two-level LFTA/HFTA DSMS architecture, inspired by Gigascope.Our first contribution is the insight that in such a scenario, additionally computing and maintaining fine-granularity aggregation queries (phantoms) at the LFTA has the benefit of supporting shared computation. Our second contribution is an investigation into the problem of identifying beneficial LFTA configurations of phantoms and user-queries. We formulate this problem as a cost optimization problem, which consists of two sub-optimization problems: how to choose phantoms and how to allocate space for them in the LFTA. We formally show the hardness of determining the optimal configuration, and propose cost greedy heuristics for these independent sub-problems based on detailed analyses. Our final contribution is a thorough experimental study, based on real IP traffic data, as well as synthetic data, to demonstrate the effectiveness of our techniques for identifying beneficial configurations.
Conference Paper
Full-text available
In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams --- adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the run-time system memory usage. We then present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing run-time memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams, and multiple queries of the above types. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling, compare it with competing scheduling strategies, and validate our analytical conclusions.
Conference Paper
Full-text available
Stream processing fits a large class of new applications for which conventional DBMSs fall short. Because many stream-oriented systems are inherently geographically distributed and because distribution offers scalable load management and higher availability, future stream processing systems will operate in a distributed fashion. They will run across the Internet on computers typically owned by multiple cooperating administrative domains. This paper describes the architectural challenges facing the design of large-scale distributed stream processing systems, and discusses novel approaches for addressing load management, high availability, and federated operation issues. We describe two stream processing systems, Aurora* and Medusa, which are being designed to explore complementary solutions to these challenges. This paper discusses the architectural issues facing the design of large-scale distributed stream processing systems. We begin in Section 2 with a brief description of our centralized stream processing system, Aurora (4). We then discuss two complementary efforts to extend Aurora to a distributed environment: Aurora* and Medusa. Aurora* assumes an environment in which all nodes fall under a single administrative domain. Medusa provides the infrastructure to support federated operation of nodes across administrative boundaries. After describing the architectures of these two systems in Section 3, we consider three design challenges common to both: infrastructures and protocols supporting communication amongst nodes (Section 4), load sharing in response to variable network conditions (Section 5), and high availability in the presence of failures (Section 6). We also discuss high-level policy specifications employed by the two systems in Section 7. For all of these issues, we believe that the push-based nature of stream-based applications not only raises new challenges but also offers the possibility of new domain-specific solutions.
Conference Paper
Full-text available
We present the demonstration of the design of "STEAM", Purdue Boiler Makers' stream database system that allows for the processing of continuous and snap-shot queries over data streams. Specifically, the demonstration focuses on the query processing engine, "Nile". Nile extends the query processor engine of an object-relational database management system, PREDATOR, to process continuous queries over data streams. Nile supports extended SQL operators that handle sliding-window execution as an approach to restrict the size of the stored state in operators such as join.
Conference Paper
Full-text available
We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring trafic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic stream. A particularly difficult and representative subproblem is to determine the top k categories to which the most packets belong, for a desired value of k and for a given notion of categorization such as the destination IP address. We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m+1) using m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this model, our algorithm identifies flows above a frequency threshold of roughly 1/√nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows according to frequency within any desired constant factor of accuracy.
Article
Full-text available
We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far. We refer to this model as the sliding window model. We consider the following basic problem: Given a stream of bits, maintain a count of the number of 1's in the last N elements seen from the stream. We show that, using O( 1 � log 2 N ) bits of memory, we can estimate the number of 1's to within a factor of 1 + � . We also give a matching lower bound of Ω( 1 � log 2 N ) memory bits for any deterministic or randomized algorithms. We extend our scheme to maintain the sum of the last N positive integers and provide matching upper and lower bounds for this more general problem as well. We also show how to efficiently compute the Lp norms (p ∈ (1, 2)) of vectors in the sliding window model using our techniques. Using our algorithm, one can adapt many other techniques to work for the sliding window model with a multiplicative overhead of O( 1 � log N ) in memory and a 1 + � factor loss in accuracy. These include maintaining approximate histograms, hash tables, and statistics or aggregates such as sum and averages.
Article
Full-text available
Windows queries are proving essential to data-stream processing. In this paper, we present an approach for evaluating sliding-window aggregate queries that reduces both space and computation time for query execution. Our approach divides overlapping windows into disjoint panes, computes sub-aggregates over each pane, and "rolls up" the pane-aggregates to computer window-aggregates. Our experimental study shows that using panes has significant performance benefits.
Article
In a database to which data is continually added, users may wish to issue a permanent query and be notified whenever data matches the query. If such continuous queries examine only single records, this can be implemented by examining each record as it arrives. This is very efficient because only the incoming record needs to be scanned. This simple approach does not work for queries involving joins or time. The Tapestry system allows users to issue such queries over a database of mail and bulletin board messages. The user issues a static query, such as “show me all messages that have been replied to by Jones,” as though the database were fixed and unchanging. Tapestry converts the query into an incremental query that efficiently finds new matches to the original query as new messages are added to the database. This paper describes the techniques used in Tapestry, which do not depend on triggers and thus be implemented on any commercial database that supports SQL. Although Tapestry is designed for filtering mail and news messages, its techniques are applicable to any append-only database.
Article
Information Dissemination applications are gaining increasing popularity due to dramatic improvements in communications bandwidth and ubiquity. The sheer volume of data available necessitates the use of selective approaches to dissemination in order to avoid overwhelming users with unnecessazyinfonnation. Existing mechanisms for selective dissemination typically rely on simple keyword matching or "bag of words" information retrieval techniques. The advent of XML as a standard for information exchange and the development of query languages for XML data enables the development of more sophisticated filtering mechanisms that take structure information into accouaL We have developed scval index organizations and search algorithms for performing efficient filtering of XML documents for large-scale information dissemination systems. In this paper we descnbe these techniques and examine their performance across a range of document, workload, and scale scenarios.
Conference Paper
Recent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real-time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also require the processing of data that arrives prior to query submission or during a period of disconnection. For such applications, we have developed PSoup, a system that combines the processing of ad-hoc and continuous queries by treating data and queries symmetrically, allowing new queries to be applied to old data and new data to be applied to old queries. PSoup also supports intermittent connectivity by separating the computation of query results from the delivery of those results. PSoup builds on adaptive query processing techniques developed in the Telegraph project at UC Berkeley. In this paper, we describe PSoup and present experiments that demonstrate the effectiveness of our approach.
Article
Research in data stream algorithms has blossomed since late 90s. The talk will trace the history of the Approximate Frequency Counts paper, how it was conceptualized and how it influenced data stream research. The talk will also touch upon a recent development: analysis of personal data streams for improving our quality of lives.
Chapter
A randomized algorithm is one that makes random choices during its execution. The behavior of such an algorithm may thus be random even on a fixed input. The design and analysis of a randomized algorithm focus on establishing that it is likely to behave well on every input; the likelihood in such a statement depends only on the probabilistic choices made by the algorithm during execution and not on any assumptions about the input. It is especially important to distinguish a randomized algorithm from the average-case analysis of algorithms, where one analyzes an algorithm assuming that its input is drawn from a fixed probability distribution. With a randomized algorithm, in contrast, no assumption is made about the input.
Article
The frequency moments of a sequence containing m i elements of type i, 1≤i≤n, are the numbers F k =∑ i=1 n m i k . We consider the space complexity of randomized algorithms that approximate the numbers F k , when the elements of the sequence are given one by one and cannot be stored. Surprisingly, it turns out that the numbers F 0 , F 1 , and F 2 can be approximated in logarithmic space, whereas the approximation of F k for k≥6 requires n Ω(1) space. Applications to data bases are mentioned as well. © Academic Press.
Article
We describe Bro, a stand-alone system for detecting network intruders in real-time by passively monitoring a network link over which the intruder's traffic transits. We give an overview of the system's design, which emphasizes high-speed (FDDI-rate) monitoring, real-time notification, clear separation between mechanism and policy, and extensibility. To achieve these ends, Bro is divided into an `event engine' that reduces a kernel-filtered network traffic stream into a series of higher-level events, and a `policy script interpreter' that interprets event handlers written in a specialized language used to express a site's security policy. Event handlers can update state information, synthesize new events, record information to disk, and generate real-time notifications via syslog. We also discuss a number of attacks that attempt to subvert passive monitoring systems and defenses against these, and give particulars of how Bro analyzes the six applications integrated into it so far: Finger, FTP, Portmapper, Ident, Telnet and Rlogin. The system is publicly available in source code form.
Article
When selecting from, or sorting, a file stored on a read-only tape and the internal storage is rather limited, several passes of the input tape may be required. We study the relation between the amount of internal storage available and the number of passes required to select the Kth highest of N inputs. We show, for example, that to find the median in two passes requires at least and at most internal storage. For probabilistic methods, internal storage is necessary and sufficient for a single pass method which finds the median with arbitrarily high probability.
Article
Two algorithms are presented for binding the values that occur more than n ÷ k times in an array b[0:n – 1]. The second one requires time proportional to n ∗ log(k) and extra space proportional to k. A theorem suggests that this algorithm is optimal among algorithms that are based on comparing array elements. Thus, finding the element that occurs more than n ÷ 2 times requires linear time, while determining whether there is a duplicate – the case k = n – requires time proportional to n ∗ log n.The algorithms may be interesting from a standpoint of programming methodology; each was developed as an extension of the algorithm for the simple case k = 2.
Article
The number of comparisons required to select the i-th smallest of n numbers is shown to be at most a linear function of n by analysis of a new selection algorithm—PICK. Specifically, no more than 5.4305 n comparisons are ever required. This bound is improved for extreme values of i, and a new lower bound on the requisite number of comparisons is also proved.
Conference Paper
Continuous queries in a Data Stream Management System (DSMS) rely on time as a basis for windows on streams and for defining a consistent semantics for multiple streams and updatable relations. The system clock in a centralized DSMS provides a convenient and well-behaved notion of time, but often it is more appropriate for a DSMS application to define its own notion of time---its own clock(s), sequence numbers, or other forms of ordering and times-tamping. Flexible application-defined time poses challenges to the DSMS, since streams may be out of order and uncoordinated with each other, they may incur latency reaching the DSMS, and they may pause or stop. We formalize these challenges and specify how to generate heartbeats so that queries can be evaluated correctly and continuously in an application-defined time domain. Our heartbeat generation algorithm is based on parameters capturing skew between streams, unordering within streams, and latency in streams reaching the DSMS. We also describe how to estimate these parameters at run-time, and we discuss how heartbeats can be used for processing continuous queries.
Conference Paper
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications.We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Conference Paper
To meet the stringent performance requirements of transaction recording systems, much of the recording and query processing functionality, which should preferably be in the database, is actually implemented in the procedural application code, with the attendant difficulties in development, modularization, maintenance, and evolution. To combat this deficiency, we propose a new data model, the chronicle model, which permits the capture, within the data model, of many computations common to transactional data recording systems. A central issue in our model is the incremental maintenance of materialized views in time independent of the size of the recorded stream. Within the chronicle model we study the type of summary queries that can be answered by using persistent views. We measure the complexity of a chronicle model by the complexity of incrementally maintaining its persistent views, and develop languages that ensure a low maintenance complexity independent of the sequence sizes.
Conference Paper
This chapter presents CAPE—Continuous Adaptive Query Processing Engine. It is designed to efficiently evaluate continuous queries in highly dynamic stream environments with the following characteristics: the input data may stream into the query engine at widely varying rates; meta knowledge such as punctuations may dynamically be embedded into data streams; as queries are registered into or removed from the query engine, the computing resources available for processing an individual operator may vary greatly over time; and different users may impose different quality of service (QoS) requirements. CAPE employs an optimization framework with heterogeneous-grained adaptivity for effectively coping with such dynamic variations. CAPE differs from other continuous query systems in many ways. It aims to deliver exact query answers by making the best effort through heterogeneous-grained adaptations with the goal to meet users' QoS requirements.
Conference Paper
This paper introduces monitoring applications, which we will show differ substantially from conventional business data processing. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS that is currently under construction at Brandeis University, Brown University, and M.I.T. We describe the basic system architecture, a stream-oriented set of operators, optimization tactics, and support for real- time operation.
Conference Paper
We study the fundamental limitations of re- lational algebra (RA) and SQL in supporting sequence and stream queries, and present ef- fectivequerylanguageanddatamodelenrich- ments to deal with them. We begin by ob- serving the well-known limitations of SQL in application domains which are important for data streams, such as sequence queries and data mining. Then we present a formal proof that, for continuous queries on data streams, SQL sufiers from additional expressive power problems. Webeginbyfocusingonthenotion ofnonblocking(NB)queriesthataretheonly continuous queries that can be supported on data streams. We characterize the notion of nonblocking queriesby showing that they are equivalent to monotonic queries. Therefore thenotionofNB-completenessforRAcanbe formalized as its ability to express all mono-
Conference Paper
We have developed Gigascope, a stream database for network applications including traffic analysis, intrusion detection, router configuration analysis, network research, network monitoring, and performance monitoring and debugging. Gigascope is undergoing installation at many sites within the AT&T network, including at OC48 routers, for detailed monitoring. In this paper we describe our motivation for and constraints in developing Gigascope, the Gigascope architecture and query language, and performance issues. We conclude with a discussion of stream database research problems we have found in our application.
Conference Paper
We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on the problem of ordering the filters adaptively to minimize processing cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We identify a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We develop a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms in the STREAM prototype Data Stream Management System and a thorough performance evaluation is presented.
Conference Paper
Much of the data exchanged over the Internet will soon be encoded in XML, allowing for sophisticated filtering and content-based routing. We have built a filtering engine called YFilter, which filters streaming XML documents according to XQuery or XPath queries that involve both path expressions and predicates. Unlike previous work, YFilter uses a novel NFA-based execution model. We present the structures and algorithms underlying YFilter, and show its efficiency and scalability under various workloads
Conference Paper
Statistics over the most recently observed data elements are often required in applications involving data streams, such as intrusion detection in network monitoring, stock price prediction in financial markets, Web log mining for access prediction, and user click stream mining for personalization. Among various statistics, computing quantile summary is probably most challenging because of its complexity. We study the problem of continuously maintaining quantile summary of the most recently observed N elements over a stream so that quantile queries can be answered with a guaranteed precision of εN. We developed a space efficient algorithm for predefined N that requires only one scan of the input data stream and O(log(ε2N)/ε+1/ε2) space in the worst cases. We also developed an algorithm that maintains quantile summaries for most recent N elements so that quantile queries on any most recent n elements (n ≤ N) can be answered with a guaranteed precision of εn. The worst case space requirement for this algorithm is only O(log2(εN)/ε2). Our performance study indicated that not only the actual quantile estimation error is far below the guaranteed precision but the space requirement is also much less than the given theoretical bound.
Conference Paper
A lack of power and extensibility in their query languages has seriously limited the generality of DBMSs and hampered their ability to support data mining applications. Thus, there is a pressing need for more general mechanisms for extending DBMSs to support efficiently database-centric data mining appliacations. To satisfy this need, we propose a new extensibility mechanism for SQL-compliant DBMSs, and demonstrate its power in supporting decision support applications. The key extension is the ability of defining new table functions and aggregate functions in SQL— rather than in external procedural languages as Object-Relational (O-R) DBMSs currently do. This simple extension turns SQL into a powerful language for decision-support applications, including ROLAPs, time-series queries, stream-oriented processing, and data mining functions. First, we discuss the use of ATLaS for data mining applications, and then the architecture and techniques used in its realization.
Article
A new selection algorithm is presented which is shown to be very efficient on the average, both theoretically and practically. The number of comparisons used to select the ith smallest of n numbers is n + min(i,n-i) + o(n). A lower bound within 9 percent of the above formula is also derived.