Stan Zdonik

Stan Zdonik
Brown University · Department of Computer Science

About

203
Publications
46,514
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
19,644
Citations
Introduction
Skills and Expertise

Publications

Publications (203)
Chapter
Over the last several years, a great deal of progress has been made in the area of stream-processing engines (SPEs). Three basic tenets distinguish SPEs from current data processing engines. First, they must support primitives for streaming applications. Unlike Online Transaction Processing (OLTP), which processes messages in isolation, streaming a...
Conference Paper
Searchlight enables search and exploration of large, multi-dimensional data sets interactively. It allows users to explore by specifying rich constraints for the "objects" they are interested in identifying. Constraints can express a variety of properties, including a shape of the object (e.g., a waveform interval of length 10-100ms), its aggregate...
Article
Data analytics has recently grown to include increasingly sophisticated techniques, such as machine learning and advanced statistics. Users frequently express these complex analytics tasks as workflows of user-defined functions (UDFs) that specify each algorithmic step. However, given typical hardware configurations and dataset sizes, the core chal...
Article
This paper presents BigDAWG, a reference implementation of a new architecture for "Big Data" applications. Such applications not only call for large-scale analytics, but also for real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the principle that "one size does not...
Article
We present a new system, called Searchlight, that uniquely integrates constraint solving and data management techniques. It allows Constraint Programming (CP) machinery to run efficiently inside a DBMS without the need to extract, transform and move the data. This marriage concurrently offers the rich expressiveness and efficiency of constraint-bas...
Article
Full-text available
This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. This trend is fueled by the proliferation of storage engines and query languages based on the observation that "no one size fits all". To address this shift, we propose a polystore architecture; it is designed...
Article
Full-text available
Next generation high-performance RDMA-capable networks will require a fundamental rethinking of the design and architecture of modern distributed DBMSs. These systems are commonly designed and optimized under the assumption that the network is the bottleneck: the network is slow and "thin", and thus needs to be avoided as much as possible. Yet this...
Article
Full-text available
Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way...
Chapter
We present a new system, called Searchlight, that uniquely integrates constraint solving and data management techniques. It allows Constraint Programming (CP) machinery to run effciently inside a DBMS without the need to extract, transform and move the data. This marriage concurrently offers the rich expressiveness and effciency of constraintbased...
Article
Full-text available
There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of us...
Article
We present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. Users can specify SWs using (i) shape-based properties, e.g., "identify all 3-by-3 windows", as well as (ii) content-based propert...
Article
The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple tra...
Article
The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple tra...
Conference Paper
Good database design is typically a very difficult and costly process. As database systems get more complex and as the amount of data under management grows, the stakes increase accordingly. Past research produced a number of design tools capable of automatically selecting secondary indexes and materialized views for a known workload. However, a si...
Article
The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applications that scale without sacrificing ACID guarantees [7, 9]. The performance of these DBMSs is predicated on the existence of an optimal database design that is tailored for...
Article
Full-text available
A new emerging class of parallel database management systems (DBMS) is designed to take advantage of the partitionable workloads of on-line transaction processing (OLTP) applications. Transactions in these systems are optimized to execute to completion on a single node in a shared-nothing cluster without needing to coordinate with other nodes or us...
Article
Full-text available
Query workloads and database schemas in OLAP applications are becoming increasingly complex. Moreover, the queries and the schemas have to continually \textit{evolve} to address business requirements. During such repetitive transitions, the \textit{order} of index deployment has to be considered while designing the physical schemas such as indexes...
Conference Paper
Array database systems are architected for scientific and engineering applications. In these applications, the value of a cell is often imprecise and uncertain. There are at least two reasons that a Monte Carlo query processing algorithm is usually required for such uncertain data. Firstly, a probabilistic graphical model must often be used to mode...
Conference Paper
Full-text available
We develop a novel method, based on the statistical concept of VC-dimension, for selecting a small representative sample from a large database. The execution of a query on the sample provides an accurate estimate for the selectivity (or cardinality of the output) of each operation in the execution of the query on the original large database. The si...
Conference Paper
This paper argues that next generation database management systems should incorporate a predictive model management component to effectively support both inward-facing applications, such as self management, and user-facing applications such as data-driven predictive analytics. We draw an analogy between model management and data management function...
Article
Full-text available
In this paper, we propose a new benchmark for scientific data management systems called SS-DB. This bench-mark, loosely modeled on an astronomy workload, is intended to simulate applications that manipulate array-oriented data through relatively sophisticated user-defined functions. SS-DB is representative of the processing performed in a number of...
Article
Full-text available
Multidimensional array database systems are suited for scientific and engineering applications. Data in these applications is often uncertain and imprecise due to errors in the instruments and observations, etc. There are often correlations exhibited in the distribution of values among the cells of an array. Typically, the correlation is stronger f...
Article
Full-text available
We describe an automatic database design tool that exploits corre- lations between attributes when recommending materialized views (MVs) and indexes. Although there is a substantial body of related work exploring how to select an appropriate set of MVs and indexes for a given workload, none of this work has explored the e ect of correlated attribut...
Article
Full-text available
Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databases, despite the fact that clustering has the poten...
Article
Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databases, despite the fact that clustering has the poten...
Conference Paper
Full-text available
For the past year, we have been assembling requirements from a collection of scientific data base users from astronomy, particle physics, fusion, remote sensing, oceanography, and biology. The intent has been to specify a common set of requirements for a new science data base system, which we call SciDB. In addition, we have discovered that very co...
Article
Full-text available
In CIDR 2009, we presented a collection of requirements for SciDB, a DBMS that would meet the needs of scientic users. These included a nested-array data model, science- specic operations such as regrid, and support for uncer- tainty, lineage, and named versions. In this paper, we present an overview of SciDB's key features and outline a demonstrat...
Article
Full-text available
In CIDR 2009, we presented a collection of requirements for SciDB, a DBMS that would meet the needs of scientific users. These included a nested-array data model, science-specific operations such as regrid, and support for uncertainty , lineage, and named versions. In this paper, we present an overview of SciDB's key features and outline a demonstr...
Article
In relational query processing, there are generally two choices for access paths when performing a predicate lookup for which no clustered index is available. One option is to use an unclustered index. Another is to perform a complete sequential scan of the table. Many analytical workloads do not benefit from the availability of unclustered indexes...
Article
Full-text available
Uncertain data arises in a number of domains, including data integration and sensor networks. Top-k queries that rank results according to some user-defined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of top-k queries on uncertain data can be ambiguous due to tradeoffs b...
Conference Paper
Full-text available
Modern database systems increasingly make use of networked storage. This storage can be in the form of SAN's or in the form of shared-nothing nodes in a cluster. One type of attack on databases is arbitrary modification of data in a database through the file system, bypassing database access control. Additionally, for many applications, ensuring st...
Article
Full-text available
In relational query processing, there are generally two choices for access paths when performing a predicate lookup for which no clustered index is available. One option is to use an unclustered index. Another is to perform a complete sequential scan of the table. Many analytical workloads do not benefit from the availability of unclustered indexes...
Chapter
Full-text available
Borealis is a distributed stream processing engine that has been developed at Brandeis University, Brown University, and MIT. It extends the first generation of data stream processing systems with advanced capabilities such as distributed operation, scalability with time-varying load, high availability against failures, and dynamic data and query m...
Article
Full-text available
This paper describes a unification of two different SQL extensions for streams and its associated semantics. We use the data models from Oracle and StreamBase as our examples. Oracle uses a time-based execution model while StreamBase uses a tuple-based execution model. Time-based execution provides a way to model simultaneity while tuple-based exec...
Article
Full-text available
Our previous work has shown that architectural and appli- cation shifts have resulted in modern OLTP databases in- creasingly falling short of optimal performance (10). In par- ticular, the availability of multiple-cores, the abundance of main memory, the lack of user stalls, and the dominant use of stored procedures are factors that portend a clea...
Article
Full-text available
Time series data is common in many settings including scientific and financial applications. In these applications, the amount of data is often very large. We seek to support prediction queries over time series data. Prediction relies on model building which can be too expensive to be practical if it is based on a large number of data points. We pr...
Conference Paper
Full-text available
We present a replication-based approach that realizes both fast and highly-available stream processing over wide area networks. In our approach, multiple operator replicas send outputs to each downstream replica so that it can use whichever data arrives first. To further expedite the data flow, replicas run independently, possibly processing data i...
Conference Paper
Full-text available
Scientific and intelligence applications have special data handling needs. In these settings, data does not fit the standard model of short coded records that had dominated the data management area for three decades. Array database systems have a specialized architecture to address this problem. Since the data is typically an approximation of reali...
Conference Paper
Full-text available
Borealis-R is a replication-based system for both fast and highly-available processing of data streams over wide-area networks. In Borealis-R, multiple operator replicas send outputs to downstream replicas, allowing each replica to use whichever data arrives first. To further reduce latency, replicas run without coordination, possibly processing da...
Conference Paper
Full-text available
We present a replication-based approach that enables both fast and reliable stream processing over wide area networks. Our approach replicates stream processing operators in a manner where operator replicas compete with each other to make the earliest impact. Therefore, any processing downstream from such replicas can proceed by relying on the fast...
Conference Paper
Full-text available
Networked information systems require strong security guarantees because of the new threats that they face. Various forms of encryption have been proposed to deal with this problem. In a database system, there are often two contradictory goals: security of the encryption and fast performance of queries. There have been a number of proposals of data...
Conference Paper
Full-text available
We present a collaborative, self-configuring high availability (HA) approach for stream processing that enables low-latency failure recovery while incurring small run-time overhead. Our approach relies on a novel fine-grained checkpointing model that allows query fragments at each server to be backed up at multiple other servers and recovered colle...
Conference Paper
Full-text available
In distributed stream processing environments, large numbers of continuous queries are distributed onto multiple servers. When one or more of these servers become overloaded due to bursty data arrival, excessive load needs to be shed in order to preserve low latency for the query results. Because of the load dependencies among the servers, load she...
Conference Paper
Full-text available
As more sensitive data is captured in electronic form, security becomes more and more important. Data encryption is the main technique for achieving security. While in the past enterprises were hesitant to implement database encryption because of the very high cost, complexity, and performance degradation, they now have to face the ever-growing ris...
Article
Full-text available
Two years ago, some of us wrote a paper predicting the demise of "One Size Fits All (OSFA)" (Sto05a). In that paper, we examined the stream processing and data warehouse markets and gave reasons for a substantial performance advantage to specialized architectures in both markets. Herein, we make three additional contributions. First, we present rea...
Conference Paper
Full-text available
Data stream management systems may be subject to higher input rates than their resources can handle. When overloaded, the system must shed load in order to maintain low-latency query results. In this paper, we describe a load shedding technique for queries consisting of one or more aggregate operators with sliding windows. We introduce a new type o...
Conference Paper
Full-text available
Data stream processing systems have become ubiquitous in academic [1, 2, 5, 6] and commercial [11] sectors, with application areas that include financial services, network traffic analysis, battlefield monitoring and traffic control [3]. The append-only model of streams implies that input data is immutable and therefore always correct. But in pract...
Conference Paper
Full-text available
Overload management has been an important problem for large-scale dynamic systems. In this paper, we study this problem in the context of our Borealis distributed stream processing system. We show that server nodes must coordinate in their load shedding decisions to achieve global control on output quality. We describe a distributed load shedding a...
Conference Paper
Full-text available
Scalability in stream processing systems can be achieved by using a cluster of computing devices. The processing burden can, thus, be distributed among the nodes by partitioning the query graph. The specific operator placement plan can have a huge impact on performance. Previous work has focused on how to move query operators dynamically in reactio...
Article
Applications that require real-time processing of high-volume data steams are pushing the limits of traditional data processing infrastructures. These stream-based applications include market feed processing and electronic trading on Wall Street, network and infrastructure monitoring, fraud detection, and command and control in military environment...
Conference Paper
Full-text available
Distributed and parallel computing environments are becoming cheap and commonplace. The availability of large numbers of CPU's makes it possible to process more data at higher speeds. Stream-processing systems are also becoming more important, as broad classes of applications require results in real-time. Since load can vary in unpredictable ways,...
Article
Full-text available
The database research with focus on integration of text, data, code, fusion of information from heterogeneous data sources, and information privacy, conducted at Lowell, is discussed. The object-oriented (OO) and object-relational (OR) database management systems (DBMS) showed how text and other data types can be added to a DBMS. Several goals ment...
Article
Full-text available
Efficient query processing across a wide-area network requiresnetwork awareness, i.e., tracking and leveraging knowledge of network characteristics when making optimization decisions. This paper summarizes our work on network-aware query processing techniques for widely-distributed, large-scale stream-processing applications. We first discuss theop...
Conference Paper
Full-text available
Stream-processing systems are designed to support an emerging class of applications that require sophisticated and timely processing of high-volume data streams, often originating in distributed environments. Unlike traditional data-processing applications that require precise recovery for correctness, many stream-processing applications can tolera...
Conference Paper
Full-text available
Borealis is a second-generation distributed stream pro-cessing engine that is being developed at Brandeis Uni-versity, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora [14] and distribution functionality from Medusa [51]. Bo-realis modifies and extends both systems in non-trivial and critical ways to pro...
Conference Paper
Full-text available
Borealis is a distributed stream processing engine that is being developed at Brandeis University, Brown University, and MIT. Borealis inherits core stream processing functionality from Aurora and inter-node communication functionality from Medusa.We propose to demonstrate some of the key aspects of distributed operation in Borealis, using a multi-...
Conference Paper
Full-text available
This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overl...
Article
Full-text available
This experience paper summarizes the key lessons we learned throughout the design and implementation of the Aurora stream-processing engine. For the past 2 years, we have built five stream-based applications using Aurora. We first describe in detail these applications and their implementation in Aurora. We then reflect on the design of Aurora based...
Article
Full-text available
Stream-processing systems are designed to support an emerging class of applications that require sophisticated and timely processing of high-volume data streams, often originating in a distributed environment. Work in stream processing has so far focused primarily on stream-oriented languages and resource-constrained, one-pass query processing. Hig...
Article
This experience paper summarizes the key lessons we learned throughout the design and implementation of the Aurora stream processing engine. For the past two years, we have built five stream-based applications using Aurora. We first describe in detail these applications and their implementation in Aurora. We then reflect on the design of Aurora bas...
Article
Full-text available
This paper provides an overview of these problems, examines why traditional relational systems are inadequate to deal with them, and identifies a new class of data processing currently known as Data Stream Management Systems (DSMS) designed to handle them. We will provide an overview of the current status of the field, as well as focus on some prot...
Conference Paper
Full-text available
The military is working on embedding sensors in a "smart uniform" that will monitor key biological parameters to determine the physiological status of a soldier. The soldier's status can only be determined accurately by combining the readings from several sensors using sophisticated physiological models. Unfortunately, the physical environment and...
Article
Recently, significant efforts have focused on developing novel data-processing systems to support a new class of applications that commonly require sophisticated and timely processing of high-volume data streams. Early work in stream processing has primarily focused on streamoriented languages and resource-constrained, one-pass query-processing. Hi...
Article
Full-text available
A group of senior database researchers gathers every few years to assess the state of database research and to point out problem areas that deserve additional focus. This report summarizes the discussion and conclusions of the sixth ad-hoc meeting held May 4-6, 2003 in Lowell, Mass. It observes that information management continues to be a critical...
Article
Full-text available
Recently, significant efforts have focused on developing novel data-processing systems to support a new class of applications that commonly require sophisticated and timely processing of high-volume data streams. Early work in stream processing has primarily focused on streamoriented languages and resource-constrained, one-pass query-processing. Hi...
Conference Paper
A Data Stream Manager accepts push-based inputs from a set of data sources, processes these inputs with respect to a set of standing queries, and produces outputs based on Quality-of-Service (QoS) specifications. When input rates exceed system capacity, the system will become overloaded and latency will deteriorate. Under these conditions, the syst...
Article
Full-text available
This paper describes the basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications. Monitoring applications differ substantially from conventional business data processing. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than...
Article
This chapter reveals that many stream-based applications have sophisticated data processing requirements and real-time performance expectations that need to be met under high-volume, time-varying data streams. To address these challenges, this chapter proposes a novel operator scheduling approaches that specify: which operators to schedule; in whic...
Article
This paper describes the basic processing model and architectu; ofAuWM4w a new system to manage data streams for monitoring applications.Monitoring applications differ su:wM4.,AwMP from conventional buMy;M; data processing. The fact that a software systemmut process and react to continuM inpun from manysou;;; (e.g., sensors) rather than from hum. o...
Article
A Data Stream Manager accepts push-based input from a set of data sources, processes these inputs with respect to a set of standing queries, and produces outputs based on Quality-of-Service (QoS) speci cations. When input rates exceed system capacity, the system will become overloaded and latency will deteriorate. Under these conditions, the system...
Conference Paper
Full-text available
Modern distributed information systems cope with disconnection and limited bandwidth by using caches. In communication-constrained situations, traditional demand-driven approaches are inadequate. Instead, caches must be preloaded in order to mitigate the absence of connectivity or the paucity of bandwidth. We propose to use application-level knowle...