Mehul A. Shah’s research while affiliated with Berkeley College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (18)


Figure 1: Example Dataflow 
Figure 2: Dataflow Pairs Normal Processing 
Figure 7: Flux design and normal case protocol. 
Highly Available, Fault-Tolerant, Parallel Dataflows
  • Conference Paper
  • Full-text available

June 2004

·

169 Reads

·

211 Citations

Mehul A. Shah

·

Joseph M. Hellerstein

·

We present a technique that masks failures in a cluster to provide high availability and fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to implement a variety of continuous query (CQ) applications that require high-throughput, 24x7 operation. Examples include network monitoring, phone call processing, click-stream processing, and online financial analysis. Our main contribution is a scheme that carefully integrates traditional query processing techniques for partitioned parallelism with the process-pairs approach for high availability. This delicate integration allows us to tolerate failures of portions of a parallel dataflow without sacrificing result quality. Upon failure, our technique provides quick fail-over, and automatically recovers the lost pieces on the fly. This piecemeal recovery provides minimal disruption to the ongoing dataflow computation and improved reliability as compared to the straight-forward application of the process-pairs technique on a per dataflow basis. Thus, our technique provides the high availability necessary for critical CQ applications. Our techniques are encapsulated in a reusable dataflow operator called Flux, an extension of the Exchange that is used to compose parallel dataflows. Encapsulating the fault-tolerance logic into Flux minimizes modifications to existing operator code and relieves the burden on the operator writer of repeatedly implementing and verifying this critical logic. We present experiments illustrating these features with an implementation of Flux in the TelegraphCQ code base [8].

Download


Flux: An Adaptive Partitioning Operator for Continuous Query Systems.

January 2003

·

140 Reads

·

328 Citations

The long-running nature of continuous queries poses new scalability challenges for dataflow processing. CQ systems execute pipelined dataflows that may be shared across multiple queries. The scalability of these dataflows is limited by their constituent, stateful operators - e.g. windowed joins or grouping operators. To scale such operators, a natural solution is to partition them across a shared-nothing platform. But in the CQ context, traditional, static techniques for partitioned parallelism can exhibit detrimental imbalances as workload and runtime conditions evolve. Long-running CQ dataflows must continue to function robustly in the face of these imbalances. To address this challenge, we introduce a dataflow operator called flux that encapsulates adaptive state partitioning and dataflow routing. Flux is placed between producer-consumer stages in a dataflow pipeline to repartition stateful operators while the pipeline is still executing. We present the flux architecture, along with repartitioning policies that can be used for CQ operators under shifting processing and memory loads. We show that the flux mechanism and these policies can provide several factors improvement in throughput and orders of magnitude improvement in average latency over the static case.


Figure 1: PostgreSQL Architecture 
Figure 2: TelegraphCQ Architecture 
TelegraphCQ: An Architectural Status Report

January 2003

·

693 Reads

·

91 Citations

We are building TelegraphCQ, a system to process continuous queries over data streams. Although we had implemented some parts of this technology in earlier Java-based prototypes, our experiences were not positive. As a result, we decided to use PostgreSQL, an open source RDBMS as a starting point for our new implementation. In March 2003, we completed an alpha milestone of TelegraphCQ. In this paper, we report on the development status of our project, with a focus on architectural issues. Specifically, we describe our experiences extending a traditional DBMS towards managing data streams, and an overview of the current early-access release of the system.




TelegraphCQ: Continuous Dataflow Processing for an

December 2002

·

371 Reads

·

83 Citations

Increasingly pervasive networks are leading towards a world where data is constantly in motion. In such a world, conventional techniques for query processing, which were developed under the assumption of a far more static and predictable computational environment, will not be sufficient. Instead, query processors based on adaptive dataflow will be necessary. The Telegraph project has developed a suite of novel technologies for continuously adaptive query processing. The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams. In this paper, we describe the system architecture and its underlying technology, and report on our ongoing implementation effort, which leverages the PostgreSQL open source code base. We also discuss open issues and our research agenda.


Continuously Adaptive Continuous Queries over Streams

April 2002

·

183 Reads

·

516 Citations

We present a continuously adaptive, continuous query (CACQ) implementation based on the eddy query processing framework. We show that our design provides significant performance benefits over existing approaches to evaluating continuous queries, not only because of its adaptivity, but also because of the aggressive crossquery sharing of work and space that it enables. By breaking the abstraction of shared relational algebra expressions, our Telegraph CACQ implementation is able to share physical operators -- both selections and join state -- at a very fine grain. We augment these features with a grouped-filter index to simultaneously evaluate multiple selection predicates. We include measurements of the performance of our core system, along with a comparison to existing continuous query approaches.


Java support for data-intensive systems

December 2001

·

8 Reads

·

20 Citations

ACM SIGMOD Record

Database system designers have traditionally had trouble with the default services and interfaces provided by operating systems. In recent years, developers and enthusiasts have increasingly promoted Java as a serious platform for building data-intensive servers. Java provides a number of very helpful language features, as well as a full run-time environment reminiscent of a traditional operating system. This combination of features and community support raises the question of whether Java is better or worse at supporting data-intensive server software than a traditional operating system coupled with a weakly-typed language such as C or C++.In this paper, we summarize and discuss our experience building the Telegraph dataflow system in Java. We highlight some of the pleasures of coding with Java, and some of the pains of coding around Java in order to obtain good performance in a data-intensive server. For those issues that were painful, we present concrete suggestions for evolving Java's interfaces to better suit serious software systems development. We believe these experiences can provide insight for other designers to avoid pitfalls we encountered and to decide if Java is a suitable platform for their system.


Java Support for Data-Intensive Systems: Experiences Building the Telegraph Dataflow System

November 2001

·

38 Reads

·

40 Citations

ACM SIGMOD Record

Database system designers have traditionally had trouble with the default services and interfaces provided by operating systems. In recent years, developers and enthusiasts have increasingly promoted Java as a serious platform for building data-intensive servers. Java provides a number of very helpful language features, as well as a full run-time environment reminiscent of a traditional operating system. This combination of features and community support raises the question of whether Java is better or worse at supporting dataintensive server software than a traditional operating system coupled with a weakly-typed language such as C or C++. In this paper, we summarize and discuss our experience building the Telegraph dataow system in Java. We highlight some of the pleasures of coding with Java, and some of the pains of coding around Java in order to obtain good performance in a data-intensive server. For those issues that were painful, we present concrete suggestions for evolving Java's interfaces to better suit serious software systems development. We believe these experiences can provide insight for other designers to avoid pitfalls we encountered and to decide if Java is a suitable platform for their system. 1.


Citations (15)


... single-threaded implementation [24]. We now describe the module functionality in detail. ...

Reference:

Using state modules for adaptive query processing
Java support for data-intensive systems
  • Citing Article
  • December 2001

ACM SIGMOD Record

... Instead, they report all the matches of the queries, including the ones corresponding to redundant predicted intervals. In addition, complicated query processors such as XML document filters [3,8,17,33] and data stream management systems (DSMS) [1,4,36] are not applicable to our problem , either. XML document filters usually specify queries as XPath expressions [13] which have no time-bound operators and cannot well express episodes. ...

Fault-tolerant, Load-balancing Queries in Telegraph.
  • Citing Conference Paper
  • June 2001

ACM SIGMOD Record

... Since the beginning of the current century, there has been a plethora of early work [8][9][10] addressing the challenging problems of event processing. Our work has been largely inspired by PIPES [11] and its extension JEPC [2], one of the early approaches for event processing to support implementations of operators using hardware accelerators. ...

TelegraphCQ: Continuous Dataflow Processing.
  • Citing Conference Paper
  • January 2003

... Currently, there are several DSMS prototypes available such as Aurora (Borealis) (Abadi et al. 2005) (Abadi et al. 2003), STREAM (Arasu et al. 2003), TelegraphCQ (Chandrasekaran et al. 2003), PIPES (Kra mer and Seeger 2009), and SPADE (Gedik et al. 2008). ...

Telegraphcq: Continuous dataflow processing for an uncertain world

... The disadvantage of the shuffle strategy is that load balance may not be achieved. Other existing SPS offer hash-based data partition [11], partialkey based [12], or executor-centric [13] solutions to deal with the distribution problem. ...

Flux: An Adaptive Partitioning Operator for Continuous Query Systems.
  • Citing Conference Paper
  • January 2003

... In FPDB, pushdown tasks are managed through adaptive pushdown to improve resource efficiency. Adaptive Query Processing: There is a rich literature on the topic of adaptive query processing [46,51,52], which adjusts the query execution dynamically based on more accurate runtime statistics. Examples include re-optimization techniques [53,81] and memory adaptive operators [63,64,86]. ...

Adaptive Query Processing: Technology in Evolution.
  • Citing Article
  • January 2000

... It provides a Continuous Query Language (CQL) for constructing continuous queries against streams and updatable relations [15]. TelegraphCQ offers an adaptive continuous query engine that adjusts the processing during run-time and applies shared processing where possible [16]. NiagaraCQ splits continuous queries into smaller queries and groups queries with the same expression signature together. ...

TelegraphCQ: An Architectural Status Report

... Examples of operator states include keeping some aggregation or summary of the received tuples in memory or keeping a state machine for detecting patterns (for example, for fraudulent financial transactions) in memory. A conventional approach is replication [53], [54], which uses backup nodes to process the same stream in parallel with the primary set of nodes, and the inputs are sent to both. The system then automatically switches over to the secondary set of nodes upon failures. ...

Highly Available, Fault-Tolerant, Parallel Dataflows

... Some systems such as Millwheel [27] and Dataflow [28] choose to separate state from the application logic. They have the state centralized in a remote storage [22,32,33] (e.g., a database management system, HDFS or GFS) shared among applications, periodically checkpointing it for fault tolerance. Using external storage can scale well to large distributed states, but it significantly increases latency in the critical path of stream processing. ...

TelegraphCQ: Continuous Dataflow Processing for an

... The exact ratio of sequential to random accesses depends on the disk drives and the OS overhead, and we will assume a ratio of 14:1 as a conversion ratio representative of current technology. 13 Note that this test cannot be reversed: failing this criterion does not necessarily mean that a workload 12 This test assumes that total execution time of the workload under consideration is dominated by page access cost. 13 Using Seagate Barracuda ultra-wide SCSI-2 drives, [19] measures a throughput of ca. ...

amdb: An Access Method Debugging Tool

ACM SIGMOD Record