Joseph M. Hellerstein’s research while affiliated with University of California, Berkeley and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (148)


PACMMOD Volume 2 Issue 4: Editorial
  • Article

September 2024

Proceedings of the ACM on Management of Data

Divyakant Agrawal

·

Azza Abouzied

·

Joseph M. Hellerstein

We are pleased to present the 4th issue of Volume 2 of PACMMOD. This issue contains papers that were submitted to the SIGMOD research track in January 2024. Papers accepted in this issue have been invited for presentation in the research track of the ACM SIGMOD Conference on Management of Data 2025, to be held in Berlin, Germany. We look forward to seeing you in Berlin from the 22nd to the 27th June, 2025.


CLX: Towards verifiable PBE data transformation

March 2019

·

178 Reads

·

7 Citations

Zhongjun Jin

·

Michael Cafarella

·

H. V. Jagadish

·

[...]

·

Joseph M. Hellerstein

Effective data analytics on data collected from the real world usually begins with a notoriously expensive pre-processing step of data transformation and wrangling. Human-in-the-loop tools have been proposed to speed up the process of data transformation, using the Programming By Example (PBE) approach. However, two important usability issues limit the effective use of such PBE data transformation systems: (1) the cost of user effort grows quickly as volume or heterogeneity of the raw data increases (prohibitive user effort), and (2) the underlying process of transformation is opaque to the user and hence difficult to validate, correct and debug (incomprehensibility). In this project, we propose a new PBE data transformation paradigm design CLX (pronounced "clicks") for data normalization to address these two issues. For the issue of prohibitive user effort, we present a pattern profiling algorithm that hierarchically clusters the input raw data based on format structures that help the user quickly identify both well-formatted and ill-formatted data and specify the desired format. After the desired transformation logic is inferred, CLX explains it as a set of simple regular expression replacement operations to improve comprehensibility. We experimentally compared the CLX prototype with FlashFill, a state-of-the-art data transformation tool. The results show improvements over the state of the art in saving user effort and enhancing comprehensibility, without loss of efficiency or expressive power. In a user effort study on data sets of various sizes, when the data size grew by a factor of 30, the user effort required by the CLX prototype grew 1.2x whereas that required by FlashFill grew 9.1x. In another test assessing the users' understanding of the transformation logic, the CLX users achieved a success rate about twice that of the FlashFill users.


Figure 1: Postgres features first mentioned in the 1986 paper* and the 1991 paper † .
Looking Back at Postgres
  • Preprint
  • File available

January 2019

·

231 Reads

This is a recollection of the UC Berkeley Postgres project, which was led by Mike Stonebraker from the mid-1980's to the mid-1990's. The article was solicited for Stonebraker's Turing Award book, as one of many personal/historical recollections. As a result it focuses on Stonebraker's design ideas and leadership. But Stonebraker was never a coder, and he stayed out of the way of his development team. The Postgres codebase was the work of a team of brilliant students and the occasional university "staff programmers" who had little more experience (and only slightly more compensation) than the students. I was lucky to join that team as a student during the latter years of the project. I got helpful input on this writeup from some of the more senior students on the project, but any errors or omissions are mine. If you spot any such, please contact me and I will try to fix them.

Download

Making Sense of Asynchrony in Interactive Data Visualizations

June 2018

·

50 Reads

Asynchronous interfaces allow users to concurrently issue requests while existing ones are processed. While it is widely used to support non-blocking input when there is latency, it's not clear if people can make use of asynchrony as the data is updating, since the UI updates dynamically and the changes can be hard to interpret. Interactive data visualization presents an interesting context for studying the effects of asynchronous interfaces, since interactions are frequent, task latencies can vary widely, and results often require interpretation. In this paper, we study the effects of introducing asynchrony into interactive visualizations, under different latencies, and with different tasks. We observe that traditional asynchronous interfaces, where results update in place, induce users to wait for the result before interacting, not taking advantage of the asynchronous rendering of the results. However, when results are rendered cumulatively over the recent history, users perform asynchronous interactions and get faster task completion times.



Indy: a software system for the dense cloud

September 2017

·

48 Reads

Early iterations of datacenter-scale computing were a reaction to the expensive multiprocessors and supercomputers of their day. They were built on clusters of commodity hardware, which at the time were packages with 2--4 CPUs. However, as datacenter-scale computing has matured, cloud vendors have provided denser, more powerful hardware. Today's cloud infrastructure aims to deliver not only reliable and cost-effective computing, but also excellent performance.


Figure 1: System design for a tweening-based query interface. 
Figure 2: Micro-operations in the Visual Grammar 
DataTweener: a demonstration of a tweening engine for incremental visualization of data transforms

August 2017

·

50 Reads

·

1 Citation

Proceedings of the VLDB Endowment

With the development and advancement of new data interaction modalities, data exploration and analysis has become a highly interactive process situating the user in a session of successive queries. With rapidly changing results, it becomes difficult for the end user to fully comprehend transformations, especially the transforms corresponding to complex queries. We introduce "data tweening" as an informative way of visualizing structural data transforms, presenting the users with a series of incremental visual representations of a resultset transformation. We present transformations as ordered sequences of basic structural transforms and visual cues. The sequences are generated using an automated framework which utilizes differences between the consecutive resultsets and queries in a query session. We evaluate the effectiveness of tweening as a visualization method through a user study.


Figure 10: User rating for pivot tweening 
Data tweening: incremental visualization of data transforms

February 2017

·

207 Reads

·

26 Citations

Proceedings of the VLDB Endowment

In the context of interactive query sessions, it is common to issue a succession of queries, transforming a dataset to the desired result. It is often difficult to comprehend a succession of transformations, especially for complex queries. Thus, to facilitate understanding of each data transformation and to provide continuous feedback, we introduce the concept of "data tweening", i.e., interpolating between resultsets, presenting to the user a series of incremental visual representations of a resultset transformation. We present tweening methods that consider not just the changes in the result, but also the changes in the query. Through user studies, we show that data tweening allows users to efficiently comprehend data transforms, and also enables them to gain a better understanding of the underlying query operations.


High performance transactions via early write visibility

January 2017

·

53 Reads

·

108 Citations

Proceedings of the VLDB Endowment

In order to guarantee recoverable transaction execution, database systems permit a transaction's writes to be observable only at the end of its execution. As a consequence, there is generally a delay between the time a transaction performs a write and the time later transactions are permitted to read it. This delayed write visibility can significantly impact the performance of serializable database systems by reducing concurrency among conflicting transactions. This paper makes the observation that delayed write visibility stems from the fact that database systems can arbitrarily abort transactions at any point during their execution. Accordingly, we make the case for database systems which only abort transactions under a restricted set of conditions, thereby enabling a new recoverability mechanism, early write visibility, which safely makes transactions' writes visible prior to the end of their execution. We design a new serializable concurrency control protocol, piece-wise visibility (PWV), with the explicit goal of enabling early write visibility. We evaluate PWV against state-of-the-art serializable protocols and a highly optimized implementation of read committed, and find that PWV can outperform serializable protocols by an order of magnitude and read committed by 3X on high contention workloads.


ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

October 2016

·

94 Reads

·

6 Citations

Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or "what if?" analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay. We introduce an execution engine that produces serial-equivalent output while accelerating throughput with pipelining and distributed parallelism. This is made possible by optimizing for high throughput rather than the traditional stream processing goal of low latency, and by aggressive sharing of versioned state, a technique we term Multi-Versioned Parallel Streaming (MVPS). In experiments we see that this engine, which we call ReStream, performs as well as batch processing and more than an order of magnitude better than a single-threaded implementation.


Citations (80)


... SmSa can benefit from a transaction management solution to provide stronger application correctness guarantees. Existing works have proposed a myriad of methods to enforce the order of concurrent tasks [15,24] and to converge to a consistent state in the presence of disorder [22,29,33,37]. Snapper [24], one of these solutions, supports ACID transactional properties for multi-actor operations through performant deterministic concurrency control, making it a good fit to integrate SmSa with. ...

Reference:

Rethinking State Management in Actor Systems for Cloud-Native Applications
Anna: A KVS for Any Scale
  • Citing Conference Paper
  • April 2018

... The first group is rooted in a paradigm called programming-byexample (PBE) [25,50,52,58,59,92,93,109], where the goal is to synthesize a program that manipulates a given input to get a given output. To do so, methods design different search spaces (operators to be applied over the input) and apply different search algorithms. ...

CLX: Towards verifiable PBE data transformation
  • Citing Conference Paper
  • March 2019

... In these situations, a user does not have the time nor the resources to discern the meaning of every data point and make the proper changes; they need to be provided with a fluid interface that conveys enough information to determine quality and accuracy of the model. Systems proposed by Khan et al. [13] and Rahman et al. [14] are examples of this interface style. In both cases, the focus of the visualization is not on the data points, but rather the operations and trends performed on them. ...

DataTweener: a demonstration of a tweening engine for incremental visualization of data transforms

Proceedings of the VLDB Endowment

... To give a comprehensive view, prior research has condensed the operations into descriptive narratives [14,33] or schematic diagrams [25,51]. In addition, many works focused on visualizing interim results through animation (e.g., [21,29,50]) or a timeline representation (e.g., [2,36,45]). For instance, Datamation [50] visually maps and links each step of the data process to the underlying dataset, providing more context for the audience. ...

Data tweening: incremental visualization of data transforms

Proceedings of the VLDB Endowment

... Directed acyclic graph (DAG) consensus protocols [15,33,33,34,53,54] have demonstrated notable enhancements in both throughput and latency, however, the serial execution is now becoming a bottleneck. Numerous works have endeavored to execute by constructing a dependency graph to trace concurrent transactions [19,41,63,64]. However, approaches prove impractical for smart contracts by assuming that read/write sets are known in prior [10,55]. ...

High performance transactions via early write visibility
  • Citing Article
  • January 2017

Proceedings of the VLDB Endowment

... The SMILES were converted into circular Morgan fingerprints (radius: 3) using the RDKit database cartridge. The fingerprints were indexed using the Generalized Search Tree (GiST) [56] algorithm. The database set-up achieves a Tanimoto similarity search of the predicted scaffolds with more than 180,000 molecules in less than a second. ...

Generalized Search Tree
  • Citing Chapter
  • January 2016

... In order to address the potential network bottlenecks, Khameleon leverages two key properties of DVE applications. First, interactions are preemptive: since responses can arrive out of order (e.g., due to network or server delays), the client renders the data for the most recent request and (silently) drops responses from older requests to avoid confusing the user [82,83]. Second, they are approximation tolerant: it is preferable to quickly render a low-quality response (e.g., fewer points [64] or coarser bins [45]) than to wait for a full quality response. ...

A DeVIL-ish approach to inconsistency in interactive visualizations
  • Citing Conference Paper
  • June 2016

... To tackle these problems, some methods proposed the using of ontologies, employed to provide a formal conceptualization of each domain. Some studies were based on a global ontology/ schema shared by all peers (Alking et al., 2008;Akbarinia and Martins, 2007;Cruz et al., 2007;Haase et al., 2004;Huebsch et al., 2005). PIER (Huebsch et al., 2005), a structured P2P system constitutes a good example in which all peers share a standard schema. ...

The Architecture of PIER: an Internet-Scale Query Processor
  • Citing Article
  • January 2005