David J. DeWitt’s research while affiliated with Cambridge and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (270)


Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools
  • Article

December 2023

·

8 Reads

·

3 Citations

Proceedings of the ACM on Management of Data

Matthew Perron

·

Raul Castro Fernandez

·

David DeWitt

·

[...]

·

Samuel Madden

Analytical query workloads are prone to rapid fluctuations in resource demands. These rapid, hard to predict resource demand changes make provisioning a challenge. Users must either over provision at excessive cost or suffer poor query latency when demand spikes. Prior work shows the viability of using cloud functions to match the supply of compute to the workload demand without provisioning resources ahead of time. For low query volumes, this approach is less costly at reasonable performance compared to provisioned systems, but as query volumes increase the cost overhead of cloud functions outweighs the benefit gained by rapid elasticity. In this work, we propose a novel strategy combining rapidly scalable but expensive resources with slow to start but inexpensive virtual machines to gain the benefit of elasticity without losing out on the cost savings of provisioned resources. We demonstrate a technique that minimizes cost over a wide range of workloads, environmental conditions, and compute costs while providing stable query performance. We implement these ideas in Cackle and demonstrate that it achieves similar performance and cost per query across a wide range of workloads, avoiding the cost and performance cliffs of alternative approaches.


A Polystore Based Database Operating System (DBOS)

March 2021

·

8 Reads

·

2 Citations

Lecture Notes in Computer Science

Current operating systems are complex systems that were designed before today’s computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. We discuss how a database OS (DBOS) can improve the programmability and performance of many of today’s most important applications and propose a plan for the development of a DBOS proof of concept.


DBOS: A Proposal for a Data-Centric Operating System

July 2020

·

243 Reads

·

1 Citation

Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. We discuss how a database OS (DBOS) can improve the programmability and performance of many of today's most important applications and propose a plan for the development of a DBOS proof of concept.



Starling: A Scalable Query Engine on Cloud Functions

January 2020

·

5 Reads

·

9 Citations

2020 Association for Computing Machinery. Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when underutilized. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs a number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.


Starling: A Scalable Query Engine on Cloud Function Services

November 2019

·

94 Reads

Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.


Choosing a cloud DBMS: architectures and tradeoffs

August 2019

·

126 Reads

·

32 Citations

Proceedings of the VLDB Endowment

As analytic (OLAP) applications move to the cloud, DBMSs have shifted from employing a pure shared-nothing design with locally attached storage to a hybrid design that combines the use of shared-storage (e.g., AWS S3) with the use of shared-nothing query execution mechanisms. This paper sheds light on the resulting tradeoffs, which have not been properly identified in previous work. To this end, it evaluates the TPC-H benchmark across a variety of DBMS offerings running in a cloud environment (AWS) on fast 10Gb+ networks, specifically database-as-a-service offerings (Redshift, Athena), query engines (Presto, Hive), and a traditional cloud agnostic OLAP database (Vertica). While these comparisons cannot be apples-to-apples in all cases due to cloud configuration restrictions, we nonetheless identify patterns and design choices that are advantageous. These include prioritizing low-cost object stores like S3 for data storage, using system agnostic yet still performant columnar formats like ORC that allow easy switching to other systems for different workloads, and making features that benefit subsequent runs like query precompilation and caching remote data to faster storage optional rather than required because they disadvantage ad hoc queries.


RoadTracer: Automatic Extraction of Road Networks from Aerial Images

June 2018

·

127 Reads

·

74 Citations

Mapping road networks is currently both expensive and labor-intensive. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity. We show that these segmentation methods have high error rates because noisy CNN outputs are difficult to correct. We propose RoadTracer, a new method to automatically construct accurate road network maps from aerial images. RoadTracer uses an iterative search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We compare our approach with a segmentation method on fifteen cities, and find that at a 5% error rate, RoadTracer correctly captures 45% more junctions across these cities.


Physical and chemical properties of soil (0 -30 cm depth)
RoadTracer: Automatic Extraction of Road Networks from Aerial Images
  • Conference Paper
  • Full-text available

June 2018

·

525 Reads

·

370 Citations

Download

Unthule: An Incremental Graph Construction Process for Robust Road Map Extraction from Aerial Images

February 2018

·

112 Reads

·

6 Citations

The availability of highly accurate maps has become crucial due to the increasing importance of location-based mobile applications as well as autonomous vehicles. However, mapping roads is currently an expensive and human-intensive process. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity. We show that these segmentation methods have high error rates (poor precision) because noisy CNN outputs are difficult to correct. We propose a novel approach, Unthule, to construct highly accurate road maps from aerial images. In contrast to prior work, Unthule uses an incremental search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We train the CNN to output the direction of roads traversing a supplied point in the aerial imagery, and then use this CNN to incrementally construct the graph. We compare our approach with a segmentation method on fifteen cities, and find that Unthule has a 45% lower error rate in identifying junctions across these cities.


Citations (87)


... In a word, for users who are not system experts, serverless query processing greatly reduces the effort required to own and use a data analytic system. However, existing serverless query engines are only cost-efficient in processing bursty and low-volume workloads, as they are built upon serverless computing infrastructures (e.g., cloud functions or micro virtual machines) that are tailored for ephemeral computing tasks [7,10,11]. For sustained workloads, these serverless query engines are less scalable and 1-2 orders of magnitude more expensive than MPP (massively parallel processing) query engines running in provisioned VM (virtual machine) clusters [7]. ...

Reference:

PixelsDB: Serverless and Natural-Language-Aided Data Analytics with Flexible Service Levels and Prices
Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools
  • Citing Article
  • December 2023

Proceedings of the ACM on Management of Data

... In addition to the cost argument, DBMS, even more DBMS for enterprise scenarios, are typically complex software projects that need to fulfill a wide range of requirements. DBMS may even provide core functionality of an operating system [13] and therefore reach a similar com-plexity as an operating system. In addition, as performance is important, the design of a DBMS may trade performance over simplicity. ...

DBOS: A Proposal for a Data-Centric Operating System
  • Citing Preprint
  • July 2020

... Profiling cost is amortized if the profiled queries execute several times without major runtime changes. This is the case for most periodic workloads [9,18,65], which happen to be the most relevant for saving money. Furthermore, stale profiles can still expose savings opportunities since small errors in costs do not greatly alter what queries migrate in an inter-query plan, as we illustrate in Section 6.6. ...

Choosing a cloud DBMS: architectures and tradeoffs
  • Citing Article
  • August 2019

Proceedings of the VLDB Endowment

... Cheng et al. [19] applied binary thresholding to road segmentation and used postprocessing with morphological refinement to extract road centerlines. Exploring local contextual features is a key issue in semantic segmentation [20,21]. In recent studies, a new semantic segmentation network called D-FusionNet [22] performs well in the road extraction task, which integrates a diluted convolutional block module that expands the receptive field and reduces feature loss in the extraction task. ...

RoadTracer: Automatic Extraction of Road Networks from Aerial Images
  • Citing Article
  • June 2018

... Third, the lack of inductive reasoning ability for AI models leads to disconnected roads, which may lead to inaccurate conclusions in road network-based urban studies. AI methods focus mainly on recognizing individual pixels as roads, rather than inferring road connectivity according to the cognitive process applied by human beings 30 . ...

RoadTracer: Automatic Extraction of Road Networks from Aerial Images

... Nevertheless, the method is not able to derive lane information within the road map. A refined method with higher accuracy is shown by Bastani et al. [10]. In the paper, the authors describe a different method to automatically detect roads in aerial images. ...

Unthule: An Incremental Graph Construction Process for Robust Road Map Extraction from Aerial Images
  • Citing Article
  • February 2018

... This has recently been a hot topic within the database research community. Modeling database workload patterns [22,25,34] and predicting user behavior [14,33,35] has been well studied. These studies and others in Section 6, have focused on what a user is doing with their database and when they are doing it. ...

Predictive Provisioning: Efficiently Anticipating Usage in Azure SQL Database
  • Citing Conference Paper
  • April 2017

... While time series forecast in general and load prediction in particular are well studied topics, none of the state-of-the-art approaches focused on predicting the lowest valley in customer CPU load for optimized backup scheduling. Instead, existing approaches focus on, for example, idle time detection for predictive resource provisioning [27,39], VM workload prediction for dynamic VM allocation [13,14], and demand-driven auto-scale of resources [19,20,21,22,36,37,41]. Thus, these approaches do not tackle the unique challenges of low load prediction for optimized backup scheduling described above. ...

Not for the timid: on the impact of aggressive over-booking in the cloud
  • Citing Article
  • September 2016

Proceedings of the VLDB Endowment

... The authors of [23] proposed two reactive solutions to balance the load in case any machine of the multi-tenant database system become overloaded. The first solution swaps the primary replica on the overloaded machines with one of its secondary replicas in any other machine. ...

STeP: Scalable Tenant Placement for Managing Database-as-a-Service Deployments
  • Citing Conference Paper
  • October 2016