Andrew Pavlo’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (97)


The Key to Effective UDF Optimization: Before Inlining, First Perform Outlining
  • Article

February 2025

·

2 Reads

Proceedings of the VLDB Endowment

Samuel Arch

·

Yuchen Liu

·

Todd C. Mowry

·

[...]

·

Andrew Pavlo

Although user-defined functions (UDFs) are a popular way to augment SQL's declarative approach with procedural code, the mismatch between programming paradigms creates a fundamental optimization challenge. UDF inlining automatically removes all UDF calls by replacing them with equivalent SQL subqueries. Although inlining leaves queries entirely in SQL (resulting in large performance gains), we observe that inlining the entire UDF often leads to sub-optimal performance. A better approach is to analyze the UDF, deconstruct it into smaller pieces, and inline only the pieces that help query optimization. To achieve this, we propose UDF outlining, a technique to intentionally hide pieces of a UDF from the optimizer, resulting in simpler UDFs and significantly faster query plans. Our implementation (PRISM) demonstrates that UDF outlining improves performance over conventional inlining (on average 1.29× speedup for DuckDB and 298.73× for SQL Server) through a combination of more effective unnesting, improved data skipping, and by avoiding unnecessary joins.


Intelligent Transaction Scheduling via Conflict Prediction in OLTP DBMS
  • Preprint
  • File available

September 2024

·

73 Reads

Current architectures for main-memory online transaction processing (OLTP) database management systems (DBMS) typically use random scheduling to assign transactions to threads. This approach achieves uniform load across threads but it ignores the likelihood of conflicts between transactions. If the DBMS could estimate the potential for transaction conflict and then intelligently schedule transactions to avoid conflicts, then the system could improve its performance. Such estimation of transaction conflict, however, is non-trivial for several reasons. First, conflicts occur under complex conditions that are far removed in time from the scheduling decision. Second, transactions must be represented in a compact and efficient manner to allow for fast conflict detection. Third, given some evidence of potential conflict, the DBMS must schedule transactions in such a way that minimizes this conflict. In this paper, we systematically explore the design decisions for solving these problems. We then empirically measure the performance impact of different representations on standard OLTP benchmarks. Our results show that intelligent scheduling using a history increases throughput by \sim40\% on 20-core machine.

Download

Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management Systems

August 2024

·

30 Reads

·

1 Citation

Proceedings of the VLDB Endowment

Autonomous database management systems (DBMSs) aim to optimize themselves automatically without human guidance. They rely on machine learning (ML) models that predict their run-time behavior to evaluate whether a candidate configuration is beneficial without the expensive execution of queries. However, the high cost of collecting the training data to build these models makes them impractical for real-world deployments. Furthermore, these models are instance-specific and thus require retraining whenever the DBMS's environment changes. State-of-the-art methods spend over 93% of their time running queries for training versus tuning. To mitigate this problem, we present the Boot framework for automatically accelerating training data collection in DBMSs. Boot utilizes macro- and micro-acceleration (MMA) techniques that modify query execution semantics with approximate run-time telemetry and skip repetitive parts of the training process. To evaluate Boot, we integrated it into a database gym for PostgreSQL. Our experimental evaluation shows that Boot reduces training collection times by up to 268× with modest degradation in model accuracy. These results also indicate that our MMA-based approach scales with dataset size and workload complexity.


The Holon Approach for Simultaneously Tuning Multiple Components in a Self-Driving Database Management System with Machine Learning via Synthesized Proto-Actions

August 2024

·

17 Reads

·

1 Citation

Proceedings of the VLDB Endowment

Existing machine learning (ML) approaches to automatically optimize database management systems (DBMSs) only target a single configuration space at a time (e.g., knobs, query hints, indexes). Simultaneously tuning multiple configuration spaces is challenging due to the combined space's complexity. Previous tuning methods work around this by sequentially tuning individual spaces with a pool of tuners. However, these approaches struggle to coordinate their tuners and get stuck in local optima. This paper presents the Proto-X framework that holistically tunes multiple configuration spaces. The key idea of Proto-X is to identify similarities across multiple spaces, encode them in a high-dimensional model, and then synthesize "proto-actions" to navigate the organized space for promising configurations. We evaluate Proto-X against state-of-the-art DBMS tuning frameworks on tuning PostgreSQL for analytical and transactional workloads. By reasoning about configuration spaces that are orders of magnitude more complex than other frameworks (both in terms of quantity and variety), Proto-X discovers configurations that improve PostgreSQL's performance by up to 53% over the next best approach.


What Goes Around Comes Around... And Around...

July 2024

·

23 Reads

·

6 Citations

ACM SIGMOD Record

Two decades ago, one of us co-authored a paper commenting on the previous 40 years of data modelling research and development [188]. That paper demonstrated that the relational model (RM) and SQL are the prevailing choice for database management systems (DBMSs), despite efforts to replace either them. Instead, SQL absorbed the best ideas from these alternative approaches. We revisit this issue and argue that this same evolution has continued since 2005. Once again there have been repeated efforts to replace either SQL or the RM. But the RM continues to be the dominant data model and SQL has been extended to capture the good ideas from others. As such, we expect more of the same in the future, namely the continued evolution of SQL and relational DBMSs (RDBMSs). We also discuss DBMS implementations and argue that the major advancements have been in the RM systems, primarily driven by changing hardware characteristics.



An Empirical Evaluation of Columnar Storage Formats

December 2023

·

29 Reads

·

15 Citations

Proceedings of the VLDB Endowment

Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.


Tigger: A Database Proxy That Bounces with User-Bypass

August 2023

·

43 Reads

·

6 Citations

Proceedings of the VLDB Endowment

Developers often deploy database-specific network proxies whereby applications connect transparently to the proxy instead of directly connecting to the database management system (DBMS). This indirection improves system performance through connection pooling, load balancing, and other DBMS-specific optimizations. Instead of simply forwarding packets, these proxies implement DBMS protocol logic (i.e., at the application layer) to achieve this behavior. Consequently, existing proxies are user-space applications that process requests as they arrive on network sockets and forward them to the appropriate destinations. This approach incurs inefficiencies as the kernel repeatedly copies buffers between user-space and kernel-space, and the associated system calls add CPU overhead. This paper presents user-bypass, a technique to eliminate these overheads by leveraging modern operating system features that support custom code execution. User-bypass pushes application logic into kernel-space via Linux's eBPF infrastructure. To demonstrate its benefits, we implemented Tigger, a PostgreSQL-compatible DBMS proxy using user-bypass to eliminate the overheads of traditional proxy design. We compare Tigger's performance against other state-of-the-art proxies widely used in real-world deployments. Our experiments show that Tigger outperforms other proxies --- in one scenario achieving both the lowest transaction latencies (up to 29% reduction) and lowest CPU utilization (up to 42% reduction). The results show that user-bypass implementations like Tigger are well-suited to DBMS proxies' unique requirements.


An Empirical Evaluation of Columnar Storage Formats

April 2023

·

41 Reads

Columnar storage is one of the core components of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed significantly. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. Our analysis identifies important considerations that may guide future formats to better fit modern technology trends.



Citations (71)


... SlabCity [13] takes an approach similar to traditional superoptimization, by considering SQL-level semantic rewrites of queries to improve performance (e.g., query simplification). Finally, DataFarm [85] and HitTheGym [49] investigated how to best produce datasets for machine learning powered database components, including query optimizers. ...

Reference:

Learned Offline Query Planning via Bayesian Optimization
Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management Systems
  • Citing Article
  • August 2024

Proceedings of the VLDB Endowment

... However, this approach is bandwidth-intensive, which can lead to performance bottlenecks in low-bandwidth or high-traffic environments. Besides, plaintext download can further reduce the data to be transferred through compression techniques [5,55,70,88]. Although we do not apply compression in our experiments, it can be integrated as an optional optimization. ...

An Empirical Evaluation of Columnar Storage Formats
  • Citing Article
  • December 2023

Proceedings of the VLDB Endowment

... In contrast, Cherry Garica [46], Epoxy [34], and ScalarDB [5] implement transaction management and concurrency control over the abstractions of underlying engines, making it an extendable to more kinds of engines, including NoSQLs. Other solutions focus on managing connections between databases and clients [47], [48]. These solutions enable the routing of statements to one or multiple database servers, thereby offering scalability and high performance. ...

Tigger: A Database Proxy That Bounces with User-Bypass
  • Citing Article
  • August 2023

Proceedings of the VLDB Endowment

... The latest version of Hyrise [31] develops the chunk-based column store, a PAX [4]-like data layout that horizontally divides a table into partitions (a.k.a., row groups), and each partition is organized in columns. NoisePage [88] (previously named Peloton [11]), a self-driving columnar database [20], [75], [76], [94], also adopts such a data layout based on Apache Arrow [8] format. Caldera [9] relies on the copy-onwrite mechanism with CPU/GPU architecture where OLTP workloads are handled by multi-threads of CPU and OLAP workloads are executed with GPU in parallel. ...

Tastes Great! Less Filling! High Performance and Accurate Training Data Collection for Self-Driving Database Management Systems
  • Citing Article
  • June 2022

... They encompass tasks such as data retrieval and placement within index structures [38,39], data sorting [41], query optimization involving cardinality estimation [36,88] and query enumeration plan [40], as well as automatic database tuning [79]. Consequently, in a recent report [1], a group of leading database researchers anticipated that ML will be a significant research direction for improving database engines. ...

The Seattle report on database research
  • Citing Article
  • August 2022

Communications of the ACM

... Recently, various techniques have been proposed to detect bugs in DBMSs. For relational DBMSs, researchers have proposed methods to identify logic bugs [1,8,20,[36][37][38], transactional bugs [4,8,17,19,22,43], and performance issues [2,20,25]. A notable example is SQLancer, which integrates several novel approaches to identify a range of bug types. ...

Litmus: Towards a Practical Database Management System with Verifiable ACID Properties and Transaction Correctness
  • Citing Conference Paper
  • June 2022

... Machine Learning for Database Systems. There have been many extensive studies focused on using machine learning to optimize storage systems, particularly for SQL databases [13,56,82]. Ottertune [76] uses machine learning techniques to analyze and tune database configuration knobs and give recommendation settings based on statistics collected from the database. ...

Tastes Great! Less Filling! High Performance and Accurate Training Data Collection for Self-Driving Database Management Systems
  • Citing Conference Paper
  • June 2022

... Ottertune [76] uses machine learning techniques to analyze and tune database configuration knobs and give recommendation settings based on statistics collected from the database. Pavlo et al. [64,65] and Aken et al. [77] introduced the concept of a self-driving database that can make automated decisions with machine learning models. Abu-Libdeh et al. [4] present several design decisions involved in integrating learned indexes into Bigtable [16] and demonstrate significant performance improvements as a result. ...

Make your database system dream of electric sheep: towards self-driving operation
  • Citing Article
  • July 2021

Proceedings of the VLDB Endowment