Matei Zaharia’s research while affiliated with Berkeley College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (211)


Adaptive and Robust Query Execution for Lakehouses at Scale
  • Article

November 2024

·

6 Reads

Proceedings of the VLDB Endowment

Maryann Xue

·

Steven Chen

·

Andy Lam

·

[...]

·

Michalis Petropoulos

Many organizations have embraced the "Lakehouse" data management paradigm, which involves constructing structured data warehouses on top of open, unstructured data lakes. This approach stands in stark contrast to traditional, closed, relational databases and introduces challenges for performance and stability of distributed query processors. Firstly, in large-scale, open Lakehouses with uncurated data, high ingestion rates, external tables, or deeply nested schemas, it is often costly or wasteful to maintain perfect and up-to-date table and column statistics. Secondly, inherently imperfect cardinality estimates with conjunctive predicates, joins and user-defined functions can lead to bad query plans. Thirdly, for the sheer magnitude of data involved, strictly relying on static query plan decisions can result in performance and stability issues such as excessive data movement, substantial disk spillage, or high memory pressure. To address these challenges, this paper presents our design, implementation, evaluation and practice of the Adaptive Query Execution (AQE) framework, which exploits natural execution pipeline breakers in query plans to collect accurate statistics and re-optimize them at runtime for both performance and robustness. In the TPC-DS benchmark, the technique demonstrates up to 25× per query speedup. At Databricks, AQE has been successfully deployed in production for multiple years. It powers billions of queries and ETL jobs to process exabytes of data per day, through key enterprise products such as Databricks Runtime, Databricks SQL, and Delta Live Tables.



Figure 2: Example Aggregation Results: The RAG baseline provides an incomplete answer to the query while Text2SQL + LM fails to answer the question using any data from the DB. The Hand-written TAG baseline provides the most thorough answer, synthesizing data from the DB and its own world knowledge.
Text2SQL is Not Enough: Unifying AI and Databases with TAG
  • Preprint
  • File available

August 2024

·

121 Reads

AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL methods focus solely on natural language questions that can be expressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited subset of queries that can be answered with point lookups to one or a few data records within the database. We propose Table-Augmented Generation (TAG), a unified and general-purpose paradigm for answering natural language questions over databases. The TAG model represents a wide range of interactions between the LM and database that have been previously unexplored and creates exciting research opportunities for leveraging the world knowledge and reasoning capabilities of LMs over data. We systematically develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly, confirming the need for further research in this area. We release code for the benchmark at https://github.com/TAG-Research/TAG-Bench.

Download

Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design

July 2024

·

8 Reads

As practitioners seek to surpass the current reliability and quality frontier of monolithic models, Compound AI Systems consisting of many language model inference calls are increasingly employed. In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness, a fundamental concept in complexity theory that we show empirically extends to Language Models (LMs). We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems. Through experiments on synthetic tasks such as prime factorization, and core benchmarks such as the MMLU, we demonstrate notable performance gains. For instance, in factoring products of two 3-digit primes, a simple NoN improves accuracy from 3.7\% to 36.6\%. On MMLU, a verifier-based judge construction with only 3 generators boosts accuracy over individual GPT-4-Turbo calls by 2.8\%. Our analysis reveals that these gains are most pronounced in domains where verification is notably easier than generation--a characterization which we believe subsumes many reasoning and procedural knowledge tasks, but doesn't often hold for factual and declarative knowledge-based settings. For mathematical and formal logic reasoning-based subjects of MMLU, we observe a 5-8\% or higher gain, whilst no gain on others such as geography and religion. We provide key takeaways for ML practitioners, including the importance of considering verification complexity, the impact of witness format on verifiability, and a simple test to determine the potential benefit of this NoN approach for a given problem distribution. This work aims to inform future research and practice in the design of compound AI systems.


LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data

July 2024

·

131 Reads

·

1 Citation

The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform semantic queries at scale. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for semantic queries over datasets (e.g., sorting or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators and several optimizations for them in LOTUS, an open-source query engine with a Pandas-like API. We demonstrate LOTUS' effectiveness across a series of real applications, including fact-checking, extreme multi-label classification, and search. We find that LOTUS' programming model is highly expressive, capturing state-of-the-art query pipelines with low development overhead. Specifically, on the FEVER dataset, LOTUS' programs can reproduce FacTool, a recent state-of-the-art fact-checking pipeline, in few lines of code, and implement a new pipeline that improves accuracy by 9.5%9.5\%, while offering 734×7-34\times lower execution time. In the extreme multi-label classification task on the BioDEX dataset, LOTUS reproduces state-of-the art result quality with its join operator, while providing an efficient algorithm that runs 800×800\times faster than a naive join. In the search and ranking application, LOTUS allows a simple composition of operators to achieve 5.949.4%5.9 - 49.4\% higher nDCG@10 than the vanilla retriever and re-ranker, while also providing query efficiency, with 1.6710×1.67 - 10\times lower execution time than LM-based ranking methods used by prior works. LOTUS is publicly available at https://github.com/stanford-futuredata/lotus.


Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

June 2024

·

11 Reads

Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy


ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data

May 2024

·

10 Reads

·

4 Citations

Proceedings of the ACM on Management of Data

Applications increasingly leverage mixed-modality data, and must jointly search over vector data, such as embedded images, text and video, as well as structured data, such as attributes and keywords. Proposed methods for this hybrid search setting either suffer from poor performance or support a severely restricted set of search predicates (e.g., only small sets of equality predicates), making them impractical for many applications. To address this, we present ACORN, an approach for performant and predicate-agnostic hybrid search. ACORN builds on Hierarchical Navigable Small Worlds (HNSW), a state-of-the-art graph-based approximate nearest neighbor index, and can be implemented efficiently by extending existing HNSW libraries. ACORN introduces the idea of predicate subgraph traversal to emulate a theoretically ideal, but impractical, hybrid search strategy. ACORN's predicate-agnostic construction algorithm is designed to enable this effective search strategy, while supporting a wide array of predicate sets and query semantics. We systematically evaluate ACORN on both prior benchmark datasets, with simple, low-cardinality predicate sets, and complex multi-modal datasets not supported by prior methods. We show that ACORN achieves state-of-the-art performance on all datasets, outperforming prior methods with 2--1,000× higher throughput at a fixed recall. Our code is available at: https://github.com/stanford-futuredata/ACORN.





Citations (50)


... Frameworks like RAGAS [9] and ARES [29] propose evaluating RAG systems based on context relevance, answer faithfulness, and answer relevance, using LLMs or fine-tuned discriminators for measurement. RGB [3] assesses the noise robustness, negative rejection, information integration, and counterfactual robustness of RAG in news data. ...

Reference:

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
  • Citing Conference Paper
  • January 2024

... Expanding to more complex scenarios beyond offline and online workloads will also be crucial as inference patterns grow in complexity, including prefix-sharing, multi-path reasoning, and compound AI systems (Patel et al., 2024a;Lin et al., 2024;Zheng et al., 2023a;Jeong et al., 2024;Zhong et al., 2024a;Zaharia et al., 2024). ...

LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data

... The process involves crafting a standardized conversational template that can exploit the model's vulnerabilities. This often includes strategies to either explicitly trick the model into making critical errors [283,318] or implicitly guide it toward unintended outputs [30,176,182,213,292,314,425,461]. Additionally, these templates can be fine-tuned through iterative training and optimization, enhancing their capabilities to consistently induce the model to perform undesirable actions across a variety of scenarios [13,180,323,463,463,468]. ...

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
  • Citing Conference Paper
  • May 2024

... It is often the case that users target to find the data objects which satisfy some constraints on the attributes and have their vector nearest to the user's query vector. These questions are usually referred to as attribute-filtering ANN query [49,52,55,71,79,83]. Besides, due to the recent development in information retrieval, there has also been a trend in multi-vector search [40,52,63]. ...

ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
  • Citing Article
  • May 2024

Proceedings of the ACM on Management of Data

... For example, Bottaro and Ramgopal [49] proposed streaming the application pipeline so that downstream calls can be invoked as soon as they are ready, without waiting for the complete response. Additionally, Santhanam et al. [55] introduced ALTO, an FM serving system for streaming AI pipelines, demonstrating improved throughput and tail latency by streaming intermediate outputs to downstream tasks. While these methods are effective, the process is still manual and hard to extend to multiple objectives, making it hard to scale for more complex FMware. ...

ALTO: An Efficient Network Orchestrator for Compound AI Systems
  • Citing Conference Paper
  • April 2024

... The recent ChatGPT is built on GPT-3.5 or GPT-4 and they are improved from GPT-3 [5], GPT-2 [30], and the original GPT [29]. There are surveys and evaluations about ChatGPT-related models [23,20,4,21,6]. These transformer-related language models are usually called large language models (LLMs). ...

How Is ChatGPT’s Behavior Changing Over Time?
  • Citing Article
  • March 2024

... LLM-based agents, characterized by their strong generalization abilities and broad applicability, have demonstrated significant advancements in language proficiency and interaction with humans 21 . Motivated by the outstanding performance of these agents, scholars have explored and exploited their capability in the various tasks of chemical and material research, such as literature mining [22][23][24][25][26][27][28][29] , molecule and material discovery 27,[29][30][31][32][33][34][35][36][37][38] , reaction condition recommendation and optimization [26][27][28][29][39][40][41] , and lab apparatus automation [27][28][29][40][41][42][43] . ...

Image and Data Mining in Reticular Chemistry Powered by GPT-4V

... They can develop techniques for isolating re-execution of events from impacting the actual microservice states. Recent tools [63] allows for replaying operations in database-backed applications. However, it is unclear how the approach can be applied to events and distributed states of microservices. ...

R 3 : Record-Replay-Retroaction for Database-Backed Applications

Proceedings of the VLDB Endowment

... The advantages of LoRA include reduced memory requirements for saving fine-tuned models, more efficient training, no impact on inference speed, and the capacity for combination with other parameter efficient fine-tuning methods. The 5 Ormerod & Kwako memory requirements for saving a fine-tuned large language model with LoRA are limited to the size of the pairs of update matrices, which are orders of magnitude smaller than the original model. Training is also more efficient and requires less GPU memory since gradients only need to be calculated for the update matrices. ...

How is ChatGPT's behavior changing over time?
  • Citing Preprint
  • July 2023