Matei Zaharia's research while affiliated with Stanford University and other places

Publications (176)

Conference Paper
Full-text available
Introduction Segmenting left atrial (LA) substructures, including the LA body, appendage (LAA), and pulmonary veins (PVs), from computed tomography (CT) is central to electroanatomic mapping for ablation and functional studies in patients with atrial fibrillation (AF). However, this process requires manual outlining which needs special training, is...
Preprint
Commercial ML APIs offered by providers such as Google, Amazon and Microsoft have dramatically simplified ML adoption in many applications. Numerous companies and academics pay to use ML APIs for tasks such as object detection, OCR and sentiment analysis. Different ML APIs tackling the same task can have very heterogeneous performance. Moreover, th...
Preprint
Deployed machine learning (ML) models often encounter new user data that differs from their training data. Therefore, estimating how well a given model might perform on the new data is an important step toward reliable ML applications. This is very challenging, however, as the data distribution can change in flexible ways, and we may not have any l...
Preprint
Full-text available
Developers are increasingly using function-as-a-service (FaaS) platforms for data-centric applications that primarily perform low-latency and transactional operations on data, such as for microservices or web serving workloads. Unfortunately, existing and recently proposed FaaS platforms support these applications poorly because they separate appli...
Article
As artificial intelligence (AI) transitions from research to deployment, creating the appropriate datasets and data pipelines to develop and evaluate AI models is increasingly the biggest challenge. Automated AI model builders that are publicly available can now achieve top performance in many applications. In contrast, the design and sculpting of...
Article
Data exploration systems that provide differential privacy must manage a privacy budget that measures the amount of privacy lost across multiple queries. One effective strategy to manage the privacy budget is to compute a one-time private synopsis of the data, to which users can make an unlimited number of queries. However, existing systems using s...
Preprint
Full-text available
Ablation is a common therapeutic procedure for atrial fibrillation (AF), that is guided to specific targets in the heart often after laborious segmentation of computed tomography. Machine learning (ML) can automate such tasks but requires large training datasets. Inspired by natural intelligence, which builds conceptual models to learn without larg...
Article
Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and searc...
Preprint
Full-text available
Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interactio...
Article
We consider the problem of assigning or allocating resources to a set of jobs. We consider the case when the resources are fungible, that is, the job can be done with any mix of the resources, but with different efficiencies. In our formulation we maximize a total utility subject to a given limit on the resource usage, which is a convex optimizatio...
Preprint
ML is being deployed in complex, real-world scenarios where errors have impactful consequences. In these systems, thorough testing of the ML pipelines is critical. A key component in ML deployment pipelines is the curation of labeled training data. Common practice in the ML literature assumes that labels are the ground truth. However, in our experi...
Preprint
Full-text available
Data-centric AI is a new and exciting research topic in the AI community, but many organizations already build and maintain various "data-centric" applications whose goal is to produce high quality data. These range from traditional business data processing applications (e.g., "how much should we charge each of our customers this month?") to produc...
Preprint
Full-text available
Much genomic data comes in the form of paired-end reads: two reads that represent genetic material with a small gap between. We present a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. This algorithm is often much faster and produces alignments that res...
Preprint
Full-text available
Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many neural IR methods encode queries and documents into single-vector representations, late interaction models produce multi-vector representations at the granularity of each token and decompose relevance modeling into scalable token-l...
Preprint
Full-text available
Much genomic data comes in the form of paired-end reads: two reads that represent genetic material with a small gap between. We present a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. This algorithm is often much faster and produces alignments that res...
Preprint
This paper investigates deep neural network (DNN) compression from the perspective of compactly representing and storing trained parameters. We explore the previously overlooked opportunity of cross-layer architecture-agnostic representation sharing for DNN parameters. To do this, we decouple feedforward parameters from DNN architectures and levera...
Preprint
Full-text available
The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, tensor-model, pipeline parallelism, and hybrid combinations thereof. Each of these strategies offers its own trade-offs and exhibits optimal performance across different models and hardware topologies. Selec...
Preprint
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many al...
Preprint
Full-text available
Many text generation systems benefit from using a retriever to retrieve passages from a textual knowledge corpus (e.g., Wikipedia) which are then provided as additional context to the generator. For open-ended generation tasks (like generating informative utterances in conversations) many varied passages may be equally relevant and we find that exi...
Article
Full-text available
Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned component that uses coarse-grained vector representations of questions and passages. We argue that this modeling...
Article
This paper lays out the rationale for building a completely new operating system (OS) stack. Rather than build on a single node OS together with separate cluster schedulers, distributed filesystems, and network managers, we argue that a distributed transactional DBMS should be the basis for a scalable cluster OS. We show herein that such a database...
Preprint
Full-text available
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and...
Preprint
Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating...
Preprint
Machine learning (ML) prediction APIs are increasingly widely used. An ML API can change over time due to model updates or retraining. This presents a key challenge in the usage of the API because it is often not clear to the user if and how the ML model has changed. Model shifts can affect downstream application performance and also create oversig...
Preprint
Given a dataset $\mathcal{D}$, we are interested in computing the mean of a subset of $\mathcal{D}$ which matches a predicate. \algname leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget $N$. In this document, we theoretically analyze \algname and show that the MSE of the estimate decays at...
Article
Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating...
Preprint
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers in an online setting is often intractable for "hyper-scale" system sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. In th...
Preprint
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train t...
Preprint
We consider the problem of assigning or allocating resources to a set of jobs. We consider the case when the resources are fungible, that is, the job can be done with any mix of the resources, but with different efficiencies. In our formulation we maximize a total utility subject to a given limit on the resource usage, which is a convex optimizatio...
Preprint
Multi-label classification tasks such as OCR and multi-object recognition are a major focus of the growing machine learning as a service industry. While many multi-label prediction APIs are available, it is challenging for users to decide which API to use for their own data and budget, due to the heterogeneity in those APIs' price and performance....
Preprint
Full-text available
Multi-hop reasoning (i.e., reasoning across two or more documents) at scale is a key step toward NLP models that can exhibit broad world knowledge by leveraging large collections of documents. We propose Baleen, a system that improves the robustness and scalability of multi-hop reasoning over current approaches. Baleen introduces a per-hop condense...
Article
Full-text available
A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are...
Article
Introduction: Regional dysfunction of cardiac myocardium can result from diverse underlying conditions and cause adverse events even in patients without clinically apparent cardiovascular disease over ejection fraction. Assessment of regional wall motion abnormalities (WMA) currently requires sophisticated imaging that is not feasible for populatio...
Article
Rationale: Susceptibility to ventricular arrhythmias (VT/VF) is difficult to predict in patients with ischemic cardiomyopathy either by clinical tools or by attempting to translate cellular mechanisms to the bedside. Objective: To develop computational phenotypes of patients with ischemic cardiomyopathy, by training then interpreting machine lea...
Article
While deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in...
Preprint
Unstructured data is now commonly queried by using target deep neural networks (DNNs) to produce structured information, e.g., object types and positions in video. As these target DNNs can be computationally expensive, recent work uses proxy models to produce query-specific proxy scores. These proxy scores are then used in downstream query processi...
Preprint
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many use...
Preprint
Full-text available
While deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in...
Preprint
Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-cen...
Article
Background - Advances in ablation for atrial fibrillation (AF) continue to be hindered by ambiguities in mapping, even between experts. We hypothesized that convolutional neural networks (CNN) may enable objective analysis of intracardiac activation in AF, which could be applied clinically if CNN classifications could also be explained. Methods - W...
Preprint
Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned component that uses coarse-grained vector representations of questions and passages. We argue that this modeling...
Preprint
Many active learning and search approaches are intractable for industrial settings with billions of unlabeled examples. Existing approaches, such as uncertainty sampling or information density, search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. However, in practice, data is often heavi...
Preprint
Data exploration systems that provide differential privacy must manage a privacy budget that measures the amount of privacy lost across multiple queries. One effective strategy to manage the privacy budget is to compute a one-time private synopsis of the data, to which users can make an unlimited number of queries. However, existing systems using s...
Preprint
Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these applications have relatively moderate levels of sparsity that are not sufficient for existing sparse kernel...
Preprint
Full-text available
Many state-of-the-art results in domains such as NLP and computer vision have been obtained by scaling up the number of parameters in existing models. However, the weight parameters and intermediate outputs of these large models often do not fit in the main memory of a single accelerator device; this means that it is necessary to use multiple accel...
Preprint
Prediction APIs offered for a fee are a fast-growing industry and an important part of machine learning as a service. While many such services are available, the heterogeneity in their price and performance makes it challenging for users to decide which API or combination of APIs to use for their own data and budget. We take a first step towards ad...
Preprint
Recent progress in Natural Language Understanding (NLU) is driving fast-paced advances in Information Retrieval (IR), largely owed to fine-tuning deep language models (LMs) for document ranking. While remarkably effective, the ranking models based on these LMs increase computational cost by orders of magnitude over prior approaches, particularly as...
Preprint
Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect all the video to identify relevant objects and events. To reduce this...
Preprint
ML models are increasingly deployed in settings with real world interactions such as vehicles, but unfortunately, these models can fail in systematic ways. To prevent errors, ML engineering teams monitor and continuously improve these models. We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a wa...
Article
Recent advances in neural networks (NNs) have enabled automatic querying of large volumes of video data with high accuracy. While these deep NNs can produce accurate annotations of an object's position and type in video, they are computationally expensive and require complex, imperative deployment code to answer queries. Prior work uses approximate...
Conference Paper
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platfor...
Preprint
Existing systems for metadata-hiding messaging that provide cryptographic privacy properties have either high communication costs, high computation costs, or both. In this paper, we introduce Express, a metadata-hiding communication system that significantly reduces both communication and computation costs. Express is a two-server system that provi...
Conference Paper
DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a sy...
Conference Paper
Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators are introduced on a regular basis. We propose TASO, the first DNN computation graph optimizer th...
Conference Paper
Data movement between main memory and the CPU is a major bottleneck in parallel data-intensive applications. In response, researchers have proposed using compilers and intermediate representations (IRs) that apply optimizations such as loop fusion under existing high-level APIs such as NumPy and TensorFlow. Even though these techniques generally do...
Preprint
Machine learning is experiencing an explosion of software and hardware solutions, and needs industry-standard performance benchmarks to drive design and enable competitive evaluation. However, machine learning training presents a number of unique challenges to benchmarking that do not exist in other domains: (1) some optimizations that improve trai...