Frank McSherry's research while affiliated with ETH Zurich and other places

Publications (72)

Preprint
Full-text available
Incremental view maintenance has been for a long time a central problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this paper we give a general solution to this problem in 3 steps:...
Article
Current systems for data-parallel, incremental processing and view maintenance over high-rate streams isolate the execution of independent queries. This creates unwanted redundancy and overhead in the presence of concurrent incrementally maintained queries: each query must independently maintain the same indexed state over the same input streams, a...
Article
We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can...
Preprint
Many of the most popular scalable data-processing frameworks are fundamentally limited in the generality of computations they can express and efficiently execute. In particular, we observe that systems' abstractions limit their ability to share and reuse indexed state within and across computations. These limitations result in an inability to expre...
Preprint
We design and implement Megaphone, a data migration mechanism for stateful distributed dataflow engines with latency objectives. When compared to existing migration mechanisms, Megaphone has the following differentiating characteristics: (i) migrations can be subdivided to a configurable granularity to avoid latency spikes, and (ii) migrations can...
Conference Paper
We propose a prototype incremental data migration mechanism for stateful distributed data-parallel dataflow engines with latency objectives. When compared to existing scaling mechanisms, our prototype has the following differentiating characteristics: (i) the mechanism provides tunable granularity for avoiding latency spikes, (ii) reconfigurations...
Article
We study the problem of finding and monitoring fixed-size subgraphs in a continually changing large-scale graph. We present the first approach that (i) performs worst-case optimal computation and communication, (ii) maintains a total memory footprint linear in the number of input edges, and (iii) scales down per-worker computation, communication, a...
Article
We study the problem of finding and monitoring fixed-size subgraphs in a continually changing large-scale graph. We present the first approach that (i) performs worst-case optimal computation and communication, (ii) maintains a total memory footprint linear in the number of input edges, and (iii) scales down per-worker computation, communication, a...
Article
We continue a line of research initiated in Dinur and Nissim (2003); Dwork and Nissim (2004); and Blum et al. (2005) on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function $f$ mapping databases to reals, the so-called {\em true answer} is the result of applying...
Article
We describe the timely dataflow model for distributed computation and its implementation in the Naiad system. The model supports stateful iterative and incremental computations. It enables both low-latency stream processing and high-throughput batch processing, using a new approach to coordination that combines asynchronous and fine-grained synchro...
Article
We report on the design and implementation of a general framework for interactively explaining the outputs of modern data-parallel computations, including iterative data analytics. To produce explanations, existing works adopt a naive backward tracing approach which runs into known issues; naive backward tracing may identify: (i) too much informati...
Conference Paper
This document presents Faucet, a modular flow control approach for distributed data-parallel dataflow engines with support for arbitrary (cyclic) topologies. When compared to existing backpressure techniques Faucet has the following differentiating characteristics: (i) the implementation only relies on existing progress information exposed by the s...
Conference Paper
Differential dataflow is a recent approach to incremental computation that relies on a partially ordered set of differences. In the present paper, we aim to develop its foundations. We define a small programming language whose types are abelian groups equipped with linear inverses, and provide both a standard and a differential denotational semanti...
Article
We present an approach to differentially private computation in which one does not scale up the magnitude of noise for challenging queries, but rather scales down the contributions of challenging records. While scaling down all records uniformly is equivalent to scaling up the noise magnitude, we show that scaling records non-uniformly can result i...
Conference Paper
Naiad is a distributed system for executing data parallel, cyclic dataflow programs. It offers the high throughput of batch processors, the low latency of stream processors, and the ability to perform iterative and incremental computations. Although existing systems offer some of these features, applications that require all three have relied on mu...
Chapter
Tracking the progress of computations can be both important and delicate in distributed systems. In a recent distributed algorithm for this purpose, each processor maintains a delayed view of the pending work, which is represented in terms of points in virtual time. This paper presents a formal specification of that algorithm in the temporal logic...
Article
We report on the design and implementation of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation. This combination is enabled by a new computational model we call differential dataflow, in which incremental computation can be performed usi...
Article
We present a new workflow for differentially-private publication of graph topologies. First, we produce differentially-private measurements of interesting graph statistics using our new version of the PINQ programming language, Weighted PINQ, which is based on a generalization of differential privacy to weighted sets. Next, we show how to generate...
Article
We advance the approach initiated by Chawla et al. for sanitizing (census) data so as to preserve the privacy of respondents while simultaneously extracting "useful" statistical information. First, we extend the scope of their techniques to a broad and rich class of distributions, specifically, mixtures of highdimensional balls, spheres, Gaussians,...
Article
We present a new platform for differentially private data analysis, wPINQ, and decribe its application to the private analysis of social networks. wPINQ generalizes the existing Privacy Integrated Query (PINQ) declarative programming language for differentially private analysis to support weighted datasets, in which records are assigned real-valued...
Article
Full-text available
Grace is a graph-aware, in-memory, transactional graph management system, specifically built for real-time queries and fast iterative computations. It is designed to run on large multi-cores, taking advantage of the inherent parallelism to improve its performance. Grace contains a number of graph-specific and multi-core-specific optimizations inclu...
Article
We investigate the integration of two approaches to information security: information flow analysis, in which the dependence between secret inputs and public outputs is tracked through a program, and differential privacy, in which a weak dependence between input and output is permitted but provided only through a relatively small set of known diffe...
Article
This chapter describes DryadLINQ, a general-purpose system for large-scale data-parallel computing, and illustrates its use on a number of machine learning problems. The main motivation behind the development of DryadLINQ was to make it easier for nonspecialists to write general-purpose, scalable programs that can operate on very large input datase...
Article
DryadLINQ is a system that facilitates the construction of distributed execution plans for processing large amounts of data on clusters containing potentially thousands of computers. In this paper, we explore how to use DryadLINQ to perform basic matrix operations on large matrices.
Article
We present new theoretical results on differentially private data release useful with respect to any target class of counting queries, coupled with experimental results on a variety of real world data sets. Specifically, we study a simple combination of the multiplicative weights approach of [Hardt and Rothblum, 2010] with the exponential mechanism...
Article
Privacy Integrated Queries (PINQ) is an extensible data analysis platform designed to provide unconditional privacy guarantees for the records of the underlying data sets. PINQ provides analysts with access to records through an SQL-like declarative language (LINQ) amidst otherwise arbitrary C# code. At the same time, the design of PINQ's analysis...
Article
We report on the design and implementation of the Privacy Integrated Queries (PINQ) platform for privacy-preserving data analysis. PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language. At the same time, the design of PINQ's analysis language and its careful implementation provide formal guarantees of di...
Conference Paper
We consider the potential for network trace analysis while providing the guarantees of "differential privacy." While differential privacy provably obscures the presence or absence of individual records in a dataset, it has two major limitations: analyses must (presently) be expressed in a higher level declarative language; and the analysis results...
Conference Paper
In this work, we describe the collection of information retrieval algorithms we have implemented using DryadLINQ. DryadLINQ is a data parallel processing system that allows programmers to write distributed programs without worrying about the implementation of a distributed system. DryadLINQ executes programs containing SQL-like Language Integrated...
Conference Paper
We identify and investigate a strong connection between probabilistic inference and differential privacy, the latter being a recent privacy definition that permits only indirect observation of data through noisy measurement. Previous research on differential privacy has focused on designing measurement processes whose output is likely to be useful...
Conference Paper
Consider the following problem: given a metric space, some of whose points are "clients," select a set of at most k facility locations to minimize the average distance from the clients to their nearest facility. This is just the well-studied k-median problem, for which many approximation algorithms and hardness results are known. Note that the obje...
Article
Consider the following problem: given a metric space, some of whose points are "clients", open a set of at most $k$ facilities to minimize the average distance from the clients to these facilities. This is just the well-studied $k$-median problem, for which many approximation algorithms and hardness results are known. Note that the objective functi...
Conference Paper
We consider the problem of producing recommendations from collective user behavior while simultaneously providing guarantees of privacy for these users. Specifically, we consider the Netflix Prize data set, and its leading algorithms, adapted to the framework of differential privacy. Unlike prior privacy work concerned with cryptographically securi...
Article
Consider the following problem: given a metric space, some of whose points are "clients," select a set of at most k facility locations to minimize the average distance from the clients to their nearest facility. This is just the well-studied k-median problem, for which many approximation algorithms and hardness results are known. Note that the obje...
Article
We consider the problem of producing recommendations from collective user behavior while simultaneously providing guarantees of privacy for these users. Specifically, we consider the Netflix Prize data set, and its leading algorithms, adapted to the framework of differential privacy. Unlike prior privacy work concerned with cryptographically securi...
Conference Paper
Privacy Integrated Queries (PINQ) is an extensible data analysis platform designed to provide unconditional privacy guarantees for the records of the underlying data sets. PINQ provides analysts with access to records through an SQL-like declarative language (LINQ) amidst otherwise arbitrary C# code. At the same time, the design of PINQ's analysis...
Article
Consider a pollster who wishes to collect private, sensitive data from a number of distrustful individuals. How might the pollster convince the respondents that it is trustworthy? Alternately, what mechanism could the respondents insist upon to ensure that mismanagement of their data is detectable and publicly demonstrable? We detail this problem,...
Article
Full-text available
DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. This technical report contains annotated listings of several example programs written using DryadLINQ, illustrating typical usage.
Article
In many large network settings, such as computer networks, social networks, or hyperlinked text documents, much information can be obtained from the network's spectral properties. However, traditional centralized approaches for computing eigenvectors struggle with at least two obstacles: the data may be difficult to obtain (both due to technical re...
Conference Paper
We study the role that privacy-preserving algorithms, which prevent the leakage of specific information about participants, can play in the design of mechanisms for strategic agents, which must encourage players to honestly report information. Specifically, we show that the recent notion of differential privacv, in addition to its own intrinsic vir...
Article
Given a matrix A, it is often desirable to find a good approximation to A that has low rank. We introduce a simple technique for accelerating the computation of such approximations when A has strong spectral features, that is, when the singular values of interest are significantly greater than those of a random matrix with size and entries similar...
Article
Full-text available
We describe an efficient procedure for sampling representa-tives from a weighted set such that the probability that for any weightings S and T , the probability that the two choose the same sample is the Jacard similarity: P r[sample(S) = sample(T)] = P x min(S(x), T (x)) P x max(S(x), T (x)) . The sampling process takes expected time linear in the...
Conference Paper
The contingency table is a work horse of ocial statistics, the format of reported data for the US Census, Bureau of Labor Statistics, and the Internal Revenue Service. In many settings such as these privacy is not only ethically man- dated, but frequently legally as well. Consequently there is an extensive and diverse literature dedicated to the pr...
Conference Paper
This work is at the intersection of two lines of research. One line, initiated by Dinur and Nissim, investigates the price, in accuracy, of protecting privacy in a statistical database. The second, growing from an extensive literature on compressed sensing (see in particular the work of Donoho and collab- orators (14, 7, 13, 11)) and explicitly con...
Conference Paper
In this work we provide efficient distributed protocols for generating shares of random noise, secure against malicious participants. The purpose of the noise generation is to create a distributed implementation of the privacy-preserving statistical databases described in recent papers [14,4,13]. In these databases, privacy is obtained by perturbin...
Conference Paper
Full-text available
We continue a line of research initiated in [10,11]on privacy-preserving statistical databases. Consider a trusted server that holds a database of sensitive information. Given a query function f mapping databases to reals, the so-called true answer is the result of applying f to the database. To protect privacy, the true answer is perturbed by the...
Conference Paper
Consider a pollster who wishes to collect private, sensitive data from a number of distrustful individuals. How might the pollster convince the respondents that it is trustworthy? Alternately, what mechanism could the respondents insist upon to ensure that mismanagement of their data is detectable and publicly demonstrable?We detail this problem, a...
Article
In this work we provide efficient distributed protocols for generating shares of random noise, secure against malicious participants. The purpose of the noise generation is to come up with a distributed implementation of the privacy-preserving statistical databases described in recent papers [21, 6, 20]. In these databases, privacy is obtained by p...
Conference Paper
We consider the problem of learning mixtures of distributions via spectral methods and derive a characterization of when such methods are useful. Specifically, given a mixture-sample, let \(\bar\mu_{i}, {\bar C_{i}}, \bar w_{i}\) denote the empirical mean, covariance matrix, and mixing weight of the samples from the i-th component. We prove that a...
Conference Paper
Full-text available
Peer-to-peer (P2P) worms exploit common vul- nerabilities in member hosts of a P2P network and spread topologically in the P2P network, a potentially more effective strategy than random scanning for locating victims. This paper describes the danger posed by P2P worms and initiates the study of possible mitigation mechanisms. In particular, the pape...
Conference Paper
We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study...
Article
In a census, individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Note that this framework is inherently noninteractive. Recently, Chawla et al. (TCC'2...
Conference Paper
Full-text available
We consider a statistical database in which a trusted administrator introduces noise to the query responses with the goal of maintaining privacy of individual database entries. In such a database, a query consists of a pair (S, f) where S is a set of rows in the database and f is a function mapping database rows to {0, 1}. The true answer is ΣiεS f...
Conference Paper
Full-text available
We study the problem of pricing items for sale to consumers so as to maximize the seller's revenue. We assume that for each consumer, we know the maximum amount he would be willing to pay for each bundle of items, and want to find pricings of the items with corresponding allocations that maximize seller profit and at the same time are envy-free, wh...
Conference Paper
We extend spectral methods to random graphs with skewed degree distributions through a degree based normalization closely connected to the normalized Laplacian. The normalization is based on intuition drawn from perturbation theory of random matrices, and has the effect of boosting the expectation of the random adjacency matrix without increasing t...
Conference Paper
In many large network settings, such as computer networks, social networks, or hyperlinked text documents, much information can be obtained from the network's spectral properties. However, traditional centralized approaches for computing eigenvectors struggle with at least two obstacles: the data may be difficult to obtain (both due to technical re...
Article
We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained appr...
Conference Paper
Problems such as bisection, graph coloring, and clique are generally believed hard in the worst case. However, they can be solved if the input data is drawn randomly from a distribution over graphs containing acceptable solutions. In this paper we show that a simple spectral algorithm can solve all three problems above in the average case, as well...
Conference Paper
Given a matrix A it is often desirable to find an approximation to A that has low rank. We introduce a simple technique for accelerating the computation of such approximations when A has strong spectral structure, i.e., when the singular values of interest are significantly greater than those of a random matrix with size and entries similar to A. O...
Article
Full-text available
Experimental evidence suggests that spectral techniques are valuable for a wide range of applications. A partial list of such applications include (i) semantic analysis of documents used to cluster documents into areas of interest, (ii) collaborative ltering | the reconstruction of missing data items, and (iii) determining the relative importance o...
Article
Full-text available
Experimental evidence suggests that spectral techniques are valuable for a wide range of applications. A partial list of such applications include (i) semantic analysis of documents used to cluster documents into areas of interest, (ii) collaborative filtering - the reconstruction of missing data items, and (iii) determining the relative importance...
Article
Given a matrix A it is often desirable to find an approximation to A that has low rank. We introduce a simple technique for accelerating the computation of such approximations when A has strong spectral structure, i.e., when the singular values of interest are significantly greater than those of a random matrix with size and entries similar to A. O...
Conference Paper
Full-text available
We present a model for web search that captures in a unified manner three critical components of the problem: how the link structure of the web is generated, how the content of a web document is generated, and how a human searcher generates a query. The key to this unification lies in capturing the correlations between these components in terms of...
Conference Paper
We propose randomized techniques for speeding up Kernel Principal Component Analysis on three levels: sampling and quantization of the Gram matrix in training, randomized rounding in evaluating the kernel expansions, and random projections in evaluating the kernel itself. In all three cases, we give sharp bounds on the accuracy of the obtained appr...
Article
Thesis (Ph. D.)--University of Washington, 2004 "Spectral methods" captures generally the class of algorithms which cast their input as a matrix and employ linear algebraic techniques, typically involving the eigenvectors or singular vectors of the matrix. Spectral techniques have had much success in a variety of data analysis domains, from text cl...
Article
Existing computational models for processing continuously changing input data are unable to efficiently support itera-tive queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph anal-ysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of s...
Article
We study the problem of pricing items for sale to consumers so as to maximize the seller's revenue. We assume that for each consumer, we know the maximum amount he would be willing to pay for each bundle of items, and want to find pricings of the items with corresponding allocations that maximize seller profit and at the same time are envy-free, wh...

Citations

... DBSP is tightly related to Differential Dataflow (DD) [28,31] and its theoretical foundations [2] (and recently [9,27]). All DBSP operators are based on DD operators. ...
... Find the BGP matchings: Find all matchings of P in G using any of the available algorithms for this task [42,3]. Run the automaton: For each obtained matching µ: ...
... On the other hand, SBR may require transfer of a larger state size compared to SBK, if all the keys of a skewed worker are shared with the helper. There are existing works in literature that address these concerns [29,31,39,40,47,58]. In the remainder of this subsection, we compare these two approaches from the perspective of their effects on the results shown to the user. ...
... In this note, we present a simple differentially private algorithm for the global minimum cut problem using only one call to the exponential mechanism. This problem was first studied by Gupta et al. [2010], and they gave a differentially private algorithm with near-optimal utility guarantees. We improve upon their work in many aspects: our algorithm is simpler, more natural, and more efficient than the one given in Gupta et al. [2010], and furthermore provides slightly better privacy and utility guarantees. ...
... Megaphone can specify migrations on a key-by-key basis, and then optimizes this by batching at varying granularities; as Figure 1 shows, the improvement over all-at-once migration can be dramatic. This paper is an extended version of a preliminary workshop publication [19]. In this paper, we describe a more general mechanism, further detail its implementation, and evaluate it more thoroughly on realistic workloads. ...
... Differential privacy [49] has become the leading formalization of privacy. Essentially, the removal of one user n's dataset D n from the dataset tuple D should not affect significantly the outcome of a (user-level) differentially private algorithm. ...
... Two common algorithms are GenericJoin [57] and LeapFrog TrieJoin [87]. Parallel approaches for finding subgraph isomorphism based on these algorithms were proposed by [2] and [29] respectively. Mhedhbi et al. combined both binary joins and the GenericJoin algorithm to evaluate subgraph isomorphism in [56]. ...
... Dataflow-based computational models were proposed to perform complex analytics on high-volume data sets: the timely dataflow [69,70] model targets batch processing, while its extension, differential dataflow [65], targets incremental processing. ...
... DBSP is tightly related to Differential Dataflow (DD) [28,31] and its theoretical foundations [2] (and recently [9,27]). All DBSP operators are based on DD operators. ...
... Although it is very common for Gremlin queries to terminate with a top-k constraint and/or aggregate operation, such an explosion of intermediate results can often lead to memory crisis, especially in an interactive environment with limited memory configuration. While several techniques exist for alleviating memory scarcity in dataflow execution, such as backpressure and memory swapping, they cannot be directly applied in GAIA due to potential deadlocks [25,31] and/or high (disk I/O) latency. To ensure bounded-memory execution without sacrificing performance (parallelism), the local executor in GAIA employs a new mechanism for dataflow execution, called dynamic scheduling. ...