Devavrat Shah

Massachusetts Institute of Technology, Cambridge, Massachusetts, United States

Are you Devavrat Shah?

Claim your profile

Publications (189)106.3 Total impact

  • Source
    Guy Bresler, David Gamarnik, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. We first observe that the notoriously difficult problem of learning parities with noise can be captured as a special case of learning graphical models. This leads to an unconditional computational lower bound of $\Omega (p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of so-called statistical algorithms recently introduced by Feldman et al (2013). The lower bound suggests that the $O(p^d)$ runtime required to exhaustively search over neighborhoods cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., many recent papers on structure learning assume that the model has the correlation decay property. Indeed, focusing on ferromagnetic Ising models, Bento and Montanari (2009) showed that all known low-complexity algorithms fail to learn simple graphs when the interaction strength exceeds a number related to the correlation decay threshold. Our second set of results gives a class of repelling (antiferromagnetic) models that have the opposite behavior: very strong interaction allows efficient learning in time $O(p^2)$. We provide an algorithm whose performance interpolates between $O(p^2)$ and $O(p^{d+2})$ depending on the strength of the repulsion.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider approximating a single component of the solution to a system of linear equations $Ax = b$, where $A$ is an invertible real matrix and $b \in \mathbb{R}^n$. If $A$ is either diagonally dominant or positive definite, we can equivalently solve for $x_i$ in $x = Gx + z$ for some $G$ and $z$ such that spectral radius $\rho(G) < 1$. Existing algorithms either focus on computing the full vector $x$ or use Monte Carlo methods to estimate a component $x_i$ under the condition $\|G\|_{\infty} < 1$. We consider the setting where $n$ is large, yet $G$ is sparse, i.e., each row has at most $d$ nonzero entries. We present synchronous and asynchronous randomized variants of a local algorithm which relies on the Neumann series characterization of the component $x_i$, and allows us to limit the sparsity of the vectors involved in the computation, leading to improved convergence rates. Both variants of our algorithm produce an estimate $\hat{x}_i$ such that $|\hat{x}_i - x_i| \leq \epsilon \|x\|_2$, and we provide convergence guarantees when $\|G\|_2 < 1$. We prove that the synchronous local algorithm uses at most $O(\min(d \epsilon^{\ln(d)/\ln(\|G\|_2)}, dn\ln(\epsilon)/\ln(\|G\|_2)))$ multiplications. The asynchronous local algorithm adaptively samples one coordinate to update among the nonzero coordinates of the current iterate in each time step. We prove with high probability that the error contracts by a time varying factor in each step, guaranteeing that the algorithm converges to the correct solution. With probability at least $1 - \delta$, the asynchronous randomized algorithm uses at most $O(\min(d (\epsilon \sqrt{\delta/5})^{-d/(1-\|G\|_2)}, -dn \ln (\epsilon \sqrt{\delta})/(1-\|G\|_2)))$ multiplications. Thus our algorithms obtain an approximation for $x_i$ in constant time with respect to the size of the matrix when $d = O(1)$ and $1/(1-\|G\|_2) = O(1)$ as a function of $n$.
  • Source
    Sewoong Oh, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivated by generating personalized recommendations using ordinal (or preference) data, we study the question of learning a mixture of MultiNomial Logit (MNL) model, a parameterized class of distributions over permutations, from partial ordinal or preference data (e.g. pair-wise comparisons). Despite its long standing importance across disciplines including social choice, operations research and revenue management, little is known about this question. In case of single MNL models (no mixture), computationally and statistically tractable learning from pair-wise comparisons is feasible. However, even learning mixture with two MNL components is infeasible in general. Given this state of affairs, we seek conditions under which it is feasible to learn the mixture model in both computationally and statistically efficient manner. We present a sufficient condition as well as an efficient algorithm for learning mixed MNL models from partial preferences/comparisons data. In particular, a mixture of $r$ MNL components over $n$ objects can be learnt using samples whose size scales polynomially in $n$ and $r$ (concretely, $r^{3.5}n^3(log n)^4$, with $r\ll n^{2/7}$ when the model parameters are sufficiently incoherent). The algorithm has two phases: first, learn the pair-wise marginals for each component using tensor decomposition; second, learn the model parameters for each component using Rank Centrality introduced by Negahban et al. In the process of proving these results, we obtain a generalization of existing analysis for tensor decomposition to a more realistic regime where only partial information about each sample is available.
  • Source
    Guy Bresler, George H. Chen, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the "online" setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. We assume there to be $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly $\log(km)$ initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing $k$. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).
  • Source
    Guy Bresler, David Gamarnik, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we consider the problem of learning undirected graphical models from data generated according to the Glauber dynamics. The Glauber dynamics is a Markov chain that sequentially updates individual nodes (variables) in a graphical model and it is frequently used to sample from the stationary distribution (to which it converges given sufficient time). Additionally, the Glauber dynamics is a natural dynamical model in a variety of settings. This work deviates from the standard formulation of graphical model learning in the literature, where one assumes access to i.i.d. samples from the distribution. Much of the research on graphical model learning has been directed towards finding algorithms with low computational cost. As the main result of this work, we establish that the problem of reconstructing binary pairwise graphical models is computationally tractable when we observe the Glauber dynamics. Specifically, we show that a binary pairwise graphical model on $p$ nodes with maximum degree $d$ can be learned in time $f(d)p^3\log p$, for a function $f(d)$, using nearly the information-theoretic minimum possible number of samples. There is no known algorithm of comparable efficiency for learning arbitrary binary pairwise models from i.i.d. samples.
  • Source
    Devavrat Shah, Kang Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we discuss the method of Bayesian regression and its efficacy for predicting price variation of Bitcoin, a recently popularized virtual, cryptographic currency. Bayesian regression refers to utilizing empirical data as proxy to perform Bayesian inference. We utilize Bayesian regression for the so-called "latent source model". The Bayesian regression for "latent source model" was introduced and discussed by Chen, Nikolov and Shah (2013) and Bresler, Chen and Shah (2014) for the purpose of binary classification. They established theoretical as well as empirical efficacy of the method for the setting of binary classification. In this paper, instead we utilize it for predicting real-valued quantity, the price of Bitcoin. Based on this price prediction method, we devise a simple strategy for trading Bitcoin. The strategy is able to nearly double the investment in less than 60 day period when run against real data trace.
  • Source
    Guy Bresler, David Gamarnik, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan (2008)) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).
  • [Show abstract] [Hide abstract]
    ABSTRACT: An ideal datacenter network should provide several properties, including low median and tail latency, high utilization (throughput), fair allocation of network resources between users or applications, deadline-aware scheduling, and congestion (loss) avoidance. Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control---to a centralized arbiter---of when each packet should be transmitted and what path it should follow. This paper describes Fastpass, a datacenter network architecture built using this principle. Fastpass incorporates two fast algorithms: the first determines the time at which each packet should be transmitted, while the second determines the path to use for that packet. In addition, Fastpass uses an efficient protocol between the endpoints and the arbiter and an arbiter replication strategy for fault-tolerant failover. We deployed and evaluated Fastpass in a portion of Facebook's datacenter network. Our results show that Fastpass achieves high throughput comparable to current networks at a 240x reduction is queue lengths (4.35 Mbytes reducing to 18 Kbytes), achieves much fairer and consistent flow throughputs than the baseline TCP (5200x reduction in the standard deviation of per-flow throughput with five concurrent connections), scalability from 1 to 8 cores in the arbiter implementation with the ability to schedule 2.21 Terabits/s of traffic in software on eight cores, and a 2.5x reduction in the number of TCP retransmissions in a latency-sensitive service at Facebook.
  • ACM SIGMETRICS Performance Evaluation Review 06/2014; 42(1):565-566. DOI:10.1145/2637364.2592020
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computing a ranking over choices using consumer data gathered from a heterogenous population has become an indispensable module for any modern consumer information system, e.g. Yelp, Netflix, Amazon and app-stores like Google play. In such applications, a ranking or recommendation algorithm needs to extract meaningful information from noisy data accurately and in a scalable manner. A principled approach to resolve this challenge requires a model that connects observations to recommendation decisions and a tractable inference algorithm utilizing this model. To that end, we abstract the preference data generated by consumers as noisy, partial realizations of their innate preferences, i.e. orderings or permutations over choices. Inspired by the seminal works of Samuelson (cf. axiom of revealed preferences) and that of McFadden (cf. discrete choice models for transportation), we model the population's innate preferences as a mixture of the so called Multi-nomial Logit (MMNL) model. Under this model, the recommendation problem boils down to (a) learning the MMNL model from population data, (b) finding am MNL component within the mixture that closely represents the revealed preferences of the consumer at hand, and (c) recommending other choices to her/him that are ranked high according to thus found component. In this work, we address the problem of learning MMNL model from partial preferences. We identify fundamental limitations of any algorithm to learn such a model as well as provide conditions under which, a simple, data-driven (non-parametric) algorithm learns the model effectively. The proposed algorithm has a pleasant similarity to the standard collaborative filtering for scalar (or star) ratings, but in the domain of permutations. This work advances the state-of-art in the domain of learning distribution over permutations (cf. [2]) as well as in the context of learning mixture distributions (cf. [4]).
    ACM SIGMETRICS Performance Evaluation Review 06/2014; 42(1). DOI:10.1145/2591971.2592020
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study the optimal scaling of the expected total queue size in an $n\times n$ input-queued switch, as a function of the number of ports $n$ and the load factor $\rho$, which has been conjectured to be $\Theta (n/(1-\rho))$. In a recent work, the validity of this conjecture has been established for the regime where $1-\rho = O(1/n^2)$. In this paper, we make further progress in the direction of this conjecture. We provide a new class of scheduling policies under which the expected total queue size scales as $O(n^{1.5}(1-\rho)^{-1}\log(1/(1-\rho)))$ when $1-\rho = O(1/n)$. This is an improvement over the state of the art; for example, for $\rho = 1 - 1/n$ the best known bound was $O(n^3)$, while ours is $O(n^{2.5}\log n)$.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain has become central to many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). For both methods, the convergence rate is is difficult to determine for general Markov chains. Power iteration is costly, as it is global and involves computation at every state. In this paper, we provide a novel local algorithm that answers whether a chosen state in a Markov chain has stationary probability larger than some $\Delta \in (0,1)$, and outputs an estimate of the stationary probability for itself and other nearby states. Our algorithm runs in constant time with respect to the Markov chain, using information from a local neighborhood of the state on the graph induced by the Markov chain, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the Markov chain. Simulation results show Markov chains for which this method gives tight estimates.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel meta algorithm, Partition-Merge (PM), which takes existing centralized algorithms for graph computation and makes them distributed and faster. In a nutshell, PM divides the graph into small subgraphs using our novel randomized partitioning scheme, runs the centralized algorithm on each partition separately, and then stitches the resulting solutions to produce a global solution. We demonstrate the efficiency of the PM algorithm on two popular problems: computation of Maximum A Posteriori (MAP) assignment in an arbitrary pairwise Markov Random Field (MRF), and modularity optimization for community detection. We show that the resulting distributed algorithms for these problems essentially run in time linear in the number of nodes in the graph, and perform as well -- or even better -- than the original centralized algorithm as long as the graph has geometric structures. Here we say a graph has geometric structures, or polynomial growth property, when the number of nodes within distance r of any given node grows no faster than a polynomial function of r. More precisely, if the centralized algorithm is a C-factor approximation with constant C \ge 1, the resulting distributed algorithm is a (C+\delta)-factor approximation for any small \delta>0; but if the centralized algorithm is a non-constant (e.g. logarithmic) factor approximation, then the resulting distributed algorithm becomes a constant factor approximation. For general graphs, we compute explicit bounds on the loss of performance of the resulting distributed algorithm with respect to the centralized algorithm.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This brief presents a technique to evaluate the timing variation of static random access memory (SRAM). Specifically, a method called loop flattening, which reduces the evaluation of the timing statistics in the complex highly structured circuit to that of a single chain of component circuits, is justified. Then, to very quickly evaluate the timing delay of a single chain, a statistical method based on importance sampling augmented with targeted high-dimensional spherical sampling can be employed. The overall methodology has shown 650× or greater speedup over the nominal Monte Carlo approach with 10.5% accuracy in probability. Examples based on both the large-signal and small-signal SRAM read path are discussed, and a detailed comparison with state-of-the-art accelerated statistical simulation techniques is given.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 08/2013; 21(8):1558-1562. DOI:10.1109/TVLSI.2012.2212254 · 1.14 Impact Factor
  • David R. Karger, Sewoong Oh, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: Crowdsourcing systems like Amazon's Mechanical Turk have emerged as an effective large-scale human-powered platform for performing tasks in domains such as image classification, data entry, recommendation, and proofreading. Since workers are low-paid (a few cents per task) and tasks performed are monotonous, the answers obtained are noisy and hence unreliable. To obtain reliable estimates, it is essential to utilize appropriate inference algorithms (e.g. Majority voting) coupled with structured redundancy through task assignment. Our goal is to obtain the best possible trade-off between reliability and redundancy. In this paper, we consider a general probabilistic model for noisy observations for crowd-sourcing systems and pose the problem of minimizing the total price (i.e. redundancy) that must be paid to achieve a target overall reliability. Concretely, we show that it is possible to obtain an answer to each task correctly with probability 1-ε as long as the redundancy per task is O((K/q) log (K/ε)), where each task can have any of the $K$ distinct answers equally likely, q is the crowd-quality parameter that is defined through a probabilistic model. Further, effectively this is the best possible redundancy-accuracy trade-off any system design can achieve. Such a single-parameter crisp characterization of the (order-)optimal trade-off between redundancy and reliability has various useful operational consequences. Further, we analyze the robustness of our approach in the presence of adversarial workers and provide a bound on their influence on the redundancy-accuracy trade-off. Unlike recent prior work [GKM11, KOS11, KOS11], our result applies to non-binary (i.e. K>2) tasks. In effect, we utilize algorithms for binary tasks (with inhomogeneous error model unlike that in [GKM11, KOS11, KOS11]) as key subroutine to obtain answers for K-ary tasks. Technically, the algorithm is based on low-rank approximation of weighted adjacency matrix for a random regular bipartite graph, weighted according to the answers provided by the workers.
    Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We study a binary classification problem whereby an infinite time series having one of two labels ("event" or "non-event") streams in, and we want to predict the label of the time series. Intuitively, the longer we wait, the more of the time series we see and so the more accurate our prediction could be. Conversely, making a prediction too early could result in a grossly inaccurate prediction. In numerous applications, such as predicting an imminent market crash or revealing which topics will go viral in a social network, making an accurate prediction as early as possible is highly valuable. Motivated by these applications, we propose a generative model for time series which we call a latent source model and which we use for non-parametric online time series classification. Our main assumption is that there are only a few ways in which a time series corresponds to an "event", such as a market crashing or a Twitter topic going viral, and that we have access to training data that are noisy versions of these few distinct modes. Our model naturally leads to weighted majority voting as a classification rule, which operates without knowing nor learning what the few latent sources are. We establish theoretical performance guarantees of weighted majority voting under the latent source model and then use the voting to predict which news topics on Twitter will go viral to become trends.
  • Conference Paper: A hardware spinal decoder
    [Show abstract] [Hide abstract]
    ABSTRACT: Spinal codes are a recently proposed capacity-achieving rateless code. While hardware encoding of spinal codes is straightforward, the design of an efficient, high-speed hardware decoder poses significant challenges. We present the first such decoder. By relaxing data dependencies inherent in the classic M-algorithm decoder, we obtain area and throughput competitive with 3GPP turbo codes as well as greatly reduced latency and complexity. The enabling architectural feature is a novel alpha-beta incremental approximate selection algorithm. We also present a method for obtaining hints which anticipate successful or failed decoding, permitting early termination and/or feedback-driven adaptation of the decoding parameters. We have validated our implementation in FPGA with on-air testing. Provisional hardware synthesis suggests that a near-capacity implementation of spinal codes can achieve a throughput of 12.5 Mbps in a 65 nm technology while using substantially less area than competitive 3GPP turbo code implementations.
    Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems; 10/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Spinal codes are a new class of rateless codes that enable wireless networks to cope with time-varying channel conditions in a natural way, without requiring any explicit bit rate selection. The key idea in the code is the sequential application of a pseudo-random hash function to the message bits to produce a sequence of coded symbols for transmission. This encoding ensures that two input messages that differ in even one bit lead to very different coded sequences after the point at which they differ, providing good resilience to noise and bit errors. To decode spinal codes, this paper develops an approximate maximum-likelihood decoder, called the bubble decoder, which runs in time polynomial in the message size and achieves the Shannon capacity over both additive white Gaussian noise (AWGN) and binary symmetric channel (BSC) models. Experimental results obtained from a software implementation of a linear-time decoder show that spinal codes achieve higher throughput than fixed-rate LDPC codes, rateless Raptor codes, and the layered rateless coding approach of Strider, across a range of channel conditions and message sizes. An early hardware prototype that can decode at 10 Mbits/s in FPGA demonstrates that spinal codes are a practical construction.
    ACM SIGCOMM Computer Communication Review 09/2012; 42(4). DOI:10.1145/2342356.2342363 · 1.10 Impact Factor
  • Source
    Sahand Negahban, Sewoong Oh, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR's TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining a ranking, finding `scores' for each object (e.g. player's rating) is of interest for understanding the intensity of the preferences. In this paper, we propose Rank Centrality, an iterative rank aggregation algorithm for discovering scores for objects (or items) from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with an edge present between a pair of objects if they are compared; the score, which we call Rank Centrality, of an object turns out to be it's stationary probability under this random walk. To study the efficacy of the algorithm, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. In particular, the number of samples required to learn the score well with high probability depends on the structure of the comparison graph. When the Laplacian of the comparison graph has a strictly positive spectral gap, e.g. each item is compared to a subset of randomly chosen items, this leads to an order-optimal dependence on the number of samples. Experimental evaluations on synthetic datasets generated according to the BTL model show that our algorithm performs as well as the Maximum Likelihood estimator for that model and outperforms a recently proposed algorithm by Ammar and Shah (2011).
  • Source
    Sahand Negahban, Sewoong Oh, Devavrat Shah
    [Show abstract] [Hide abstract]
    ABSTRACT: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR's TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding 'scores' for each object (e.g. player's rating) is of interest for understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects (or items) from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with an edge present between a pair of objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. In particular, the number of samples required to learn the score well with high probability depends on the structure of the comparison graph. When the Laplacian of the comparison graph has a strictly positive spectral gap, e.g. each item is compared to a subset of randomly chosen items, this leads to order-optimal dependence on the number of samples. Experimental evaluations on synthetic datasets generated according to the BTL model show that our (model independent) algorithm performs as well as the Maximum Likelihood estimator for that model and outperforms a recently proposed algorithm by Ammar and Shah [AS11].

Publication Stats

3k Citations
106.30 Total Impact Points

Institutions

  • 2006–2014
    • Massachusetts Institute of Technology
      • • Laboratory for Information and Decision Systems
      • • Department of Electrical Engineering and Computer Science
      Cambridge, Massachusetts, United States
  • 2001–2006
    • Stanford University
      • • Department of Computer Science
      • • Department of Electrical Engineering
      Palo Alto, California, United States
    • Politecnico di Torino
      • DET - Department of Electronics and Telecommunications
      Torino, Piedmont, Italy
  • 2005
    • Microsoft
      Washington, West Virginia, United States