[Show abstract][Hide abstract] ABSTRACT: There is much empirical evidence that item-item collaborative filtering works
well in practice. Motivated to understand this, we provide a framework to
design and analyze various recommendation algorithms. The setup amounts to
online binary matrix completion, where at each time a random user requests a
recommendation and the algorithm chooses an entry to reveal in the user's row.
The goal is to minimize regret, or equivalently to maximize the number of +1
entries revealed at any time. We analyze an item-item collaborative filtering
algorithm that can achieve fundamentally better performance compared to
user-user collaborative filtering. The algorithm achieves good "cold-start"
performance (appropriately defined) by quickly making good recommendations to
new users about whom there is little information.
[Show abstract][Hide abstract] ABSTRACT: In this paper we investigate the computational complexity of learning the
graph structure underlying a discrete undirected graphical model from i.i.d.
samples. We first observe that the notoriously difficult problem of learning
parities with noise can be captured as a special case of learning graphical
models. This leads to an unconditional computational lower bound of $\Omega
(p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree
$d$, for the class of so-called statistical algorithms recently introduced by
Feldman et al (2013). The lower bound suggests that the $O(p^d)$ runtime
required to exhaustively search over neighborhoods cannot be significantly
improved without restricting the class of models.
Aside from structural assumptions on the graph such as it being a tree,
hypertree, tree-like, etc., many recent papers on structure learning assume
that the model has the correlation decay property. Indeed, focusing on
ferromagnetic Ising models, Bento and Montanari (2009) showed that all known
low-complexity algorithms fail to learn simple graphs when the interaction
strength exceeds a number related to the correlation decay threshold. Our
second set of results gives a class of repelling (antiferromagnetic) models
that have the opposite behavior: very strong interaction allows efficient
learning in time $O(p^2)$. We provide an algorithm whose performance
interpolates between $O(p^2)$ and $O(p^{d+2})$ depending on the strength of the
repulsion.
Advances in neural information processing systems 12/2014; 4.
[Show abstract][Hide abstract] ABSTRACT: We consider approximating a single component of the solution to a system of
linear equations $Ax = b$, where $A$ is an invertible real matrix and $b \in
\mathbb{R}^n$. If $A$ is either diagonally dominant or positive definite, we
can equivalently solve for $x_i$ in $x = Gx + z$ for some $G$ and $z$ such that
spectral radius $\rho(G) < 1$. Existing algorithms either focus on computing
the full vector $x$ or use Monte Carlo methods to estimate a component $x_i$
under the condition $\|G\|_{\infty} < 1$. We consider the setting where $n$ is
large, yet $G$ is sparse, i.e., each row has at most $d$ nonzero entries. We
present synchronous and asynchronous randomized variants of a local algorithm
which relies on the Neumann series characterization of the component $x_i$, and
allows us to limit the sparsity of the vectors involved in the computation,
leading to improved convergence rates. Both variants of our algorithm produce
an estimate $\hat{x}_i$ such that $|\hat{x}_i - x_i| \leq \epsilon \|x\|_2$,
and we provide convergence guarantees when $\|G\|_2 < 1$. We prove that the
synchronous local algorithm uses at most $O(\min(d
\epsilon^{\ln(d)/\ln(\|G\|_2)}, dn\ln(\epsilon)/\ln(\|G\|_2)))$
multiplications. The asynchronous local algorithm adaptively samples one
coordinate to update among the nonzero coordinates of the current iterate in
each time step. We prove with high probability that the error contracts by a
time varying factor in each step, guaranteeing that the algorithm converges to
the correct solution. With probability at least $1 - \delta$, the asynchronous
randomized algorithm uses at most $O(\min(d (\epsilon
\sqrt{\delta/5})^{-d/(1-\|G\|_2)}, -dn \ln (\epsilon
\sqrt{\delta})/(1-\|G\|_2)))$ multiplications. Thus our algorithms obtain an
approximation for $x_i$ in constant time with respect to the size of the matrix
when $d = O(1)$ and $1/(1-\|G\|_2) = O(1)$ as a function of $n$.
[Show abstract][Hide abstract] ABSTRACT: Motivated by generating personalized recommendations using ordinal (or
preference) data, we study the question of learning a mixture of MultiNomial
Logit (MNL) model, a parameterized class of distributions over permutations,
from partial ordinal or preference data (e.g. pair-wise comparisons). Despite
its long standing importance across disciplines including social choice,
operations research and revenue management, little is known about this
question. In case of single MNL models (no mixture), computationally and
statistically tractable learning from pair-wise comparisons is feasible.
However, even learning mixture with two MNL components is infeasible in
general.
Given this state of affairs, we seek conditions under which it is feasible to
learn the mixture model in both computationally and statistically efficient
manner. We present a sufficient condition as well as an efficient algorithm for
learning mixed MNL models from partial preferences/comparisons data. In
particular, a mixture of $r$ MNL components over $n$ objects can be learnt
using samples whose size scales polynomially in $n$ and $r$ (concretely,
$r^{3.5}n^3(log n)^4$, with $r\ll n^{2/7}$ when the model parameters are
sufficiently incoherent). The algorithm has two phases: first, learn the
pair-wise marginals for each component using tensor decomposition; second,
learn the model parameters for each component using Rank Centrality introduced
by Negahban et al. In the process of proving these results, we obtain a
generalization of existing analysis for tensor decomposition to a more
realistic regime where only partial information about each sample is available.
Advances in neural information processing systems 11/2014; 1.
[Show abstract][Hide abstract] ABSTRACT: Despite the prevalence of collaborative filtering in recommendation systems,
there has been little theoretical development on why and how well it works,
especially in the "online" setting, where items are recommended to users over
time. We address this theoretical gap by introducing a model for online
recommendation systems, cast item recommendation under the model as a learning
problem, and analyze the performance of a cosine-similarity collaborative
filtering method. In our model, each of $n$ users either likes or dislikes each
of $m$ items. We assume there to be $k$ types of users, and all the users of a
given type share a common string of probabilities determining the chance of
liking each item. At each time step, we recommend an item to each user, where a
key distinction from related bandit literature is that once a user consumes an
item (e.g., watches a movie), then that item cannot be recommended to the same
user again. The goal is to maximize the number of likable items recommended to
users over time. Our main result establishes that after nearly $\log(km)$
initial learning time steps, a simple collaborative filtering algorithm
achieves essentially optimal performance without knowing $k$. The algorithm has
an exploitation step that uses cosine similarity and two types of exploration
steps, one to explore the space of items (standard in the literature) and the
other to explore similarity between users (novel to this work).
[Show abstract][Hide abstract] ABSTRACT: In this paper we consider the problem of learning undirected graphical models
from data generated according to the Glauber dynamics. The Glauber dynamics is
a Markov chain that sequentially updates individual nodes (variables) in a
graphical model and it is frequently used to sample from the stationary
distribution (to which it converges given sufficient time). Additionally, the
Glauber dynamics is a natural dynamical model in a variety of settings. This
work deviates from the standard formulation of graphical model learning in the
literature, where one assumes access to i.i.d. samples from the distribution.
Much of the research on graphical model learning has been directed towards
finding algorithms with low computational cost. As the main result of this
work, we establish that the problem of reconstructing binary pairwise graphical
models is computationally tractable when we observe the Glauber dynamics.
Specifically, we show that a binary pairwise graphical model on $p$ nodes with
maximum degree $d$ can be learned in time $f(d)p^3\log p$, for a function
$f(d)$, using nearly the information-theoretic minimum possible number of
samples. There is no known algorithm of comparable efficiency for learning
arbitrary binary pairwise models from i.i.d. samples.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we discuss the method of Bayesian regression and its efficacy
for predicting price variation of Bitcoin, a recently popularized virtual,
cryptographic currency. Bayesian regression refers to utilizing empirical data
as proxy to perform Bayesian inference. We utilize Bayesian regression for the
so-called "latent source model". The Bayesian regression for "latent source
model" was introduced and discussed by Chen, Nikolov and Shah (2013) and
Bresler, Chen and Shah (2014) for the purpose of binary classification. They
established theoretical as well as empirical efficacy of the method for the
setting of binary classification.
In this paper, instead we utilize it for predicting real-valued quantity, the
price of Bitcoin. Based on this price prediction method, we devise a simple
strategy for trading Bitcoin. The strategy is able to nearly double the
investment in less than 60 day period when run against real data trace.
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of learning the canonical parameters specifying an
undirected graphical model (Markov random field) from the mean parameters. For
graphical models representing a minimal exponential family, the canonical
parameters are uniquely determined by the mean parameters, so the problem is
feasible in principle. The goal of this paper is to investigate the
computational feasibility of this statistical task. Our main result shows that
parameter estimation is in general intractable: no algorithm can learn the
canonical parameters of a generic pair-wise binary graphical model from the
mean parameters in time bounded by a polynomial in the number of variables
(unless RP = NP). Indeed, such a result has been believed to be true (see the
monograph by Wainwright and Jordan (2008)) but no proof was known.
Our proof gives a polynomial time reduction from approximating the partition
function of the hard-core model, known to be hard, to learning approximate
parameters. Our reduction entails showing that the marginal polytope boundary
has an inherent repulsive property, which validates an optimization procedure
over the polytope that does not use any knowledge of its structure (as required
by the ellipsoid method and others).
Advances in neural information processing systems 09/2014; 2.
[Show abstract][Hide abstract] ABSTRACT: An ideal datacenter network should provide several properties, including low median and tail latency, high utilization (throughput), fair allocation of network resources between users or applications, deadline-aware scheduling, and congestion (loss) avoidance. Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control---to a centralized arbiter---of when each packet should be transmitted and what path it should follow. This paper describes Fastpass, a datacenter network architecture built using this principle. Fastpass incorporates two fast algorithms: the first determines the time at which each packet should be transmitted, while the second determines the path to use for that packet. In addition, Fastpass uses an efficient protocol between the endpoints and the arbiter and an arbiter replication strategy for fault-tolerant failover. We deployed and evaluated Fastpass in a portion of Facebook's datacenter network. Our results show that Fastpass achieves high throughput comparable to current networks at a 240x reduction is queue lengths (4.35 Mbytes reducing to 18 Kbytes), achieves much fairer and consistent flow throughputs than the baseline TCP (5200x reduction in the standard deviation of per-flow throughput with five concurrent connections), scalability from 1 to 8 cores in the arbiter implementation with the ability to schedule 2.21 Terabits/s of traffic in software on eight cores, and a 2.5x reduction in the number of TCP retransmissions in a latency-sensitive service at Facebook.
[Show abstract][Hide abstract] ABSTRACT: Computing a ranking over choices using consumer data gathered from a heterogenous population has become an indispensable module for any modern consumer information system, e.g. Yelp, Netflix, Amazon and app-stores like Google play. In such applications, a ranking or recommendation algorithm needs to extract meaningful information from noisy data accurately and in a scalable manner. A principled approach to resolve this challenge requires a model that connects observations to recommendation decisions and a tractable inference algorithm utilizing this model. To that end, we abstract the preference data generated by consumers as noisy, partial realizations of their innate preferences, i.e. orderings or permutations over choices. Inspired by the seminal works of Samuelson (cf. axiom of revealed preferences) and that of McFadden (cf. discrete choice models for transportation), we model the population's innate preferences as a mixture of the so called Multi-nomial Logit (MMNL) model. Under this model, the recommendation problem boils down to (a) learning the MMNL model from population data, (b) finding am MNL component within the mixture that closely represents the revealed preferences of the consumer at hand, and (c) recommending other choices to her/him that are ranked high according to thus found component. In this work, we address the problem of learning MMNL model from partial preferences. We identify fundamental limitations of any algorithm to learn such a model as well as provide conditions under which, a simple, data-driven (non-parametric) algorithm learns the model effectively. The proposed algorithm has a pleasant similarity to the standard collaborative filtering for scalar (or star) ratings, but in the domain of permutations. This work advances the state-of-art in the domain of learning distribution over permutations (cf. [2]) as well as in the context of learning mixture distributions (cf. [4]).
[Show abstract][Hide abstract] ABSTRACT: We study the optimal scaling of the expected total queue size in an $n\times
n$ input-queued switch, as a function of the number of ports $n$ and the load
factor $\rho$, which has been conjectured to be $\Theta (n/(1-\rho))$. In a
recent work, the validity of this conjecture has been established for the
regime where $1-\rho = O(1/n^2)$. In this paper, we make further progress in
the direction of this conjecture. We provide a new class of scheduling policies
under which the expected total queue size scales as
$O(n^{1.5}(1-\rho)^{-1}\log(1/(1-\rho)))$ when $1-\rho = O(1/n)$. This is an
improvement over the state of the art; for example, for $\rho = 1 - 1/n$ the
best known bound was $O(n^3)$, while ours is $O(n^{2.5}\log n)$.
[Show abstract][Hide abstract] ABSTRACT: Computing the stationary distribution of a large finite or countably infinite
state space Markov Chain has become central to many problems such as
statistical inference and network analysis. Standard methods involve large
matrix multiplications as in power iteration, or simulations of long random
walks, as in Markov Chain Monte Carlo (MCMC). For both methods, the convergence
rate is is difficult to determine for general Markov chains. Power iteration is
costly, as it is global and involves computation at every state. In this paper,
we provide a novel local algorithm that answers whether a chosen state in a
Markov chain has stationary probability larger than some $\Delta \in (0,1)$,
and outputs an estimate of the stationary probability for itself and other
nearby states. Our algorithm runs in constant time with respect to the Markov
chain, using information from a local neighborhood of the state on the graph
induced by the Markov chain, which has constant size relative to the state
space. The multiplicative error of the estimate is upper bounded by a function
of the mixing properties of the Markov chain. Simulation results show Markov
chains for which this method gives tight estimates.
Advances in neural information processing systems 12/2013;
[Show abstract][Hide abstract] ABSTRACT: This paper presents a novel meta algorithm, Partition-Merge (PM), which takes
existing centralized algorithms for graph computation and makes them
distributed and faster. In a nutshell, PM divides the graph into small
subgraphs using our novel randomized partitioning scheme, runs the centralized
algorithm on each partition separately, and then stitches the resulting
solutions to produce a global solution. We demonstrate the efficiency of the PM
algorithm on two popular problems: computation of Maximum A Posteriori (MAP)
assignment in an arbitrary pairwise Markov Random Field (MRF), and modularity
optimization for community detection. We show that the resulting distributed
algorithms for these problems essentially run in time linear in the number of
nodes in the graph, and perform as well -- or even better -- than the original
centralized algorithm as long as the graph has geometric structures. Here we
say a graph has geometric structures, or polynomial growth property, when the
number of nodes within distance r of any given node grows no faster than a
polynomial function of r. More precisely, if the centralized algorithm is a
C-factor approximation with constant C \ge 1, the resulting distributed
algorithm is a (C+\delta)-factor approximation for any small \delta>0; but if
the centralized algorithm is a non-constant (e.g. logarithmic) factor
approximation, then the resulting distributed algorithm becomes a constant
factor approximation. For general graphs, we compute explicit bounds on the
loss of performance of the resulting distributed algorithm with respect to the
centralized algorithm.
[Show abstract][Hide abstract] ABSTRACT: This brief presents a technique to evaluate the timing variation of static random access memory (SRAM). Specifically, a method called loop flattening, which reduces the evaluation of the timing statistics in the complex highly structured circuit to that of a single chain of component circuits, is justified. Then, to very quickly evaluate the timing delay of a single chain, a statistical method based on importance sampling augmented with targeted high-dimensional spherical sampling can be employed. The overall methodology has shown 650× or greater speedup over the nominal Monte Carlo approach with 10.5% accuracy in probability. Examples based on both the large-signal and small-signal SRAM read path are discussed, and a detailed comparison with state-of-the-art accelerated statistical simulation techniques is given.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 08/2013; 21(8):1558-1562. DOI:10.1109/TVLSI.2012.2212254 · 1.36 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Crowdsourcing systems like Amazon's Mechanical Turk have emerged as an effective large-scale human-powered platform for performing tasks in domains such as image classification, data entry, recommendation, and proofreading. Since workers are low-paid (a few cents per task) and tasks performed are monotonous, the answers obtained are noisy and hence unreliable. To obtain reliable estimates, it is essential to utilize appropriate inference algorithms (e.g. Majority voting) coupled with structured redundancy through task assignment. Our goal is to obtain the best possible trade-off between reliability and redundancy. In this paper, we consider a general probabilistic model for noisy observations for crowd-sourcing systems and pose the problem of minimizing the total price (i.e. redundancy) that must be paid to achieve a target overall reliability. Concretely, we show that it is possible to obtain an answer to each task correctly with probability 1-ε as long as the redundancy per task is O((K/q) log (K/ε)), where each task can have any of the $K$ distinct answers equally likely, q is the crowd-quality parameter that is defined through a probabilistic model. Further, effectively this is the best possible redundancy-accuracy trade-off any system design can achieve. Such a single-parameter crisp characterization of the (order-)optimal trade-off between redundancy and reliability has various useful operational consequences. Further, we analyze the robustness of our approach in the presence of adversarial workers and provide a bound on their influence on the redundancy-accuracy trade-off.
Unlike recent prior work [GKM11, KOS11, KOS11], our result applies to non-binary (i.e. K>2) tasks. In effect, we utilize algorithms for binary tasks (with inhomogeneous error model unlike that in [GKM11, KOS11, KOS11]) as key subroutine to obtain answers for K-ary tasks. Technically, the algorithm is based on low-rank approximation of weighted adjacency matrix for a random regular bipartite graph, weighted according to the answers provided by the workers.
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems; 06/2013
[Show abstract][Hide abstract] ABSTRACT: We study a binary classification problem whereby an infinite time series
having one of two labels ("event" or "non-event") streams in, and we want to
predict the label of the time series. Intuitively, the longer we wait, the more
of the time series we see and so the more accurate our prediction could be.
Conversely, making a prediction too early could result in a grossly inaccurate
prediction. In numerous applications, such as predicting an imminent market
crash or revealing which topics will go viral in a social network, making an
accurate prediction as early as possible is highly valuable. Motivated by these
applications, we propose a generative model for time series which we call a
latent source model and which we use for non-parametric online time series
classification. Our main assumption is that there are only a few ways in which
a time series corresponds to an "event", such as a market crashing or a Twitter
topic going viral, and that we have access to training data that are noisy
versions of these few distinct modes. Our model naturally leads to weighted
majority voting as a classification rule, which operates without knowing nor
learning what the few latent sources are. We establish theoretical performance
guarantees of weighted majority voting under the latent source model and then
use the voting to predict which news topics on Twitter will go viral to become
trends.
[Show abstract][Hide abstract] ABSTRACT: Spinal codes are a recently proposed capacity-achieving rateless code. While hardware encoding of spinal codes is straightforward, the design of an efficient, high-speed hardware decoder poses significant challenges. We present the first such decoder. By relaxing data dependencies inherent in the classic M-algorithm decoder, we obtain area and throughput competitive with 3GPP turbo codes as well as greatly reduced latency and complexity. The enabling architectural feature is a novel alpha-beta incremental approximate selection algorithm. We also present a method for obtaining hints which anticipate successful or failed decoding, permitting early termination and/or feedback-driven adaptation of the decoding parameters.
We have validated our implementation in FPGA with on-air testing. Provisional hardware synthesis suggests that a near-capacity implementation of spinal codes can achieve a throughput of 12.5 Mbps in a 65 nm technology while using substantially less area than competitive 3GPP turbo code implementations.
Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems; 10/2012
[Show abstract][Hide abstract] ABSTRACT: Crowdsourcing systems, in which tasks are electronically distributed to numerous "information piece-workers", have emerged as an effective paradigm for human-powered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Be-cause these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as ma-jority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task as-signments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers' answers. We show that our algorithm signifi-cantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker.
[Show abstract][Hide abstract] ABSTRACT: Spinal codes are a new class of rateless codes that enable wireless networks to cope with time-varying channel conditions in a natural way, without requiring any explicit bit rate selection. The key idea in the code is the sequential application of a pseudo-random hash function to the message bits to produce a sequence of coded symbols for transmission. This encoding ensures that two input messages that differ in even one bit lead to very different coded sequences after the point at which they differ, providing good resilience to noise and bit errors. To decode spinal codes, this paper develops an approximate maximum-likelihood decoder, called the bubble decoder, which runs in time polynomial in the message size and achieves the Shannon capacity over both additive white Gaussian noise (AWGN) and binary symmetric channel (BSC) models. Experimental results obtained from a software implementation of a linear-time decoder show that spinal codes achieve higher throughput than fixed-rate LDPC codes, rateless Raptor codes, and the layered rateless coding approach of Strider, across a range of channel conditions and message sizes. An early hardware prototype that can decode at 10 Mbits/s in FPGA demonstrates that spinal codes are a practical construction.