About
272
Publications
39,375
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
14,566
Citations
Introduction
Skills and Expertise
Additional affiliations
January 2011 - present
January 2006 - present
January 2003 - present
Publications
Publications (272)
We introduce an adaptive method with formal quality guarantees for weak supervision in a non-stationary setting. Our goal is to infer the unknown labels of a sequence of data by using weak supervision sources that provide independent noisy signals of the correct classification for each data point. This setting includes crowdsourcing and programmati...
We develop and analyze a general technique for learning with an unknown distribution drift. Given a sequence of independent observations from the last $T$ steps of a drifting distribution, our algorithm agnostically learns a family of functions with respect to the current distribution at time $T$. Unlike previous work, our technique does not requir...
We study nonparametric density estimation in non-stationary drift settings. Given a sequence of independent samples taken from a distribution that gradually changes in time, the goal is to compute the best estimate for the current distribution. We prove tight minimax risk bounds for both discrete and continuous smooth densities, where the minimum i...
The sets of hyperlinks in web pages, relationship ties in social networks, or sets of recommendations in recommender systems, have a major impact on the diversity of content accessed by the user in a browsing session. Bias induced by the graph structure may trap a reader in a polarized bubble with no access to other opinions. It is widely accepted...
We develop a rigorous mathematical analysis of zero-shot learning with attributes. In this setting, the goal is to label novel classes with no training data, only detectors for attributes and a description of how those attributes are correlated with the target classes, called the class-attribute matrix. We develop the first non-trivial lower bound...
We present a novel method for reducing the computational complexity of rigorously estimating the partition functions (normalizing constants) of Gibbs (Boltzmann) distributions, which arise ubiquitously in probabilistic graphical models. A major obstacle to practical applications of Gibbs distributions is the need to estimate their partition functio...
We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges.
Our methods address the challenging task of counting sp...
The topology of the hyperlink graph among pages expressing different opinions may influence the exposure of readers to diverse content. Structural bias may trap a reader in a polarized bubble with no access to other opinions. We model readers' behavior as random walks. A node is in a polarized bubble if the expected length of a random walk from it...
2019 IEEE. Recently, there have been several proposals to develop visual recommendation systems. The most advanced systems aim to recommend visualizations, which help users to find new correlations or identify an interesting deviation based on the current context of the user's analysis. However, when recommending a visualization to a user, there is...
2019 Association for Computing Machinery. Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelme...
The Markov-Chain Monte-Carlo (MCMC) method has been used widely in the literature for various applications, in particular estimating the expectation $\mathbb{E}_{\pi}[f]$ of a function $f:\Omega\to [a,b]$ over a distribution $\pi$ on $\Omega$ (a.k.a. mean-estimation), to within $\varepsilon$ additive error (w.h.p.). Letting $R \doteq b-a$, standard...
We present an algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. In order to be efficient in terms of both time and space, our algorithm is based on a decomposition strategy which partitions the graph into disjoint clusters of bounded radius. Theoretically...
The most important feature of Wikipedia is the presence of hyperlinks in pages. Link placement is the product of people's collaboration, consequently Wikipedia naturally inherits human bias. Due to the high influence that links' disposition has on users' navigation sessions, one needs to verify that, given a controversial topic, the hyperlinks' net...
Enabling interactive visualization over new datasets at “human speed” is key to democratizing data science and maximizing human productivity. In this work, we first argue why existing analytics infrastructures do not support interactive data exploration and outline the challenges and opportunities of building a system specifically designed for inte...
While standard statistical inference techniques and machine learning generalization bounds assume that tests are run on data selected independently of the hypotheses, practical data analysis and machine learning are usually iterative and adaptive processes where the same holdout data is often used for testing a sequence of hypotheses (or models), w...
Statistical knowledge and domain expertise are key to extract actionable insights out of data, yet such skills rarely coexist together. In Machine Learning, high-quality results are only attainable via mindful data preprocessing, hyperparameter tuning and model selection. Domain experts are often overwhelmed by such complexity, de-facto inhibiting...
We tackle a fundamental problem in empirical game-theoretic analysis (EGTA), that of learning equilibria of simulation-based games. Such games cannot be described in analytical form; instead, a black-box simulator can be queried to obtain noisy samples of utilities. Our approach to EGTA is in the spirit of probably approximately correct learning. W...
Problem
We study the problem of identifying differentially mutated subnetworks of a large gene–gene interaction network, that is, subnetworks that display a significant difference in mutation frequency in two sets of cancer samples. We formally define the associated computational problem and show that the problem is NP-hard.
Algorithm
We propose a...
We frame the problem of selecting an optimal audio encoding scheme as a supervised learning task. Through uniform convergence theory, we guarantee approximately optimal codec selection while controlling for selection bias. We present rigorous statistical guarantees for the codec selection problem that hold for arbitrary distributions over audio seq...
Visual representations of data (visualizations) are tools of great importance and widespread use in data analytics as they provide users visual insight to patterns in the observed data in a simple and effective way. However, since visualizations tools are applied to sample data, there is a a risk of visualizing random fluctuations in the sample rat...
Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown...
ABPA Ξ AΣ (ABRAXAS): Gnostic word of mystic meaning.
We present ABRA, a suite of algorithms to compute and maintain probabilistically guaranteed high-quality approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms use progressive random sampling and their analysis rely on Rademach...
Democratizing Data Science requires a fundamental rethinking of the way data analytics and model discovery is done. Available tools for analyzing massive data sets and curating machine learning models are limited in a number of fundamental ways. First, existing tools require well-trained data scientists to select the appropriate techniques to build...
Covering the edges of a bipartite graph by a minimum set of bipartite complete graphs (bicliques) is a basic graph theoretic problem, with numerous applications. In particular, it is used to characterize parsimonious models of a set of observations (each biclique corresponds to a factor or feature that relates the observations in the two sets of no...
We introduce Tiered Sampling, a novel technique for approximate counting sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size $M$, which can be magnitudes smaller than the number of edges. Our methods addresses the challenging task of counting spa...
Social networks are important communication and information media. Individuals in a social network share information and influence each other through their social connections. Understanding social influence and information diffusion is a fundamental research endeavor and it has important applications in online social advertising and viral marketing...
Exploring data via visualization has become a popular way to understand complex data. Features or patterns in visualization can be perceived as relevant insights by users, even though they may actually arise from random noise. Moreover, interactive data exploration and visualization recommendation tools can examine a large number of observations, a...
Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. They allow users to (visually) examine many hypotheses and make inference with simple interactions, and thus incur the issue commonly known in statistics as the "multiple hypothesis testing error." In this work, we propose a solution t...
Recent tools for interactive data exploration significantly increase the chance that users make false discoveries. The crux is that these tools implicitly allow the user to test a large body of different hypotheses with just a few clicks thus incurring in the issue commonly known in statistics as the multiple hypothesis testing error. In this paper...
Betweenness centrality is a fundamental centrality measure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of key importance to numerous important applications that rely on betweenness centrality, including community detection and understanding graph vulnerability. Despite the large...
Betweenness centrality (BWC) is a fundamental centrality measure in social network analysis. Given a large-scale network, how can we find the most central nodes? This question is of great importance to many key applications that rely on BWC, including community detection and understanding graph vulnerability. Despite the large amount of work on sca...
We present ABRA, a suite of algorithms to compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms use progressive random sampling and their analysis rely on Rademacher averages and pseudodimension, fundamental concep...
We present TRIEST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions.
Our algorithms use reservoir sampling and its variants...
We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines.
We consider a large system consists of many nodes,...
Given a dataset of points in a metric space and an integer $k$, a diversity maximization problem requires determining a subset of $k$ points maximizing some diversity objective measure, e.g., the minimum or the average distance between a pair of points in the subset. Diversity maximization problems are computationally hard, hence only approximate s...
Load balancing is a well-studied problem, with balls-in-bins being the primary framework. The greedy algorithm $\mathsf{Greedy}[d]$ of Azar et al. places each ball by probing $d > 1$ random bins and placing the ball in the least loaded of them. It ensures a maximum load that is exponentially better than the strategy of placing each ball uniformly a...
We present TRI\`EST, a suite of one-pass streaming algorithms to compute unbiased, low-variance, high-quality approximations of the global and local (i.e., incident to each vertex) number of triangles in a fully-dynamic graph represented as an adversarial stream of edge insertions and deletions. Our algorithms use reservoir sampling and its variant...
We study the problem of learning probabilistic models for permutations, where the order between highly ranked items in the observed permutations is more reliable (i.e., consistent in different rankings) than the order between lower ranked items, a typical phenomena observed in many applications such as web search results and product ranking. We int...
We present ABRA, a suite of algorithms that compute and maintain probabilistically-guaranteed, high-quality, approximations of the betweenness centrality of all nodes (or edges) on both static and fully dynamic graphs. Our algorithms rely on random sampling and their analysis leverages on Rademacher averages and pseudodimension, fundamental concept...
Detecting new information and events in a dynamic network by probing individual nodes has many practical applications: discovering new webpages, analyzing influence properties in network, and detecting failure propagation in electronic circuits or infections in public drinkable water systems. In practice, it is infeasible for anyone but the owner o...
Advances in DNA sequencing technologies have enabled large cancer sequencing studies, collecting somatic mutation data from a large number of cancer patients. One of the main goals of these studies is the identification of all cancer genes-genes associated with cancer. Its achievement is complicated by the extensive mutational heterogeneity of canc...
We formulate and study a fundamental search and detection problem, Schedule Optimization, motivated by a variety of real-world applications, ranging from monitoring content changes on the web, social networks, and user activities to detecting failure on large systems with many individual machines.
We consider a large system consists of many nodes,...
We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (FIs) from random samples of a transactional dataset. With high probability the approximation is a superset of the FIs, and no itemset with frequency much lower than the threshold is included in it. The algorithm employs progressive sampling, with a st...
Rademacher Averages and the Vapnik-Chervonenkis dimension are fundamental concepts from statistical learning theory. They allow to study simultaneous deviation bounds of empirical averages from their expectations for classes of functions, by considering properties of the functions, of their domain (the dataset), and of the sampling process. In this...
We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius....
We present a space/time-efficient practical parallel algorithm for
approximating the diameter of massive graphs on distributed platforms
supporting a MapReduce-like abstraction. The core of the algorithm is a
weighted graph decomposition strategy generating disjoint clusters of bounded
weighted radius. The number of parallel rounds executed by the...
It is increasingly accepted that energy savings can be achieved by trading the accuracy of a computing system for energy gains - quite often significantly. This approach is referred to as inexact or approximate computing. Given that a significant portion of the energy in a modern general purpose processor is spent on moving data to and from storage...
We formulate and study the Probabilistic Hitting Set Paradigm (PHSP), a
general framework for design and analysis of search and detection algorithms in
large scale dynamic networks. The PHSP captures applications ranging from
monitoring new contents on the web, blogosphere, and Twitterverse, to analyzing
influence properties in social networks, and...
Motivated by the need for robust and fast distributed computation in highly dynamic Peer-to-Peer (P2P) networks, we study algorithms for the fundamental distributed agreement problem. P2P networks are highly dynamic networks that experience heavy node churn (i.e., nodes join and leave the network continuously over time). Our goal is to design fast...
We present the first parallel (MapReduce) algorithm to approximate the
diameter of large graphs through graph decomposition which requires a number of
parallel rounds that is sub-linear in the diameter and total space linear in
the graph size. The quality of the diameter approximation is expressed in terms
of the doubling dimension of the graph and...
We present a simple, efficient, and secure data-oblivious randomized shuffle
algorithm. This is the first secure data-oblivious shuffle that is not based on
sorting. Our method can be used to improve previous oblivious storage solutions
for network-based outsourcing of data.
In a multi-armed bandit problem, an online algorithm chooses from a set of
strategies in a sequence of trials so as to maximize the total payoff of the
chosen strategies. While the performance of bandit algorithms with a small
finite strategy set is quite well understood, bandit problems with large
strategy sets are still a topic of very active inv...
A key challenge in genomics is to identify genetic variants that distinguish
patients with different survival time following diagnosis or treatment. While
the log-rank test is widely used for this purpose, nearly all implementations
of the log-rank test rely on an asymptotic approximation that is not
appropriate in many genomics applications. This...
Cancer is a disease that is driven by somatic mutations that accumulate in the genome during an individuals lifetime. Recent advances in DNA sequencing technology are enabling genome-wide measurements of these mutations in large cohorts of cancer patients. A major challenge in analyzing this data is to distinguish functional, “driver” mutations tha...
We study robust and efficient distributed algorithms for searching, storing, and maintaining data in dynamic Peer-to-Peer (P2P) networks. P2P networks are highly dynamic networks that experience heavy node churn (i.e., nodes join and leave the network continuously over time). Our goal is to guarantee, despite high node churn rate, that a large numb...
Motivation. Next-generation DNA sequencing technologies now enable the measurement of exomes, genomes, and mRNA expression in many samples. The next challenge is to interpret these large quantities of DNA and RNA sequence data. In many human and cancer genomics studies, a major goal is to discover associations between an observed phenotype and a pa...
In this work we focus on efficient heuristics for solving a class of
stochastic planning problems that arise in a variety of business, investment,
and industrial applications. The problem is best described in terms of future
buy and sell contracts. By buying less reliable, but less expensive, buy
(supply) contracts, a company or a trader can cover...
Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales wel...
Background
Cancer sequencing projects are now measuring somatic mutations in large numbers of cancer genomes. A key challenge in interpreting these data is to distinguish driver mutations, mutations important for cancer development, from passenger mutations that have accumulated in somatic cells but without functional consequences. A common approac...
Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google’s search engine). In distributed computing alone, PageRank vectors, or more generally random walk based quantities have b...
One of the most important features of the Web graph and social networks is that they are constantly evolving. The classical computational paradigm, which assumes a fixed data set as an input to an algorithm that terminates, is inadequate for such settings. In this paper we study the problem of computing PageRank on an evolving graph. We propose an...
As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we address significance in the context of frequent itemset mining. Specifically, we develop a novel me...
A major goal of cancer sequencing projects is to identify genetic alterations that determine clinical phenotypes, such as survival time or drug response. Somatic mutations in cancer are typically very diverse, and are found in different sets of genes in different patients. This mutational heterogeneity complicates the discovery of associations betw...
Accurate query performance prediction (QPP) is central to effective resource management, query optimization and query scheduling. Analytical cost models, used in current generation of query optimizers, have been successful in comparing the costs of alternative query plans, but they are poor predictors of execution latency. As a more promising appro...
Two proposed algorithms predict which combinations of mutations in cancer genomes are priorities for experimental study. One relies on interaction network data to identify recurrently mutated sets of genes, while the other searches for groups of mutations that exhibit specific combinatorial properties.
We apply our algorithms to several cancer types including glioblastoma multiforme (GBM), lung adenocarcinoma, and ovarian carcinoma (OV). HotNet identifies significant subnetworks that are part of well-known cancer pathways as well as novel subnetworks. Among the most significant subnetworks identified in OV data is the Notch signaling pathway, and...
Motivated by applications that concern graphs that are evolv-ing and massive in nature, we define a new general frame-work for computing with such graphs. In our framework, the graph changes over time and an algorithm can only track these changes by explicitly probing the graph. This frame-work captures the inherent tradeoff between the complexity...
The tasks of extracting (top-$K$) Frequent Itemsets (FI's) and Association
Rules (AR's) are fundamental primitives in data mining and database
applications. Exact algorithms for these problems exist and are widely used,
but their running time is hindered by the need of scanning the entire dataset,
possibly multiple times. High quality approximation...
This work explores fundamental modeling and algorithmic issues arising in the
well-established MapReduce framework. First, we formally specify a
computational model for MapReduce which captures the functional flavor of the
paradigm by allowing for a flexible use of parallelism. Indeed, the model
diverges from a traditional processor-centric view by...
Cancer sequencing projects are now measuring somatic mutations in large numbers of cancer genomes. A key challenge in interpreting
these data is to distinguish driver mutations, mutations important for cancer development, from passenger mutations that have accumulated in somatic cells but without functional consequences. A common approach to identi...
Motivated by the need for robust and fast distributed computation in highly
dynamic Peer-to-Peer (P2P) networks, we study algorithms for the fundamental
distributed agreement problem. P2P networks are highly dynamic networks that
experience heavy node {\em churn} (i.e., nodes join and leave the network
continuously over time). Our main contribution...
Current trends in data management systems, such as cloud and multi-tenant databases, are leading to data processing environments that concurrently execute heterogeneous query workloads. At the same time, these systems need to satisfy diverse performance expectations. In these newly-emerging settings, avoiding potential Quality-of-Service (QoS) viol...
Next-generation DNA sequencing technologies are enabling genome-wide measurements of somatic mutations in large numbers of cancer patients. A major challenge in the interpretation of these data is to distinguish functional "driver mutations" important for cancer development from random "passenger mutations." A common approach for identifying driver...
We formulate and study a new computational model for dynamic data. In this model, the data changes gradually and the goal of an algorithm is to compute the solution to some problem on the data at each time step, under the constraint that it only has limited access to the data each time. As the data is constantly changing and the algorithm might be...