[Show abstract][Hide abstract] ABSTRACT: Single Index Models (SIMs) are simple yet flexible semi-parametric models for
classification and regression. Response variables are modeled as a nonlinear,
monotonic function of a linear combination of features. Estimation in this
context requires learning both the feature weights, and the nonlinear function.
While methods have been described to learn SIMs in the low dimensional regime,
a method that can efficiently learn SIMs in high dimensions has not been
forthcoming. We propose three variants of a computationally and statistically
efficient algorithm for SIM inference in high dimensions. We establish excess
risk bounds for the proposed algorithms and experimentally validate the
advantages that our SIM learning methods provide relative to Generalized Linear
Model (GLM) and low dimensional SIM based learning methods.
[Show abstract][Hide abstract] ABSTRACT: This paper investigates the problem of active learning for binary label
prediction on a graph. We introduce a simple and label-efficient algorithm
called S2 for this task. At each step, S2 selects the vertex to be labeled
based on the structure of the graph and all previously gathered labels.
Specifically, S2 queries for the label of the vertex that bisects the *shortest
shortest* path between any pair of oppositely labeled vertices. We present a
theoretical estimate of the number of queries S2 needs in terms of a novel
parametrization of the complexity of binary functions on graphs. We also
present experimental results demonstrating the performance of S2 on both real
and synthetic data. While other graph-based active learning algorithms have
shown promise in practice, our algorithm is the first with both good
performance and theoretical guarantees. Finally, we demonstrate the
implications of the S2 algorithm to the theory of nonparametric active
learning. In particular, we show that S2 achieves near minimax optimal excess
risk for an important class of nonparametric classification problems.
[Show abstract][Hide abstract] ABSTRACT: Low-rank matrix completion (LRMC) problems arise in a wide variety of
applications. Previous theory mainly provides conditions for completion under
missing-at-random samplings. An incomplete $d \times N$ matrix is
$\textit{finitely completable}$ if there are at most finitely many rank-$r$
matrices that agree with all its observed entries. Finite completability is the
tipping point in LRMC, as a few additional samples of a finitely completable
matrix guarantee its $\textit{unique}$ completability. The main contribution of
this paper is a full characterization of finitely completable observation sets.
We use this characterization to derive sufficient deterministic sampling
conditions for unique completability. We also show that under uniform random
sampling schemes, these conditions are satisfied with high probability if at
least $\mathscr{O}(\max\{r,\log d \})$ entries per column are observed.
[Show abstract][Hide abstract] ABSTRACT: This paper considers the problem of recovering an unknown sparse p×p matrix X from an m×m matrix Y=AXBT, where A and B are known m×p matrices with m≪p. The main result shows that there exist constructions of the sketching matrices A and B so that even if X has O(p) nonzeros, it can be recovered exactly and efficiently using a convex program as long as these nonzeros are not concentrated in any single row/column of X. Furthermore, it suffices for the size of Y (the sketch dimension) to scale as m = O(√(# nonzeros in X) × log p). The results also show that the recovery is robust and stable in the sense that if X is equal to a sparse matrix plus a perturbation, then the convex program we propose produces an approximation with accuracy proportional to the size of the perturbation. Unlike traditional results on sparse recovery, where the sensing matrix produces independent measurements, our sensing operator is highly constrained (it assumes a tensor product structure). Therefore, proving recovery guarantees require nonstandard techniques. Indeed, our approach relies on a novel result concerning tensor products of bipartite graphs, which may be of independent interest. This problem is motivated by the following application, among others. Consider a p×n data matrix D, consisting of n observations of p variables. Assume that the correlation matrix X:=DDT is (approximately) sparse in the sense that each of the p variables is significantly correlated with only a few others. Our results show that these significant correlations can be detected even if we have access to only a sketch of the data S=AD with A ∈ Rm×p .
IEEE Transactions on Information Theory 03/2015; 61(3):1-1. DOI:10.1109/TIT.2015.2391251 · 2.33 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The dueling bandit problem is a variation of the classical multi-armed bandit
in which the allowable actions are noisy comparisons between pairs of arms.
This paper focuses on a new approach for finding the "best" arm according to
the Borda criterion using noisy comparisons. We prove that in the absence of
structural assumptions, the sample complexity of this problem is proportional
to the sum of the inverse squared gaps between the Borda scores of each
suboptimal arm and the best arm. We explore this dependence further and
consider structural constraints on the pairwise comparison matrix (a particular
form of sparsity natural to this problem) that can significantly reduce the
sample complexity. This motivates a new algorithm called Successive Elimination
with Comparison Sparsity (SECS) that exploits sparsity to find the Borda winner
using fewer samples than standard algorithms. We also evaluate the new
algorithm experimentally with synthetic and real data. The results show that
the sparsity model and the new algorithm can provide significant improvements
over standard approaches.
[Show abstract][Hide abstract] ABSTRACT: Consider a generic $r$-dimensional subspace of $\mathbb{R}^d$, $r<d$, and
suppose that we are only given projections of this subspace onto small subsets
of the canonical coordinates. The paper establishes necessary and sufficient
deterministic conditions on the subsets for subspace identifiability.
[Show abstract][Hide abstract] ABSTRACT: This paper studies ordered weighted L1 (OWL) norm regularization for sparse
estimation problems with strongly correlated variables. We prove sufficient
conditions for clustering based on the correlation/colinearity of variables
using the OWL norm, of which the so-called OSCAR is a particular case. Our
results extend previous ones for OSCAR in several ways: for the squared error
loss, our conditions hold for the more general OWL norm and under weaker
assumptions; we also establish clustering conditions for the absolute error
loss, which is, as far as we know, a novel result. Furthermore, we characterize
the statistical performance of OWL norm regularization for generative models in
which certain clusters of regression variables are strongly (even perfectly)
correlated, but variables in different clusters are uncorrelated. We show that
if the true p-dimensional signal generating the data involves only s of the
clusters, then O(s log p) samples suffice to accurately estimate the signal,
regardless of the number of coefficients within the clusters. The estimation of
s-sparse signals with completely independent variables requires just as many
measurements. In other words, using the OWL we pay no price (in terms of the
number of measurements) for the presence of strongly correlated variables.
[Show abstract][Hide abstract] ABSTRACT: We give deterministic necessary and sufficient conditions to guarantee that
if a subspace fits certain partially observed data from a union of subspaces,
it is because such data really lies in a subspace.
Furthermore, we give deterministic necessary and sufficient conditions to
guarantee that if a subspace fits certain partially observed data, such
subspace is unique.
We do this by characterizing when and only when a set of incomplete vectors
behaves as a single but complete one.
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It has been known however that the evolutionary history of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possibly confounding phylogenetic analysis. A further complication in practice is that one has to estimate gene trees from molecular sequences of finite length. We provide the first full data-requirement analysis of a species tree reconstruction method that takes into account estimation errors at the gene level. Under that criterion, we also devise a novel algorithm that provably improves over all previous methods in a regime of interest.
2014 IEEE International Symposium on Information Theory (ISIT); 06/2014
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of estimating the evolutionary history of a set of
species (phylogeny or species tree) from several genes. It is known that the
evolutionary history of individual genes (gene trees) might be topologically
distinct from each other and from the underlying species tree, possibly
confounding phylogenetic analysis. A further complication in practice is that
one has to estimate gene trees from molecular sequences of finite length. We
provide the first full data-requirement analysis of a species tree
reconstruction method that takes into account estimation errors at the gene
level. Under that criterion, we also devise a novel reconstruction algorithm
that provably improves over all previous methods in a regime of interest.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 04/2014; 12(2). DOI:10.1109/TCBB.2014.2361685 · 1.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper studies graphical model selection, i.e., the problem of estimating
a graph of statistical relationships among a collection of random variables.
Conventional graphical model selection algorithms are passive, i.e., they
require all the measurements to have been collected before processing begins.
We propose an active learning algorithm that uses junction tree representations
to adapt future measurements based on the information gathered from prior
measurements. We prove that, under certain conditions, our active learning
algorithm requires fewer scalar measurements than any passive algorithm to
reliably estimate a graph. A range of numerical results validate our theory and
demonstrates the benefits of active learning.
[Show abstract][Hide abstract] ABSTRACT: This paper is concerned with identifying the arm with the highest mean in a multi-armed bandit problem using as few independent samples from the arms as possible. While the so-called “best arm problem” dates back to the 1950s, only recently were two qualitatively different algorithms proposed that achieve the optimal sample complexity for the problem. This paper reviews these recent advances and shows that most best-arm algorithms can be described as variants of the two recent optimal algorithms. For each algorithm type we consider a specific instance to analyze both theoretically and empirically thereby exposing the core components of the theoretical analysis of these algorithms and intuition about how the algorithms work in practice. The derived sample complexity bounds are novel, and in certain cases improve upon previous bounds. In addition, we compare a variety of state-of-the-art algorithms empirically through simulations for the best-arm-problem.
2014 48th Annual Conference on Information Sciences and Systems (CISS); 03/2014
[Show abstract][Hide abstract] ABSTRACT: Binary logistic regression with a sparsity constraint on the solution plays a
vital role in many high dimensional machine learning applications. In some
cases, the features can be grouped together, so that entire subsets of features
can be selected or zeroed out. In many applications, however, this can be very
restrictive. In this paper, we are interested in a less restrictive form of
structured sparse feature selection: we assume that while features can be
grouped according to some notion of similarity, not all features in a group
need be selected for the task at hand. This is sometimes referred to as a
"sparse group" lasso procedure, and it allows for more flexibility than
traditional group lasso methods. Our framework generalizes conventional sparse
group lasso further by allowing for overlapping groups, an additional
flexibility that presents further challenges. The main contribution of this
paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex
optimization program that automatically selects similar features for learning
in high dimensions. We establish consistency results for the SOSlasso for
classification problems using the logistic regression setting, which
specializes to results for the lasso and the group lasso, some known and some
new. In particular, SOSlasso is motivated by multi-subject fMRI studies in
which functional activity is classified using brain voxels as features, source
localization problems in Magnetoencephalography (MEG), and analyzing gene
activation patterns in microarray data analysis. Experiments with real and
synthetic data demonstrate the advantages of SOSlasso compared to the lasso and
group lasso.
[Show abstract][Hide abstract] ABSTRACT: Second-harmonic generation (SHG) imaging can help reveal interactions between collagen fibers and cancer cells. Quantitative analysis of SHG images of collagen fibers is challenged by the heterogeneity of collagen structures and low signal-to-noise ratio often found while imaging collagen in tissue. The role of collagen in breast cancer progression can be assessed post acquisition via enhanced computation. To facilitate this, we have implemented and evaluated four algorithms for extracting fiber information, such as number, length, and curvature, from a variety of SHG images of collagen in breast tissue. The image-processing algorithms included a Gaussian filter, SPIRAL-TV filter, Tubeness filter, and curvelet-denoising filter. Fibers are then extracted using an automated tracking algorithm called fiber extraction (FIRE). We evaluated the algorithm performance by comparing length, angle and position of the automatically extracted fibers with those of manually extracted fibers in twenty-five SHG images of breast cancer. We found that the curvelet-denoising filter followed by FIRE, a process we call CT-FIRE, outperforms the other algorithms under investigation. CT-FIRE was then successfully applied to track collagen fiber shape changes over time in an in vivo mouse model for breast cancer.
[Show abstract][Hide abstract] ABSTRACT: The paper proposes a novel upper confidence bound (UCB) procedure for
identifying the arm with the largest mean in a multi-armed bandit game in the
fixed confidence setting using a small number of total samples. The procedure
cannot be improved in the sense that the number of samples required to identify
the best arm is within a constant factor of a lower bound based on the law of
the iterated logarithm (LIL). Inspired by the LIL, we construct our confidence
bounds to explicitly account for the infinite time horizon of the algorithm. In
addition, by using a novel stopping time for the algorithm we avoid a union
bound over the arms that has been observed in other UCB-type algorithms. We
prove that the algorithm is optimal up to constants and also show through
simulations that it provides superior performance with respect to the
state-of-the-art.
[Show abstract][Hide abstract] ABSTRACT: Multitask learning can be effective when features useful in one task are also
useful for other tasks, and the group lasso is a standard method for selecting
a common subset of features. In this paper, we are interested in a less
restrictive form of multitask learning, wherein (1) the available features can
be organized into subsets according to a notion of similarity and (2) features
useful in one task are similar, but not necessarily identical, to the features
best suited for other tasks. The main contribution of this paper is a new
procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization
that automatically selects similar features for related learning tasks. Error
bounds are derived for SOSlasso and its consistency is established for squared
error loss. In particular, SOSlasso is motivated by multi- subject fMRI studies
in which functional activity is classified using brain voxels as features.
Experiments with real and synthetic data demonstrate the advantages of SOSlasso
compared to the lasso and group lasso.
Advances in neural information processing systems 11/2013;
[Show abstract][Hide abstract] ABSTRACT: This paper studies the sample complexity of searching over multiple populations. We consider a large number of populations, each corresponding to either distribution P0 or P1. The goal of the search problem studied here is to find one population corresponding to distribution P1 with as few samples as possible. The main contribution is to quantify the number of samples needed to correctly find one such population. We consider two general approaches: nonadaptive sampling methods, which sample each population a predetermined number of times until a population following P1 is found, and adaptive sampling methods, which employ sequential sampling schemes for each population. We first derive a lower bound on the number of samples required by any sampling scheme. We then consider an adaptive procedure consisting of a series of sequential probability ratio tests, and show it comes within a constant factor of the lower bound. We give explicit expressions for this constant when samples of the populations follow Gaussian and Bernoulli distributions. An alternative adaptive scheme is discussed which does not require full knowledge of P1, and comes within a constant factor of the optimal scheme. For comparison, a lower bound on the sampling requirements of any nonadaptive scheme is presented.
IEEE Transactions on Information Theory 08/2013; 59(8):5039-5050. DOI:10.1109/TIT.2013.2258071 · 2.33 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a simple adaptive sensing and group testing algorithm for
sparse signal recovery. The algorithm, termed Compressive Adaptive Sense and
Search (CASS), is shown to be near-optimal in that it succeeds at the lowest
possible signal-to-noise-ratio (SNR) levels. Like traditional compressed
sensing based on random non-adaptive design matrices, the CASS algorithm
requires only k log n measurements to recover a k-sparse signal of dimension n.
However, CASS succeeds at SNR levels that are a factor log n less than required
by standard compressed sensing. From the point of view of constructing and
implementing the sensing operation as well as computing the reconstruction, the
proposed algorithm is substantially less computationally intensive than
standard compressed sensing. CASS is also demonstrated to perform considerably
better in practice through simulation. To the best of our knowledge, this is
the first demonstration of an adaptive compressed sensing algorithm with
near-optimal theoretical guarantees and excellent practical performance. This
paper also shows that methods like compressed sensing, group testing, and
pooling have an advantage beyond simply reducing the number of measurements or
tests -- adaptive versions of such methods can also improve detection and
estimation performance when compared to non-adaptive direct (uncompressed)
sensing.
IEEE Transactions on Information Theory 06/2013; 60(7). DOI:10.1109/TIT.2014.2321552 · 2.33 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Sampling from distributions to find the one with the largest mean arises in a
broad range of applications, and it can be mathematically modeled as a
multi-armed bandit problem in which each distribution is associated with an
arm. This paper studies the sample complexity of identifying the best arm
(largest mean) in a multi-armed bandit problem. Motivated by large-scale
applications, we are especially interested in identifying situations where the
total number of samples that are necessary and sufficient to find the best arm
scale linearly with the number of arms. We present a single-parameter
multi-armed bandit model that spans the range from linear to superlinear sample
complexity. We also give a new algorithm for best arm identification, called
PRISM, with linear sample complexity for a wide range of mean distributions.
The algorithm, like most exploration procedures for multi-armed bandits, is
adaptive in the sense that the next arms to sample are selected based on
previous samples. We compare the sample complexity of adaptive procedures with
simpler non-adaptive procedures using new lower bounds. For many problem
instances, the increased sample complexity required by non-adaptive procedures
is a polynomial factor of the number of arms.
[Show abstract][Hide abstract] ABSTRACT: An undirected graphical model is a joint probability distribution defined on
an undirected graph $G^*$, where the vertices in the graph index a collection
of random variables and the edges encode conditional independence relationships
amongst random variables. The undirected graphical model selection (UGMS)
problem is to estimate the graph $G^*$ given observations drawn from the
undirected graphical model. This paper proposes a framework for decomposing the
UGMS problem into multiple subproblems over clusters and subsets of the
separators in a junction tree. The junction tree is constructed using a graph
that contains a superset of the edges in $G^*$. We highlight three main
properties of using junction trees for UGMS. First, different regularization
parameters or different UGMS algorithms can be used to learn different parts of
the graph. This is possible since the subproblems we identify can be solved
independently of each other. Second, under certain conditions, a junction tree
based UGMS algorithm can produce consistent results with exponentially fewer
observations than the usual requirements of existing algorithms. Third, both
our theoretical and experimental results show that the junction tree framework
does a significantly better job at finding the weakest edges in a graph than
existing methods. This property is a consequence of both the first and second
properties. Finally, we note that our framework is independent of the choice of
the UGMS algorithm and can be used as a wrapper around standard UGMS algorithms
for more accurate graph estimation.
Journal of Machine Learning Research 04/2013; 15. · 2.47 Impact Factor