# Journal of Machine Learning Research

Online ISSN: 1532-4435

Online ISSN: 1532-4435

Publications

Article

The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results.

…

Article

We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.

…

Article

We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate "artificial" posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data.

…

Article

In this paper, we introduce PEBL, a Python library and application for learning Bayesian network structure from data and prior knowledge that provides features unmatched by alternative software packages: the ability to use interventional data, flexible specification of structural priors, modeling with hidden variables and exploitation of parallel processing. PEBL is released under the MIT open-source license, can be installed from the Python Package Index and is available at http://pebl-project.googlecode.com.

…

Article

Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan and Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan and Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology.

…

Article

We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.

…

Article

Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse.

…

Article

A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local "expert", and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the "experts" allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results.

…

Article

Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method's promising performance.

…

Article

We apply a variational method to automatically determine the number of mixtures of independent components in high-dimensional datasets, in which the sources may be nonsymmetrically distributed. The data are modeled by clusters where each cluster is described as a linear mixture of independent factors. The variational Bayesian method yields an accurate density model for the observed data without overfitting problems. This allows the dimensionality of the data to be identified for each cluster. The new method was successfully applied to a difficult real-world medical dataset for diagnosing glaucoma.

…

Article

We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B = Θ(n (1.5)) bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B = Θ(n) replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.

…

Article

We present a novel method for estimating tree-structured covariance matrices directly from observed continuous data. Specifically, we estimate a covariance matrix from observations of p continuous random variables encoding a stochastic process over a tree with p leaves. A representation of these classes of matrices as linear combinations of rank-one matrices indicating object partitions is used to formulate estimation as instances of well-studied numerical optimization problems.In particular, our estimates are based on projection, where the covariance estimate is the nearest tree-structured covariance matrix to an observed sample covariance matrix. The problem is posed as a linear or quadratic mixed-integer program (MIP) where a setting of the integer variables in the MIP specifies a set of tree topologies of the structured covariance matrix. We solve these problems to optimality using efficient and robust existing MIP solvers.We present a case study in phylogenetic analysis of gene expression and a simulation study comparing our method to distance-based tree estimating procedures.

…

Article

In cognitive science and neuroscience, there have been two leading models describing how humans perceive and classify facial expressions of emotion-the continuous and the categorical model. The continuous model defines each facial expression of emotion as a feature vector in a face space. This model explains, for example, how expressions of emotion can be seen at different intensities. In contrast, the categorical model consists of C classifiers, each tuned to a specific emotion category. This model explains, among other findings, why the images in a morphing sequence between a happy and a surprise face are perceived as either happy or surprise but not something in between. While the continuous model has a more difficult time justifying this latter finding, the categorical model is not as good when it comes to explaining how expressions are recognized at different intensities or modes. Most importantly, both models have problems explaining how one can recognize combinations of emotion categories such as happily surprised versus angrily surprised versus surprise. To resolve these issues, in the past several years, we have worked on a revised model that justifies the results reported in the cognitive science and neuroscience literature. This model consists of C distinct continuous spaces. Multiple (compound) emotion categories can be recognized by linearly combining these C face spaces. The dimensions of these spaces are shown to be mostly configural. According to this model, the major task for the classification of facial expressions of emotion is precise, detailed detection of facial landmarks rather than recognition. We provide an overview of the literature justifying the model, show how the resulting model can be employed to build algorithms for the recognition of facial expression of emotion, and propose research directions in machine learning and computer vision researchers to keep pushing the state of the art in these areas. We also discuss how the model can aid in studies of human perception, social interactions and disorders.

…

Article

Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

…

Article

We develop an R package fastclime for solving a family of regularized linear programming (LP) problems. Our package efficiently implements the parametric simplex algorithm, which provides a scalable and sophisticated tool for solving large-scale linear programs. As an illustrative example, one use of our LP solver is to implement an important sparse precision matrix estimation method called CLIME (Constrained L 1 Minimization Estimator). Compared with existing packages for this problem such as clime and flare, our package has three advantages: (1) it efficiently calculates the full piecewise-linear regularization path; (2) it provides an accurate dual certificate as stopping criterion; (3) it is completely coded in C and is highly portable. This package is designed to be useful to statisticians and machine learning researchers for solving a wide range of problems.

…

Article

We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a real-world decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support.

…

Article

Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Large-margin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance.

…

Conference Paper

The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science including - examples in computer vision, text processing and more recently bioinformatics are abundant. We present a definition of "relevancy" based on spectral properties of the Affinity (or Laplacian) of the features' measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a "biased nonnegativity" of a key matrix in the process. As a result, a simple least-squares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maxima over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence shows that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.

…

Conference Paper

In the context of independent components analysis (ICA), the
mutual information (MI) of the extracted components is one of the most
desirable measures of independence, due to its special properties. This
paper presents a method for performing linear and nonlinear ICA based on
MI, with few approximations. The use of MI as an objective function for
ICA requires the estimation of the statistical distributions of the
separated components. In this work, both the extraction of independent
components and the estimation of their distributions are performed
simultaneously, by a single network with a specialized structure,
trained with a single objective function

…

Article

This paper comments on the published work dealing with robustness and
regularization of support vector machines (Journal of Machine Learning
Research, vol. 10, pp. 1485-1510, 2009) [arXiv:0803.3490] by H. Xu, etc. They
proposed a theorem to show that it is possible to relate robustness in the
feature space and robustness in the sample space directly. In this paper, we
propose a counter example that rejects their theorem.

…

Article

Hierarchical statistical models are widely employed in information science
and data engineering. The models consist of two types of variables: observable
variables that represent the given data and latent variables for the
unobservable labels. An asymptotic analysis of the models plays an important
role in evaluating the learning process; the result of the analysis is applied
not only to theoretical but also to practical situations, such as optimal model
selection and active learning. There are many studies of generalization errors,
which measure the prediction accuracy of the observable variables. However, the
accuracy of estimating the latent variables has not yet been elucidated. For a
quantitative evaluation of this, the present paper formulates
distribution-based functions for the errors in the estimation of the latent
variables. The asymptotic behavior is analyzed for both the maximum likelihood
and the Bayes methods.

…

Article

We study the theoretical advantages of active learning over passive learning.
Specifically, we prove that, in noise-free classifier learning for VC classes,
any passive learning algorithm can be transformed into an active learning
algorithm with asymptotically strictly superior label complexity for all
nontrivial target functions and distributions. We further provide a general
characterization of the magnitudes of these improvements in terms of a novel
generalization of the disagreement coefficient. We also extend these results to
active learning in the presence of label noise, and find that even under broad
classes of noise distributions, we can typically guarantee strict improvements
over the known results for passive learning.

…

Article

Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain.

…

Article

The investigation of directed acyclic graphs (DAGs) encoding the same Markov
property, that is the same conditional independence relations of multivariate
observational distributions, has a long tradition; many algorithms exist for
model selection and structure learning in Markov equivalence classes. In this
paper, we extend the notion of Markov equivalence of DAGs to the case of
interventional distributions arising from multiple intervention experiments. We
show that under reasonable assumptions on the intervention experiments,
interventional Markov equivalence defines a finer partitioning of DAGs than
observational Markov equivalence and hence improves the identifiability of
causal models. We give a graph theoretic criterion for two DAGs being Markov
equivalent under interventions and show that each interventional Markov
equivalence class can, analogously to the observational case, be uniquely
represented by a chain graph called interventional essential graph (also known
as CPDAG in the observational case). These are key insights for deriving a
generalization of the Greedy Equivalence Search algorithm aimed at structure
learning from interventional data. This new algorithm is evaluated in a
simulation study.

…

Article

We consider the PC-algorithm Spirtes et. al. (2000) for estimating the skeleton of a very high-dimensional acyclic directed graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible for sparse problems with many nodes, i.e. variables, and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(n^a) for any 0<a<infinity. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We empirically demonstrate the PC-algorithm for simulated data and argue that the algorithm is rather insensitive to the choice of its single tuning parameter.

…

Article

The AdaBoost algorithm was designed to combine many "weak" hypotheses that
perform slightly better than random guessing into a "strong" hypothesis that
has very low error. We study the rate at which AdaBoost iteratively converges
to the minimum of the "exponential loss." Unlike previous work, our proofs do
not require a weak-learning assumption, nor do they require that minimizers of
the exponential loss are finite. Our first result shows that at iteration $t$,
the exponential loss of AdaBoost's computed parameter vector will be at most
$\epsilon$ more than that of any parameter vector of $\ell_1$-norm bounded by
$B$ in a number of rounds that is at most a polynomial in $B$ and $1/\epsilon$.
We also provide lower bounds showing that a polynomial dependence on these
parameters is necessary. Our second result is that within $C/\epsilon$
iterations, AdaBoost achieves a value of the exponential loss that is at most
$\epsilon$ more than the best possible value, where $C$ depends on the dataset.
We show that this dependence of the rate on $\epsilon$ is optimal up to
constant factors, i.e., at least $\Omega(1/\epsilon)$ rounds are necessary to
achieve within $\epsilon$ of the optimal exponential loss.

…

Article

There is a large literature explaining why AdaBoost is a successful
classifier. The literature on AdaBoost focuses on classifier margins and
boosting's interpretation as the optimization of an exponential likelihood
function. These existing explanations, however, have been pointed out to be
incomplete. A random forest is another popular ensemble method for which there
is substantially less explanation in the literature. We introduce a novel
perspective on AdaBoost and random forests that proposes that the two
algorithms work for similar reasons. While both classifiers achieve similar
predictive accuracy, random forests cannot be conceived as a direct
optimization procedure. Rather, random forests is a self-averaging,
interpolating algorithm which creates what we denote as a "spikey-smooth"
classifier, and we view AdaBoost in the same light. We conjecture that both
AdaBoost and random forests succeed because of this mechanism. We provide a
number of examples and some theoretical justification to support this
explanation. In the process, we question the conventional wisdom that suggests
that boosting algorithms for classification require regularization or early
stopping and should be limited to low complexity classes of learners, such as
decision stumps. We conclude that boosting should be used like random forests:
with large decision trees and without direct regularization or early stopping.

…

Article

When applying aggregating strategies to Prediction with Expert Advice, the learning rate must be adaptively tuned. The natural choice of sqrt(complexity/current loss) renders the analysis of Weighted Majority derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative "Follow the Perturbed Leader" (FPL) algorithm from Kalai & Vempala (2003) (based on Hannan's algorithm) is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new.

…

Article

The scalability of statistical estimators is of increasing importance in
modern applications. One approach to implementing scalable algorithms is to
compress data into a low dimensional latent space using dimension reduction
methods. In this paper we develop an approach for dimension reduction that
exploits the assumption of low rank structure in high dimensional data to gain
both computational and statistical advantages. We adapt recent randomized
low-rank approximation algorithms to provide an efficient solution to principal
component analysis (PCA), and we use this efficient solver to improve parameter
estimation in large-scale linear mixed models (LMM) for association mapping in
statistical and quantitative genomics. A key observation in this paper is that
randomization serves a dual role, improving both computational and statistical
performance by implicitly regularizing the covariance matrix estimate of the
random effect in a LMM. These statistical and computational advantages are
highlighted in our experiments on simulated data and large-scale genomic
studies.

…

Article

Very recently crowdsourcing has become the de facto platform for distributing
and collecting human computation for a wide range of tasks and applications
such as information retrieval, natural language processing and machine
learning. Current crowdsourcing platforms have some limitations in the area of
quality control. Most of the effort to ensure good quality has to be done by
the experimenter who has to manage the number of workers needed to reach good
results.
We propose a simple model for adaptive quality control in crowdsourced
multiple-choice tasks which we call the \emph{bandit survey problem}. This
model is related to, but technically different from the well-known multi-armed
bandit problem. We present several algorithms for this problem, and support
them with analysis and simulations. Our approach is based in our experience
conducting relevance evaluation for a large commercial search engine.

…

Conference Paper

We construct a boosting algorithm, which is the first both smooth and adaptive booster. These two features make it possible to achieve performance improvement for many learning tasks whose solution use a boosting technique.
Originally, the boosting approach was suggested for the standard PAC model; we analyze possible applications of boosting in the model of agnostic learning (which is “more realistic” than PAC). We derive a lower bound for the final error achievable by boosting in the agnostic model; we show that our algorithm actually achieves that accuracy (within a constant factor of 2): When the booster faces distribution D, its final error is bounded above by \(
\frac{1}
{{1/2 - \beta }}err_D (F) + \zeta
\), where errD
(F) + β is an upper bound on the error of a hypothesis received from the (agnostic) weak learner when it faces distribution D and ζ is any real, so that the complexity of the boosting is polynomial in 1/ζ We note that the idea of applying boosting in the agnostic model was first suggested by Ben-David, Long and Mansour and the above accuracy is an exponential improvement w.r.t. β over their result \(
(\frac{1}
{{1/2 - \beta }}err_D (F)^{2(1/2 - \beta )^2 /ln(1/\beta - 1)} + \zeta )
\). Eventually, we construct a boosting “tandem”, thus approaching in terms of O the lowest number of the boosting iterations possible, as well as in terms of Õ the best possible smoothness. This allows solving adaptively problems whose solution is based on smooth boosting (like noise tolerant boosting and DNF membership learning), preserving the original solution’s complexity.

…

Article

The False Discovery Rate (FDR) is a commonly used type I error rate in
multiple testing problems. It is defined as the expected False Discovery
Proportion (FDP), that is, the expected fraction of false positives among
rejected hypotheses. When the hypotheses are independent, the
Benjamini-Hochberg procedure achieves FDR control at any pre-specified level.
By construction, FDR control offers no guarantee in terms of power, or type II
error. A number of alternative procedures have been developed, including
plug-in procedures that aim at gaining power by incorporating an estimate of
the proportion of true null hypotheses. In this paper, we study the asymptotic
behavior of a class of plug-in procedures based on kernel estimators of the
density of the $p$-values, as the number $m$ of tested hypotheses grows to
infinity. In a setting where the hypotheses tested are independent, we prove
that these procedures are asymptotically more powerful in two respects: (i) a
tighter asymptotic FDR control for any target FDR level and (ii) a broader
range of target levels yielding positive asymptotic power. We also show that
this increased asymptotic power comes at the price of slower, non-parametric
convergence rates for the FDP. These rates are of the form $m^{-k/(2k+1)}$,
where $k$ is determined by the regularity of the density of the $p$-value
distribution, or, equivalently, of the test statistics distribution. These
results are applied to one- and two-sided tests statistics for Gaussian and
Laplace location models, and for the Student model.

…

Article

The CUR matrix decomposition and the Nystr\"{o}m approximation are two
important low-rank matrix approximation techniques. The Nystr\"{o}m method
approximates a symmetric positive semidefinite matrix in terms of a small
number of its columns, while CUR approximates an arbitrary data matrix by a
small number of its columns and rows. Thus, CUR decomposition can be regarded
as an extension of the Nystr\"{o}m approximation.
In this paper we establish a more general error bound for the adaptive
column/row sampling algorithm, based on which we propose more accurate CUR and
Nystr\"{o}m algorithms with expected relative-error bounds. The proposed CUR
and Nystr\"{o}m algorithms also have low time complexity and can avoid
maintaining the whole data matrix in RAM. In addition, we give theoretical
analysis for the lower error bounds of the standard Nystr\"{o}m method and the
ensemble Nystr\"{o}m method. The main theoretical results established in this
paper are novel, and our analysis makes no special assumption on the data
matrices.

…

Article

In this paper, we consider networks consisting of a finite number of
non-overlapping communities. To extract these communities, the interaction
between pairs of nodes may be sampled from a large available data set, which
allows a given node pair to be sampled several times. When a node pair is
sampled, the observed outcome is a binary random variable, equal to 1 if nodes
interact and to 0 otherwise. The outcome is more likely to be positive if nodes
belong to the same communities. For a given budget of node pair samples or
observations, we wish to jointly design a sampling strategy (the sequence of
sampled node pairs) and a clustering algorithm that recover the hidden
communities with the highest possible accuracy. We consider both non-adaptive
and adaptive sampling strategies, and for both classes of strategies, we derive
fundamental performance limits satisfied by any sampling and clustering
algorithm. In particular, we provide necessary conditions for the existence of
algorithms recovering the communities accurately as the network size grows
large. We also devise simple algorithms that accurately reconstruct the
communities when this is at all possible, hence proving that the proposed
necessary conditions for accurate community detection are also sufficient. The
classical problem of community detection in the stochastic block model can be
seen as a particular instance of the problems consider here. But our framework
covers more general scenarios where the sequence of sampled node pairs can be
designed in an adaptive manner. The paper provides new results for the
stochastic block model, and extends the analysis to the case of adaptive
sampling.

…

Article

Online estimation and modelling of i.i.d. data for short sequences over large
or complex "alphabets" is a ubiquitous (sub)problem in machine learning,
information theory, data compression, statistical language processing, and
document analysis. The Dirichlet-Multinomial distribution (also called Polya
urn scheme) and extensions thereof are widely applied for online i.i.d.
estimation. Good a-priori choices for the parameters in this regime are
difficult to obtain though. I derive an optimal adaptive choice for the main
parameter via tight, data-dependent redundancy bounds for a related model. The
1-line recommendation is to set the 'total mass' = 'precision' =
'concentration' parameter to m/2ln[(n+1)/m], where n is the (past) sample size
and m the number of different symbols observed (so far). The resulting
estimator (i) is simple, (ii) online, (iii) fast, (iv) performs well for all m,
small, middle and large, (v) is independent of the base alphabet size, (vi)
non-occurring symbols induce no redundancy, (vii) the constant sequence has
constant redundancy, (viii) symbols that appear only finitely often have
bounded/constant contribution to the redundancy, (ix) is competitive with
(slow) Bayesian mixing over all sub-alphabets.

…

Article

We propose a new, nonparametric method for multivariate regression subject to
convexity or concavity constraints on the response function. Convexity
constraints are common in economics, statistics, operations research, financial
engineering and optimization, but there is currently no multivariate method
that is computationally feasible for more than a few hundred observations. We
introduce Convex Adaptive Partitioning (CAP), which creates a globally convex
regression model from locally linear estimates fit on adaptively selected
covariate partitions. CAP is computationally efficient, in stark contrast to
current methods. The most popular method, the least squares estimator, has a
computational complexity of $\mathcal{O}(n^3)$. We show that CAP has a
computational complexity of $\mathcal{O}(n \log(n)\log(\log(n)))$ and also give
consistency results. CAP is applied to value function approximation for pricing
American basket options with a large number of underlying assets.

…

Article

In modeling multivariate time series, it is important to allow time-varying
smoothness in the mean and covariance process. In particular, there may be
certain time intervals exhibiting rapid changes and others in which changes are
slow. If such time-varying smoothness is not accounted for, one can obtain
misleading inferences and predictions, with over-smoothing across erratic time
intervals and under-smoothing across times exhibiting slow variation. This can
lead to mis-calibration of predictive intervals, which can be substantially too
narrow or wide depending on the time. We propose a locally adaptive factor
process for characterizing multivariate mean-covariance changes in continuous
time, allowing locally varying smoothness in both the mean and covariance
matrix. This process is constructed utilizing latent dictionary functions
evolving in time through nested Gaussian processes and linearly related to the
observed data with a sparse mapping. Using a differential equation
representation, we bypass usual computational bottlenecks in obtaining MCMC and
online algorithms for approximate Bayesian inference. The performance is
assessed in simulations and illustrated in a financial application.

…

Article

Hierarchical clustering based on pairwise similarities is a common tool used
in a broad range of scientific applications. However, in many problems it may
be expensive to obtain or compute similarities between the items to be
clustered. This paper investigates the hierarchical clustering of N items based
on a small subset of pairwise similarities, significantly less than the
complete set of N(N-1)/2 similarities. First, we show that if the intracluster
similarities exceed intercluster similarities, then it is possible to correctly
determine the hierarchical clustering from as few as 3N log N similarities. We
demonstrate this order of magnitude savings in the number of pairwise
similarities necessitates sequentially selecting which similarities to obtain
in an adaptive fashion, rather than picking them at random. We then propose an
active clustering method that is robust to a limited fraction of anomalous
similarities, and show how even in the presence of these noisy similarity
values we can resolve the hierarchical clustering using only O(N log^2 N)
pairwise similarities.

…

Article

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm
that avoids the random walk behavior and sensitivity to correlated parameters
that plague many MCMC methods by taking a series of steps informed by
first-order gradient information. These features allow it to converge to
high-dimensional target distributions much more quickly than simpler methods
such as random walk Metropolis or Gibbs sampling. However, HMC's performance is
highly sensitive to two user-specified parameters: a step size {\epsilon} and a
desired number of steps L. In particular, if L is too small then the algorithm
exhibits undesirable random walk behavior, while if L is too large the
algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an
extension to HMC that eliminates the need to set a number of steps L. NUTS uses
a recursive algorithm to build a set of likely candidate points that spans a
wide swath of the target distribution, stopping automatically when it starts to
double back and retrace its steps. Empirically, NUTS perform at least as
efficiently as and sometimes more efficiently than a well tuned standard HMC
method, without requiring user intervention or costly tuning runs. We also
derive a method for adapting the step size parameter {\epsilon} on the fly
based on primal-dual averaging. NUTS can thus be used with no hand-tuning at
all. NUTS is also suitable for applications such as BUGS-style automatic
inference engines that require efficient "turnkey" sampling algorithms.

…

Article

In this paper, we consider supervised learning problems such as logistic
regression and study the stochastic gradient method with averaging, in the
usual stochastic approximation setting where observations are used only once.
We show that after $N$ iterations, with a constant step-size proportional to
$1/R^2 \sqrt{N}$ where $N$ is the number of observations and $R$ is the maximum
norm of the observations, the convergence rate is always of order
$O(1/\sqrt{N})$, and improves to $O(R^2 / \mu N)$ where $\mu$ is the lowest
eigenvalue of the Hessian at the global optimum (when this eigenvalue is
greater than $R^2/\sqrt{N}$). Since $\mu$ does not need to be known in advance,
this shows that averaged stochastic gradient is adaptive to \emph{unknown
local} strong convexity of the objective function. Our proof relies on the
generalized self-concordance properties of the logistic loss and thus extends
to all generalized linear models with uniformly bounded features.

…

Article

Sparse additive models are families of $d$-variate functions that have the
additive decomposition $f^* = \sum_{j \in S} f^*_j$, where $S$ is an unknown
subset of cardinality $s \ll d$. In this paper, we consider the case where each
univariate component function $f^*_j$ lies in a reproducing kernel Hilbert
space (RKHS), and analyze a method for estimating the unknown function $f^*$
based on kernels combined with $\ell_1$-type convex regularization. Working
within a high-dimensional framework that allows both the dimension $d$ and
sparsity $s$ to increase with $n$, we derive convergence rates (upper bounds)
in the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norms over the class
$\MyBigClass$ of sparse additive models with each univariate function $f^*_j$
in the unit ball of a univariate RKHS with bounded kernel function. We
complement our upper bounds by deriving minimax lower bounds on the
$L^2(\mathbb{P})$ error, thereby showing the optimality of our method. Thus, we
obtain optimal minimax rates for many interesting classes of sparse additive
models, including polynomials, splines, and Sobolev classes. We also show that
if, in contrast to our univariate conditions, the multivariate function class
is assumed to be globally bounded, then much faster estimation rates are
possible for any sparsity $s = \Omega(\sqrt{n})$, showing that global
boundedness is a significant restriction in the high-dimensional setting.

…

Article

We consider the problem of learning causal directed acyclic graphs from an
observational joint distribution. One can use these graphs to predict the
outcome of interventional experiments, from which data are often not available.
We show that if the observational distribution follows a structural equation
model with an additive noise structure, the directed acyclic graph becomes
identifiable from the distribution under mild conditions. This constitutes an
interesting alternative to traditional methods that assume faithfulness and
identify the Markov equivalence class of the graph, i.e., leaving some edges
undirected. We provide practical algorithms for finitely many samples and
provide an empirical evaluation.

…

Article

We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal),
whose regret is, essentially, optimal both for adversarial rewards and for
stochastic rewards. Specifically, SAO combines the square-root worst-case
regret of Exp3 (Auer et al., SIAM J. on Computing 2002) and the
(poly)logarithmic regret of UCB1 (Auer et al., Machine Learning 2002) for
stochastic rewards. Adversarial rewards and stochastic rewards are the two main
settings in the literature on (non-Bayesian) multi-armed bandits. Prior work on
multi-armed bandits treats them separately, and does not attempt to jointly
optimize for both. Our result falls into a general theme of achieving good
worst-case performance while also taking advantage of "nice" problem instances,
an important issue in the design of algorithms with partially known inputs.

…

Article

We consider an original problem that arises from the issue of security
analysis of a power system and that we name optimal discovery with
probabilistic expert advice. We address it with an algorithm based on the
optimistic paradigm and on the Good-Turing missing mass estimator. We prove two
different regret bounds on the performance of this algorithm under weak
assumptions on the probabilistic experts. Under more restrictive hypotheses, we
also prove a macroscopic optimality result, comparing the algorithm both with
an oracle strategy and with uniform sampling. Finally, we provide numerical
experiments illustrating these theoretical findings.

…

Article

In this technical report I show that the Brier game of prediction is perfectly mixable and find the optimal learning rate and substitution function for it. These results are straightforward, but the computations are surprisingly messy. 1 Loss bound A game of prediction consists of three components: the observation space Ω, the decision space Γ, and the loss function λ: Ω × Γ → R. In this note we are interested in the following Brier game [1]: Ω is a finite and non-empty set, Γ: = P(Ω) is the set of all probability measures on Ω, and λ(ω, γ) = ∑ (γ{o} − δω{o}) 2, o∈Ω where δω ∈ P(Ω) is the probability measure concentrated at ω: δω{ω} = 1 and δω{o} = 0 for o ̸ = ω. The Brier game is being played repeatedly by a learner having access to decisions made by a pool of experts, which leads to the following prediction protocol: Prediction with expert advice for the Brier game

…

Article

We study the problem of rank aggregation: given a set of ranked lists, we
want to form a consensus ranking. Furthermore, we consider the case of extreme
lists: i.e., only the rank of the best or worst elements are known. We impute
missing ranks by the average value and generalise Spearman's \rho to extreme
ranks. Our main contribution is the derivation of a non-parametric estimator
for rank aggregation based on multivariate extensions of Spearman's \rho, which
measures correlation between a set of ranked lists. Multivariate Spearman's
\rho is defined using copulas, and we show that the geometric mean of
normalised ranks maximises multivariate correlation. Motivated by this, we
propose a weighted geometric mean approach for learning to rank which has a
closed form least squares solution. When only the best or worst elements of a
ranked list are known, we impute the missing ranks by the average value,
allowing us to apply Spearman's \rho. Finally, we demonstrate good performance
on the rank aggregation benchmarks MQ2007 and MQ2008.

…

Article

In this paper, we consider the problem of "hyper-sparse aggregation". Namely, given a dictionary $F = \{f_1, ..., f_M \}$ of functions, we look for an optimal aggregation algorithm that writes $\tilde f = \sum_{j=1}^M \theta_j f_j$ with as many zero coefficients $\theta_j$ as possible. This problem is of particular interest when $F$ contains many irrelevant functions that should not appear in $\tilde{f}$. We provide an exact oracle inequality for $\tilde f$, where only two coefficients are non-zero, that entails $\tilde f$ to be an optimal aggregation algorithm. Since selectors are suboptimal aggregation procedures, this proves that 2 is the minimal number of elements of $F$ required for the construction of an optimal aggregation procedures in every situations. A simulated example of this algorithm is proposed on a dictionary obtained using LARS, for the problem of selection of the regularization parameter of the LASSO. We also give an example of use of aggregation to achieve minimax adaptation over anisotropic Besov spaces, which was not previously known in minimax theory (in regression on a random design). Comment: 33 pages

…

Article

In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of $M$ density estimators for the Kullback-Leiber divergence (KL), the Hellinger's distance and the $L\_1$-distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000). Combining these results, we state that $\log M/n$ is an optimal rate of aggregation in the sense of Tsybakov (2003), where $n$ is the sample size.

…

Article

We study pool-based active learning of half-spaces. We revisit the aggressive
approach for active learning in the realizable case, and show that it can be
made efficient and practical, while also having theoretical guarantees under
reasonable assumptions. We further show, both theoretically and experimentally,
that it can be preferable to mellow approaches. Our efficient aggressive active
learner of half-spaces has formal approximation guarantees that hold when the
pool is separable with a margin. While our analysis is focused on the
realizable setting, we show that a simple heuristic allows using the same
algorithm successfully for pools with low error as well. We further compare the
aggressive approach to the mellow approach, and prove that there are cases in
which the aggressive approach results in significantly better label complexity
compared to the mellow approach. We demonstrate experimentally that substantial
improvements in label complexity can be achieved using the aggressive approach,
for both realizable and low-error settings.

…

Article

Online Passive-Aggressive (PA) learning is an effective framework for
performing max-margin online learning. But the deterministic formulation and
estimated single large-margin model could limit its capability in discovering
descriptive structures underlying complex data. This pa- per presents online
Bayesian Passive-Aggressive (BayesPA) learning, which subsumes the online PA
and extends naturally to incorporate latent variables and perform nonparametric
Bayesian inference, thus providing great flexibility for explorative analysis.
We apply BayesPA to topic modeling and derive efficient online learning
algorithms for max-margin topic models. We further develop nonparametric
methods to resolve the number of topics. Experimental results on real datasets
show that our approaches significantly improve time efficiency while
maintaining comparable results with the batch counterparts.

…

Top-cited authors