# Journal of Machine Learning Research

Online ISSN: 1532-4435
Publications
Article
The ℓ1-penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1-penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1-penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results.

Article
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.

Article
We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate "artificial" posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data.

Article
In this paper, we introduce PEBL, a Python library and application for learning Bayesian network structure from data and prior knowledge that provides features unmatched by alternative software packages: the ability to use interventional data, flexible specification of structural priors, modeling with hidden variables and exploitation of parallel processing. PEBL is released under the MIT open-source license, can be installed from the Python Package Index and is available at http://pebl-project.googlecode.com.

Article
Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan and Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan and Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology.

Article
We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.

Article
Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse.

Article
A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local "expert", and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the "experts" allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results.

Article
Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method's promising performance.

Article
We apply a variational method to automatically determine the number of mixtures of independent components in high-dimensional datasets, in which the sources may be nonsymmetrically distributed. The data are modeled by clusters where each cluster is described as a linear mixture of independent factors. The variational Bayesian method yields an accurate density model for the observed data without overfitting problems. This allows the dimensionality of the data to be identified for each cluster. The new method was successfully applied to a difficult real-world medical dataset for diagnosing glaucoma.

Article
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B = Θ(n (1.5)) bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B = Θ(n) replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.

Article
We present a novel method for estimating tree-structured covariance matrices directly from observed continuous data. Specifically, we estimate a covariance matrix from observations of p continuous random variables encoding a stochastic process over a tree with p leaves. A representation of these classes of matrices as linear combinations of rank-one matrices indicating object partitions is used to formulate estimation as instances of well-studied numerical optimization problems.In particular, our estimates are based on projection, where the covariance estimate is the nearest tree-structured covariance matrix to an observed sample covariance matrix. The problem is posed as a linear or quadratic mixed-integer program (MIP) where a setting of the integer variables in the MIP specifies a set of tree topologies of the structured covariance matrix. We solve these problems to optimality using efficient and robust existing MIP solvers.We present a case study in phylogenetic analysis of gene expression and a simulation study comparing our method to distance-based tree estimating procedures.

Article
In cognitive science and neuroscience, there have been two leading models describing how humans perceive and classify facial expressions of emotion-the continuous and the categorical model. The continuous model defines each facial expression of emotion as a feature vector in a face space. This model explains, for example, how expressions of emotion can be seen at different intensities. In contrast, the categorical model consists of C classifiers, each tuned to a specific emotion category. This model explains, among other findings, why the images in a morphing sequence between a happy and a surprise face are perceived as either happy or surprise but not something in between. While the continuous model has a more difficult time justifying this latter finding, the categorical model is not as good when it comes to explaining how expressions are recognized at different intensities or modes. Most importantly, both models have problems explaining how one can recognize combinations of emotion categories such as happily surprised versus angrily surprised versus surprise. To resolve these issues, in the past several years, we have worked on a revised model that justifies the results reported in the cognitive science and neuroscience literature. This model consists of C distinct continuous spaces. Multiple (compound) emotion categories can be recognized by linearly combining these C face spaces. The dimensions of these spaces are shown to be mostly configural. According to this model, the major task for the classification of facial expressions of emotion is precise, detailed detection of facial landmarks rather than recognition. We provide an overview of the literature justifying the model, show how the resulting model can be employed to build algorithms for the recognition of facial expression of emotion, and propose research directions in machine learning and computer vision researchers to keep pushing the state of the art in these areas. We also discuss how the model can aid in studies of human perception, social interactions and disorders.

Article
Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

Article
We develop an R package fastclime for solving a family of regularized linear programming (LP) problems. Our package efficiently implements the parametric simplex algorithm, which provides a scalable and sophisticated tool for solving large-scale linear programs. As an illustrative example, one use of our LP solver is to implement an important sparse precision matrix estimation method called CLIME (Constrained L 1 Minimization Estimator). Compared with existing packages for this problem such as clime and flare, our package has three advantages: (1) it efficiently calculates the full piecewise-linear regularization path; (2) it provides an accurate dual certificate as stopping criterion; (3) it is completely coded in C and is highly portable. This package is designed to be useful to statisticians and machine learning researchers for solving a wide range of problems.

Article
We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a real-world decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support.

Article
Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Large-margin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance.

Conference Paper
The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science including - examples in computer vision, text processing and more recently bioinformatics are abundant. We present a definition of "relevancy" based on spectral properties of the Affinity (or Laplacian) of the features' measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a "biased nonnegativity" of a key matrix in the process. As a result, a simple least-squares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maxima over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence shows that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.

Conference Paper
In the context of independent components analysis (ICA), the mutual information (MI) of the extracted components is one of the most desirable measures of independence, due to its special properties. This paper presents a method for performing linear and nonlinear ICA based on MI, with few approximations. The use of MI as an objective function for ICA requires the estimation of the statistical distributions of the separated components. In this work, both the extraction of independent components and the estimation of their distributions are performed simultaneously, by a single network with a specialized structure, trained with a single objective function

Article
This paper comments on the published work dealing with robustness and regularization of support vector machines (Journal of Machine Learning Research, vol. 10, pp. 1485-1510, 2009) [arXiv:0803.3490] by H. Xu, etc. They proposed a theorem to show that it is possible to relate robustness in the feature space and robustness in the sample space directly. In this paper, we propose a counter example that rejects their theorem.

Article
Hierarchical statistical models are widely employed in information science and data engineering. The models consist of two types of variables: observable variables that represent the given data and latent variables for the unobservable labels. An asymptotic analysis of the models plays an important role in evaluating the learning process; the result of the analysis is applied not only to theoretical but also to practical situations, such as optimal model selection and active learning. There are many studies of generalization errors, which measure the prediction accuracy of the observable variables. However, the accuracy of estimating the latent variables has not yet been elucidated. For a quantitative evaluation of this, the present paper formulates distribution-based functions for the errors in the estimation of the latent variables. The asymptotic behavior is analyzed for both the maximum likelihood and the Bayes methods.

Article
We study the theoretical advantages of active learning over passive learning. Specifically, we prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be transformed into an active learning algorithm with asymptotically strictly superior label complexity for all nontrivial target functions and distributions. We further provide a general characterization of the magnitudes of these improvements in terms of a novel generalization of the disagreement coefficient. We also extend these results to active learning in the presence of label noise, and find that even under broad classes of noise distributions, we can typically guarantee strict improvements over the known results for passive learning.

Article
Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain.

Article
The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study.

Article
We consider the PC-algorithm Spirtes et. al. (2000) for estimating the skeleton of a very high-dimensional acyclic directed graph (DAG) with corresponding Gaussian distribution. The PC-algorithm is computationally feasible for sparse problems with many nodes, i.e. variables, and it has the attractive property to automatically achieve high computational efficiency as a function of sparseness of the true underlying DAG. We prove consistency of the algorithm for very high-dimensional, sparse DAGs where the number of nodes is allowed to quickly grow with sample size n, as fast as O(n^a) for any 0<a<infinity. The sparseness assumption is rather minimal requiring only that the neighborhoods in the DAG are of lower order than sample size n. We empirically demonstrate the PC-algorithm for simulated data and argue that the algorithm is rather insensitive to the choice of its single tuning parameter.

Article
The AdaBoost algorithm was designed to combine many "weak" hypotheses that perform slightly better than random guessing into a "strong" hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the "exponential loss." Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that at iteration $t$, the exponential loss of AdaBoost's computed parameter vector will be at most $\epsilon$ more than that of any parameter vector of $\ell_1$-norm bounded by $B$ in a number of rounds that is at most a polynomial in $B$ and $1/\epsilon$. We also provide lower bounds showing that a polynomial dependence on these parameters is necessary. Our second result is that within $C/\epsilon$ iterations, AdaBoost achieves a value of the exponential loss that is at most $\epsilon$ more than the best possible value, where $C$ depends on the dataset. We show that this dependence of the rate on $\epsilon$ is optimal up to constant factors, i.e., at least $\Omega(1/\epsilon)$ rounds are necessary to achieve within $\epsilon$ of the optimal exponential loss.

Article
There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self-averaging, interpolating algorithm which creates what we denote as a "spikey-smooth" classifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples and some theoretical justification to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.

Article
When applying aggregating strategies to Prediction with Expert Advice, the learning rate must be adaptively tuned. The natural choice of sqrt(complexity/current loss) renders the analysis of Weighted Majority derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative "Follow the Perturbed Leader" (FPL) algorithm from Kalai & Vempala (2003) (based on Hannan's algorithm) is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new.

Article
The scalability of statistical estimators is of increasing importance in modern applications. One approach to implementing scalable algorithms is to compress data into a low dimensional latent space using dimension reduction methods. In this paper we develop an approach for dimension reduction that exploits the assumption of low rank structure in high dimensional data to gain both computational and statistical advantages. We adapt recent randomized low-rank approximation algorithms to provide an efficient solution to principal component analysis (PCA), and we use this efficient solver to improve parameter estimation in large-scale linear mixed models (LMM) for association mapping in statistical and quantitative genomics. A key observation in this paper is that randomization serves a dual role, improving both computational and statistical performance by implicitly regularizing the covariance matrix estimate of the random effect in a LMM. These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies.

Article
Very recently crowdsourcing has become the de facto platform for distributing and collecting human computation for a wide range of tasks and applications such as information retrieval, natural language processing and machine learning. Current crowdsourcing platforms have some limitations in the area of quality control. Most of the effort to ensure good quality has to be done by the experimenter who has to manage the number of workers needed to reach good results. We propose a simple model for adaptive quality control in crowdsourced multiple-choice tasks which we call the \emph{bandit survey problem}. This model is related to, but technically different from the well-known multi-armed bandit problem. We present several algorithms for this problem, and support them with analysis and simulations. Our approach is based in our experience conducting relevance evaluation for a large commercial search engine.

Conference Paper
We construct a boosting algorithm, which is the first both smooth and adaptive booster. These two features make it possible to achieve performance improvement for many learning tasks whose solution use a boosting technique. Originally, the boosting approach was suggested for the standard PAC model; we analyze possible applications of boosting in the model of agnostic learning (which is “more realistic” than PAC). We derive a lower bound for the final error achievable by boosting in the agnostic model; we show that our algorithm actually achieves that accuracy (within a constant factor of 2): When the booster faces distribution D, its final error is bounded above by $$\frac{1} {{1/2 - \beta }}err_D (F) + \zeta$$, where errD (F) + β is an upper bound on the error of a hypothesis received from the (agnostic) weak learner when it faces distribution D and ζ is any real, so that the complexity of the boosting is polynomial in 1/ζ We note that the idea of applying boosting in the agnostic model was first suggested by Ben-David, Long and Mansour and the above accuracy is an exponential improvement w.r.t. β over their result $$(\frac{1} {{1/2 - \beta }}err_D (F)^{2(1/2 - \beta )^2 /ln(1/\beta - 1)} + \zeta )$$. Eventually, we construct a boosting “tandem”, thus approaching in terms of O the lowest number of the boosting iterations possible, as well as in terms of Õ the best possible smoothness. This allows solving adaptively problems whose solution is based on smooth boosting (like noise tolerant boosting and DNF membership learning), preserving the original solution’s complexity.

Article
The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the Benjamini-Hochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the $p$-values, as the number $m$ of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form $m^{-k/(2k+1)}$, where $k$ is determined by the regularity of the density of the $p$-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model.

Article
The CUR matrix decomposition and the Nystr\"{o}m approximation are two important low-rank matrix approximation techniques. The Nystr\"{o}m method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr\"{o}m approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr\"{o}m algorithms with expected relative-error bounds. The proposed CUR and Nystr\"{o}m algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr\"{o}m method and the ensemble Nystr\"{o}m method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.

Article
In this paper, we consider networks consisting of a finite number of non-overlapping communities. To extract these communities, the interaction between pairs of nodes may be sampled from a large available data set, which allows a given node pair to be sampled several times. When a node pair is sampled, the observed outcome is a binary random variable, equal to 1 if nodes interact and to 0 otherwise. The outcome is more likely to be positive if nodes belong to the same communities. For a given budget of node pair samples or observations, we wish to jointly design a sampling strategy (the sequence of sampled node pairs) and a clustering algorithm that recover the hidden communities with the highest possible accuracy. We consider both non-adaptive and adaptive sampling strategies, and for both classes of strategies, we derive fundamental performance limits satisfied by any sampling and clustering algorithm. In particular, we provide necessary conditions for the existence of algorithms recovering the communities accurately as the network size grows large. We also devise simple algorithms that accurately reconstruct the communities when this is at all possible, hence proving that the proposed necessary conditions for accurate community detection are also sufficient. The classical problem of community detection in the stochastic block model can be seen as a particular instance of the problems consider here. But our framework covers more general scenarios where the sequence of sampled node pairs can be designed in an adaptive manner. The paper provides new results for the stochastic block model, and extends the analysis to the case of adaptive sampling.

Article
Online estimation and modelling of i.i.d. data for short sequences over large or complex "alphabets" is a ubiquitous (sub)problem in machine learning, information theory, data compression, statistical language processing, and document analysis. The Dirichlet-Multinomial distribution (also called Polya urn scheme) and extensions thereof are widely applied for online i.i.d. estimation. Good a-priori choices for the parameters in this regime are difficult to obtain though. I derive an optimal adaptive choice for the main parameter via tight, data-dependent redundancy bounds for a related model. The 1-line recommendation is to set the 'total mass' = 'precision' = 'concentration' parameter to m/2ln[(n+1)/m], where n is the (past) sample size and m the number of different symbols observed (so far). The resulting estimator (i) is simple, (ii) online, (iii) fast, (iv) performs well for all m, small, middle and large, (v) is independent of the base alphabet size, (vi) non-occurring symbols induce no redundancy, (vii) the constant sequence has constant redundancy, (viii) symbols that appear only finitely often have bounded/constant contribution to the redundancy, (ix) is competitive with (slow) Bayesian mixing over all sub-alphabets.

Article
We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, financial engineering and optimization, but there is currently no multivariate method that is computationally feasible for more than a few hundred observations. We introduce Convex Adaptive Partitioning (CAP), which creates a globally convex regression model from locally linear estimates fit on adaptively selected covariate partitions. CAP is computationally efficient, in stark contrast to current methods. The most popular method, the least squares estimator, has a computational complexity of $\mathcal{O}(n^3)$. We show that CAP has a computational complexity of $\mathcal{O}(n \log(n)\log(\log(n)))$ and also give consistency results. CAP is applied to value function approximation for pricing American basket options with a large number of underlying assets.

Article
In modeling multivariate time series, it is important to allow time-varying smoothness in the mean and covariance process. In particular, there may be certain time intervals exhibiting rapid changes and others in which changes are slow. If such time-varying smoothness is not accounted for, one can obtain misleading inferences and predictions, with over-smoothing across erratic time intervals and under-smoothing across times exhibiting slow variation. This can lead to mis-calibration of predictive intervals, which can be substantially too narrow or wide depending on the time. We propose a locally adaptive factor process for characterizing multivariate mean-covariance changes in continuous time, allowing locally varying smoothness in both the mean and covariance matrix. This process is constructed utilizing latent dictionary functions evolving in time through nested Gaussian processes and linearly related to the observed data with a sparse mapping. Using a differential equation representation, we bypass usual computational bottlenecks in obtaining MCMC and online algorithms for approximate Bayesian inference. The performance is assessed in simulations and illustrated in a financial application.

Article
Hierarchical clustering based on pairwise similarities is a common tool used in a broad range of scientific applications. However, in many problems it may be expensive to obtain or compute similarities between the items to be clustered. This paper investigates the hierarchical clustering of N items based on a small subset of pairwise similarities, significantly less than the complete set of N(N-1)/2 similarities. First, we show that if the intracluster similarities exceed intercluster similarities, then it is possible to correctly determine the hierarchical clustering from as few as 3N log N similarities. We demonstrate this order of magnitude savings in the number of pairwise similarities necessitates sequentially selecting which similarities to obtain in an adaptive fashion, rather than picking them at random. We then propose an active clustering method that is robust to a limited fraction of anomalous similarities, and show how even in the presence of these noisy similarity values we can resolve the hierarchical clustering using only O(N log^2 N) pairwise similarities.

Article
Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC's performance is highly sensitive to two user-specified parameters: a step size {\epsilon} and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter {\epsilon} on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient "turnkey" sampling algorithms.

Article
In this paper, we consider supervised learning problems such as logistic regression and study the stochastic gradient method with averaging, in the usual stochastic approximation setting where observations are used only once. We show that after $N$ iterations, with a constant step-size proportional to $1/R^2 \sqrt{N}$ where $N$ is the number of observations and $R$ is the maximum norm of the observations, the convergence rate is always of order $O(1/\sqrt{N})$, and improves to $O(R^2 / \mu N)$ where $\mu$ is the lowest eigenvalue of the Hessian at the global optimum (when this eigenvalue is greater than $R^2/\sqrt{N}$). Since $\mu$ does not need to be known in advance, this shows that averaged stochastic gradient is adaptive to \emph{unknown local} strong convexity of the objective function. Our proof relies on the generalized self-concordance properties of the logistic loss and thus extends to all generalized linear models with uniformly bounded features.

Article
Sparse additive models are families of $d$-variate functions that have the additive decomposition $f^* = \sum_{j \in S} f^*_j$, where $S$ is an unknown subset of cardinality $s \ll d$. In this paper, we consider the case where each univariate component function $f^*_j$ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function $f^*$ based on kernels combined with $\ell_1$-type convex regularization. Working within a high-dimensional framework that allows both the dimension $d$ and sparsity $s$ to increase with $n$, we derive convergence rates (upper bounds) in the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norms over the class $\MyBigClass$ of sparse additive models with each univariate function $f^*_j$ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the $L^2(\mathbb{P})$ error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the multivariate function class is assumed to be globally bounded, then much faster estimation rates are possible for any sparsity $s = \Omega(\sqrt{n})$, showing that global boundedness is a significant restriction in the high-dimensional setting.

Article
We consider the problem of learning causal directed acyclic graphs from an observational joint distribution. One can use these graphs to predict the outcome of interventional experiments, from which data are often not available. We show that if the observational distribution follows a structural equation model with an additive noise structure, the directed acyclic graph becomes identifiable from the distribution under mild conditions. This constitutes an interesting alternative to traditional methods that assume faithfulness and identify the Markov equivalence class of the graph, i.e., leaving some edges undirected. We provide practical algorithms for finitely many samples and provide an empirical evaluation.

Article
We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal), whose regret is, essentially, optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the square-root worst-case regret of Exp3 (Auer et al., SIAM J. on Computing 2002) and the (poly)logarithmic regret of UCB1 (Auer et al., Machine Learning 2002) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on (non-Bayesian) multi-armed bandits. Prior work on multi-armed bandits treats them separately, and does not attempt to jointly optimize for both. Our result falls into a general theme of achieving good worst-case performance while also taking advantage of "nice" problem instances, an important issue in the design of algorithms with partially known inputs.

Article
We consider an original problem that arises from the issue of security analysis of a power system and that we name optimal discovery with probabilistic expert advice. We address it with an algorithm based on the optimistic paradigm and on the Good-Turing missing mass estimator. We prove two different regret bounds on the performance of this algorithm under weak assumptions on the probabilistic experts. Under more restrictive hypotheses, we also prove a macroscopic optimality result, comparing the algorithm both with an oracle strategy and with uniform sampling. Finally, we provide numerical experiments illustrating these theoretical findings.

Article
In this technical report I show that the Brier game of prediction is perfectly mixable and find the optimal learning rate and substitution function for it. These results are straightforward, but the computations are surprisingly messy. 1 Loss bound A game of prediction consists of three components: the observation space Ω, the decision space Γ, and the loss function λ: Ω × Γ → R. In this note we are interested in the following Brier game [1]: Ω is a finite and non-empty set, Γ: = P(Ω) is the set of all probability measures on Ω, and λ(ω, γ) = ∑ (γ{o} − δω{o}) 2, o∈Ω where δω ∈ P(Ω) is the probability measure concentrated at ω: δω{ω} = 1 and δω{o} = 0 for o ̸ = ω. The Brier game is being played repeatedly by a learner having access to decisions made by a pool of experts, which leads to the following prediction protocol: Prediction with expert advice for the Brier game

Article
We study the problem of rank aggregation: given a set of ranked lists, we want to form a consensus ranking. Furthermore, we consider the case of extreme lists: i.e., only the rank of the best or worst elements are known. We impute missing ranks by the average value and generalise Spearman's \rho to extreme ranks. Our main contribution is the derivation of a non-parametric estimator for rank aggregation based on multivariate extensions of Spearman's \rho, which measures correlation between a set of ranked lists. Multivariate Spearman's \rho is defined using copulas, and we show that the geometric mean of normalised ranks maximises multivariate correlation. Motivated by this, we propose a weighted geometric mean approach for learning to rank which has a closed form least squares solution. When only the best or worst elements of a ranked list are known, we impute the missing ranks by the average value, allowing us to apply Spearman's \rho. Finally, we demonstrate good performance on the rank aggregation benchmarks MQ2007 and MQ2008.

Article
In this paper, we consider the problem of "hyper-sparse aggregation". Namely, given a dictionary $F = \{f_1, ..., f_M \}$ of functions, we look for an optimal aggregation algorithm that writes $\tilde f = \sum_{j=1}^M \theta_j f_j$ with as many zero coefficients $\theta_j$ as possible. This problem is of particular interest when $F$ contains many irrelevant functions that should not appear in $\tilde{f}$. We provide an exact oracle inequality for $\tilde f$, where only two coefficients are non-zero, that entails $\tilde f$ to be an optimal aggregation algorithm. Since selectors are suboptimal aggregation procedures, this proves that 2 is the minimal number of elements of $F$ required for the construction of an optimal aggregation procedures in every situations. A simulated example of this algorithm is proposed on a dictionary obtained using LARS, for the problem of selection of the regularization parameter of the LASSO. We also give an example of use of aggregation to achieve minimax adaptation over anisotropic Besov spaces, which was not previously known in minimax theory (in regression on a random design). Comment: 33 pages

Article
In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of $M$ density estimators for the Kullback-Leiber divergence (KL), the Hellinger's distance and the $L\_1$-distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000). Combining these results, we state that $\log M/n$ is an optimal rate of aggregation in the sense of Tsybakov (2003), where $n$ is the sample size.

Article
We study pool-based active learning of half-spaces. We revisit the aggressive approach for active learning in the realizable case, and show that it can be made efficient and practical, while also having theoretical guarantees under reasonable assumptions. We further show, both theoretically and experimentally, that it can be preferable to mellow approaches. Our efficient aggressive active learner of half-spaces has formal approximation guarantees that hold when the pool is separable with a margin. While our analysis is focused on the realizable setting, we show that a simple heuristic allows using the same algorithm successfully for pools with low error as well. We further compare the aggressive approach to the mellow approach, and prove that there are cases in which the aggressive approach results in significantly better label complexity compared to the mellow approach. We demonstrate experimentally that substantial improvements in label complexity can be achieved using the aggressive approach, for both realizable and low-error settings.

Article
Online Passive-Aggressive (PA) learning is an effective framework for performing max-margin online learning. But the deterministic formulation and estimated single large-margin model could limit its capability in discovering descriptive structures underlying complex data. This pa- per presents online Bayesian Passive-Aggressive (BayesPA) learning, which subsumes the online PA and extends naturally to incorporate latent variables and perform nonparametric Bayesian inference, thus providing great flexibility for explorative analysis. We apply BayesPA to topic modeling and derive efficient online learning algorithms for max-margin topic models. We further develop nonparametric methods to resolve the number of topics. Experimental results on real datasets show that our approaches significantly improve time efficiency while maintaining comparable results with the batch counterparts.

Top-cited authors
• Indian Institute of Technology Roorkee
• University of California, Berkeley
• Université de Montréal
• Université de Montréal