Zoubin Ghahramani

University of Cambridge, Cambridge, ENG, United Kingdom

Are you Zoubin Ghahramani?

Claim your profile

Publications (257)275.85 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003b), which defines a tree structure over partitions (i.e., non-overlapping subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.
    08/2014;
  • Source
    Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.
    08/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a novel information-theoretic approach for Bayesian optimization called Predictive Entropy Search (PES). At each iteration, PES selects the next evaluation point that maximizes the expected information gained with respect to the global maximum. PES codifies this intractable acquisition function in terms of the expected reduction in the differential entropy of the predictive distribution. This reformulation allows PES to obtain approximations that are both more accurate and efficient than other alternatives such as Entropy Search (ES). Furthermore, PES can easily perform a fully Bayesian treatment of the model hyperparameters while ES cannot. We evaluate PES in both synthetic and real-world applications, including optimization problems in machine learning, finance, biotechnology, and robotics. We show that the increased accuracy of PES leads to significant gains in optimization performance.
    06/2014;
  • Source
    John P. Cunningham, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: Linear dimensionality reduction methods are a cornerstone of analyzing high dimensional data, due to their simple geometric interpretations and typically attractive computational properties. These methods capture many data features of interest, such as covariance, dynamical structure, correlation between data sets, input-output relationships, and margin between data classes. Methods have been developed with a variety of names and motivations in many fields, and perhaps as a result the deeper connections between all these methods have not been understood. Here we unify methods from this disparate literature as optimization programs over matrix manifolds. We discuss principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, maximum autocorrelation factors, slow feature analysis, undercomplete independent component analysis, linear regression, and more. This optimization framework helps elucidate some rarely discussed shortcomings of well-known methods, such as the suboptimality of certain eigenvector solutions. Modern techniques for optimization over matrix manifolds enable a generic linear dimensionality reduction solver, which accepts as input data and an objective to be optimized, and returns, as output, an optimal low-dimensional projection of the data. This optimization framework further allows rapid development of novel variants of classical methods, which we demonstrate here by creating an orthogonal-projection canonical correlations analysis. More broadly, we suggest that our generic linear dimensionality reduction solver can move linear dimensionality reduction toward becoming a blackbox, objective-agnostic numerical technology.
    06/2014;
  • Source
    Konstantina Palla, David A. Knowles, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.
    03/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.
    02/2014;
  • Source
    Alex Davies, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: We present Random Partition Kernels, a new class of kernels derived by demonstrating a natural connection between random partitions of objects and kernels between those objects. We show how the construction can be used to create kernels from methods that would not normally be viewed as random partitions, such as Random Forest. To demonstrate the potential of this method, we propose two new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show that these kernels consistently outperform standard kernels on problems involving real-world datasets. Finally, we show how the form of these kernels lend themselves to a natural approximation that is appropriate for certain big data problems, allowing $O(N)$ inference in methods such as Gaussian Processes, Support Vector Machines and Kernel PCA.
    02/2014;
  • Source
    Amar Shah, Andrew Gordon Wilson, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the Student-t process as an alternative to the Gaussian process as a nonparametric prior over functions. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating away an inverse Wishart process prior over the covariance kernel of a Gaussian process model. We show surprising equivalences between different hierarchical Gaussian process models leading to Student-t processes, and derive a new sampling scheme for the inverse Wishart process, which helps elucidate these equivalences. Overall, we show that a Student-t process can retain the attractive properties of a Gaussian process -- a nonparametric representation, analytic marginal and predictive distributions, and easy model selection through covariance kernels -- but has enhanced flexibility, and predictive covariances that, unlike a Gaussian process, explicitly depend on the values of training observations. We verify empirically that a Student-t process is especially useful in situations where there are changes in covariance structure, or in applications like Bayesian optimization, where accurate predictive covariances are critical for good performance. These advantages come at no additional computational cost over Gaussian processes.
    02/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of possible statistical models to discover a good explanation of the data, and then produces a detailed report with figures and natural-language text. Our approach treats unknown functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes model functions in terms of high-level properties (e.g. smoothness, trends, periodicity, changepoints). Taken together with the compositional structure of our language of models, this allows us to automatically describe functions through a decomposition into additive parts. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains.
    02/2014;
  • Source
    Yue Wu, Jose Miguel Hernandez Lobato, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: The accurate prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the variances. Moreover, function parameters are usually learned using maximum likelihood, which can lead to overfitting. To address these problems we introduce a novel model for time-changing variances using Gaussian Processes. A Gaussian Process (GP) defines a distribution over functions, which allows us to capture highly flexible functional relationships for the variances. In addition, we develop an online algorithm to perform inference. The algorithm has two main advantages. First, it takes a Bayesian approach, thereby avoiding overfitting. Second, it is much quicker than current offline inference procedures. Finally, our new model was evaluated on financial data and showed significant improvement in predictive performance over current standard models.
    02/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Classical techniques such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix.
    02/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.
    09/2013;
  • Source
    Amar Shah, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the number of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets.
    09/2013;
  • Tomoharu Iwata, Amar Shah, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: Many people share their activities with others through online communities. These shared activities have an impact on other users' activities. For example, users are likely to become interested in items that are adopted (e.g. liked, bought and shared) by their friends. In this paper, we propose a probabilistic model for discovering latent influence from sequences of item adoption events. An inhomogeneous Poisson process is used for modeling a sequence, in which adoption by a user triggers the subsequent adoption of the same item by other users. For modeling adoption of multiple items, we employ multiple inhomogeneous Poisson processes, which share parameters, such as influence for each user and relations between users. The proposed model can be used for finding influential users, discovering relations between users and predicting item popularity in the future. We present an efficient Bayesian inference procedure of the proposed model based on the stochastic EM algorithm. The effectiveness of the proposed model is demonstrated by using real data sets in a social bookmark sharing service.
    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 08/2013
  • Source
    Sebastien Bratieres, Novi Quadrianto, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields (CRFs), maximum margin Markov networks (M3N), and structured support vector machines (SVMstruct), which embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct.
    07/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a) overfitting problems and multiple local optima, b) failure to capture shifts in market conditions and c) large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models.
    05/2013;
  • Frederik Eaton, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: We offer a solution to the problem of efficiently translating algorithms between different types of discrete statistical model. We investigate the expressive power of three classes of model-those with binary variables, with pairwise factors, and with planar topology-as well as their four intersections. We formalize a notion of "simple reduction" for the problem of inferring marginal probabilities and consider whether it is possible to "simply reduce" marginal inference from general discrete factor graphs to factor graphs in each of these seven subclasses. We characterize the reducibility of each class, showing in particular that the class of binary pairwise factor graphs is able to simply reduce only positive models. We also exhibit a continuous "spectral reduction" based on polynomial interpolation, which overcomes this limitation. Experiments assess the performance of standard approximate inference algorithms on the outputs of our reductions.
    Neural Computation 05/2013; 25(5):1213-60. · 1.76 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 \times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 \times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/
    04/2013;
  • Source
    Colorado Reed, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: Inference for latent feature models is inherently difficult as the inference space grows exponentially with the size of the input data and number of latent features. In this work, we use Kurihara & Welling (2008)'s maximization-expectation framework to perform approximate MAP inference for linear-Gaussian latent feature models with an Indian Buffet Process (IBP) prior. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model.
    04/2013;
  • Source
    Konstantina Palla, David A. Knowles, Zoubin Ghahramani
    [Show abstract] [Hide abstract]
    ABSTRACT: The fundamental aim of clustering algorithms is to partition data points. We consider tasks where the discovered partition is allowed to vary with some covariate such as space or time. One approach would be to use fragmentation-coagulation processes, but these, being Markov processes, are restricted to linear or tree structured covariate spaces. We define a partition-valued process on an arbitrary covariate space using Gaussian processes. We use the process to construct a multitask clustering model which partitions datapoints in a similar way across multiple data sources, and a time series model of network data which allows cluster assignments to vary over time. We describe sampling algorithms for inference and apply our method to defining cancer subtypes based on different types of cellular characteristics, finding regulatory modules from gene expression data from multiple human populations, and discovering time varying community structure in a social network.
    03/2013;

Publication Stats

11k Citations
275.85 Total Impact Points

Institutions

  • 2006–2013
    • University of Cambridge
      • Department of Engineering
      Cambridge, ENG, United Kingdom
  • 2009–2012
    • The University of Warwick
      • Department of Chemistry
      Coventry, England, United Kingdom
    • Ludwig-Maximilian-University of Munich
      München, Bavaria, Germany
  • 1998–2012
    • University College London
      • Institute of Neurology
      Londinium, England, United Kingdom
  • 2010
    • Max-Planck-Institut für Intelligente Systeme, Tübingen
      • Group of Machine Learning and Computational Biology
      Tübingen, Baden-Württemberg, Germany
    • Northeastern University
      • Department of Electrical and Computer Engineering
      Boston, MA, United States
    • Max Planck Institute for Biological Cybernetics
      Tübingen, Baden-Württemberg, Germany
  • 2008–2009
    • Carnegie Mellon University
      • Computer Science Department
      Pittsburgh, Pennsylvania, United States
  • 2006–2009
    • Pohang University of Science and Technology
      • Department of Computer Science and Engineering
      Andong, North Gyeongsang, South Korea
  • 2007
    • Keck Graduate Institute
      Claremont, California, United States
  • 2005
    • University at Buffalo, The State University of New York
      • Department of Computer Science and Engineering
      Buffalo, NY, United States
  • 1999–2005
    • Oxford Centre for Computational Neuroscience
      Oxford, England, United Kingdom
    • Princeton University
      • Department of Molecular Biology
      Princeton, NJ, United States
  • 1998–2003
    • NTT Communication Science Laboratories
      Kioto, Kyōto, Japan
  • 2002
    • McMaster University
      Hamilton, Ontario, Canada
  • 1995–2000
    • Massachusetts Institute of Technology
      • Department of Brain and Cognitive Sciences
      Cambridge, Massachusetts, United States
  • 1995–1997
    • University of Toronto
      • Department of Computer Science
      Toronto, Ontario, Canada