Article

Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes.

Department of Statistics and the Oxford-Man Institute for Quantitative Finance, University of Oxford, , .
Journal of the Royal Statistical Society Series B (Statistical Methodology) (Impact Factor: 5.72). 01/2011; 73(1):37-57. DOI: 10.1111/j.1467-9868.2010.00756.x
Source: PubMed

ABSTRACT We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.

0 Followers
 · 
91 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.
    Journal of the American Statistical Association 09/2011; 106(495):807-817. DOI:10.1198/jasa.2011.ap10058 · 2.11 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: It is routine in many fields to collect data having a variety of measurement scales and supports. For example, in biomedical studies for each patient one may collect functional data on a biomarker over time, gene expression values normalized to lie on a hypersphere to remove artifacts, clinical and demographic covariates and a health outcome. In such settings, one typically wants to build a predictive model for the health outcome, while also obtaining some information about the important predictors. A typical strategy defines a parametric model for the response conditionally on the predictors, with linear regression for continuous responses and logistic regression for categorical responses. However, parametric assumptions are typically based on convenience and do not represent real prior knowledge. From a Bayesian perspective, it is most appropriate to define a prior with large support allowing the conditional distribution of the response given predictors to be unknown and changing flexibly across the predictor space not just in the mean but also in the variance and shape. Building on earlier work on Dirichlet process mixtures, we describe a simple and general strategy for inducing models for conditional distributions through product kernel models for joint distributions of predictors and response variables. This approach is contrasted to approaches that avoid modeling of the distribution of the predictors.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Bayesian nonparametric methods are useful for modeling data without having to define the complexity of the entire model a priori, but rather allowing for this complexity determined by the data. We consider novel nonparametric Bayes models for high-dimensional and sparse data in this dissertation. The flexibility of Bayesian nonparametric priors arises from the prior's definition over an infinite dimensional parameter space. Therefore, there are theoretically an infinite number of latent components and an infinite number of latent factors. Nevertheless, draws from each respective prior will produce only a small number of components or factors that appear in a given data set. As mentioned, the number of these components and factors, and their corresponding parameter values, are left for the data to decide. This dissertation is divided into four parts, which motivate novel Bayesian nonparametric methods and clearly illustrate their utilities:Chapter 1: In Chapter 1, we review the Dirichlet process (DP) in detail. There are many other ways of nonparametric modeling, but with the availability of efficient computation and complete set up of theories, the DP is most popular and has been developed and studied extensively. We will also review the most new development of the DP in this chapter.Chapter 2: We propose the multiple Bayesian elastic net (abbreviated as MBEN), a new regularization and variable selection method. High dimensional and highly correlated data are commonplace. In such situations, maximum likelihood procedures typically fail--their estimates are unstable, and have large variance. To address thisproblem, a number of shrinkage methods have been proposed, including ridge regression, the lasso and the elastic net; these methods encourage coefficients to be near zero (in fact, the lasso and the elastic net perform variable selection by forcing some regression coefficients to equal zero). In this paper we describe a semiparametric approach that allows shrinkage to multiple locations, where the location and scale parameters are assigned Dirichlet process hyperpriors. The MBEN prior encourages variables to cluster, so that strongly correlated predictors tend to be in or out of the model together. We apply the MBEN prior to a multi-task learning (MTL) problem, using text data from the Wikipedia. An efficient MCMC algorithm and an automated Monte Carlo EM algorithm enable fast computation in high dimensions. The methods are applied to Wikipedia data using shared words to predict article links.Chapter 3: Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity toparametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors toflexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.Chapter 4: In studies involving multi-level data structures, problems of data sparsity are often encountered and it becomes necessary to borrow information to improve inferences and predictions. This article is motivated by studies collecting data on different outcomes following congenital heart surgery. If there were sufficient numbers of patients receiving each type of procedure, one could potentially fit procedure-specific multivariate random effects model to relate the outcomes of surgery to patient predictors while allowing variability among hospitals. However, as there are approximately 150 procedures with many procedures conducted on few patients, it is important to borrow information. Allowing variability among hospitals, procedures and outcome types in the regression coefficients relating patient factorsto outcomes, we obtain a three-way tensor of regression coefficient vectors. To borrow information in estimating these coefficients, we propose a Bayesian multiway tensor co-clustering model. In particular, the model works by reducing the dimension of the table through separately clustering hospitals, procedures and outcome types. This soft probabilistic clustering proceeds via nonparametric Bayesian latent class models, which favor clustering of dimensions that have similar values for feature vectors. Efficient MCMC and fast approximation approaches are proposed for posterior computation. The methods are illustrated using simulated data, and applied to heart surgery outcome data from a Duke study. Dissertation

Preview

Download
2 Downloads
Available from