Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes.

Department of Statistics and the Oxford-Man Institute for Quantitative Finance, University of Oxford, , .
Journal of the Royal Statistical Society Series B (Statistical Methodology) (Impact Factor: 4.81). 01/2011; 73(1):37-57. DOI: 10.1111/j.1467-9868.2010.00756.x
Source: PubMed

ABSTRACT We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: It has become common for data sets to contain large numbers of variables in studies conducted in areas such as genetics, machine vision, image analysis and many others. When analyzing such data, parametric models are often too inflexible while nonparametric procedures tend to be non-robust because of insufficient data on these high dimensional spaces. This is particularly true when interest lies in building efficient classifiers in the presence of many predictor variables. When dealing with these types of data, it is often the case that most of the variability tends to lie along a few directions, or more generally along a much smaller dimensional submanifold of the data space. In this article, we propose a class of models that flexibly learn about this submanifold while simultaneously performing dimension reduction in classification. This methodology, allows the cell probabilities to vary nonparametrically based on a few coordinates expressed as linear combinations of the predictors. Also, as opposed to many black-box methods for dimensionality reduction, the proposed model is appealing in having clearly interpretable and identifiable parameters which provide insight into which predictors are important in determining accurate classification boundaries. Gibbs sampling methods are developed for posterior computation, and the methods are illustrated using simulated and real data applications.
    Journal of the American Statistical Association 03/2013; 108(501):187-201. · 1.83 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variants (CNVs) may play an important part in the development of common birth defectssuch as oral clefts, and individual patients with multiple birth defects (including clefts) have beenshown to carry small and large chromosomal deletions. In this paper we investigate de novo deletionsdefined as DNA segments missing in an oral cleft proband but present in both unaffected parents.We compare de novo deletion frequencies in children of European ancestry with an isolated, nonsyndromicoral cleft to frequencies in European ancestry children from randomly sampled trios. We identified a genome-wide significant 62 kilo base (kb) non-coding region on chromosome 7p14.1where de novo deletions occur more frequently among oral cleft cases than controls. We also observedwider de novo deletions among cleft lip palate (CLP) cases than seen among cleft palate (CP) and cleftlip (CL) cases. This study presents a region where de novo deletions appear to be involved in the etiology of oralclefts, although the underlying biological mechanisms are still unknown. Larger de novo deletions aremore likely to interfere with normal craniofacial development and may result in more severe clefts.Study protocol and sample DNA source can severely affect estimates of de novo deletion frequencies.Follow-up studies are needed to further validate these findings and to potentially identify additionalstructural variants underlying oral clefts.
    BMC Genetics 02/2014; 15(1):24. · 2.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: BACKGROUND: In studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differfrom the parental copy numbers as de novo and of interest for their potential functional role indisease. Among the leading array-based methods for discovery of de novo CNVs in case-parent triosis the joint hidden Markov model (HMM) implemented in the PennCNV software. However, thecomputational demands of the joint HMM are substantial and the extent to which false positiveidentifications occur in case-parent trios has not been well described. We evaluate these issues in astudy of oral cleft case-parent trios. RESULTS: Our analysis of the oral cleft trios reveals that genomic waves represent a substantial source of falsepositive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. Inaddition, the noise of low-level summaries of relative copy number (log R ratios) is stronglyassociated with batch and correlated with the frequency of de novo CNV calls. Exploiting the triodesign, we propose a univariate statistic for relative copy number referred to as the minimum distancethat can reduce technical variation from probe effects and genomic waves. We use circular binarysegmentation to segment the minimum distance and maximum a posteriori estimation to infer denovo CNVs from the segmented genome. Compared to PennCNV on simulated data,MinimumDistance identifies fewer false positives on average and is comparable to PennCNV withrespect to false negatives. Genomic waves contribute to discordance of PennCNV andMinimumDistance for high coverage de novo calls, while highly concordant calls on chromosome 22were validated by quantitative PCR. Computationally, MinimumDistance provides a nearly 8-foldincrease in speed relative to the joint HMM in a study of oral cleft trios. CONCLUSIONS: Our results indicate that batch effects and genomic waves are important considerations forcase-parent studies of de novo CNV, and that the minimum distance is an effective statistic forreducing technical variation contributing to false de novo discoveries. Coupled with segmentationand maximum a posteriori estimation, our algorithm compares favorably to the joint HMM withMinimumDistance being much faster.
    BMC Bioinformatics 12/2012; 13(1):330. · 3.02 Impact Factor


Available from