Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes

Department of Statistics and the Oxford-Man Institute for Quantitative Finance, University of Oxford, , .
Journal of the Royal Statistical Society Series B (Statistical Methodology) (Impact Factor: 3.52). 01/2011; 73(1):37-57. DOI: 10.1111/j.1467-9868.2010.00756.x
Source: PubMed


We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.

Download full-text


Available from: Omiros Papaspiliopoulos, Aug 17, 2015
9 Reads
  • Source
    • "Estimates of the relative copy number (log R ratios) and B allele frequencies measured at each marker on the array are mutually informative for the latent copy number [11]. Various hidden Markov model (HMM) implementations integrate the log R ratios and B allele frequencies to infer copy number [12-19]. Copy number estimation is challenging, in part, due to technical artifacts that contribute to false positives. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Hyperuricemia is associated with multiple diseases, including gout, cardiovascular disease, and renal disease. Serum urate is highly heritable, yet association studies of single nucleotide polymorphisms (SNPs) and serum uric acid explain a small fraction of the heritability. Whether copy number polymorphisms (CNPs) contribute to uric acid levels is unknown. Results We assessed copy number on a genome-wide scale among 8,411 individuals of European ancestry (EA) who participated in the Atherosclerosis Risk in Communities (ARIC) study. CNPs upstream of the urate transporter SLC2A9 on chromosome 4p16.1 are associated with uric acid (χ2df2=3545, p=3.19×10-23). Effect sizes, expressed as the percentage change in uric acid per deleted copy, are most pronounced among women (3.974.935.87 [ 2.55097.5 denoting percentiles], p=4.57×10-23) and independent of previously reported SNPs in SLC2A9 as assessed by SNP and CNP regression models and the phasing SNP and CNP haplotypes (χ2df2=3190,p=7.23×10-08). Our finding is replicated in the Framingham Heart Study (FHS), where the effect size estimated from 4,089 women is comparable to ARIC in direction and magnitude (1.414.707.88, p=5.46×10-03). Conclusions This is the first study to characterize CNPs in ARIC and the first genome-wide analysis of CNPs and uric acid. Our findings suggests a novel, non-coding regulatory mechanism for SLC2A9-mediated modulation of serum uric acid, and detail a bioinformatic approach for assessing the contribution of CNPs to heritable traits in large population-based studies where technical sources of variation are substantial.
    BMC Genetics 07/2014; 15(1):81. DOI:10.1186/1471-2156-15-81 · 2.40 Impact Factor
  • Source
    • "case-control studies) to assess associations between copy number variants (CNVs) and a disorder of interest [28-32], since investigating parents and offspring simultaneously enables the researcher to infer structural variants that occur de novo in the offspring (typically through a germline deletion). However, while numerous methods for CNV delineation in individual samples [33-38] or multiple independent samples [39-42] are available, only relatively few statistical approaches for detecting de novo CNVs have been proposed, and these are limited to offspring-parent trios. PennCNV[43] is based on a hidden Markov model (HMM), jointly modeling the unknown copy number states in all three trio members. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Copy number variants (CNVs) may play an important part in the development of common birth defectssuch as oral clefts, and individual patients with multiple birth defects (including clefts) have beenshown to carry small and large chromosomal deletions. In this paper we investigate de novo deletionsdefined as DNA segments missing in an oral cleft proband but present in both unaffected parents.We compare de novo deletion frequencies in children of European ancestry with an isolated, nonsyndromicoral cleft to frequencies in European ancestry children from randomly sampled trios. We identified a genome-wide significant 62 kilo base (kb) non-coding region on chromosome 7p14.1where de novo deletions occur more frequently among oral cleft cases than controls. We also observedwider de novo deletions among cleft lip palate (CLP) cases than seen among cleft palate (CP) and cleftlip (CL) cases. This study presents a region where de novo deletions appear to be involved in the etiology of oralclefts, although the underlying biological mechanisms are still unknown. Larger de novo deletions aremore likely to interfere with normal craniofacial development and may result in more severe clefts.Study protocol and sample DNA source can severely affect estimates of de novo deletion frequencies.Follow-up studies are needed to further validate these findings and to potentially identify additionalstructural variants underlying oral clefts.
    BMC Genetics 02/2014; 15(1):24. DOI:10.1186/1471-2156-15-24 · 2.40 Impact Factor
  • Source
    • "For example, distance-based transition probabilities [6], fully Bayesian HMMs [23], reversible jump and approximate sampling Markov chain Monte Carlo (MCMC) [24,25], iterative approaches to parameter estimation [26], alternatives to the Viterbi algorithm [27], and higher order Markov chains [28]. As HMMs readily accomodate multiple data sequences, the observation that copy number can be estimated from genotyping arrays [29] led to the development of several HMMs that jointly model copy number and genotypes at SNPs [30-37]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background In studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differ from the parental copy numbers as de novo and of interest for their potential functional role in disease. Among the leading array-based methods for discovery of de novo CNVs in case-parent trios is the joint hidden Markov model (HMM) implemented in the PennCNV software. However, the computational demands of the joint HMM are substantial and the extent to which false positive identifications occur in case-parent trios has not been well described. We evaluate these issues in a study of oral cleft case-parent trios. Results Our analysis of the oral cleft trios reveals that genomic waves represent a substantial source of false positive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. In addition, the noise of low-level summaries of relative copy number (log R ratios) is strongly associated with batch and correlated with the frequency of de novo CNV calls. Exploiting the trio design, we propose a univariate statistic for relative copy number referred to as the minimum distance that can reduce technical variation from probe effects and genomic waves. We use circular binary segmentation to segment the minimum distance and maximum a posteriori estimation to infer de novo CNVs from the segmented genome. Compared to PennCNV on simulated data, MinimumDistance identifies fewer false positives on average and is comparable to PennCNV with respect to false negatives. Genomic waves contribute to discordance of PennCNV and MinimumDistance for high coverage de novo calls, while highly concordant calls on chromosome 22 were validated by quantitative PCR. Computationally, MinimumDistance provides a nearly 8-fold increase in speed relative to the joint HMM in a study of oral cleft trios. Conclusions Our results indicate that batch effects and genomic waves are important considerations for case-parent studies of de novo CNV, and that the minimum distance is an effective statistic for reducing technical variation contributing to false de novo discoveries. Coupled with segmentation and maximum a posteriori estimation, our algorithm compares favorably to the joint HMM with MinimumDistance being much faster.
    BMC Bioinformatics 12/2012; 13(1):330. DOI:10.1186/1471-2105-13-330 · 2.58 Impact Factor
Show more