[Show abstract][Hide abstract] ABSTRACT: The detection of local genomic signals using high-throughput DNA sequencing
data can be cast as a problem of scanning a Poisson random field for local
changes in the rate of the process. We propose a likelihood-based framework for
for such scans, and derive formulas for false positive rate control and power
calculations. The framework can also accommodate mixtures of Poisson processes
to deal with over-dispersion. As a specific, detailed example, we consider the
detection of insertions and deletions by paired-end DNA-sequencing. We propose
several statistics for this problem, compare their power under current
experimental designs, and illustrate their application on an Illumina Platinum
Genomes data set.
[Show abstract][Hide abstract] ABSTRACT: Recombination events are not uniformly distributed and often cluster in narrow regions known as recombination hotspots. Several studies using different approaches have dramatically advanced our understanding of recombination hotspot regulation. Population genetic data have been used to map and quantify hotspots in the human genome. Genetic variation in recombination rates and hotspots usage have been explored in human pedigrees, mouse intercrosses, and by sperm typing. These studies pointed to the central role of the PRDM9 gene in hotspot modulation. In this study, we used single nucleotide polymorphisms (SNPs) from whole-genome resequencing and genotyping studies of mouse inbred strains to estimate recombination rates across the mouse genome and identified 47,068 historical hotspots--an average of over 2477 per chromosome. We show by simulation that inbred mouse strains can be used to identify positions of historical hotspots. Recombination hotspots were found to be enriched for the predicted binding sequences for different alleles of the PRDM9 protein. Recombination rates were on average lower near transcription start sites (TSS). Comparing the inferred historical recombination hotspots with the recent genome-wide mapping of double-strand breaks (DSBs) in mouse sperm revealed a significant overlap, especially toward the telomeres. Our results suggest that inbred strains can be used to characterize and study the dynamics of historical recombination hotspots. They also strengthen previous findings on mouse recombination hotspots, and specifically the impact of sequence variants in Prdm9.
[Show abstract][Hide abstract] ABSTRACT: The false discovery rate is a criterion for controlling Type I error in simultaneous testing of multiple hypotheses. For scanning statistics, due to local dependence, clusters of neighbouring hypotheses are likely to be rejected together. In such situations, it is more intuitive and informative to group neighbouring rejections together and count them as a single discovery, with the false discovery rate defined as the proportion of clusters that are falsely declared among all declared clusters. Assuming that the number of false discoveries, under this broader definition of a discovery, is approximately Poisson and independent of the number of true discoveries, we examine approaches for estimating and controlling the false discovery rate, and provide examples from biological applications. Copyright 2011, Oxford University Press.
[Show abstract][Hide abstract] ABSTRACT: Given a set of aligned sequences of independent noisy observations, we are
concerned with detecting intervals where the mean values of the observations
change simultaneously in a subset of the sequences. The intervals of changed
means are typically short relative to the length of the sequences, the subset
where the change occurs, the "carriers," can be relatively small, and the sizes
of the changes can vary from one sequence to another. This problem is motivated
by the scientific problem of detecting inherited copy number variants in
aligned DNA samples. We suggest a statistic based on the assumption that for
any given interval of changed means there is a given fraction of samples that
carry the change. We derive an analytic approximation for the false positive
error probability of a scan, which is shown by simulations to be reasonably
accurate. We show that the new method usually improves on methods that analyze
a single sample at a time and on our earlier multi-sample method, which is most
efficient when the carriers form a large fraction of the set of sequences. The
proposed procedure is also shown to be robust with respect to the assumed
fraction of carriers of the changes.
The Annals of Applied Statistics 08/2011; 5(2011). · 2.24 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Because of their somatic cell origin, human induced pluripotent stem cells (HiPSCs) are assumed to carry a normal diploid genome, and adaptive chromosomal aberrations have not been fully evaluated. Here, we analyzed the chromosomal integrity of 66 HiPSC and 38 human embryonic stem cell (HESC) samples from 18 different studies by global gene expression meta-analysis. We report identification of a substantial number of cell lines carrying full and partial chromosomal aberrations, half of which were validated at the DNA level. Several aberrations resulted from culture adaptation, and others are suspected to originate from the parent somatic cell. Our classification revealed a third type of aneuploidy already evident in early passage HiPSCs, suggesting considerable selective pressure during the reprogramming process. The analysis indicated high incidence of chromosome 12 duplications, resulting in significant enrichment for cell cycle-related genes. Such aneuploidy may limit the differentiation capacity and increase the tumorigenicity of HiPSCs.
[Show abstract][Hide abstract] ABSTRACT: Chronic neuropathic pain is affected by specifics of the precipitating neural pathology, psychosocial factors, and by genetic predisposition. Little is known about the identity of predisposing genes. Using an integrative approach, we discovered that CACNG2 significantly affects susceptibility to chronic pain following nerve injury. CACNG2 encodes for stargazin, a protein intimately involved in the trafficking of glutamatergic AMPA receptors. The protein might also be a Ca(2+) channel subunit. CACNG2 has previously been implicated in epilepsy. Initially, using two fine-mapping strategies in a mouse model (recombinant progeny testing [RPT] and recombinant inbred segregation test [RIST]), we mapped a pain-related quantitative trait locus (QTL) (Pain1) into a 4.2-Mb interval on chromosome 15. This interval includes 155 genes. Subsequently, bioinformatics and whole-genome microarray expression analysis were used to narrow the list of candidates and ultimately to pinpoint Cacng2 as a likely candidate. Analysis of stargazer mice, a Cacng2 hypomorphic mutant, provided electrophysiological and behavioral evidence for the gene's functional role in pain processing. Finally, we showed that human CACNG2 polymorphisms are associated with chronic pain in a cohort of cancer patients who underwent breast surgery. Our findings provide novel information on the genetic basis of neuropathic pain and new insights into pain physiology that may ultimately enable better treatments.
Genome Research 09/2010; 20(9):1180-90. · 14.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The likelihood ratio method for dealing with change-point problems of B. Yakir and M. Pollak [Ann. Appl. Probab. 8, No. 3, 749–774 (1998; Zbl 0937.60082)], which has subsequently been extended to deal with a wide variety of problems involving maxima of random fields, has as a key ingredient a conditional local limit theorem for a log-likelihood ratio, given an almost independent “local” sigma-algebra. This article contains a general version of that theorem, illustrated by several examples.
[Show abstract][Hide abstract] ABSTRACT: Until last year, type 2 diabetes (T2D) susceptibility loci have hardly been identified, despite great effort. Recently, however, several whole-genome association (WGA) studies jointly uncovered 10 robustly replicated loci. Here, we examine these loci in the Ashkenazi Jewish (AJ) population in a sample of 1,131 cases versus 1,147 controls. Genetic predisposition to T2D in the AJ population was found similar to that established in the previous studies. One SNP, rs7754840 in the CDKAL1 gene, presented a significantly stronger effect in the AJ population as compared to the general Caucasian population. This may possibly be due to the increased homogeneity of the AJ population. The use of the SNPs considered in this study, to identify individuals at high (or low) risk to develop T2D, was found of limited value. Our study, however, strongly supports the robustness of WGA studies for the identification of genes affecting complex traits in general and T2D in particular.
Human Genetics 09/2008; 124(1):101-4. · 4.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We give a unified treatment of the statistical foundations of population based association mapping and of family based linkage mapping of quantitative traits in humans. A central ingredient in the unification involves the efficient score statistic. The discussion focuses on generalized linear models with an additional illustration of the Cox (proportional hazards) model for age of onset data. We give analytic expressions for noncentrality parameters and show how they give qualitative insight into the loss of power that occurs if the scientist's assumed genetic model differs from nature's "true" genetic model. Issues to be studied in detail in the future development of this approach are discussed.
Proceedings of the National Academy of Sciences 01/2008; 104(51):20210-5. · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We study sequential change-point detection when obser- vations form a sequence of independent Gaussian random fields, and the change-point is the time at which a sig- nal of known functional form involving a finite number of unknown parameters appears. Building on Siegmund and Yakir (2008), which identifies in a simpler problem a detec- tion procedure of Shiryayev-Roberts type that is asymptot- ically minimax up to terms that vanish as the false detec- tion rate converges to zero, we compare easily computed ap- proximations to the Shiryayev-Roberts detection procedure with similar approximations to CUSUM type procedures. Although the CUSUM type procedures are suboptimal, our studies indicate that they compare favorably to the asymp- totically optimal procedures. gradually. Generally speaking, the problem is to detect the signal as soon as possible after it appears, under the con- straint that (false) detection occurs very rarely if no signal appears. To illustrate the principles involved we will assume that the observations consist of a sequence of uncorrelated Gaussian random fields and that the signal is a parameter- ized function of a known form superimposed on the noisy observations at an unknown change-point. The discussion will be formulated in the context of image analysis. Yet the results and the principles that we introduce are meaningful in other settings as well. Following the developments in the companion to this paper (12), we measure detection delay by the expected Kullback-Leibler information accumulated between the change-point and its detection. We begin in the next sub- section with a precise description of the model. In Subsec- tion 1.2 the criterion for asymptotic minimax optimality, which is stated and proved in the companion paper, is re- formulated to fit the current context. The asymptotic minimax policy uses a randomized form of the Shiryayev-Roberts monitoring scheme. The alterna- tive Cumulative Sum (CUSUM) monitoring scheme is a bet- ter known approach. In Section 2 expressions describing the asymptotic performance of optimal CUSUM and optimal Shiryayev-Roberts rules are obtained. Suboptimal formula- tions of these procedures are also assessed. It turns out that the natural candidate for optimal Shiryayev-Roberts and CUSUM rules may require substan- tial computation. In Section 3 we propose alternative rules which are asymptotically equivalent but require less com- putational effort. A simulation study is conducted in order to investigate the finite-sample properties of these simplified rules. The paper concludes with a discussion of related open problems. The analysis of the different detection methods draws on a substantial literature for its justification. The calculations are only sketched here and emphasize new features of the present formulation. See (12) and the references cited there, and (3).
Statistics and its interface 01/2008; 1(1). · 0.40 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The result of Pollak [1985. Optimal detection of a change in distribution. Ann. Statist. 13, 206–227] proving the asymptotic optimality in sequential change-point detection of a suitable Shirayayev–Roberts stopping rule up to terms that vanish in the limit is generalized from the case of two completely specified distributions to that of a composite alternative hypothesis in a multidimensional exponential family. An explicit asymptotic lower bound on the expected Kullback–Leibler information required to detect a change-point is derived and is shown to be attained by a Shirayayev–Roberts stopping rule.
Journal of Statistical Planning and Inference 01/2008; 138(9):2815-2825. · 0.71 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Motivated by the problem of testing for the existence of a signal of known parametric structure and unknown ``location'' (as explained below) against a noisy background, we obtain for the maximum of a centered, smooth random field an approximation for the tail of the distribution. For the motivating class of problems this gives approximately the significance level of the maximum score test. The method is based on an application of a likelihood-ratio-identity followed by approximations of local fields. Numerical examples illustrate the accuracy of the approximations.
The Annals of Statistics 11/2007; · 2.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Sex and environment may dramatically affect genetic studies, and thus should be carefully considered. Beginning with two inbred mouse strains with contrasting phenotype in the neuroma model of neuropathic pain (autotomy), we established a backcross population on which we conducted a genome-wide scan. The backcross population was partially maintained in small social groups and partially in isolation. The genome scan detected one previously reported quantitative trait locus (QTL) on chromosome 15 (pain1), but no additional QTLs were found. Interestingly, group caging introduced phenotypic noise large enough to completely mask the genetic effect of the chromosome 15 QTL. The reason appears to be that group-caging animals from the low-autotomy strain together with animals from the high-autotomy strain dramatically increases autotomy in the otherwise low-autotomy mice (males or females). The converse, suppression of pain behaviour in the high-autotomy strain when caged with the low-autotomy strain was also observed, but only in females. Even in isolated mice, the genetic effect of the chromosome 15 QTL was significant only in females. To determine why, we evaluated autotomy levels of females in 12 different inbred stains of mice and compared them to previously reported levels for males. Strikingly larger environmental variation was observed in males than in females for this pain phenotype. The high baseline variance in males can explain the difficulty in detecting the genetic effect, which was readily seen in females. Our study emphasizes the importance of sex and environment in the genetic analysis of pain.
European Journal of Neuroscience 09/2007; 26(3):681-8. · 3.75 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Type 2 diabetes (T2D) is a common, polygenic chronic disease with high heritability. The purpose of this whole-genome association study was to discover novel T2D-associated genes. We genotyped 500 familial cases and 497 controls with >300,000 HapMap-derived tagging single-nucleotide-polymorphism (SNP) markers. When a stringent statistical correction for multiple testing was used, the only significant SNP was at TCF7L2, which has already been discovered and confirmed as a T2D-susceptibility gene. For a replication study, we selected 10 SNPs in six chromosomal regions with the strongest association (singly or as part of a haplotype) for retesting in an independent case-control set including 2,573 T2D cases and 2,776 controls. The most significant replicated result was found at the AHI1-LOC441171 gene region.
The American Journal of Human Genetics 08/2007; 81(2):338-45. · 11.20 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In a hidden Markov model, one "estimates" the state of the hidden Markov chain at t by computing via the forwards-backwards algorithm the conditional distribution of the state vector given the observed data. The covariance matrix of this conditional distribution measures the information lost by failure to observe directly the state of the hidden process. In the case where changes of state occur slowly relative to the speed at which information about the underlying state accumulates in the observed data, we compute approximately these covariances in terms of functionals of Brownian motion that arise in change-point analysis. Applications in gene mapping, where these covariances play a role in standardizing the score statistic and in evaluating the loss of noncentrality due to incomplete information, are discussed. Numerical examples illustrate the range of validity and limitations of our results.
Statistical Applications in Genetics and Molecular Biology 02/2007; 6:Article 18. · 1.52 Impact Factor