Matthew Stephens

French National Institute for Agricultural Research, Paris, Ile-de-France, France

Are you Matthew Stephens?

Claim your profile

Publications (66)779.93 Total impact

  • Article: Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
    Xiang Zhou, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, existing methods for calculating the likelihood ratio test statistics in mvLMMs are time consuming, and, without approximations, cannot be directly applied to analyze even two traits jointly in a typical-size GWAS. Here, we present a novel algorithm for computing parameter estimates and test statistics (Likelihood ratio and Wald) in mvLMMs that i) reduces per-iteration optimization complexity from cubic to linear in the number of samples; and ii) in GWAS analyses, reduces per-marker complexity from cubic to approximately quadratic (or linear if the relatedness matrix is of low rank) in the number of samples. The new method effectively generalizes both the EMMA (Efficient Mixed Model Association) algorithm and the GEMMA (Genome-wide EMMA) algorithm to the multivariate case, making the likelihood ratio tests in GWASs with mvLMM possible, for the first time, for tens of thousands of samples and a moderate number of phenotypes (<10). With real examples, we show that, as expected, the new method is orders of magnitude faster than competing methods in both variance component estimation in a single mvLMM, and in GWAS applications. The method is implemented in the GEMMA software package, freely available at http://stephenslab.uchicago.edu/software.html
    05/2013;
  • Article: A Statistical Framework for Joint eQTL Analysis in Multiple Tissues.
    [show abstract] [hide abstract]
    ABSTRACT: Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues, this framework increases power to detect eQTLs that are present in more than one tissue compared with "tissue-by-tissue" analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells, and fibroblasts, we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR = 0.05). Further, the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.
    PLoS Genetics 05/2013; 9(5):e1003486. · 8.69 Impact Factor
  • Article: Polygenic modeling with bayesian sparse linear mixed models.
    Xiang Zhou, Peter Carbonetto, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given dataset one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a "Bayesian sparse linear mixed model" (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html.
    PLoS Genetics 02/2013; 9(2):e1003264. · 8.69 Impact Factor
  • Article: Genetic, functional and molecular features of glucocorticoid receptor binding.
    [show abstract] [hide abstract]
    ABSTRACT: Glucocorticoids (GCs) are key mediators of stress response and are widely used as pharmacological agents to treat immune diseases, such as asthma and inflammatory bowel disease, and certain types of cancer. GCs act mainly by activating the GC receptor (GR), which interacts with other transcription factors to regulate gene expression. Here, we combined different functional genomics approaches to gain molecular insights into the mechanisms of action of GC. By profiling the transcriptional response to GC over time in 4 Yoruba (YRI) and 4 Tuscans (TSI) lymphoblastoid cell lines (LCLs), we suggest that the transcriptional response to GC is variable not only in time, but also in direction (positive or negative) depending on the presence of specific interacting transcription factors. Accordingly, when we performed ChIP-seq for GR and NF-κB in two YRI LCLs treated with GC or with vehicle control, we observed that features of GR binding sites differ for up- and down-regulated genes. Finally, we show that eQTLs that affect expression patterns only in the presence of GC are 1.9-fold more likely to occur in GR binding sites, compared to eQTLs that affect expression only in its absence. Our results indicate that genetic variation at GR and interacting transcription factors binding sites influences variability in gene expression, and attest to the power of combining different functional genomic approaches.
    PLoS ONE 01/2013; 8(4):e61654. · 4.09 Impact Factor
  • Article: Small World MCMC with Tempering: Ergodicity and Spectral Gap
    Yongtao Guan, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: When sampling a multi-modal distribution $\pi(x)$, $x\in \rr^d$, a Markov chain with local proposals is often slowly mixing; while a Small-World sampler \citep{guankrone} -- a Markov chain that uses a mixture of local and long-range proposals -- is fast mixing. However, a Small-World sampler suffers from the curse of dimensionality because its spectral gap depends on the volume of each mode. We present a new sampler that combines tempering, Small-World sampling, and producing long-range proposals from samples in companion chains (e.g. Equi-Energy sampler). In its simplest form the sampler employs two Small-World chains: an exploring chain and a sampling chain. The exploring chain samples $\pi_t(x) \propto \pi(x)^{1/t}$, $t\in [1,\infty)$, and builds up an empirical distribution. Using this empirical distribution as its long-range proposal, the sampling chain is designed to have a stationary distribution $\pi(x)$. We prove ergodicity of the algorithm and study its convergence rate. We show that the spectral gap of the exploring chain is enlarged by a factor of $t^{d}$ and that of the sampling chain is shrunk by a factor of $t^{-d}$. Importantly, the spectral gap of the exploring chain depends on the "size" of $\pi_t(x)$ while that of sampling chain does not. Overall, the sampler enlarges a severe bottleneck at the cost of shrinking a mild one, hence achieves faster mixing. The penalty on the spectral gap of the sampling chain can be significantly alleviated when extending the algorithm to multiple chains whose temperatures $\{t_k\}$ follow a geometric progression. If we allow $t_k \rightarrow 0$, the sampler becomes a global optimizer.
    11/2012;
  • Source
    Article: The contribution of RNA decay quantitative trait Loci to inter-individual variation in steady-state gene expression levels.
    [show abstract] [hide abstract]
    ABSTRACT: Recent gene expression QTL (eQTL) mapping studies have provided considerable insight into the genetic basis for inter-individual regulatory variation. However, a limitation of all eQTL studies to date, which have used measurements of steady-state gene expression levels, is the inability to directly distinguish between variation in transcription and decay rates. To address this gap, we performed a genome-wide study of variation in gene-specific mRNA decay rates across individuals. Using a time-course study design, we estimated mRNA decay rates for over 16,000 genes in 70 Yoruban HapMap lymphoblastoid cell lines (LCLs), for which extensive genotyping data are available. Considering mRNA decay rates across genes, we found that: (i) as expected, highly expressed genes are generally associated with lower mRNA decay rates, (ii) genes with rapid mRNA decay rates are enriched with putative binding sites for miRNA and RNA binding proteins, and (iii) genes with similar functional roles tend to exhibit correlated rates of mRNA decay. Focusing on variation in mRNA decay across individuals, we estimate that steady-state expression levels are significantly correlated with variation in decay rates in 10% of genes. Somewhat counter-intuitively, for about half of these genes, higher expression is associated with faster decay rates, possibly due to a coupling of mRNA decay with transcriptional processes in genes involved in rapid cellular responses. Finally, we used these data to map genetic variation that is specifically associated with variation in mRNA decay rates across individuals. We found 195 such loci, which we named RNA decay quantitative trait loci ("rdQTLs"). All the observed rdQTLs are located near the regulated genes and therefore are assumed to act in cis. By analyzing our data within the context of known steady-state eQTLs, we estimate that a substantial fraction of eQTLs are associated with inter-individual variation in mRNA decay rates.
    PLoS Genetics 10/2012; 8(10):e1003000. · 8.69 Impact Factor
  • Article: Fast and accurate genotype imputation in genome-wide association studies through pre-phasing.
    [show abstract] [hide abstract]
    ABSTRACT: The 1000 Genomes Project and disease-specific sequencing efforts are producing large collections of haplotypes that can be used as reference panels for genotype imputation in genome-wide association studies (GWAS). However, imputing from large reference panels with existing methods imposes a high computational burden. We introduce a strategy called 'pre-phasing' that maintains the accuracy of leading methods while reducing computational costs. We first statistically estimate the haplotypes for each individual within the GWAS sample (pre-phasing) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because (i) the GWAS samples must be phased only once, whereas standard methods would implicitly repeat phasing with each reference panel update, and (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match two unphased GWAS genotypes to a pair of reference haplotypes. We implemented our approach in the MaCH and IMPUTE2 frameworks, and we tested it on data sets from the Wellcome Trust Case Control Consortium 2 (WTCCC2), the Genetic Association Information Network (GAIN), the Women's Health Initiative (WHI) and the 1000 Genomes Project. This strategy will be particularly valuable for repeated imputation as reference panels evolve.
    Nature Genetics 07/2012; 44(8):955-9. · 35.53 Impact Factor
  • Article: Genome-wide efficient mixed-model analysis for association studies.
    Xiang Zhou, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: Linear mixed models have attracted considerable attention recently as a powerful and effective tool for accounting for population stratification and relatedness in genetic association tests. However, existing methods for exact computation of standard test statistics are computationally impractical for even moderate-sized genome-wide association studies. To address this issue, several approximate methods have been proposed. Here, we present an efficient exact method, which we refer to as genome-wide efficient mixed-model association (GEMMA), that makes approximations unnecessary in many contexts. This method is approximately n times faster than the widely used exact method known as efficient mixed-model association (EMMA), where n is the sample size, making exact genome-wide association analysis computationally practical for large numbers of individuals.
    Nature Genetics 06/2012; 44(7):821-4. · 35.53 Impact Factor
  • Article: Mapping gene-environment interactions at regulatory polymorphisms: insights into mechanisms of phenotypic variation.
    [show abstract] [hide abstract]
    ABSTRACT: Genetic effects on gene regulation make a substantial contribution to phenotypic diversity, yet their mechanisms remain elusive. Here, we discuss the potential insights to be gained from mapping gene-environment interactions at regulatory polymorphisms (i.e., genetic variation that affects gene expression under specific environmental conditions). We highlight a novel statistical method to identify specific patterns of gene-environment interaction at these regulatory polymorphisms. Reviewing its application to a study that mapped gene expression in the presence and absence of glucocorticoids, we discuss the mechanistic insights that this approach provides.
    Transcription. 03/2012; 3(2):56-62.
  • Source
    Article: DNase I sensitivity QTLs are a major determinant of human expression variation.
    [show abstract] [hide abstract]
    ABSTRACT: The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation. However, it remains difficult to identify the causal variants underlying eQTLs, and little is known about the regulatory mechanisms by which they act. Here we show that genetic variants that modify chromatin accessibility and transcription factor binding are a major mechanism through which genetic variation leads to gene expression differences among humans. We used DNase I sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines, for which genome-wide genotypes and estimates of gene expression levels are also available. We obtained a total of 2.7 billion uniquely mapped DNase I-sequencing (DNase-seq) reads, which allowed us to produce genome-wide maps of chromatin accessibility for each individual. We identified 8,902 locations at which the DNase-seq read depth correlated significantly with genotype at a nearby single nucleotide polymorphism or insertion/deletion (false discovery rate = 10%). We call such variants 'DNase I sensitivity quantitative trait loci' (dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele-specific changes in transcription factor binding. A substantial fraction (16%) of dsQTLs are also associated with variation in the expression levels of nearby genes (that is, these loci are also classified as eQTLs). Conversely, we estimate that as many as 55% of eQTL single nucleotide polymorphisms are also dsQTLs. Our observations indicate that dsQTLs are highly abundant in the human genome and are likely to be important contributors to phenotypic variation.
    Nature 02/2012; 482(7385):390-4. · 36.28 Impact Factor
  • Source
    Article: Dissecting the regulatory architecture of gene expression QTLs.
    [show abstract] [hide abstract]
    ABSTRACT: Expression quantitative trait loci (eQTLs) are likely to play an important role in the genetics of complex traits; however, their functional basis remains poorly understood. Using the HapMap lymphoblastoid cell lines, we combine 1000 Genomes genotypes and an extensive catalogue of human functional elements to investigate the biological mechanisms that eQTLs perturb. We use a Bayesian hierarchical model to estimate the enrichment of eQTLs in a wide variety of regulatory annotations. We find that approximately 40% of eQTLs occur in open chromatin, and that they are particularly enriched in transcription factor binding sites, suggesting that many directly impact protein-DNA interactions. Analysis of core promoter regions shows that eQTLs also frequently disrupt some known core promoter motifs but, surprisingly, are not enriched in other well-known motifs such as the TATA box. We also show that information from regulatory annotations alone, when weighted by the hierarchical model, can provide a meaningful ranking of the SNPs that are most likely to drive gene expression variation. Our study demonstrates how regulatory annotation and the association signal derived from eQTL-mapping can be combined into a single framework. We used this approach to further our understanding of the biology that drives human gene expression variation, and of the putatively causal SNPs that underlie it.
    Genome biology 01/2012; 13(1):R7. · 6.63 Impact Factor
  • Source
    Article: Exon-specific QTLs skew the inferred distribution of expression QTLs detected using gene expression array data.
    [show abstract] [hide abstract]
    ABSTRACT: Mapping of expression quantitative trait loci (eQTLs) is an important technique for studying how genetic variation affects gene regulation in natural populations. In a previous study using Illumina expression data from human lymphoblastoid cell lines, we reported that cis-eQTLs are especially enriched around transcription start sites (TSSs) and immediately upstream of transcription end sites (TESs). In this paper, we revisit the distribution of eQTLs using additional data from Affymetrix exon arrays and from RNA sequencing. We confirm that most eQTLs lie close to the target genes; that transcribed regions are generally enriched for eQTLs; that eQTLs are more abundant in exons than introns; and that the peak density of eQTLs occurs at the TSS. However, we find that the intriguing TES peak is greatly reduced or absent in the Affymetrix and RNA-seq data. Instead our data suggest that the TES peak observed in the Illumina data is mainly due to exon-specific QTLs that affect 3' untranslated regions, where most of the Illumina probes are positioned. Nonetheless, we do observe an overall enrichment of eQTLs in exons versus introns in all three data sets, consistent with an important role for exonic sequences in gene regulation.
    PLoS ONE 01/2012; 7(2):e30629. · 4.09 Impact Factor
  • Article: Comparative RNA sequencing reveals substantial genetic variation in endangered primates.
    [show abstract] [hide abstract]
    ABSTRACT: Comparative genomic studies in primates have yielded important insights into the evolutionary forces that shape genetic diversity and revealed the likely genetic basis for certain species-specific adaptations. To date, however, these studies have focused on only a small number of species. For the majority of nonhuman primates, including some of the most critically endangered, genome-level data are not yet available. In this study, we have taken the first steps toward addressing this gap by sequencing RNA from the livers of multiple individuals from each of 16 mammalian species, including humans and 11 nonhuman primates. Of the nonhuman primate species, five are lemurs and two are lorisoids, for which little or no genomic data were previously available. To analyze these data, we developed a method for de novo assembly and alignment of orthologous gene sequences across species. We assembled an average of 5721 gene sequences per species and characterized diversity and divergence of both gene sequences and gene expression levels. We identified patterns of variation that are consistent with the action of positive or directional selection, including an 18-fold enrichment of peroxisomal genes among genes whose regulation likely evolved under directional selection in the ancestral primate lineage. Importantly, we found no relationship between genetic diversity and endangered status, with the two most endangered species in our study, the black and white ruffed lemur and the Coquerel's sifaka, having the highest genetic diversity among all primates. Our observations imply that many endangered lemur populations still harbor considerable genetic variation. Timely efforts to conserve these species alongside their habitats have, therefore, strong potential to achieve long-term success.
    Genome Research 12/2011; 22(4):602-10. · 13.61 Impact Factor
  • Source
    Article: Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions
    Xiaoquan Wen, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: In genetic association analyses, it is often desired to analyze data from multiple potentially-heterogeneous subgroups. The amount of expected heterogeneity can vary from modest (as might typically be expected in a meta-analysis of multiple studies of the same phenotype, for example), to large (e.g. a strong gene-environment interaction, where the environmental exposure defines discrete subgroups). Here, we consider a flexible set of Bayesian models and priors that can capture these different levels of heterogeneity. We provide accurate numerical approaches to compute approximate Bayes Factors for these different models, and also some simple analytic forms which have natural interpretations and, in some cases, close connections with standard frequentist test statistics. These approximations also have the convenient feature that they require only summary-level data from each subgroup (in the simplest case, a point estimate for the genetic effect, and its standard error, from each subgroup). We illustrate the flexibility of these approaches on three examples: an analysis of a potential gene-environment interaction for a recombination phenotype, a large scale meta-analysis of genome-wide association data from the Global Lipids consortium, and a cross-population analysis for expression quantitative trait loci (eQTLs).
    11/2011;
  • Article: Genotype imputation with thousands of genomes.
    Bryan Howie, Jonathan Marchini, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package.
    G3 (Bethesda, Md.). 11/2011; 1(6):457-70.
  • Source
    Article: Bayesian variable selection regression for genome-wide association studies and other large-scale problems
    Yongtao Guan, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete "correct" subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. Here we focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of "missing heritability" in genome-wide association studies.
    10/2011;
  • Source
    Article: Interactions between glucocorticoid treatment and cis-regulatory polymorphisms contribute to cellular response phenotypes.
    [show abstract] [hide abstract]
    ABSTRACT: Glucocorticoids (GCs) mediate physiological responses to environmental stress and are commonly used as pharmaceuticals. GCs act primarily through the GC receptor (GR, a transcription factor). Despite their clear biomedical importance, little is known about the genetic architecture of variation in GC response. Here we provide an initial assessment of variability in the cellular response to GC treatment by profiling gene expression and protein secretion in 114 EBV-transformed B lymphocytes of African and European ancestry. We found that genetic variation affects the response of nearby genes and exhibits distinctive patterns of genotype-treatment interactions, with genotypic effects evident in either only GC-treated or only control-treated conditions. Using a novel statistical framework, we identified interactions that influence the expression of 26 genes known to play central roles in GC-related pathways (e.g. NQO1, AIRE, and SGK1) and that influence the secretion of IL6.
    PLoS Genetics 07/2011; 7(7):e1002162. · 8.69 Impact Factor
  • Source
    Article: Variation in human recombination rates and its genetic determinants.
    [show abstract] [hide abstract]
    ABSTRACT: BACKGROUND: Despite the fundamental role of crossing-over in the pairing and segregation of chromosomes during human meiosis, the rates and placements of events vary markedly among individuals. Characterizing this variation and identifying its determinants are essential steps in our understanding of the human recombination process and its evolution. STUDY DESIGN/RESULTS: Using three large sets of European-American pedigrees, we examined variation in five recombination phenotypes that capture distinct aspects of crossing-over patterns. We found that the mean recombination rate in males and females and the historical hotspot usage are significantly heritable and are uncorrelated with one another. We then conducted a genome-wide association study in order to identify loci that influence them. We replicated associations of RNF212 with the mean rate in males and in females as well as the association of Inversion 17q21.31 with the female mean rate. We also replicated the association of PRDM9 with historical hotspot usage, finding that it explains most of the genetic variance in this phenotype. In addition, we identified a set of new candidate regions for further validation. SIGNIFICANCE: These findings suggest that variation at broad and fine scales is largely separable and that, beyond three known loci, there is no evidence for common variation with large effects on recombination phenotypes.
    PLoS ONE 01/2011; 6(6):e20321. · 4.09 Impact Factor
  • Source
    Article: Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis.
    Barbara E Engelhardt, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more "continuous," as in isolation-by-distance models.
    PLoS Genetics 09/2010; 6(9). · 8.69 Impact Factor
  • Source
    Article: USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA.
    Xiaoquan Wen, Matthew Stephens
    [show abstract] [hide abstract]
    ABSTRACT: Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas in practice it is often the case that only summary data are available. For example this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article, we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straight-forward, and related to a long history of the use of linear methods for estimating missing values (e.g. Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible - allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context - these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-of-the art imputation methods that use individual-level data, but at a fraction of the computational cost.
    The Annals of Applied Statistics 09/2010; 4(3):1158-1182. · 1.58 Impact Factor

Institutions

  • 2007–2013
    • French National Institute for Agricultural Research
      Paris, Ile-de-France, France
    • Broad Institute of MIT and Harvard
      Cambridge, MA, USA
    • The Scripps Research Institute
      La Jolla, CA, USA
  • 2006–2013
    • University of Chicago
      • • Department of Human Genetics
      • • Department of Statistics
      Chicago, IL, USA
  • 2009
    • Howard Hughes Medical Institute
      Chevy Chase, MD, USA
  • 2008
    • University of Michigan
      Ann Arbor, MI, USA
  • 2006–2007
    • University of Oxford
      • Department of Statistics
      Oxford, ENG, United Kingdom
  • 2003–2007
    • University of Washington Seattle
      • • Department of Statistics
      • • Department of Biology
      Seattle, WA, USA