FaST linear mixed models for genome-wide association studies

Microsoft Research, Los Angeles, California, USA.
Nature Methods (Impact Factor: 32.07). 09/2011; 8(10):833-5. DOI: 10.1038/nmeth.1681
Source: PubMed


We describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use. On Wellcome Trust data for 15,000 individuals, FaST-LMM ran an order of magnitude faster than current efficient algorithms. Our algorithm can analyze data for 120,000 individuals in just a few hours, whereas current algorithms fail on data for even 20,000 individuals (

Download full-text


Available from: David Heckerman, Dec 08, 2014
1 Follower
29 Reads
  • Source
    • "Linear kernels, population structure, and correlated noise. In statistical genetics it is widely acknowledged that inclusion of a linear kernel in a linear mixed model model is a way to capture and account for confounding influences in the data matrix [6] [19]. Here we explain this mechanism in the context of our model. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A large class of problems in statistical genetics amounts to finding a sparse linear effect in a binary classification setup, such as finding a small set of genes that most strongly predict a disease. Very often, these signals are spurious and obfuscated by confounders such as age, ethnicity or population structure. In the probit regression model, such confounding can be modeled in terms of correlated label noise, but poses mathematical challenges for learning algorithms. In this paper we propose a learning algorithm to overcome these problems. We manage to learn sparse signals that are less influenced by the correlated noise. This problem setup generalizes to fields outside statistical genetics. Our method can be understood as a hybrid between an $\ell_1$ regularized probit classifier and a Gaussian Process (GP) classifier. In addition to a latent GP to capture correlated noise, the model captures sparse signals in a linear effect. Because the observed labels can be explained in part by the correlated noise, the linear effect will try to learn signals that capture information beyond just correlated noise. As we show on real-world data, signals found by our model are less correlated with the top confounders. Hence, we can find effects closer to the unconfounded sparse effects we are aiming to detect. Besides that, we show that our method outperforms Gaussian process classification and uncorrelated probit regression in terms of prediction accuracy.
    • "The above model was implemented using the FaST-LMM algorithm (Lippert et al., 2011). This program is designed to accommodate large datasets with reduced computational time (Lippert et al., 2011). FaST-LMM uses either maximum likelihood (ML) or restricted maximum likelihood (REML). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Population structure analyses and genome-wide-association studies (GWAS) conducted on crop germplasm collections provide valuable information on the frequency and distribution of alleles governing economically important traits. The value of these analyses is substantially enhanced when the accession numbers can be increased from ~1K to~ 10K or more. In this research we conducted the first comprehensive analysis of population structure on the collection of 14K soybean accessions (Glycine max and G. soja) using a 50K SNP chip. Accessions originating from Japan were relatively homogenous and distinct from the Korean accessions. As a whole, both Japanese and Korean accessions diverged from the Chinese accessions. The ancestry of founders of the American accessions derived mostly from two Chinese subpopulations, which reflects the composition of the American accessions as a whole. A 12K-accession GWAS conducted on seed protein and oil is the largest reported to date in plants and identified SNPs with strong signals on chromosomes 20 and 15. A chromosome 20 region previously reported to be important for protein and oil content was further narrowed and now contains only three plausible candidate genes. The haplotype effects show a strong negative relationship between oil and protein at this locus, indicating negative pleiotropic effects or multiple closely linked loci in repulsion phase linkage. The vast majority of accessions carry the haplotype allele conferring lower protein and higher oil. Our results provide a fuller understanding of the distribution of genetic variation contained within the USDA soybean collection and how it relates to phenotypic variation for economically important traits.
    The Plant Genome 07/2015; doi: 10.3835/plantgenome2015.04.0024. DOI:10.3835/plantgenome2015.04.0024 · 3.93 Impact Factor
  • Source
    • " or marker - based estimates of relatedness showed broad agreement in the QTL mapping . While there has been a considerable amount of work demonstrating the benefits of marker - based estimates of " realized relatedness " to control for confounding due to population structure or due to familial relationships ( Yu et al . 2006 ; Kang et al . 2008 ; Lippert et al . 2011 ; Listgarten et al . 2012 ; Zhou et al . 2013 ) , there has been little work on demonstrating these benefits in AILs . In this article , we did not aim for a systematic comparison of ap - proaches to correct for relatedness ( for an empirical compar - ison in simulated data sets , see Cheng et al . 2013 ) . Our results nonetheless sugge"
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic influences on anxiety disorders are well documented; however, the specific genes underlying these disorders remain largely unknown. To identify quantitative trait loci (QTL) for conditioned fear and open field behavior, we used an F2 intercross (n = 490) and a 34th-generation advanced intercross line (AIL) (n = 687) from the LG/J and SM/J inbred mouse strains. The F2 provided strong support for several QTL, but within wide chromosomal regions. The AIL yielded much narrower QTL, but the results were less statistically significant, despite the larger number of mice. Simultaneous analysis of the F2 and AIL provided strong support for QTL and within much narrower regions. We used a linear mixed-model approach, implemented in the program QTLRel, to correct for possible confounding due to familial relatedness. Because we recorded the full pedigree, we were able to empirically compare two ways of accounting for relatedness: using the pedigree to estimate kinship coefficients and using genetic marker estimates of "realized relatedness." QTL mapping using the marker-based estimates yielded more support for QTL, but only when we excluded the chromosome being scanned from the marker-based relatedness estimates. We used a forward model selection procedure to assess evidence for multiple QTL on the same chromosome. Overall, we identified 12 significant loci for behaviors in the open field and 12 significant loci for conditioned fear behaviors. Our approach implements multiple advances to integrated analysis of F2 and AILs that provide both power and precision, while maintaining the advantages of using only two inbred strains to map QTL.
    Genetics 09/2014; 198(1):103-16. DOI:10.1534/genetics.114.167056 · 5.96 Impact Factor
Show more