FaST linear mixed models for genome-wide association studies

Microsoft Research, Los Angeles, California, USA.
Nature Methods (Impact Factor: 32.07). 09/2011; 8(10):833-5. DOI: 10.1038/nmeth.1681
Source: PubMed


We describe factored spectrally transformed linear mixed models (FaST-LMM), an algorithm for genome-wide association studies (GWAS) that scales linearly with cohort size in both run time and memory use. On Wellcome Trust data for 15,000 individuals, FaST-LMM ran an order of magnitude faster than current efficient algorithms. Our algorithm can analyze data for 120,000 individuals in just a few hours, whereas current algorithms fail on data for even 20,000 individuals (

Download full-text


Available from: David Heckerman, Dec 08, 2014
  • Source
    • "Each of the new loci had achieved a suggestive (P<0.05) level of significance using the Prior genotypes, which provides an indication of the consistency of these findings; however, as only 71,000 SNPs are shared between the two datasets, the specific SNPs making up the peaks in both genotype sets were not entirely identical. Like other mixed-model algorithms (eg Lippert et al. 2011 "
    [Show abstract] [Hide abstract]
    ABSTRACT: Human genome-wide association studies (GWAS) have identified thousands of loci associated with disease phenotypes. GWAS studies have also become feasible using rodent models and these have some important advantages over human studies including controlled environment, access to tissues for molecular profiling, reproducible genotypes and a wide array of techniques for experimental validation. Association mapping with common mouse inbred strains generally requires one hundred or more strains to achieve sufficient power and mapping resolution; in contrast, sample sizes for human studies are typically one or more orders of magnitude greater than this. To enable well-powered studies in mice, we have generated high-density genotypes for ~175 inbred strains of mice using the Mouse Diversity Array. These new data increase marker density by 1.9-fold, have reduced missing data rates, and provide more accurate identification of heterozygous regions compared to previous genotype data. We report the discovery of new loci from previously reported association mapping studies using the new genotype data. The data are freely available for download and web-based tools provide easy access for association mapping and viewing of the underlying intensity data for individual loci. Copyright © 2015 Author et al.
    G3-Genes Genomes Genetics 07/2015; 5(10). DOI:10.1534/g3.115.020784 · 3.20 Impact Factor
  • Source
    • "Linear kernels, population structure, and correlated noise. In statistical genetics it is widely acknowledged that inclusion of a linear kernel in a linear mixed model model is a way to capture and account for confounding influences in the data matrix [6] [19]. Here we explain this mechanism in the context of our model. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A large class of problems in statistical genetics amounts to finding a sparse linear effect in a binary classification setup, such as finding a small set of genes that most strongly predict a disease. Very often, these signals are spurious and obfuscated by confounders such as age, ethnicity or population structure. In the probit regression model, such confounding can be modeled in terms of correlated label noise, but poses mathematical challenges for learning algorithms. In this paper we propose a learning algorithm to overcome these problems. We manage to learn sparse signals that are less influenced by the correlated noise. This problem setup generalizes to fields outside statistical genetics. Our method can be understood as a hybrid between an $\ell_1$ regularized probit classifier and a Gaussian Process (GP) classifier. In addition to a latent GP to capture correlated noise, the model captures sparse signals in a linear effect. Because the observed labels can be explained in part by the correlated noise, the linear effect will try to learn signals that capture information beyond just correlated noise. As we show on real-world data, signals found by our model are less correlated with the top confounders. Hence, we can find effects closer to the unconfounded sparse effects we are aiming to detect. Besides that, we show that our method outperforms Gaussian process classification and uncorrelated probit regression in terms of prediction accuracy.
  • Source
    • "The above model was implemented using the FaST-LMM algorithm (Lippert et al., 2011). This program is designed to accommodate large datasets with reduced computational time (Lippert et al., 2011). FaST-LMM uses either maximum likelihood (ML) or restricted maximum likelihood (REML). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Population structure analyses and genome-wide-association studies (GWAS) conducted on crop germplasm collections provide valuable information on the frequency and distribution of alleles governing economically important traits. The value of these analyses is substantially enhanced when the accession numbers can be increased from ~1K to~ 10K or more. In this research we conducted the first comprehensive analysis of population structure on the collection of 14K soybean accessions (Glycine max and G. soja) using a 50K SNP chip. Accessions originating from Japan were relatively homogenous and distinct from the Korean accessions. As a whole, both Japanese and Korean accessions diverged from the Chinese accessions. The ancestry of founders of the American accessions derived mostly from two Chinese subpopulations, which reflects the composition of the American accessions as a whole. A 12K-accession GWAS conducted on seed protein and oil is the largest reported to date in plants and identified SNPs with strong signals on chromosomes 20 and 15. A chromosome 20 region previously reported to be important for protein and oil content was further narrowed and now contains only three plausible candidate genes. The haplotype effects show a strong negative relationship between oil and protein at this locus, indicating negative pleiotropic effects or multiple closely linked loci in repulsion phase linkage. The vast majority of accessions carry the haplotype allele conferring lower protein and higher oil. Our results provide a fuller understanding of the distribution of genetic variation contained within the USDA soybean collection and how it relates to phenotypic variation for economically important traits.
    The Plant Genome 07/2015; doi: 10.3835/plantgenome2015.04.0024(3). DOI:10.3835/plantgenome2015.04.0024 · 3.93 Impact Factor
Show more