Beyond Missing Heritability: Prediction of Complex Traits

Department of Biostatistics, University of Alabama at Birmingham, Alabama, United States of America.
PLoS Genetics (Impact Factor: 8.17). 04/2011; 7(4):e1002051. DOI: 10.1371/journal.pgen.1002051
Source: PubMed

ABSTRACT Despite rapid advances in genomic technology, our ability to account for phenotypic variation using genetic information remains limited for many traits. This has unfortunately resulted in limited application of genetic data towards preventive and personalized medicine, one of the primary impetuses of genome-wide association studies. Recently, a large proportion of the "missing heritability" for human height was statistically explained by modeling thousands of single nucleotide polymorphisms concurrently. However, it is currently unclear how gains in explained genetic variance will translate to the prediction of yet-to-be observed phenotypes. Using data from the Framingham Heart Study, we explore the genomic prediction of human height in training and validation samples while varying the statistical approach used, the number of SNPs included in the model, the validation scheme, and the number of subjects used to train the model. In our training datasets, we are able to explain a large proportion of the variation in height (h(2) up to 0.83, R(2) up to 0.96). However, the proportion of variance accounted for in validation samples is much smaller (ranging from 0.15 to 0.36 depending on the degree of familial information used in the training dataset). While such R(2) values vastly exceed what has been previously reported using a reduced number of pre-selected markers (<0.10), given the heritability of the trait (∼ 0.80), substantial room for improvement remains.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association studies (GWAS) have detected large numbers of variants associated with complex human traits and diseases. However, the proportion of variance explained by GWAS-significant single nucleotide polymorphisms has been usually small. This brought interest in the use of whole-genome regression (WGR) methods. However, there has been limited research on the factors that affect prediction accuracy (PA) of WGRs when applied to human data of distantly related individuals. Here, we examine, using real human genotypes and simulated phenotypes, how trait complexity, marker-quantitative trait loci (QTL) linkage disequilibrium (LD), and the model used affect the performance of WGRs. Our results indicated that the estimated rate of missing heritability is dependent on the extent of marker-QTL LD. However, this parameter was not greatly affected by trait complexity. Regarding PA our results indicated that: (a) under perfect marker-QTL LD WGR can achieve moderately high prediction accuracy, and with simple genetic architectures variable selection methods outperform shrinkage procedures and (b) under imperfect marker-QTL LD, variable selection methods can achieved reasonably good PA with simple or moderately complex genetic architectures; however, the PA of these methods deteriorated as trait complexity increases and with highly complex traits variable selection and shrinkage methods both performed poorly. This was confirmed with an analysis of human height. © 2015 The Authors. Annals of Human Genetics published by University College London (UCL) and John Wiley & Sons Ltd.
    Annals of Human Genetics 01/2015; 79(2). DOI:10.1111/ahg.12099 · 1.93 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The prediction of the flowering time (FT) trait in Brassica napus based on genome-wide markers and the detection of underlying genetic factors is important not only for oilseed producers around the world but also for the other crop industry in the rotation system in China. In previous studies the low density and mixture of biomarkers used obstructed genomic selection in B. napus and comprehensive mapping of FT related loci. In this study, a high-density genome-wide SNP set was genotyped from a double-haploid population of B. napus. We first performed genomic prediction of FT traits in B. napus using SNPs across the genome under ten environments of three geographic regions via eight existing genomic predictive models. The results showed that all the models achieved comparably high accuracies, verifying the feasibility of genomic prediction in B. napus. Next, we performed a large-scale mapping of FT related loci among three regions, and found 437 associated SNPs, some of which represented known FT genes, such as AP1 and PHYE. The genes tagged by the associated SNPs were enriched in biological processes involved in the formation of flowers. Epistasis analysis showed that significant interactions were found between detected loci, even among some known FT related genes. All the results showed that our large scale and high-density genotype data are of great practical and scientific values for B. napus. To our best knowledge, this is the first evaluation of genomic selection models in B. napus based on a high-density SNP dataset and large-scale mapping of FT loci.
    PLoS ONE 01/2015; 10(3):e0119425. DOI:10.1371/journal.pone.0119425 · 3.53 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many genetic markers have been shown to be associated with common quantitative traits in genome-wide association studies. Typically these associated genetic markers have small to modest effect sizes and individually they explain only a small amount of the variability of the phenotype. In order to build a genetic prediction model without fitting a multiple linear regression model with possibly hundreds of genetic markers as predictors, researchers often summarize the joint effect of risk alleles into a genetic score that is used as a covariate in the genetic prediction model. However, the prediction accuracy can be highly variable and selecting the optimal number of markers to be included in the genetic score is challenging. In this manuscript we present a strategy to build an ensemble of genetic prediction models from data and we show that the ensemble-based method makes the challenge of choosing the number of genetic markers more amenable. Using simulated data with varying heritability and number of genetic markers, we compare the predictive accuracy and inclusion of true positive and false positive markers of a single genetic prediction model and our proposed ensemble method. The results show that the ensemble of genetic models tends to include a larger number of genetic variants than a single genetic model and it is more likely to include all of the true genetic markers. This increased sensitivity is obtained at the price of a lower specificity that appears to minimally affect the predictive accuracy of the ensemble.
    Frontiers in Genetics 01/2014; 5:474. DOI:10.3389/fgene.2014.00474

Preview (2 Sources)

1 Download
Available from