Beyond Missing Heritability: Prediction of Complex Traits

Department of Biostatistics, University of Alabama at Birmingham, Alabama, United States of America.
PLoS Genetics (Impact Factor: 7.53). 04/2011; 7(4):e1002051. DOI: 10.1371/journal.pgen.1002051
Source: PubMed


Despite rapid advances in genomic technology, our ability to account for phenotypic variation using genetic information remains limited for many traits. This has unfortunately resulted in limited application of genetic data towards preventive and personalized medicine, one of the primary impetuses of genome-wide association studies. Recently, a large proportion of the "missing heritability" for human height was statistically explained by modeling thousands of single nucleotide polymorphisms concurrently. However, it is currently unclear how gains in explained genetic variance will translate to the prediction of yet-to-be observed phenotypes. Using data from the Framingham Heart Study, we explore the genomic prediction of human height in training and validation samples while varying the statistical approach used, the number of SNPs included in the model, the validation scheme, and the number of subjects used to train the model. In our training datasets, we are able to explain a large proportion of the variation in height (h(2) up to 0.83, R(2) up to 0.96). However, the proportion of variance accounted for in validation samples is much smaller (ranging from 0.15 to 0.36 depending on the degree of familial information used in the training dataset). While such R(2) values vastly exceed what has been previously reported using a reduced number of pre-selected markers (<0.10), given the heritability of the trait (∼ 0.80), substantial room for improvement remains.

Full-text preview

Available from: PubMed Central
  • Source
    • "The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated. Recently, machine learning based risk prediction methods using genotyping data have gained momentum in relation with GWA studies in complex disease383940414243444546, making an important contribution towards the promise of personalized medicine474849. Here, we have organized the largest AN cohort so far[37], constructed machine learning models using GWA microarray data[47,50], and applied the model to testing data set to evaluate the model's performance by the area under the receiver operating characteristic curve (AUC)[47,51,52]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Anorexia nervosa (AN) is a complex psychiatric disease with a moderate to strong genetic contribution. In addition to conventional genome wide association (GWA) studies, researchers have been using machine learning methods in conjunction with genomic data to predict risk of diseases in which genetics play an important role. Methods In this study, we collected whole genome genotyping data on 3940 AN cases and 9266 controls from the Genetic Consortium for Anorexia Nervosa (GCAN), the Wellcome Trust Case Control Consortium 3 (WTCCC3), Price Foundation Collaborative Group and the Children’s Hospital of Philadelphia (CHOP), and applied machine learning methods for predicting AN disease risk. The prediction performance is measured by area under the receiver operating characteristic curve (AUC), indicating how well the model distinguishes cases from unaffected control subjects. Results Logistic regression model with the lasso penalty technique generated an AUC of 0.693, while Support Vector Machines and Gradient Boosted Trees reached AUC’s of 0.691 and 0.623, respectively. Using different sample sizes, our results suggest that larger datasets are required to optimize the machine learning models and achieve higher AUC values. Conclusions To our knowledge, this is the first attempt to assess AN risk based on genome wide genotype level data. Future integration of genomic, environmental and family-based information is likely to improve the AN risk evaluation process, eventually benefitting AN patients and families in the clinical setting.
    Preview · Article · Dec 2015 · BMC Medical Genomics
    • "This result has been used in conjunction with clustering methods such as k -means or partitioning around medoids (PAM; Bishop, 2006) to produce subsets of minimally related individuals from a given sample by maximising the Euclidean distance (e.g. Daetwyler et al., 2013; Makowsky et al., 2011). At the population level, the divergence between two populations due to drift, environmental adaptation, or artificial selection can be measured with F ST . "
    [Show abstract] [Hide abstract]
    ABSTRACT: The prediction of phenotypic traits using high-density genomic data has many applications such as the selection of plants and animals of commercial interest; and it is expected to play an increasing role in medical diagnostics. Statistical models used for this task are usually tested using cross-validation, which implicitly assumes that new individuals (whose phenotypes we would like to predict) originate from the same population the genomic prediction model is trained on. In this paper we investigate the effect of increasing genetic distance between training and target populations when predicting quantitative traits. This is important for plant and animal genetics, where genomic selection programs rely on the precision of predictions in future rounds of breeding. Therefore, estimating how quickly predictive accuracy decays is important in deciding which training population to use and how often the model has to be recalibrated. We find that the correlation between true and predicted values decays approximately linearly with respect to either $\F$ or mean kinship between the training and the target populations. We illustrate this relationship using simulations and a collection of data sets from mice, wheat and human genetics.
    No preview · Article · Sep 2015
  • Source
    • "This method relies on both the additive relationship matrix between the individuals in the population, which are traditionally obtained from pedigree records, and on phenotypic records of the candidates to selection. Such is the power of BLUP that it is actually not only used in breeding programmes, but also in evolutionary ecology to estimate the strength of selection and evolutionary change (see Hadfield et al., 2010 for a review) and more recently in human genetics for the prediction of complex traits (Makowsky et al., 2011). With the advent of high-throughput genotyping techniques and the development of chips containing thousands of single nucleotide polymorphisms (SNPs) at a reasonable cost, the implementation of genome-wide evaluations (Meuwissen et al., 2001; Goddard and Hayes, 2007) is routinely used in many breeding programs, and conventional BLUP selection based on pedigrees is now migrating to genomic selection. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Estimated breeding values (EBVs) are traditionally obtained from pedigree information. However, EBVs from high-density genotypes can have higher accuracy than EBVs from pedigree information. At the same time, it has been shown that EBVs from genomic data lead to lower increases in inbreeding compared with traditional selection based on genealogies. Here we evaluate the performance with BLUP selection based on genealogical coancestry with three different genome-based coancestry estimates: (1) an estimate based on shared segments of homozygosity, (2) an approach based on SNP-by-SNP count corrected by allelic frequencies, and (3) the identity by state methodology. We evaluate the effect of different population sizes, different number of genomic markers, and several heritability values for a quantitative trait. The performance of the different measures of coancestry in BLUP is evaluated in the true breeding values after truncation selection and also in terms of coancestry and diversity maintained. Accordingly, cross-performances were also carried out, that is, how prediction based on genealogical records impacts the three other measures of coancestry and inbreeding, and viceversa. Our results show that the genetic gains are very similar for all four coancestries, but the genomic-based methods are superior to using genealogical coancestries in terms of maintaining diversity measured as observed heterozygosity. Furthermore, the measure of coancestry based on shared segments of the genome seems to provide slightly better results on some scenarios, and the increase in inbreeding and loss in diversity is only slightly larger than the other genomic selection methods in those scenarios. Our results shed light on genomic selection vs. traditional genealogical-based BLUP and make the case to manage the population variability using genomic information to preserve the future success of selection programmes.
    Full-text · Article · Apr 2015 · Frontiers in Genetics
Show more