-
[show abstract]
[hide abstract]
ABSTRACT: The BLR (Bayesian linear regression) package of R implements several Bayesian regression models for continuous traits. The package was originally developed for implementing the Bayesian LASSO (BL) of Park and Casella (J Am Stat Assoc 103(482):681-686, 2008), extended to accommodate fixed effects and regressions on pedigree using methods described by de los Campos et al. (Genetics 182(1):375-385, 2009). In 2010 we further developed the code into an R-package, reprogrammed some internal aspects of the algorithm in the C language to increase computational speed, and further documented the package (Plant Genome J 3(2):106-116, 2010). The first version of BLR was launched in 2010 and since then the package has been used for multiple publications and is being routinely used for genomic evaluations in some animal and plant breeding programs. In this article we review the models implemented by BLR and illustrate the use of the package with examples.
Methods in molecular biology (Clifton, N.J.) 01/2013; 1019:299-320.
-
[show abstract]
[hide abstract]
ABSTRACT: Prediction of genetic risk for disease is needed for preventive and personalized medicine. Genome wide association studies have found unprecedented numbers of variants associated with complex human traits and diseases. However, these variants explain only a small proportion of genetic risk. Mounting evidence suggests that many traits, relevant to public health, are affected by large numbers of small-effect genes and that prediction of genetic risk to those traits and diseases could be improved by incorporating large numbers of markers into whole-genome prediction (WGP) models. We developed a WGP model incorporating thousands of markers for prediction of skin cancer risk in humans. We also considered other ways of incorporating genetic information into prediction models, such as family history or ancestry (using principal components, PC, of informative markers). Prediction accuracy was evaluated using the area under the receiver operating characteristic curve (AUC) estimated in a cross-validation. Incorporation of genetic information (i.e., familial relationships, PC or WGP) yielded a significant increase in prediction accuracy: from an AUC of 0.53 for a baseline model that accounted for non-genetic covariates to AUCs of 0.58 (pedigree), 0.62 (PC), and 0.64 (WGP). In summary, prediction of skin cancer risk could be improved by considering genetic information and using a large number of SNPs in a WGP model, which allows for the detection of patterns of genetic risk that are above and beyond those that can be captured using family history. We discuss avenues for improving prediction accuracy and speculate on the possible use of WGP to prospectively identify individuals at high risk.
Genetics 10/2012; · 4.01 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The analysis of longevity as a function of risk factors such as body mass index (BMI; kg/m(2) ), activity levels, and dietary factors is a mainstay of obesity research. Modeling survival through hazard functions, relative risks, or odds of dying with methods such as Cox proportional hazards or logistic regression are the most common approaches and have many advantages. However, they also have disadvantages in terms of the ease of interpretability, especially for non-statisticians; the need for additional data to convert parameter estimates to estimates of years of life lost (YLL); and debates about the appropriate time scale in the model. Parametric survival models are able to provide more direct answers, and in our analysis of an obesity-related data set, gave consistent YLL estimates regardless of the distribution used. Additionally, we offer alternative approaches to the analyses of censored survival data including a modified or 'compressed' Gaussian distribution. We therefore recommend increased consideration of parametric survival models in chronic disease and risk factor epidemiology.
Obesity 08/2012; · 4.28 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Genetic factors are believed to account for 25% of the interindividual differences in Years of Life (YL) among humans. However, the genetic loci that have thus far been found to be associated with YL explain a very small proportion of the expected genetic variation in this trait, perhaps reflecting the complexity of the trait and the limitations of traditional association studies when applied to traits affected by a large number of small-effect genes. Using data from the Framingham Heart Study and statistical methods borrowed largely from the field of animal genetics (whole-genome prediction, WGP), we developed a WGP model for the study of YL and evaluated the extent to which thousands of genetic variants across the genome examined simultaneously can be used to predict interindividual differences in YL. We find that a sizable proportion of differences in YL--which were unexplained by age at entry, sex, smoking and BMI--can be accounted for and predicted using WGP methods. The contribution of genomic information to prediction accuracy was even higher than that of smoking and body mass index (BMI) combined; two predictors that are considered among the most important life-shortening factors. We evaluated the impacts of familial relationships and population structure (as described by the first two marker-derived principal components) and concluded that in our dataset population structure explained partially, but not fully the gains in prediction accuracy obtained with WGP. Further inspection of prediction accuracies by age at death indicated that most of the gains in predictive ability achieved with WGP were due to the increased accuracy of prediction of early mortality, perhaps reflecting the ability of WGP to capture differences in genetic risk to deadly diseases such as cancer, which are most often responsible for early mortality in our sample.
PLoS ONE 01/2012; 7(7):e40964. · 4.09 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The availability of thousands of genome-wide molecular markers has made possible the use of genomic selection in plants and animals. However, the evaluation of models for genomic selection in plant breeding populations remains limited. In this study, we provide an overview of several models for genomic selection, whose predictive ability we investigate using two plant data sets. The first data set comprises historical phenotypic records of a series of wheat (Triticum aestivum L.) trials evaluated in 10 environments and recently generated genomic data. The second data set pertains to international maize (Zea mays L.) trials in which two disease traits (Exserohilum turcicum and Cercospora zeae-maydis) of maize lines evaluated in five environments were measured. Results showed that models including marker information yielded important gains in predictive ability relative to that of a pedigree-based model, this with a modest number of markers. Estimates of marker effects were different across environmental conditions, indicating that genotype × environment interaction was an important component of genetic variability. Overall, the study provided evidence from real populations indicating that genomic selection could be an effective tool for improving traits of economic importance in commercial crops.
Journal of Crop Improvement 05/2011; 25(3):239-261.
-
[show abstract]
[hide abstract]
ABSTRACT: Despite rapid advances in genomic technology, our ability to account for phenotypic variation using genetic information remains limited for many traits. This has unfortunately resulted in limited application of genetic data towards preventive and personalized medicine, one of the primary impetuses of genome-wide association studies. Recently, a large proportion of the "missing heritability" for human height was statistically explained by modeling thousands of single nucleotide polymorphisms concurrently. However, it is currently unclear how gains in explained genetic variance will translate to the prediction of yet-to-be observed phenotypes. Using data from the Framingham Heart Study, we explore the genomic prediction of human height in training and validation samples while varying the statistical approach used, the number of SNPs included in the model, the validation scheme, and the number of subjects used to train the model. In our training datasets, we are able to explain a large proportion of the variation in height (h(2) up to 0.83, R(2) up to 0.96). However, the proportion of variance accounted for in validation samples is much smaller (ranging from 0.15 to 0.36 depending on the degree of familial information used in the training dataset). While such R(2) values vastly exceed what has been previously reported using a reduced number of pre-selected markers (<0.10), given the heritability of the trait (∼ 0.80), substantial room for improvement remains.
PLoS Genetics 04/2011; 7(4):e1002051. · 8.69 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Phenotypic traits may exert causal effects between them. For example, on the one hand, high yield in dairy cows may increase the liability to certain diseases and, on the other hand, the incidence of a disease may affect yield negatively. Likewise, the transcriptome may be a function of the reproductive status in mammals and the latter may depend on other physiological variables. Knowledge of phenotype networks describing such interrelationships can be used to predict the behavior of complex systems, e.g. biological pathways underlying complex traits such as diseases, growth and reproduction. Structural Equation Models (SEM) can be used to study recursive and simultaneous relationships among phenotypes in multivariate systems such as genetical genomics, system biology, and multiple trait models in quantitative genetics. Hence, SEM can produce an interpretation of relationships among traits which differs from that obtained with traditional multiple trait models, in which all relationships are represented by symmetric linear associations among random variables, such as covariances and correlations. In this review, we discuss the application of SEM and related techniques for the study of multiple phenotypes. Two basic scenarios are considered, one pertaining to genetical genomics studies, in which QTL or molecular marker information is used to facilitate causal inference, and another related to quantitative genetic analysis in livestock, in which only phenotypic and pedigree information is available. Advantages and limitations of SEM compared to traditional approaches commonly used for the analysis of multiple traits, as well as some indication of future research in this area are presented in a concluding section.
Genetics Selection Evolution 02/2011; 43:6. · 2.88 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Although genome-wide association studies have identified markers that are associated with various human traits and diseases, our ability to predict such phenotypes remains limited. A perhaps overlooked explanation lies in the limitations of the genetic models and statistical techniques commonly used in association studies. We propose that alternative approaches, which are largely borrowed from animal breeding, provide potential for advances. We review selected methods and discuss the challenges and opportunities ahead.
Nature Reviews Genetics 11/2010; 11(12):880-6. · 38.08 Impact Factor
-
José Crossa, Gustavo de Los Campos,
Paulino Pérez,
Daniel Gianola,
Juan Burgueño,
José Luis Araus,
Dan Makumbi,
Ravi P Singh,
Susanne Dreisigacker,
Jianbing Yan,
Vivi Arief,
Marianne Banziger,
Hans-Joachim Braun
[show abstract]
[hide abstract]
ABSTRACT: The availability of dense molecular markers has made possible the use of genomic selection (GS) for plant breeding. However, the evaluation of models for GS in real plant populations is very limited. This article evaluates the performance of parametric and semiparametric models for GS using wheat (Triticum aestivum L.) and maize (Zea mays) data in which different traits were measured in several environmental conditions. The findings, based on extensive cross-validations, indicate that models including marker information had higher predictive ability than pedigree-based models. In the wheat data set, and relative to a pedigree model, gains in predictive ability due to inclusion of markers ranged from 7.7 to 35.7%. Correlation between observed and predictive values in the maize data set achieved values up to 0.79. Estimates of marker effects were different across environmental conditions, indicating that genotype × environment interaction is an important component of genetic variability. These results indicate that GS in plant breeding can be an effective strategy for selecting among lines whose phenotypes have yet to be observed.
Genetics 10/2010; 186(2):713-24. · 4.01 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Prediction of genetic values is a central problem in quantitative genetics. Over many decades, such predictions have been successfully accomplished using information on phenotypic records and family structure usually represented with a pedigree. Dense molecular markers are now available in the genome of humans, plants and animals, and this information can be used to enhance the prediction of genetic values. However, the incorporation of dense molecular marker data into models poses many statistical and computational challenges, such as how models can cope with the genetic complexity of multi-factorial traits and with the curse of dimensionality that arises when the number of markers exceeds the number of data points. Reproducing kernel Hilbert spaces regressions can be used to address some of these challenges. The methodology allows regressions on almost any type of prediction sets (covariates, graphs, strings, images, etc.) and has important computational advantages relative to many parametric approaches. Moreover, some parametric models appear as special cases. This article provides an overview of the methodology, a discussion of the problem of kernel choice with a focus on genetic applications, algorithms for kernel selection and an assessment of the proposed methods using a collection of 599 wheat lines evaluated for grain yield in four mega environments.
Genetics Research 08/2010; 92(4):295-308. · 1.71 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The use of structural equation models for the analysis of recursive and simultaneous relationships between phenotypes has become more popular recently. The aim of this paper is to illustrate how these models can be applied in animal breeding to achieve parameterizations of different levels of complexity and, more specifically, to model phenotypic recursion between three calving traits: gestation length (GL), calving difficulty (CD) and stillbirth (SB). All recursive models considered here postulate heterogeneous recursive relationships between GL and liabilities to CD and SB, and between liability to CD and liability to SB, depending on categories of GL phenotype.
Four models were compared in terms of goodness of fit and predictive ability: 1) standard mixed model (SMM), a model with unstructured (co)variance matrices; 2) recursive mixed model 1 (RMM1), assuming that residual correlations are due to the recursive relationships between phenotypes; 3) RMM2, assuming that correlations between residuals and contemporary groups are due to recursive relationships between phenotypes; and 4) RMM3, postulating that the correlations between genetic effects, contemporary groups and residuals are due to recursive relationships between phenotypes.
For all the RMM considered, the estimates of the structural coefficients were similar. Results revealed a nonlinear relationship between GL and the liabilities both to CD and to SB, and a linear relationship between the liabilities to CD and SB.Differences in terms of goodness of fit and predictive ability of the models considered were negligible, suggesting that RMM3 is plausible.
The applications examined in this study suggest the plausibility of a nonlinear recursive effect from GL onto CD and SB. Also, the fact that the most restrictive model RMM3, which assumes that the only cause of correlation is phenotypic recursion, performs as well as the others indicates that the phenotypic recursion may be an important cause of the observed patterns of genetic and environmental correlations.
Genetics Selection Evolution 01/2010; 42:1. · 2.88 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The availability of dense molecular markers has made possible the use of genomic selection in plant and animal breeding. However, models for genomic selection pose several computational and statistical challenges and require specialized computer programs, not always available to the end user and not implemented in standard statistical software yet. The R-package BLR (Bayesian Linear Regression) implements several statistical procedures (e.g., Bayesian Ridge Regression, Bayesian LASSO) in a unifi ed framework that allows including marker genotypes and pedigree data jointly. This article describes the classes of models implemented in the BLR package and illustrates their use through examples. Some challenges faced when applying genomic-enabled selection, such as model choice, evaluation of predictive ability through cross-validation, and choice of hyper-parameters, are also addressed.
The Plant Genome 01/2010; 3(2):106-116.
-
[show abstract]
[hide abstract]
ABSTRACT: The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available.
Genetics 04/2009; 182(1):375-85. · 4.01 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Reproducing kernel Hilbert spaces (RKHS) methods are widely used for statistical learning in many areas of endeavor. Recently, these methods have been suggested as a way of incorporating dense markers into genetic models. This note argues that RKHS regression provides a general framework for genetic evaluation that can be used either for pedigree- or marker-based regressions and under any genetic model, infinitesimal or not, and additive or not. Most of the standard models for genetic evaluation, such as infinitesimal animal or sire models, and marker-assisted selection models appear as special cases of RKHS methods.
Journal of Animal Science 03/2009; 87(6):1883-7. · 2.10 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Inferences about genetic values and prediction of phenotypes for a quantitative trait in the presence of complex forms of gene action, issues of importance in animal and plant breeding, and in evolutionary quantitative genetics, are discussed. Current methods for dealing with epistatic variability via variance component models are reviewed. Problems posed by cryptic, non-linear, forms of epistasis are identified and discussed. Alternative statistical procedures are suggested. Non-parametric definitions of additive effects (breeding values), with and without employing molecular information, are proposed, and it is shown how these can be inferred using reproducing kernel Hilbert spaces regression. Two stylized examples are presented to demonstrate the methods numerically. The first example falls in the domain of the infinitesimal model of quantitative genetics, with additive and dominance effects inferred both parametrically and non-parametrically. The second example tackles a non-linear genetic system with two loci, and the predictive ability of several models is evaluated.
Genetics Research 01/2009; 90(6):525-40. · 1.71 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Multivariate linear models are increasingly important in quantitative genetics. In high dimensional specifications, factor analysis (FA) may provide an avenue for structuring (co)variance matrices, thus reducing the number of parameters needed for describing (co)dispersion. We describe how FA can be used to model genetic effects in the context of a multivariate linear mixed model. An orthogonal common factor structure is used to model genetic effects under Gaussian assumption, so that the marginal likelihood is multivariate normal with a structured genetic (co)variance matrix. Under standard prior assumptions, all fully conditional distributions have closed form, and samples from the joint posterior distribution can be obtained via Gibbs sampling. The model and the algorithm developed for its Bayesian implementation were used to describe five repeated records of milk yield in dairy cattle, and a one common FA model was compared with a standard multiple trait model. The Bayesian Information Criterion favored the FA model.
Genetics Selection Evolution 39(5):481-94. · 2.88 Impact Factor