Article

A Comparative Study of Single-Trait and Multi-Trait Genomic Selection

Authors:
  • ICAR-Indian Agricultural Statistics Research Institute
  • Indian Council of Agricultural Reserach
  • ICAR-Indian Agricultural Statistics Research Institute
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In recent years of animal and plant breeding research, genomic selection (GS) became a choice for selection of appropriate candidate for breeding as it significantly contributes to enhance the genetic gain. Various studies related to GS have been carried out in the recent past. These studies were mostly confined to single trait. Although GS methods based on single trait have not performed very well in cases like pleiotropy, missing data and when the trait under study has low heritability. Gradually, some studies were carried out to explore the possibility of methods for GS based on multiple traits in the view of overcoming the above-mentioned problems in the method of single-trait GS (STGS). Currently, multi-trait-based GS methods are getting importance as it exploits the information of correlated structure among response. In this study, we have compared various methods related to STGS, such as stepwise regression, ridge regression, least absolute shrinkage and selection operator (LASSO), Bayesian, best linear unbiased prediction, and support vector machine, and multi-trait-based GS methods, such as multivariate regression with covariance estimation, conditional Gaussian graphical models, mixed model, and LASSO. In almost all cases, multi-trait-based methods are found to be more accurate. Based on the results of this study, it may be concluded that multi-trait-based methods have great potential to increase genetic gain as they utilize the correlation among the response variable as extra information, which contributes to estimate breeding value more precisely. This study is a comprehensive review of the methods of GS right from single trait to multiple traits and comparisons among these two classes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Various studies related to GS have been mostly confined to single trait in the recent past, although they performed not very well in cases of pleiotropy, missing data and low heritability (Boison et al., 2017;Budhlakoti et al., 2019). Gradually, studies were carried out to explore the possibility of methods for GS based on multiple traits that enabled to provide accurate genomic prediction by exploiting the information of correlated structure among response (Budhlakoti et al., 2019). ...
... Various studies related to GS have been mostly confined to single trait in the recent past, although they performed not very well in cases of pleiotropy, missing data and low heritability (Boison et al., 2017;Budhlakoti et al., 2019). Gradually, studies were carried out to explore the possibility of methods for GS based on multiple traits that enabled to provide accurate genomic prediction by exploiting the information of correlated structure among response (Budhlakoti et al., 2019). In addition, breeders in animal breeding usually record one trait at multiple times throughout the lifetime of animals that are often strongly genetically correlated. ...
... A large number of genomic selection studies have focused on single-trait analyses (Boison et al., 2017;Budhlakoti et al., 2019). However, many traits are genetically correlated, such as the Angora wool traits among different shearing times. ...
Article
Full-text available
The Angora rabbit, a well-known breed for fiber production, has been undergoing traditional breeding programs relying mainly on phenotypes. Genomic selection (GS) uses genomic information and promises to accelerate genetic gain. Practically, to implement GS in Angora rabbit breeding, it is necessary to evaluate different marker densities and GS models to develop suitable strategies for an optimized breeding pipeline. Considering a lack in microarray, low-coverage sequencing combined with genotype imputation was used to boost the number of SNPs across the rabbit genome. Here, in a population of 629 Angora rabbits, a total of 18,577,154 high-quality SNPs were imputed (imputation accuracy above 98%) based on low-coverage sequencing of 3.84X genomic coverage, and wool traits and body weight were measured at 70, 140 and 210 days of age. From the original markers, 0.5K, 1K, 3K, 5K, 10K, 50K, 100K, 500K, 1M and 2M were randomly selected and evaluated, resulting in 50K markers as the baseline for the heritability estimation and genomic prediction. Comparing to the GS performance of single-trait models, the prediction accuracy of nearly all traits could be improved by multi-trait models, which might because multiple-trait models used information from genetically correlated traits. Furthermore, we observed high significant negative correlation between the increased prediction accuracy from single-trait to multiple-trait models and estimated heritability. The results indicated that low-heritability traits could borrow more information from correlated traits and hence achieve higher prediction accuracy. The research first reported heritability estimation in rabbits by using genome-wide markers, and provided 50K as an optimal marker density for further microarray design, genetic evaluation and genomic selection in Angora rabbits. We expect that the work could provide strategies for GS in early selection, and optimize breeding programs in rabbits.
... However, in such situations, we may lose some additional information, e.g., high correlation among the traits and pleiotropic effects of genes, if available. In such cases, multi-trait genomic selection (MTGS)-based methods may provide more accurate GEBVs and subsequently higher prediction accuracy [99,100,101,102]. Number of MTGS-based methods have been studied in relation to GS, e.g., MRCE (Multivariate Regression with Covariance Estimation) [103], Multivariate mixed model approach [104105), Bayesian multitrait model [104], and cGGM (conditional Gaussian Graphical Models) [104106]. ...
... Crop residues may harbour primary inoculum in the form of perithecia and sporodochia (1) initially, type I resistance for initial infection by the pathogen/pathotypes; (2) infection followed by spreading to nearby tissues is considered type II resistance (3) type III resistance for kernel infection (4) type IV resistance exhibited against toxin secretion and its accumulation on kernels, and (5) type V displayed for tolerance [115]. Although numerous quantitative trait loci (QTL) have been explored against multiple pathotypes of Fusarium in wheat and other cereal crops with different enormities of consequences [74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117]. A QTL and its effect across multiple environments are thought to be stable, indicating greater practical breeding efficiency than minor ones. ...
Article
Full-text available
Purpose Wheat is an important cereal crop that is cultivated in different parts of the world. The biotic stresses are the major concerns in wheat-growing nations and are responsible for production loss globally. The change in climate dynamics makes the pathogen more virulent in foothills and tropical regions. There is growing concern about FHB in major wheat-growing nations, and until now, there has been no known potential source of resistance identified in wheat germplasm. The plant pathogen interaction activates the cascade of pathways, genes, TFs, and resistance genes. Pathogenesis-related genes’ role in disease resistance is functionally validated in different plant systems. Similarly, Genomewide association Studies (GWAS) and Genomic selection (GS) are promising tools and have led to the discovery of resistance genes, genomic regions, and novel markers. Fusarium graminearum produces deoxynivalenol (DON) mycotoxins in wheat kernels, affecting wheat productivity globally. Modern technology now allows for detecting and managing DON toxin to reduce the risk to humans and animals. This review offers a comprehensive overview of the roles played by GWAS and Genomic selection (GS) in the identification of new genes, genetic variants, molecular markers and DON toxin management strategies. Methods The review offers a comprehensive and in-depth analysis of the function of Fusarium graminearum virulence factors in Durum wheat. The role of GWAS and GS for Fusarium Head Blight (FHB) resistance has been well described. This paper provides a comprehensive description of the various statistical models that are used in GWAS and GS. In this review, we look at how different detection methods have been used to analyze and manage DON toxin exposure. Results This review highlights the role of virulent genes in Fusarium disease establishment. The role of genome-based selection offers the identification of novel QTLs in resistant wheat germplasm. The role of GWAS and GS selection has minimized the use of population development through breeding technology. Here, we also emphasized the function of recent technological developments in minimizing the impact of DON toxins and their implications for food safety.
... Multitrait genomic selection (MTGS) approaches, such as multivariate mixed models (Klápště et al., 2020), Bayesian multitrait models (Cheng et al., 2018), MRCE (Rothman et al., 2010), and cGGM (Chiquet et al., 2017), offer higher accuracy by considering multiple traits simultaneously. Studies comparing STGS and MTGS methods have confirmed improved prediction accuracy for traits with low heritability when leveraging related traits (Budhlakoti et al., 2019). ...
Article
Full-text available
In modern plant breeding, molecular markers have become indispensable tools, allowing the precise identification of genetic loci linked to key agronomic traits. These markers provide critical insight into the genetic architecture of crops, accelerating the selection of desirable traits for sustainable agriculture. This review focuses on the advancements in quantitative trait locus (QTL) mapping and genome-wide association studies (GWASs), highlighting their effective roles in identifying complex traits such as stress tolerance, yield, disease resistance, and nutrient efficiency. QTL mapping identifies the significant genetic regions linked to desired traits, while GWASs enhance precision using larger populations. The integration of high-throughput phenotyping has further improved the efficiency and accuracy of QTL research and GWASs, enabling precise trait analysis across diverse conditions. Additionally, next-generation sequencing, clustered regularly interspaced short palindromic repeats (CRISPR) technology, and transcriptomics have transformed these methods, offering profound insights into gene function and regulation. Single-cell RNA sequencing further enhances our understanding of plant responses at the cellular level, especially under environmental stress. Despite this progress, however, challenges persist in optimizing methods, refining training populations, and integrating these tools into breeding programs. Future studies must aim to enhance genetic prediction models, incorporate advanced molecular technologies, and refine functional markers to tackle the challenges of sustainable agriculture.
... A density of 50 K is commonly used in livestock genetics and breeding [51][52][53][54], as also demonstrated in Angora rabbits [55]. Although the baselines for marker density in other species were different, a similar phenomenon was observed in all [56]. Thus, 50 K, a relatively low marker density, was sufficient to produce high prediction accuracy ( Figure 2; Table 2). ...
Article
Full-text available
The selection and breeding of good meat rabbit breeds are fundamental to their industrial development, and genomic selection (GS) can employ genomic information to make up for the shortcomings of traditional phenotype-based breeding methods. For the practical implementation of GS in meat rabbit breeding, it is necessary to assess different marker densities and GS models. Here, we obtained low-coverage whole-genome sequencing (lcWGS) data from 1515 meat rabbits (including parent herd and half-sibling offspring). The specific objectives were (1) to derive a baseline for heritability estimates and genomic predictions based on randomly selected marker densities and (2) to assess the accuracy of genomic predictions for single- and multiple-trait linear mixed models. We found that a marker density of 50 K can be used as a baseline for heritability estimation and genomic prediction. For GS, the multi-trait genomic best linear unbiased prediction (GBLUP) model results in more accurate predictions for virtually all traits compared to the single-trait model, with improvements greater than 15% for all of them, which may be attributed to the use of information on genetically related traits. In addition, we discovered a positive correlation between the performance of the multi-trait GBLUP and the genetic correlation between the traits. We anticipate that this approach will provide solutions for GS, as well as optimize breeding programs, in meat rabbits.
... Several MTGS-based methods have been studied in relation to GS, e.g., the multivariate mixed model approach (Jia and Jannink 2012;Klápště et al. 2020), Bayesian multi-trait model (Jia and Jannink 2012;Cheng et al. 2018), MRCE (multivariate regression with covariance estimation) (Rothman et al. 2010), and cGGM (conditional Gaussian graphical model) (Chiquet et al. 2017). Jia and Jannink (2012) presented three multivariate linear models (i.e., GBLUP, Bayes A, and Bayes Cπ) and compared them to univariate models, and a detailed comparison of various STGS and MTGS-based methods has also been studied by Budhlakoti et al. (2019). A brief structure of different STGS-and MTGS-based methods used in GS studies is given in Fig. 6.1. ...
Chapter
Full-text available
Pearl millet [Pennisetum glaucum (L.) R. Br] is a staple grain for about 90 million people in India, sub-Saharan Africa, and South Asia. Genomic selection is a new tool that helps to identify better lines among experimental cultivars in plant breeding programs. Genomic selection examines the phenotypes and high-density marker scores of lines in a population to predict breeding values. The integration of all marker information in the prediction model contributes to the effectiveness of genomic selection by eliminating biased marker effect estimations and collecting more of the variance associated with small-effect quantitative trait loci (QTL). The whole genome sequence of pearl millet has recently been sequenced, allowing genomic selection models to be used to improve the selection process in the pearl millet breeding program. Genomic selection, which employs genomic-estimated breeding values of individuals obtained from genome-wide markers to identify candidates for the next breeding cycle, is a powerful tool for enhancing quantitative traits. Models used for genomic selection frequently encounter problems when the number of markers exceeds the number of phenotypic data. To address this issue and enhance prediction accuracy, genomic selection models and algorithms such as Bayesian, Gaussian, and machine learning have been used. This chapter focuses extensively on the transition from conventional selection techniques used in plant breeding to the genomic selection, the underlying statistical models and methods used for this purpose, the current state of genomic selection research in pearl millet, and the prospects for its successful application in the development of climate resilient pearl millet varieties suitable for different end users.
... When low-heritability traits have at least moderate correlation with high-heritability traits, the predictability for the low-heritability traits could be strongly increased by using a multi-trait model (Jia and Jannink, 2012;Montesinos-Ló pez et al., 2016;Budhlakoti et al., 2019). The prediction accuracy for low-heritability key traits can be improved by using high-heritability secondary traits (Jia and Jannink, 2012;Muranty et al., 2015). ...
Article
Full-text available
The first paradigm of plant breeding involves direct selection based phenotypic observation, followed by predictive breeding using statistical models constructed for quantitative traits based on genetic experimental design and more recently by incorporating molecular marker genotypes. However, plant performance or phenotype (P) is determined by the combining effects of genotype (G), envirotype (E) and genotype by environment interaction (GEI). Phenotypes can be predicted more precisely by training a model using data collected from multiple sources, including spatiotemporal omics (genomics, phenomics and enviromics across time and space). Integration of 3D information profiles (G-P-E), each with multidimensionality, provides predictive breeding with both tremendous opportunities and great challenges. Here, we first review innovative technologies for predictive breeding. We then evaluate multidimensional information profiles that can be integrated with a predictive breeding strategy, particularly envirotypic data, which have largely been neglected in data collection and nearly untouched in model construction. We propose a smart breeding scheme, integrated genomic-enviromic prediction (iGEP), as an extension of genomic prediction, using integrated multiomics information, big data technology and artificial intelligence (mainly focus on machine and deep learning). How to implement iGEP was discussed, including spatiotemporal models, environmental indices, factorial and spatiotemporal structure of plant breeding data, and cross-species prediction. A strategy is then proposed for prediction-based crop redesign at both the macro (individual, population and species) and micro (gene, metabolism and network) scales. Finally, we provide perspectives on translating the smart breeding into genetic gain through integrative breeding platforms and open-source breeding initiatives. We call for coordinated efforts in smart breeding through iGEP, institutional partnerships, and innovative technological support.
... Unlike single-trait, multiple-trait (MT) genomic prediction models combine information from individual lines and analyze multiple traits simultaneously. The potential of MT methods to improve predictive ability in genomic prediction has been proven (Lado et al. 2018;Jia and Jannink 2012;Budhlakoti et al. 2019). For completeness, the following two MT prediction methods are also included in the comparative analysis. ...
Article
Full-text available
Genomic prediction (GP) has revolutionized crop breeding despite remaining issues of transfer-ability of models to unseen environmental conditions and environments. Usage of endophenotypes rather than genomic markers leads to the possibility of building phenomic prediction (PP) models that can account, in part, for this challenge. Here, we compare and contrast GP and PP models for three growth-related traits, namely, leaf count, tree height, and trunk diameter, from two coffee three-way hybrid (H3W) populations exposed to a series of treatment-inducing environmental conditions. The models are based on seven different statistical methods built with genomic markers and chlorophyll a fluorescence (ChlF) data used as predictors. This comparative analysis demonstrates that the best performing PP models show higher predictability than the best GP models for the considered traits and environments in the vast majority of comparisons within H3W populations. In addition, we show that PP models are transferrable between conditions, but to a lower extent between populations and we conclude that ChlF data can serve as alternative predictors in statistical models of coffee hybrid performance. Future directions will explore their combination with other endophenotypes to further improve the prediction of growth-related traits for crops. KEYWORDS genomic prediction phenomic prediction three-way coffee hybrids chlorophyll a fluo-rescence
... Unlike single trait, multiple-trait (MT) GP models combine information from individual lines and analyze MTs simultaneously. The potential of MT methods to improve predictive ability in GP has been proven (Jia and Jannink 2012;Lado et al. 2018;Budhlakoti et al. 2019). For completeness, the following 2 MT prediction methods are also included in the comparative analysis. ...
Article
Full-text available
Genomic prediction (GP) has revolutionized crop breeding despite remaining issues of transferability of models to unseen environmental conditions and environments. Usage of endophenotypes rather than genomic markers leads to the possibility of building phenomic prediction (PP) models that can account, in part, for this challenge. Here, we compare and contrast GP and PP models for three growth-related traits, namely, leaf count, tree height, and trunk diameter, from two coffee three-way hybrid (H3W) populations exposed to a series of treatment-inducing environmental conditions. The models are based on seven different statistical methods built with genomic markers and chlorophyll a fluorescence (ChlF) data used as predictors. This comparative analysis demonstrates that the best performing PP models show higher predictability than the best GP models for the considered traits and environments in the vast majority of comparisons within H3W populations. In addition, we show that PP models are transferrable between conditions, but to a lower extent between populations and we conclude that ChlF data can serve as alternative predictors in statistical models of coffee hybrid performance. Future directions will explore their combination with other endophenotypes to further improve the prediction of growth-related traits for crops.
... It uses genome-wide molecular markers associated with resistance genes for predicting and selecting high breeding value lines. In a recent review, different models used in GS were discussed in detail; particularly, the use of multivariate GS models (MTGS) over single trait GS (STGS) was presented (Budhlakoti et al., 2019). Multi-trait GS (MTGS) methods may provide more accurate genomic-estimated breeding values (GEBVs). ...
Article
Full-text available
Pea (Pisum sativum L.) is one of the most important and productive cool season pulse crops grown throughout the world. Biotic stresses are the crucial constraints in harnessing the potential productivity of pea and warrant dedicated research and developmental efforts to utilize omics resources and advanced breeding techniques to assist rapid and timely development of high-yielding multiple stress-tolerant–resistant varieties. Recently, the pea researcher’s community has made notable achievements in conventional and molecular breeding to accelerate its genetic gain. Several quantitative trait loci (QTLs) or markers associated with genes controlling resistance for fusarium wilt, fusarium root rot, powdery mildew, ascochyta blight, rust, common root rot, broomrape, pea enation, and pea seed borne mosaic virus are available for the marker-assisted breeding. The advanced genomic tools such as the availability of comprehensive genetic maps and linked reliable DNA markers hold great promise toward the introgression of resistance genes from different sources to speed up the genetic gain in pea. This review provides a brief account of the achievements made in the recent past regarding genetic and genomic resources’ development, inheritance of genes controlling various biotic stress responses and genes controlling pathogenesis in disease causing organisms, genes/QTLs mapping, and transcriptomic and proteomic advances. Moreover, the emerging new breeding approaches such as transgenics, genome editing, genomic selection, epigenetic breeding, and speed breeding hold great promise to transform pea breeding. Overall, the judicious amalgamation of conventional and modern omics-enabled breeding strategies will augment the genetic gain and could hasten the development of biotic stress-resistant cultivars to sustain pea production under changing climate. The present review encompasses at one platform the research accomplishment made so far in pea improvement with respect to major biotic stresses and the way forward to enhance pea productivity through advanced genomic tools and technologies.
... Many studies have applied SVM algorithm in animals [2,18,56,[77][78][79][80] and plants [4,20,46,48,69,73,74,[81][82][83], and several of them concluded that SVM is one of the most accurate algorithms for GWP. ...
Chapter
Full-text available
Growth of artificial intelligence and machine learning (ML) methodology has been explosive in recent years. In this class of procedures, computers get knowledge from sets of experiences and provide forecasts or classification. In genome-wide based prediction (GWP), many ML studies have been carried out. This chapter provides a description of main semiparametric and nonparametric algorithms used in GWP in animals and plants. Thirty-four ML comparative studies conducted in the last decade were used to develop a meta-analysis through a Thurstonian model, to evaluate algorithms with the best predictive qualities. It was found that some kernel, Bayesian, and ensemble methods displayed greater robustness and predictive ability. However, the type of study and data distribution must be considered in order to choose the most appropriate model for a given problem.Key wordsMachine learningGWP Neural networks Kernel methods Ensemble methods Bayesian methods Meta-analysis Complex traits
... Prediction using the MTM model gives more reliable results. Budhlakoti et al. [84] also reported that MTM has great potential to increase genetic gain and contributes to estimating the breeding value more precisely. ...
Article
Full-text available
Thailand is a tropical country affected by global climate change and has high temperatures and humidity that cause heat stress in livestock. A temperature–humidity index (THI) is required to assess and evaluate heat stress levels in livestock. One of the livestock types in Thailand experiencing heat stress due to extreme climate change is crossbred dairy cattle. Genetic evaluations of heat tolerance in dairy cattle have been carried out for reproductive traits. Heritability values for reproductive traits are generally low (<0.10) because environmental factors heavily influence them. Consequently, genetic improvement for these traits would be slow compared to production traits. Positive and negative genetic correlations were found between reproductive traits and reproductive traits and yield traits. Several selection methods for reproductive traits have been introduced, i.e., the traditional method, marker-assisted selection (MAS), and genomic selection (GS). GS is the most promising technique and provides accurate results with a high genetic gain. Single-step genomic BLUP (ssGBLUP) has higher accuracy than the multi-step equivalent for fertility traits or low-heritability traits.
... In comparison to single-trait models, larger accuracies obtained via multi-trait animal models were consistent with the findings reported by Tsuruta et al. (2011), Guo et al. (2014, Ayalew et al. (2017), and Budhlakoti et al. (2019). This leads to the rejection of the null hypothesis and the acceptance of the alternative hypothesis that the accuracy of genomic predictions increases in the presence of G×E interactions. ...
Article
Full-text available
Aim of study: To assess genotype by environment (G×E) interaction via single- and multi-trait animal models for binary traits in dairy cattle. Area of study: University of Tabriz, Tabriz, Iran. Material and methods: Phenotypic and genomic data were simulated considering a binary trait in four environments as different correlated traits. Heritabilities of 0.05, 0.10, 0.15, and 0.20 were considered to mimic the genetic variation of the binary trait in different environments. Eight scenarios resulted from combining the number of QTLs (60 or 300), LD level (high or low), and incidence of the binary trait (10% or 30%) were simulated to compare the accuracy of predictions. For all scenarios, 1667 markers per chromosome (depicting a 50K SNP chip) were randomly spaced over 30 chromosomes. Multi-trait animal models were applied to take account of G×E interaction and to predict the genomic breeding value in different environments. Prediction accuracies obtained from the single- and multi-trait animal models were compared. Main results: In the models with G×E interaction, the largest accuracy of 0.401 was obtained in high LD scenario with 60 QTLs, and incidence of 30% for the fourth environment. The lowest accuracy of 0.190 was achieved in low LD scenario with 300 QTLs and incidence of 10% for the first environment. Research highlights: Genomic selection with high prediction accuracy can be possible by considering the G×E interaction during the genetic improvement programs in dairy cattle.
... So Multi Trait Genomic Selection (MTGS) gave more accurate GEBVs than STGS for the trait with low heritability and for the trait having missing data. Jia et al. (2012) presented three multivariate linear models (i.e., GBLUP, Bayes A, and Bayes Cπ) and compared them to univariate models and a detailed comparison of various STGS and MTGS based methods also been deliberated by Budhlakoti et al. (2019). Moreover, the models, we generally use for GS are linear. ...
Conference Paper
Full-text available
Genomic selection is a modified form of Marker Assisted Selection in which the markers from the whole genome is used to estimate Genomic Estimated Breeding Value (GEBV). The population of individuals with both phenotypic and genotypic data is used to estimate model parameters that are subsequently be used to calculate GEBVs of selection candidates having only genotypic data. These GEBVs are then further be used to select the individuals for the purpose of advancement in the breeding cycle. Several estimators are available to estimate GEBV. However, various issues such as high dimensionality of the data, multicollinearity among the markers, a smaller number of individuals and a greater number of markers (large p and small n problem) are the major challenges in estimation of GEBVs. This paper discusses most commonly used methods for estimation of GEBVs, viz., Ridge Regression, Genomic Best Linear Unbiased Prediction (GBLUP), Bayesian Alphabets and Least Absolute Shrinkage and Selection Operator (LASSO) with the aim to meet the challenges associated with estimation of GEBVs. Apart from this, some semi and non-parametric methods of genomic selection have been discussed as well. Moreover, another problem like presence of outliers in the data of genomic selection has also been conversed. Furthermore, a case study deals with non-linearity of the data has also been presented and illustrated using multi traits data. At the end, some future directives of research in this area are highlighted.
... Several MTGS-based methods have been studied in relation to GS, e.g., multivariate mixed model approach Klápště et al., 2020), Bayesian multi-trait model Cheng et al., 2018), MRCE (multivariate regression with covariance estimation) (Rothman et al., 2010), and cGGM (conditional Gaussian graphical model) (Chiquet et al., 2017). Jia and Jannink (2012) presented three multivariate linear models (i.e., GBLUP, Bayes A, and Bayes Cπ) and compared them to univariate models, and a detailed comparison of various STGSand MTGS-based methods has also been studied by Budhlakoti et al. (2019c). A brief structure of different STGS-and MTGSbased methods used in GS studies is given in Figure 2. ...
Article
Full-text available
Since the inception of the theory and conceptual framework of genomic selection (GS), extensive research has been done on evaluating its efficiency for utilization in crop improvement. Though, the marker-assisted selection has proven its potential for improvement of qualitative traits controlled by one to few genes with large effects. Its role in improving quantitative traits controlled by several genes with small effects is limited. In this regard, GS that utilizes genomic-estimated breeding values of individuals obtained from genome-wide markers to choose candidates for the next breeding cycle is a powerful approach to improve quantitative traits. In the last two decades, GS has been widely adopted in animal breeding programs globally because of its potential to improve selection accuracy, minimize phenotyping, reduce cycle time, and increase genetic gains. In addition, given the promising initial evaluation outcomes of GS for the improvement of yield, biotic and abiotic stress tolerance, and quality in cereal crops like wheat, maize, and rice, prospects of integrating it in breeding crops are also being explored. Improved statistical models that leverage the genomic information to increase the prediction accuracies are critical for the effectiveness of GS-enabled breeding programs. Study on genetic architecture under drought and heat stress helps in developing production markers that can significantly accelerate the development of stress-resilient crop varieties through GS. This review focuses on the transition from traditional selection methods to GS, underlying statistical methods and tools used for this purpose, current status of GS studies in crop plants, and perspectives for its successful implementation in the development of climate-resilient crops.
... Several MTGS-based methods have been studied in relation to GS, e.g., multivariate mixed model approach Klápště et al., 2020), Bayesian multi-trait model Cheng et al., 2018), MRCE (multivariate regression with covariance estimation) (Rothman et al., 2010), and cGGM (conditional Gaussian graphical model) (Chiquet et al., 2017). Jia and Jannink (2012) presented three multivariate linear models (i.e., GBLUP, Bayes A, and Bayes Cπ) and compared them to univariate models, and a detailed comparison of various STGSand MTGS-based methods has also been studied by Budhlakoti et al. (2019c). A brief structure of different STGS-and MTGSbased methods used in GS studies is given in Figure 2. ...
Article
Full-text available
Since the inception of the theory and conceptual framework of genomic selection (GS), extensive research has been done on evaluating its efficiency for utilization in crop improvement. Though, the marker-assisted selection has proven its potential for improvement of qualitative traits controlled by one to few genes with large effects. Its role in improving quantitative traits controlled by several genes with small effects is limited. In this regard, GS that utilizes genomic-estimated breeding values of individuals obtained from genome-wide markers to choose candidates for the next breeding cycle is a powerful approach to improve quantitative traits. In the last two decades, GS has been widely adopted in animal breeding programs globally because of its potential to improve selection accuracy, minimize phenotyping, reduce cycle time, and increase genetic gains. In addition, given the promising initial evaluation outcomes of GS for the improvement of yield, biotic and abiotic stress tolerance, and quality in cereal crops like wheat, maize, and rice, prospects of integrating it in breeding crops are also being explored. Improved statistical models that leverage the genomic information to increase the prediction accuracies are critical for the effectiveness of GS-enabled breeding programs. Study on genetic architecture under drought and heat stress helps in developing production markers that can significantly accelerate the development of stress-resilient crop varieties through GS. This review focuses on the transition from traditional selection methods to GS, underlying statistical methods and tools used for this purpose, current status of GS studies in crop plants, and perspectives for its successful implementation in the development of climate-resilient crops.
... However, the effect of the rrBLUP models with peak GWAS markers on prediction accuracy remains controversial (Rice and Lipka 2019;Spindel et al. 2015). A classical cultivar is the combination of many excellent traits; therefore, a multi-trait GS may perform better than single-trait (Budhlakoti et al. 2019;Habyarimana et al. 2020). Marker density is another important factor that affects prediction accuracy. ...
Article
Full-text available
Key message The importance and potential of the multi-purpose crop sorghum in global food security have not yet been fully exploited, and the integration of the state-of-art genomics and high-throughput technologies into breeding practice is required. Abstract Sorghum, a historically vital staple food source and currently the fifth most important major cereal, is emerging as a crop with diverse end-uses as food, feed, fuel and forage and a model for functional genetics and genomics of tropical grasses. Rapid development in high-throughput experimental and data processing technologies has significantly speeded up sorghum genomic researches in the past few years. The genomes of three sorghum lines are available, thousands of genetic stocks accessible and various genetic populations, including NAM, MAGIC, and mutagenised populations released. Functional and comparative genomics have elucidated key genetic loci and genes controlling agronomical and adaptive traits. However, the knowledge gained has far away from being translated into real breeding practices. We argue that the way forward is to take a genome-based approach for tailored designing of sorghum as a multi-functional crop combining excellent agricultural traits for various end uses. In this review, we update the new concepts and innovation systems in crop breeding and summarise recent advances in sorghum genomic researches, especially the genome-wide dissection of variations in genes and alleles for agronomically important traits. Future directions and opportunities for sorghum breeding are highlighted to stimulate discussion amongst sorghum academic and industrial communities.
Chapter
Full-text available
In view of current focus for improvement of crop varieties for disease resistance the adoption of genomic aided plant breeding is growing as an important approach. Till date lots of research has been done in the field of crop improvement for disease resistance using both conventional and molecular breeding approach. Major chunk of research for developing disease resistant variety, concentrated on major gene resistant for biotic stress but this type of resistance is more prone to breakdown with frequent changes in pathogenic strain. On other hand, more stable and broad spectrum resistance can be achieved by breeding varieties resistance for minor quantitative genes. Genomic selection is most convenient approach in molecular plant breeding for developing plant varieties resistant to quantitative disease resistance. This approach enhance genetic gain in the selecting individuals by exploring whole genome sequence data to calculate breeding value of progeny. The selection is based on adoption of genomic selection methodology and whole genome prediction model while using GS for yield and other economical traits. Although GS is a promising tool for genetic improvement of quantitative traits through reduction of breeding cycle but its efficiency in crop breeding programme could be increased by optimization of models regarding analysis of interaction between genotype and environment to improve predication accuracy. This could be gained by combining GS with different novel platforms like genotyping, phenotyping and speed breeding speed up the pace of genomic selection aided breeding procedure and higher genetic gain in context of per unit time.
Chapter
A boost in the crop improvement rate is essential for accomplishing a sustainable food supply and other demands of rapid population growth. Genomic selection (GS), a very promising breeding strategy used effectively in animal breeding, is now used in crop improvement. GS offers a reduced duration of breeding cycles by rapidly selecting better genotypes. Several empirical and simulated research on GS and their implications on agricultural production enhancement have lately been published. We briefly discuss the GS methodology, its present position, the GS advantages over alternative methods of breeding, commonly used prediction models of GS, and factors interfering with the prediction accuracy of GS to provide a comprehensive grasp of the technology. In addition, the integration of speed breeding and other modern techniques for increasing the effectiveness and speed of GS are discussed.
Chapter
Sorghum is a promising bioenergy crop, but its productivity is affected by abiotic stressors such as drought, heat, and salinity. To address this issue, a concerted effort is being made to understand the stress tolerance mechanisms and gene discovery in Sorghum, along with the interaction of genetic and environmental factors. In this regard, several omics approaches, tools, and resources have already been developed for Sorghum cultivation. The advent of modern sequencing technologies has significantly accelerated genomics and transcriptomic studies in Sorghum. This has facilitated the use of quantitative trait loci (QTL) mapping, genome-wide association studies (GWAS), and genomic selection (GS) to identify key genes and genetic markers that contribute to abiotic stress tolerance in Sorghum. However, there has been limited effort in other omics branches such as proteomics, metabolomics, and ionomics. Despite the extensive cataloging of omics resources for Sorghum, there is still a need for greater integration of omics approaches to efficiently utilize these resources and gain a better understanding of the molecular mechanisms involved in abiotic stress tolerance. This would enable researchers to understand the plant’s responses and the genetic regulatory networks involved in abiotic stress tolerance, which would be helpful in enhancing Sorghum productivity and contributing to sustainable bioenergy production.
Chapter
Full-text available
The linear mixed model framework is explained in detail in this chapter. We explore three methods of parameter estimation (maximum likelihood, EM algorithm, and REML) and illustrate how genomic-enabled predictions are performed under this framework. We illustrate the use of linear mixed models by using the predictor several components such as environments, genotypes, and genotype × environment interaction. Also, the linear mixed model is illustrated under a multi-trait framework that is important in the prediction performance when the degree of correlation between traits is moderate or large. We illustrate the use of single-trait and multi-trait linear mixed models and provide the R codes for performing the analyses.
Article
Full-text available
Genomic selection (GS) is a promising tool for plant and animal breeding that uses genome-wide molecular marker data to capture small and large effect quantitative trait loci and predict the genetic value of selection candidates. Genomic selection has been shown previously to have higher prediction accuracies than conventional marker-assisted selection (MAS) for quantitative traits. In this study, we compared phenotypic and marker-based prediction accuracy of genetic value for nine different grain quality traits within two biparental soft winter wheat (Triticum aestivum L.) populations. We used a cross-validation approach that trained and validated prediction accuracy across years to evaluate effects of model training population size, training population replication, and marker density. Results showed that prediction accuracy was significantly greater using GS versus MAS for all traits studied and that accuracy for GS reached a plateau at low marker densities (128-256). The average ratio of GS accuracy to phenotypic selection accuracy was 0.66, 0.54, and 0.42 for training population sizes of 96, 48, and 24, respectively. These results provide further empirical evidence that GS could produce greater genetic gain per unit time and cost than both phenotypic selection and conventional MAS in plant breeding with use of year-round nurseries and inexpensive, high-throughput genotyping technology.
Article
Full-text available
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline.
Article
Full-text available
Many important traits in plant and animal breeding are polygenic and therefore recalcitrant to traditional marker-assisted selection. Genomic selection addresses this complexity by including all markers in the prediction model. A key method for the genomic prediction of breeding values is ridge regression (RR), which is equivalent to BLUP when the genetic covariance between lines is proportional to their similarity in genotype space. This additive model can be broadened to include epistatic effects by using other kernels, such as the Gaussian, which represent inner products in a complex feature space. To facilitate the use of RR and non-additive kernels by breeders, a new software package for R called rrBLUP has been developed. At its core is a fast maximum-likelihood algorithm for mixed models with a single variance component besides the residual error, which allows for efficient prediction with unreplicated training data. Use of the rrBLUP software is demonstrated through several examples, including the identification of optimal crosses based on superior progeny value. In cross-validation tests, the prediction accuracy with non-additive kernels was significantly higher than RR for wheat grain yield but equivalent for several maize traits.
Article
Full-text available
Background Advances in genotyping technology, such as genotyping by sequencing (GBS), are making genomic prediction more attractive to reduce breeding cycle times and costs associated with phenotyping. Genomic prediction and selection has been studied in several crop species, but no reports exist in soybean. The objectives of this study were (i) evaluate prospects for genomic selection using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean lines sampled from the University of Nebraska Soybean Breeding Program were genotyped using GBS and evaluated for yield and other agronomic traits at multiple Nebraska locations. Results Genotyping by sequencing scored 16,502 single nucleotide polymorphisms (SNPs) with minor-allele frequency (MAF) > 0.05 and percentage of missing values ≤ 5% on 301 elite soybean breeding lines. When SNPs with up to 80% missing values were included, 52,349 SNPs were scored. Prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using genomic selection for grain yield in soybean. Filtering SNPs based on missing data percentage had little to no effect on prediction accuracy, especially when random forest imputation was used to impute missing values. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy. The effect of training population size on accuracy began to plateau around 100, but accuracy steadily climbed until the largest possible size was used in this analysis. Including only SNPs with MAF > 0.30 provided higher accuracies when training populations were smaller. Conclusions Using GBS for genomic prediction in soybean holds good potential to expedite genetic gain. Our results suggest that standard additive G-BLUP models can be used on unfiltered, imputed GBS data without loss in accuracy.
Article
Full-text available
Simulation and empirical studies of genomic selection (GS) show accuracies sufficient to generate rapid genetic gains. However, with the increased popularity of GS approaches, numerous models have been proposed and no comparative analysis is available to identify the most promising ones. Using eight wheat (Triticum aestivum L.), barley (Hordeum vulgare L.), Arabidopsis thaliana (L.) Heynh., and maize (Zea mays L.) datasets, the predictive ability of currently available GS models along with several machine learning methods was evaluated by comparing accuracies, the genomic estimated breeding values (GEBVs), and the marker effects for each model. While a similar level of accuracy was observed for many models, the level of overfitting varied widely as did the computation time and the distribution of marker effect estimates. Our comparisons suggested that GS in plant breeding programs could be based on a reduced set of models such as the Bayesian Lasso, weighted Bayesian shrinkage regression (wBSR, a fast version of BayesB), and random forest (RF) (a machine learning method that could capture nonadditive effects). Linear combinations of different models were tested as well as bagging and boosting methods, but they did not improve accuracy. This study also showed large differences in accuracy between subpopulations within a dataset that could not always be explained by differences in phenotypic variance and size. The broad diversity of empirical datasets tested here adds evidence that GS could increase genetic gain per unit of time and cost.
Article
Full-text available
Background Marker-assisted selection (MAS) and genomic selection (GS) based on genome-wide marker data provide powerful tools to predict the genotypic value of selection material in plant breeding. However, case-to-case optimization of these approaches is required to achieve maximum accuracy of prediction with reasonable input. Results Based on extended field evaluation data for grain yield, plant height, starch content and total pentosan content of elite hybrid rye derived from testcrosses involving two bi-parental populations that were genotyped with 1048 molecular markers, we compared the accuracy of prediction of MAS and GS in a cross-validation approach. MAS delivered generally lower and in addition potentially over-estimated accuracies of prediction than GS by ridge regression best linear unbiased prediction (RR-BLUP). The grade of relatedness of the plant material included in the estimation and test sets clearly affected the accuracy of prediction of GS. Within each of the two bi-parental populations, accuracies differed depending on the relatedness of the respective parental lines. Across populations, accuracy increased when both populations contributed to estimation and test set. In contrast, accuracy of prediction based on an estimation set from one population to a test set from the other population was low despite that the two bi-parental segregating populations under scrutiny shared one parental line. Limiting the number of locations or years in field testing reduced the accuracy of prediction of GS equally, supporting the view that to establish robust GS calibration models a sufficient number of test locations is of similar importance as extended testing for more than one year. Conclusions In hybrid rye, genomic selection is superior to marker-assisted selection. However, it achieves high accuracies of prediction only for selection candidates closely related to the plant material evaluated in field trials, resulting in a rather pessimistic prognosis for distantly related material. Both, the numbers of evaluation locations and testing years in trials contribute equally to prediction accuracy.
Article
Full-text available
We propose a regularized method for multivariate linear regression when the number of predictors may exceed the sample size. This method is designed to strengthen the estimation and the selection of the relevant input features with three ingredients: it takes advantage of the dependency pattern between the responses by estimating the residual covariance; it performs selection on direct links between predictors and responses; and selection is driven by prior structural information. To this end, we build on a recent reformulation of the multivariate linear regression model to a conditional Gaussian graphical model and propose a new regularization scheme accompanied with an efficient optimization procedure. On top of showing very competitive performance on artificial and real data sets, our method demonstrates capabilities for fine interpretation of its parameters, as illustrated in applications to genetics, genomics and spectroscopy.
Article
Full-text available
With the development of molecular marker technology, crop breeding has been accelerated by marker-assisted selection for the improvement of quantitative traits. However, due to the traits' polygenic nature, traditional marker-assisted selection methods are ill-suited for identification of quantitative trait loci. Genomic selection (GS) was introduced into crop breeding to achieve more accurate predictions by considering all genes or markers simultaneously. We used dozens of sequence-characterized amplified region (SCAR) markers for genotyping soybean varieties, and we identified markers associated with hundred-seed weight. The best linear unbiased predictor and Bayesian liner regression methods were used to construct GS models to predict the hundred-seed weight trait based upon genotype information for trait selection. Both GS models showed good prediction performance in soybean, as the correlation coefficient between genomic estimated breeding values and true breeding values was as high as 0.904. This indicated that GS was performed effectively based on dozens of SCAR markers in soybean; these markers were of low density but easily detectable. Therefore, the combination of GS modeling and highly effective molecular marker technology involving SCAR markers can facilitate genetic breeding in soybean. This approach may also be suitable for genetic selection in other crops, such as wheat, maize, and rice.
Article
Full-text available
Genomic prediction is expected to considerably increase genetic gains by increasing selection intensity and accelerating the breeding cycle. In this study, marker effects estimated in 255 diverse maize (Zea mays L.) hybrids were used to predict grain yield, anthesis date, and anthesis-silking interval within the diversity panel and testcross progenies of 30 F(2)-derived lines from each of five populations. Although up to 25% of the genetic variance could be explained by cross validation within the diversity panel, the prediction of testcross performance of F(2)-derived lines using marker effects estimated in the diversity panel was on average zero. Hybrids in the diversity panel could be grouped into eight breeding populations differing in mean performance. When performance was predicted separately for each breeding population on the basis of marker effects estimated in the other populations, predictive ability was low (i.e., 0.12 for grain yield). These results suggest that prediction resulted mostly from differences in mean performance of the breeding populations and less from the relationship between the training and validation sets or linkage disequilibrium with causal variants underlying the predicted traits. Potential uses for genomic prediction in maize hybrid breeding are discussed emphasizing the need of (1) a clear definition of the breeding scenario in which genomic prediction should be applied (i.e., prediction among or within populations), (2) a detailed analysis of the population structure before performing cross validation, and (3) larger training sets with strong genetic relationship to the validation set.
Article
Full-text available
Genome-based prediction of genetic values is expected to overcome shortcomings that limit the application of QTL mapping and marker-assisted selection in plant breeding. Our goal was to study the genome-based prediction of test cross performance with genetic effects that were estimated using genotypes from the preceding breeding cycle. In particular, our objectives were to employ a ridge regression approach that approximates best linear unbiased prediction of genetic effects, compare cross validation with validation using genetic material of the subsequent breeding cycle, and investigate the prospects of genome-based prediction in sugar beet breeding. We focused on the traits sugar content and standard molasses loss (ML) and used a set of 310 sugar beet lines to estimate genetic effects at 384 SNP markers. In cross validation, correlations >0.8 between observed and predicted test cross performance were observed for both traits. However, in validation with 56 lines from the next breeding cycle, a correlation of 0.8 could only be observed for sugar content, for standard ML the correlation reduced to 0.4. We found that ridge regression based on preliminary estimates of the heritability provided a very good approximation of best linear unbiased prediction and was not accompanied with a loss in prediction accuracy. We conclude that prediction accuracy assessed with cross validation within one cycle of a breeding program can not be used as an indicator for the accuracy of predicting lines of the next cycle. Prediction of lines of the next cycle seems promising for traits with high heritabilities.
Article
Full-text available
Genomic-enabled prediction is becoming increasingly important in animal and plant breeding, and is also receiving attention in human genetics. Deriving accurate predictions of complex traits requires implementing whole-genome regression (WGR) models where phenotypes are regressed on thousands of markers concurrently. Following the groundbreaking contribution of MEUWISSEN et al. (2001) several methods have been proposed and evaluated, and genome-enabled selection (GS) is being implemented in several plant and animal breeding programs. The list of methods is long, and the relationships between the available methods have not been fully addressed. In this article we provide an overview of available methods for implementing parametric WGR models, discuss selected topics which emerge in the application of these methods and present a general discussion of lessons learnt from simulation and empirical data analysis in the last decade.
Article
Full-text available
kernlab is an extensible package for kernel-based machine learning methods in R. It takes advantage of R's new S4 ob ject model and provides a framework for creating and using kernel-based algorithms. The package contains dot product primitives (kernels), implementations of support vector machines and the relevance vector machine, Gaussian processes, a ranking algorithm, kernel PCA, kernel CCA, and a spectral clustering algorithm. Moreover it provides a general purpose quadratic programming solver, and an incomplete Cholesky decomposition method.
Article
Full-text available
Despite rapid advances in genomic technology, our ability to account for phenotypic variation using genetic information remains limited for many traits. This has unfortunately resulted in limited application of genetic data towards preventive and personalized medicine, one of the primary impetuses of genome-wide association studies. Recently, a large proportion of the "missing heritability" for human height was statistically explained by modeling thousands of single nucleotide polymorphisms concurrently. However, it is currently unclear how gains in explained genetic variance will translate to the prediction of yet-to-be observed phenotypes. Using data from the Framingham Heart Study, we explore the genomic prediction of human height in training and validation samples while varying the statistical approach used, the number of SNPs included in the model, the validation scheme, and the number of subjects used to train the model. In our training datasets, we are able to explain a large proportion of the variation in height (h(2) up to 0.83, R(2) up to 0.96). However, the proportion of variance accounted for in validation samples is much smaller (ranging from 0.15 to 0.36 depending on the degree of familial information used in the training dataset). While such R(2) values vastly exceed what has been previously reported using a reduced number of pre-selected markers (<0.10), given the heritability of the trait (∼ 0.80), substantial room for improvement remains.
Article
Full-text available
Genome-enabled prediction of breeding values using high-density panels (HDP) can be highly accurate, even for young sires. However, the cost of the assay may limit its use to elite animals only. Low-density panels (LDP) containing a subset of single nucleotide polymorphisms (SNP) may give reasonably accurate predictions and could be used cost-effectively with young males and females. This study evaluates strategies for selecting subsets of SNP for several traits, compares predictive ability of LDP with that of HDP, and assesses the benefits of including parent average (PA) as a predictor in models using LDP. Data consisting of progeny-test predicted transmitting ability (PTA) for net merit and 6 other traits of economic interest from 4,783 Holstein sires were evaluated using testing and training sets with regressions on their high-density genotypes and parent averages for net merit index. Additionally, SNP subsets of different sizes were selected using different strategies, including the "best" SNP based on the absolute values of their estimated effects from HDP models for either the trait itself or lifetime net merit, and evenly spaced (ES) SNP across the genome. Overall, HDP models had the best predictive ability, setting an upper bound for the predictive ability of LDP sets. Low-density panels targeting the SNP with strongest effects (for either a single trait or lifetime net merit) provided reasonably accurate predictions and generally outperformed predictions based on evenly spaced SNP. For example, evenly spaced sets would require at least 5,000 to 7,500 SNP to reach 95% of the predictive ability provided by HDP. On the other hand, this level of predictive ability can be achieved with sets of 2,000 SNP when SNP are selected based on magnitude of estimated effects for the trait. Accuracy of predictions based on LDP can be improved markedly by including parent average as a fixed effect in the model; for example, a set with the 1,000 best SNP using the parent average achieved the 95% of the accuracy of a HDP model.
Article
Full-text available
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multi- nomial regression problems while the penalties include âÂÂ_1 (the lasso), âÂÂ_2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
Article
Full-text available
The availability of dense molecular markers has made possible the use of genomic selection (GS) for plant breeding. However, the evaluation of models for GS in real plant populations is very limited. This article evaluates the performance of parametric and semiparametric models for GS using wheat (Triticum aestivum L.) and maize (Zea mays) data in which different traits were measured in several environmental conditions. The findings, based on extensive cross-validations, indicate that models including marker information had higher predictive ability than pedigree-based models. In the wheat data set, and relative to a pedigree model, gains in predictive ability due to inclusion of markers ranged from 7.7 to 35.7%. Correlation between observed and predictive values in the maize data set achieved values up to 0.79. Estimates of marker effects were different across environmental conditions, indicating that genotype × environment interaction is an important component of genetic variability. These results indicate that GS in plant breeding can be an effective strategy for selecting among lines whose phenotypes have yet to be observed.
Article
Full-text available
The rapid increase in high-throughput single-nucleotide polymorphism data has led to a great interest in applying genome-wide evaluation methods to identify an individual's genetic merit. Genome-wide evaluation combines statistical methods with genomic data to predict genetic values for complex traits. Considerable uncertainty currently exists in determining which genome-wide evaluation method is the most appropriate. We hypothesize that genome-wide methods deal differently with the genetic architecture of quantitative traits and genomes. A genomic linear method (GBLUP), and a genomic nonlinear Bayesian variable selection method (BayesB) are compared using stochastic simulation across three effective population sizes and a wide range of numbers of quantitative trait loci (N(QTL)). GBLUP had a constant accuracy, for a given heritability and sample size, regardless of N(QTL). BayesB had a higher accuracy than GBLUP when N(QTL) was low, but this advantage diminished as N(QTL) increased and when N(QTL) became large, GBLUP slightly outperformed BayesB. In addition, deterministic equations are extended to predict the accuracy of both methods and to estimate the number of independent chromosome segments (M(e)) and N(QTL). The predictions of accuracy and estimates of M(e) and N(QTL) were generally in good agreement with results from simulated data. We conclude that the relative accuracy of GBLUP and BayesB for a given number of records and heritability are highly dependent on M(e,) which is a property of the target genome, as well as the architecture of the trait (N(QTL)).
Article
Full-text available
This study investigated the reliability of genomic estimated breeding values (GEBV) in the Danish Holstein population. The data in the analysis included 3,330 bulls with both published conventional EBV and single nucleotide polymorphism (SNP) markers. After data editing, 38,134 SNP markers were available. In the analysis, all SNP were fitted simultaneously as random effects in a Bayesian variable selection model, which allows heterogeneous variances for different SNP markers. The response variables were the official EBV. Direct GEBV were calculated as the sum of individual SNP effects. Initial analyses of 4 index traits were carried out to compare models with different intensities of shrinkage for SNP effects; that is, mixture prior distributions of scaling factors (standard deviation of SNP effects) assuming 5, 10, 20, or 50% of SNP having large effects and the others having very small or no effects, and a single prior distribution common for all SNP. It was found that, in general, the model with a common prior distribution of scaling factors had better predictive ability than any mixture prior models. Therefore, a common prior model was used to estimate SNP effects and breeding values for all 18 index traits. Reliability of GEBV was assessed by squared correlation between GEBV and conventional EBV (r(2)(GEBV, EBV)), and expected reliability was obtained from prediction error variance using a 5-fold cross validation. Squared correlations between GEBV and published EBV (without any adjustment) ranged from 0.252 to 0.700, with an average of 0.418. Expected reliabilities ranged from 0.494 to 0.733, with an average of 0.546. Averaged over 18 traits, r(2)(GEBV, EBV) was 0.13 higher and expected reliability was 0.26 higher than reliability of conventional parent average. The results indicate that genomic selection can greatly improve the accuracy of preselection for young bulls compared with traditional selection based on parent average information.
Article
Full-text available
We used a least absolute shrinkage and selection operator (LASSO) approach to estimate marker effects for genomic selection. The least angle regression (LARS) algorithm and cross-validation were used to define the best subset of markers to include in the model. The LASSO-LARS approach was tested on two data sets: a simulated data set with 5865 individuals and 6000 Single Nucleotide Polymorphisms (SNPs); and a mouse data set with 1885 individuals genotyped for 10 656 SNPs and phenotyped for a number of quantitative traits. In the simulated data, three approaches were used to split the reference population into training and validation subsets for cross-validation: random splitting across the whole population; random sampling of validation set from the last generation only, either within or across families. The highest accuracy was obtained by random splitting across the whole population. The accuracy of genomic estimated breeding values (GEBVs) in the candidate population obtained by LASSO-LARS was 0.89 with 156 explanatory SNPs. This value was higher than those obtained by Best Linear Unbiased Prediction (BLUP) and a Bayesian method (BayesA), which were 0.75 and 0.84, respectively. In the mouse data, 1600 individuals were randomly allocated to the reference population. The GEBVs for the remaining 285 individuals estimated by LASSO-LARS were more accurate than those obtained by BLUP and BayesA for weight at six weeks and slightly lower for growth rate and body length. It was concluded that LASSO-LARS approach is a good alternative method to estimate marker effects for genomic selection, particularly when the cost of genotyping can be reduced by using a limited subset of markers.
Article
Full-text available
Recent developments in SNP discovery and high throughput genotyping technology have made the use of high-density SNP markers to predict breeding values feasible. This involves estimation of the SNP effects in a training data set, and use of these estimates to evaluate the breeding values of other 'evaluation' individuals. Simulation studies have shown that these predictions of breeding values can be accurate, when training and evaluation individuals are (closely) related. However, many general applications of genomic selection require the prediction of breeding values of 'unrelated' individuals, i.e. individuals from the same population, but not particularly closely related to the training individuals. Accuracy of selection was investigated by computer simulation of small populations. Using scaling arguments, the results were extended to different populations, training data sets and genome sizes, and different trait heritabilities. Prediction of breeding values of unrelated individuals required a substantially higher marker density and number of training records than when prediction individuals were offspring of training individuals. However, when the number of records was 2*Ne*L and the number of markers was 10*Ne*L, the breeding values of unrelated individuals could be predicted with accuracies of 0.88 - 0.93, where Ne is the effective population size and L the genome size in Morgan. Reducing this requirement to 1*Ne*L individuals, reduced prediction accuracies to 0.73-0.83. For livestock populations, 1NeL requires about approximately 30,000 training records, but this may be reduced if training and evaluation animals are related. A prediction equation is presented, that predicts accuracy when training and evaluation individuals are related. For humans, 1NeL requires approximately 350,000 individuals, which means that human disease risk prediction is possible only for diseases that are determined by a limited number of genes. Otherwise, genotyping and phenotypic recording need to become very common in the future.
Article
Full-text available
Genetic progress will increase when breeders examine genotypes in addition to pedigrees and phenotypes. Genotypes for 38,416 markers and August 2003 genetic evaluations for 3,576 Holstein bulls born before 1999 were used to predict January 2008 daughter deviations for 1,759 bulls born from 1999 through 2002. Genotypes were generated using the Illumina BovineSNP50 BeadChip and DNA from semen contributed by US and Canadian artificial-insemination organizations to the Cooperative Dairy DNA Repository. Genomic predictions for 5 yield traits, 5 fitness traits, 16 conformation traits, and net merit were computed using a linear model with an assumed normal distribution for marker effects and also using a nonlinear model with a heavier tailed prior distribution to account for major genes. The official parent average from 2003 and a 2003 parent average computed from only the subset of genotyped ancestors were combined with genomic predictions using a selection index. Combined predictions were more accurate than official parent averages for all 27 traits. The coefficients of determination (R(2)) were 0.05 to 0.38 greater with nonlinear genomic predictions included compared with those from parent average alone. Linear genomic predictions had R(2) values similar to those from nonlinear predictions but averaged just 0.01 lower. The greatest benefits of genomic prediction were for fat percentage because of a known gene with a large effect. The R(2) values were converted to realized reliabilities by dividing by mean reliability of 2008 daughter deviations and then adding the difference between published and observed reliabilities of 2003 parent averages. When averaged across all traits, combined genomic predictions had realized reliabilities that were 23% greater than reliabilities of parent averages (50 vs. 27%), and gains in information were equivalent to 11 additional daughter records. Reliability increased more by doubling the number of bulls genotyped than the number of markers genotyped. Genomic prediction improves reliability by tracing the inheritance of genes even with small effects.
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Article
This paper is an exposition of the use of ridge regression methods. Two examples from the literature are used as a base. Attention is focused on the RIDGE TRACE which is a two-dimensional graphical procedure for portraying the complex relationships in multifactor data. Recommendations are made for obtaining a better regression equation than that given by ordinary least squares estimation.
Article
We propose a procedure for constructing a sparse estimator of a multivariate regression coefficient matrix that accounts for correlation of the response variables. This method, which we call multivariate regression with covariance estimation (MRCE), involves penalized likelihood with simultaneous estimation of the regression coefficients and the covariance structure. An efficient optimization algorithm and a fast approximation are developed for computing MRCE. Using simulation studies, we show that the proposed method outperforms relevant competitors when the responses are highly correlated. We also apply the new method to a finance example on predicting asset returns. An R-package containing this dataset and code for computing MRCE and its approximation are available online.
Article
We propose a procedure for constructing a sparse estimator of a multivariate regression coefficient matrix that accounts for correlation of the response variables. This method, which we call multivariate regression with covariance estimation (MRCE), involves penalized likelihood with simultaneous estimation of the regression coefficients and the covariance structure. An efficient optimization algorithm and a fast approximation are developed for computing MRCE. Using simulation studies, we show that the proposed method outperforms relevant competitors when the responses are highly correlated. We also apply the new method to a finance example on predicting asset returns. An R-package containing this dataset and code for computing MRCE and its approximation are available online.
Article
Fusarium head blight (FHB) resistance is quantitative and diffi cult to evaluate. Genomic selection (GS) could accelerate FHB resistance breeding. We used U.S. cooperative FHB wheat nursery data to evaluate GS models for several FHB resistance traits including deoxynivalenol (DON) levels. For all traits we compared the models: ridge regression (RR), Bayesian LASSO (BL), reproducing kernel Hilbert spaces (RKHS) regression, random forest (RF) regression, and multiple linear regression (MLR) (fi xed effects). For DON, we evaluated additional prediction methods including bivariate RR models, phenotypes for correlated traits, and RF regression models combining markers and correlated phenotypes as predictors. Additionally, for all traits, we compared different marker sets including genomewide markers, FHB quantitative trait loci (QTL) targeted markers, and both sets combined. Genomic selection accuracies were always higher than MLR accuracies, RF and RKHS regression were often the most accurate methods, and for DON, marker plus trait RF regression was more accurate than all other methods. For all traits except DON, using QTL targeted markers alone led to lower accuracies than using genomewide markers. This study indicates that cooperative FHB nursery data can be useful for GS, and prior information about correlated traits and QTL could be used to improve accuracies in some cases.
Article
Methods for predicting hybrid yield would facilitate the identification of superior maize (Zea mays L.) single crosses. Best linear unbiased prediction of the performance of single crosses, based on (i) restriction fragment length polymorphism (RFLP) data on the parental inbreds and (ii) yield data on a related set of single crosses, was evaluated. Yields of m single crosses were predicted as y(M) = C V-1 y(P), where: y(M) = m x 1 vector of predicted yields of missing (i.e., no yield data available) single crosses; C = m x n matrix of genetic covariances between the missing and predictor hybrids; V = n x n matrix of phenotypic variances and covariances among predictor hybrids; and y(P) = n x 1 vector of predictor hybrid yields corrected for trial effects. From a set of 54 single crosses, made between six Iowa Stiff Stalk Synthetic (SSS) and nine non-SSS inbreds, 100 different sets of n = 10, 15, 20, 25, or 30 predictor hybrids were chosen at random. Pooled correlations between predicted and observed yields of the remaining (54 - n) hybrids ranged from 0.654 to 0.800. The correlations were slightly higher when dominance variance was included in the model or when coefficients of coancestry were determined from RFLP rather than pedigree data. The correlations remained relatively stable across different, arbitrary values of genetic variances. The results suggested that single-cross yield can be predicted effectively based on parental RFLP data and yields of a related set of hybrids.
Article
This paper is an exposition of the use of ridge regression methods. Two examples from the literature are used as a base. Attention is focused on the RIDGE TRACE which is a two-dimensional graphical procedure for portraying the complex relationships in multifactor data. Recommendations are made for obtaining a better regression equation than that given by ordinary least squares estimation.
Article
In the mid-1980s, the development of abundant molecular markers, appropriate statistical pro- cedures, and user-friendly computer software that implemented these statistical procedures permitted the detection of molecular markers associated with quantitative trait loci (QTL) for complex traits. Marker-assisted selection was then proposed as a means of exploiting mark- ers linked to QTL to develop improved cultivars. But while thousands of marker-trait associations have been reported for many traits in different plant species, far fewer examples of success- fully exploiting mapped QTL have been reported in the literature. Key lessons learned from apply- ing markers in plant breeding include the follow- ing: (i) the purpose of detecting QTL should be clearly defi ned before embarking on QTL map- ping; (ii) procedures for marker-based selection depend on the number of QTL; (iii) estimates of QTL effects for complex traits are often incon- sistent; and (iv) gain per unit cost and time rather than gain per cycle should be considered. Future applications for complex traits will likely focus on predictive methodologies for marker-based selection before phenotyping and for marker- based selection without QTL mapping. These applications will take advantage of cheaper costs of genotyping than of phenotyping.
Article
An algorithm is given for selacting the biasing paramatar, k, in RIDGE regrassion. By means of simulaction it is shown that the algorithm has the following properties: (i) it produces an aberaged squared error for the regrassion coafficiants that is les than least squares, (ii) the distribuction of squared arrots for the regression coafficiants has a smallar variance than does that for last squares, and (iii) regradless of he signal-to-noiss retio the probability that RIDGE producas a smaller squared error than least squares is greatar than 0.50.
Article
Applications of genome mapping and marker-assisted selection (MAS) in crop improvement are reviewed. The following aspects are considered: a comparison of the choice of markers available for the generation of linkage maps (including amplified fragment length polymorphisms (AFLP); restriction fragment length polymorphisms (RFLP); randomly amplified polymorphic DNA (RAPD) and simple sequence repeats (SSR)); quantitative trait loci (QTL) analysis; use of molecular markers in the exploitation of hybrid vigour; physical genome mapping; map-based cloning and transposon tagging of agriculturally important genes; synteny in cereal genomes; and the use of MAS in breeding for disease and pest resistance.
Article
Genomic selection is a promising breeding strategy for rapid improvement of complex traits. The objective of our study was to investigate the prediction accuracy of genomic breeding values through cross validation. The study was based on experimental data of six segregating populations from a half-diallel mating design with 788 testcross progenies from an elite maize breeding program. The plants were intensively phenotyped in multi-location field trials and fingerprinted with 960 SNP markers. We used random regression best linear unbiased prediction in combination with fivefold cross validation. The prediction accuracy across populations was higher for grain moisture (0.90) than for grain yield (0.58). The accuracy of genomic selection realized for grain yield corresponds to the precision of phenotyping at unreplicated field trials in 3-4 locations. As for maize up to three generations are feasible per year, selection gain per unit time is high and, consequently, genomic selection holds great promise for maize breeding programs.
Article
This is the first large-scale experimental study on genome-based prediction of testcross values in an advanced cycle breeding population of maize. The study comprised testcross progenies of 1,380 doubled haploid lines of maize derived from 36 crosses and phenotyped for grain yield and grain dry matter content in seven locations. The lines were genotyped with 1,152 single nucleotide polymorphism markers. Pedigree data were available for three generations. We used best linear unbiased prediction and stratified cross-validation to evaluate the performance of prediction models differing in the modeling of relatedness between inbred lines and in the calculation of genome-based coefficients of similarity. The choice of similarity coefficient did not affect prediction accuracies. Models including genomic information yielded significantly higher prediction accuracies than the model based on pedigree information alone. Average prediction accuracies based on genomic data were high even for a complex trait like grain yield (0.72-0.74) when the cross-validation scheme allowed for a high degree of relatedness between the estimation and the test set. When predictions were performed across distantly related families, prediction accuracies decreased significantly (0.47-0.48). Prediction accuracies decreased with decreasing sample size but were still high when the population size was halved (0.67-0.69). The results from this study are encouraging with respect to genome-based prediction of the genetic value of untested lines in advanced cycle breeding populations and the implementation of genomic selection in the breeding process.
Article
A method is described for the prediction of breeding values incorporating genomic information. The first stage involves the prediction of genomic breeding values for genotyped individuals. A novel component of this is the estimation of the genomic relationship matrix in the context of a multi-breed population. Because not all ancestors of genotyped animals are genotyped, a selection index procedure is used to blend genomic predictions with traditional ancestral information that is lost between the process of deregression of the national breeding values and subsequent re-estimation using the genomic relationship matrix. Finally, the genomically enhanced predictions are filtered through to nongenotyped descendants using a regression procedure.
Article
The objective of the present study was to assess the predictive ability of subsets of single nucleotide polymorphism (SNP) markers for development of low-cost, low-density genotyping assays in dairy cattle. Dense SNP genotypes of 4,703 Holstein bulls were provided by the USDA Agricultural Research Service. A subset of 3,305 bulls born from 1952 to 1998 was used to fit various models (training set), and a subset of 1,398 bulls born from 1999 to 2002 was used to evaluate their predictive ability (testing set). After editing, data included genotypes for 32,518 SNP and August 2003 and April 2008 predicted transmitting abilities (PTA) for lifetime net merit (LNM$), the latter resulting from progeny testing. The Bayesian least absolute shrinkage and selection operator method was used to regress August 2003 PTA on marker covariates in the training set to arrive at estimates of marker effects and direct genomic PTA. The coefficient of determination (R(2)) from regressing the April 2008 progeny test PTA of bulls in the testing set on their August 2003 direct genomic PTA was 0.375. Subsets of 300, 500, 750, 1,000, 1,250, 1,500, and 2,000 SNP were created by choosing equally spaced and highly ranked SNP, with the latter based on the absolute value of their estimated effects obtained from the training set. The SNP effects were re-estimated from the training set for each subset of SNP, and the 2008 progeny test PTA of bulls in the testing set were regressed on corresponding direct genomic PTA. The R(2) values for subsets of 300, 500, 750, 1,000, 1,250, 1,500, and 2,000 SNP with largest effects (evenly spaced SNP) were 0.184 (0.064), 0.236 (0.111), 0.269 (0.190), 0.289 (0.179), 0.307 (0.228), 0.313 (0.268), and 0.322 (0.291), respectively. These results indicate that a low-density assay comprising selected SNP could be a cost-effective alternative for selection decisions and that significant gains in predictive ability may be achieved by increasing the number of SNP allocated to such an assay from 300 or fewer to 1,000 or more.
Article
A new technology called genomic selection is revolutionizing dairy cattle breeding. Genomic selection refers to selection decisions based on genomic breeding values (GEBV). The GEBV are calculated as the sum of the effects of dense genetic markers, or haplotypes of these markers, across the entire genome, thereby potentially capturing all the quantitative trait loci (QTL) that contribute to variation in a trait. The QTL effects, inferred from either haplotypes or individual single nucleotide polymorphism markers, are first estimated in a large reference population with phenotypic information. In subsequent generations, only marker information is required to calculate GEBV. The reliability of GEBV predicted in this way has already been evaluated in experiments in the United States, New Zealand, Australia, and the Netherlands. These experiments used reference populations of between 650 and 4,500 progeny-tested Holstein-Friesian bulls, genotyped for approximately 50,000 genome-wide markers. Reliabilities of GEBV for young bulls without progeny test results in the reference population were between 20 and 67%. The reliability achieved depended on the heritability of the trait evaluated, the number of bulls in the reference population, the statistical method used to estimate the single nucleotide polymorphism effects in the reference population, and the method used to calculate the reliability. A common finding in 3 countries (United States, New Zealand, and Australia) was that a straightforward BLUP method for estimating the marker effects gave reliabilities of GEBV almost as high as more complex methods. The BLUP method is attractive because the only prior information required is the additive genetic variance of the trait. All countries included a polygenic effect (parent average breeding value) in their GEBV calculation. This inclusion is recommended to capture any genetic variance not associated with the markers, and to put some selection pressure on low-frequency QTL that may not be captured by the markers. The reliabilities of GEBV achieved were significantly greater than the reliability of parental average breeding values, the current criteria for selection of bull calves to enter progeny test teams. The increase in reliability is sufficiently high that at least 2 dairy breeding companies are already marketing bull teams for commercial use based on their GEBV only, at 2 yr of age. This strategy should at least double the rate of genetic gain in the dairy industry. Many challenges with genomic selection and its implementation remain, including increasing the accuracy of GEBV, integrating genomic information into national and international genetic evaluations, and managing long-term genetic gain.
Article
Recent advances in molecular genetic techniques will make dense marker maps available and genotyping many individuals for these markers feasible. Here we attempted to estimate the effects of approximately 50,000 marker haplotypes simultaneously from a limited number of phenotypic records. A genome of 1000 cM was simulated with a marker spacing of 1 cM. The markers surrounding every 1-cM region were combined into marker haplotypes. Due to finite population size N(e) = 100, the marker haplotypes were in linkage disequilibrium with the QTL located between the markers. Using least squares, all haplotype effects could not be estimated simultaneously. When only the biggest effects were included, they were overestimated and the accuracy of predicting genetic values of the offspring of the recorded animals was only 0.32. Best linear unbiased prediction of haplotype effects assumed equal variances associated to each 1-cM chromosomal segment, which yielded an accuracy of 0.73, although this assumption was far from true. Bayesian methods that assumed a prior distribution of the variance associated with each chromosome segment increased this accuracy to 0.85, even when the prior was not correct. It was concluded that selection on genetic values predicted from markers could substantially increase the rate of genetic gain in animals and plants, especially if combined with reproductive techniques to shorten the generation interval.
Article
Animals can be genotyped for thousands of single nucleotide polymorphisms (SNPs) at one time, where the SNPs are located at roughly 1-cM intervals throughout the genome. For each contiguous pair of SNPs there are four possible haplotypes that could be inherited from the sire. The effects of each interval on a trait can be estimated for all intervals simultaneously in a model where interval effects are random factors. Given the estimated effects of each haplotype for every interval in the genome, and given an animal's genotype, a 'genomic' estimated breeding value is obtained by summing the estimated effects for that genotype. The accuracy of that estimator of breeding values is around 80%. Because the genomic estimated breeding values can be calculated at birth, and because it has a high accuracy, a strategy that utilizes these advantages was compared with a traditional progeny testing strategy under a typical Canadian-like dairy cattle situation. Costs of proving bulls were reduced by 92% and genetic change was increased by a factor of 2. Genome-wide selection may become a popular tool for genetic improvement in livestock.
Article
The purpose of model selection algorithms such as All Subsets, Forward Selection, and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the eificient pre- diction of a response variable. Least Angle Regression (" LARS"), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods.
BLR: Bayesian linear regression. R package version 1.3. Plant Genome
  • G De Los Campos
  • P Rodriguez
de los Campos, G., and Rodriguez, P. 2012. BLR: Bayesian linear regression. R package version 1.3. Plant Genome 4250-4255.
Linear Models in Statistics
  • Schaaljeg Renchera