Page 1

BioMed Central

Page 1 of 16

(page number not for citation purposes)

Genetics Selection Evolution

Open Access

Research

A comparison of five methods to predict genomic breeding values

of dairy bulls from genome-wide SNP markers

Gerhard Moser*1,2, Bruce Tier1,3, Ron E Crump1,3, Mehar S Khatkar1,4 and

Herman W Raadsma1,4

Address: 1The CRC for Innovative Dairy Products, Australia, 2Bellbowrie, QLD 4070, Australia, 3Animal Breeding and Genetics Unit, University of

New England, Armidale NSW 2351, Australia and 4ReproGen - Advanced Technologies in Animal Genetics and Reproduction, Faculty of Veterinary

Science, University of Sydney, 425 Werombi Road, Camden NSW 2570, Australia

Email: Gerhard Moser* - gerhard.moser@bigpond.com; Bruce Tier - btier@une.edu.au; Ron E Crump - ron.crump@une.edu.au;

Mehar S Khatkar - m.khatkar@usyd.edu.au; Herman W Raadsma - h.raadsma@usyd.edu.au

* Corresponding author

Abstract

Background: Genomic selection (GS) uses molecular breeding values (MBV) derived from dense

markers across the entire genome for selection of young animals. The accuracy of MBV prediction is

important for a successful application of GS. Recently, several methods have been proposed to estimate

MBV. Initial simulation studies have shown that these methods can accurately predict MBV. In this study

we compared the accuracies and possible bias of five different regression methods in an empirical

application in dairy cattle.

Methods: Genotypes of 7,372 SNP and highly accurate EBV of 1,945 dairy bulls were used to predict MBV

for protein percentage (PPT) and a profit index (Australian Selection Index, ASI). Marker effects were

estimated by least squares regression (FR-LS), Bayesian regression (Bayes-R), random regression best

linear unbiased prediction (RR-BLUP), partial least squares regression (PLSR) and nonparametric support

vector regression (SVR) in a training set of 1,239 bulls. Accuracy and bias of MBV prediction were

calculated from cross-validation of the training set and tested against a test team of 706 young bulls.

Results: For both traits, FR-LS using a subset of SNP was significantly less accurate than all other methods

which used all SNP. Accuracies obtained by Bayes-R, RR-BLUP, PLSR and SVR were very similar for ASI

(0.39-0.45) and for PPT (0.55-0.61). Overall, SVR gave the highest accuracy.

All methods resulted in biased MBV predictions for ASI, for PPT only RR-BLUP and SVR predictions were

unbiased. A significant decrease in accuracy of prediction of ASI was seen in young test cohorts of bulls

compared to the accuracy derived from cross-validation of the training set. This reduction was not

apparent for PPT. Combining MBV predictions with pedigree based predictions gave 1.05 - 1.34 times

higher accuracies compared to predictions based on pedigree alone. Some methods have largely different

computational requirements, with PLSR and RR-BLUP requiring the least computing time.

Conclusions: The four methods which use information from all SNP namely RR-BLUP, Bayes-R, PLSR

and SVR generate similar accuracies of MBV prediction for genomic selection, and their use in the selection

of immediate future generations in dairy cattle will be comparable. The use of FR-LS in genomic selection

is not recommended.

Published: 31 December 2009

Genetics Selection Evolution 2009, 41:56doi:10.1186/1297-9686-41-56

Received: 12 June 2009

Accepted: 31 December 2009

This article is available from: http://www.gsejournal.org/content/41/1/56

© 2009 Moser et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 2 of 16

(page number not for citation purposes)

Background

Until recently, the use of molecular genetics in commer-

cial applications of marker-assisted selection (MAS) have

focused on the use of individual genes or a few quantita-

tive trait loci (QTL) linked to markers [1,2]. With the

exception of a few genes with relatively large effects such

as DGAT [3] or FECB [4] most candidate genes or QTL

capture only a very small proportion of the total genetic

variance. Recent empirical genome-wide association

(GWAS) studies using a high-density SNP technology in

humans, (e.g. [5-7], mice [8] and cattle [9] suggest that

complex traits are most likely affected by many genes with

a small effect.

A dramatic change in terms of the use of genomic infor-

mation to estimate the total genetic value for breeding

animals, known as genomic selection (GS) or Genome

Wide Selection (GWS) was predicted by Meuwissen et al.

[10]. Using simulations, they showed that with a dense

marker map covering all the chromosomes, it is possible

to accurately estimate the breeding value of animals with-

out information about their phenotype or that of close rel-

atives. Genomic estimated breeding values (GEBV) can be

calculated for both sexes at an early stage in life, and there-

fore GS can increase the profitability and accelerate

genetic gain of dairy cattle breeding by reducing the gen-

eration interval and cost of proving bulls [11,12]. This is

projected to restructure dairy cattle breeding schemes,

many of which rely on progeny testing sires and the

recording of hundreds of thousands and often millions of

cows [12].

Whole-genome analyses require methods that are capable

of handling cases where the number of marker variables

greatly exceeds the number of individuals, and models are

at risk of being over parameterized. Furthermore, inclu-

sion of complex pedigrees in large animal breeding data

sets may lead to population stratification and confound-

ing of relatedness with gene or SNP effects [13]. A variety

of methods have been suggested for the estimation of

genomic breeding values. Meuwissen et al. [10] have com-

pared a joint least squares estimation of individually sig-

nificant haplotype effects with best linear unbiased

prediction (BLUP) including all haplotypes and two Baye-

sian approaches similar to BLUP, but allowing for varia-

tion in the genetic variance accounted for by individual

haplotype effects. Xu [14] has used a similar Bayesian

approach but has estimated additive and dominance

effects attributed to individual marker loci rather than

haplotype effects. As pointed out by Gianola et al. [15]

Bayesian regression methods, such as those by Meuwissen

et al. [10], require some strong a priori assumptions. These

authors have proposed non-parametric kernel regression

with a BLUP model accounting for the residual polygenes.

Initially, these methods for computation of genomic

breeding values have been investigated by simulation

studies which may not be realistic in empirical situations

and it is unlikely that the models underlying these simu-

lations reflect the complexity of biological systems.

Recently, other methods have been attempted in GWS

analyses, e.g. principal component regression [16], partial

least squares regression [17,18], LARS [19], LASSO

[20,21], and BLUP including a genomic relationship

matrix [22].

In this study we compared the accuracies of five different

regression methods for the computation of genomic pre-

diction of genetic merit in an empirical application. The

selection of methods was based on their inherent differ-

ences in the underlying assumptions and previous appli-

cation in GS. We analyzed a data set of 7,372 SNP markers

genotyped on 1,945 Australian Holstein Friesian dairy

bulls with highly reliable estimated breeding values (EBV)

derived from phenotypic records of large groups of prog-

eny.

Methods

Statistical models

Five regression methods were used to estimate SNP

effects: fixed regression using least squares (FR-LS), ran-

dom regression BLUP (RR-BLUP), Bayesian regression

(Bayes-R), partial least squares regression (PLSR); and

support vector regression (SVR). Other than the require-

ment that markers are located across the genome, no addi-

tional information, such as marker location or pedigree, is

required by the methods. The basic model can be denoted

as

where yi is the estimated breeding value (EBV) of sire i (i =

1, 2,..., n) and xi is a 1 × p vector of SNP genotypes on bull

i, and g(xi) is a function relating genotypes to EBVs and

can be considered as a molecular breeding value (MBV)

and ei is a residual term. The SNP genotypes are coded as

variates according to the number of copies of one SNP

allele, i.e. 0, 1 or 2. We denote with X the matrix contain-

ing the column vectors xk of SNP genotypes at locus k (k =

1, 2, ..., p).

Fixed regression-least squares (FR-LS)

In linear regression on SNP, g(xi) is modeled as

where βk is the regression of EBV on the additive effect of

SNP k, and q the number of SNP fitted in the model. The

multicollinearity between SNP, i.e. two or more SNP in

yge

iii

=+

(),

x

gx

i ik

k

q

∑

p

k

x

() =

<<

β ,

Page 3

Genetics Selection Evolution 2009, 41:56 http://www.gsejournal.org/content/41/1/56

Page 3 of 16

(page number not for citation purposes)

high but not complete LD, is addressed by selecting a lim-

ited number of 'important' SNP. A stepwise procedure in

which markers are considered for inclusion in the model

one at a time was used, as applied by Habier et al. [13].

Each marker that is not already in the model is tested for

inclusion in the model. In each step the most significant

SNP which had a P-value below a predefined threshold α

was added to the model. P-values of all markers in the cur-

rent model were then checked and the marker with the

highest P-value above α was dropped from the model. The

procedure stopped when no further addition or deletion

was possible. The optimal P-value was found by cross-val-

idation.

Random regression-BLUP (RR-BLUP)

In RR-BLUP, SNP effects are assumed random [10], with

g(xi) having the form

where βk is the effect associated with SNP k, xk is set up as

described above for additive effects. The regression coeffi-

cients are found by solving the normal equations,

where λ is constant for all SNPs. Differences in shrinkage

between SNP still arise as a result of variation in allele fre-

quency. Meuwissen et al. [10] and Habier et al. [13] have

calculated λ for their simulated data from known genetic

and residual variances. With no knowledge of these vari-

ance components and analyzing EBV data, an appropriate

value for the shrinkage parameter can be obtained by

cross-validation. When EBV have a variety of reliabilities

then the regression can be weighted accordingly so that

ˆ

( ’)’

β β=+

XXIXy

RR

λ

, where R is a diagonal

matrix of weights. In this case most reliabilities exceeded

0.85 so they were treated as homogeneous, i.e. R = I.

Bayesian regression (Bayes-R)

Bayesian regression on additive SNP effects was per-

formed as proposed by Meuwissen et al. [10] for their

method BayesA using a Gibbs Sampler. It differs from the

RR-BLUP model in that each SNP effect has its own poste-

rior distribution. This model allows each marker to have

its own variance, resulting in different shrinkage of SNP

effects. The prior of Meuwissen et al. [10] was used for SNP

effects.

Support vector regression (SVR)

Support vector machines are algorithms developed from

statistical learning theory. Support vector regression (SVR,

Vapnik [23]) uses linear models to implement non-linear

regression by mapping the input space to a higher dimen-

sional feature space using kernel functions. A feature of

SVR is that it simultaneously minimizes an objective func-

tion which includes both model complexity and the error

on the training data. SVR can be considered as a specific

learning algorithm for reproducing kernel Hilbert spaces

(RKHS) regression, first proposed for whole-genome anal-

ysis of quantitative traits by Gianola et al. [15]. A precise

account of the theory of RKHS and details of SVR is

beyond the scope of this article, so only a brief description

is given here. For essential theoretical details and term def-

initions we refer the reader to Gianola et al. [15], Gianola

and van Kaam [24]), Gianola and de Los Campos [25]

and de Los Campos et al. [26]. An application of the RKHS

regression approach to estimate genetic merit for early

mortality in broilers from SNP data is described in [27]. A

more detailed introduction to SVR is given in [28].

RKHS regression estimation is based on minimization of

the following functional (e.g. equation 2 in [26]):

where V(yi, g(xi)) is some loss (error) function, the second

term in the equation acts as an penalty, and λ is a fixed

positive real number that somehow controls the trade-off

between the two terms and ||||2 denotes the norm under a

Hilbert space. Several choices of the loss function V in (1)

are possible [28]. In their application of RKHS regression

[27], used V(yi, g(xi)) = (yi - g(xi))2 (equation (1) in [27])

as the loss function, which corresponds to the conven-

tional least squares error criterion. In SVR the quadratic

error function is replaced by a function called epsilon-

insensitive loss proposed by Vapnik [23]:

It can be shown that the minimizer of (1) using epsilon-

insensitive loss can be written as:

j = 1, 2,..., n, where K = (xi, xj) is the kernel involving the

genotypes of sires i and j. The coefficients αi and αi* are

the solution of a system of nonlinear equations. The loss

function assigns zero loss to errors less than ε, thus safe-

guarding against overfitting. The parameter ε also pro-

gx

iik

k

p

k

x

() =

=∑

1

β ,

ˆ

’ ’ ,

β β=+

()

−

X XIX y

λ

1

−−−

111

H g

[ ]=

n

V y

(

gg

ii

i

n

)+

=∑

1

2

1

2

, ( ),

x

λ

(1)

V y

(

gyg

if yg

ygotherwise

iiii

ii

ii

, ( )()

()

()

x

)=−=

−≤

−−

⎧

⎨⎪

⎩

⎪ ⎪

x

x

x

ε

ε

ε

0

⎫

⎬⎪

⎭ ⎪

.

gK

ii

j

n

iij

()( ) (, ),

*

x x x

=−

=∑α

1

α

Page 4

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 4 of 16

(page number not for citation purposes)

vides a sparse representation of the data as only a fraction

of the coefficients αi, αi* are nonzero. Data points associ-

ated with non-zero coefficients are called support vectors

and a detailed interpretation of support vectors is given in

[23]. Unlike the case of the quadratic loss function, where

the coefficients αi are found by solving a linear system,

using epsilon-insensitive loss the coefficients αi are the

solutions to a quadratic programming problem.

In our implementation of SVR we used a Gaussian kernel.

In order to solve the SVR regression problem three meta-

parameters must be specified; the insensitivity zone ε, a

penalty parameter C > 0 that determines the trade-off

between approximation error and the amount up to

which deviations larger than ε are tolerated and the band-

width of the kernel function. Cross validation employing

a grid search was used to tune the meta-parameters.

Partial least squares regression (PLSR)

A dimension reduction procedure, partial least squares

regression (PLSR, [29]), was used for modeling without

imposing strong assumptions. The main idea of PLSR is to

build orthogonal components (called 'latent compo-

nents') from the original predictor matrix X and use them

for prediction in place of the original variables. Thus, g(xi)

can be expressed as:

where ta is latent component a (a = 1, 2..., h) and generally

h <<p. PLSR is similar to the well-known principal compo-

nent regression (PCR), both methods construct a matrix

of latent components T as a linear transformation of X, T

= XW, where W is a matrix of weights. The difference is

that PCR extracts components that explain the variance of

X, whereas PLSR extracts components that have a large

covariance with y, i.e. the columns of weight matrix W are

defined such that the squared sample covariance matrix

between y and the latent components is maximized under

the constraint that the latent components are mutually

uncorrelated.

Different techniques to extract the latent components

exist, and each gives rise to a variant of PLSR. We imple-

mented PLSR using an algorithm by Dayal and MacGregor

[30], which does not require the calculation of the sample

covariance matrix of X and which we have used previously

[17]. A different PLS algorithm was used by Solberg et al.

[18] to predict genomic breeding values in their simula-

tion study. The optimal model complexity (i.e. number of

latent components) was estimated by cross-validation.

Animals and SNP data

A total of 1,945 progeny tested Holstein Friesian dairy

bulls born between 1955 and 2002 were used in the study.

The phenotypes used were Australian breeding values

(EBV) taken from the August 2007 Australian Dairy Herd

Improvement Scheme

www.adhis.com.au/) evaluation. The traits analyzed

included protein percentage (PPT) and Australian Selec-

tion Index (ASI). ASI is a production based index that

combines protein yield, fat yield and milk yield EBV and

is weighted in relation to the value of the milk compo-

nents: (ASI = 3.8 × protein EBV + 0.9 × fat EBV - 0.048 ×

milk EBV). The mean reliability of the EBV for both ASI

and PPT was 0.89, with corresponding distribution of var-

iation in range of reliability shown in Figure 1c. The dis-

tribution of EBV for both ASI and PPT for all 1,945 bulls

is shown in Figure 1a and 1b, respectively.

(ADHIS; http://

The genotypic data belonged to a panel of 1,546 bulls gen-

otyped for a 15K SNP chip and of 441 bulls for a Gene-

Chip® Bovine Mapping

www.affymetrix.com/. There were 9,217 SNP and 44 bulls

in common between these two datasets. A combined data

set on 7,372 common SNP were extracted for the present

study after removing SNP with low minor allele frequency

(< 0.01), with low call rates (<80%), that deviated from

Hardy-Weinberg equilibrium (P ≤ 0.0001) or which

showed inconsistent inheritance [31]. The proportion of

missing SNP genotypes was less than 1%. We performed

genotype imputation using the NIPALS (nonlinear itera-

tive partial least squares, [32]) algorithm, which performs

principal component analysis in the presence of missing

data.

25K SNP chip http://

Partitioning the data in training and test data sets

To assess the ability to predict breeding values of young

bulls based on SNP data before progeny data were availa-

ble, animals born before 1998 (N = 1,239) were included

in the training set. Bulls born between 1998 and 2002 rep-

resented five single year cohorts and were allocated to test

sets according to their year of birth.

Model optimization

Applying FR-LS, RR-BLUP, SVR and PLSR requires the

selection of appropriate meta-parameters. Model optimi-

zation was performed by 5-fold cross-validation. The

complete training set (N = 1,239) was partitioned in K = 5

folds. For a given value of the meta-parameter(s) θ the

prediction model is estimated using K-1 folds, and the

predictive capacity of the model is assessed by applying

the estimated model to the individuals in the left-out fold

and this process is repeated K times so that every fold is

left out once. The value of θ which minimized the average

mean squared error of prediction (MSEP) in the K test sets

gt

i ia

a

h

a

x

() =

=∑

1

β ,

Page 5

Genetics Selection Evolution 2009, 41:56 http://www.gsejournal.org/content/41/1/56

Page 5 of 16

(page number not for citation purposes)

was then used in the model to estimate the SNP effects

from the full training set.

Accuracy and bias of MBV prediction

The correlation coefficient between the realized EBV and

the predicted MBV (rEBV,MBV) was used as a measure of the

accuracy of MBV prediction. The realized EBV were line-

arly regressed on the predicted MBV, where the regression

coefficient bEBV,MBV reflects the degree of bias of the MBV

prediction. The interest here is the comparisons between

bulls and therefore the constant estimated in the regres-

sion of EBV on MBV is of less interest and is not reported.

The bias relates to the size of the absolute differences

between MBV among cohorts, i.e. the estimate of the dif-

ference between a pair of bulls is greater (bEBV,MBV < 1) or

less (bEBV,MBV > 1) than the difference between their EBV.

A regression coefficient of one indicates no bias.

The MBV predictions of young bulls were combined with

pedigree based predictions into an estimate of genomic

estimated breeding values (GEBV) as GEBV = (w1 MBV +

w2 SMGS)/(w1 + w2), where SMGS are predictions based

on the sire maternal-grandsire

2

i

=

/()1

−

pathway and

with i = 1 for MBV and i = 2 for SMGS.

For MBV, R2 was calculated as the squared correlation

between realized EBVs and MBV predictions (rEBV,MBV)

from cross-validation of the training data. For SMGS, R2

was calculated as the squared correlation between the

realized EBV and SMGS predictions calculated at the time

of the birth of the bull calves (rEBV,SMGS). As a measure of

the accuracy of GEBV prediction we calculated the correla-

tion between realized EBVs and GEBV predictions

(rEBV,GEBV).

An analysis of variance was performed to investigate the

effect of trait, method and test year on the accuracy and

bias of MBV prediction. The regression coefficient was

loge-transformed to account for non-normality and unsta-

wRR

ii

2

Distribution of EBVs for Australian Selection Index (ASI, a) and protein percentage (PPT, b), distribution of reliabilities of EBVs (c), and number of bulls within year of birth (d)

Figure 1

Distribution of EBVs for Australian Selection Index (ASI, a) and protein percentage (PPT, b), distribution of

reliabilities of EBVs (c), and number of bulls within year of birth (d).

19551955195519691969196919761976 197619821982 198219881988 198819941994199420002000 2000

000

50 5050

100100 100

150150150

200 200200

Number of bulls

Year of birth Year of birthYear of birth

dd

Number of bullsNumber of bulls

707075 7580

ReliabilityReliability

85 8590 909595 100100

00

100

300

500

700

Number of bulls

cc

80

100

300

500

700

Number of bulls

Number of bulls

ASI ASI

-140 -100-140 -100-60-60-20-2020206060100 100140140

00

100

200

300

aa

Number of bulls

100

200

300

-0.30-0.30 -0.30 -0.20-0.20 -0.20-0.10-0.10 -0.100.00

PPTPPT

0.100.100.100.200.200.200.300.300.30

000

100

200

300

Number of bulls

bb

0.000.00

100

200

300

100

200

300

Number of bulls

Page 6

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 6 of 16

(page number not for citation purposes)

ble variance. A single linear model was fitted to each of the

metrics,

where y is either rEBV,MBV or logebEBV,MBV, μ is a mean, trait

is the effect of trait (PPT, ASI), year is the effect of test

cohort (1998, 1999,...2002) including the 5-fold cross-

validation set as a level of year; method is the effect of

method (FR-LS, RR-BLUP, Bayes-R, PLSR, SVR),

trait.method, method.year, and trait.year are two-way inter-

actions between main effects; and ε is a random error.

Implementation

For Bayes-R, the MCMC chain was run for 200,000 cycles

with the first 50,000 samples discarded as burn in. Poste-

rior estimates of SNP effects are based on 15,000 samples,

drawing every 10th sample after burn-in. The Gauss-Seidel

algorithm with residual uptake suggested by Legarra and

Misztal [33] was used in Bayes-R. FR-LS, PLSR, RR-BLUP

and Bayes-R were implemented in Fortran, for SVR analy-

ses we used the C++ library LIBSVM [34].

Results

Summary statistic on phenotypes

Despite the fact that most sires were pre-selected as young

bulls on the basis of pedigree information, the distribu-

tion for both ASI and PPT is fairly symmetric (Figure 1a

and 1b). One would not expect a noticeable genetic trend

for PPT which was not part of the selection goal in the

past. ASI was introduced in 1997 such that 63% of ani-

mals were born before the first ASI EBVs became available,

ASI and was incorporated into a new profit-centered

breeding objective in 2003.

The majority of the sires were test mated at approximately

one year of age and their daughters subsequently brought

into milk and performance tested. The body responsible

for the genetic evaluation of dairy cattle in Australia pub-

lishes a single reliability value for all production traits and

ASI. The published reliabilities of the EBV for ASI and PPT

had an average of 0.89, with over 84% of animals having

reliabilities of 0.85 or higher (Figure 1c). The age distribu-

tion of the genotyped bulls is shown in Figure 1d. Around

50% of the bulls were born after 1995, with a greater

number of animals in the more recent cohorts.

Model optimization, accuracy and bias of MBV prediction

obtained by cross-validation

The use of PLSR for genomic prediction has been

described before [17,18] but we briefly illustrate the

method by showing the cross-validation results for the

analysis of ASI (Figure 2). A series of models with increas-

ing numbers of latent components from 1 to 20 was fitted.

The proportion of variance explained in the training sam-

ples shows that a small number of latent components pro-

vided an adequate fit of the data, with the first eight latent

components explaining more than 90% of the EBV vari-

ance and the first latent component accounting for 34%

alone. The prediction error in the corresponding test sets

(MSEPCV) identified the first five latent components as

having the lowest MSEP (Figure 2). The model with five

latent components used only 8% of the SNP variance in

the training set, suggesting a high degree of multicolline-

arity among SNP loci.

Table 1 provides a summary statistic of optimizing the

threshold parameter α used to select SNP for the FR-LS

prediction model by cross-validation. Shown are the

mean accuracies obtained by predicting the five cross-val-

idation samples, standard errors of accuracy and bias of

prediction were computed from the variance of the means

of the five cross-validation samples. As expected, the

number of markers selected in the prediction model

decreased with more stringent threshold values of α. The

optimal model for ASI with the lowest MSEP was

obtained with α = 0.001 including 33 markers, resulting

in an accuracy of MBV prediction of 0.53. Similarly the

best model for PPT was obtained at α = 0.001 (based on

30 SNP on average) with accuracy of prediction of 0.43,

whereas the worst model was obtained at α = 0.1 (with

inclusion of 215.6 SNP on average) with an accuracy of

prediction of 0.35. In general the differences in accuracy

of prediction between models including different number

of SNP are small, and are small compared to their stand-

ard error. The degree of bias is assessed by comparing the

regression coefficient of EBV on MBV with the value 1. The

results show that for all α values predictions had a large

bias as shown by a regression coefficient markedly less

than 1 and bias decreased with decreasing number of

SNPs.

Table 2 summarizes the results for the various methods

obtained from cross-validation of the training set. Accura-

cies of prediction ranged from 0.53 to 0.72 for ASI and

0.43 to 0.58 for PPT, respectively. All methods that used

the information of all 7,372 SNP outperformed FR-LS.

Accuracies of prediction and prediction errors for meth-

ods that estimate effects of all markers were essentially the

same, although for ASI and PPT, SVR had the lowest

MSEP, and the highest accuracy of prediction of 0.72 and

0.58, respectively. For both traits predictions obtained by

FR-LS and PLSR had the largest bias, whereas for RR-

BLUP, SVR and Bayes-R the regression of EBV on MBV was

close to one.

Accuracies in young bull cohorts

Table 3 shows the accuracy of MBV prediction of young

bulls according to their year of birth and of the total test

sample containing 706 animals. Marker effects were esti-

mated using all animals in the training set with the best

ytrait methodyeartrait method

.

method yeartrait year

.

=+++++++

μ

.

ε ε,

Page 7

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 7 of 16

(page number not for citation purposes)

model of each method obtained by cross-validation. The

FR-LS models included 48 SNP for ASI and 29 SNP for

PPT. For ASI, accuracies of prediction (rEBV,MBV) of all

young bulls were 0.45, 0.42, 0.41, 0.39, 0.27 for SVR,

PLSR, Bayes-R, RR-BLUP and FR-LS, respectively. As for

ASI, accuracies of MBV prediction of PPT were very similar

between methods that used all SNP information (rEBV,MBV

= 0.55-0.61), whereas FR-LS was the least accurate method

(rEBV,MBV = 0.47).

In general the MBV predictions of PPT showed lower bias

compared to those of ASI. MBV predictions of ASI

obtained by all methods resulted in inflated differences in

the relative rankings of bulls compared to relative rank-

Partial least squares regression model optimization for Australian Selection Index using cross-validation

Figure 2

Partial least squares regression model optimization for Australian Selection Index using cross-validation.

Shown is the mean prediction error (MSEP) in the training (MSEPtraining) data set, the average MSEP in the 5-fold cross-valida-

tion samples (MSEPCV), the proportion of EBV (VarEBV) and SNP variance (VarSNP) explained in the training data for models

with an increasing number of latent components; the optimal prediction model includes the first 5 latent components, identi-

fied by the smallest MSEPCV.

1155 10 1015 152020

Number of latent componentsNumber of latent components

00

2020

40 40

60 60

8080

100 100

Variance explained (%)

00

200200

400400

600 600

800800

1000 1000

12001200

MSEP

VarEBV

VarEBV

VarSNP

VarSNP

MSEPtraining

MSEPtraining

MSEPCV

MSEPCV

Variance explained (%)

MSEP

Table 1: Cross-validation results for method fixed regression-least squares at different threshold values

Trait

α†

nSNPMSEPrEBV,MBV

bEBV,MBV

ASI0.1

0.01

0.001

0.0001

197.2

98.2

33.0

15.0

(31.6)

(7.1)

(5.4)

(1.9)

1464

1235

1090

1108

(139.2)

(62.1)

(124.4)

(136.8)

0.52

0.54

0.53

0.50

(0.031)

(0.043)

(0.036)

(0.043)

0.49

0.58

0.71

0.76

(0.059)

(0.045)

(0.048)

(0.084)

PPT0.1

0.01

0.001

0.0001

215.6

81.6

30.0

15.4

(29.3)

(5.0)

(4.2)

(2.1)

0.0214

0.0156

0.0135

0.0136

(0.0023)

(0.0016)

(0.0023)

(0.0016)

0.35

0.42

0.43

0.39

(0.056)

(0.059)

(0.089)

(0.076)

0.32

0.48

0.62

0.67

(0.059)

(0.075)

(0.155)

(0.173)

Average number of SNP (nSNP) in the model, mean square error (MSEP), correlation (rEBV,MBV) between EBV and MBV, and regression coefficient

(bEBV,MBV) of EBV on MBV for different threshold levels (α) in five cross-validation samples of the training data set, standard error in parentheses;

ASI: Australian Selection Index; PPT: protein percentage; † P-value used to select SNPs in or out of model.

Page 8

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 8 of 16

(page number not for citation purposes)

ings based on EBV. For PPT, predictions obtained by RR-

BLUP and SVR were close to being unbiased compared to

predictions of MBV obtained by PLSR and Bayes-R.

Figure 3 shows fits relating MBV predictions and realized

EBVs of ASI and PPT in a single cross-validation sample,

and in a young bull cohort 1998 for each of the five meth-

ods. FR-LS showed a larger dispersion of MBV across the

range of EBV in cohort 1998 for both traits compared to

the other methods, which is consistent with the lower

accuracy seen with this method.

Accuracies of MBV prediction of ASI in young bull cohorts

were considerably lower compared to the accuracy

obtained by cross-validation of the training set, whereas

for PPT predictions of the test data set were as accurate as

the predictions obtained from cross-validation. As

depicted in Figure 3, the EBV variance for ASI in the test

sample is much lower relative to the cross-validation sam-

ple, which can partly explain the decrease in accuracy of

predictions of ASI in young bull cohorts.

Comparison of methods for MBV prediction

Correlations between MBV predictions obtained by cross-

validation of the training data set and the test data set of

young bulls are shown in Table 4. Predictions obtained by

FR-LS were considerably less similar (r = 0.72-0.83) to all

other methods. Thus using a smaller number of SNP as

fixed effects produces somewhat different predictions to

methods which use all SNP. The correlations between

methods that used all SNP information were very high

(r>0.9) for both ASI and PPT.

Figure 4 shows the distribution of SNP effects along the

genome estimated in the training data set for four meth-

ods for ASI and PPT. Relatively few SNP are used by the

FR-LS method for both traits. The other methods assign

relatively small effects to most of the SNP. However, the

distribution for PPT depicts a small number of SNP with

relatively large effects. All methods displayed very similar

clustering of SNP with large effects along the genome.

Increasing the number of bulls in the training data set

The accuracy of genomic predictions depends on the

number of animals that are used to estimate the SNP

effects [35]. The accuracy of MBV prediction estimated by

PLSR from training data sets of increasing size are shown

in Table 5. Larger training data sets did not result in signif-

icant gains in accuracy of MBV prediction in year cohorts

of young bulls. In all cases predictions of PPT were more

accurate than of ASI.

Combining MBV and SMGS predictions

GEBV predictions for bulls born between 1998 and 2002

were calculated by combining the MBV predictions with

the sire maternal-grandsire pathway predictions, which

were calculated at the time of birth of the young bull

calves (Table 6). The accuracy of GEBV prediction of ASI

was 1.06 - 1.34 times higher than the accuracy of the ped-

igree based prediction, and 1.16 - 1.27 times higher for

PPT. Among the methods FR-LS had the lowest accuracy

and the differences between the other methods were

small.

Variability in accuracy and bias of MBV prediction

Abridged analysis of variance tables for the accuracy and

bias of MBV prediction are shown in Table 7. The method

and the interaction between trait and year showed signif-

icant effects on the accuracy of MBV prediction. Accuracies

of prediction by FR-LS and SVR were significantly different

from other methods, with SVR being the most and FR-LS

the least accurate method (additional file 1). Accuracy of

prediction obtained by cross-validation of the training

data was significantly higher than the accuracy of predic-

tion in cohort 2002. The interactions between method

and trait and between trait and year showed significant

effects on the regression of EBV on MBV.

Computing time

Computing time is important, particularly for cross-vali-

dation and implementation in practice which requires fre-

quent re-estimation of

computational demand of the various methods is shown

in Table 8. The machine used for all calculations had a

dual core Intel D 3.2 GHz CPU. The PLSR and RR-BLUP

methods, took less than 1 min to calculate the marker

effects for a single replicate of the training data. The

requirements of Bayes-R are several orders of magnitude

breeding values. The

Table 2: Summary of MBV prediction in the training data for five

methods obtained by cross-validation

TraitMethod MSEPrEBV,MBV

bEBV,MBV

ASIFR-LS

RR-BLUP

Bayes-R

SVR

PLSR

1,090

712

714

700

735

(124.4)

(93.5)

(95.3)

(92.2)

(95.4)

0.53

0.71

0.71

0.72

0.70

(0.036)

(0.017)

(0.016)

(0.017)

(0.022)

0.71

1.07

1.09

1.06

0.93

(0.048)

(0.076)

(0.071)

(0.079)

(0.069)

PPTFR-LS

RR-BLUP

Bayes-R

SVR

PLSR

0.0135

0.0104

0.0104

0.0100

0.0109

(0.0023)

(0.0018)

(0.0010)

(0.0010)

(0.0012)

0.43

0.56

0.56

0.58

0.55

(0.089)

(0.067)

(0.067)

(0.064)

(0.061)

0.62

1.01

1.06

1.01

0.81

(0.155)

(0.104)

(0.117)

(0.100)

(0.078)

Mean square error (MSEP), correlation (rEBV,MBV) between EBV and

MBV, and regression coefficient (bEBV,MBV) of EBV on MBV derived by

5-fold cross-validation of the training data set, standard errors in

parentheses; ASI: Australian Selection Index; PPT: protein percentage;

FR-LS: fixed regression-least squares; RR-BLUP: random regression-

BLUP; Bayes-R: Bayesian regression; SVR: support vector regression;

PLSR: partial least squares regression.

Page 9

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 9 of 16

(page number not for citation purposes)

higher (~421 min). The computational burden for SVR

lies in the grid search for the meta-parameters and also

requires the computation of the kernel matrix in the pre-

diction step. PLSR, RR-BLUP and SVR all scale well to

larger number of SNPs.

Discussion

The concept of genomic selection was first raised over

eight years ago [10], but it was not until the advent of high

capacity genotyping platforms that empirical data sets

became available in dairy cattle [36-40]. Initial reports on

efficiencies and pros and cons of different statistical

approaches for genomic selection have been conducted

largely on simulated data sets (e.g. [10,13,15,18,41-44])

and to a lesser extent in real data (e.g. [19,35,45,46]. Sim-

ulation studies, although informative, are strongly

dependent on the underlying assumptions, some of

which may be biologically unrealistic or limited in their

complexity. This paper describes the performance of five

statistical approaches for the prediction of molecular and

genomic breeding values using empirical data.

Comparison of methods

The choice of methods evaluated here represent a range of

methods proposed previously for the potential use in

genomic selection including variable selection methods

(FR-LS, [10,13,47], shrinkage methods (Bayes-R and

BLUP, [10,13,14]); support vector learning methods

(SVR, [15,43]) and dimension reduction methods (PLSR,

[17,18]).

Methods for calculating genomic breeding values have to

deal with the problem of multicollinearity and over-

parameterization resulting from fitting many parameters

to relatively small data sets. The FR-LS regression method,

which exploits a reduced subset of selected SNP consist-

ently had lower accuracy and a larger bias of prediction

than the other methods. Bayes-R, RR-BLUP, SVR and PLSR

which use all available SNP information performed

Table 3: Correlation (rEBV,MBV) between EBV and MBV and regression coefficient (bEBV,MBV) of EBV on MBV in cohorts of young bulls

for five methods

Trait MethodYear of birth

199819992000200120021998-2002

rEBV,MBV

ASIFR-LS

RR-BLUP

Bayes-R

SVR

PLSR

0.22

0.35

0.38

0.42

0.39

0.23

0.39

0.38

0.40

0.38

0.33

0.40

0.42

0.46

0.40

0.26

0.32

0.33

0.40

0.35

0.12

0.28

0.29

0.35

0.34

0.27

0.39

0.41

0.45

0.42

PPT FR-LS

RR-BLUP

Bayes-R

SVR

PLSR

0.48

0.53

0.63

0.64

0.63

0.52

0.58

0.60

0.61

0.55

0.41

0.53

0.55

0.57

0.50

0.46

0.56

0.63

0.63

0.62

0.43

0.49

0.51

0.52

0.43

0.47

0.55

0.60

0.61

0.56

bEBV,MBV

ASIFR-LS

RR-BLUP

Bayes-R

SVR

PLSR

0.18

0.59

0.59

0.61

0.45

0.25

0.72

0.71

0.65

0.49

0.29

0.82

0.81

0.76

0.55

0.26

0.72

0.71

0.77

0.51

0.11

0.60

0.59

0.66

0.48

0.26

0.72

0.76

0.74

0.55

PPTFR-LS

RR-BLUP

Bayes-R

SVR

PLSR

0.58

1.09

1.24

1.13

0.88

0.62

0.93

1.10

0.99

0.73

0.45

0.98

1.15

1.07

0.74

0.59

1.10

1.29

1.17

0.93

0.40

0.72

0.81

0.72

0.50

0.55

1.08

1.16

1.05

0.80

Number of bulls

18914417313763706

The training data set included animals born before 1998; ASI: Australian Selection Index; PPT: protein percentage; FR-LS: fixed regression-least

squares; RR-BLUP: random regression-BLUP; Bayes-R: Bayesian regression; SVR: support vector regression; PLSR: partial least squares regression.

Page 10

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 10 of 16

(page number not for citation purposes)

Fit of models relating EBVs and predicted MBVs in the training data and in young bulls

Figure 3

Fit of models relating EBVs and predicted MBVs in the training data and in young bulls. To avoid cluttering predic-

tions are plotted for a single fold of the cross-validation (CV) of the training data set and young bull cohort 1998; ASI: Austral-

ian Selection Index; PPT: protein percentage; FR-LS: fixed regression-least squares; RR-BLUP: random regression-BLUP; Bayes-

R: Bayesian regression; SVR: support vector regression; PLSR: partial least squares regression.

PPTPPT ASIASI RR-BLUP RR-BLUP

-0.3-0.3-0.30.0

MBV MBVMBV

0.30.30.3

-0.3-0.3 -0.3

0.00.00.0

0.30.3 0.3

EBV

19981998

-100-100

-100

0.0

MBVMBV

MBV

100 100

100

-100-100

-100

0.00.0

0.0

100 100

100

EBV

19981998

-100-100-1000.0

MBVMBVMBV

100100100

-100-100-100

0.00.00.0

100100100

EBV

CVCVCVCV

-0.3-0.3

-0.3

0.0

MBVMBV

MBV

0.30.3

0.3

-0.3-0.3

-0.3

0.00.0

0.0

0.30.3

0.3

EBV

0.0 0.0

EBVEBV

0.0

0.0

EBVEBV

0.00.0

EBVEBV

0.0

0.0

EBVEBV

PPTPPTASIASIFR-LSFR-LS

-0.3-0.3-0.30.0

MBV MBVMBV

0.30.30.3

-0.3-0.3-0.3

0.00.00.0

0.30.30.3

EBV

19981998

-100-100

-100

0.0

MBVMBV

MBV

100100

100

-100-100

-100

0.00.0

0.0

100100

100

EBV

19981998

-100-100

-100

0.0

MBVMBV

MBV

100100

100

-100-100

-100

0.00.0

0.0

100100

100

EBV

CVCVCVCV

-0.3-0.3

-0.3

0.0

MBVMBV

MBV

0.30.3

0.3

-0.3-0.3

-0.3

0.00.0

0.0

0.30.3

0.3

EBV

0.00.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBV EBV

PPTPPTASIASI Bayes-RBayes-R

-0.3-0.3-0.30.0

MBVMBVMBV

0.30.30.3

-0.3-0.3-0.3

0.00.0 0.0

0.30.30.3

EBV

19981998

-100-100

-100

0.0

MBVMBV

MBV

100100

100

-100-100

-100

0.00.0

0.0

100100

100

EBV

19981998

-100 -100

-100

0.0

MBVMBV

MBV

100100

100

-100-100

-100

0.00.0

0.0

100100

100

EBV

CVCVCV CV

-0.3-0.3

-0.3

0.0

MBVMBV

MBV

0.30.3

0.3

-0.3-0.3

-0.3

0.00.0

0.0

0.30.3

0.3

EBV

0.00.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBVEBV

PPTPPTASIASIPLSRPLSR

-0.3-0.3 -0.30.0

MBVMBVMBV

0.30.30.3

-0.3 -0.3-0.3

0.00.00.0

0.30.30.3

EBV

19981998

-100 -100

-100

0.0

MBVMBVMBV

100100

100

-100-100-100

0.00.00.0

100100

100

EBV

1998 1998

-100-100

-100

0.0

MBVMBVMBV

100100

100

-100-100-100

0.00.00.0

100 100

100

EBV

CVCVCVCV

-0.3-0.3

-0.3

0.0

MBVMBV

MBV

0.30.3

0.3

-0.3-0.3

-0.3

0.00.0

0.0

0.30.3

0.3

EBV

0.00.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBVEBV

PPTPPTASIASISVRSVR

-0.3-0.3-0.30.0

MBVMBVMBV

0.30.30.3

-0.3-0.3-0.3

0.00.00.0

0.30.30.3

EBV

19981998

-100-100

-100

0.0

MBVMBV

MBV

100100

100

-100-100

-100

0.00.0

0.0

100100

100

EBV

19981998

-100-100

-100

0.0

MBVMBV

MBV

100100

100

-100-100

-100

0.00.0

0.0

100100

100

EBV

CV CVCV CV

-0.3 -0.3

-0.3

0.0

MBVMBV

MBV

0.3 0.3

0.3

-0.3-0.3

-0.3

0.0 0.0

0.0

0.3 0.3

0.3

EBV

0.00.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBVEBV

0.0

0.0

EBV EBV

Page 11

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 11 of 16

(page number not for citation purposes)

remarkably similarly; even though these methods are very

different from each other. The performance of all methods

depends on one or more meta-parameters. For RR-BLUP,

SVR and PLSR optimal values of the meta-parameters are

found by minimizing the prediction error using cross-val-

idation, potentially leading to more robust predictions

than Bayes-R where the posterior estimates of SNP effects

are greatly affected by the choice of the parameters in the

prior distributions. Using the same priors as in [10] Bayes-

R performed similar to the optimized models of RR-BLUP,

SVR and PLSR. One reason why these priors performed

well might be that the frequency distribution of estimated

SNP effects for ASI and PPT somewhat resembles the fre-

quency distribution of SNP effects underlying the simula-

tion in [10] which was derived from published estimated

QTL effects [48].

It appears that the gain in accuracy of RR-BLUP, Bayes-R,

SVR and PLSR over FR-LS is related to the increased

number of SNP included in the model. FR-LS fitted only

48 SNP for ASI and 29 SNP for PPT in the prediction equa-

tions and including more SNP did not result in higher

accuracies in the cross-validation set (Table 1). It is well

known that FR-LS estimates of a subset of SNP effects or

QTL are biased upwards and that SNP selection methods

perform poorly on multicollinear markers [49]. An advan-

tage of the use of multicollinear SNP is that it can increase

the accuracy of estimates of effects. SNPs in high LD

define a larger segment of genome and the standard devi-

ation of the estimated effect of the segment is reduced by

a factor of when the average of n SNP is used instead

of a single SNP. This averaging takes place when many

SNP in high linkage disequilibrium are used to construct

a model. An analogy is encountered in QTL mapping,

when QTL inference is related to the peak area of a QTL,

rather than the peak height at a given position. In general,

prediction is not seriously affected by multicollinearity as

long as the correlational structure observed in the training

sample persists in the prediction population, e.g. [50].

With close genetic relationships between bulls in the

training and tests set, as was the case here, methods that

fit more SNPs capture more of the genetic relationships

which can in fact lead to an increase in accuracies as

shown by Habier et al. [13]. For example, sons and grand-

sons of the 10 most popular ancestors accounted for

37.6% of bulls in the training data set and 32.7% of young

bulls in the test data set.

As technology platforms advance it is possible to extend

the density to many thousands of SNP genotypes per indi-

vidual, possibly capturing all sources of genomic variation

with entire genomes being sequenced per individual.

Such technology platforms will only exacerbate the curse

of dimensionality and computational burden of a method

will become more important. In particular, PLSR and RR-

BLUP are very fast methods. The use of methods based on

Bayesian regression on the other hand, such as BayesA,

might be prohibitive when the number of SNP is large.

For SVR computing time does not depend on the number

of SNP but rather on the animals that are genotyped.

The relevance of methods which focus on identifying a

subset of the available SNP will remain high while the cost

of dense chips is high. Although subset selection by FR-LS

performed poorly in our study and therefore cannot be

recommended, some authors reported similar or

improved accuracy when using a pre-selected subset of

SNP. In the study of Moser et al. [17] the selection was per-

formed within the PLSR and in Gonzalez-Recio et al. [45]

within a kernel regression framework. The use of SNP sub-

set selection in genomic selection needs further testing. In

the end, it may be of limited use if multiple traits require

so many SNP that the cost of genotyping them is similar

to the cost of a high density chip.

Application of GS and variability in accuracy of prediction

A key issue in genomic selection is predicting genetic

merit in young animals across a wide range of traits. In the

case of dairy cattle breeding, this is particularly advanta-

n

Table 4: Pearson correlations of MBV predictions in the training data (above diagonal) and in cohorts of young bulls (below diagonal)

between five methods

ASIPPT

MethodFR-LSRR-BLUPBayes-R SVR PLSRFR-LSRR-BLUPBayes-RSVRPLSR

FR-LS

RR-BLUP

Bayes-R

SVR

PLSR

0.730.74

1

0.73

0.99

0.99

0.71

0.97

0.97

0.96

0.660.67

1

0.66

0.98

0.98

0.64

0.97

0.96

0.95

0.57

0.58

0.59

0.55

0.59

0.63

0.63

0.60

0.96

0.93

0.93

0.95

0.91

0.92

0.96

0.97

0.97

0.970.95 0.93

ASI: Australian Selection Index; PPT: protein percentage; FR-LS: fixed regression-least squares; RR-BLUP: random regression-BLUP; Bayes-R:

Bayesian regression; SVR: support vector regression; PLSR: partial least squares regression.

Page 12

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 12 of 16

(page number not for citation purposes)

Distribution of 7,372 SNP effects along the genome estimated by four methods

Figure 4

Distribution of 7,372 SNP effects along the genome estimated by four methods. The right most 772 SNPs are unas-

signed to chromosomes; ASI: Australian Selection Index; PPT: protein percentage; FR-LS: fixed regression-least squares; RR-

BLUP: random regression-BLUP; Bayes-R: Bayesian regression.

0.00.0

0.10.1

0.20.2

0.00.0

0.10.1

0.20.2

0.000.00

0.010.01

0.020.02

0.000.00

0.010.01

0.020.02

0.000.00

0.010.01

0.020.02

0.000.00

0.010.01

0.020.02

0.000.00

0.010.01

0.020.02

0.000.00

0.010.01

0.020.02

FR-LSFR-LS

RR-BLUPRR-BLUP

Bayes-RBayes-R

PLSRPLSR

SNP indexSNP indexSNP index

SNP indexSNP indexSNP index

SNP indexSNP indexSNP index

SNP indexSNP indexSNP index

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Efect (SD)

ASIASIASI

ASIASIASI

ASIASIASI

ASIASIASI

SNP indexSNP indexSNP index

SNP indexSNP indexSNP index

SNP indexSNP indexSNP index

SNP indexSNP indexSNP index

PPTPPTPPT

PPTPPTPPT

PPTPPTPPT

PPTPPTPPT

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Efect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Effect (SD)

Page 13

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 13 of 16

(page number not for citation purposes)

geous given the sex limited expression of most traits and

the long generation interval due to relying on progeny

testing to select superior replacement sires. The potential

advantages of using genomic selection in breeding pro-

grams of dairy cattle have been demonstrated by Schaeffer

[11] and König et al. [12]. Although the assumptions may

not be met by the currently achieved accuracies of GEBV,

the principles of reduced generation interval and

increased accuracy of selection of young bulls at time of

entry into progeny test all show substantial benefits from

increase in genetic gain and reduced costs.

Genomic breeding values that combined the marker-

based MBV with a pedigree based polygenic effect had

higher accuracies than MBV or polygenic component

alone, which is consistent with reports in dairy cattle

[35,46], wheat and mice [21]. We show here that accura-

cies of GEBV were approximately 1.3 times larger than

accuracies of sire maternal-grandsire pathway breeding

values, which are currently used to select bull calves to

enter progeny testing. Most of the bulls in our study

belong to the same breeding program as the population

used to estimate GEBV by Hayes et al. [35], but there was

no overlap between the training sets between studies as

their training set included animals born between 1998

and 2002, some of which are presumably part of our test

sets. Although direct comparison of both studies is diffi-

cult since there were differences in the number of bulls in

the training data and in the method used to calculate

accuracies of GEBV, Hayes et al. [35] reported higher accu-

racies of GEBV for ASI than for PPT. It remains uncertain

to what extent MBV predictions of the same trait derived

from different populations, or even different reference

populations within the same breed, are robust and war-

rants further examination.

Improvement of accuracies of genomic predictions are

likely to benefit from substantially increased training sets

as individual SNP effects are estimated with greater accu-

racy [35,46]. In contrast with observations by VanRaden et

al. [46], an increase in accuracy of MBV was not apparent

with an increased number of bulls in the training set

(Table 5). This discrepancy may rest with the smaller

range in the number of bulls in the training set in our data

(1,239-1,882) compared with 1,151 to 3,576 used in [46].

Table 5: Correlation (rEBV,MBV) between EBV and MBV in cohorts

of young bulls with increasing size of the training data

TrainingrEBV,MBV

Year19981999200020012002

TraitN 14418917313763

ASI

≤1997

≤1998

≤1999

≤2000

≤2001

1,239

1,383

1,572

1,745

1,882

0.390.38

0.37

0.40

0.38

0.45

0.35

0.29

0.35

0.39

0.34

0.26

0.30

0.34

0.32

PPT

≤1997

≤1998

≤1999

≤2000

≤2001

1,239

1,383

1,572

1,745

1,882

0.630.55

0.55

0.50

0.51

0.52

0.62

0.64

0.66

0.68

0.43

0.41

0.43

0.46

0.47

Results were obtained by cross-validation using partial least squares

regression; ASI: Australian Selection Index; PPT: protein percentage

Table 6: Correlation (rEBV,SMGS) between EBV and pre-progeny

test sire maternal-grandsire EBV prediction and correlation

(rEBV,GEBV) between EBV and GEBV in young bulls for five

methods

rEBV,GEBV

TraitrEBV,SMGS

FR-LSRR-BLUPBayes-RSVRPLSR

ASI

PPT

0.35

0.49

0.37

0.57

0.45

0.60

0.45

0.62

0.47

0.60

0.45

0.62

GEBV predictions for bulls born between 1998 and 2002 were

calculated by combining the MBV predictions with the sire maternal-

grandsire pathway predictions, which were calculated at the time of

birth of the young bull calves; ASI: Australian Selection Index; PPT:

protein percentage; FR-LS: fixed regression-least squares; RR-BLUP:

random regression-BLUP; Bayes-R: Bayesian regression; SVR: support

vector regression; PLSR: partial least squares regression.

Table 7: Summary of ANOVA of factors affecting correlation

(rEBV,MBV) between EBV and MBV and regression coefficient

(logebEBV,MBV) of EBV on MBV

rEBV,MBV

logebEBV,MBV

Model Term

P-value†

F-value

P-value†

F-value

Method

Trait

Year

Method.Trait

Method.Year

Trait.Year

< 0.00148.88

350.84

61.68

1.66

0.65

60.19

n.t.

n.t.

n.t.

88.09

131.35

20.44

6.18

1.20

10.73

n.t.

n.t.

0.198

0.827

< 0.001

0.002

0.346

<0.001

Shown are the significance level (P-value) and F-value of each model

term; the regression coefficient was loge-transformed to account for

non-normality and unstable variance; † n.t. non-testable.

Table 8: Computation times for estimation of SNP effects for

five methods

FR-LSRR-BLUPBayes-RSVRPLSR

~3 min~22 s~421 min~4 min ~8 s

Results were obtained by calculating SNP effects for a single replicate

of the training data; FR-LS: fixed regression-least squares; RR-BLUP:

random regression-BLUP; Bayes-R: Bayesian regression; SVR: support

vector regression; PLSR: partial least squares regression.

Page 14

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 14 of 16

(page number not for citation purposes)

Gains in accuracy are expected from increased SNP densi-

ties, as a larger proportion of the QTL variance is

explained by markers and effects of QTL can be predicted

across generations as shown in simulated data sets

[10,51]. VanRaden et al. [46] have shown a consistent but

small increase in the coefficient of determination for

genomic prediction for North American Holstein bulls

from 0.39 to 0.43 when the map density increased from

9,604 to 38,416 SNP. We would expect a similar relatively

small increase of the accuracy of MBV prediction with

greater SNP densities here. The reason is that many ani-

mals in our training and test data share DNA segments

from a small number of sires and relatively few markers

are required to trace the chromosome segments shared

between related animals separated by only a few genera-

tions. To what extent increasing SNP densities improves

the accuracy of genomic prediction in populations with

low effective population size remains to be seen. Greater

SNP densities may be required in more divergent popula-

tions or when individual animal phenotypes are analyzed

instead of EBVs derived from progeny test data.

As noted above, markers used in the statistical model not

only estimate QTL effects but also capture genetic rela-

tionships between individuals in the training data [13].

Habier et al. [13,41] and Zhong et al. [44] have demon-

strated differences in the contribution of LD between

marker and QTL and marker-based relatedness for differ-

ent statistical methods. They show a gradual decline of

accuracy of prediction for individuals which are removed

several generations from the training data set. Habier et al.

[13] have found that RR-BLUP is affected by genetic rela-

tionships to a larger degree than FR-LS, predicting a

steeper decline in the accuracy for RR-BLUP in generations

following training. Here, we did not observe a gradual

decline in accuracy of prediction, in fact for ASI correla-

tions for all methods were higher in animals born in year

2000 compared to animals born in year 1998. Small rank

changes in the performance of the methods between test

years did occur, but there was no strong evidence for dif-

ferent rates of decline of accuracies in later test years

between methods. This is supported by the non-signifi-

cant interaction between method and year from the

ANOVA. In practice, it will be difficult to differentiate

between improvements in the accuracy of prediction

resulting from modelling relationships via SNP or from

LD between SNP and QTL. It is still likely that a significant

component of the gains of GS will reside with predicting

relationships more accurately on the genome level either

within families [41] or even across families. For industry

applications it is feasible and most likely that prediction

equations can be updated as information on new animals

becomes available, and this will ensure a minimal lag

between animals in the training set and the test set.

In practice, the accuracy of predictions of future outcomes

needs to be assessed. The partitioning of the data depicts

this situation with older bulls in the training data and

younger bulls in the test data. Care must be taken when

the accuracy of future predictions is evaluated by cross-

validation of random subsets of the training data set. A

significant decrease in accuracy of prediction of young

bull cohorts was observed for ASI relative to the accuracy

obtained by cross-validation of the training data set,

whereas for PPT the reduction in accuracy was negligible

(Tables 2 and 3). In general a decrease in the accuracy of

MBV in young bull cohorts might be expected, since accu-

racies of realized EBV for early proofs are likely to be lower

than the accuracies of realized EBV of the older bulls in the

training data set. Initially, this would not explain the dif-

ferences in accuracy of MBV of young bulls seen between

ASI and PPT, as the body responsible for the genetic eval-

uation of dairy cattle in Australia publishes a single relia-

bility value for all production traits and ASI. However,

heritabilities for the ASI component traits (0.25) are lower

than for PPT (0.40) and more training animals may be

required to obtain accurate prediction equations for traits

with lower heritability.

Another reason for a decrease in accuracy of prediction of

younger bull cohorts may be a reduction in EBV variance

in pre-selected bull teams. As shown in Figure 3, the vari-

ance of PPT and ASI (column 1 and 2, respectively) of the

training animals is greater compared to the variance in

one of the young bull cohorts (column 3 and 4, respec-

tively). This is likely to have affected the accuracy of ASI,

since ASI is a strong component of the multi-trait profit

index on which young bulls are currently selected,

whereas no pre-selection is likely to have occurred on PPT.

Finally the genetic architecture underlying traits may also

affect the robustness of accuracy of prediction from SNP

data, since for PPT compared to ASI, fewer individual SNP

with relatively large effect contributed to the prediction

equations.

Conclusions

Five regression methods proposed to calculate genomic

breeding values have been empirically evaluated using

data of 1,945 dairy bulls typed for 7,373 genome-wide

SNP markers. From our evaluations a number of impor-

tant observations can be made. Firstly, FR-LS based mod-

els included a small number of markers and had poor

accuracy and large bias of prediction. Secondly, accuracies

of MBV prediction obtained by methods that estimate

effects of all SNP were remarkably similar, despite the dif-

ferent assumptions underlying the models. Thirdly, accu-

racies derived by cross-validation with random subsets of

the training data are likely to overestimate the realized

accuracies of future predictions for some traits in young

bull cohorts. Combining marker and pedigree informa-

Page 15

Genetics Selection Evolution 2009, 41:56http://www.gsejournal.org/content/41/1/56

Page 15 of 16

(page number not for citation purposes)

tion increased the accuracy of prediction but the gain was

different for the two traits investigated. Computational

demand of a method is potentially important in imple-

menting genomic selection in practice and was lowest for

PLSR and RR-BLUP, and highest for Bayes-R.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

GM was the principal investigator in the design of the

study and methods, participated in the statistical meth-

ods, carried out the statistical analysis and drafted the

manuscript. BT and RC participated in the statistical

methods, discussions and helped revise the manuscript.

MSK participated in the analysis, had a principal role in

data acquisition, assembly and data QC and contributed

to the manuscript preparation. HWR was the project

leader, contributed to project design, data acquisition,

result interpretation and had a leading role in manuscript

preparation. All authors read and approved the manu-

script.

Authors' information

AGBU is a joint venture of NSW Department of Primary

Industry and University of New England.

Additional material

Acknowledgements

The authors wish to thank Genetics Australia for semen samples and the

Australian Dairy Herd Improvement Scheme (ADHIS) for providing EBV

and pedigree data. The study was supported by the CRC for Innovative

Dairy Products. Drs Kyall Zenger, and Julie Cavanagh are greatly acknowl-

edged for coordinating the genotyping of the bull samples, Dr Peter Thom-

son suggested and helped with the ANOVA analysis, and Dr John Hickey

(University of New England, Australia) for help with the Bayes-R implemen-

tation.

References

1.Dekkers JC, Hospital F: The use of molecular genetics in the

improvement of agricultural populations. Nat Rev Genet 2002,

3:22-32.

2. Dekkers JC: Commercial application of marker- and gene-

assisted selection in livestock: strategies and lessons. J Anim

Sci 2004, 82(E-Suppl):E313-328.

3.Grisart B, Coppieters W, Farnir F, Karim L, Ford C, Berzi P, Cam-

bisano N, Mni M, Reid S, Simon P, Spelman R, Georges M, Snell R:

Positional candidate cloning of a QTL in dairy cattle: identi-

fication of a missense mutation in the bovine DGAT1 gene

with major effect on milk yield and composition. Genome Res

2002, 12:222-231.

Montgomery GW, Crawford AM, Penty JM, Dodds KG, Ede AJ,

Henry HM, Pierson CA, Lord EA, Galloway SM, Schmack AE: The

ovine Booroola fecundity gene (FecB) is linked to markers

from a region of human chromosome 4q. Nat Genet 1993,

4:410-414.

Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Dun-

canson A, Kwiatkowski DP, McCarthy MI, Ouwehand WH, Samani

NJ, Todd JA, Donnelly P, Barrett JC, Davison D, Easton D, Evans DM,

Leung HT, Marchini JL, Morris AP, Spencer CC, Tobin MD, Attwood

AP, Boorman JP, Cant B, Everson U, Hussey JM, Jolley JD, Knight AS,

Koch K, Meech E, Nutland S, Prowse CV, Stevens HE, Taylor NC,

Walters GR, Walker NM, Watkins NA, Winzer T, Jones RW, McAr-

dle WL, Ring SM, Strachan DP, Pembrey M, Breen G, St Clair D, Cae-

sar S, Gordon-Smith K, Jones L, Fraser C, Green EK, Grozeva D,

Hamshere ML, Holmans PA, Jones IR, Kirov G, Moskivina V, Nikolov

I, O'Donovan MC, Owen MJ, Collier DA, Elkin A, Farmer A, William-

son R, McGuffin P, Young AH, Ferrier IN, Ball SG, Balmforth AJ, Bar-

rett JH, Bishop TD, Iles MM, Maqbool A, Yuldasheva N, Hall AS,

Braund PS, Dixon RJ, Mangino M, Stevens S, Thompson JR, Bredin F,

Tremelling M, Parkes M, Drummond H, Lees CW, Nimmo ER, Sat-

sangi J, Fisher SA, Forbes A, Lewis CM, Onnie CM, Prescott NJ, Sand-

erson J, Matthew CG, Barbour J, Mohiuddin MK, Todhunter CE,

Mansfield JC, Ahmad T, Cummings FR, Jewell DP, Webster J, Brown

MJ, Lathrop MG, Connell J, Dominiczak A, Marcano CA, Burke B,

Dobson R, Gungadoo J, Lee KL, Munroe PB, Newhouse SJ, Onipinla

A, Wallace C, Xue M, Caulfield M, Farrall M, Barton A, Bruce IN,

Donovan H, Eyre S, Gilbert PD, Hilder SL, Hinks AM, John SL, Potter

C, Silman AJ, Symmons DP, Thomson W, Worthington J, Dunger DB,

Widmer B, Frayling TM, Freathy RM, Lango H, Perry JR, Shields BM,

Weedon MN, Hattersley AT, Hitman GA, Walker M, Elliott KS,

Groves CJ, Lindgren CM, Rayner NW, Timpson NJ, Zeggini E, New-

port M, Sirugo G, Lyons E, Vannberg F, Hill AV, Bradbury LA, Farrar

C, Pointon JJ, Wordsworth P, Brown MA, Franklyn JA, Heward JM,

Simmonds MJ, Gough SC, Seal S, Stratton MR, Rahman N, Ban M,

Goris A, Sawcer SJ, Compston A, Conway D, Jallow M, Rockett KA,

Bumpstead SJ, Chaney A, Downes K, Ghori MJ, Gwilliam R, Hunt SE,

Inouye M, Keniry A, King E, McGinnis R, Potter S, Ravindrarajah R,

Whittaker P, Widden C, Withers D, Cardin NJ, Ferreira T, Pereira-

Gale J, Hallgrimsdo'ttir IB, Howie BN, Su Z, Teo YY, Vukcevic D,

Bentley D, Mitchell SL, Newby PR, Brand OJ, Carr-Smith J, Pearce SH,

Reveille JD, Zhou X, Sims AM, Dowling A, Taylor J, Doan T, Davis JC,

Savage L, Ward MM, Learch TL, Weisman MH, Brown M: Associa-

tion scan of 14,500 nonsynonymous SNPs in four diseases

identifies autoimmunity variants.

39:1329-1337.

Perola M, Sammalisto S, Hiekkalinna T, Martin NG, Visscher PM,

Montgomery GW, Benyamin B, Harris JR, Boomsma D, Willemsen G,

Hottenga JJ, Christensen K, Kyvik KO, Sorensen TI, Pedersen NL,

Magnusson PK, Spector TD, Widen E, Silventoinen K, Kaprio J, Palotie

A, Peltonen L: Combined genome scans for body stature in

6,602 European twins: evidence for common Caucasian loci.

PLoS Genet 2007, 3:e97.

Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino

M, Freathy RM, Perry JR, Stevens S, Hall AS, Samani NJ, Shields B,

Prokopenko I, Farrall M, Dominiczak A, Johnson T, Bergmann S, Beck-

mann JS, Vollenweider P, Waterworth DM, Mooser V, Palmer CN,

Morris AD, Ouwehand WH, Zhao JH, Li S, Loos RJ, Barroso I, Delou-

kas P, Sandhu MS, Wheeler E, Soranzo N, Inouye M, Wareham NJ,

Caulfield M, Munroe PB, Hattersley AT, McCarthy MI, Frayling TM:

Genome-wide association analysis identifies 20 loci that

influence adult height. Nat Genet 2008, 40:575-583.

Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, Cookson

WO, Taylor MS, Rawlins JN, Mott R, Flint J: Genome-wide genetic

association of complex traits in heterogeneous stock mice.

Nat Genet 2006, 38:879-887.

Cole JB, VanRaden PM, O'Connell JR, Van Tassell CP, Sonstegard TS,

Schnabel RD, Taylor JF, Wiggans GR: Distribution and location of

genetic effects for dairy traits. J Dairy Sci 2009, 92:2931-2946.

4.

5.

Nat Genet 2007,

6.

7.

8.

9.

Additional file 1

Tables showing model-based means from ANOVA of factors affecting

correlation (rEBV,MBV) between EBV and MBV and regression coeffi-

cient (logebEBV,MBV) of EBV on MBV. The regression coefficient was loge-

transformed to account for non-normality and unstable variance. Esti-

mates with different superscript are significantly different at the 0.05 sig-

nificance level.

Click here for file

[http://www.biomedcentral.com/content/supplementary/1297-

9686-41-56-S1.DOC]