Page 1

Increased accuracy of artificial selection by using

the realized relationship matrix

B. J. HAYES1*, P. M. VISSCHER2AND M. E. GODDARD1,3

1Biosciences Research Division, Department of Primary Industries Victoria, 1 Park Drive, Bundoora 3083, Australia.

2Queensland Institute of Medical Research, Brisbane, Australia.

3Faculty of Land and Food Resources, University of Melbourne, Parkville 3010, Australia.

(Received 23 July 2008 and in revised form 4 December 2008)

Summary

Dense marker genotypes allow the construction of the realized relationship matrix between

individuals, with elements the realized proportion of the genome that is identical by descent (IBD)

between pairs of individuals. In this paper, we demonstrate that by replacing the average

relationship matrix derived from pedigree with the realized relationship matrix in best linear

unbiased prediction (BLUP) of breeding values, the accuracy of the breeding values can be

substantially increased, especially for individuals with no phenotype of their own. We further

demonstrate that this method of predicting breeding values is exactly equivalent to the genomic

selection methodology where the effects of quantitative trait loci (QTLs) contributing to variation in

the trait are assumed to be normally distributed. The accuracy of breeding values predicted using the

realized relationship matrix in the BLUP equations can be deterministically predicted for known

family relationships, for example half sibs. The deterministic method uses the effective number of

independently segregating loci controlling the phenotype that depends on the type of family

relationship and the length of the genome. The accuracy of predicted breeding values depends on

this number of effective loci, the family relationship and the number of phenotypic records. The

deterministic prediction demonstrates that the accuracy of breeding values can approach unity if

enough relatives are genotyped and phenotyped. For example, when 1000 full sibs per family were

genotyped and phenotyped, and the heritability of the trait was 0.5, the reliability of predicted

genomic breeding values (GEBVs) for individuals in the same full sib family without phenotypes was

0.82. These results were verified by simulation. A deterministic prediction was also derived for

random mating populations, where the effective population size is the key parameter determining the

effective number of independently segregating loci. If the effective population size is large, a very

large number of individuals must be genotyped and phenotyped in order to accurately predict

breeding values for unphenotyped individuals from the same population. If the heritability of the

trait is 0.3, and Ne=1000, approximately 5750 individuals with genotypes and phenotypes are

required in order to predict GEBVs of un-phenotyped individuals in the same population with an

accuracy of 0.7.

1. Introduction

In best linear unbiased prediction (BLUP) of breeding

values, information from performance of relatives is

incorporated through the use of a relationship matrix.

Elements of this matrix are derived as the predicted

proportion of the genome that is identical by descent

(IBD) among two individuals given their pedigree

relationship. However, Mendelian sampling during

gamete formation results in variation in the realized

proportion of the genome, which is IBD between

pairs of individuals with the same predicted relation-

ship coefficients (Franklin, 1977; Hill, 1993; Guo,

1996). For example, between full-sib individuals the

* Corresponding author. Tel: +61 (0)3 9479 5439. Fax: +61 (0)3

9479 3113. e-mail: ben.hayes@dpi.vic.gov.au

Genet. Res., Camb. (2009), 91, pp. 47–60.

doi:10.1017/S0016672308009981

f 2009 Cambridge University Press

Printed in the United Kingdom

47

Page 2

predicted proportion of the genome that is IBD is

0.5, while its standard deviation is 0.04 for a species

with 30 chromosomes each of 1 M in length (Guo,

1996).

DNA marker information can be used to calculate

the realized relationship matrix with elements the ac-

tual proportion of the genome that is IBD between

two individuals, with a high degree of precision, pro-

vided that a sufficient number of markers are used.

Nejati-Javaremi et al. (1997) demonstrated with

simulation that if the loci contributing to trait vari-

ation were known, and the alleles at these loci were

used to derive the realized relationship matrix, the

accuracy of breeding values calculated using this ma-

trix could be higher than that calculated using the

predicted relationship matrix. In practice, all the loci

contributing to trait variation are unlikely to have

been identified. Villanueva et al. (2005) demonstrated

by simulation that using the realized relationship

matrix derived from markers rather than the pre-

dicted relationship matrix in the calculation of esti-

mated breeding values (EBVs) could lead to higher

accuracies of selection. They proposed that marker

information used in this way could offer benefits in

selection programmes when no quantum trait locus

(QTL) has been mapped or when the underlying

genetic model can be considered the infinitesimal

model, where no individual QTL has a moderate to

large effect on the trait. For some traits such as height

in humans, this is indeed the case, with the largest

reported QTLs explaining only a small fraction of the

genetic variance (e.g. Sanna et al., 2008; Visscher,

2008).

While Villanueva et al. (2005) considered estimat-

ing realized relationships conditional on a known

pedigree (exploiting linkage information) realized re-

lationship coefficients can also be estimated for ‘un-

related’ individuals within a population. This requires

sufficient marker density to identify chromosome

segments in two individuals that are descended from

the same common, but unknown ancestor.

An alternative method by which DNA marker data

can be used to estimate breeding values is genomic

selection (Meuwissen et al., 2001). In this method, the

markers are used to track QTLs whose effects are es-

timated and summed to predict the breeding value of

each individual. However, if there are many QTLs

whose effects are normally distributed with constant

variance, then genomic selection can be equivalent

to the use of the realized relationship matrix (e.g.

Fernando, 1998; Habier et al., 2007; Van Raden,

2007 and Goddard, 2008).

Currently, there is no analytical method available

to predict the accuracy of EBVs calculated using the

genomic relationship matrix considering information

from relatives. Analytical expressions would be de-

sirable to guide the design of experiments aiming to

achieve a given accuracy of genomic breeding values

(GEBVs).Ourobjectivewastoderivesuchexpressions

for the accuracy of GEBV considering information

from relatives. We also modify the expression of

Goddard (2008) for the accuracy of GEBV in random

mating populations to improve the predictions. Our

starting point for all derivations was the equivalent

genomic selection model. We then verified the ana-

lytical predictions using two simulation approaches.

First, we derive a prediction of the accuracy based on

the prediction error variance (PEV) where the realized

relationship matrix is determined by a large number

of informative markers. Secondly, we derive accuracy

from simulations with both markers and QTLs seg-

regating as the correlation between true and predicted

breeding values. We then investigate the sensitivity of

the results to the number of markers used, the number

of QTLs and effective population size.

2. Methods

(i) An equivalent model for genomic selection

This material is also contained in Goddard (2008) but

is included here for completeness. Consider a model

of the true breeding value of the ith individual (gi)

based on a large number of QTLs of small effect. To

simplify our analytical derivation, we will define a

parameter q as the number of independent chromo-

some segments. This model can be pictured as divid-

ing the chromosomes into segments that effectively

segregate independently and defining the effect of the

segment as the sum of the effects of the QTL carried

on that segment. The assumption here is that there are

at least as many QTLs as there are effective chromo-

some segments. Alternatively if QTLs are unlinked,

then q is the number of unlinked QTLs. Then

gi= g

q

j=1

Wijuj,

where ujis the allele substitution effect at the jth QTL

and is normally distributed uyN(0, su

the variance of the effect of QTL alleles sampled ran-

domly from the population, and Wijis 0, 1 or 2 if

individual i carries 0, 1 or 2 copies of the second allele

at the jth QTL. In practice, it is convenient to subtract

the mean value of w from each element so that

Wij=0x2pjor 1x2pjor 2x2pj, where pj=the allele

frequency of the second allele at locus j. This

corresponds to the genomic selection model that

Meuwissen et al. (2001) called the BLUP model.

A simple version of genomic selection is to define the

Wijbased on markers instead of the QTL. Then the

best estimates of the ujand hence gican be obtained

by BLUP.

In matrix form g=Wu and V(g)=WWksu

W is a design matrix allocating QTL allele effects to

2), where su

2is

2, where

B. J. Hayes et al. 48

Page 3

individuals. g is also normally distributed since it is

the sum of many normally distributed effects.

Now a vector of phenotypic records y can be mod-

elled as either

y=Xb+ZWu+e (1)

or

y=Xb+Zg+e, (2)

where X is a design matrix, b is a vector of fixed effects

and Z is matrix allocating records to individuals. The

two models are equivalent provided V(g)=Gsg

WWksu

trix calculated from the markers, and sg

variance. Elements of G are Gik, the proportion of the

genome that is IBD between individuals i and k.

Relationships, like inbreeding coefficients, are always

relative to a base population. By subtracting the mean

allele frequencies from the elements of W, the re-

lationships Gikare relative to the current population.

Consequently, they average approximately zero and

some are negative. This means that the genetic vari-

ance sg

population. The two models give the same estimates

of g=Wu. That is, a genomic selection model (1) with

normally distributed QTL effects is equivalent to a

conventional individual model (2) with the relation-

ship matrix among the individuals estimated from the

markers. Note that it assumes that the genotypes are

known without error.

2=

2, where G=WWksu

2/sg

2is the relationship ma-

2is the genetic

2is also the genetic variance in the current

(ii) Derivation of accuracies for breeding values

predicted with the equivalent model

When the marker data have been collected on a

sample of individuals, we can use either (1) or

(2) to calculate GEBVs for those individuals and their

reliabilities. However, it would be useful to predict in

general, before collecting the marker data, the accu-

racy that this form of genomic selection would

achieve. It is difficult to derive a formula for reliability

based on (2) because G is a complicated matrix. It is

perhaps easier to work with (1) but this is still difficult

because the design matrix, ZkWkWZ, the inverse of

which occurs in the system of equations required to

predict the QTL effects, is complex and likely to have

singularities. This complexity comes about because wij

for closely linked markers are correlated. Therefore,

we will approximate (1) by a model in which there are

q independent chromosome segments as described

above. In what follows, we first derive the number of

independent chromosome segments in different family

relationships or a random mating population, and

then use these numbers in the derivation of accuracy

of GEBV.

(a) Effective number of independent chromosome

segments within families

We will determine the effective number of chromo-

some segments by considering the variation in re-

lationship between pairs of individuals with the same

pedigree. For instance, based on pedigree all full sibs

have a relationship of 0.5 but in reality this relation-

ship varies from about 0.4 to 0.6 (Hill, 1993; Visscher

et al., 2006). This variation in relationship comes

about because sibs inherit large segments of chromo-

somes from their parents. The more the independent

chromosomes segments make up the genome the

more closely all full sibs would come to sharing ex-

actly 50% of their genome.

Formulae for the variation in realized relationship

between the different types of relatives have been

published by Hill (1993) and Guo (1996) and we will

use their formulae.

Consider a single locus and calculate the relation-

ship between relatives i and j, i.e. Gij. For full-sibs

25% of the time Gij=1, 50% of the time Gij=0.5 and

25% of the time Gij=0. So the variance is 1/8. If there

are q independent chromosome segments, then the

variance of Gij=1/(8q). However, the variance in Gij

can also be calculated for a genome consisting of

chromosomes of known length in Morgan. Hill (1993)

and Guo (1996) present formulae for this. For in-

stance, using their formula, if the genome consists of a

single chromosome 35 M long, then the variance in

relationship between pairs of full-sibs is 0.00177.

Equating this to the variance of Gij in our model

with q independent chromosome segments, 1/(8q)=

0.00177. So the effective number of loci is q=1/

(8*0.00177)=70.6, close to the reported empirical

value of 82 for human full sibs (Visscher et al., 2006).

So if two gametes produced by the same sire (corre-

sponding to two sibs) are considered, then a 35 M

chromosome will experience approximately 70 cross-

overs(35foreachgamete).Therefore,thetwogametes

can be considered as composed of 70 segments and for

each segment the probability that the two gametes are

identical is 0.5. Although we have assumed a single

chromosome here, Hill (1993) showed that the vari-

ation in relationship is not particularly affected by

assumptions on the number of chromosomes, pro-

vided the total length of the genome was kept con-

stant.

With the same assumptions as above, for half sibs,

the V(Gij) for a single locus is 1/16 and the variance

of relationship for a genome with one chromosome

35 M long is 0.00088 and so again qy70. For double

cousins, V(Gij) for a single locus is 3/32 and the vari-

ance in relationship from the formula of Guo (1996) is

0.00107, so q=88.

The number of effective loci is similar to the recom-

bination index for humans, assumed by Rasmusson

Selection using the realized relationship matrix49

Page 4

(1993) to be the number of independently segregating

units in the genome.

(b) Effective number of chromosome segments in

a random mating population

To derive the effective number of loci in a random

mating population, consider two gametes taken at

random from the population. The position at one end

of a chromosome in both gametes can be traced back

until they coalesce. Positions close to this first point

will coalesce in the same ancestor but, as one moves

along the chromosome, a recombination will be

reached such that the next position coalesces in a dif-

ferent ancestor. Thus the two gametes can be seen to

be composed of a series of short chromosome seg-

ments that coalesce. The average length of these seg-

ments is 1/(4Ne), where Neis the effective population

size (Stam, 1980). Therefore, the two gametes of

length L Morgan are divided into 4NeL segments.

However, some segments are larger by chance than

others so that if the effective number of segments is

calculated from the variation in relationship between

individuals in the population it is approximately

2NeL/log(4NeL) per chromosome (Goddard, 2008).

However, this approximation does not consider

the fact that the small segments may still contain as

many QTL mutations as the larger segments since

they have on average a longer time to trace back to

the same common ancestor and hence a longer time

for mutations to accumulate. Therefore, the most

appropriate value for the number of effective seg-

ments might be in between 4NeL and 2NeL/log(4NeL)

per chromosome. As an approximation, we will as-

sume that the effective number of loci is 2NeL and

then test the validity of this assumption with simu-

lated data.

The variation in relationship between two gametes

arises for two reasons. First, some pairs of gametes

are more closely related by pedigree than others. For

instance, some pairs of gametes may share a common

parent or grandparent, whereas other pairs do not.

Secondly, even considering pairs of gametes that have

the same pedigree relationship, they may share more

or less alleles than the average expected for that

relationship, due to Mendelian sampling. The first

source of variation in relationship is used by a con-

ventional individual model BLUP to estimate the

breeding values of individuals including those with no

phenotypic record. The second source of variation in

realized relationship is the source of the increase in

reliability of GEBVs. For a pair of gametes with

constant pedigree relationship, the ancestor in which

one chromosome segment coalesces is independent of

the ancestor in which another chromosome segment

coalesces. Consequently, the variation in relationship

due to this source would be zero if there were an

infinite number of unlinked loci. Even though

the number of positions in the genome may be very

large, linkage causes variation in relationship by gen-

erating chromosome segments that coalesce. Since

each segment coalesces independently conditional on

the pedigree, it again seems appropriate to estimate

the effective number of loci as in between the number

of segments (4NeL) and the number of segments

weighted by length (2NeL/log(4NeL) per chromo-

some), e.g. as 2NeL.

(c) Accuracy of genomic EBVs with information

from relatives

In what follows we will assume that fixed effects can

be adequately estimated and that the data have been

corrected for them, so y=Zg+e, where y is corrected

for fixed effects.

Even without any genetic markers, the breeding

value of an individual can be estimated from pedigree

and phenotypic records. We will focus on predicting

the increase in reliability due to markers of the GEBV

of an individual that has marker data but no pheno-

typic record and no offspring because that is the most

important use of genomic selection. That is, we will

calculate the increase in accuracy of GEBV above that

obtainable simply from the pedigree and records on

ancestors and collateral relatives. In this scenario, the

breeding value of the ith individual (gi) can be ex-

pressed as the mean of individuals with the same

pedigree as individual i( f) and a deviation from that

mean caused by the actual genes the individual in-

herited:

gi=f+ g

q

j=1

usij+ g

q

i=1

umij,

where f=family mean breeding value, usij=paternal

allele effect inherited by the ith individual at the jth

independent chromosome segment as a deviation

from family mean, umij=maternal allele effect in-

herited by the ith individual at the jth independent

chromosome segment as a deviation from the family

mean and summation is over all independent chromo-

some segments.

The variance of the breeding values is then

V(g)=V( f )+ g

q

j=1

V(usj)+ g

q

j=1

V(umj):

If we analyse the data y with the model and esti-

mate f, usjand umj:

gˆ =fˆ+ g

q

j=1

uˆsj+ g

q

j=1

uˆmj:

The components of this equation are independent,

as the effects of the sire and dam alleles are expressed

B. J. Hayes et al. 50

Page 5

as deviations from the family mean. Therefore

V(gˆ )=V(fˆ)+ g

q

j=1

V(uˆsj)+ g

q

j=1

V(uˆmj):

(3)

The reliability of GEBVs is V(g ˆ )/V(g) and the ac-

curacy is the square root of this reliability. The re-

liability can be calculated using eqn (3) and compared

with that obtained using only the family mean to

quantify the increase in reliability due to the marker

data.

With N progeny per family the reliability from

phenotypic and pedigree information alone is

V(fˆ)

V(f )=

N

N+lf,

where

lf=(V(y)xV( f ))=V( f ):

Now we calculate the increase in reliability with

genomic information. Assuming that there are n pa-

ternal alleles per family that are equally represented in

the data so that there are N/n individuals carrying

each paternal allele. As explained in the appendix

ls=2q/h2. Then

V(uˆs)=s2

appendix)

u(1x1=n)N=(N+nls) (derived in the

g

j=1

q

V(u ˆsj)=0?5V(g)(1x1=n)N=(N+nls) because

V(g)=2qs2

u

gq

As an example consider the case of selecting among

a population of individuals consisting of full sib fam-

ilies. In this case V( f )=0.5 V(g), lf=(V(y)xV( f ))/

V(f )=(1x0.5h2)/(0.5h2)

above, for full-sibs, if the genome consists of a single

chromosome 35-Morgan long, the variance in re-

lationship among pairs of full sibs is 0.00179 corre-

sponding to q=70 effective chromosome segments.

Consequently, su

Within a family of full-sibs there are two paternal

alleles and two maternal alleles, so n=2,

j=1V(uˆdj) is calculated in a similar manner.

andV( fˆ)=0?5V(g)N

N+lf:

As

2=V(g)/(2*70), ls=2q/h2=140/h2.

g

j=1

q

V(uˆsj)=0?5V(g)(1=2)N=(N+2ls)

and gq

h2=0.5, then

j=1V(uˆdj) is the same. If V(g)=1, N=99 and

V(fˆ)=0?5*99=(99+3)=0?486:

So from eqn (3)

V(gˆ )=0?486+0?5*(1=2)*99=(99+2*280)

+0?5*(1=2)*99=(99+2*280)=0?561:

As V(g)=1, the reliability is 0.561.

(d) Accuracy of GEBVs in a random breeding

population

We also want to predict the increase in reliability of an

EBV using the realized relationship matrix compared

with the pedigree relationship matrix in a random

breeding population. As before, assume the breeding

value is the sum of many QTLs each of which is in

perfect Linkage disequilibrium (LD) with a marker.

That is, the breeding value of individual i is gi=

gq

press wijas a deviation from the mean, e.g. wij=

xijx2pj,wherexijis0,1or2representinghomozygote,

heterozygote and other homozygote and pjis the allele

frequency at independent chromosome segment j and

as before ujis the allele substitution effect at the jth

independent chromosome segment assuming there

are only two alleles per independent chromosome

segment, and q is the number of independently segre-

gating chromosome segments, which for a randomly

mating population is derived above. The phenotypes

are modelled as in (1).

This derivation of accuracy of breeding value is

similar to those for full-sib families but differs in an

important way. In the full-sib case, each parent is as-

sumed to have two different alleles at each indepen-

dent chromosome segment, so the number of alleles at

one independent chromosome segment is four times

the number of families. Consequently, the genetic

variance at one independent chromosome segment in

the parental generation is 2su

a random mating population, there are assumed to be

only two alleles per independent chromosome seg-

ment and the variance contributed by the jth inde-

pendent chromosome segment is V(wj) su

su

where V(w) is the average value of V(wj) over all in-

dependent chromosome segment. In the full-sib case,

we were estimating the effect of markers within a

family and so only the number of individuals within

the family could be used to estimate u. On the other

hand, the effective number of loci within a full-sib

family is small because large segments of chromosome

segregate within a family. By contrast in a random

mating population we are estimating the effect of u

across the population, so all individuals with pheno-

types (N) can be used but the effective number of loci

is large because there must be a marker close enough

to any QTL to be in high LD with it.

There is assumed to be no LD between the QTLs

so the BLUP equations for estimating u are approxi-

mately block diagonal with the jth independent

chromosome segment having a block of equations

j=1wijujas before, except here it is convenient to ex-

2. However, in the case of

2=2pj(1xpj)

2=qV(w)su

2and the total genetic variance is sg

2,

[WkjWj+l] u ˆj=Wkjy

Selection using the realized relationship matrix 51