Page 1

Copyright ? 2007 by the Genetics Society of America

DOI: 10.1534/genetics.106.070011

Fractioned DNA Pooling: A New Cost-Effective Strategy for Fine

Mapping of Quantitative Trait Loci

A. Korol,*,1Z. Frenkel,* L. Cohen,* E. Lipkin†and M. Soller†

*Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel and†Department of Genetics,

Hebrew University of Jerusalem, Jerusalem 91904, Israel

Manuscript received December 20, 2006

Accepted for publication June 11, 2007

ABSTRACT

Selective DNA pooling (SDP) is a cost-effective means for an initial scan for linkage between marker

and quantitative trait loci (QTL) in suitable populations. The method is based on scoring marker allele

frequencies in DNA pools from the tails of the population trait distribution. Various analytical approaches

have been proposed for QTL detection using data on multiple families with SDP analysis. This article

presents a new experimental procedure, fractioned-pool design (FPD), aimed to increase the reliability of

SDP mapping results, by ‘‘fractioning’’ the tails of the population distribution into independent subpools.

FPD is a conceptual and structural modification of SDP that allows for the first time the use of

permutation tests for QTL detection rather than relying on presumed asymptotic distributions of the test

statistics. For situations of family and cross mapping design we propose a spectrum of new tools for QTL

mapping in FPD that were previously possible only with individual genotyping. These include: joint

analysis of multiple families and multiple markers across a chromosome, even when the marker loci are

only partly shared among families; detection of families segregating (heterozygous) for the QTL;

estimation of confidence intervals for the QTL position; and analysis of multiple-linked QTL. These new

advantages are of special importance for pooling analysis with SNP chips. Combining SNP microarray

analysis with DNA pooling can dramatically reduce the cost of screening large numbers of SNPs on large

samples, making chip technology readily applicable for genomewide association mapping in humans and

farm animals. This extension, however, will require additional, nontrivial, development of FPD analytical

tools.

A

linkage for QTL of small effect is difficult and requires

large mapping populations, with consequent high cost

of marker genotyping. Similar situations also arise in

association studies based on linkage disequilibrium

(LD). A cost-effective solution to reduce costs associ-

ated with genotyping large mapping populations is to

replace individual genotyping by DNA analysis in pools

of individuals coming from the high and the low tails

of the mapping population distribution. This concept,

referred to as ‘‘tail analysis’’ (Hillel et al. 1990;

Dunnington et al. 1992; Plotsky et al. 1993), ‘‘bulked

segregant analysis’’ (Giovannoni et al. 1991; Michelmore

et al. 1991), or ‘‘selective DNA pooling (SDP)’’ (Darvasi

and Soller 1994), was proposed for QTL analysis and for

testing of linkage between markers and a major gene.

Darvasi and Soller (1994) provided a detailed quan-

titative analysis of this procedure, based on comparing

marker allele frequency (which can be obtained by den-

sitometry) in the pooled DNA samples; a number of

CHIEVING reasonable statistical power of designs

for detecting marker–quantitative trait loci (QTL)

authors have proposed useful corrections to obtain

reliable estimates of SNP allele frequencies in pools

(Visscher and Le Hellard 2003; Zou and Zhao 2004,

2005; Craig et al. 2005). The SDP procedure can readily

be extended to situations, such as half-sib or full-sib de-

signs, where the mapping population consists of several

families. It was applied for genome scanning for QTL

affectingmilkproductiontraitsusingmicrosatellitemarkers

(Lipkin et al. 1998; Mosig et al. 2001).

Various approaches have been proposed for obtain-

ing QTL position and its confidence interval with SDP

(Dekkers 2000; Carleos et al. 2003; Brohede et al.

2005; Johnson 2005). Among the problems with such

analyses are varying proportion of family founders

heterozygous at both the QTL and the markers; hetero-

geneity of the families with respect to QTL effects; dif-

ferent information content of different marker loci;

allelesharingbetweenthefoundersiresanddamsofthe

families;varyingproportionofsharedmarkerlociamong

families, laboratories, and populations; effects of popu-

lation admixture; variation of PCR efficiency for marker

alleles; and the use of asymptotic, difficult-to-justify ap-

proximations of test-statistic distributions. Wang et al.

(2007) provide least-squares and maximum-likelihood

generalizationsofDekkers(2000)andaddressanumber

1Correspondingauthor:InstituteofEvolution,UniversityofHaifa,Mount

Carmel, Haifa 31905, Israel. E-mail: korol@research.haifa.ac.il

Genetics 176: 2611–2623 (August 2007)

Page 2

of the shortcomings of existing methodology. Recently,

DNApoolinganalysesusingSNPmarkershavealsobeen

employed in some human mapping studies based on

populationwide association tests or involving compari-

son of pools of healthy and affected individuals (Sham

et al. 2002; Butcher et al. 2004; Schnack et al. 2004;

Brohede et al. 2005; Tamiya et al. 2005). These SNP-

based association tests are also subject to many of the

statistical limitations listed above. When analyses are

basedonindividualselectivegenotyping,analyticalsolu-

tions are available for most of these problems (Lander

and Botstein 1989; Darvasi and Soller 1992; Ronin

et al. 1998). This is not the case when the analyses are

based on SDP. Thus, despite many publications sup-

porting pooling analysis, concerns remain about the

reliability of the marker–QTL associations obtained in

this way.

A ‘‘fractioned-pool’’ approach, in which the tails of

the population distribution are randomly allocated

among a number of independent subpools, has been

considered by a few authors, with the objective of ob-

taining an empirical standard error for estimates of

markerallelefrequenciesinpools(e.g.,Shametal.2002),

or for optimization of pool number/pool size, from

the viewpoint of amplification fidelity (Brohede et al.

2005). In the present article, the fractioned-pool con-

cept is extended to provide a complete analytical sys-

tem for QTL linkage mapping analysis by selective

DNA pooling, termed fractioned-pool design (FPD)

(Figure 1). The FPD removes many of the above statis-

tical limitations. The FPD analysis is not limited by an

assumptionofnormaldistributionofthetrait.However,

the tails of trait distribution (corresponding to high

and low trait values) must contain a sufficient number

of individuals to achieve a reasonably high detection

power.

For the first time in selective DNA pooling, the FPD

allows QTL detection based onpermutation tests rather

than on assumed asymptotic distributions of test statis-

tics and estimation of confidence intervals for QTL

position and effect based on jackknife or bootstrap re-

sampling techniques. It also allows estimating the test

statistic more accurately than in the case of a singlepool

per tail. The proposedmethodisillustrated using Monte

Carlo simulations. Successful validation of the FPD for

genomewide studies of quantitative variation opens a

new perspective for highly reliable and cost-efficient

large-scale QTL analysis, unattainable by standard SDP

analytical procedures.

STANDARD SELECTIVE DNA POOLING APPROACH

TO QTL MAPPING

The experimental material for QTL mapping based

on SDP consists of individuals selected from the tails of

the mapping population trait distributions. The proce-

dures considered here are suitable for mapping popu-

lations composed of full- or half-sib families or multiple

families within F2or BC populations. The simulated

examples employed to illustrate the proposed method-

ology correspond to multiple half-sib daughter families

(e.g., a population based on artificial insemination as

found in dairy cattle). Each family consists of the prog-

eny of a different sire and is represented by some given

number of daughters per tail selected out of all pheno-

typed daughters of that family.

Assume that a sire is heterozygous at a QTL affecting

trait value, and designate as a positive sire QTL allele the

sire QTL allele increasing trait value and as a negative

sireQTLallelethesireQTLalleledecreasingtraitvalue.

ThenthefrequencyofthepositivesireQTLallelewillbe

higher in the group of daughters having high trait value

and lower in the group of daughters having low trait

value; the oppositewill betruefor the negative sire QTL

allele. Through hitchhiking effects, this difference in

the frequency of the positive and negative sire QTL

alleles in groups with high and low trait values produces

a parallel difference in the frequency of sire marker

allelesatmarkerlociheterozygousinthesiresthatarein

coupling linkage to these heterozygous QTL. Analyzing

sire marker-allele frequency differences at several

marker loci enables the position of the QTL on the

chromosome to be estimated.

It is convenient to denote the two pools as high (H)

and low (L), respectively, and the two sire alleles at the

linked marker locus m (m ¼ 1,..., M) as alleles Amand

Bm, respectively. Using this notation, we define the

statistic Dmas a characteristic of sire allele divergence

in the two tails,

Dm¼ ½ðFHAm? FLAmÞ ? ðFHBm? FLBmÞ?=2

(Lipkin et al. 1998), where FHAmis the frequency of

allele Amin the high pool, and FHBm, FLAm, and FLBm

are definedaccordingly. When thereare only twoalleles

at the marker locus as in the case of SNP markers, FAm

and FBmare in perfect negative correlation, and hence

only one of the alleles need be included in estimating

ð1Þ

Figure 1.—Constructing multiple subpools. Trait distribu-

tion in each family is divided into three parts: individuals with

high or low trait values that make up the high and low tails

and individuals with intermediate trait values. At each tail, in-

dividuals are grouped randomly into subpools. NTcharacter-

izes the number of individuals with corresponding trait values

in a family. L1, L2, L3, and L4 are low-tail subpools; H1, H2,

H3, and H4 are high-tail subpools.

2612 A. Korol et al.

Page 3

Dm. However, when there are multiple alleles at the

marker locus as in the case of microsatellite markers,

FHAmand FHBmare not perfectly correlated, and hence

both contain independent information on Dm. In this

case, the accuracy of estimation of D is improved by

averaging estimates from both alleles as shown in (1).

The estimate from allele Bmis given a minus sign in (1)

because changes due to a linked QTL in allele Bmare in

opposite direction to those in allele Am, as noted above

(see Lipkin et al. 1998 for details).

To illustrate how the QTL substitution effect influen-

ces the expected value of D-statistics, consider a single-

QTL case for the half-sib design. Let QTL q be diallelic

with sire QTL genotype A(q)B(q)and equal frequencies

of alleles A(q)and B(q)in the dam population. In this

situation, the proportions of QTL genotypes in the

progeny are 25% A(q)A(q), 50% A(q)B(q), and 25%

B(q)B(q). Let the targeted quantitative trait be normally

distributed with residual variance s2and mean value

dependent on QTL genotype: m ? d for B(q)B(q), m for

A(q)B(q), and m 1 d for A(q)A(q). For 10% cutoff tails of

trait distribution and allele substitution effect of QTL

d/s ¼ 0.3, 0.2, and 0.15, the expected value of D(q)

(defined analogously to Dm) will be 0.26, 0.17, and 0.14,

respectively. Assume further that marker locus m is

triallelicwithallelesAm,Bm,andCm;thesire’shaplotypes

are AmA(q)and BmB(q); allele frequencies in the dam

population are 0.25 for Am, 0.25 for Bm, and 0.5 for Cm;

and marker and QTL alleles in the dam population are

in linkage equilibrium. Then, if marker m is coincident

with QTL q ½i.e., marker allele Amis inherited from the

sireonlywithA(q)andBmonlywithB(q)?,theexpectation

ofDmshouldbehalfofD(q)(i.e.,0.13,0.085,and0.07for

d/s ¼ 0.3, 0.2, and 0.15, respectively).

For detecting the chromosomes with QTL effects,

one can consider for every marker m the statistic x2

taken over all F families heterozygous for the marker m,

m

x2

m¼ SfD2

f ;m=VarDf ;m;

ð2Þ

where Var Df,mis the sampling variance of Df,mfor the

f family at the m marker. When the selected trait is not

affected by the tested chromosome (H0hypothesis), x2

is presumed to follow a x2-distribution with d.f. ¼ F

(numberoffamilies),enablingax2-testforthepresence

of a QTL linked to the marker (Weller et al. 1990).

m

THE ANALYTICAL SYSTEM OF FPD

By joint analysis of these sire marker-allele frequency

differences, Dm, at several marker loci, one can estimate

thechromosomalpositionofthedetectedQTL.Forone

or several families heterozygous for the same QTL,

fittingafunctionofchromosomalpositionsforobserved

Dmvaluesatthepolymorphicmarkerlocicanbeusedfor

estimation of the QTL position (similar to the proce-

duresdescribedbyKearsey1998andRoninetal.1999).

Single-QTL model: For a single-QTL situation, the

expectation of statistic Dmis proportional to (1 ? 2rm),

where rmis the recombination rate between the marker

m and the QTL q. In (1) the sign of statistic Dmdepends

on which of the two sire marker alleles was designated

Amand which was designated Bm. In what follows we

assume that marker haplotypes of sire are known and

marker alleles from one haplotype are designated by Am

and from another by Bm, m ¼ 1,..., M, where M is the

number of marker loci included in the haplotype (note

that FPD methods also apply in the case of unknown

phases; see Unknown marker linkage phase in the sire

below). Value rmdepends on location of marker m and

unknown location (x(q)) of the putative QTL on the

chromosome. Hence the expectation of Dm can be

represented as

EDm¼ l½1 ? 2rmðxðqÞÞ?;

ð3Þ

where l is the (expected) value (henceforth ‘‘l-value’’)

of D for a marker that coincides with the QTL, and

rm(x(q)) is the recombination rate between the marker

andtheQTLandwillbezeroforamarkerlocatedatx(q).

Assuming absence of interference, rmcan be calculated

using theHaldanemodel, rm(y) ¼ 0.5(1 ? exp{?0.02y}),

where y is the map distance in centimorgans between xm

andtheunknowncoordinatex(q)oftheQTL(Figure2).

The information on all markers scored for the same

chromosome can be combined to derive the unknown

coefficients l and x(q). These parameters can be esti-

mated(analogouslytoWangetal.2007)usingastandard

least-squares approach (by minimizing the following

criterion):

SmfDm? l½1 ? 2rmðxðqÞÞ?g2=VarDm???!

The sampling variance of Dm(Var Dm) can be calculated

bywaysreviewedinShametal.(2002)andBrohedeetal.

(2005). Employment of expression (3) by using crite-

rion(4)canberepresentedintermsofastandardlinear

model,

xðqÞ;lmin:

ð4Þ

Dm¼ l½1 ? 2rmðxðqÞÞ?1em

Figure 2.—One QTL on the chromosome. Expectation of

the statistic D for markers situated at various locations on the

chromosome. Value ED is calculated by formula (3) (using the

Haldane model of recombination). Height of the graph at

the QTL position x0¼ x(q)is a characteristic of the QTL effect

on markers in this family (family l-value).

Fractioned DNA Pooling2613

Page 4

(Wang et al. 2007), or in matrix notations, D ¼ Xl 1 e.

Here values emare residuals, including both sampling

andtechnicalerrors,withvarianceequaltoVarDm;D,X,

and e are vectors of Dm, ½1 ? 2rm(x(q))? and emcor-

respondingly, m ¼ 1,..., M, and M is the number of

markers. The test statistic, calculated at given putative

QTLposition,canthenbewrittenasx2¼Sm{Dm?l½1?

2rm(x(q))?}2/Var Dm. However, because the correlations

between values of Dmfor linked markers are not taken

into account in (4), the statistical quality (sampling

variance) of the estimates obtained by this criterion is

not optimal. We therefore use a more general optimi-

zationcriterionthatdoestakecorrelationsintoaccount.

Let emin the linear model be correlated with cor-

relationsdefinedbymatrixG.Then,usingageneralized

least-squares approach, parameters can be estimated by

minimizing the following criterion (for simplicity of

designation, we write it in matrix form):

ðCðD ? XlÞÞ9G?1CðD ? XlÞ ???!

HereCisthediagonalmatrixof(VarDm)?0.5.Foragiven

x(q)putative position of the QTL q parameter l min-

imizing criterion (5) is equal to (X9C9G?1CX)?1X9

C9G?1CD. Coefficients of matrix G can be calculated

using correlation coefficients defined under the hy-

pothesis of no QTL in the chromosome.

For example, if sire alleles at markers m1and m2are

not presented in the dam population and there are no

technical errors, then the correlation coefficient looks

like r ¼ Corr(Dm1;Dm2) ¼ 1 ? 2r, where r is the re-

combination rate between markers m1 and m2. The

estimated l-value can serve as a test statistic combining

the information from multiple markers along the

chromosome. In our simulations correlations were

obtained analytically using only recombination distance

between markers and frequencies of the two sire alleles

in dam population: r ¼ Corr(Dm1;Dm2) ¼ (1 ? 2r)Var

D0/Var D, where Var D0 and Var D are analytical

estimations of variances of the D-value in the cases of

zero and nonzero frequencies of sire alleles in the dam

population. Alternatively, correlations among Dmvalues

canbeestimatedusingthemaximum-likelihoodmethod

(Wang et al. 2007).

In the same manner it is possible to combine the

informationfrom severalfamilies with respectto agiven

chromosome, assuming that all sires that are heterozy-

gous at a QTL on that chromosome are heterozygous at

one and the same QTL with respect to location (x(q)),

although the size of the QTL effect may vary among

sires. Thus, for the one-QTL assumption and multiple

families and letting lfrepresent the l-value for the f-sire

Equation 3 will be modified as

xðqÞ;lmin:

ð5Þ

Df ;m¼ lf½1 ? 2rf ;mðxðqÞÞ?:

ð3aÞ

Correspondingly, the estimation criterion will be

SfSmfDf ;m? lf½1 ? 2rf ;mðxðqÞÞ?g2=VarDf ;m????????? ?!

xðqÞ;lf;f ¼1;...;F

min

ð4aÞ

or,takingintoaccountthecorrelationbetweenvaluesof

D for linked markers,

SfðCfðDf? XflfÞÞ9G?1

fCfðDf? XflfÞ ???????? ?!

xðqÞ;lf;f ¼1;...;F

min:

ð5aÞ

Using this expression, the unknown parameters can be

obtained in the following way. At each of the chromo-

somal positions x ¼ x(i)taken consecutively with some

step (e.g., 1 cM), values lf, f ¼ 1,..., F, can be found

analytically. For every family, the r value in (3a) is

calculated using recombination distance between loca-

tion of marker m and current location x(i). Then, the

position minimizing the criterion can be taken as the

best position x(q).

After fitting the model (3a), by using criteria (4a) or

(5a), the statistic S(lf)2can serve to conduct an overall

permutation test (see below), instead of using the

asymptotic x2-properties of statistic (2). If we assume

one QTL in the chromosome common to all QTL-

heterozygous sires, then lfwill represent the expected

value of the test statistic at the marker locus coinciding

with (or closest to) the QTL. All other segregating

markers for this sire f will display a decreasing function

of the distance between the marker and the QTL.

Hence, an immanent property of our approach (similar

to the model of Kearsey 1998 or Ronin et al. 1999) is

that for single QTL, lfrepresents the approximation of

D at the presumed position x0coinciding with the QTL.

Thus, lf‘‘absorbs’’the information of all markers of the

sire, and statistic S(lf)2does this cumulatively across

sires, by fitting one and only one QTL position, due to the

assumption of one shared QTL.

QTL detection based on FPD permutation tests:

Employment of the FPD allows new types of tests for

QTLdetection,basedonpermutationofsubpools,asan

analog of permutations of individual trait or genotype

scores in selective genotyping analysis. These tests do

not depend on assumptions as to asymptotic distribu-

tionoftheteststatisticsandprovideaspectrumofuseful

analytical options. In particular, these tests can be em-

ployed for detecting chromosomes with QTL effects,

discriminating between sires homozygous and hetero-

zygous for the detected QTL, and comparing and con-

trasting hypotheses about one-, two-, or more QTL per

chromosome. The simplest of the proposed permuta-

tion tests is based on random reshuffling of the in-

dividual subpools between tails of the trait distribution.

This process is repeated many times, and each time the

test statistics are recalculated. In general terms, the

proportion of permuted test statistics that are greater

than the observed test statistic is the type I error of the

test (Doerge and Churchill 1996). If H0{no QTL

2614A. Korol et al.

Page 5

effect} is correct for a particular marker, such a permu-

tation will not have an appreciable effect on the level of

the test statistics. Thus, in most cases the observed test

statistic will lie well within the range of permuted

statistics. If the H1alternative is correct, reshuffling will

destroy the marker–trait (i.e., marker–tail) connection.

This will be manifested as a strong reduction of the test

statistics in the majority of permutation runs. Thus, the

observed test statistic in this case will exceed all but a

small fraction of the permuted statistics. The test can be

applied to any of the possible test statistics: x2

Equation2,estimatedlfrom Equations 4or 5,orS(lf)2

from (4a) or (5a).

The total number of different reshuffling configura-

tions per family, Rf, is a function of the number of

subpools per tail. In the case of the same number of

subpools for the high and low tails, S,

?

In the case of an unequal number of pools per tail,

?

where SLis the number of low-trait subpools and SHis

the number of high-trait subpools. Thus, for S ina the

range 4–8 pools per tail, Rfvaries from 35 to 6435.

Clearly, the total number of configurations with multi-

ple families is a product of corresponding numbers for

families R ¼Q

combinations. The number of combinations is impor-

tant, because the lowest possible P-value in permutation

is equal to 1/R.

Detecting chromosomes with QTL effects: Tests based

on x2

several families can be estimated as the proportion of

random permutation runs of pool configurations,

havingteststatisticvaluex2

on initial nonreshuffled data. To set significance levels

when a number of markers are considered on the same

chromosome, it is necessary to correct for multiple

comparisons, e.g., by controlling the false discovery rate

(FDR) (Benjamini and Hochberg 1995) or the pro-

portion of false positives (PFP) (Fernando et al. 2004).

Alternatively,achromosomewisetestcanbeproposed

analogous to the approaches applied in standard in-

terval mapping under individual genotyping. In that

case,foreachsetofkmarkerintervals,intervalanalysisis

conducted and the maximum (across intervals) LOD

value (max LODk) or the maximum F-test (max Fk) for

regression-based models is calculated. Then, the signif-

icance of the putative QTL effect of the tested chromo-

someisestimatedastheproportionofpermutationruns

(i.e., samples corresponding to H0obtained by random

reshuffling of the trait scores relative to the multilocus

mfrom

Rf ¼ 0:5

2S

S

?

? 4S?1:5:

Rf ¼

SL1SH

SL

?

¼

SL1SH

SH

??

;

fRf. Even for a minimal S ¼ 4, a design

with five families will give R ¼ 355? 5.2 3 107

m: The significance of QTL effect for marker m in

m(Equation2)$x2

mobtained

marker genotypes), where max LODk/Fkwas equal to or

higher than the max LOD/Fkvalue calculated for the

nonreshuffled data (Doerge and Churchill 1996).

Applying this approach to the FPD analysis, instead of

max LOD we can employ max x2¼ maxmx2

for the nonreshuffled and reshuffled configurations of

subpools, where maxmx2

which x2is at a maximum. Note that in the case of max

x2-statistics, the fitted model does not include any pa-

rameterscharacterizingQTLeffectandposition,sinceit

is based on single-marker analysis. In contrast, the max

LOD/Fktest is preceded by building a genetic model

that depends on unknown parameters and obtaining

maximum-likelihood (least squares, in the case of the

regression model) estimates of the parameters.

Significance of the putative QTL effect of the tested

chromosome can also be estimated by the P-value of the

highest significant marker on the chromosome (taking

into account the problem of multiple comparisons).

Individual P-values for marker m can be calculated by a

permutation test (using test statistic x2

(Welleretal.1990).UsingtheFDRapproach(Benjamini

and Hochberg 1995) to control for multiple compar-

isons,wedenotecorrespondingsignificancethresholdsby

TFDRðIÞforthepermutationtestandTFDRðIIÞforthex2-test,

respectively.

Permutation test based on lf: The permutation test

based on x2

mosome,butinformationcontainedintherelativeloca-

tions of the markers is ignored. In standard individual

genotypingschemes,single-markeranalysisandinterval

analysis are close with respect to QTL detection power

at moderate to high marker density. However, at low

marker density, interval analysis is more powerful. This

is due to the fact that loss of power caused by QTL–

marker recombination can be estimated as ?r/2 and

?r2/4, for single-marker analysis and interval analysis,

respectively.

ItwasfoundthatinFPD,asinstandardQTLmapping

analysis based on individual genotyping, hypothesis

testing is more efficient and flexible, if conducted on

the basis of fitting a mapping model aimed at QTL

detection or at discriminating between more complex

situations (such as single or multiple QTL on a chro-

mosome, mode of QTL action and interaction, and

linkage vs. pleiotropy as sources of genetic correlation).

In this context, by including marker positions, models

(3a), (4a), and (5a) presented above allow extracting

the information about QTL presence and location on

the tested chromosome through joint analysis of linked

markers.Asshownbysimulation(Table1),powerofthe

max x2-test is less than that of the S(lf)2test. Pre-

sumably, this is due to the fact that the max x2-test does

not utilize all of the information potentially contribut-

ing to QTL detection power. Thus, for a single-family

analysis, the estimated l-value (from Equation 4 or 5)

wouldbethepreferredstatisticforthepermutationtest.

mcalculated

mis the value for the marker for

m) or a x2-test

mtakes into account all markers on a chro-

Fractioned DNA Pooling 2615

Page 6

For multiple-family analysis statistics, S(lf)2and

maximum lfacross all families (maxfjlfj), with family-

specific least-squares estimates of l-values being derived

from (3a) and (4a) (or 5a), can serve to conduct the

overall experimentwise permutation test across families

and markers of the analyzed chromosome. In the FPD

methodology, each marker is represented in (4a) ½or in

(5a)? by its position relative to the unknown location of

the putative QTL, rather than by its name. Conse-

quently, there is no need for full coincidence of poly-

morphic marker loci among the families. In principle,

the system will work even with zero overlapping of

polymorphic marker loci among families. This is an

importantadvantageoftheproposedmethodologyover

the standard SDP methodology (Darvasi and Soller

1997), in which the test statistics is calculated for each

marker locus across families polymorphic for the

marker, and itis not possible tocompensate for markers

at which the sire is homozygous by including informa-

tion from neighboring heterozygous markers.

Detecting sires heterozygous at the QTL: For analysis of a

single family, f, within a multiple-family analysis, the

estimated value of jlfj or maxmx2

statistic for the permutation test. The significance of a

sire f is then determined as the proportion of permu-

tations of the runs made over all families, where the

statistic of QTL effect jlfj was greater than that for

nonreshuffled data. Sires of families where the test

statistics (jlfj or maxmx2

taken to be homozygous at the QTL. On this basis, sires

can be subdivided into two groups, QTL homozygous

and QTL heterozygous.

Estimating the confidence interval of QTL position:

bootstrap/jackknife analysis: One of the major param-

eters characterizing the detected QTL is the accuracy of

the estimatedparameters, especially ofQTLposition, as

given by its standard error or confidence interval. The

f ;mcan be used as a test

f ;m) are not significant can be

most common way to evaluate confidence interval of

QTL position within the framework of individual or

selective genotyping is by using resampling procedures

such as bootstrap or jackknife (Ronin et al. 1998). The

95% confidence interval of QTL location can then be

taken as the narrowest interval that includes 95% of the

resampling-based estimates of QTL position. Alterna-

tively, the confidence interval of QTL location can be

characterized by mean value ? xðqÞ, standard error (SE),

and standard deviation (SD) of the resampling-based

estimates. The proposed FPD methodology, for the first

time, allows resampling procedures to be applied for

DNA pooling analysis. As in the individual genotyping

application of these procedures, multiple samples are

generatedfrom theinitialdatasetbysamplingsubpools

within tails with return (bootstrap analysis) or without

return (jackknife analysis). Each such sample is treated

using the same model that was applied to the total sam-

ple, and the variation of the derived parameters among

the samples is employed to get a SD for each estimated

parameter and (if needed) a SE for its mean value. The

only difference in application of these procedures in

FPD is that pools are resampled instead of individuals.

With new chip-based technologies of SNP analysis, a

high number of densely spaced polymorphic markers

maybecomeavailableforFPDorinterval-mappinganal-

ysis. In this case, the resampling procedure may be mod-

ified to include simultaneous resampling of markers

within chromosomes and subpools within tails so that

different jackknife or bootstrap runs may include not

fully coinciding sets of markers for a given family.

Simulation data: To illustrate the proposed method-

ologywesimulatedsituationscorrespondingtomultiple

half-sib daughter families (a population based on arti-

ficial insemination, e.g., dairy cattle). Each family con-

sists of the progeny of a different sire, with each sire

family being represented by a certain number (10% of

TABLE 1

Effect of number of markers (M) under the FPD on the confidence interval (C.I.) of QTL location, comparisonwise error

rate (P-value), and statistical power, according to the test for significance and standardized allele substitution effect at

the QTL (d/s), using simulated data

C.I.

P-valuePower

d/s

M

D

SD

S(lf)2

maxjlj

0.053

0.030

0.049

max-x2

TFDRðIÞ

TFDRðIIÞ

S(lf)2(%)max-x2(%)

0.225

13

7

4.1

5.1

6.4

3.1

3.3

3.6

0.003

0.002

0.004

0.007

0.008

0.006

0.015

0.018

0.016

0.074

0.076

0.061

99

99

98

56

59

64

0.1525

13

7

4.9

6.3

7.8

5.2

5.2

5.9

0.008

0.021

0.021

0.104

0.071

0.098

0.056

0.126

0.130

0.067

0.112

0.101

0.250

0.260

0.299

92

90

89

27

24

28

Tests of significance: S(lf)2, maxjlj, max-x2, TFDRðIÞ, and TFDRðIIÞ. See text for details. Power was calculated at P-value ¼ 5%.

Values D and SD characterize the center and size of the confidence interval obtained in jackknife iterations (see text). Parameters

of the simulations: chromosome length 120 cM. A single QTL was situated in position 40 cM. Number of families, F ¼ 10 (5

families, sire heterozygous at the QTL; 5 families, sire homozygous at the QTL); number of daughters per family, N ¼ 2000; pro-

portion of the population selected to each tail, 0.10; number of subpools per tail, S ¼ 4. Values are the mean based on 10 sim-

ulation data sets; for every data set, 500 permutations and 100 jackknife iterations were made.

2616 A. Korol et al.

Page 7

the total) of daughters per tail selected out of all

phenotyped daughters ofthatfamily. Inour simulations

we used a normally distributed trait with constant

variance s2and mean value depending on QTL geno-

type. Each of QTL q was assumed additive and diallelic

with alleles A(q)and B(q). Frequencies of alleles A(q)and

B(q)in dams were set to 0.50. Frequencies of marker

alleles in the dams were 0.25 Am, 0.25 Bm, and 0.25 Cm,

where Amand Bmare sire alleles and Cmrepresents all

other alleles. Amand A(q)are alleles of one of the hap-

lotypes of the sire for all m ¼ 1,..., M, q ¼ 1,..., Q; Bm

and B(q)are alleles of the other haplotype of the sire;

all loci are from one chromosome. Positions of loci

(markers and simulated QTL) on the chromosome are

defined by recombination distance from the most prox-

imal locus. In the same way we define position(s) for

putative QTL. Recombination events in the sire gamete

were simulated as independent for different parts of

the chromosome (recombination rate between loci was

calculated using distance on the linkage map and the

Haldane model). Linkage equilibrium among all alleles

(markers and QTL) was assumed in the dams.

Each progeny genotype was simulated by indepen-

dently generating a haplotype inherited from the sire

and a haplotype inherited from a dam. The haplotype

inherited from the dam was simulated by randomly choos-

ing alleles for each locus proportionally to their fre-

quencies in the dams. The haplotype inherited from the sire

was simulated as follows: The allele in the most proximal

locus was chosen randomly from one of the two sire

alleles (with probability 0.5). This allele determined the

starting sire haplotype. The allele in every subsequent

locus on the chromosome was chosen with probability

1 ? r from the same haplotype as in the previous locus

and with probability r from the alternative haplotype,

whereristherecombinationratebetweenthesetwocon-

secutiveloci.Thetraitvalueforeachsimulatedindividual

in the progeny was set equal to the mean trait value for

theinherited QTL genotype plus a normally distributed

random value with mean zero and variance s2. In the

single-QTL case, mean trait value was defined as m ?

d(q), m, and m 1 d(q)for genotypes B(q)B(q), A(q)B(q), and

A(q)A(q), correspondingly. Value d(q)was not necessarily

the same for all families. In the case of two QTL (q ¼ 1,

2),traitmeanvaluewasm?d(1)?d(2),m?d(2),m1d(1)?

d(2), m ? d(1), m, m 1 d(1), m ? d(1)1 d(2), m 1 d(2), and

m 1 d(1) 1 d(2) for genotypes B(1)B(1)B(2)B(2),

A(1)B(1)B(2)B(2),

A(1)A(1)B(2)B(2),

A(1)B(1)A(2)B(2),

A(1)A(1)A(2)B(2),

A(1)B(1)A(2)A(2), and A(1)A(1)A(2)A(2), respectively. In

the simulations, QTL-genotype frequencies in the tails

of trait distribution for a given tail cutoff depend on the

proportion d/s ¼ dðqÞ=

and s2. In our simulations we used m ¼ 0 and s2¼ 1.

Subdivision of the individuals in the tails of the trait

distribution into subpools was random. The number of

individuals in each subpool was equal if the number of

B(1)B(1)A(2)B(2),

B(1)B(1)A(2)A(2),

ffiffiffiffiffi

s2

p

, rather than on the m-value

individuals in the tail was divisible by the number of

subpools; otherwise it could differ by one individual.

Simulated technical error standard deviation associated

with estimation of marker allele frequencies in a pool

was set at 0.02 (absolute value). For analysis of the

simulated data, the marker haplotypes of the sires were

assumed known.

Example of QTL analysis by FPD: The scheme of

QTL analysis by FPD for the case of a single QTL per

chromosome is illustrated using a simulated example

with six half-sib families, three segregating for sire

allelesatthesimulatedQTL(i.e.,thesiresofthefamilies

are heterozygous at the simulated QTL) and three not

segregating for the sire alleles at the simulated QTL.

Results are shown in Figure 3.

Various numbers of markers were employed in the

different families (with some regions being represented

by neighboring but not coinciding marker loci), illus-

trating the ability of the FPD analytical system to deal

with cases when markers are not shared among families.

To simulate such a situation, we initially generated for

each family a high excess of markers with identical

chromosome positions. Then, the majority of markers

for each family were declared ‘‘homozygous,’’ and only

a small proportion of markers were randomly selected

to be ‘‘heterozygous.’’ A QTL with standardized allele

substitution effect d/s ¼ 0.3 was simulated at location

40 cM on the chromosome of 120 cM length. There

were 2000 daughters per family; a proportion 0.10 of

total daughters (i.e., 200 daughters) was selected for

each tail, and there were four subpools per tail. The

overall permutation test conducted after fitting the

estimation model (5a) gave significance P ¼ 0.009 (in

1000 permutations). P-values per family were respec-

tively 0.029, 0.029, 0.029, 0.94, 0.69, and 0.74 (based on

permutation tests within families, where only 35 possi-

ble different permutations exist for the 4 1 4 subpool

configurations). Corresponding P-values for the fami-

lies obtained in an experimentwise permutation test

were 0.018, 0.012, 0.023, 0.483, 0.344, and 0.428 (1000

random permutations). QTL positions estimated using

all six families or only the three families with significant

effect (P-value ,0.05) were 43.9 cM with standard

deviation of estimated position among runs (SD ¼

2.8) and 43.6 (SD ¼ 2.6), respectively (based on 500

jackknifes). On the basis of the jackknife procedure,

QTL detection power for the entire set of families was

estimated as follows. Threshold values of the test

statistics S(lf)2were obtained from the permutation

test for significance levels 5 and 1%. QTL ‘‘detection

power’’ was then estimated as the proportion of jack-

knife runs where the test statistics exceeded the thresh-

old value at the chosen significance level. Calculated in

this way, estimated powers for P-values ¼ 0.05 and 0.01

were 99 and 82%, respectively.

Comparing the quality of mapping for different num-

bers of markers: A few more examples with single-QTL

Fractioned DNA Pooling2617

Page 8

chromosomes were simulated with 10 sire families (5

withsireheterozygousand5withsirehomozygousatthe

QTL), with two standardized allele substitution effects

at the QTL (0.2 and 0.15) situated at position x(q)¼ 40

cM, and with three marker densities (9, 13, and 25

evenlyspacedmarkersper120-cMchromosome)(Table

1). Population size, proportion selected to the tails, and

number of subpools per tail were as in Figure 1. Table 1

presents the results for the six parameter combinations,

with 10 independent Monte Carlo data sets simulated

for each combination; for every simulated data set 500

permutations of subpools and 100 jackknife iterations

were made. For each of the 10 simulated data sets we

calculated the standard deviation of the difference

between estimated QTL position ? xðqÞand the simulated

one x(q)¼ 40 cM among the 100 jackknife iterations.

The mean of these standarddeviations across all 10 data

sets, denoted SD, characterizes the size of the confi-

dence interval of estimated QTL position. In addition,

for each data set we calculated the difference between

the mean of estimated QTL position based on the 100

iterations and the simulated position. The mean square

of these differences, denoted D, characterizes the shift

of the center of the confidence interval relative to the

true value. Table 1 shows that increasing the number of

markers reduces D more efficiently than SD. As one

would expect, SD (and hence the size of the confidence

interval)ishigherinthecaseofd/s¼0.15comparedto

d/s ¼ 0.2 (5.4 vs. 3.3).

Table 1 also allows acomparison of different methods

of testing the significance of QTL effect. Among the

model-freetestsbasedonx2

the best results seem to be provided by the permutation

test for max-x2statistics (for d/s ¼ 0.2) and by the

TFDRðIÞtestalsobasedonpermutations(ford/s¼0.15).

According to the presented results, the TFDRðIÞtest based

on permutations gave a much higher level of signifi-

cancethantheTFDRðIIÞtestbasedonx2-asymptoticapprox-

imation (P-values were lower by an order of magnitude).

m,max-x2,TFDRðIÞ,andTFDRðIIÞ,

The model-based test using the S(lf)2statistic instead

of max-x2resulted in a further severalfold decrease in

P-values (see Table 1). In accordance with the ranking

of the test statistics for P-values, S(lf)2also proved to be

superiorwithrespecttodetectionpower(i.e.,resultingin

thelowestproportionoffalse-negativedeclarationsinthe

case of the given fixed P-value ¼ 0.05). Estimated power

of the test based on S(lf)2was very high (?0.9 for d/s ¼

0.15 and $0.98 for d/s ¼ 0.20). When d/s ¼ 0.15,

estimated power of this test increased slightly with

increasing number of markers M. Estimated power of

the test based on max-x2was also higher for d/s ¼ 0.20

than for d/s ¼ 0.15. Nevertheless, unlike S(lf)2, power

for this test did not increase with increasing M; indeed,

what may even be an opposite tendency was observed for

d/s ¼ 0.20). This observation can be explained as

follows: With increasing M, the probability that in

permutation runs, the x2

will be higher than maxmx2

alsoincreases. Conversely, increasing M also can increase

the power of this test if the additional markers belong to

the vicinity of the QTL (not shown).

Multiple linked QTL analysis—two or more QTL on

the chromosome: In the case of two or more QTL per

chromosome, expected D at the marker locus is defined

by the expected frequencies of sire alleles in the high

and low pools at the closest situated QTL and by

recombination rates between marker and QTL. Let K

be the number of QTL in the chromosome and de-

nominate the QTL according to their locations ½i.e., x(1)

, x(2),..., x(K)?. The expectationof Dfor amarker at

location x can then be written in the form

mvalue for one of the markers

min initial pool configuration

EDfðxÞ ¼

lf ;1ð1 ? 2rxðxð1ÞÞÞ;

lf ;Kð1 ? 2rxðxðKÞÞÞ;

Df ;xðqÞ;xðq11ÞðxÞ;

x #xð1Þ

x $xðKÞ

x 2 ½xðqÞ;xðq11Þ?;q ¼ 1;...;K ? 1;

8

:

<

ð6Þ

where

Figure 3.—QTL analysis of multiple families

with some nonshared markers. Six families with

2000 daughters each were simulated (three fam-

ilies with sire heterozygous for a single QTL situ-

ated at position 40 cM with allele substitution

effect d/s ¼ 0.3 and three families with sire ho-

mozygous at the QTL). Chromosome length

was 120 cM with 6–10 markers per family; a pro-

portion 0.10 of all daughters was selected to each

tail in each family. Individuals in both tails were

randomly subdivided into four sub-pools. (a) D-

value across the markers for each family (solid

and open squares, triangles, and diamonds represent D in families with QTL-heterozygous and -homozygous sires correspond-

ingly); (b) the results of jackknife resampling analysis (90% confidence intervals of l-values for each family are shown by vertical

lines, estimated in 500 jackknifes). The experimentwise P-value in a permutation test based on S(lf)2was 0.012 (in 1000 permu-

tations). The corresponding experimentwise permutation test P-values per family were 0.018, 0.012, 0.023, 0.483, 0.344, and 0.428

Estimated QTL position on all six families or on three families with a significant (P-value ,0.05) l-value was 43.9 cM (SD ¼ 2.8)

and 43.6 (SD ¼ 2.6) cM, respectively. Estimated power for P-value ¼ 0.05 was 99%.

2618A. Korol et al.

Page 9

Df ;xðqÞ;xðq11ÞðxÞ ¼

lf ;q1lf ;q11

2ð1 ? rxðqÞðxðq11ÞÞÞð1 ? rxðxðqÞÞ ? rxðxðq11ÞÞÞ

1lf ;q11? lf ;q

2rxðqÞðxðq11ÞÞðrxðxðqÞÞ ? rxðxðq11ÞÞÞ:

Here lf,qis the characteristic of the qth QTL in family f,

and x(q)is the location of this QTL. Value rx(x(q)) is the

recombination rate between the marker loci situated in

positions x and x(q). The origin of Equation 6 is similar

to Equation 3 (for details see also Wang et al. 2007): Let

lf,1,..., lf,Kbe expectations for D-values of markers

coinciding with corresponding QTL. Assuming absence

of interference we can consider the expectation of

D-values separately for each interval between QTL. For

the two end intervals x , x(1)and x . x(K)Equation 6

has the same form as Equation 3. For other intervals

the absolute value of the expectation of D is reduced

by corresponding double recombination (double re-

combination is not a factor for the end intervals). The

estimation criterionfortheregressionmethodtakesthe

following form:

SfSmfDf ;m? EDf ;mg2=VarDf ;m???????????????? ?!

xðqÞ;lf ;q;f ¼1;...;F;q¼1;...;K

min:

ð7Þ

Fitting the model by using criteria (7) can be expressed

in terms of the linear model

Df ¼ Xflf1ef;

where lfis a vector of lf,1,..., lf,Kand coefficients

of matrix Xfare equal to corresponding multipliers in

Equation 6. Taking into account the correlation be-

tween values of D for linked markers and using the

generalized least-squares approach, the estimation cri-

terion takes the form

SfðCfðDf? XflfÞÞ9G?1

ðDf? XflfÞ ?????????????? ?!

HerematricesGandCarelikeinEquation5a.Forgiven

putative QTL positions, vector lfof parameters lf,1,...,

lf,Kminimizing criterion (8) can be calculated as

fCf

xðqÞ;lf ;q;f ¼1;...;F;q¼1;...;K

min:

ð8Þ

^lf ¼ ðX9fC9fG?1

fCfXfÞ?1X9fC9fG?1

fCfDf:

EveninthecaseofonlytwoQTLonthechromosome,

various situations can exist. These include heterozygos-

ity of different sires for one, two, or none of the QTL

and the linkage phase between the QTL (coupling vs.

repulsion) in the sires that are heterozygous for both

QTL. Thus, in addition to the foregoing tests of signifi-

cance, the situation with linked QTL calls for compar-

isons of H2vs. H1(two-QTL vs. single-QTL hypotheses)

for the entire data set as well as for each family. However,

in this article we demonstrate only the potential of the

FPD system to analyze linked QTL, leaving the detailed

analysis of various scenarios for a future publication.

The example, presented in Figure 4, is based on one

simulateddatasetof10families.Eachsirewassimulated

heterozygous for two linked QTL (half of the sires in

coupling phase and half in repulsion phase) with allele

substitution effects d/s ¼ 0.3 at locations 30 and 80 cM

on a chromosome of length 120 cM with 13 evenly

spaced markers (at positions 0, 10, 20,..., 120 cM).

Population size, proportion selected to the tails, and

number of subpools per tail were as in Figure 1. After

fitting a two-QTL model and using FPD analysis to

Figure 4.—Analysis with multiple-linked QTL. Simulated were 10 families heterozygous for two linked QTL, 5 in coupling and 5

in repulsion phase. Thirteen markers were evenly spaced on a chromosome of length 120 cM. QTL 1 and QTL 2 were simulated in

positions 30 and 80 cM, respectively. The allele substitution effect at both QTL in all 10 families was d/s ¼ 0.3. Alleles at QTL 1 and

QTL 2 that came from dams were simulated as independent cases. The number of daughters per family was 2000; the proportion

of total population selected to each tail was 0.10. (a) D-values for all families and markers. Points corresponding to a given family

are connected by a line. (b) l-Values and their standard errors in 500 jackknifes for every family. Clear separation is observed

between the first five sires (QTL in coupling phase) and the last five sires (QTL in repulsion phase). (c) Simulated (solid circle)

and estimated (open circle) positions of QTL. The curve encloses the area where the position of QTL was estimated in $90% of

500 jackknifes {included points with integer coordinates (x, y) such that in $5 jackknifes, estimated QTL positions belonged in the

interval (x 6 0.5, y 6 0.5 cM).

Fractioned DNA Pooling2619

Page 10

detect the two QTL, the estimated QTL positions were

within 2 cM from the simulated positions. Standard

errors in 500 jackknifes were 1.7 and 0.8 for QTL 1 and

QTL 2, respectively. The high quality of the analysis is

duetothehighallele-substitutioneffectsinthetwoQTL

and the relatively large map distance between them.

More diverse sires with respect to their QTL structure

(heterozygous at one, two, or none of the QTL) are also

treatable with relative ease within the framework of the

two-QTL FPD model.

General scheme of FPD QTL analysis: To conclude

the analytical section, we present here a general scheme

of the proposed system of FPD QTL analysis (Figure 5).

The suggested integrative algorithm includes: (A) fit-

ting the mapping model, (B) an overall test of signif-

icance (using lf-value-based models for conducting

permutation tests), (C) detecting nonsignificant (QTL-

homozygous) sires, (D) removing the homozygous sires

and repeating the tests, (E) estimating QTL detection

power, and (F) conductingjackknifeanalysistoevaluate

the confidence interval for the estimated position of

detected QTL. This scheme can be further extended to

take into account the possibilities of multiple-linked-QTL

analysis, including: fitting multiple-linked-QTL models;

comparing multiple-linked and single-QTL models

(testing H0vs. H1and H2and H1vs. H2); detection of

siresheterozygousforzero, one,ormultiple-linkedQTL;

and estimating the confidence intervals of the chromo-

somal positions of the detected QTL.

Unknown marker linkage phase in the sire: In the

case of unknown marker–QTL linkage phase (sire

marker haplotypes), the algebraic sign of the statistic

Dmis not uniquely defined. For markers with unknown

phase these signs (plus or minus) can be found through

optimization of criteria (4), (5), (4a), (5a), (7), or (8)

(with the minimum now taken over all possible combi-

nations of signs). To make optimization in this case

moreeffective,someheuristicscanbeused.Forasingle-

QTL model where marker phase in the sire is not

known, it is reasonable to allocate the same sign (say,

plus) to the D-values for all markers. For the model with

two QTL on the chromosome, it is reasonable to con-

sider D-values changing sign no more than once, e.g.,

positive for the first m markers and negative for the

others(ifthetwoQTLinthesireareinrepulsivephases).

Optimization of the signs of D-values can result in an

increase in the false positive declaration rate. Indeed,

it can convert some families with noisy fluctuating

D-values around zero to have D-values of one sign. This

can greatly increase jlj and, hence, falsely cause a QTL-

homozygous family to be declared heterozygous. There-

fore, external information about linkage phases of the

maker loci reduces the proportion of false positive

families.

Choosing the number of subpools: The multiple-

pool approach was previously proposed as a means of

improving the quality of allele frequency estimates

(Sham et al. 2002; Brohede et al. 2005). Within this

framework, the problem of ‘‘optimal size’’ of pools was

primarily considered from the aspect of amplification

fidelity (Brohede et al. 2005) and as a way to obtain an

adequate estimate of variation of marker allele frequen-

cies Var Df,m(e.g., Sham et al. 2002). In the present study,

the number of pools affects the number of possible

differentpermutationsand jackknifesand hence affects

P-values and power of the analysis.

Todemonstratethedependenceofanalysisqualityon

the number of subpools per tail, a series of simulation

experiments were conducted. Situations with one,

three, and five families were simulated. The proportion

of individuals taken to the tails was 0.10 as in the

previous simulations. The individuals in the tails were

then randomly subdivided into four, six, or eight sub-

pools of equal size. The family sizes were 960 and 1920.

As above achromosome of120 cM length with13 evenly

spaced markers was assumed, and the QTL was simu-

lated in position 40 cM with allele substitution effects

d/s ¼ 0.3, 0.2, and 0.15. For each parameter combina-

tion, 10 Monte Carlo data sets were simulated; for every

set 1000 permutations and 100 jackknife iterations

were made (with exactly one pool per tail per family

being excluded in each jackknife run). The results are

summarized in Table 2.

It was found that a higher number of subpools does

not reduce the standard error of estimated QTL loca-

tion, if the percentage of excluded pools is the same in

each jackknife iteration (not shown). However, if in

each jackknife iteration exactly one pool per tail is

excluded, SD and confidence intervals became smaller

with a higher number of subpools (Table 2) but less

robust(i.e.,samplingvarianceoftheconfidenceinterval

center and its size are higher), because different runs

Figure 5.—The general scheme of QTL analysis by the FPD

method.

2620A. Korol et al.

Page 11

are more dependent. This can explain why value D does

notalwaysdecreasewithincreasingnumberofsubpoolsS.

In contrast, P-values decreased asymptotically with the

number of subpools until some limit determined by

QTL allele substitution effect, number and proportion

of QTL-polymorphic families, number of daughters

per family, proportion of daughters taken to each tail,

number and positions of markers on the chromosome,

and technical error of densitometric estimation of pool

frequencies. Results summarized in Table 2 demon-

strate the variation of P-value and power of the analysis

thatcanbeachievedindifferentsituations.Asexpected,

better results were obtained in situations with a greater

number of families, a greater number of progeny per

family, and a greater allele substitution effect d/s of

QTL. The unexpected smaller D and SD for the one-

family situation in the case of d/s ¼ 0.15 (compared to

d/s ¼ 0.2) can be explained by a shortcoming of

criterion (5a): In the case of absence of or very small

QTL effect, the difference in the criterion values for

different x(q)is very small; and the smallest value tends

to be observed for x(q)close to the average marker

position (60 cM in our situation). In other words, under

H0, the estimated position is not uniformly distributed

along the chromosome (not shown). Note that the

lowest possible P-value in permutation is equal to 1/R,

where R is the number of different permutations. If we

are ‘‘satisfied’’ with P-values $a, then no more than 5/a

different permutations are needed. Hence, in the case

ofonlyonefamilyweneed?S¼log4R11.5¼log4(5/a)1

1.5 subpools. For the experimentwise permutation test

in F similar families we need S ¼ log4(R)1/F1 1.5 ¼ 1/F

log4(5/a)11.5subpoolspertail,per family.Thus,from

the point of view of maximizing the number of different

permutations, it is more effective to analyze more

families than to make more subpools per family. The

relative cost of additional families, subpools, markers, and

desired QTL detection power and mapping accuracy

definesacost-effectivestrategy for theinitialgenomescan

forQTLbyFPD.Clearly,theaboveaspectsofamplification

fidelity and estimation of variation of marker allele fre-

quencies considered by Brohede et al. (2005), Sham et al.

(2002), and other authors should also be an important

part of designing FPD experiments.

Correlations between D-values and quality of the

analysis: Taking into account correlations between D-

values for linked markers, i.e., using a generalized least-

squares method (Equations 5, 5a, and 8), will probably

not increase the QTL detecting power and accuracy of

the QTL position estimates in the majority of practical

situations. When substitution effects, number of daugh-

ters per family, and number of families are small, the

sampling variance of Dmis high relative to its expected

value. Taking the correlations into account will increase

the sampling variance and reduce the expected value

for each marker (Montgomery and Peck 1992). This

makes the analysis less robust. The least-squares optimi-

zation criterion, whenH0istrue, follows a x2-distribution

TABLE 2

Effect of number of subpools per tail (S) under the FPD on characteristics D and SD of the confidence interval for QTL

location, comparisonwise error rate (P-value), and statistical power, according to number of families (F), number of

daughters per family (N), and standardized allele substitution effect at the QTL (d/s), using simulated data

D

SD

P-valuePower at P ¼ 0.05

S ¼ 4 (%)

—

—

—

FN d/s

S ¼ 4

10.4

14.1

11.0

S ¼ 6

10.1

14.0

10.3

S ¼ 8

10.3

13.7

11.1

S ¼ 4

4.7

14.6

11.6

S ¼ 6

4.0

12.0

8.8

S ¼ 8

3.5

10.7

6.7

S ¼ 4

0.056

0.083

0.156

S ¼ 6

0.007

0.043

0.135

S ¼ 8

0.005

0.028

0.110

S ¼ 6 (%)

79

32

—

S ¼ 8 (%)

89

56

—

11920 0.3

0.2

0.15

3 9600.3

0.2

0.15

4.7

10.3

14.6

3.8

12.1

14.7

3.5

12.0

14.9

5.4

11.4

19.2

4.9

7.1

15.7

3.2

6.9

13.2

0.003

0.030

0.195

0.003

0.021

0.203

0.002

0.023

0.208

59

30

—

89

52

—

94

52

—

3 19200.3

0.2

0.15

2.9

5.7

10.7

3.1

5.2

10.1

3.2

5.4

10.1

1.9

4.4

6.9

1.5

3.4

5.6

1.2

3.0

5.1

0.001

0.003

0.028

0.001

0.002

0.024

0.001

0.003

0.011

94

56

46

99

82

72

99

92

76

5 960 0.3

0.2

0.15

2.7

5.7

14.3

2.9

5.8

15.1

2.9

5.4

14.9

2.6

5.3

12.0

1.8

4.3

9.6

1.6

3.3

8.4

0.001

0.013

0.081

0.001

0.009

0.070

0.001

0.006

0.067

87

44

—

99

63

—

99

71

—

P-values and power were calculated using the permutation test based on S(Af)2(see text). Power was calculated for the thresh-

old of the statistics corresponding to P-value ¼ 0.05 (shown only for situations where the observed experimentwise P-value did not

exceed 0.05). Characteristics D and SD of the confidence interval for QTL location were obtained from the jackknife iterations.

Parameters of the simulations: chromosome length 120 cM. A single QTL was situated at position 40 cM. Number of markers M ¼

13. Proportion of population selected to each tail, 0.10. One subpool per tail was excluded in each jackknife. Values represent

mean of 10 simulation data sets; for every data set 1000 permutations of subpools and 100 jackknife iterations were made to es-

timate P-value, power, D, and SD.

Fractioned DNA Pooling2621

Page 12

with degrees of freedom equal to the number of terms

in the sum. Parameters minimizing this criterion also

maximize the likelihood function, but the difference

between the criterion values for different putative QTL

positions is small (not shown). Nevertheless, by taking

thecorrelations intoaccount,wereducetheconfidence

interval and discrepancy between the estimated and

simulated QTL positions (data not shown).

DISCUSSION AND PROSPECTS

Genomewide scans for the detection of marker–QTL

linkage or linkage disequilibrium for QTL of small

effect require large mapping populations and hence

involve a high cost of marker genotyping. Even more

challenging are the requirements of population size

fromtheviewpointofQTLmappingaccuracy.Infamily-

based analysis, the confidenceintervals for the estimated

QTL chromosomal position are of tens of centimorgans

even for QTL of moderate effects (Darvasi and Soller

1997; Ronin et al. 2003). A cost-effective solution is to

replace individual genotyping by DNA analysis in pools

using individuals from the tails of the trait distribution

(Hillel et al. 1990; Darvasi and Soller 1994) or al-

ternative phenotypic groups in the case of discontinu-

ous variation (Giovannoni et al. 1991; Michelmore

et al. 1991). To increase the fidelity of pooling analysis,

Dekkers (2000) proposed a method of joint treatment

of multiple markers by scanning a chromosome with a

sliding window (see also Johnson 2005 for further

developments in LD QTL analysis).

Although the ideaof using a multiple-pool design has

been discussed previously (Sham et al. 2002; Brohede

et al. 2005), the objectives of those studies were to im-

prove the quality of the allele-frequency estimates and

corresponding variances. In addition to these uses, the

proposed FPD system utilizes the multiple-pool design

toprovideawidespectrumofnewanalyticaloptionsthat

were previously possible only with individual genotyp-

ing. These new options are of special importance in the

light of accumulating evidence on reliability of pooling

analysis with SNP chips. Combining SNP microarray

analysis with DNA pooling can reduce dramatically the

cost of screening large numbers of SNPs on large sam-

ples, making chip technology applicable for genome-

wide association mapping in humans and farm animals

(Butcher et al. 2004; Brohede et al. 2005; Craig et al.

2005). The FPD analysis relaxes some of the previous

limitations of the pooling analysis by utilizing the infor-

mation provided by multiple subpools within tails. This

allows a flexible analytical system in QTL detection

based on resampling procedures (permutations, boot-

straps, and jackknifes), rather than on asymptotic as-

sumptions (Sham et al. 2002; Carleos et al. 2003),

enabling evaluation of the confidence interval of QTL

position and discriminating between different hypoth-

eses of trait genetic architecture.

Allowing for resampling analysis via the FPD does

comeatacost ofrequiring multiplesubpoolsper tail.In

the situations when multiple traits are analyzed, indi-

viduals need to be separated into subpools in the tails

of trait distribution for every trait. In these situations

the number of subpools may be close to the number of

individuals in the mapping population (if traits are not

strongly correlated), thereby reducing the advantage

of the pooling method. Another disadvantage is that

this method only partially utilizes haplotype information

compared to individual selective genotyping. However,

a partial solution to this problem could be provided by

using multivariate tails of the multidimensional trait

distribution rather than trait-specific tails (Ronin et al.

1998).

The proposed methodology allows joint analysis of

multiple families and multiple markers across a chro-

mosome, even if the markers are only partly shared (or

even not shared at all) among families. Resampling pro-

cedures permit confidence intervals to be constructed

for family-specific l-values. These intervals allow iden-

tification of families for which the founder sire was

homozygous at the QTL. The FPD analysis permits ex-

tension to cases of two or more QTL on the same

chromosome. All this provides cost-effective options for

sequential family- and region-specific increase of marker

density to improve the QTL mapping resolution and

accuracy and to reduce type I (false positive) and type II

(false negative) errors. Of special interest is the exten-

sion of pooling methodology to genome expression

analysis (Alba et al. 2004; Kendziorski et al. 2005). The

cautious optimism of pooling RNA expressed by these

authors can be considered as justifying the extension of

the FPD to RNA analysis.

Themajoradvantageofpopulation-basedratherthan

family-based mapping is in its potential for fine and

ultra-fine mapping due to accumulation of historical

recombination events. Recent findings on the existence

of linkage disequilibrium block and estimates of the

sizesoftheseblocksestablishabasisforLD(association)

mapping. Still, for lociwithsmallto moderate effects on

the target traits one of the major limiting factors is the

size of the effect and not the degree of recombination

(diversity of haplotypes). Consequently, very large sam-

plesizesarerequiredmakingpoolinganalysisextremely

attractive.Therefore,weplantoextendthefractionated

pooling design to LD-based QTL analysis.

We thank J. Dekkers for very constructive criticism and helpful

suggestions. This research was supported in part by grant QLK5-CT-

2001-02379 (BovMAS project) under the European Union FP5

programandby aPh.D.fellowshipfromtheUniversity ofHaifatoZ. F.

LITERATURE CITED

Alba, R., Z. J. Fei, P. Payton, Y. Liu, S. L. Moore et al., 2004

cDNA microarrays, and gene expression profiling: tools for dis-

secting plant physiology and development. Plant J. 39: 697–714.

ESTs,

2622A. Korol et al.

Page 13

Benjamini, Y., and Y. Hochberg, 1995

ery rate - a practical and powerful approach to multiple testing. J.

R. Stat. Soc. Ser. B 57: 289–300.

Brohede, J., R. Dunne, J. D. Mckay and G. N. Hannan, 2005

an algorithm for accurate estimation of SNP allele frequencies in

small equimolar pools of DNA using data from high density mi-

croarrays. Nucleic Acids Res. 33: e142.

Butcher, L. M., E. Meaburn, L. Liu, C. Fernandes, L. Hill et al.,

2004 Genotyping pooled DNA on microarrays: a systematic ge-

nome screen of thousands of SNPs in large samples to detect

QTLs for complex traits. Behav. Genet. 34: 549–555.

Carleos, C., J. A. Baro, J. Canon and N. Corral, 2003

variances of QTL estimators with selective DNA pooling. J. Hered.

94: 175–179.

Craig, D. W., M. J. Huentelman, D. Hu-Lince, V. L. Zismann, M. C.

Kruer et al., 2005 Identification of disease causing loci using

an array-based genotyping approach on pooled DNA. BMC

Genomics 6: 138.

Darvasi, A., and M. Soller, 1992

nation of linkage between a marker locus and a quantitative trait

locus. Theor. Appl. Genet. 85: 353–359.

Darvasi, A., and M. Soller, 1994

mination of linkage between a molecular marker and a quantita-

tive trait locus. Genetics 138: 1365–1373.

Darvasi, A., and M. Soller, 1997

solving power and confidence interval of QTL map location. Be-

hav. Genet. 27: 125–132.

Dekkers, J. C. M., 2000 Quantitative trait locus mapping based on

selective DNA pooling. Anim. Breed. Genet. 117: 1–16.

Doerge, R. W., and G. A. Churchill, 1996

multiple loci affecting a quantitative character. Genetics 142:

285–294.

Dunnington, E. A., A. Haberfeld, L. C. Stallard, P. B. Siegel and

J. Hillel, 1992Deoxyribonucleic-acid fingerprint bandslinked

to loci coding for quantitative traits in chickens. Poult. Sci. 71:

1251–1258.

Fernando, R. L., D. Nettleton, B. R. Southey, J. C. Dekkers, M. F.

Rothschild et al., 2004 Controlling the proportion of false

positives in multiple dependent tests. Genetics 166: 611–619.

Giovannoni, J. J., R. A. Wing, M. W. Ganal and S. D. Tanksley,

1991 Isolation of molecular markers from specific chromo-

somal intervals using DNA pools from existing mapping popula-

tions. Nucleic Acids Res. 19: 6553–6558.

Hillel, J., R. Avner, C. Baxter-Jones, E.A. Dunnington, A. Cahaner

et al., 1990 DNA fingerprints from blood mixes in chickens and

turkeys. Anim. Biotechnol. 2: 201–204.

Johnson, T., 2005Multipoint linkage disequilibrium mapping

using multilocus allele frequency data. Ann. Hum. Genet. 69:

474–497.

Kearsey, M. J., 1998The principles of QTL analysis (a minimal

mathematics approach). J. Exp. Bot. 49: 1619–1623.

Kendziorski, C., R. A. Irizarry, K. S. Chen, J. D. Haag and M. N.

Gould, 2005On the utility of pooling biological samples in mi-

croarray experiments. Proc. Natl. Acad. Sci. USA 102: 4252–4257.

Lander, E. S., and D. Botstein, 1989

underlying quantitative traits using RFLP linkage maps. Genetics

121: 185–194.

Controlling the false discov-

PPC:

Asymptotic

Selective genotyping for determi-

Selective DNA pooling for deter-

A simple method to calculate re-

Permutation tests for

Mapping Mendelian factors

Lipkin, E., M. O. Mosig, A. Darvasi, E. Ezra, A. Shalom et al.,

1998Quantitative trait locus mapping in dairy cattle by means

of selective milk DNA pooling using dinucleotide microsatellite

markers: analysis of milk protein percentage. Genetics 149:

1557–1567.

Michelmore, R. W., I. Paran and R. V. Kesseli, 1991

of markers linked to disease-resistance genes by bulked segregant

analysis: a rapid method to detect markers in specific genomic

regions by using segregating populations. Proc. Natl. Acad. Sci.

USA 88: 9828–9832.

Montgomery, D. C., and E. A. PECK, 1992

gression Analysis, Ed. 2. John Wiley & Sons, New York.

Mosig, M. O., E. Lipkin, G. Khutoreskaya, E. Tchourzyna, M.

Soller et al., 2001 A whole genome scan for quantitative trait

loci affecting milk protein percentage in Israeli-Holstein cattle, by

means of selective milk DNA pooling in a daughter design, using

an adjusted false discovery rate criterion. Genetics 157: 1683–1698.

Plotsky, Y., A. Cahaner, A. Haberfeld, U. Lavi, S. J. Lamont et al.,

1993 DNA fingerprint bands applied to linkage analysis with

quantitative trait loci in chickens. Anim. Genet. 24: 105–110.

Ronin, Y., A. Korol, M. Shtemberg, E. Nevo and M. Soller,

2003High-resolution mapping of quantitative trait loci by se-

lective recombinant genotyping. Genetics 164: 1657–1666.

Ronin, Y. I., A. B. Korol and J. I. Weller, 1998

to detect quantitative trait loci affecting multiple traits: interval

mapping analysis. Theor. Appl. Genet. 97: 1169–1178.

Ronin, Y. I., A. B. Korol and E. Nevo, 1999

trait mapping analysis of linked quantitative trait loci: some as-

ymptotic analytical approximations. Genetics 151: 387–396.

Schnack, H. G., S. C. Bakker, R. Van’t Slot, B. M. Groot, R. J.

Sinke et al., 2004Accurate determination of microsatellite al-

lele frequencies in pooled DNA samples. Eur. J. Hum. Genet.

12: 925–934.

Sham, P., J. S. Bader, I. Craig, M. O’Donovan and M. Owen,

2002DNA pooling: a tool for large-scale association studies.

Nat. Rev. Genet. 3: 862–871.

Tamiya, G., M. Shinya, T. Imanishi, T. Ikuta, S. Makino et al.,

2005Whole genome association study of rheumatoid arthritis

using 27 039 microsatellites. Hum. Mol. Genet. 14: 2305–2321.

Visscher, P. M., and S. Le Hellard, 2003

SNP-based association studies using DNA pools. Genet. Epide-

miol. 24: 291–296.

Wang, J., K. J. Koehler and J. C. M. Dekkers, 2007

ping of quantitative trait loci with selective DNA pooling data.

Genet. Sel. Evol. (in press).

Weller, J. I., Y. Kashi and M. Soller, 1990

granddaughter designs for determining linkage between marker

loci and quantitative trait loci in dairy-cattle. J. Dairy Sci. 73:

2525–2537.

Zou, G. H., and H. Y. Zhao, 2004

genotyping and DNA pooling on association studies. Genet. Epi-

demiol. 26: 1–10.

Zou, G. H., and H. Y. Zhao, 2005

different family structures using pooled DNA. Ann. Hum. Genet.

69: 429–442.

Identification

Introduction to Linear Re-

Selective genotyping

Single- and multiple-

Simple method to analyze

Interval map-

Power of daughter and

The impacts of errors in individual

Family-based association tests for

Communicating editor: M. W. Feldman

Fractioned DNA Pooling2623