Page 1

Copyright ? 2008 by the Genetics Society of America

DOI: 10.1534/genetics.108.090308

Optimal Design of Genetic Studies of Gene Expression With Two-Color

Microarrays in Outbred Crosses

Alex C. Lam,*,†,1Jingyuan Fu,‡Ritsert C. Jansen,‡Chris S. Haley* and Dirk-Jan de Koning*

*Roslin Institute (Edinburgh) and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin Biocentre, Roslin,

Midlothian EH25 9PS, United Kingdom,†Institute of Evolutionary Biology, School of Biological Sciences, University of

Edinburgh, Edinburgh EH9 3JT, United Kingdom and‡Groningen Bioinformatics Centre, Groningen Biomolecular

Sciences and Biotechnology Institute, University of Groningen, 9751 NN, Haren, The Netherlands

Manuscript received April 14, 2008

Accepted for publication September 2, 2008

ABSTRACT

Combining global gene-expression profiling and genetic analysis of natural allelic variation (genetical

genomics) has great potential in dissecting the genetic pathways underlying complex phenotypes. Efficient

use of microarrays is paramount in experimental design as the cost of conducting this type of study is high.

For those organisms where recombinant inbred lines are available for mapping, the ‘‘distant pair design’’

maximizes the number of informative contrasts over all marker loci. Here, we describe an extension of this

design,namedthe‘‘optimalpairdesign,’’forusewithF2crossesbetweenoutbredlines.Theperformanceof

this design is investigated by simulation and compared to several other two-color microarray designs. We

show that, for a given number of microarrays, the optimal pair design outperforms all other designs

consideredfordetectionofexpressionquantitativetraitloci(eQTL)withadditiveeffectsbylinkageanalysis.

We also discuss the suitability of this design for outbred crosses in organisms with large genomes and for

detection of dominance.

G

2001), has great potential for dissecting the mechanisms

underlyingcomplexphenotypes(Mehrabianetal.2005;

Schadt et al. 2005). Although variation in transcript

abundanceisofteninresponsetoexternalenvironmental

factors, part of the between-individual variation in ex-

pression of a substantial number of genes can be ex-

plainedbyDNApolymorphisms(Jinetal.2001).Todate,

thevastmajorityofpublishedstudiesinthisresearcharea

have been conducted in model organisms such as yeast

(Brem et al. 2002), flowering plant (Keurentjes et al.

2007), nematode worm (Li et al. 2006), mouse (Schadt

et al. 2003; Bystrykh et al. 2005), and rat (Hubner et al.

2005). There are also a number of studies that focused

on human populations (Monks et al. 2004; Morley et al.

2004; Stranger et al. 2005). Efforts in mapping expres-

sion quantitative trait loci (eQTL) have provided strong

evidence for candidate gene selection in studies of com-

plexphenotypesuchashypertension(Hubneretal.2005)

and childhood asthma (Dixon et al. 2007).

Like in any QTL study, appropriate sample size is

essential for adequate power in eQTL detection. Al-

though many of the published studies have provided

very interesting insights into the properties of genetic

locithatregulategene-expressionphenotypes,thesmall

ENETIC analysis of variation in gene expression,

alsoknownasgeneticalgenomics(JansenandNap

sample sizes of the early studies meant they have limited

power to detect eQTL of small to moderate effects (De

Koning and Haley 2005). In many cases, there is no

shortageofanimalsorcelllinesforageneticalgenomics

approach as the genetic materials have already been

collected for concurrent large-scale studies. Therefore,

themajorfactorthatrestrictssamplesizestendstobethe

high cost of the associated technologies, particularly

the cost of microarrays. To address this issue, significant

improvement in the usage of microarray resources for

genetical genomics has been proposed in a number of

articles. Jin et al. (2004) presented an algorithm for

‘‘selectivephenotyping’’inwhichasubsamplewaschosen

from the entire sample set for maximum genotypic dis-

similarity as a way to reduce the amount of phenotyping

without sacrificing sensitivity in QTL detection. In a

different article, Piepho (2005) discussed the optimal

allocationofsamples tocDNAmicroarraysfordetecting

heterosis. Bueno Filho et al. (2006) covered a range

of optimal microarray designs, from studying the geno-

typic effect of a single locus to models that include

both fixed treatment and random polygenic effects.

Rosa et al. (2006) provided a comprehensive review on

microarray design for eQTL mapping. Fu and Jansen

(2006) proposed a more general approach called the

distant pair design, which combines optimal allocation

by hybridizing most dissimilar samples and selective

genotyping when thepopulation resource is large. They

used recombinant inbred lines (RILs) to demonstrate

the power of this approach. In this article, the research

1Correspondingauthor:RoslinInstitute(Edinburgh),Roslin,Midlothian

EH25 9PS, United Kingdom.E-mail: alex.lam@roslin.ed.ac.uk

Genetics 180: 1691–1698 (November 2008)

Page 2

emphasis is on further developing this experimental

design to other populations, like the outbred F2crosses.

InFu andJansen’sstudy,the authors choseto explore

two-color microarray technologies over single-color plat-

forms because for the same number of slides, two-color

microarrays can potentially generate twice as much

hybridization data as single-color arrays. In addition,

two-color microarray platforms remain the only choice

for many research projects because commercial single-

color microarrays such as Affymetrix GeneChip are

available for only a handful of species. Furthermore,

two-color microarrays offer greater flexibility in exper-

imental design inwhichpairsofsamplescan beselected

tocohybridizedeliberately,whichenhancesperformance.

Fu and Jansen proposed the ‘‘distant pair design,’’

which outperforms the conventional designs, namely

the common reference and the loop designs, when a

panel of RILs is used. Moreover, for the expression

profiling of a given number of biological samples, the

distant pair design would require only half as many

slides as the common reference or the loop design in a

two-color microarray.

The distant pair design presents an effective direct-

pairing strategy that increases the ratio of within- over

between-slidesgenotypicdissimilarity(Rosaetal.2006).

However, it is not clear how to apply this rationale

to populations in which more than two genotypes are

possible for each marker locus. Although some insight

was provided on how best to detect overdominance, or

heterosis,whichinvolvedsubjectswithhybridgenotypes

(Piepho 2005), this strategy for sample allocation is not

designed for mapping gene-expression variation to any

specific loci. Bueno Filho et al. (2006) addressed the

problem with three possible genotypes at a single locus

and gave a generalization of the solution to multiple

loci. Their method may be more tractable when marker

genotypes can be treated as fixed effects for contrast,

which is true for inbred line crosses. For researchers

studying genetics of outbred species, mapping resour-

ces like inbred strains or RILs are often not feasible.

By contrast, F2 intercrosses between two genetically

divergent outbred populations are much more readily

available.Amajorcomplicationarisinginoutbredcrosses

is due to the fact that there are common sets of alleles

segregating in both of the founder populations. Hence,

it is often the case that marker genotypes in the F2gen-

eration would not be fully informative for the origins

of lineage at any given locus. This uncertainty obscures

how one can define genotypic dissimilarity for the pur-

pose of pair assignment in distant pair design. Further-

more, researchers face the issue regarding large genome

sizes. It is expected that when genome size increases,

finding distant pairs will become more and more dif-

ficult. Fu and Jansen (2006) have shown that in RILs

a small advantage is achievable with large genomes.

However, whether this advantage is also present in an

F2design remains uncertain. This question is directly

relevant to researchers who are interested in studying

the genetics of gene expression in nonmodel organ-

isms.Therefore,theusefulnessofthedistantpairdesign

for genetical genomic studies in outbred F2 crosses

warrants investigation.

In this article, we propose an extension to the distant

pair design by adapting the least-squares QTL mapping

framework (Haley et al. 1994). Here we refer to this

extension as the ‘‘optimal pair design.’’ We also assess

the performance of this design in the presence of dom-

inance. Moreover, we consider the impact of genome

size on power and discuss the usefulness of our exten-

sion of the distant pair design for eQTL studies in

outbred experimental crosses.

MATERIALS AND METHODS

QTL analysis: The method for mapping QTL follows the

least-squares approach (Haley et al. 1994). Briefly, the line

origins at fixed intervals (e.g., 1 cM) along the genome for the

individuals in the F2 generation are expressed as lineage

probabilities,conditionalonthemarkergenotype.Thiscanbe

done by considering all possible line-origin combinations

based on the parental and grandparental genotypes and has

been implemented in the online software ‘‘QTL Express’’

(Seaton et al. 2002). Assuming that founder lines are fixed

for alternative QTL alleles, the lineage probabilities can be

usedtopredicttheputativeQTLgenotypes.Phenotypicvalues

are then regressed onto genetic coefficients calculated for a

putative QTL at a fixed position. The genetic coefficients for

additive and dominance effects are derived from the con-

ditional probabilities: the additive coefficient (denoted xa) is

the difference of the probabilities for the homozygous line

origins, and the dominance coefficient (denoted xd) is the

sum of the probabilities for the heterozygous line origins. An

F-ratio test statistic can be used to test the null model (without

QTL fitted) against the full model (with QTL fitted) and de-

termine the significance of the presence of QTL. For full

details on the derivation of line-origin probabilities and

regression-based QTL mapping, see Haley et al. (1994).

In the context of a pair design in two-color microarrays, the

gene-expression phenotypes can be expressed either in ratios

or in signals of the separate channels. In this article we chose

ratios over signals as the phenotypes because the use of ratios

can minimize the risk of bias as a result of spot or array effects

(Wit and McClure 2004). Fu and Jansen (2006) argued that

there is a negligible difference in the final results between

ratios and signals, provided that the distributional assump-

tions for the array and spot effects used in the signal-based

analysis are correct. The log ratio of the red channel intensity

to the green channel intensity of a probe is equivalent to the

difference of the two signal intensities in logarithmic scale. To

utilize such phenotypes in the Haley–Knott least-squares

framework, the linear regression model can be written as

Dyi¼ m1Dxaia 1Dxdid 1ei;

ð1Þ

where Dyiis the difference by subtracting the log signal of

the green channel from that of the red channel for the ith

microarray (i ¼ 1,..., n); m is the overall mean; Dxaiis the

difference of the additive coefficients by subtracting xaof

the individual assigned to the green channel from xaof the

individual assigned to the red channel for the ith microarray;

Dxdiis the coefficient difference for dominance xd; a and d are

1692A. C. Lam et al.

Page 3

the additive and dominance parameters, respectively; and ei

is the residual error. In matrix form, the expression can be

simplified as Y ¼ Xb 1e, where b ¼ (m, a, d)t.

Finding optimal pairs: We used the same definition for the

optimal design as in the original publication on the distant

pair design (Fu and Jansen 2006), which is the minimum for

the sum of the variances ofˆb in the matrix form of our model.

Following the A-optimality criterion (Wit and McClure

2004), this is equivalent to minimizing the trace of (X9X)?1.

For our regression model in (1), the matrix X consists of a

column of 1’s for the mean m, a column of Dxacoefficients for

the additive parameter, and, if dominance is included in the

model, a third column of Dxd coefficients. To reach the

optimal pairing design over all positions in the genome, we

search for the minimum of the sum over all marker loci the

trace of (X9X)?1. Genetic coefficients at marker loci only are

used for optimization to keep the computation tractable.

Because the genetic coefficients between marker intervals are

derived from the markers, we do not anticipate our optimiza-

tionmethodtobedifferentfromusingthecoefficientsatevery

centimorgan.

The simulated annealing technique (Kirkpatrick et al.

1983) was used to find a pairing configuration that is optimal

or close to optimal according to the definition above. Consult

Fu and Jansen (2006) for details of the procedures. The

implementation of finding optimal pairs was accomplished

using the R statistical computer program (R Development

Core Team 2007).

Power assessment via simulations: We studied three differ-

ent genome sizes: 100, 1000, and 2000 cM; and for each

genome size we simulated 100 replicates of F2intercrosses.

First, F1individuals were generated by randomly mating 20 F0

sires from founder line one to 80 F0dams from founder line

two (4 dams per sire), each having 5 offspring. Then, another

400offspringweregeneratedintheF2generationbyrandomly

mating 20 F1sires to 80 F1dams (5 progenies per mating).

Marker data were simulated for all samples, with 11 evenly

spaced markers per chromosome of 100 cM in length. Four

alleles were simulated for every marker segregating at equal

frequencies in both founder lines, with marker genotypes in

Hardy–Weinberg equilibrium. A single biallelic QTL that is

fixed for alternative alleles in the founder lines was simulated

on the first chromosome at 46 cM. For this QTL, we simulated

two alternative settings: (a) an additive QTL without domi-

nance, where the homozygous genotypic value a ¼ 0.5 and the

heterozygous genotypic value d ¼ 0, and (b) a QTL with

complete dominance, where a ¼ 0.5 and d ¼ 0.5. Polygenic

background effects were modeledas 10unlinked biallelicloci,

each with an additive effect of 0.25 and segregating at a fre-

quency of 0.5 in both founder lines, as described in Alfonso

and Haley (1998). To mimic the nongenetic factors affecting

the gene-expression phenotype and technical errors of micro-

arrays, we added an environmental component sampled from

a normal distribution with a variance of 0.5 to the simulated

phenotype. The narrow-sense heritability (h2) is 0.47 for the

trait and 0.20 for the main QTL on the first chromosome.

To assess the performance of the optimal pair design under

the least-squares framework, we scanned in 1-cM steps for the

most significant P-values obtained in the marker interval that

contains the QTL (between 40 and 50 cM on the first chro-

mosome) under four scenarios. These four scenarios are

summarized in Table 1 and are described as follows: first, all

400F2subjectsandtheirindividualphenotypicmeasurements

were analyzed. Conceptually this is equivalent to the common

reference design that includes all F2individuals. Second, 200

F2subjects were randomly selected, together with their in-

dividual phenotypic measurements. This scenario also repre-

sents the common reference design, but a smaller budget

limits the profiling of gene expression to fewer individuals

than in the first scenario. Due to the random sampling nature

of this scenario, for each simulated population replicate we

repeated the random sampling 100 times and scanned for the

most significant P-value in the QTL-containing interval as

above. Then the median P-value was selected to represent the

performance under this scenario for the given population

replicate.Third,werandomlypairedupall400F2subjectsand

analyzed the data with regression model (1). Under this

scenario, wealsorepeatedtheprocess 100 timesper simulated

population replicate and proceeded to obtain the P-value in

thesamewayasinthesecondscenario.Finally,wepairedupall

400 F2subjects, using the optimal pair design. We abbreviate

these four scenarios above as ‘‘all.data,’’ ‘‘half.data,’’ ‘‘ran.pair,’’

and‘‘opt.pair,’’respectively,forreferenceintherestofthisarticle.

For both ‘‘additive only’’ and ‘‘additive and dominance’’ QTL

settings, the data were analyzed under those four scenarios.

Alternative marker allele frequencies and population sizes:

In the simulations above the marker allele frequencies are

equal over all four alleles in both founder lines. This rep-

resents a suboptimal scenario in which the marker genotypes

in the F2generation are expected to have limited information

for the line origins. For the genome size of 2000 cM, we also

simulated the ‘‘best-case scenario’’ in which each founder line

has two unique alleles; i.e., two of the four alleles are seg-

regating within each founder line, with no common alleles

shared by both lines. Such an intercross is equivalent to an F2

cross between two inbred lines. These two sets of marker allele

frequencies would enable us to determine a below-average

range and the upper bound for the performance of the

optimal pair design. In addition, we performed further sim-

ulations in which we fixed the number of microarrays being

used to 400 and evaluated an F2population size of 1000. We

comparedtheperformanceoftheoptimal pairdesignandthe

common reference design when expression profiling of every

individual in the sample population is not possible.

RESULTS

Additive effect: We studied the power for detecting

additive QTL under the four scenarios. For the results

of opt.pair presented in this section, we minimized the

varianceoftheadditiveeffectintheregressionmodelby

TABLE 1

Summary of the four scenarios investigated in the

power study

Scenario

abbreviationsDescription

No. of F2

subjects

profiled

No. of

slides

required

all.dataIndividual phenotypic

values are available

for all subjects

Same as all.data except

that 50% of the

subjects are selected

Pairs are assigned

randomly

Pairs are assigned

according to the

outcome of simulated

annealing

400400

half.data

200200

ran.pair 400200

opt.pair

400200

eQTL Microarray Design1693

Page 4

simulated annealing. Figure 1 shows the minus log-

transformed P-values (sorted in ascending order) for

the four scenarios. The scenario with the highest pro-

portion of the largest minus log-transformed P-values

can be considered as the most powerful design. For a

single chromosome (Figure 1A), the most significant P-

values can be found under the all.data scenario. But for

theopt.pairscenario,under which only200microarrays

would be required, the power to detect the QTL is

remarkably close to that under the all.data scenario.

Under the half.data and ran.pair scenarios, likewise,

only 200 microarrays would be required, but the power

is much reduced compared to both all.data and

opt.pair. Incidentally, the performances of half.data

and ran.pair are almost identical; hence most of the data

points for these two designs are overlapping in Figure 1.

Table 2 summarizes the performance under the four

scenarios by the mean ?log10P and shows the effect of

genome size on the power for detecting QTL. The mean

?log10P across different genome sizes under the all.data,

half.data, and ran.pair scenarios shows little deviation.

However, the mean ?log10P under the opt.pair scenario

followsanotabledownwardtrendwhenthegenomesize

increased. At the genome size of 2000 cM (Figure 1B)

all.data performs best of the four scenarios. But more

importantly, the opt.pair scenario is the most powerful

of the designs that require 200 microarrays.

We analyzed the simulations of the F2cross with fully

informativemarkersforthegenomesizeof2000cMand

found that the power increased slightly under all four

scenarios (Table 2). The increase in power is expected

because line origins can be inferred with certainty. It is

important to note that the difference in the power

between the suboptimal and the best-case scenario for

the marker allele frequencies is small. This indicates

that our power assessment using equal marker allele

frequencies in the simulations is robust and represen-

tative of real outbred F2 intercrosses, of which the

marker allele frequencies in the founder lines are in

between those two extremes.

Additive and dominance effects: For the dominant

QTL, two levels of analysis were carried out: (a) QTL

detection by comparing the full model (additive and

dominance) to the null model and (b) detection of

dominance effect by comparing the full model to the

reducedmodel(additiveonly).Inthesimulatedanneal-

ing step of optimal pairing, the dominance coefficients

wereincludedasthethirdcolumninthematrixXinthe

linear model (see materials and methods).

With a single-chromosome (100 cM) genome, the

power to detect QTL under the opt.pair scenario is

clearly lower (Figure 2A, left) than that under all.data.

It can be seen in Table 3 that the mean ?log10P under

all.data is ?50% greater than that under opt.pair. But

opt.pair is still more powerful than both half.data and

ran.pair. By contrast, our results (Figure 2A, right) show

that opt.pair and all.data are similarly powerful for

detecting dominance effects and are superior to both

half.data and ran.pair in a small genome.

Tables3and4showthatgenomesizesagainhavelittle

effect on power under all.data, half.data, and ran.pair.

However, the increase in genome size affects optimal

pairing more severely here than when no dominance

effecthasbeensimulated.Atthegenomesizeof2000cM,

opt.pair is only marginally more powerful in detecting

the QTL than half.data and ran.pair (Figure 2B, left).

The power for detecting the dominance effect is more

drastically affected and opt.pair performs similarly to

half.data and ran.pair (Figure 2B, right). We repeated

the optimization solely on the additive parameter with

the genome size of 2000 cM. The power of detecting

QTL was found to have improved slightly, while there

was little change in the power for detecting dominance

Figure 1.—Performance for detecting additive QTL effects

under various scenarios. (A) Genome size of 100 cM, a single

chromosome; (B) genome size of 2000 cM, 20 chromosomes.

The horizontal dotted line shows the significant level of

P ¼ 1 3 10?5. The simulations are sorted in ascending order

of the ?log10P on the x-axis.

1694A. C. Lam et al.

Page 5

(results not shown). Therefore, in the presence of

dominance effects, the advantage in the performance

of the optimal pair design in detecting QTL is reduced.

Includingdominanceintheoptimizationhasanegative

impact on the optimal pair design, especially for large

genome sizes, when QTL detection is the primary

objective.

Fixed number of microarrays with a large F2sample

size: In previous simulations, we observed that all.data,

which required 400 microarrays, was more powerful

in detecting the additive QTL effect than using 200

microarrays under the opt.pair scenario. Here, we

studied the power of these two designs conditioned on

a total of 400 microarrays. With an F2sample size of

1000, neither design can profile all the individuals with

400 microarrays. Under the optimal pair design, 400

pairs were deliberately selected to give the minimum

variance for the estimated additive genetic parameter.

On the other hand, only 400 individuals (randomly

selected from 1000 individuals) could be profiled using

the common reference design. Given the equal number

of microarrays being used, our results in Figure 3 show

that the optimal pair design outperforms the common

reference design.

DISCUSSION

ThedistantpairdesignenablesthemappingofeQTL

in an efficient and effective manner using recombinant

inbred lines. For researchers studying genetics of many

outbred species, however, the creation of recombinant

inbred lines is impractical. Here we explore whether

eQTL studies of natural species would benefit from the

Figure 2.—Performance for detecting

QTL additive and dominance (left) and

dominance effects (right) under various

scenarios. (A) Genome size of 100 cM, a

single chromosome; (B) genome size of

2000 cM, 20 chromosomes. The horizontal

dotted line shows the significant level of

P ¼ 1 3 10?5. The simulations are sorted

in ascending order of the ?log10P on the

x-axis.

TABLE 2

Summary of P-values (mean and standard deviation on ?log10scale) at the main QTL position for additive QTL

detection under the four scenarios, where only an additive effect was simulated

Genome

size (cM)

No. of

chromosomesall.datahalf.data ran.pairopt.pair

100

1000

2000

2000a

1 11.9 (2.9)

12.3 (2.6)

12.1 (2.9)

12.9 (2.9)

6.4 (1.5)

6.6 (1.3)

6.5 (1.5)

6.9 (1.5)

6.3 (1.5)

6.6 (1.4)

6.4 (1.5)

6.8 (1.5)

11.0 (2.7)

9.2 (2.4)

8.3 (2.4)

8.9 (2.4)

10

20

20

Standard deviations are in parentheses.

aFully informative markers.

eQTL Microarray Design1695

Page 6

same design principles used in distant pairing. We show

that the optimal pair design, an extension of the distant

pair design for outbred lines crosses, can indeed im-

prove the efficiency of the use of microarrays and in-

crease the statistical power for detecting eQTL, even for

studying organisms with large genome sizes.

Under the linear regression framework, the greatest

power is achieved by having the regression coefficients

inequalproportionsnearthetopandbottomextremes.

For the regression model proposed for the optimal pair

design in this article, this would be achieved by pairing

up individuals who have large genetic coefficients with

opposite signs. However, in a line cross such as the

F2, it is inevitable that not every pair would result in a

regression coefficient that is near one extreme or the

other. Furthermore, when the number of independent

loci increases (increase in chromosome length and

number of chromosomes), the optimal pair assignment

for one locus will usually not be optimal for the other

loci. The optimal pair assignment over the whole ge-

nome is therefore suboptimal in the perspective of a

single locus, i.e., fewer regression coefficients around

theextremes.Wecanthereforeexpecttheperformance

of distant pairing to degrade to the same level as that of

randompairingeventuallyasthegenomesizecontinues

to increase.

Clear benefits in detecting additive effects: We show

that when there are few loci to consider, such as in a

small genome, the power of detecting additive effects

with the optimal pair design is similar to using a com-

mon reference design that consumes twice the number

of microarrays. With near-optimal pairing for individual

loci (achievable when there are small numbers of effec-

tively independent loci), the efficiency of the optimal

pair design is very attractive. Moreover, the common

reference design with only half the sample size (i.e.,

thesame number ofmicroarrays)performs significantly

worse. This highlights the problem of small sample size

leading to reduction in power in complex trait analysis.

Extremely similar performances were observed for ran-

dom pairing and the common reference design with

50% of the samples. The difference in performance be-

tween these two designs could have been more marked

if our simulations had explicitly modeled the possible

differenceinthebiologicalsamplingvariancebetweena

pair design and a common reference design.

As expected, the performance of the optimal pair

designdropswhenthegenomesizeincreases.Neverthe-

less, it is very promising that in a large genome the

optimal pair design still notably outperforms designs

that use the same number of microarray slides. Further-

more, as shown by theexcellent performance in smaller

genomes,itisevidentthattheoptimalpairdesignwould

be beneficial for a focused study of one or more can-

didateregionswithinalargegenome. Thepowercanbe

maximized for genomic regions for which the research-

ers have the most interest, while the power in the rest of

thegenomewouldbeatleastasgoodastherandompair

design. In addition, we show that with the number of

microarrays used being equal, the optimal pair design

always gives the highest statistical power of the ap-

proaches compared. Our comparison to the common

TABLE 3

Summary of P-values (mean and standard deviation in ?log10scale) at the main QTL position for QTL

detection (additive plus dominance model vs. null model) under the four scenarios, where

both additive and dominance effects were simulated

Genome

size (cM)

No. of

chromosomes all.data half.dataran.pairopt.pair

100

1000

2000

115.2 (3.2)

15.5 (3.3)

15.4 (3.7)

7.8 (1.6)

8.0 (1.7)

7.9 (1.8)

7.8 (1.7)

7.9 (1.7)

7.9 (1.8)

10.5 (2.6)

9.6 (2.3)

8.9 (2.5)

10

20

Standard deviations are in parentheses.

TABLE 4

Summary of P-values (mean and standard deviation on ?log10scale) at the main QTL position for

dominance detection (additive plus dominance model vs. additive model) under the four

scenarios, where both additive and dominance effects were simulated

Genome

size (cM)

No. of

chromosomesall.datahalf.data ran.pairopt.pair

100

1000

2000

15.8 (2.4)

5.7 (1.7)

5.8 (2.4)

3.3 (1.2)

3.1 (0.9)

3.2 (1.2)

3.2 (1.2)

3.1 (0.9)

3.2 (1.2)

5.4 (2.1)

3.8 (1.5)

3.7 (1.9)

10

20

Standard deviations are in parentheses.

1696A. C. Lam et al.

Page 7

reference design was made to random selection of

individuals. Although selective phenotyping (Jin et al.

2004) can improve the efficiency of the common ref-

erence design, the optimal pair design would allow more

subjects to be assayed as well as maximize the genotypic

dissimilarity.Therefore, for outbred speciesthat possess

large genomes, the optimal pair design can provide both

efficientuseofthemicroarrayresourceandgood power

for the detection of eQTL with additive effects.

Complications due to dominance effects: How does

dominance affect the performance of this design? We

evaluate the optimal pair design that optimizes for both

additive and dominance effects simultaneously: the

conclusion is that by including the dominance param-

eter, the design becomes less optimized for detecting

the main (additive) effect. Although over a small ge-

nome, the optimal pair design can offer a moderate

power advantage for detecting QTL and dominance

effects over no optimization, the performance is af-

fected severely in that the power for detecting both the

main and the dominance effect degrade to almost the

same levels as random pairing with alargegenome. Our

results agree with other studies (Piepho 2005; Bueno

Filho et al. 2006) that finding a design that is optimal

for detecting both additive and dominance effects

cannot be achieved. They have shown that optimizing

for detecting dominance effects would decrease power

for detecting additive effects. Therefore, when one has

to make a choice between additive and dominance

effects for optimization, the question relates directly to

the goal of the experiment. If the goal is to scan across

the whole genome for linked loci to gene-expression

phenotypes, we argue that one could consider focusing

on the additive parameter alone for the optimization.

After all, the ultimate interest is to detect QTL. In most

cases QTL are expected to have an additive component,

even in cases where dominance is present. Optimizing

for dominance effects should be considered only if

there is strong a priori evidence for overdominance in

the QTL of interest in a candidate gene study.

Final remarks: We conclude that our extension of the

distant pair design, the optimal pair design, can be

applied efficiently to outbred line crosses for genetical

genomic studies. Having stated that, we acknowledge

that in an experimental design for genetical genomics,

there is no ‘‘one-size fits-all’’ solution. The most power-

ful and efficient design will depend on the population

structure, the marker density, the chosen method of

analysis, the numbers of treatments, and the parameter

of interest. Bueno Filho et al. (2006) proposed different

designs for multiple genotypes, epistasis, and multiple

treatments. In human or other natural populations, the

Haseman–Elstonmethod (HasemanandElston1972)

can be applied to sib-pair analysis, in which case, the

mosteffectiveuseofmicroarrayresourcestoconductan

eQTL linkage analysis would be to profile the expres-

sion of a pair of sibs on the same array. This is because

the trait squared differences between two sibs are the

dependent variable used in this method; these quanti-

ties are obtained most accurately when sibs are paired

up on the same array.

It is also worth considering the implication of the use

of high-density single-nucleotide polymorphism (SNP)

genotyping on the optimal pair design described in

this article. High-density SNP genotyping is most widely

used in association studies in natural human popula-

tions rather than in line crossesof the animals discussed

above. As linkage disequilibrium spans relatively short

distances in human populations, the effective number

of independent loci is much higher than what we have

modeled in our line-cross simulations. This effect is

equivalent to increasing the genome size and is likely

to have a negative impact on the performance of the

optimal pair design than what can be expected in

outbred line crosses. Eventually, the distant pairing

strategy might become almost equivalent to a pairing

strategy based on relationships, in which less-related in-

dividuals should be paired for each hybridization (Rosa

et al. 2006; Bueno Filho et al. 2006). Theoretically, the

optimalpairdesignshouldalwaysbepreferred;since the

variance of the estimate of the parameter is minimized,

its performance should be at least as good as the com-

mon reference design. However, other factors, such as

technical simplicity and flexibility in the choice of

statistical methods, might shift the balance in favor of

the common reference design when the performance

advantage in using the optimal pair design becomes

less marked. Therefore, it is imperative to consider each

experiment and the question of interest on a case-

by-case basis. Nevertheless, our results suggest that the

Figure 3.—Comparison of the performance for QTL de-

tection under the common reference design and the optimal

pair design when the number of arrays is fixed at 400, the ge-

nome size is 2000 cM, and the F2sample size is 1000. The hor-

izontal dotted line shows the significant level of P ¼ 1 3 10?5.

The simulations are sorted in ascending order of the ?log10P

on the x-axis.

eQTL Microarray Design1697

Page 8

efficient design principles outlined by Fu and Jansen

(2006) can be applied to a wider context than RILs.

With larger eQTL experiments becoming more afford-

able, we can expect to discover more loci with moderate

to small effects. Such attainment will ultimately lead to

greater advances inour understanding ofthemolecular

basis of complex traits.

We thank two anonymous referees for their helpful suggestions.

The R code for optimizing pairing configuration can be obtained by

request from the corresponding author. This research was funded

by the Biotechnology and Biological Sciences Research Council

(BBSRC). A.C.L. is grateful for support from the BBSRC (grant no.

BBSSF200512735), the Genesis Faraday Partnership, and Genus plc.

LITERATURE CITED

Alfonso, L., and C. S. Haley, 1998

for QTL detection in livestock. Anim. Sci. 66: 1–8.

Brem, R. B., G. Yvert, R. Clinton and L. Kruglyak, 2002

dissection of transcriptional regulation in budding yeast. Science

296: 752–755.

Bueno Filho, J. S., S. G. Gilmour and G. J. Rosa, 2006

microarray experiments for genetical genomics studies. Genetics

174: 945–957.

Bystrykh, L., E. Weersing, B. Dontje, S. Sutton, M. T. Pletcher

et al., 2005 Uncovering regulatory pathways that affect hemato-

poietic stem cell function using ‘genetical genomics’. Nat. Genet.

37: 225–232.

de Koning, D. J., and C. S. Haley, 2005

mans and model organisms. Trends Genet. 21: 377–381.

Dixon, A. L., L. Liang, M. F. Moffatt, W. Chen, S. Heath et al.,

2007A genome-wide association study of global gene expres-

sion. Nat. Genet. 39: 1202–1207.

Fu, J. Y., and R. C. Jansen, 2006Optimal design and analysis of ge-

netic studies on gene expression. Genetics 172: 1993–1999.

Haley, C. S., S. A. Knott and J. M. Elsen, 1994

tive trait loci in crosses between outbred lines using least squares.

Genetics 136: 1195–1207.

Haseman, J. K., and R. C. Elston, 1972

between a quantitative trait and a marker locus. Behav. Genet. 2:

3–19.

Hubner, N., C. A. Wallace, H. Zimdahl, E. Petretto, H. Schulz

et al., 2005 Integrated transcriptional profiling and linkage

analysis for identification of genes underlying disease. Nat. Genet.

37: 243–253.

Jansen, R. C., and J. P. Nap, 2001

value from segregation. Trends Genet. 17: 388–391.

Power of different F-2 schemes

Genetic

Design of

Genetical genomics in hu-

Mapping quantita-

The investigation of linkage

Genetical genomics: the added

Jin, C., H. Lan, A. D. Attie, G. A. Churchill, D. Bulutuglo et al.,

2004Selective phenotyping for increased efficiency in genetic

mapping studies. Genetics 168: 2285–2293.

Jin, W., R. M. Riley, R. D. Wolfinger, K. P. White, G. Passador-

Gurgel et al., 2001 The contributions of sex, genotype and

age to transcriptional variance in Drosophila melanogaster. Nat.

Genet. 29: 389–395.

Keurentjes, J. J., J. Fu, I. R. Terpstra, J. M. Garcia, G. van den

Ackerveken et al., 2007 Regulatory network construction in

Arabidopsis by using genome-wide gene expression quantitative

trait loci. Proc. Natl. Acad. Sci. USA 104: 1708–1713.

Kirkpatrick, S., C. D. Gelatt, Jr. and M. P. Vecchi, 1983

zation by simulated annealing. Science 220: 671–680.

Li, Y., O. A. Alvarez, E. W. Gutteling, M. Tijsterman, J. Fu et al.,

2006 Mapping determinants of gene expression plasticity by ge-

netical genomics in C. elegans. PLoS Genet. 2: e222.

Mehrabian, M., H. Allayee, J. Stockton, P. Y. Lum, T. A. Drake

et al., 2005 Integrating genotypic and expression data in a seg-

regating mouse population to identify 5-lipoxygenase as a suscep-

tibilitygenefor obesity andbonetraits.Nat.Genet.37: 1224–1233.

Monks, S. A., A. Leonardson, H. Zhu, P. Cundiff, P. Pietrusiak

et al., 2004 Genetic inheritance of gene expression in human

cell lines. Am. J. Hum. Genet. 75: 1094–1105.

Morley, M., C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens

et al., 2004Genetic analysis of genome-wide variation in human

gene expression. Nature 430: 743–747.

Piepho, H. P., 2005Optimal allocation in designs for assessing het-

erosis from cDNA gene expression data. Genetics 171: 359–364.

R DevelopmentCore Team, 2007

Statistical Computing. R Foundation for Statistical Computing,

Vienna.

Rosa, G. J., N. de Leon and A. J. Rosa, 2006

experimental design strategies for genetical genomics studies.

Physiol. Genomics 28: 15–23.

Schadt, E. E., S. A. Monks, T. A. Drake, A. J. Lusis, N. Che et al.,

2003Genetics of gene expression surveyed in maize, mouse

and man. Nature 422: 297–302.

Schadt, E. E., J. Lamb, X. Yang, J. Zhu, S. Edwards et al., 2005

integrative genomics approach to infer causal associations be-

tween gene expression and disease. Nat. Genet. 37: 710–717.

Seaton, G., C. S. Haley, S. A. Knott, M. Kearsey and P. M.

Visscher, 2002 QTL Express: mapping quantitative trait loci

in simple and complex pedigrees. Bioinformatics 18: 339–340.

Stranger, B. E., M. S. Forrest, A. G. Clark, M. J. Minichiello, S.

Deutsch et al., 2005 Genome-wide associations of gene expres-

sion variation in humans. PLoS Genet. 1: e78.

Wit, E., and J. McClure, 2004

Statistics for Microarrays: Design, Anal-

ysis and Inference. John Wiley & Sons, Chichester, UK.

Optimi-

R: A Language and Environment for

Review of microarray

An

Communicating editor: R. W. Doerge

1698A. C. Lam et al.