Optimal design of genetic studies of gene expression with two-color microarrays in outbred crosses.
ABSTRACT Combining global gene-expression profiling and genetic analysis of natural allelic variation (genetical genomics) has great potential in dissecting the genetic pathways underlying complex phenotypes. Efficient use of microarrays is paramount in experimental design as the cost of conducting this type of study is high. For those organisms where recombinant inbred lines are available for mapping, the "distant pair design" maximizes the number of informative contrasts over all marker loci. Here, we describe an extension of this design, named the "optimal pair design," for use with F2 crosses between outbred lines. The performance of this design is investigated by simulation and compared to several other two-color microarray designs. We show that, for a given number of microarrays, the optimal pair design outperforms all other designs considered for detection of expression quantitative trait loci (eQTL) with additive effects by linkage analysis. We also discuss the suitability of this design for outbred crosses in organisms with large genomes and for detection of dominance.
- SourceAvailable from: Ritsert C Jansen[show abstract] [hide abstract]
ABSTRACT: Whole-genome profiling of gene expression in a segregating population has the potential to identify the regulatory consequences of natural allelic variation. Costs of such studies are high and require that resources--microarrays and population--are used as efficiently as possible. We show that current studies can be improved significantly by a new design for two-color microarrays. Our "distant pair design" profiles twice as many individuals as there are arrays, cohybridizes individuals with dissimilar genomes, gives more weight to known regulatory loci if wished, and therewith maximizes the power for decomposing expression variation into regulatory factors. It can also exploit a large population (larger than twice the number of available microarrays) as a useful resource to select the most dissimilar pairs of individuals from. Our approach identifies more regulatory factors than alternative strategies do in computer simulations for realistic genome sizes, and similar promising results are obtained in an application on Arabidopsis thaliana. Our results will aid the design and analysis of future studies on gene expression and will help to shed more light on gene regulatory networks.Genetics 04/2006; 172(3):1993-9. · 4.39 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Genetical genomics approaches provide a powerful tool for studying the genetic mechanisms governing variation in complex traits. By combining information on phenotypic traits, pedigree structure, molecular markers, and gene expression, such studies can be used for estimating heritability of mRNA transcript abundances, for mapping expression quantitative trait loci (eQTL), and for inferring regulatory gene networks. Microarray experiments, however, can be extremely costly and time consuming, which may limit sample sizes and statistical power. Thus it is crucial to optimize experimental designs by carefully choosing the subjects to be assayed, within a selective profiling approach, and by cautiously controlling systematic factors affecting the system. Also, a rigorous strategy should be used for allocating mRNA samples across assay batches, slides, and dye labeling, so that effects of interest are not confounded with nuisance factors. In this presentation, we review some selective profiling strategies for genetical genomics studies, including the selection of individuals for increased genetic dissimilarity and for a higher number of recombination events. Efficient designs for studying epistasis are also discussed, as well as experiments for inferring heritability of transcriptional levels. It is shown that solving an optimal design problem generally requires a numerical implementation and that the optimality criteria should be intimately related to the goals of the experiment, such as the estimation of additive, dominance, and interacting effects, localizing putative eQTL, or inferring genetic and environmental variance components associated with transcriptional abundances.Physiological Genomics 01/2007; 28(1):15-23. · 2.81 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: The power of a genetic mapping study depends on the heritability of the trait, the number of individuals included in the analysis, and the genetic dissimilarity among them. In experiments that involve microarrays or other complex physiological assays, phenotyping can be expensive and time-consuming and may impose limits on the sample size. A random selection of individuals may not provide sufficient power to detect linkage until a large sample size is reached. We present an algorithm for selecting a subset of individuals solely on the basis of genotype data that can achieve substantial improvements in sensitivity compared to a random sample of the same size. The selective phenotyping method involves preferentially selecting individuals to maximize their genotypic dissimilarity. Selective phenotyping is most effective when prior knowledge of genetic architecture allows us to focus on specific genetic regions. However, it can also provide modest improvements in efficiency when applied on a whole-genome basis. Importantly, selective phenotyping does not reduce the efficiency of mapping as compared to a random sample in regions that are not considered in the selection process. In contrast to selective genotyping, inferences based solely on a selectively phenotyped population of individuals are representative of the whole population. The substantial improvement introduced by selective phenotyping is particularly useful when phenotyping is difficult or costly and thus limits the sample size in a genetic mapping study.Genetics 01/2005; 168(4):2285-93. · 4.39 Impact Factor
Copyright ? 2008 by the Genetics Society of America
Optimal Design of Genetic Studies of Gene Expression With Two-Color
Microarrays in Outbred Crosses
Alex C. Lam,*,†,1Jingyuan Fu,‡Ritsert C. Jansen,‡Chris S. Haley* and Dirk-Jan de Koning*
*Roslin Institute (Edinburgh) and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Roslin Biocentre, Roslin,
Midlothian EH25 9PS, United Kingdom,†Institute of Evolutionary Biology, School of Biological Sciences, University of
Edinburgh, Edinburgh EH9 3JT, United Kingdom and‡Groningen Bioinformatics Centre, Groningen Biomolecular
Sciences and Biotechnology Institute, University of Groningen, 9751 NN, Haren, The Netherlands
Manuscript received April 14, 2008
Accepted for publication September 2, 2008
Combining global gene-expression profiling and genetic analysis of natural allelic variation (genetical
genomics) has great potential in dissecting the genetic pathways underlying complex phenotypes. Efficient
use of microarrays is paramount in experimental design as the cost of conducting this type of study is high.
For those organisms where recombinant inbred lines are available for mapping, the ‘‘distant pair design’’
maximizes the number of informative contrasts over all marker loci. Here, we describe an extension of this
this design is investigated by simulation and compared to several other two-color microarray designs. We
show that, for a given number of microarrays, the optimal pair design outperforms all other designs
We also discuss the suitability of this design for outbred crosses in organisms with large genomes and for
detection of dominance.
2001), has great potential for dissecting the mechanisms
Schadt et al. 2005). Although variation in transcript
factors, part of the between-individual variation in ex-
pression of a substantial number of genes can be ex-
have been conducted in model organisms such as yeast
(Brem et al. 2002), flowering plant (Keurentjes et al.
2007), nematode worm (Li et al. 2006), mouse (Schadt
et al. 2003; Bystrykh et al. 2005), and rat (Hubner et al.
2005). There are also a number of studies that focused
on human populations (Monks et al. 2004; Morley et al.
2004; Stranger et al. 2005). Efforts in mapping expres-
sion quantitative trait loci (eQTL) have provided strong
evidence for candidate gene selection in studies of com-
and childhood asthma (Dixon et al. 2007).
Like in any QTL study, appropriate sample size is
essential for adequate power in eQTL detection. Al-
though many of the published studies have provided
very interesting insights into the properties of genetic
ENETIC analysis of variation in gene expression,
sample sizes of the early studies meant they have limited
power to detect eQTL of small to moderate effects (De
Koning and Haley 2005). In many cases, there is no
approach as the genetic materials have already been
collected for concurrent large-scale studies. Therefore,
high cost of the associated technologies, particularly
the cost of microarrays. To address this issue, significant
improvement in the usage of microarray resources for
genetical genomics has been proposed in a number of
articles. Jin et al. (2004) presented an algorithm for
from the entire sample set for maximum genotypic dis-
similarity as a way to reduce the amount of phenotyping
without sacrificing sensitivity in QTL detection. In a
different article, Piepho (2005) discussed the optimal
heterosis. Bueno Filho et al. (2006) covered a range
of optimal microarray designs, from studying the geno-
typic effect of a single locus to models that include
both fixed treatment and random polygenic effects.
Rosa et al. (2006) provided a comprehensive review on
microarray design for eQTL mapping. Fu and Jansen
(2006) proposed a more general approach called the
distant pair design, which combines optimal allocation
by hybridizing most dissimilar samples and selective
genotyping when thepopulation resource is large. They
used recombinant inbred lines (RILs) to demonstrate
the power of this approach. In this article, the research
EH25 9PS, United Kingdom.E-mail: firstname.lastname@example.org
Genetics 180: 1691–1698 (November 2008)
emphasis is on further developing this experimental
design to other populations, like the outbred F2crosses.
InFu andJansen’sstudy,the authors choseto explore
two-color microarray technologies over single-color plat-
forms because for the same number of slides, two-color
microarrays can potentially generate twice as much
hybridization data as single-color arrays. In addition,
two-color microarray platforms remain the only choice
for many research projects because commercial single-
color microarrays such as Affymetrix GeneChip are
available for only a handful of species. Furthermore,
two-color microarrays offer greater flexibility in exper-
imental design inwhichpairsofsamplescan beselected
Fu and Jansen proposed the ‘‘distant pair design,’’
which outperforms the conventional designs, namely
the common reference and the loop designs, when a
panel of RILs is used. Moreover, for the expression
profiling of a given number of biological samples, the
distant pair design would require only half as many
slides as the common reference or the loop design in a
The distant pair design presents an effective direct-
pairing strategy that increases the ratio of within- over
However, it is not clear how to apply this rationale
to populations in which more than two genotypes are
possible for each marker locus. Although some insight
was provided on how best to detect overdominance, or
(Piepho 2005), this strategy for sample allocation is not
designed for mapping gene-expression variation to any
specific loci. Bueno Filho et al. (2006) addressed the
problem with three possible genotypes at a single locus
and gave a generalization of the solution to multiple
loci. Their method may be more tractable when marker
genotypes can be treated as fixed effects for contrast,
which is true for inbred line crosses. For researchers
studying genetics of outbred species, mapping resour-
ces like inbred strains or RILs are often not feasible.
By contrast, F2 intercrosses between two genetically
divergent outbred populations are much more readily
is due to the fact that there are common sets of alleles
segregating in both of the founder populations. Hence,
it is often the case that marker genotypes in the F2gen-
eration would not be fully informative for the origins
of lineage at any given locus. This uncertainty obscures
how one can define genotypic dissimilarity for the pur-
pose of pair assignment in distant pair design. Further-
more, researchers face the issue regarding large genome
sizes. It is expected that when genome size increases,
finding distant pairs will become more and more dif-
ficult. Fu and Jansen (2006) have shown that in RILs
a small advantage is achievable with large genomes.
However, whether this advantage is also present in an
F2design remains uncertain. This question is directly
relevant to researchers who are interested in studying
the genetics of gene expression in nonmodel organ-
for genetical genomic studies in outbred F2 crosses
In this article, we propose an extension to the distant
pair design by adapting the least-squares QTL mapping
framework (Haley et al. 1994). Here we refer to this
extension as the ‘‘optimal pair design.’’ We also assess
the performance of this design in the presence of dom-
inance. Moreover, we consider the impact of genome
size on power and discuss the usefulness of our exten-
sion of the distant pair design for eQTL studies in
outbred experimental crosses.
MATERIALS AND METHODS
QTL analysis: The method for mapping QTL follows the
least-squares approach (Haley et al. 1994). Briefly, the line
origins at fixed intervals (e.g., 1 cM) along the genome for the
individuals in the F2 generation are expressed as lineage
done by considering all possible line-origin combinations
based on the parental and grandparental genotypes and has
been implemented in the online software ‘‘QTL Express’’
(Seaton et al. 2002). Assuming that founder lines are fixed
for alternative QTL alleles, the lineage probabilities can be
are then regressed onto genetic coefficients calculated for a
putative QTL at a fixed position. The genetic coefficients for
additive and dominance effects are derived from the con-
ditional probabilities: the additive coefficient (denoted xa) is
the difference of the probabilities for the homozygous line
origins, and the dominance coefficient (denoted xd) is the
sum of the probabilities for the heterozygous line origins. An
F-ratio test statistic can be used to test the null model (without
QTL fitted) against the full model (with QTL fitted) and de-
termine the significance of the presence of QTL. For full
details on the derivation of line-origin probabilities and
regression-based QTL mapping, see Haley et al. (1994).
In the context of a pair design in two-color microarrays, the
gene-expression phenotypes can be expressed either in ratios
or in signals of the separate channels. In this article we chose
ratios over signals as the phenotypes because the use of ratios
can minimize the risk of bias as a result of spot or array effects
(Wit and McClure 2004). Fu and Jansen (2006) argued that
there is a negligible difference in the final results between
ratios and signals, provided that the distributional assump-
tions for the array and spot effects used in the signal-based
analysis are correct. The log ratio of the red channel intensity
to the green channel intensity of a probe is equivalent to the
difference of the two signal intensities in logarithmic scale. To
utilize such phenotypes in the Haley–Knott least-squares
framework, the linear regression model can be written as
Dyi¼ m1Dxaia 1Dxdid 1ei;
where Dyiis the difference by subtracting the log signal of
the green channel from that of the red channel for the ith
microarray (i ¼ 1,..., n); m is the overall mean; Dxaiis the
difference of the additive coefficients by subtracting xaof
the individual assigned to the green channel from xaof the
individual assigned to the red channel for the ith microarray;
Dxdiis the coefficient difference for dominance xd; a and d are
1692A. C. Lam et al.
the additive and dominance parameters, respectively; and ei
is the residual error. In matrix form, the expression can be
simplified as Y ¼ Xb 1e, where b ¼ (m, a, d)t.
Finding optimal pairs: We used the same definition for the
optimal design as in the original publication on the distant
pair design (Fu and Jansen 2006), which is the minimum for
the sum of the variances ofˆb in the matrix form of our model.
Following the A-optimality criterion (Wit and McClure
2004), this is equivalent to minimizing the trace of (X9X)?1.
For our regression model in (1), the matrix X consists of a
column of 1’s for the mean m, a column of Dxacoefficients for
the additive parameter, and, if dominance is included in the
model, a third column of Dxd coefficients. To reach the
optimal pairing design over all positions in the genome, we
search for the minimum of the sum over all marker loci the
trace of (X9X)?1. Genetic coefficients at marker loci only are
used for optimization to keep the computation tractable.
Because the genetic coefficients between marker intervals are
derived from the markers, we do not anticipate our optimiza-
The simulated annealing technique (Kirkpatrick et al.
1983) was used to find a pairing configuration that is optimal
or close to optimal according to the definition above. Consult
Fu and Jansen (2006) for details of the procedures. The
implementation of finding optimal pairs was accomplished
using the R statistical computer program (R Development
Core Team 2007).
Power assessment via simulations: We studied three differ-
ent genome sizes: 100, 1000, and 2000 cM; and for each
genome size we simulated 100 replicates of F2intercrosses.
First, F1individuals were generated by randomly mating 20 F0
sires from founder line one to 80 F0dams from founder line
two (4 dams per sire), each having 5 offspring. Then, another
mating 20 F1sires to 80 F1dams (5 progenies per mating).
Marker data were simulated for all samples, with 11 evenly
spaced markers per chromosome of 100 cM in length. Four
alleles were simulated for every marker segregating at equal
frequencies in both founder lines, with marker genotypes in
Hardy–Weinberg equilibrium. A single biallelic QTL that is
fixed for alternative alleles in the founder lines was simulated
on the first chromosome at 46 cM. For this QTL, we simulated
two alternative settings: (a) an additive QTL without domi-
nance, where the homozygous genotypic value a ¼ 0.5 and the
heterozygous genotypic value d ¼ 0, and (b) a QTL with
complete dominance, where a ¼ 0.5 and d ¼ 0.5. Polygenic
background effects were modeledas 10unlinked biallelicloci,
each with an additive effect of 0.25 and segregating at a fre-
quency of 0.5 in both founder lines, as described in Alfonso
and Haley (1998). To mimic the nongenetic factors affecting
the gene-expression phenotype and technical errors of micro-
arrays, we added an environmental component sampled from
a normal distribution with a variance of 0.5 to the simulated
phenotype. The narrow-sense heritability (h2) is 0.47 for the
trait and 0.20 for the main QTL on the first chromosome.
To assess the performance of the optimal pair design under
the least-squares framework, we scanned in 1-cM steps for the
most significant P-values obtained in the marker interval that
contains the QTL (between 40 and 50 cM on the first chro-
mosome) under four scenarios. These four scenarios are
summarized in Table 1 and are described as follows: first, all
were analyzed. Conceptually this is equivalent to the common
reference design that includes all F2individuals. Second, 200
F2subjects were randomly selected, together with their in-
dividual phenotypic measurements. This scenario also repre-
sents the common reference design, but a smaller budget
limits the profiling of gene expression to fewer individuals
than in the first scenario. Due to the random sampling nature
of this scenario, for each simulated population replicate we
repeated the random sampling 100 times and scanned for the
most significant P-value in the QTL-containing interval as
above. Then the median P-value was selected to represent the
performance under this scenario for the given population
analyzed the data with regression model (1). Under this
scenario, wealsorepeatedtheprocess 100 timesper simulated
population replicate and proceeded to obtain the P-value in
400 F2subjects, using the optimal pair design. We abbreviate
these four scenarios above as ‘‘all.data,’’ ‘‘half.data,’’ ‘‘ran.pair,’’
For both ‘‘additive only’’ and ‘‘additive and dominance’’ QTL
settings, the data were analyzed under those four scenarios.
Alternative marker allele frequencies and population sizes:
In the simulations above the marker allele frequencies are
equal over all four alleles in both founder lines. This rep-
resents a suboptimal scenario in which the marker genotypes
in the F2generation are expected to have limited information
for the line origins. For the genome size of 2000 cM, we also
simulated the ‘‘best-case scenario’’ in which each founder line
has two unique alleles; i.e., two of the four alleles are seg-
regating within each founder line, with no common alleles
shared by both lines. Such an intercross is equivalent to an F2
cross between two inbred lines. These two sets of marker allele
frequencies would enable us to determine a below-average
range and the upper bound for the performance of the
optimal pair design. In addition, we performed further sim-
ulations in which we fixed the number of microarrays being
used to 400 and evaluated an F2population size of 1000. We
common reference design when expression profiling of every
individual in the sample population is not possible.
Additive effect: We studied the power for detecting
additive QTL under the four scenarios. For the results
of opt.pair presented in this section, we minimized the
Summary of the four scenarios investigated in the
No. of F2
values are available
for all subjects
Same as all.data except
that 50% of the
subjects are selected
Pairs are assigned
Pairs are assigned
according to the
outcome of simulated
eQTL Microarray Design1693
simulated annealing. Figure 1 shows the minus log-
transformed P-values (sorted in ascending order) for
the four scenarios. The scenario with the highest pro-
portion of the largest minus log-transformed P-values
can be considered as the most powerful design. For a
single chromosome (Figure 1A), the most significant P-
values can be found under the all.data scenario. But for
theopt.pairscenario,under which only200microarrays
would be required, the power to detect the QTL is
remarkably close to that under the all.data scenario.
Under the half.data and ran.pair scenarios, likewise,
only 200 microarrays would be required, but the power
is much reduced compared to both all.data and
opt.pair. Incidentally, the performances of half.data
and ran.pair are almost identical; hence most of the data
points for these two designs are overlapping in Figure 1.
Table 2 summarizes the performance under the four
scenarios by the mean ?log10P and shows the effect of
genome size on the power for detecting QTL. The mean
?log10P across different genome sizes under the all.data,
half.data, and ran.pair scenarios shows little deviation.
However, the mean ?log10P under the opt.pair scenario
increased. At the genome size of 2000 cM (Figure 1B)
all.data performs best of the four scenarios. But more
importantly, the opt.pair scenario is the most powerful
of the designs that require 200 microarrays.
We analyzed the simulations of the F2cross with fully
found that the power increased slightly under all four
scenarios (Table 2). The increase in power is expected
because line origins can be inferred with certainty. It is
important to note that the difference in the power
between the suboptimal and the best-case scenario for
the marker allele frequencies is small. This indicates
that our power assessment using equal marker allele
frequencies in the simulations is robust and represen-
tative of real outbred F2 intercrosses, of which the
marker allele frequencies in the founder lines are in
between those two extremes.
Additive and dominance effects: For the dominant
QTL, two levels of analysis were carried out: (a) QTL
detection by comparing the full model (additive and
dominance) to the null model and (b) detection of
dominance effect by comparing the full model to the
ing step of optimal pairing, the dominance coefficients
linear model (see materials and methods).
With a single-chromosome (100 cM) genome, the
power to detect QTL under the opt.pair scenario is
clearly lower (Figure 2A, left) than that under all.data.
It can be seen in Table 3 that the mean ?log10P under
all.data is ?50% greater than that under opt.pair. But
opt.pair is still more powerful than both half.data and
ran.pair. By contrast, our results (Figure 2A, right) show
that opt.pair and all.data are similarly powerful for
detecting dominance effects and are superior to both
half.data and ran.pair in a small genome.
effect on power under all.data, half.data, and ran.pair.
However, the increase in genome size affects optimal
pairing more severely here than when no dominance
opt.pair is only marginally more powerful in detecting
the QTL than half.data and ran.pair (Figure 2B, left).
The power for detecting the dominance effect is more
drastically affected and opt.pair performs similarly to
half.data and ran.pair (Figure 2B, right). We repeated
the optimization solely on the additive parameter with
the genome size of 2000 cM. The power of detecting
QTL was found to have improved slightly, while there
was little change in the power for detecting dominance
Figure 1.—Performance for detecting additive QTL effects
under various scenarios. (A) Genome size of 100 cM, a single
chromosome; (B) genome size of 2000 cM, 20 chromosomes.
The horizontal dotted line shows the significant level of
P ¼ 1 3 10?5. The simulations are sorted in ascending order
of the ?log10P on the x-axis.
1694A. C. Lam et al.
(results not shown). Therefore, in the presence of
dominance effects, the advantage in the performance
of the optimal pair design in detecting QTL is reduced.
impact on the optimal pair design, especially for large
genome sizes, when QTL detection is the primary
Fixed number of microarrays with a large F2sample
size: In previous simulations, we observed that all.data,
which required 400 microarrays, was more powerful
in detecting the additive QTL effect than using 200
microarrays under the opt.pair scenario. Here, we
studied the power of these two designs conditioned on
a total of 400 microarrays. With an F2sample size of
1000, neither design can profile all the individuals with
400 microarrays. Under the optimal pair design, 400
pairs were deliberately selected to give the minimum
variance for the estimated additive genetic parameter.
On the other hand, only 400 individuals (randomly
selected from 1000 individuals) could be profiled using
the common reference design. Given the equal number
of microarrays being used, our results in Figure 3 show
that the optimal pair design outperforms the common
in an efficient and effective manner using recombinant
inbred lines. For researchers studying genetics of many
outbred species, however, the creation of recombinant
inbred lines is impractical. Here we explore whether
eQTL studies of natural species would benefit from the
Figure 2.—Performance for detecting
QTL additive and dominance (left) and
dominance effects (right) under various
scenarios. (A) Genome size of 100 cM, a
single chromosome; (B) genome size of
2000 cM, 20 chromosomes. The horizontal
dotted line shows the significant level of
P ¼ 1 3 10?5. The simulations are sorted
in ascending order of the ?log10P on the
Summary of P-values (mean and standard deviation on ?log10scale) at the main QTL position for additive QTL
detection under the four scenarios, where only an additive effect was simulated
1 11.9 (2.9)
Standard deviations are in parentheses.
aFully informative markers.
eQTL Microarray Design1695
same design principles used in distant pairing. We show
that the optimal pair design, an extension of the distant
pair design for outbred lines crosses, can indeed im-
prove the efficiency of the use of microarrays and in-
crease the statistical power for detecting eQTL, even for
studying organisms with large genome sizes.
Under the linear regression framework, the greatest
power is achieved by having the regression coefficients
For the regression model proposed for the optimal pair
design in this article, this would be achieved by pairing
up individuals who have large genetic coefficients with
opposite signs. However, in a line cross such as the
F2, it is inevitable that not every pair would result in a
regression coefficient that is near one extreme or the
other. Furthermore, when the number of independent
loci increases (increase in chromosome length and
number of chromosomes), the optimal pair assignment
for one locus will usually not be optimal for the other
loci. The optimal pair assignment over the whole ge-
nome is therefore suboptimal in the perspective of a
single locus, i.e., fewer regression coefficients around
of distant pairing to degrade to the same level as that of
Clear benefits in detecting additive effects: We show
that when there are few loci to consider, such as in a
small genome, the power of detecting additive effects
with the optimal pair design is similar to using a com-
mon reference design that consumes twice the number
of microarrays. With near-optimal pairing for individual
loci (achievable when there are small numbers of effec-
tively independent loci), the efficiency of the optimal
pair design is very attractive. Moreover, the common
reference design with only half the sample size (i.e.,
thesame number ofmicroarrays)performs significantly
worse. This highlights the problem of small sample size
leading to reduction in power in complex trait analysis.
Extremely similar performances were observed for ran-
dom pairing and the common reference design with
50% of the samples. The difference in performance be-
tween these two designs could have been more marked
if our simulations had explicitly modeled the possible
pair design and a common reference design.
As expected, the performance of the optimal pair
less, it is very promising that in a large genome the
optimal pair design still notably outperforms designs
that use the same number of microarray slides. Further-
more, as shown by theexcellent performance in smaller
be beneficial for a focused study of one or more can-
maximized for genomic regions for which the research-
ers have the most interest, while the power in the rest of
design. In addition, we show that with the number of
microarrays used being equal, the optimal pair design
always gives the highest statistical power of the ap-
proaches compared. Our comparison to the common
Summary of P-values (mean and standard deviation in ?log10scale) at the main QTL position for QTL
detection (additive plus dominance model vs. null model) under the four scenarios, where
both additive and dominance effects were simulated
chromosomes all.data half.dataran.pairopt.pair
Standard deviations are in parentheses.
Summary of P-values (mean and standard deviation on ?log10scale) at the main QTL position for
dominance detection (additive plus dominance model vs. additive model) under the four
scenarios, where both additive and dominance effects were simulated
Standard deviations are in parentheses.
1696A. C. Lam et al.
reference design was made to random selection of
individuals. Although selective phenotyping (Jin et al.
2004) can improve the efficiency of the common ref-
erence design, the optimal pair design would allow more
subjects to be assayed as well as maximize the genotypic
dissimilarity.Therefore, for outbred speciesthat possess
large genomes, the optimal pair design can provide both
for the detection of eQTL with additive effects.
Complications due to dominance effects: How does
dominance affect the performance of this design? We
evaluate the optimal pair design that optimizes for both
additive and dominance effects simultaneously: the
conclusion is that by including the dominance param-
eter, the design becomes less optimized for detecting
the main (additive) effect. Although over a small ge-
nome, the optimal pair design can offer a moderate
power advantage for detecting QTL and dominance
effects over no optimization, the performance is af-
fected severely in that the power for detecting both the
main and the dominance effect degrade to almost the
same levels as random pairing with alargegenome. Our
results agree with other studies (Piepho 2005; Bueno
Filho et al. 2006) that finding a design that is optimal
for detecting both additive and dominance effects
cannot be achieved. They have shown that optimizing
for detecting dominance effects would decrease power
for detecting additive effects. Therefore, when one has
to make a choice between additive and dominance
effects for optimization, the question relates directly to
the goal of the experiment. If the goal is to scan across
the whole genome for linked loci to gene-expression
phenotypes, we argue that one could consider focusing
on the additive parameter alone for the optimization.
After all, the ultimate interest is to detect QTL. In most
cases QTL are expected to have an additive component,
even in cases where dominance is present. Optimizing
for dominance effects should be considered only if
there is strong a priori evidence for overdominance in
the QTL of interest in a candidate gene study.
Final remarks: We conclude that our extension of the
distant pair design, the optimal pair design, can be
applied efficiently to outbred line crosses for genetical
genomic studies. Having stated that, we acknowledge
that in an experimental design for genetical genomics,
there is no ‘‘one-size fits-all’’ solution. The most power-
ful and efficient design will depend on the population
structure, the marker density, the chosen method of
analysis, the numbers of treatments, and the parameter
of interest. Bueno Filho et al. (2006) proposed different
designs for multiple genotypes, epistasis, and multiple
treatments. In human or other natural populations, the
can be applied to sib-pair analysis, in which case, the
eQTL linkage analysis would be to profile the expres-
sion of a pair of sibs on the same array. This is because
the trait squared differences between two sibs are the
dependent variable used in this method; these quanti-
ties are obtained most accurately when sibs are paired
up on the same array.
It is also worth considering the implication of the use
of high-density single-nucleotide polymorphism (SNP)
genotyping on the optimal pair design described in
this article. High-density SNP genotyping is most widely
used in association studies in natural human popula-
tions rather than in line crossesof the animals discussed
above. As linkage disequilibrium spans relatively short
distances in human populations, the effective number
of independent loci is much higher than what we have
modeled in our line-cross simulations. This effect is
equivalent to increasing the genome size and is likely
to have a negative impact on the performance of the
optimal pair design than what can be expected in
outbred line crosses. Eventually, the distant pairing
strategy might become almost equivalent to a pairing
strategy based on relationships, in which less-related in-
dividuals should be paired for each hybridization (Rosa
et al. 2006; Bueno Filho et al. 2006). Theoretically, the
variance of the estimate of the parameter is minimized,
its performance should be at least as good as the com-
mon reference design. However, other factors, such as
technical simplicity and flexibility in the choice of
statistical methods, might shift the balance in favor of
the common reference design when the performance
advantage in using the optimal pair design becomes
less marked. Therefore, it is imperative to consider each
experiment and the question of interest on a case-
by-case basis. Nevertheless, our results suggest that the
Figure 3.—Comparison of the performance for QTL de-
tection under the common reference design and the optimal
pair design when the number of arrays is fixed at 400, the ge-
nome size is 2000 cM, and the F2sample size is 1000. The hor-
izontal dotted line shows the significant level of P ¼ 1 3 10?5.
The simulations are sorted in ascending order of the ?log10P
on the x-axis.
eQTL Microarray Design1697
efficient design principles outlined by Fu and Jansen
(2006) can be applied to a wider context than RILs.
With larger eQTL experiments becoming more afford-
able, we can expect to discover more loci with moderate
to small effects. Such attainment will ultimately lead to
greater advances inour understanding ofthemolecular
basis of complex traits.
We thank two anonymous referees for their helpful suggestions.
The R code for optimizing pairing configuration can be obtained by
request from the corresponding author. This research was funded
by the Biotechnology and Biological Sciences Research Council
(BBSRC). A.C.L. is grateful for support from the BBSRC (grant no.
BBSSF200512735), the Genesis Faraday Partnership, and Genus plc.
Alfonso, L., and C. S. Haley, 1998
for QTL detection in livestock. Anim. Sci. 66: 1–8.
Brem, R. B., G. Yvert, R. Clinton and L. Kruglyak, 2002
dissection of transcriptional regulation in budding yeast. Science
Bueno Filho, J. S., S. G. Gilmour and G. J. Rosa, 2006
microarray experiments for genetical genomics studies. Genetics
Bystrykh, L., E. Weersing, B. Dontje, S. Sutton, M. T. Pletcher
et al., 2005 Uncovering regulatory pathways that affect hemato-
poietic stem cell function using ‘genetical genomics’. Nat. Genet.
de Koning, D. J., and C. S. Haley, 2005
mans and model organisms. Trends Genet. 21: 377–381.
Dixon, A. L., L. Liang, M. F. Moffatt, W. Chen, S. Heath et al.,
2007A genome-wide association study of global gene expres-
sion. Nat. Genet. 39: 1202–1207.
Fu, J. Y., and R. C. Jansen, 2006Optimal design and analysis of ge-
netic studies on gene expression. Genetics 172: 1993–1999.
Haley, C. S., S. A. Knott and J. M. Elsen, 1994
tive trait loci in crosses between outbred lines using least squares.
Genetics 136: 1195–1207.
Haseman, J. K., and R. C. Elston, 1972
between a quantitative trait and a marker locus. Behav. Genet. 2:
Hubner, N., C. A. Wallace, H. Zimdahl, E. Petretto, H. Schulz
et al., 2005 Integrated transcriptional profiling and linkage
analysis for identification of genes underlying disease. Nat. Genet.
Jansen, R. C., and J. P. Nap, 2001
value from segregation. Trends Genet. 17: 388–391.
Power of different F-2 schemes
Genetical genomics in hu-
The investigation of linkage
Genetical genomics: the added
Jin, C., H. Lan, A. D. Attie, G. A. Churchill, D. Bulutuglo et al.,
2004Selective phenotyping for increased efficiency in genetic
mapping studies. Genetics 168: 2285–2293.
Jin, W., R. M. Riley, R. D. Wolfinger, K. P. White, G. Passador-
Gurgel et al., 2001 The contributions of sex, genotype and
age to transcriptional variance in Drosophila melanogaster. Nat.
Genet. 29: 389–395.
Keurentjes, J. J., J. Fu, I. R. Terpstra, J. M. Garcia, G. van den
Ackerveken et al., 2007 Regulatory network construction in
Arabidopsis by using genome-wide gene expression quantitative
trait loci. Proc. Natl. Acad. Sci. USA 104: 1708–1713.
Kirkpatrick, S., C. D. Gelatt, Jr. and M. P. Vecchi, 1983
zation by simulated annealing. Science 220: 671–680.
Li, Y., O. A. Alvarez, E. W. Gutteling, M. Tijsterman, J. Fu et al.,
2006 Mapping determinants of gene expression plasticity by ge-
netical genomics in C. elegans. PLoS Genet. 2: e222.
Mehrabian, M., H. Allayee, J. Stockton, P. Y. Lum, T. A. Drake
et al., 2005 Integrating genotypic and expression data in a seg-
regating mouse population to identify 5-lipoxygenase as a suscep-
tibilitygenefor obesity andbonetraits.Nat.Genet.37: 1224–1233.
Monks, S. A., A. Leonardson, H. Zhu, P. Cundiff, P. Pietrusiak
et al., 2004 Genetic inheritance of gene expression in human
cell lines. Am. J. Hum. Genet. 75: 1094–1105.
Morley, M., C. M. Molony, T. M. Weber, J. L. Devlin, K. G. Ewens
et al., 2004Genetic analysis of genome-wide variation in human
gene expression. Nature 430: 743–747.
Piepho, H. P., 2005Optimal allocation in designs for assessing het-
erosis from cDNA gene expression data. Genetics 171: 359–364.
R DevelopmentCore Team, 2007
Statistical Computing. R Foundation for Statistical Computing,
Rosa, G. J., N. de Leon and A. J. Rosa, 2006
experimental design strategies for genetical genomics studies.
Physiol. Genomics 28: 15–23.
Schadt, E. E., S. A. Monks, T. A. Drake, A. J. Lusis, N. Che et al.,
2003Genetics of gene expression surveyed in maize, mouse
and man. Nature 422: 297–302.
Schadt, E. E., J. Lamb, X. Yang, J. Zhu, S. Edwards et al., 2005
integrative genomics approach to infer causal associations be-
tween gene expression and disease. Nat. Genet. 37: 710–717.
Seaton, G., C. S. Haley, S. A. Knott, M. Kearsey and P. M.
Visscher, 2002 QTL Express: mapping quantitative trait loci
in simple and complex pedigrees. Bioinformatics 18: 339–340.
Stranger, B. E., M. S. Forrest, A. G. Clark, M. J. Minichiello, S.
Deutsch et al., 2005 Genome-wide associations of gene expres-
sion variation in humans. PLoS Genet. 1: e78.
Wit, E., and J. McClure, 2004
Statistics for Microarrays: Design, Anal-
ysis and Inference. John Wiley & Sons, Chichester, UK.
R: A Language and Environment for
Review of microarray
Communicating editor: R. W. Doerge
1698A. C. Lam et al.