Page 1
ARTICLE
Studying Gene and GeneEnvironment Effects of Uncommon
and Common Variants on Continuous Traits: A MarkerSet
Approach Using GeneTrait Similarity Regression
JungYing Tzeng,1,2,* Daowen Zhang,1Monnat Pongpanich,2Chris Smith,2Mark I. McCarthy,3
Miche `le M. Sale,4Bradford B. Worrall,5FangChi Hsu,6Duncan C. Thomas,7and Patrick F. Sullivan8
Genomic association analyses of complex traits demand statistical tools that are capable of detecting small effects of common and rare
variants and modeling complex interaction effects and yet are computationally feasible. In this work, we introduce a similaritybased
regression method for assessing the main genetic and interaction effects of a group of markers on quantitative traits. The method
uses genetic similarity to aggregate information from multiple polymorphic sites and integrates adaptive weights that depend on allele
frequencies to accomodate common and uncommon variants. Collapsing information at the similarity level instead of the genotype
level avoids canceling signals that have the opposite etiological effects and is applicable to any class of genetic variants without the
need for dichotomizing the allele types. To assess genetrait associations, we regress trait similarities for pairs of unrelated individuals
on their genetic similarities and assess association by using a score test whose limiting distribution is derived in this work. The proposed
regression framework allows for covariates, has the capacity to model both main and interaction effects, can be applied to a mixture of
different polymorphism types, and is computationally efficient. These features make it an ideal tool for evaluating associations between
phenotype and marker sets defined by linkage disequilibrium (LD) blocks, genes, or pathways in wholegenome analysis.
Introduction
Markerset analysis refers to the joint evaluation of a group
of markers for genetic association. These markers might
be of various polymorphism types (e.g., a mixture of SNP,
insertiondeletion variants [INDEL], block substitutions,
copynumber variants, or inversion variants) but share
certain common genomic features, such as participating
in the same pathway, being in high linkage disequilibrium
(LD), or being located within the same gene or conserved
functional region. Markerset analysis has drawn great
attention in recent genomewide and sequencebased
associationstudies.Itassessesthejointassociationofpoten
tially correlatedand interacting loci. It amplifies the detect
abilityofthecausalsignalsbyaggregatingsmalleffectsfrom
multiple individual loci. Furthermore, because sequences
and functions of genes are highly consistent across popula
tions and species, a markerset analysis increases the inter
pretability and replicability of the association findings. For
wholegenomescans,italsooffersanaturalwayofreducing
the total number of tests and hence improves power by
reducing the multipletesting burden. For sequencebased
studies,markersetanalysisaccumulatesinformationacross
multiple rare mutations and has a greatly enhanced power
to detect rare variants that are hard for researchers to
identify by traditional analysis methods.
A variety of methods are available for detecting marker
set association, ranging from minimum p value or Fisher’s
combined methods1,2for singlemarker tests to multi
marker tests with a genotype or haplotypebased scoring.
Many recent methods fall in between the two extremes.
These methods collapse information from all markers in
the set and achieve a better balance between information
and degrees of freedom. Depending on how the individual
marker information is combined, we can roughly classify
these approaches into four categories. Methods in the
first category use the weighted sum of genotypes across
markers, for example the LDbased weighting method,3
the weighted Fourier transform,4and the PCAbased
methods.5,6Recently, special versions of the weighted
sum methods based on allele frequencies were proposed
to target rare variants.7–10Methods of the second type
model the genetic similarity of pairs of individuals and
are also referred to as Ustatistics approaches.11–19Methods
of the third type are variancecomponent (VC) methods,
which treat individual genetic effects as random effects
andtestforthecorrespondingVCtodetecttheglobaleffect
of a gene. Methods of this type include the SNP random
effects model,20,21haplotype randomeffects model,22and
kernelbased methods.23–25The fourth category includes
other approaches that do not fit into the above categories,
such as the calpha test,26the group additive regression
model,27Tukey’s model,28and entropybased methods.29
Although most markerset methods have concentrated
on detecting genetic main effects, here we focus on
methodsfor studying geneenvironment(G3 E)
1Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA;2Bioinformatics Research Center, North Carolina State University,
Raleigh, NC 27695, USA;3Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK;4Center for Public Health Genomics,
Departments of Medicine and Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA;5Center for Public Health
Genomics, Departments of Public Health Sciences and Neurology, University of Virginia, Charlottesville, VA 22908, USA;6Department of Biostatistical
Sciences, Wake Forest School of Medicine, WinstonSalem, NC 27157, USA;7Department of Preventive Medicine, University of Southern California,
Los Angeles, CA 90089, USA;8Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
*Correspondence: jungying_tzeng@ncsu.edu
DOI 10.1016/j.ajhg.2011.07.007. ?2011 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 89, 277–288, August 12, 2011
277
Page 2
interactions. Identifying genetic variants with heteroge
neous effects under different environmental exposures
is crucial for understanding individualized medicine,
studying pharmacogenetics, characterizing underlying
biological mechanisms, and uncovering unexplained
heritability.30,31Markerset analysis provides an ideal
framework for the study of G 3 E interactions. The marker
set, either defined by genes, pathways, or functions,
provides a biologically sensible unit for the G component,
and the loci in a set can be assessed jointly for whether
their effects are modified under different environmental
exposures. In addition, the potential power gain brought
by the markerset analysis—either through aggregating
genetic signals or by reducing multipletesting penalty—
can alleviate the datahungry nature of detecting G 3 E
interactions. Typically, a G 3 E test would require sample
sizes at least four times larger than a main effect test for
detecting an effect of comparable magnitude.30–33Further
more, many G 3 E studies are based on conceptual models
for candidate pathways, in which a set of genes are selected
and studied together.31,34Markerset analysis offers a suit
abletoolfortheevaluationoftheoveralleffectofthepostu
lated pathways when assessing G 3 E interactions.
The markerset G 3 E method we present focuses on
quantitative traits and uses pairwise genetic similarity as
a tool to aggregate marker information (i.e., the second
category in the above method categorization). Our
approach differs from those in the literature on gene/
pathway level analysis in the following aspects. First, we
introduce aframework forincorporatinginteractioneffects
in similaritybased methods. To be useful for G 3 E studies
with either confirmatory or exploratory aims, we develop
a series of tests to suit different purposes, including a test
for detecting G 3 E interactions, a test for detecting
marginal main effects, and a joint test for detecting the
overall association induced either by genetic main effects
or by G 3 E interactions. The joint test serves as a good
tool when little is known a priori about the genetic hetero
geneity across exposure strata and provides power across
a wide range of the unknown underlying true structures.
Second, the proposed method can collapse information
fromamixtureofdifferenttypesofvariantsandisdesigned
todetect commonanduncommonvariants.Botharedesir
able features when more classes of DNA variants are avail
able. Finally, we illustrate how similaritybased collapsing
methods can be equivalent to VC methods (i.e., category
3 in the method categorization), which are found to have
better maineffect performance than several other marker
set approaches.24,35–37Through simulation, we show the
validity of the test and investigate the power of the pro
posed approach under a wide range of scenarios. We illus
trate the utility of the proposed method by using the
samples from the Vitamin Intervention for Stroke Preven
tion (VISP) trial. In this study, candidate genes across the
genome were selected for the evaluation of the gene and
geneageinteractioneffectsonthechangeinfastinghomo
cysteine (Hcy) level following a 2 hr methionine load test.
Material and Methods
GeneTrait Similarity Regression for G
and G 3 E Effects
We use the following notations. For individual i ði ¼ 1; 2;.;nÞ; let
Yi be the continuous trait, Xi be the K31 covariate vector
excluding the intercept term and standardized to mean ¼ 0 and
variance ¼ 1, and Gm;ibe the allelecount vector of marker m for
person i; with the length equal to the number of distinct alleles
at marker m (denoted by [m), m ¼ 1;2;/;M. For example,
Gm;i¼ ½2;0? if person i has genotype 11 at SNP m and ¼ ½1;1? if
person i has genotype 10. To fix the idea, we consider K ¼ 1, but
the method described here also applies to K > 1:
For each pair of individuals i and j, we measure the trait simi
larity Zijand genetic similarity Sijof the targeted marker set. We
then regress the trait similarity on the genetic similarity and
detect genetrait association by testing for the significance of rele
vant regression coefficients. The trait similarity Zijis quantified
through trait covariance by taking the product of the trait resid
uals of subjects i and j. Let mibe the subjectspecific mean of trait
value adjusted for the covariate information; then we set
Zij¼ ðYi? miÞðYj? mjÞ; where mi¼ g0þ Xig and ðg0;gÞ is the
covariate effects including the intercept. The genetic similarity
Sijis measured by the average of the weighted allele matching
score (weighted matching score for short) between subjects i
and j across the M markers. It takes the form of Sij¼
1=M3PM
consider a SNP and the weight Wm¼ I232. Then SAA;AA¼ 4,
SAA;Aa¼ 2, and SAA;aa¼ 0. When quantifying genetic similarity,
one can use weights based on allele frequencies, the degree of
evolutionary conservation, or the functionality of the variations
to better target genetic variants of certain features (e.g., rare, func
tional).15,25,38For example, to upweight similarities contributed
by rare variants, we define the frequency of allele a at marker m
as qa;mand set Wm¼ diagf1=qa;mg or diagf1=
the similarity in rare alleles.23,24
The proposed genetrait similarity regression model has the
following form:
m¼1GT
specifies
m;iWmGm;j, in which Wm is an [m3[m matrix
the weightingscheme. that As an illustration,
ffiffiffiffiffiffiffiffiffi
qa;m
p
g to upweight
E?Zijj X;H?¼ b3Sijþ d3Sij3XiXj;isj:
Because baseline and covariate effects have been adjusted for
Zij; the regression has a zero intercept and does not have the cova
riate term XiXj. This contention will become more obvious from
the viewpoint of variance components in the following para
graph. Equation 1 incorporates information about genetic main
effects and geneenvironment interactions and hence allows the
possibility of a genetic effect to be modified by an environmental
exposure. Under Equation 1, one can evaluate the overall genetic
association by performing a joint test of genetic main effects and
geneenvironment interactions for H0: b ¼ d ¼ 0. To assess gene
environment interactions only, one can perform a G 3 E test by
examining H0: d ¼ 0. Finally, one can evaluate the marginal
main effects by examining the main effect term and testing for
H0: b ¼ 0 under the constraint of d ¼ 0. We refer to this test as
the G test. The G test can be used as a subsequent test when a
G 3 E test fails to reject H0, or it can be used as an alternative
way to detect the overall genetic association. Because interactive
factors can often exhibit a marginal effect even when the interac
tion terms are not modeled,39,40the G test is often used to perform
genome screening in common practice. Compared to the joint
(Equation 1)
278
The American Journal of Human Genetics 89, 277–288, August 12, 2011
Page 3
test, the G test uses fewer degrees of freedom and hence is more
powerful when there are no geneenvironment interactions or
when the interaction effects are big, but it might be less powerful
when the genetic effect is restricted to the exposure group.41
The test statistics for G 3 E, G, and joint tests can be derived
through the equivalence between the similarity regression models
and the haplotype randomeffects model.17Consider a working
haplotype randomeffects model:
Yi¼ g0þ Xig þ HT
ib þ XiHT
il þ ei;
(Equation 2)
where ei? Nð0;sÞ, Hiis the L31 haplotype vector, L is the number
of distinct haplotypes observed in the population, bL 3 1? N
ð0;tRÞ;lL31? Nð0;fRÞ; and R is an L3L matrix in which the
ðh;kÞ th entry is equal to the similarity between haplotypes h
and k, quantified by the weighted matching score. Under the
working mixed model (Equation 2), the trait covariance between
individuals i and j ðisjÞ is
cov?Yi;Yjj X;H?¼ HT
¼ t 3HT
¼ t 3Sijþ f3XiXj3Sij
The last line follows from the fact that HT
PM
larity regression are the variance components in the mixed model
(Equation 2). Therefore, following similar derivations in Tzeng
and Zhang22and Zhang and Lin,42we obtain the score test statis
tics for G 3 E test, G test, and the joint test as follows:
icovðbÞHjþ XiHT
iRHjþ f3XiXj3HT
icovðlÞHjXj
iRHj
(Equation 3)
iRHj¼ 1=M3
m¼1GT
b ¼ t and d ¼ f: That is, the regression coefficients in the simi
m;iWmGm;j17. Comparing Equations 1 and 3, we have
TG3E¼ YTP1DSDP1Yjf¼0;t¼b t;s¼b s;
TG¼ YTP0SP0Yjf¼0;t¼0;s¼~s;
and
Tjoint¼ YTP0ðS þ DSDÞP0Yjf¼0;t¼0;s¼~s:
In the above equations, YT
and
S ¼ fSijg
V?1
The quantities ðb t; b sÞ are the REML estimates for ðt;sÞ obtained
H0: f ¼ t ¼ 0. These estimates are given in Appendix A. As shown
in Appendix B, these test statistics follow a weighted c2distribu
tion, and the p values can be calculated with the threemoment
approximation.43,44
There are a few remarks regarding the similaritybased marker
set methods. The similarity regression aggregates marker informa
tion through a sum of genotype similarity across markers instead
of a sum of genotypes. Compared to genotype sums, aggregating
information through similarity can prevent signals of opposite
directions from being canceled. In addition, because Gm;i takes
integer or dosage counts and can be of any length, this approach
can work with typed and imputed genotype calls and is applicable
to a mixture of different types of variants without having to
dichotomize the variants.
n31¼ ðY1;/;YnÞ, Dn3n¼ diagfXig;
Sij¼ HT
t ;t ¼ 0;1;
where
iRHj;
where
matrix
V0¼ sI;V1¼ tS þ sI:
Pt¼ V?1
t
?
t XðXTV?1
t XÞ?1XTV?1
under H0: f ¼ 0, and~s is the REML estimate for s under
Simulation Studies
We performed simulations based on HapMap 3 data to assess the
performance of the proposed tests. We obtained a haplotype
population consisting of 234 phased haplotypes from chromo
some 21 of the CEU (Utah residents with ancestry from northern
and western Europe) samples in HapMap 3. To obtain a variety of
risk allele frequencies and LD patterns of a marker set, we defined
a marker set as a 10 SNP region, and used a nonoverlapping sliding
window on chromosome 21 to obtain 1734 regions. Given
a markerset region, we generated haplotypes for 500 individuals
by randomly sampling 500 pairs of haplotypes with replacement
from the 234 haplotypes under a HardyWeinberg equilibrium
assumption. Because the rarest allele frequency we can obtain
is 1=234z0:004, we used a relatively small sample size (n ¼ 500)
to assure genetic heterogeneity attributable to rare mutations.
Given a 10 SNP region, the 5th and the 10th SNPs were set to be
the risk loci, and their genotypes for individual i are denoted by G1i
and G2i˛f0;1;2g; respectively. We generated Xi? Nð0;1Þ: Then on
the basis of the genetic and covariate information of individual i,
the trait value Yiwas sampled from a normal distribution with
mean ¼ g0þ g1Xiþ gG1G1iþ gG2G2iþ gGE1XiG1iþ gGE2XiG2i and
variance ¼ v2; where g0and g1were set to be 1, and v2was deter
mined so that the heritability was around 0.1 to 0.2. For type I
error rate analysis, we set ðgG1;gG2;gGE1;gGE2Þ ¼ ð0;0;0;0Þ for all
three tests and also ð0:2;0:2;0;0Þ for G3 E test. For power analysis,
we set ðgG1;gG2;gGE1;gGE2Þ ¼ ð0:25;0:25;0:3;0:3Þ: These values
were chosen so that the power of the joint tests is not too close
to 1, whereas the power of G3E and G tests is not too close to
the nominal level of 0.0005.
Each region was analyzed with the proposed similarity regres
sion with three weighting schemes considered in the litera
ture:23,24(1)Wm¼ diagf1=qmg(referred to as SIM1), (2) Wm¼
diagf1=
to as SIM0). The results were compared to two benchmark
methods, the singleSNP minimumpvalue method (referred to
as SNP) and the multiSNP haplotypebased method (referred to
as HAP). In all analyses, the two risk loci were excluded, and the
phase information was removed. For the minimum p value
method, we used the minimum of the p values from the G 3 E,
G and joint tests for the eight SNPs, and the significance threshold
was determined with the multipletesting correction method of
Moskvina and Schmidt.45This method estimates the effective
number of independent tests for correlated SNPs at a given overall
type I error rate and calculates the significance level for the indi
vidual tests accordingly. For the haplotypebased analysis, we
used the widely used R package haplo.stats to carry out standard
haplotype regression analysis. Specifically, we used haplo.glm46
for the G 3 E test and haplo.score47for the G test. We did not
perform the joint test at the haplotype level because it is not
supported by this program. Haplotypes with frequencies less
than the program default threshold (i.e., 0.01) were pooled into
the baseline haplotype.
ffiffiffiffiffiffi
qm
p
g(referred to as SIM2), and (3) Wm¼ diagf1g(referred
Results
Simulation Studies
To evaluate type I error rates, we randomly selected six of
1734 regions on chromosome 21 to represent six different
scenarios: two levels of disease allele frequencies (q ¼ 0.1
and 0.3) combined with three levels of LD pattern (high,
medium, and low). The LD pattern was summarized
with the average of the 16 R2values, where each value
is the LD between an observed marker (eight in total)
and a risk locus (two in total). A larger LD value reflects
stronger correlation between the observed markers and
The American Journal of Human Genetics 89, 277–288, August 12, 2011
279
Page 4
the unobserved risk loci, hence the value reflects the
informativeness of the observed markers for the risk
loci. Each of the type I error rates was calculated on the
basis of 50,000 replications for ðgG1;gG2;gGE1;gGE2Þ ¼
ð0;0;0;0Þ for all
ð0:2;0:2;0;0Þ for G 3 E test. The results (Figure 1) indicate
that the type I error rates were around the nominal
levels considered (i.e., a ¼ 0:05, 0:005, and 0:0005) for
all methods in most scenarios. The exceptions tend to
occur in the haplotype G 3 E tests, where the type I errors
can be inflated because of the presence of rare haplotypes.
Inflation at larger a levels can often be eliminated by
using a slightly higher threshold (e.g., 0.02, as opposed
to the default value of 0.01) that pools uncommon haplo
types into the baseline group. To avoid any potential
impact that modifying the default threshold might
induce, we still used the threshold value of 0.01 in our
power analysis.
The power was evaluated for each of the 1734 regions on
the basis of 100 replications at the nominal level of 0.0005.
The results are shown in Figure 2 (G 3 E test), Figure 3 (G
test), and Figure 4 (joint test). The 1734 regions were
grouped into 12 categories, combinations of the four
scenarios of allele frequencies and the three LD patterns.
The risk allele frequencies from rare to common are catego
rized as follows: (A) both allele frequencies are less than
0.05, (B) sums of allele frequencies that are less than 0.3
but excluding those in (A), (C) sums of allele frequencies
that are between 0.3 and 0.6, and (D) sums of allele
frequencies that are greater than 0.6. The clustering of
LD patterns is based on the following thresholds: an
average R2> 0:6 for high, an average R2˛ð0:25;0:6Þ for
medium, and an average R2< 0:25 for low.
testsand 20,000 replicationsfor
A similar pattern was observed across Figures 2–4, hence
we concentrate on explaining Figure 2. In regions that
exhibit low LD (LDL), all three methods lacked power and
had roughly equal performance. The exception is in (A),
where the SIM1 method performed worse than the other
two. The situation that all three methods had similarly
lowpowerisnotsurprising becauseLDLrepresents regions
that contained markers with little information about the
two risk loci. The lone exception in LDL (A) can be ex
plained by the fact that the SIM1 method is best applied
in scenarios where a large number of markers have at least
mediumlevel LD with the risk loci, but in LDL (A), such a
scenario only occurred in 13% of the regions. On the other
hand, in 60% of the regions, the majority of the markers
had no LD with the risk loci, but either one single marker
was in perfect LD with one of the risk loci, or two markers
were in very high LD with each of the risk loci. The former
cases tend to favor the SNP methods, whereas the latter
tend to favor the HAP methods (and the remaining 27%
were regions where all markers had extremely low LD with
the risk loci). In the scenarios of LDL with (B), (C), and
(D), we did not observe such a large proportion of extreme
cases, and this resulted in a more comparable performance
of the three methods. Finally, compared to regions with
LDL, in the regions with medium LD (LDM), we observed
a uniform increase of power in all three methods, and SIM1
has a slightly greater power. The power gain was more
pronounced for highLD regions (LDH), where SIM1
showed more power than the other two methods.
To understand the impact of different weighting
schemes in the similarity regression, we repeated the
same analysis with SIM1, SIM2, and SIM0 (Figure 5).
Because the overall patterns were similar across different
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
H
\
M

L
/
H
\
M

L
/
0
4
8
12 16
q=0.1q=0.3
α α=0.05
GxE
(0,0,0,0)
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
H
\
M

L
/
H
\
M

L
/
q=0.1 q=0.3
GxE
(0.2,0.2,0,0)
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
H
\
M

L
/
H
\
M

L
/
q=0.1q=0.3
G
(0,0,0,0)
2
0
+
2
0
+
2
0
2
0
+
2
0
+
2
0
+
+
H
\
M

L
/
H
\
M

L
/
q=0.1 q=0.3
Joint
(0,0,0,0)
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
H
\
M

L
/
H
\
M

L
/
0
4
8
12 16
q=0.1q=0.3
α α=0.005
2
0
2
0
+
2
0
+
2
0
+
2
0
+
2
0
+
+
H
\
M

L
/
H
\
M

L
/
q=0.1q=0.3
2
0
+
2
0
2
0
+
2
0
+
2
0
+
2
0
+
+
H
\
M

L
/
H
\
M

L
/
q=0.1q=0.3
2
0
2
0
2
0
2
0
+
2
0
+
2
0
+
+
+
+
H
\
M

L
/
H
\
M

L
/
q=0.1q=0.3
2
0
2
0
2
0
2
0
+
2
0
2
0
+
+
+
+
+
H
\
M

L
/
H
\
M

L
/
0
4
8
12 16
q=0.1q=0.3
α α=0.0005
2
0
+
x
SIM1
SIM2
SIM0
SNP
HAP
2
2
+
2
0
2
0
2
0
2
0
+
0
0
+
+
+
+
H
\
M

L
/
H
\
M

L
/
q=0.1q=0.3
2
0
2
2
0
2
0
2
0
2
00
+
+
+
+
+
+
H
\
M

L
/
H
\
M

L
/
q=0.1q=0.3
Type 1 Error Rate
Figure 1.
Proposed Methods
The type I error rates are shown on the
scale of 102, 103, and 104for nominal
level a ¼ 0:05, 0.005, and 0.0005, respec
tively. The regions are randomly selected
from chromosome 21 to represent six
different scenarios listed on the x axis:
two levels of disease allele frequencies
(q ¼ 0:1 and 0.3) combined with three
levels of LD pattern (high, medium, and
low). A highLD value reflects stronger
correlation between the observed markers
and the two unobserved risk loci. The
panel titlesindicate
ðgG1;gG2;gGE1;gGE2Þ, that is the effect
sizes of the main genetic effects and gene
environment interactions at the two risk
loci used in generating simulated data.
Each of the type I error rates is calculated
onthebasis of
for ðgG1;gG2;gGE1;gGE2Þ ¼ ð0;0;0;0Þ and
20,000 replications for ð0:2;0:2;0;0Þ. The
type I error rates for HAPG at a ¼ 0:0005
are given below as some are beyond the
plotting range: (0.00454, 0.00266, 0.0023,
0.00158, 0.00794, and 0.00072).
Type I Error Rates of the
thevalue of
50,000replications
280
The American Journal of Human Genetics 89, 277–288, August 12, 2011
Page 5
tests, we present the results from the G 3 E and G tests.
Figure 5 presents the box plots of power for the same
regions as shown previously, except that panels (C) and
(D) in Figures 2–4 were grouped together to represent
commonvariant scenarios. We also marked the corre
sponding average power of SNP (solid line) and HAP
(dotted line) for comparison. We observed the following
features: (1) SIM0 and SIM2 had very similar power in
almost all situations; (2) when risk alleles are common
(i.e., [C] and [F]), SIM2 and SIM0 had similar or slightly
better power than SIM1, although the difference was not
very obvious; and (3) when the risk alleles are uncommon
or rare, SIM1 started to gain some traction in improving
power. The power improvement became more substantial
for rarer alleles. For example, in situations with a moderate
LD level, SIM1 had higher power than SNP and HAP,
whereas SIM2 and SIM0 did not.
Application to Real Data
We applied the similarity regression on samples collected
from the VISP trial. VISP was a multicenter, doubleblind,
randomized, controlled clinical trial that aimed to study
the effect of vitamins on preventing recurrent stroke. The
VISP trial was conducted under institutional review board
approval at the Wake Forest University School of Medicine
and at each of the clinic sites and adhered to the tenets of
the Declaration of Helsinki. Written informed consent was
obtained from all patients participating in the study. The
trial enrolled patients who were 35 or older with a nondis
abling cerebral infarction [MIM 601367] within 120 days
of randomization and Hcy levels in the top quartile for
the U.S. population. Subjects were randomly assigned to
receive daily doses of either a highdose formulation (con
taining 25 mg vitamin B6, 0.4 mg vitamin B12, and 2.5 mg
folic acid) or a lowdose formulation (containing 200 mg
vitamin B6, 6 mg vitamin B12, and 20 mg folic acid). The
patients were followed up for a maximum of 2 years, and
the average followup time was 1.7 years. About 2100 of
the VISP participants provided DNA samples, and geno
type information was collected from candidate genes
selected across the genome that are involved in homocys
teine metabolism, stroke risk, and atherosclerosis [MIM
209010]. After quality control, the dataset consists of
1944 subjects and genotypes of 1393 SNPs collected from
215 candidate genes. More details on the VISP trial and
VISP genetic study can be found in Toole et al.48and Hsu
et al.,49respectively.
Our analysis here focused on the genetic influence on
the Hcy level obtained from a 2 hr methionine load test
measured at baseline. It has been suggested that Hcy level
can be used to predict risk of recurrent stroke and symp
tomatic coronary heart disease, and genetic variations
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
LD−H
A both<0.05
SNP SIM1 HAP
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
B sum<0.3
SNPSIM1HAP
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
C sum 0.3~0.6
SNP SIM1HAP
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
D sum>0.6
SNP SIM1HAP
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
LD−M
SNPSIM1 HAP
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SNPSIM1 HAP
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SNPSIM1 HAP
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SNPSIM1HAP
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
LD−L
SNP SIM1HAP
●
●
●
●
●
●
●
●
●
●
● ●
●
●
x
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNPSIM1 HAP
●
●
●
● ●
●
●
● ●
x
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNP SIM1 HAP
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
● ● ●● ●
x
●
● ●●●
●
●●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNPSIM1HAP
power
Figure 2.
The 3 sign indicates the average power. The power at a region is calculated on the basis of 100 replications at a nominal level of 0.0005.
The results are grouped into 12 categories on the basis of frequencies of the risk alleles and LD patterns. The risk allele frequencies from
rare to common are categorized as (A) both allele frequencies < 0:05; (B) sums of allele frequencies < 0:3 but excluding (A); (C) sums of
allele frequencies between 0.3 and 0.6; and (D) sums of allele frequencies > 0:6. The clustering of LD patterns is done according to the
following thresholds: average R2> 0:6 for high (LDH), average R2˛ð0:25;0:6Þ for medium (LDM), and average R2< 0:25 for low (LDL).
Boxplot of Power of G 3 E Test from the 1734 Regions on Chromosome 21
The American Journal of Human Genetics 89, 277–288, August 12, 2011
281
Page 6
might be attributed to mild to moderate hyperhomocysti
nemia [MIM 603174]. Given that the Hcy level tends to
increase with age, we also investigated the potential
geneage interaction effects on Hcy. We conducted gene
based analyses; we used the proposed SIM1 method to
assess the significant level of each gene and compared it
to the available benchmark, SNP, and/or HAP methods.
As in the original study,49we adjusted for age, sex, and
race in each analysis. The Bonferroni threshold for p value
is 0:05=215 ¼ 2:33310?4:
Wefirstusedthejointtesttoperformagenebasedscanto
evaluate the gene and geneage effects on the change in
postmethionine load Hcy level (i.e., postmethionine load
test Hcy ? baseline fasting Hcy). If a gene is rejected by
a joint test, the G 3 E and G tests can be used to further
refine the sources of identified signals. The joint test is a
suitable screening tool for scenarios in which the under
lying geneage interaction mechanism is little known32,41
because it assesses the genetic main effect and geneage
interactions simultaneously. The p values of the testing
results for each gene (sorted by gene names) are shown in
Figure 6. For joint tests, one gene was found to be
significant (CBS [MIM 613381]), and both SIM1 and SNP
testsyieldsignificantpvalues.ThepvalueoftheSIM1joint
is 2:46310?5, and the followup analysis reveals that the
signal is caused by the genetic main effect instead of gene
age interactions: the p value of SIM G 3 E is 0:614, and
the p value of SIM G is 1:99310?6. The SNP joint test has
the adjusted minimum p value (adjusted for the 10 typed
SNPsin CBS) of2:06310?5. Theadjusted minimum pvalue
is obtained by 1 ? ð1 ? raw p valueÞke f fwhere ke f f¼ 7:59
istheeffectivenumberofindependenttestsestimatedwith
the method of Moskvina and Schmidt45after accounting
for the LD in CBS. The adjusted minimum p value for the
SNP G 3 E test is 0.700, and for SNP G test it is
9:42310?6. Finally, the HAP G 3 E test yielded a p value
of 0:362, and HAP G test yielded a significant p value of
1:02310?5.VariantsinCBShavepreviouslybeenassociated
with postmethionine load Hcy levels and change in Hcy
levels.49–52A common 68 bp insertion at the intron
7exon 8 boundary of CBS and the 31 bp variable number
of tandem repeats (VNTR) might be genetic determinants
of postmethionine load Hcy levels. Because postmethio
nine load Hcy levels are found to have an increased risk
for cardiovascular disease, CBS could be also considered
a risk factor for cardiovascular disease.
Discussion
Association analyses at the gene, pathway, and exon
levels (here by markerset analysis) hold great promise in
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
LD−H
A both<0.05
SNP SIM1HAP
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
B sum<0.3
SNP SIM1 HAP
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
C sum 0.3~0.6
SNP SIM1HAP
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
D sum>0.6
SNPSIM1HAP
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
LD−M
SNPSIM1 HAP
● ●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SNPSIM1 HAP
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SNPSIM1 HAP
●
● ● ● ● ●●
x
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNPSIM1HAP
●
●
● ●
● ●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
LD−L
SNP SIM1 HAP
● ●●
●
●
●
●
●
●
●
● ●
●
●
x
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●●
●
x
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
SNPSIM1HAP
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
● ●
●●
●
● ● ●
x
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●●
x
●
●
●
●●
● ●●
●●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●●
●
● ● ●●
●●●
●●●●
●
x
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
SNPSIM1 HAP
●
●
● ● ● ●
x
●
●
● ●●
x
●● ●
x
●
0.0
0.2
0.4
0.6
0.8
1.0
SNP SIM1HAP
power
Figure 3.
The 3 sign indicates the average power. The power at a region is calculated on the basis of 100 replications at a nominal level 0.0005.
The results are grouped into 12 categories on the basis of frequencies of the risk alleles and LD patterns. The risk allele frequencies from
rare to common are categorized as (A) both allele frequencies < 0:05; (B) sums of allele frequencies < 0:3 but excluding (A); (C) sums of
allele frequencies between 0.3 and 0.6; and (D) sums of allele frequencies > 0:6. The clustering of LD patterns is done according to the
following thresholds: average R2> 0:6 for high (LDH), average R2˛ð0:25;0:6Þ for medium (LDM), and average R2< 0:25 for low (LDL).
Boxplot of Power of G Test from the 1734 Regions on Chromosome 21
282
The American Journal of Human Genetics 89, 277–288, August 12, 2011
Page 7
evaluating modest etiological effects of genes with data
from genomewide association studies (GWAS) or next
generation sequencing.However,
methods tend to target either rare or common variants
but not both, assume samedirection effects for loci within
a marker set, use a testing framework that cannot accom
modate covariates, or do not have the capacity to assess
interaction effects. In this article, we propose a flexible,
powerful and computationally efficient method to con
duct markerset analysis for assessing gene and gene
environment interactions on quantitative traits. The pro
posed method is constructed via a similarity regression
framework under which we regress trait similarity on
genetic similarity. The framework incorporates interaction
effects, can adjust for covariates, and is applicable to both
observed and imputed dosage genotypes. We develop a
series of statistical tests that can be used for genetic
marginal main effects, G 3 E interactions, or the joint
effect of the two. We demonstrated that a similarity regres
sion is equivalent to a haplotype randomeffects model.
The equivalence enabled us to analytically derive the
asymptotic distributions of the test statistics and provide
a permutationfree procedure to assess significance. The
currentlyavailable
software implementing the proposed methods is available
at the authors’ website (see Web Resources).
The proposed method uses genetic similarity to aggre
gate information across markers and integrates adaptive
weights dependent on allele frequencies to accommodate
common and uncommon variants. Collapsing informa
tion at the similarity level instead of the genotype level
avoids canceling signals with opposite etiological effects
and is applicable to any class of genetic variants without
having to dichotomize the allele types. As demonstrated
in the simulation, incorporating frequency weights gives
the method satisfactory power for detecting both common
and uncommon variants. The simulation results also
reveal that its performance is sensitive to the signal
tonoise ratio (e.g., LD) among all loci included in the
markerset analysis. The higher the ratio is, the greater
the power gain for the proposed methods. As discussed
in the next paragraph, it is possible to increase the
signaltonoise ratio to maximize the chance of power
gain, such as by using functional, biological or LD informa
tion to downweight the contribution from noise markers.
In practice, the underlying LD levels are not known and
will vary from regions to regions, it is less likely to choose
0.0
0.2
0.4
0.6
0.8
1.0
x
x
LD−H
A both<0.05
SNPSIM1
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
B sum<0.3
SNP SIM1
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
C sum 0.3~0.6
SNPSIM1
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
D sum>0.6
SNPSIM1
0.0
0.2
0.4
0.6
0.8
1.0
x
x
LD−M
SNPSIM1
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNPSIM1
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNPSIM1
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNP SIM1
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
LD−L
SNP SIM1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNP SIM1
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNP SIM1
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
0.0
0.2
0.4
0.6
0.8
1.0
x
x
SNPSIM1
power
Figure 4.
The 3 sign indicates the average power. The power at a region is calculated on the basis of 100 replications at a nominal level 0.0005. The
resultsare groupedinto 12 categories on the basis of frequenciesof therisk alleles and LD patterns.The riskallele frequencies from rare to
common are categorized as (A) both allele frequencies < 0:05; (B) sums of allele frequencies < 0:3 but excluding (A); (C) sums of allele
frequencies between 0.3 and 0.6; and (D) sums of allele frequencies > 0:6. The clustering of LD patterns is done according to the
following thresholds: average R2> 0:6 for high (LDH), average R2˛ð0:25;0:6Þ for medium (LDM), and average R2< 0:25 for low
(LDL).
Boxplot of Power of Joint Test from the 1734 Regions on Chromosome 21
The American Journal of Human Genetics 89, 277–288, August 12, 2011
283
Page 8
one best performing method in advance. In addition, in
GWAS, the lowLD scenario would occur less frequently
by design, and in sequencing studies the number of risk
loci in a set should be higher than what we considered in
the simulation. Given these considerations, the proposed
method can serve as a sensible and robust tool for
evaluating association of complex traits in wholegenome
markerset analyses.
The inclusion of nonfunctional loci (i.e., nonrisk
markers that are not in LD with the risk loci) is a major
factor influencing the performance of all markerset
approaches. Intelligently incorporating LD information
and biological knowledge into the collapsing process,
and downweighting the contribution of nonfunctional
markers will be a useful solution. In our framework, bio
logical and functional information, as pioneered and
comprehensively reviewed in Price et al.10and Schaid38
can be naturally incorporated through the weight matrix,
Wm: One unique feature of our weighting framework is
that it allows functional weights at the allelespecific
level (as opposed to locusspecific level), such as the
impact of a specific mutation sequence on protein func
tions, structures, or stability. We are exploring mecha
nisms to include genomic knowledge on the basis of
functionality, biological pathways, and system biological
networks.
One key requirement for the proposed method to have
power for both common and uncommon variants is that
the similarity level be weighted by allele frequency at
order k (i.e., q?k). Although the principle is to upweight
similarities that are contributed by rare variants, there
are no clear rules for what the specific form of the weights
should be as a function of the allele frequencies. Kwee
et al.23considered both k ¼ 1 and k ¼ 1=2 when calcu
lating the IBS kernel and concluded that the former might
be too strong and the latter is more suitable in their
setting. Wu et al.24therefore used k ¼ 1=2 in their work.
When aggregating information of multiple loci through
weighted genotype sum, Madsen and Browning8consid
ered their weights in the order of k ¼ 1=2 from the bino
mial standard deviation (SD) viewpoint. Here, we evalu
ated these different choices of k under our framework
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
GxE
both<0.05
SIM1SIM2SIM0
LD−H
x
x
x
SIM1SIM2SIM0
GxE
sum<0.3
x
x
x
SIM1SIM2SIM0
GxE
sum>0.3
x
x
x
G
both<0.05
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
G
sum<0.3
x
x
x
SIM1SIM2SIM0
G
sum>0.3
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SIM1SIM2SIM0
LD−M
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
0.0
0.2
0.4
0.6
0.8
1.0
x
x
x
SIM1SIM2SIM0
LD−L
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
x
x
x
SIM1SIM2SIM0
power
ABCDEF
Figure 5.
Chromosome 21
The 3 sign indicates the average power of the method shown on the x axis. The solid and dotted lines indicate the average power of
SNP test and HAP test, respectively. The power at a region is calculated on the basis of 100 replications at a nominal level 0.0005.
The results are grouped into nine categories on the basis of frequencies of the risk alleles and LD patterns. The risk allele frequencies
from rare to common are categorized: (A and D) both allele frequencies < 0:05; (B and E) sums of allele frequencies < 0:3 but excluding
(A) and (D); (C and F) sums of allele frequencies > 0:3. The clustering of LD patterns is done according to the following thresholds:
average R2> 0:6 for high (LDH), average R2˛ð0:25;0:6Þ for medium (LDM), and average R2< 0:25 for low (LDL).
Boxplot of Power of G 3 E Test and G Test with Different Weights—SIM1, SIM2, and SIM0—from the 1734 Regions on
284
The American Journal of Human Genetics 89, 277–288, August 12, 2011
Page 9
(i.e., SIM1 ½k ¼ 1?; SIM2 ½k ¼ 1=2?, and SIM0 ½k ¼ 0?:). We
found that SIM2 might be too mild and tends to yield
similar results as the unweighted SIM0. One main differ
ence between our weighting framework and others is
that we assign weights for every allele, whereas others
only assign weights for minor alleles. To illustrate the
impact of the difference, consider the similarity score
between a heterozygous pair. Our weights yield a score
of q?1=2
major, whereas those weights placed only on
minor alleles yield a bigger score of q?1=2
a stronger weighting effect.
Simulation results also suggest that larger values of k can
greatly boost power for detecting rare variants, but it also
risks losing power when the risk variant is common. We
focused on SIM1 on the basis of its superior power for
rare variants and comparable power for common variants.
It is possible that the optimal weights would lie some
where between k ¼ 1 and k ¼ 1=2, and we are investigating
further how to identify an optimal order. Alternatively,
one can use centered genotype scoring to account for
sharing of rarer alleles.53To center the allele count vector
Gm;i, we define G?
minorþ q?1=2
minor32 and give
m;i¼ Gm;i? Gm, where Gm is the vector
of population allele frequency for marker m. Then the
similarity score S?
The centering strategy bypasses the need of allele
frequencydependent weights and hence avoids the choice
of an order k. Studies to understand the pros and cons of
centering versus weighting strategies are underway.
ijis obtained by 1=M3PM
m¼1G?T
m;iG?
m;j.
Appendix A: ExpectationMaximization Algorithm
for the REML Estimates of t and s When Testing
for G 3 E H0: f ¼ 0
Let u ¼ KTY be a set of n ? d linearly independent
contrasts of Y with KKT¼ I ? XðXTXÞ?1XTand KTK ¼
In3n. Then the conditional distribution of u given b; de
noted by fðuj bÞ, is normal with mean KTHb and variance
sI and does not depend on the fixed effect g: Therefore,
the REML estimations of t and s can be based on its
marginal distribution f ðuÞ ¼Rfðuj bÞfðbÞdb: This moti
observed data u and missing data b. The completedata
log likelihood is given by
vated an expectationmaximization algorithm based on
0
1
2
3
4
5
6
1 3950100 150 200
++ ++
++++
+
+
+
+
+
+
++++ +++
+
+++++
++ ++++
+++++
+
+
+
+
+
++++
+
+
++
+
+++++
+
+ ++ ++++
+
+
+
+
++++
++++++
+++++
+
+ ++++
+
++
+
+ ++++ +
+
+
++++
+
++ ++++++
++
++
+
+
+
+
+
+
++
+
++
++++
+
+
+
++ +
+
+++
++
++
+
+
+++++
++
+++++++++
+
+
+
++
+
++
+
+
+
+
+
+
++ + +++++++++++
+
++++
+ +
+ +
+
+++
+
+
+
+
+ +++
Joint Test
+
x
SIM1
SNP
HAP
0
1
2
3
4
5
6
1 3950100 150200
GxAge Test
x xxx
++ ++
x
xx
x
xxxx
x
+
xxxxxxx
xx
x
x
+
x
+
xx
xxx
xx
xx
x
x
+
x
xxxxxx
x
+
xxx
xxxxxxxxxxxx
x
xx
++
xx
+
x
++++
xxx
x
+
xxx
xxx
xxxxxxx
x
+
xxxx
x
xxxx
x
x
x
xxx
x
xxxxxxx
x
x
+
x
+
xxx
xxx
xx
+
x
+
x
+
xxx
x
xxxx
+
x
xxxxx
x
+
xx
++ + +++++ +
xx
xxxxx
x
+
xxx
x
xx
x
+
x
xx
x
xxxxxxx
x
+
xxxxxxxxxxxx
x
+
x
+
xxxx
x
+
x
+
x
+
xx
x
xxxxx
x
xx
xxxx
x
+
x
xx
++
+++
+
+
+
+
+
+++ ++
+
+
+
++
+
+++++
++
+++
+
++
+
++++
+++++++++
+
+++++
+
+
+++
++
+
+++++ +
+++
+++++ +
+++
++
+
++
++++
++++ +++
+++
+++
+
+++
+
++
+
+++ +++
++++++
+++++++ +
+
+ + +
++++ +
+
+
+ ++++ +
+++
+
++ ++
++++
+
++
++++
+
+
++
++
0
1
2
3
4
5
6
1 3950 100150200
G Test
x
+ +
xx
x
x
x
xx
++
xxxx
x
xxxx
xx
xxxxxxxxxxxxx
x
+
xxx
++
x
xxxxxx
xxxxxx
x
x
xxxxxxx
x
xx
x
+++
xx
xxxxxxxx
xxxxxxxxxx
x
xx
x
++ ++ ++++
xx
xxxxxxx
x
x
+
xxxxxxxxx
xxx
x
+
x
+
xxx
+
xx
+
x
+
x
+
xxxxx
x
x
+
x
+
xxx
xxxx
x
x
+
x
+
x
+
xx
++ +
xxxxxxxxxxxxxxxx
x
xx
++
x
+
xx
+ +
x
+
x
xxx
xx
xxxx
x
+++
x
x
xxxx
x
x
++++
xxxxxx
x
+
xxxx
x
+
x
x
xxx
+
++++
+
+
+
+
+
++++++++
++++
+++++++++++++
++
++++
+
++++
+++++
+++++ ++ +++
+
+
+ +++++++
++
+
+
+
+++
+
+
+
++
+
+
+++
+++++++++
+++
+
++
+
+++++
+
+
++
+++
+++
+
++
+++++++++++++
+
+
++++
+++++++
++++
+
++
+
++ ++
++
++++
−log10(p−value)
Gene ID
Figure 6.
The x axis shows the gene IDs sorted by the alphabetic order of the gene names, and gene ID 39 is CBS. The red line indicates results
for SIM1, þ for SNP method, and 3 for HAP method. The results for the SNP methods are based on the adjusted minimum p values
that adjust for the multiple SNPs in a gene. The adjusted minimum p value is obtained by 1 ? ð1 ? raw p valueÞkeff, where keffis the
effective number of independent tests estimated with the method of Moskvina and Schmidt45after accounting the LD among SNPs
in a gene. A few genes are not plotted on the graph for the HAP methods because of convergence failure at these locations. This failure
is mostly attributed to excessive number of SNPs in the gene.
p Values with Negative Log 10 Transformation for the VISP Trial Analysis
The American Journal of Human Genetics 89, 277–288, August 12, 2011
285
Page 10
log fðu;b;t;sÞ ¼ log fðuj b;t;sÞ þ log fðb;t;sÞ
¼ ?n ? d
2
?L
log s ?1
2logjRj ?1
2s
?u ? KTHb?T?u ? KTHb?
2tbTR?1b:
2logt ?1
In the expectation step (Estep), we compute Qðt;s;
ðtÞ;bs
wherebt
Q
t;s;bt
¼ ?n ? d
2
?L
bt
ðtÞÞ, the conditional expected value of log f ðu;b;t;sÞ
given the observed data u assuming ðt;sÞ ¼ ðbt
ðtÞ;bs
ðtÞÞ,
ðtÞandbs
ðtÞ;bs
log s ?1
2logt ?1
ðtÞare the estimates at the tth iteration.
?
ðtÞ?
hE
h
log fðu;b;t;sÞju;bt
2sE
2log jRj ?1
ðtÞ;bs
ðtÞi
h?u ? KTHb?T?u ? KTHb???u;bt
2tE
ðtÞ;bs
ðtÞi
h
bTR?1bju;bt
ðtÞ;bs
ðtÞi
:
In the maximization step (Mstep), we solve for vQ=vt ¼
0 and vQ=vs ¼ 0 and obtain
bt
ðtþ1Þ¼1
LE
h
bTR?1bju;bt
ðtÞ;bs
ðtÞi
¼1
L
~bR?1~b
?1þ tr
?
R?1f
W
?
;
and
bs
ðtþ1Þ¼
1
n ? dE
?
h?u ? KTHb?T?u ? KTHb???u;bt
Y ? H~b
ðtÞ;bs
:
ðtÞi
¼
?T
A
?
Y ? H~b
?
þ tr
?
HTAHf
W
?
In the above equations, A ¼ KKT¼ I ? XðXTXÞ?1XT,
~bhEðbj u;;bt
given u are obtained directly from the normality of the
joint distribution of ðu;bÞ: The calculation of the project
matrix P1 requires inverting the n3n nonsparse matrix
V1; which can be computational burdensome. To speed
up the computation, we rewrite
ðtÞ;bs
ðtÞÞ ¼ tRHTP1Y;
and
f
Whvarðbju;bt
ðtÞ;
bs
ðtÞÞ ¼ tR ? t2RHTPHR. The conditional moments of b
V1¼ tS þ sI ¼ s
n
I þt
sS
o
¼ s
n
I þt
sELETo
;
whereS ¼ ELET,theeigenvaluedecompositionofmatrixS.
Then by the fact that ðI þ B1B2Þ?1¼ I ? B1ðI þ B2B1Þ?1B2,
we can rewrite V?1
1
¼ 1=sfI ? t=sEL½I þ ETt=sEL??1ETg ¼
1=sfI ? tEL½sI þ tETEL??1ETg; in which the calculation
involves only an inversion of an L3L matrix.
Appendix B: Derivation of the Score Test Statistics
and Their Asymptotic Distribution
For quantitative traits that follow a normal distribution
directly or after appropriate transformations, model (Equa
tion 2) reduces to the following linear mixed model (LMM)
in matrix notation
Y ¼ 1g0þ Xg þ Hb þ DHl þ ε; with b ? Nð0;tRÞ;
l ? Nð0;fRÞ;and ε ? Nð0;sIÞ
(Equation 4)
where YT¼ ½Y1;/;Yn?; 1 is an n31 vector of 1s, XT¼
½X1;/;Xn?;D ¼ diagfXig, and εT
our primary interest is to test the variance components f
and t, we consider the restricted maximum likelihood
(REML)loglikelihoodfunction
nentsðt;f;sÞ :[REMLðt;f;YÞ ¼ ?flogjVj þ logjXTV?1Xjþ
YTPYg=2; where V is the marginal variance of Y and
V ¼ fS þ tS þ sI, where S ¼ HRHT
V?1? V?1XðXTV?1XÞ?1XTV?1is the projection matrix
for the LMM (4).
Let Ufðf;t;sÞ and Utðf;t;sÞ denote the score functions
based on the REML function for f and t; respectively.
Simple algebra54shows that under H0: f ¼ 0;
n31¼ ½e1;/;en?: Because
ofvariance compo
and S ¼ DSD; P ¼
Uf
?0;bt;bs?
¼v[REMLðt;f;sÞ
vf
¼1
2
????
f¼0;t¼bt;s¼bs
?YTP1SP1Y ? trðP1SÞ?;
(Equation 5)
and under H0: t ¼ 0 (and with the constrain of f ¼ 0),
Ut
?0;0;bs?
¼v[REMLðt;f;sÞ
¼1
2
vt
????f¼0;t¼0;s¼~s
?YTP0SP0Y ? trðP0SÞ?:
(Equation 6)
In the above equations, ð~ t;~ sÞ are the REML estimates
of ðt;sÞ under H0: f ¼ 0 as given in Appendix A, and~s
the REML estimate of s when t ¼ f ¼ 0. Recall that
Pt¼ V?1
V1¼ tS þ sI and V0¼ sI.
t
? V?1
t XðXTV?1
t XÞ?1XTV?1
t
where t˛f0;1g; and
Null Distribution of the Score Statistics
for G 3 E Test and G Test
As shown in Tzeng and Zhang,22the score statistics under
H0arenotasymptoticallynormalbecausethedesignmatrix
H for the random effects b is not block diagonal and the
dimension of b is fixed. We thus use the first terms of the
score statistics as the testing statistics and obtain TG3E¼
YTP1SP1Y=2 and TG¼ YTP0SP0Y=2: Below we derive
the asymptotic null distribution of TG3E; and similar
steps can be used to obtain the distribution for TG: If
m ¼ 1g0? Xg; and Z ¼ V?1=2
dardmultivariate normal
TG3E¼ ZTð1=2V1=2
becausemTP1¼ 0bythefactofP1beingaprojectionmatrix.
Define eiand hi, the eigenvector and eigenvalue of matrix
CG3E, respectively. Then TG3E¼Pc
tion. In reality, ðt;sÞ is evaluated at their restricted
maximum likelihood estimates ðbt;bsÞ. Following Tzeng
1
ðY ? mÞ, then Z follows a stan
distribution.
P1SP1V1=2
1
ÞZhZTCG3EZ, which is true
Werewrite
1
i¼1hiðeT
iZÞ2hPL
i¼1hi~Z
2
i;
with~Z
2
ifollows a 1 degreeoffreedom chisquare distribu
and Zhang,22the distribution of TG3Ecan be approximated
by the distribution ofPc
i¼1bhic2
i1, wherebhi’s are the nonzero
286
The American Journal of Human Genetics 89, 277–288, August 12, 2011
Page 11
eigenvalues of matrix CG3Ejt¼bt;s¼bs. The distribution of
mation method of.43The level a significance threshold is
estimated by k1þ ðca? h0Þ3
k3
distribution with h0degrees of freedom). Alternatively,
one can report the p value of the observed statistic TG3E
by P < c2
By the same manner, the distribution of TGcan also be
approximated by the threemoment approximation as
above, except that the eigenvalues his are obtained from
matrix CG¼ 1=2V1=2
TG3Ecan be approximated by the threemoment approxi
ffiffiffiffiffiffiffiffiffiffiffi
k2=h0
p
, where kj¼P
ihj
i;h0¼
2=k2
3and cais the a th quantile of c2
h0 (i.e., chisquare
h0 > c?, where c?¼ ðTG3E? k1Þ3
ffiffiffiffiffiffiffiffiffiffiffi
h0=k2
p
þ h0.
0
P0SP0V1=2
0
js¼~s.
Null Distribution of the Score Statistics for Joint Test
The test statistic for the joint hypothesis H0: f ¼ t ¼ 0 is
Tjoint¼ TGþ Tð0Þ
Tð0Þ
and s ¼~s: A direct (unweighted) sum is used here because
X has been prestandardized to mean ¼ 0 and variance ¼ 1,
and hence TGand TG3Eare on the same scale. We found
that the performance of the unweighted sum is very
similar to that of the weighted sum, Twt
wG3E3Tð0Þ
a similar derivation as in the G 3 E test, it can be shown
that the null distribution of Tjoint also has a weighted
chisquare distribution and can be approximated by the
threemoment approximation. The procedure is the
same as what mentioned for the G 3 E test, except
that the eigenvalues should be obtained from the matrix
Cjoint¼ 1=2V1=2
G3E, where TG is defined as before and
G3E¼ 1=2YTP0
PP0Y, i.e., TG3E evaluated at f ¼ t ¼ 0
joint¼ wG3TGþ
G3E, where the weights wi¼ EðTiÞ=varðTiÞ. By
0
P0ðS þ SÞP0V1=2
0
js¼~s.
Acknowledgments
The authors thank all the study subjects who participated in the
VISP study. They also thank the two anonymous reviewers for
their constructive comments and Alison MotsingerReif, Dmitri
Zaykin, and Arnab Maity for their helpful discussions. This work
was supported by National Institutes of Health grants R01
MH074027 (J.Y.T., D.Z., M.P., C.S., D.C.T., and P.F.S.), P01
CA142538 (J.Y.T.), R37 AI03178920 (D.Z.), R01 CA85848 (D.Z.),
and U01 HG005160 (M.M.S. and B.B.W.) and the Wake Forest
University General Clinical Research Center M01 RR07122
(M.M.S. and F.C.H.).
Received: December 7, 2010
Revised: June 16, 2011
Accepted: July 13, 2011
Published online: August 11, 2011
Web Resources
The URLs for data presented herein are as follows:
JungYingTzeng,http://www4.stat.ncsu.edu/~tzeng/software.php
Online Mendelian Inheritance in Man (OMIM), http://www.
omim.org
References
1. De la Cruz, O., Wen, X., Ke, B., Song, M., and Nicolae, D.L.
(2010). Gene, region and pathway level analyses in whole
genome studies. Genet. Epidemiol. 34, 222–231.
2. Fisher, R.A. (1932). Statistical methods for research workers
(London: Oliver and Boyd).
3. Li, M., Wang, K., Grant,S.F.,Hakonarson, H., and Li,C. (2009).
ATOM: a powerful genebased association test by combining
optimally weighted markers. Bioinformatics 25, 497–503.
4. Wang, T., and Elston, R.C. (2007). Improved power by use of
a weighted score test for linkage disequilibrium mapping.
Am. J. Hum. Genet. 80, 353–360.
5. Gauderman, W.J., Murcray, C., Gilliland, F., and Conti, D.V.
(2007). Testing association between disease and multiple
SNPs in a candidate gene. Genet. Epidemiol. 31, 383–395.
6. Wang, K., and Abbott, D. (2008). A principal components
regression approach to multilocus genetic association studies.
Genet. Epidemiol. 32, 108–118.
7. Li, B., and Leal, S.M. (2008). Methods for detecting associa
tions with rare variants for common diseases: application to
analysis of sequence data. Am. J. Hum. Genet. 83, 311–321.
8. Madsen, B.E., and Browning, S.R. (2009). A groupwise associa
tion test for rare mutations using a weighted sum statistic.
PLoS Genet. 5, e1000384.
9. Morgenthaler, S., and Thilly, W.G. (2007). A strategy to
discover genes that carry multiallelic or monoallelic risk for
common diseases: a cohort allelic sums test (CAST). Mutat.
Res. 615, 28–56.
10. Price, A.L., Kryukov, G.V., de Bakker, P.I., Purcell, S.M., Staples,
J., Wei, L.J., and Sunyaev, S.R. (2010). Pooled association tests
for rare variants in exonresequencing studies. Am. J. Hum.
Genet. 86, 832–838.
11. Tzeng,J.Y., Byerley, W., Devlin,B., Roeder, K., and Wasserman,
L.(2003).Outlierdetectionandfalsediscoveryratesfor whole
genome DNA matching. J. Am. Stat. Assoc. 98, 236–246.
12. Tzeng, J.Y., Devlin, B., Wasserman, L., and Roeder, K. (2003).
On the identification of disease mutations by the analysis of
haplotype similarity and goodness of fit. Am. J. Hum. Genet.
72, 891–902.
13. Schaid, D.J., McDonnell, S.K., Hebbring, S.J., Cunningham,
J.M., and Thibodeau, S.N. (2005). Nonparametric tests of
association of multiple genes with human disease. Am. J.
Hum. Genet. 76, 780–793.
14. Beckmann, L., Thomas, D.C., Fischer, C., and ChangClaude,
J. (2005). Haplotype sharing analysis using mantel statistics.
Hum. Hered. 59, 67–78.
15. Wessel, J., and Schork, N.J. (2006). Generalized genomic
distancebased regression methodology for multilocus associ
ation analysis. Am. J. Hum. Genet. 79, 792–806.
16. Dempfle,A., Hein, R., Beckmann, L., Scherag, A., Nguyen, T.T.,
Scha ¨fer, H., and ChangClaude, J. (2007). Comparison of the
power of haplotypebased versus single and multilocus
association methods for gene x environment (gene x sex)
interactions and application to gene x smoking and gene
x sex interactions in rheumatoid arthritis. BMC Proc 1
(Suppl 1), S73.
17. Tzeng, J.Y., Zhang, D., Chang, S.M., Thomas, D.C., and
Davidian,M. (2009).Genetraitsimilarityregressionfor multi
markerbased association analysis. Biometrics 65, 822–832.
18. Mukhopadhyay, I., Feingold, E., Weeks, D.E., and Thalamu
thu, A. (2010). Association tests using kernelbased measures
The American Journal of Human Genetics 89, 277–288, August 12, 2011
287
Page 12
of multilocus genotype similarity between individuals.
Genet. Epidemiol. 34, 213–221.
19. Wei, Z., Li, M., Rebbeck, T., and Li, H. (2008). Ustatistics
based tests for multiple genes in genetic association studies.
Ann. Hum. Genet. 72, 821–833.
20. Goeman, J.J., van de Geer, S.A., de Kort, F., and van Houwelin
gen, H.C. (2004). A global test for groups of genes: testing
association with a clinical outcome. Bioinformatics 20, 93–99.
21. Goeman, J.J., van de Geer, S.A., and van Houwelingen, H.C.
(2005). Testing against a high dimensional alternative. J. R.
Stat. Soc. Series B Stat. Methodol. 68, 477–493.
22. Tzeng, J.Y., and Zhang, D. (2007). Haplotypebased associa
tion analysis via variancecomponents score test. Am. J.
Hum. Genet. 81, 927–938.
23. Kwee, L.C., Liu, D., Lin, X., Ghosh, D., and Epstein, M.P.
(2008). A powerful and flexible multilocus association test
for quantitative traits. Am. J. Hum. Genet. 82, 386–397.
24. Wu, M.C., Kraft, P., Epstein, M.P., Taylor, D.M., Chanock, S.J.,
Hunter, D.J., and Lin, X. (2010). Powerful SNPset analysis for
casecontrol genomewide association studies. Am. J. Hum.
Genet. 86, 929–942.
25. Schaid, D.J. (2010a). Genomic similarity and kernel methods
I: advancements by building on mathematical and statistical
foundations. Hum. Hered. 70, 109–131.
26. Neale, B.M., Rivas, M.A., Voight, B.F., Altshuler, D., Devlin, B.,
OrhoMelander, M., Kathiresan, S., Purcell, S.M., Roeder, K.,
and Daly, M.J. (2011). Testing for an unusual distribution of
rare variants. PLoS Genet. 7, e1001322.
27. Luan, Y., and Li, H. (2008). Group additive regression models
for genomic data analysis. Biostatistics 9, 100–113.
28. Chatterjee, N., Kalaylioglu, Z., Moslehi, R., Peters, U., and
Wacholder, S. (2006). Powerful multilocus tests of genetic
association in the presence of genegene and geneenviron
ment interactions. Am. J. Hum. Genet. 79, 1002–1016.
29. Zhao, J., Boerwinkle, E., and Xiong, M. (2005). An entropy
based statistic for genomewide association studies. Am. J.
Hum. Genet. 77, 27–40.
30. Dempfle, A., Scherag, A., Hein, R., Beckmann, L., Chang
Claude, J., and Scha ¨fer, H. (2008). Geneenvironment interac
tions for complex traits: definitions, methodological require
ments and challenges. Eur. J. Hum. Genet. 16, 1164–1172.
31. Thomas, D. (2010). Gene—environmentwide association
studies: emerging approaches. Nat. Rev. Genet. 11, 259–272.
32. Lindstro ¨m, S., Yen, Y.C., Spiegelman, D., and Kraft, P. (2009).
The impact of geneenvironment dependence and misclassifi
cation in genetic association studies incorporating geneenvi
ronment interactions. Hum. Hered. 68, 171–181.
33. Smith, P.G., and Day, N.E. (1984). The design of casecontrol
studies: the influence of confounding and interaction effects.
Int. J. Epidemiol. 13, 356–365.
34. Thomas, D. (2010). Methods for investigating geneenviron
ment interactions in candidate pathway and genomewide
association studies. Annu. Rev. Public Health 31, 21–36.
35. Ballard, D.H., Cho, J., and Zhao, H. (2010). Comparisons of
multimarker association methods to detect association
between a candidate region and disease. Genet. Epidemiol.
34, 201–212.
36. Chapman,J.,andWhittaker,J.(2008).AnalysisofmultipleSNPs
in a candidate gene or region. Genet. Epidemiol. 32, 560–566.
37. Fridley, B.L., Jenkins, G.D., and Biernacka, J.M. (2010). Self
contained geneset analysis of expression data: an evaluation
of existing and novel methods. PLoS ONE 5, e12693.
38. Schaid, D.J. (2010b). Genomic similarity and kernel methods
II: methods for genomic information. Hum. Hered. 70,
132–140.
39. Cordell, H.J. (2002). Epistasis: what it means, what it doesn’t
mean, and statistical methods to detect it in humans.
Hum. Mol. Genet. 11, 2463–2468.
40. Hirschhorn, J.N., and Daly, M.J. (2005). Genomewide
association studies for common diseases and complex traits.
Nat. Rev. Genet. 6, 95–108.
41. Kraft, P., Yen, Y.C., Stram, D.O., Morrison, J., and Gauderman,
W.J. (2007). Exploiting geneenvironment interaction to
detect genetic associations. Hum. Hered. 63, 111–119.
42. Zhang, D., and Lin, X. (2003). Hypothesis testing in semipara
metric additive mixed models. Biostatistics 4, 57–74.
43. Pearson, E.S. (1959). Note on an approximation to the
distribution of noncentral c2. Biometrika 46, 364.
44. Imhof, J.P. (1961). Computing the Distribution of Quadratic
Forms in Normal Variables. Biometrika 48, 419–426.
45. Moskvina, V., and Schmidt, K.M. (2008). On multipletesting
correction in genomewide
Epidemiol. 32, 567–573.
46. Lake, S.L., Lyon, H., Tantisira, K., Silverman, E.K., Weiss, S.T.,
Laird, N.M., and Schaid, D.J. (2003). Estimation and tests of
haplotypeenvironment interaction when linkage phase is
ambiguous. Hum. Hered. 55, 56–65.
47. Schaid, D.J., Rowland, C.M., Tines, D.E., Jacobson, R.M., and
Poland, G.A. (2002). Score tests for association between traits
and haplotypes when linkage phase is ambiguous. Am. J.
Hum. Genet. 70, 425–434.
48. Toole, J.F., Malinow, M.R., Chambless, L.E., Spence, J.D., Petti
grew, L.C., Howard, V.J., Sides, E.G., Wang, C.H., and Stamp
fer, M. (2004). Lowering homocysteine in patients with
ischemic stroke to prevent recurrent stroke, myocardial infarc
tion, and death: the Vitamin Intervention for Stroke Preven
tion (VISP) randomized controlled trial. JAMA 291, 565–575.
49. Hsu, F.C., Sides, E.G., Mychaleckyj, J.C., Worrall, B.B., Elias,
G.A., Liu, Y., Chen, W.M., Coull, B.M., Toole, J.F., Rich, S.S.,
et al. (2011). A Transcobalamin 2 gene variant associated
with poststroke homocysteine modifies recurrent stroke
risk. Neurology, in press.
50. Tsai, M.Y., Yang, F., Bignell, M., Aras, O., and Hanson, N.Q.
(1999). Relation between plasma homocysteine concentra
tion, the 844ins68 variant of the cystathionine betasynthase
gene, and pyridoxal50phosphate concentration. Mol. Genet.
Metab. 67, 352–356.
51. Lievers, K.J., Kluijtmans, L.A., Heil, S.G., Boers, G.H., Verhoef,
P., van OppenraayEmmerzaal, D., den Heijer, M., Trijbels, F.J.,
and Blom, H.J. (2001). A 31 bp VNTR in the cystathionine
betasynthase (CBS) gene is associated with reduced CBS
activity and elevated postload homocysteine levels. Eur. J.
Hum. Genet. 9, 583–589.
52. Lievers, K.J., Kluijtmans, L.A., Blom, H.J., Wilson, P.W.,
Selhub, J., and Ordovas, J.M. (2006). Association of a 31 bp
VNTR in the CBS gene with postload homocysteine concen
trations in the Framingham Offspring Study. Eur. J. Hum.
Genet. 14, 1125–1129.
53. Qian, D., and Thomas, D.C. (2001). Genome scan of complex
traits by haplotype sharing correlation. Genet. Epidemiol. 21
(Suppl 1), S582–S587.
54. Harville, D.A. (1977). Maximum likelihood approaches to
variancecomponent estimation
J. Am. Stat. Assoc. 72, 320–338.
associationstudies. Genet.
and relatedproblems.
288
The American Journal of Human Genetics 89, 277–288, August 12, 2011