Page 1

Gene–environment interaction testing in family-based

association studies with phenotypically ascertained

samples: a causal inference approach

DAVID W. FARDO∗

Department of Biostatistics, Division of Biomedical Informatics,

Center for Clinical and Translational Science, University of Kentucky,

Lexington, KY 40536, USA

david.fardo@uky.edu

JINZE LIU

Department of Computer Science, University of Kentucky,

Lexington, KY 40536, USA

DAWN L. DEMEO

Channing Laboratory, Brigham and Women’s Hospital and Harvard Medical School,

Boston, MA 02115, USA

EDWIN K. SILVERMAN

Channing Laboratory, Brigham and Women’s Hospital and Harvard Medical School,

Boston, MA 02115, USA

STIJN VANSTEELANDT

Department of Applied Mathematics and Computer Science, Ghent University,

9000 Gent, Belgium and Department of Epidemiology and Population Health,

London School of Hygiene and Tropical Medicine, London, UK

SUMMARY

We propose a method for testing gene–environment (G × E) interactions on a complex trait in family-

based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach

employsG-estimation,asemiparametricestimationtechniquefromthecausalinferenceliterature,toavoid

modeling of the association between the environmental exposure and the phenotype, to gain robustness

against unmeasured confounding due to population substructure, and to acknowledge the ascertainment

conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation

studies to an analogous conditional likelihood–based approach and to the QBAT-I test, which also invokes

the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic ob-

structive pulmonary disorder.

Keywords: Causal inference; COPD; Family-based association; G-estimation; Gene–environment interaction.

∗To whom correspondence should be addressed.

c ? The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

doi:10.1093/biostatistics/kxr035

Advance Access publication on November 13, 2011

Biostatistics (2012), 13, 3, pp. 468–481

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 2

1. INTRODUCTION

It is well accepted that the interplay of genetic and environmental factors can form an important part of

the biological foundation of common diseases. Some examples of gene–environment (G × E) interac-

tions include the N-acetyltransferase 2 (NAT2) gene with smoking in bladder cancer (Garc´ ıa-Closas and

others, 2005) and with red meat consumption in colorectal cancer (Chan and others, 2005), the alcohol

dehydrogenase type 3 (ADH3) gene with moderate alcohol consumption in myocardial infarction (Hines

and others, 2001) and the serine proteinase inhibitor clade E, member 2 gene (SERPINE2) with smoking

in chronic obstructive pulmonary disorder (COPD) (Silverman, 2006). Detecting, and more importantly,

understanding these G×E interactions thus play a crucial role in solving the puzzle of gene-influenced

phenotypic variation. Several study designs have been proposed for discovering these types of associa-

tions, and the designs can be broadly characterized as either population- or family-based depending on

whether or not study subjects are related.

When considering designs consisting of related subjects, the class of family-based association tests

(FBATs; Laird and others, 2000) provides strategies for testing genetic effects that are robust to un-

detected/unaccounted for population substructure. Conditioning on parental/founder genotypes (or the

corresponding sufficient statistics when parental genotypes are missing; Rabinowitz and Laird, 2000) in-

sulates these testing strategies from the bias due to ancestry-driven confounding. By relying on a priori

knowledge regarding the distribution of the offspring genotypes conditional on the parental genotypes,

these testing strategies further remain valid when ascertainment conditions are imposed for sample re-

cruitment. This property is lost once a main genetic effect must be estimated, as in the case of testing

for G × E or gene–gene (G × G) interactions. Current methodologies assume either (i) no ascertainment

condition (i.e. a population sample) (Umbach and Weinberg, 2000; Yang and others, 2000; Vansteelandt,

VanderWeele, and others, 2008), (ii) a dichotomous trait of interest (Yang and others, 1999; Liu and

others, 2002; Cordell and others, 2004; Lake and Laird, 2004; Whittemore, 2004; Chatterjee and others,

2005; Hoffmann and others, 2009; Moerkerke and others, 2010; Wang and others, 2011) or (iii) a fully

parametric framework (Whittemore, 2004; Dudbridge, 2008). It is likely that extensions of some of the

aforementioned parametric approaches could be powerful alternatives able to handle multiparameter set-

tings when the parametric assumptions are correct. For example, the parental-genotype-robust estimator

proposed by Whittemore (2004) could be modified to estimate G×E interaction effects under continuous

trait ascertainment schema and, thus, employ a more standard, non-causal-inference–based method.

Our focus here will be on continuous traits. Considering that controlling false-positive rates is a fore-

most concern in this age of large-scale genetic association studies, we will attempt to avoid parametric as-

sumptions where possible by making use of G-estimation (Robins and others, 1992; Joffe and Brensinger,

2003; Greenland and others, 2008). This is a semiparametric estimation method from the causal inference

literature, which underlies the class of FBATs (Vansteelandt, Demeo, and others, 2008). In particular,

adapting ideas on artificial censoring in G-estimation for structural nested failure time models (Robins

and Tsiatis, 1991; Robins, 2002), we propose a test that maintains robustness to population stratification

and correctly accounts for sample ascertainment, while avoiding assumptions on the phenotypic and allele

frequency distributions. We assess the performance of this test empirically through extensive simulation

studies. To illustrate the approach in practice, we also apply these new techniques to a study of COPD

(Demeo and others, 2006).

2. METHODS

2.1

Notation

We consider a family-based genetic association study and introduce the following notation: Xijis a func-

tion of the genotype from the jth member of the ith family (i = 1,...,n, where n is the number of

G × E interaction testing for family-based ascertained samples

469

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 3

independent families; j = 1,...,mi, where miis the number of nonfounders in the ith pedigree) coded

to reflect some particular mode of inheritance (MOI) (e.g. for an assumed additive MOI, Xijis the count

of risk alleles from the jth member of the ith family); Siis the sufficient statistic for the founder geno-

types of the ith family (see Rabinowitz and Laird, 2000); Yijis the disease phenotype of interest; Zijis

an environmental covariate; and Aiis a hypothetical ascertainment indicator which is 1 for families that

have been ascertained and 0 otherwise.

2.2

Model

Figure 1 graphically illustrates the causal assumptions for a typical family-based association study using

a directed acyclic graph (DAG) (Pearl, 1995, 2000; Robins, 2001). It encodes, with arrows, all possi-

ble direct causal paths between the variables considered in the study. Let us first assume that all parental

genotypes have been observed so that S encodes the founder genotypes. While we allow for measured and

unmeasured covariates Z and U, respectively, to affect (or be associated with) the founder genotypes, S,

and the disease phenotype, Y, directly, the DAG assumes that they can affect the nonfounder genotype, X,

only through S. This particular assumption underlies the validity of the class of FBATs, even in the pres-

ence of population stratification (FBATs; Rabinowitz and Laird, 2000). Note furthermore that the DAG

embodies the common G×E independence assumption X

often deemed plausible, but has nevertheless been found to be violated in specific studies (Chanock and

Hunter, 2008) and should be critically assessed whenever applied.

Under these assumptions, estimating the overall genetic effect, here expressed by the arrow from X to

Y,canbeaccomplishedbystratifyingonthefoundergenotypesS.Thisresultfollowsbyapplicationofthe

d-separation rule (Pearl, 2000; Robins, 2001) upon noting that the correlation between phenotype Y and

genotype X induced by spurious associations from both measured and unmeasured potential confounders

is removed by conditioning on S, so that only the causal effect remains. In what follows, we will formalize

this into the assumption that

Z|S (Umbach and Weinberg, 1997), which is

Y(0)

X| Z,S,

(2.1)

where Y(0) denotes the potential/counterfactual response of a given subject under a regime in which X is

set to some reference value 0. Because X cannot have an effect on Y(0), the validity of this assumption fol-

lows from standard theory for family-based tests (Rabinowitz and Laird, 2000; Horvath and others, 2004)

Fig. 1. DAG displaying the causal assumptions common for family-based genetic association study. Each arrow de-

notes a possible causal path. Nodes are defined as U, unmeasured covariates; Z, measured covariates; S, the sufficient

statistics for the founder genotypes; X, the offspring genotypes; and Y, the disease phenotype.

D. W. FARDO AND OTHERS

470

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 4

that, under Mendelian transmission and in the absence of a genetic effect, potential traits and genotypes

are independent conditional on S (and Z).

Suppose now that the founder genotypes are incompletely observed so that S refers to the sufficient

statistic for the founder genotypes (Rabinowitz and Laird, 2000). It then follows from Hoffmann and oth-

ers (2009) that the G×E independence assumption continues to hold. It further follows from the detailed

arguments in the supplementary materials of Vansteelandt, Demeo, and others (2008) that Y(0) contin-

ues to be conditionally independent of X, given S (and also given the environmental covariate, Z). This

motivates our focus on the structural distribution model

P{Y(0) ? y| X = x, Z,S} = P(Y − βx − γxZ ? y| X = x, Z,S).

Under this model, we have in particular that

(2.2)

E{Y − Y(0)| X = x, Z,S} = βx + γxZ,

and, by (2.1), that

E(Y| X = x, Z,S) = E{Y(0)| Z,S} + βx + γxZ,

from which it is clear that β encodes the main genetic effect and γ the G×E interaction. Under model

(2.2), (2.1) further implies that

Y − βX − γ XZ

G-estimation (Robins and others, 1992; Vansteelandt, Demeo, and others, 2008) exploits this indepen-

dence restriction by obtaining a consistent estimator for β and γ as the values that solve an estimating

equation of the form

?

X| Z,S.

?

i,j

1

Zij

?

(Yij− βXij− γ XijZij){Xij− E(Xij|Si)} =

?

0

0

?

.(2.3)

It is noteworthy that this approach, which is equivalent to that in Vansteelandt, Demeo, and others (2008),

does not require specification of the form of effects not of direct interest, that is Z and S.

2.3

Introducing ascertainment criteria

Previous results are no longer valid in the presence of ascertainment (unless for testing the overall genetic

effectnullhypothesis).Thiscanbeseenbyaddingtheascertainmentnode A totheoriginalcausaldiagram

(see Figure 2 (a) which includes the situation where families are selected based on a proband’s phenotype)

and noting that the analysis is, by design, conditional on A. Indeed, stratifying on the “child” A of a

collider Y modifies the association between S and X. It follows that

E(Xij|Si, Ai) ?= E(Xij|Si)

thus that we can no longer use Mendel’s law of segregation when the previous estimation strategy is

restricted to ascertained individuals only and thus that we cannot immediately rely on Mendel’s law of

segregation when the analysis is restricted to ascertained individuals only. It is also seen from Figure 2 (a)

that the no confounding assumption 2.1 will generally fail to hold conditional on A (see also Section 2.4).

A strategy to address this problem, suggested to us by James Robins, is to evaluate the estimating

equation (2.3) only in subjects who would have been ascertained regardless of genotype. Related ideas

have been used in G-estimation for structural nested failure time models (Robins and Tsiatis, 1991;

Robins, 2002), where the observed event times are typically artificially “recensored.” To be precise, let

G × E interaction testing for family-based ascertained samples

471

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 5

Fig. 2. DAGs incorporating ascertainment criterion into the usual family-based genetic association study. We denote

the nodes as U, unmeasured covariates; Z, measured covariates; S, the sufficient statistics for the founder genotypes;

{Y(x)}, the set of counterfactual disease traits under regimes when X is set to x, for all x; A = f (Y) is the outcome-

based ascertainment criterion; A∗is the counterfactual-based ascertainment criterion (A∗= 1 denotes the subjects

who would have been ascertained regardless of genotype); X, the offspring genotypes; and Y, the observed disease

phenotype.

A(x) denote the counterfactual ascertainment status of a given subject under a regime in which X is set

to x. Then we will evaluate the estimating equation only in subjects with A(x) = 1 for all x. This is in-

formally depicted in Figure 2 (b), where A∗= 1 for subjects with A(x) = 1 for all x (0 otherwise); from

Figure 2 (b), we can see that Y(0)

X| Z,S, A∗= 1, X Z|S, A∗= 1, and A∗

the previous G-estimation principle remains applicable to this subset. A more formal development, which

shows that application of G-estimation to this subset continues to infer the population genetic effect, is

given in the next section.

X|S, suggesting that

2.4

G×E interaction testing

Throughout, we will consider an outcome-dependent ascertainment criterion such that families are se-

lected for a study based on a single proband within the family having an extreme phenotype. Specifically,

a proband is randomly chosen and the corresponding family is selected if and only if that proband’s phe-

notype is below some known value y, so that individuals with phenotypes below y are identified and

family members of these probands are then also recruited. This strategy is often part of a study’s eli-

gibility criteria, for example using methacholine reactivity for asthma (Childhood Asthma Management

Program, 1999), albumin–creatinine ratio for diabetic nephropathy (Mueller and others, 2006), and body

mass index for obesity (Hinney and others, 1997). As an example, in the COPD study motivating the

methodology, single probands with severe COPD were identified and, subsequently, all available relatives

of this single proband were recruited such that the entire family’s recruitment is solely contingent on the

original proband.

For notational convenience, we will identify the selected proband with subject index j = 1 and define

the family ascertainment indicator to be Ai= f (Yi1) = I(Yi1? y). Under the null hypothesis of no G×E

interaction (i.e. γ = 0), we then propose to estimate the main genetic effect β by solving an equation of

the form

?

for β, where we define A∗

proband ascertainment schema considered here, A∗

ij

I{A∗

i(β) = 1}(Yij− βXij){Xij− E(Xij|Si)} = 0(2.4)

ij(β) = 1 if f {Yij+ β(x − Xij)} = 1 ∀x and 0 otherwise, and due to the single

i(β) = A∗

i1(β) (see supplementary material available at

D. W. FARDO AND OTHERS

472

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 6

http://www.biostatistics.oxfordjournals.org for alternative definitions of A∗

family i implies in particular that Ai= 1, so that this estimating equation is effectively selecting a subset

of the observed ascertained sample. The validity of this approach requires the following generalization of

assumption (2.1):

{Yij(0); j = 1,...,mi}

which is plausible when (2.1) holds and a subject’s genotype does not affect the phenotype of other

family members. Indeed, under this strengthened assumption and model (2.2) with γ = 0, (2.4) is a

mean-zero estimating equation because I{A∗

1,...,mi},whichisindependentof Xij,conditionalon ZijandSiby(2.5)andbecause E(Xij| Zij,Si) =

E(Xij|Si). We can then evaluate the following score test statistic for G×E interaction, SG×E(β):

?

which has mean zero under the null hypothesis of no G×E interaction. In practice, we evaluate it with β

replaced withˆβ, the solution to (2.4). Note that we choose to focus on score tests because it enables setting

γ = 0, which is computationally attractive and tends to limit the deletion of families from the analysis.

Because the indicator functions I{A∗

tion for β, that is

Uij,β(β) = I{A∗

and the score function for the G×E interaction test, that is

Uij,γ(β) = ZijI{A∗

are nonsmooth in β. The usual Taylor series arguments for M-estimators, which are needed to adjust the

G×E interaction test for the uncertainty regardingˆβ, are no longer applicable. Assuming that

n−1/2?

n−1/2?

for conformable matrices ?γβand ?ββ, we have that

n−1/2?

Here, for absolutely continuous phenotypes, ?γβ and ?ββ can be estimated as the empirical means of

−ZijI{A∗

uated atˆβ. More generally, one may use the least squares–based resampling approach proposed by Zeng

and Lin (2008), which we adopted in the simulation studies and data analysis. Briefly, this involves as-

sessing how fluctuations in ˆβ induce fluctuations in the scores Uij,β(ˆβ) and Uij,γ(ˆβ), and then using

least squares regression to summarize these associations. B realizations from a random, mean zero vec-

tor are generated and denoted by Z1,..., ZB.ˆ?ββis then calculated as the least squares estimate from

regressing n−1/2Uij,β(ˆβ + n−1/2Zb)(b = 1,..., B) on Zb(b = 1,..., B). Similarly,ˆ?γβis the least

squares estimate from regressing n−1/2Uij,γ(ˆβ + n−1/2Zb)(b = 1,..., B) on Zb(b = 1,..., B). From

(2.7), the asymptotic variance of the test statistic n−1/2?

G×E interaction by comparing the ratio of n−1/2?

i(β)). Note that A∗

i(β) = 1 for

Xij| Zij,Si,

(2.5)

i(β) = 1}(Yij− βXij) is a function of {Yij− βXij; j =

ij

ZijI{A∗

i(β) = 1}(Yij− βXij){Xij− E(Xij|Si)},

(2.6)

i(β) = 1} depend on the genetic effect size, the estimating func-

i(β) = 1}(Yij− βXij){Xij− E(Xij|Si)}

i(β) = 1}(Yij− βXij){Xij− E(Xij|Si)}

ij

Uij,γ(ˆβ) = n−1/2?

Uij,β(ˆβ) = n−1/2?

ij

Uij,γ(β) − ?γβn1/2(ˆβ − β) + op(1)

ijij

Uij,β(β) − ?ββn1/2(ˆβ − β) + op(1),

ij

Uij,γ(ˆβ) = n−1/2?

ij

{Uij,γ(β) − ?γβ?−1

ββUij,β(β)} + op(1).(2.7)

i(β) = 1}Xij{Xij− E(Xij|Si)} and −I{A∗

i(β) = 1}Xij{Xij− E(Xij|Si)}, respectively, eval-

ijUij,γ(ˆβ) can now be evaluated as the em-

pirical variance of Uij,γ(ˆβ) −ˆ?γβˆ?−1

ββUij,β(ˆβ) withˆβ,ˆ?γβandˆ?ββheld fixed. We now have a test of

ijUij,γ(ˆβ) and its standard deviation to a standard

G × E interaction testing for family-based ascertained samples

473

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 7

normal distribution. This test avoids distributional assumptions while taking into account the imposed

ascertainment conditions.

3. SIMULATION STUDIES

In each replicate of the simulation study (10000 replicates per parameter combination), we simulate a

single nucleotide polymorphism (SNP) and an environmental covariate for 2000 trios by first simulating

Fig. 3. Empirical type I error rates over 10000 simulations of a study recruiting 2000 trios and employing either a

50% or 90% ascertainment rule.

Fig. 4. Empirical power over 10000 simulations of a study recruiting 2000 trios and employing either a 50% or 90%

ascertainment rule.

D. W. FARDO AND OTHERS

474

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 8

parental genotypes and assuming Mendelian transmissions to the proband. The offspring trait, Y, is gen-

erated based on the model: Yi = βXXi+ Zi+ βX×Z(XiZi) + εi, where Xi ∈ {0,1,2} (reflecting an

additive MOI), Zi∼ N(0,1) and εi∼ N(0,1). The effect sizes are then calculated based on the specified

heritability for the main effect of genotype (h2

pared against an ascertainment rule (e.g. Y > y0.90, where ypfor p ∈ [0,1] denotes the p100% percentile

of the distribution of Y), and this is repeated until 2000 “ascertained” trios have been generated. Figures

3(a)–4(b) display results from scenarios with various minor allele frequencies (MAFs; p = 5–50%), as-

certainment rules (Y > yawith a = 0.50 and 0.90) and heritabilities (h2

2.5%). Note that the notion of “extreme phenotype” in the simulations defines ascertainment through the

upper tail of the trait distribution. The performance of the proposed method is independent of the direc-

tion of the ascertainment condition. For each ascertainment rule examined, the pair of supporting figures

consists of (a) type I error rates (Figure 3) for the test of G×E interaction (i.e. under the interaction null

of h2

X= 0) and then in the presence of a main genetic effect

(h2

X×Z= 2.5%, again displaying panels both with and

without a main genetic effect.

Studies recruiting only probands with traits above the population median (i.e. using an ascertainment

rule of Y > y0.50) incur a mild inflation of type I error when a genetic main effect is present when em-

ploying QBAT-I (Vansteelandt, Demeo, and others, 2008), the test statistic that ignores the ascertainment

conditions (Figure 3(a)). This inflation is most pronounced at lower MAFs. For example, the most severe

inflation observed in the simulations was for a SNP with a MAF of 5% (type I error rate of 7.1% using

a nominal α = 5%). Conversely, SG×E is mildly conservative when a main genetic effect is present.

Both test statistics maintain the nominal significance level under the complete null hypothesis of no ge-

netic effect. Under this ascertainment schema, however, QBAT-I is uniformly more powerful than SG×E

(although powers are difficult to compare when type I error rates are inflated) and, other than at low MAF,

power is a decreasing function of MAF (Figure 4(a)).

When trait ascertainment is more stringent, the consequences of failing to properly account for it are

moreevident.Figures3(b)and4(b)displaythetypeIerrorratesandpowersunderanascertainmentruleof

Y > y0.90. Here, we see that both tests preserve the nominal significance under the complete null hypoth-

esis; however, when there is a main genetic effect, the bias observed using QBAT-I is quite pronounced,

here increasing as MAF does, while SG×Eremains mildly conservative (Figure 3(b)). Interestingly, other

than with rare SNPs and in the presence of a main genetic effect, SG×Eis consistently more powerful than

QBAT-I, and confers a 30% improvement in power for many MAFs. This is surprising in the absence of

a main genetic effect, in which case also QBAT-I is valid. As with median-based ascertainment, power

is a decreasing function of MAF; this suggests that ascertainment strategies can be more useful when

applied to rarer SNPs. Also of note is that, curiously, when examining a SNP with a main effect and a

MAF of 20% or above, the empirical rejection rates for QBAT-I are actually higher under the interaction

null hypothesis (h2

X×Z= 2.5%). This clearly shows the importance of

employing an analytic strategy that is appropriate for the given study design.

X) and the G×E effect (h2

X×Z). Simulated traits are com-

X& h2

X×Z, each either 0 or

X×Z= 0), first without a main effect (h2

X= 2.5%) and (b) powers (Figure 4) assuming h2

X×Z= 0) than the alternative (h2

3.1

Comparison to a likelihood-based approach

We additionally examine a strategy that posits a truncated normal distribution for the offspring trait. Max-

imum likelihood estimates are calculated from the model Yi = β0+ βXXi+ βEXE(Xi|Si) + βZZi+

βX×Z(XiZi) + εiconditional on Yi > y0.90. A standard test for H0: βX×Z = 0 is then used. We find

that the likelihood approach confers more power than the proposed method but, similar to QBAT-I, suffers

from inflated type I error under a variety of scenarios. This bias is likely finite-sample bias due to the dif-

ficulties inherent in using only a tail of the normal distribution to estimate the rest of the distribution. See

G × E interaction testing for family-based ascertained samples

475

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 9

Section S1.1 of the supplementary material (available at Biostatistics online) for more detailed simulation

results.

3.2

Introducing misspecification

In all above simulations, we assume the data-generating mechanism to be known; however, it is often

the case that components of the mechanism are not known. We consider the effects of (i) misspecifying

the MOI, for example testing with the assumption of an additive MOI when in fact the genetic effect

is dominant, (ii) misspecifying the offspring trait distribution, and (iii) misspecifying the form of the

interaction effect. To assess the effects of a misspecified MOI, we simulate as before and vary the true

MOI and that assumed for testing. Our proposed approach remains valid except when data are generated

with a recessive MOI and the minor allele is rare (e.g. 5%). Empirical power is not greatly affected when

testing with a dominant MOI when the true MOI is additive and vice versa; however, testing assuming a

recessive MOI results in much loss of power unless the true MOI is in fact recessive. Results are provided

in Section S1.2.1 of the supplementary material (available at Biostatistics online).

We also simulate scenarios with offspring trait errors generated from a Gamma(1,1) distribution (i.e.

εi ∼ ?(1,1)) and others where the environmental covariate enters the interaction as a squared term,

βX×Z(XiZ2

the QBAT-I and likelihood approaches. These results are displayed in Section S1.2.2 in the supplementary

material (available at Biostatistics online).

i). Ourapproachremainsvalidunderthesetypesofspecification,andbiasisamplifiedforboth

4. APPLICATION TO A CHRONIC OBSTRUCTIVE PULMONARY DISEASE STUDY

As in Vansteelandt, Demeo, and others (2008), we examine the Boston Early-Onset COPD Study of

Silverman and others (1998). We test the same 6 SNPs located in the SERPINE2 gene previously found to

be in a linkage peak for COPD (Demeo and others, 2006). The study consists of 128 extended pedigrees

with a maximum family size of 27 members, each ascertained from a single proband satisfying (i) forced

expiratory volume at 1 s (FEV1) less than 40% of predicted, (ii) less than 53 years of age, and (iii) no

evidence of severe alpha 1-antitrypsin deficiency. Because the study’s primary phenotype is FEV1, we

apply the new G×E interaction testing strategy as it relates to the first ascertainment criterion because

the remaining ascertainment conditions relate to baseline covariates and can henceforth be ignored (when

the model is stated conditional on these covariates). That is to estimate the main genetic effect of FEV1,

we place the restriction that the proband from each pedigree have FEV1< 40% of predicted regardless of

the proband’s true genotype as detailed in Section 2.4. The remainder of each proband’s extended family

is included in the analysis only when this criterion is satisfied.

Smoking is the main risk factor for COPD. As such, it is the environmental covariate investigated in

the G×E interaction analysis. Two measures of smoking are used here: a binary version that differentiates

subjects who have ever smoked before (ever-smokers) from those who have not (never-smokers); and

a continuous version that serves as a proxy for lifetime exposure, pack years (one pack year is defined

as smoking one pack of cigarettes per day for 1 year). Consistent with previous analyses of this data,

we assume a dominant MOI for each SNP and a null hypothesis of linkage and no association as the

tested SNPs were previously found to be under a linkage peak. There is no evidence to support violation

of the G×E conditional independence assumption in these data, and an ongoing study corroborates this

by finding no nominally significant associations between the investigated SNPs and various smoking

covariates (data not shown).

Table 1 displays the results of this analysis employing both the new test, SG×E, and QBAT-I (Vanstee-

landt, Demeo, and others, 2008). For each SNP, the p values are given for both tests. The new approach

yields less significant results from ser8, but a potentially more powerful test when considering ser51 and

D. W. FARDO AND OTHERS

476

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 10

Table 1. Summary of results for COPD study. Each SNP is assumed to act in a dominant fashion, and all

analyses assumed a null hypothesis of linkage and no association. G×E interaction p values from both the

proposed test statistic and QBAT-I are presented for the continuous covariate, pack years, and the binary

measure, ever-smoker

Markers Pack yearsEver-smoker

SG×E-p

0.118

0.662

0.047

0.209

0.213

0.232

QBAT-I-p

SG×E-p

0.152

0.951

0.455

0.703

0.745

0.870

QBAT-I-p

ser37

ser8

ser51

ser55

ser50

ser6

0.126

0.243

0.082

0.244

0.271

0.249

0.163

0.958

0.816

0.774

0.825

0.941

the quantitative phenotype pack years (SG×E-p = 0.047 vs. QBAT-I-p = 0.082). These findings are

consistent with those of Vansteelandt, Demeo, and others (2008) that indicated a haplotype-smoking in-

teraction for haplotypes containing the ser51 SNP and also that the quantitative phenotype pack years is

more powerful than its binary counterpart.

5. DISCUSSION

The current era of large-scale genetics has seen many successes from conducting genome-wide associa-

tion studies (GWAS) of complex diseases; to date, novel SNPs for well over 100 disease traits have been

discovered using this approach (Hindorff and others, 2010). The so-called problem of “missing heritabil-

ity,” however, has caused some to criticize GWAS (Goldstein, 2009). Others have been more optimistic

and have addressed the manners in which current and future GWAS can be used to uncover more of the

genetic predisposition for complex diseases (McCarthy and Hirschhorn, 2008; Manolio and others, 2009);

not the least among the suggested strategies are using family samples and exploring G×G and G×E in-

teractions. It is integral that appropriate, powerful, and unbiased statistical methodologies be designed for

detecting G×E interaction with quantitative disease traits in this context of large-scale analysis as well as

in smaller studies.

We have proposed a novel test for G×E interaction that does not require any modeling assumptions

for the phenotype distribution, appropriately accounts for outcome-dependent ascertainment, is robust to

confounding due to population substructure, and allows for incomplete parental genotypes. Our results

support the use of this method for any family study investigating G×E interactions that employs an as-

certainment criterion based on a quantitative trait provided that there is support for the G×E conditional

independence assumption (or, at the least, no strong evidence to that it fails to hold). If this assumption

is implausible, one could in principle derive the distribution of X| Z,S utilizing the known law of X|S

and an application of Bayes’ rule in order to modify our approach. However, this would require modeling

assumptions that have been intentionally avoided in the proposed method. By focusing on estimation of

a scalar parameter, we additionally avoid problems associated with the potential for identifying multiple

solutions (Joffe and others, 2011).

A potential limitation of the proposed approach is that when the genotype coding can take on many

levels (e.g. when X refers to a haplotype coding), then the set of families who would have been ascer-

tained regardless of their genotype may be small, in which case the proposed test might have very limited

power. This can be partially remedied by restricting the focus to a smaller set Kxwith bounded support,

containing a subset of the observed genotypes (e.g. haplotypes satisfying some threshold of observed

G × E interaction testing for family-based ascertained samples

477

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 11

frequency). In particular, we can redefine A∗

otherwise, and modify the estimating equation (2.4) for β to

ij(β) = 1 if f {Yij+ β(x − Xij)} = 1 ∀x ∈ Kx and 0

?

ij

I{A∗

i(β) = 1}I(Xij∈ Kx)(Yij− βXij)

?

Xij−E(XijI(Xij∈ Kx)|Si)

E(I(Xij∈ Kx)|Si)

?

= 0

and the G×E test statistic to

?

ij

ZijI{A∗

i(β) = 1}I(Xij∈ Kx)(Yij− βXij)

?

Xij−E(XijI(Xij∈ Kx)|Si)

E(I(Xij∈ Kx)|Si)

?

.

This modification is motivated by the more general results in Robins (2002, Theorem A4.1); its validity

is readily verified from the unbiasedness of these equations under the null hypothesis. In future work,

we hope to extend the proposed methods to accommodate more general ascertainment conditions that are

based not on the tested phenotype but rather a correlated endophenotype.

SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

ACKNOWLEDGMENTS

We are very grateful to Jamie Robins for suggesting the potential use of artificial censoring and for many

useful discussions and suggestions. We also thank two anonymous reviewers for insightful comments that

improved the manuscript.

FUNDING

National Institutes of Health [NCRR] (P20RR020145 and 5P20RR016481-10 to D.W.F.); IAP research

network (P06/03 from the Belgian government [Belgian Science Policy] and Ghent University [Multi-

disciplinary Research Partnership “Bioinformatics: from nucleotides to networks”] to S.V.). Conflict of

Interest: None declared.

REFERENCES

CHAN, A. T., TRANAH, G. J., GIOVANNUCCI, E. L., WILLETT, W. C., HUNTER, D. J. AND FUCHS, C. S.

(2005). Prospective study of n-acetyltransferase-2 genotypes, meat intake, smoking and risk of colorectal cancer.

International Journal of Cancer 115, 648–652.

CHANOCK, S. J. AND HUNTER, D. J. (2008). Genomics: when the smoke clears ... . Nature 452, 537–538.

CHATTERJEE, N., KALAYLIOGLU, Z. AND CARROLL, R. J. (2005). Exploiting gene-environment independence

in family-based case-control studies: increased power for detecting associations, interactions and joint effects.

Genetic Epidemiology 28, 138–156.

CHILDHOOD ASTHMA MANAGEMENT PROGRAM. (1999). The childhood asthma management program (camp):

design,rationale,andmethods.Childhoodasthmamanagementprogramresearchgroup. ControlledClinicalTrials

20, 91–120.

D. W. FARDO AND OTHERS

478

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 12

CORDELL, H. J., BARRATT, B. J. AND CLAYTON, D. G. (2004). Case/pseudocontrol analysis in genetic associ-

ation studies: A unified framework for detection of genotype and haplotype associations, gene-gene and gene-

environment interactions, and parent-of-origin effects. Genetic Epidemiology 26, 167–185.

DEMEO, D. L., MARIANI, T. J., LANGE, C., SRISUMA, S., LITONJUA, A. A., CELEDON, J. C., LAKE, S. L.,

REILLY, J. J., CHAPMAN, H. A., MECHAM, B. H. and others (2006). The SERPINE2 gene is associated with

chronic obstructive pulmonary disease. American Journal of Human Genetics 78, 253–264.

DUDBRIDGE, F. (2008). Likelihood-based association analysis for nuclear families and unrelated subjects with miss-

ing genotype data. Human Heredity 66, 87–98.

GARC´ IA-CLOSAS, M., MALATS, N., SILVERMAN, D., DOSEMECI, M., KOGEVINAS, M., HEIN, D. W., TARD´ ON,

A., SERRA, C., CARRATO, A., GARC´ IA-CLOSAS, R. and others (2008). Nat2 slow acetylation, gstm1 null

genotype, and risk of bladder cancer: results from the spanish bladder cancer study and meta-analyses. Lancet 366,

649–659.

GOLDSTEIN, D. B. (2009). Common genetic variation and human traits. New England Journal of Medicine 360,

1696–1698.

GREENLAND, S., LANES, S. AND JARA, M. (2008).Estimatingeffectsfromrandomizedtrialswithdiscontinuations:

the need for intent-to-treat design and g-estimation. Clinical Trials 5, 5–13.

HINDORFF, L. A., JUNKINS, H. A., HALL, P. N., MEHTA, J. P. AND MANOLIO, T. A. A catalog of published

genome-wide association studies. www.genome.gov/gwastudies (Accessed October 25, 2010).

HINES, L. M., STAMPFER, M. J., MA, J., GAZIANO, J. M., RIDKER, P. M., HANKINSON, S. E., SACKS, F.,

RIMM, E. B. AND HUNTER, D. J. (2001). Genetic variation in alcohol dehydrogenase and the beneficial effect of

moderate alcohol consumption on myocardial infarction. New England Journal of Medicine 344, 549–555.

HINNEY, A., LENTES, K. U., ROSENKRANZ, K., BARTH, N., ROTH, H., ZIEGLER, A., HENNIGHAUSEN, K.,

CONERS, H., WURMSER, H., JACOB, K. and others (1997). Beta 3-adrenergic-receptor allele distributions in

children, adolescents and young adults with obesity, underweight or anorexia nervosa. International Journal of

Obesity and Related Metabolic Disorders 21, 224–230.

HOFFMANN, T. J., LANGE, C., VANSTEELANDT, S. AND LAIRD, N. M. (2009). Gene-environment interaction tests

for dichotomous traits in trios and sibships. Genetic Epidemiology 33, 691–699.

HORVATH, S., XU, X., LAKE, S. L., SILVERMAN, E. K., WEISS, S. T. AND LAIRD, N. M. (2004). Family-based

tests for associating haplotypes with general phenotype data: application to asthma genetics. Genetic Epidemiol-

ogy 26, 61–69.

JOFFE, M. M. AND BRENSINGER, C. (2003). Weighting in instrumental variables and g-estimation. Statistics in

Medicine 22, 1285–1303.

JOFFE, M. M., YANG, W. P. AND FELDMAN, H. (2011). G-estimation and artificial censoring: prob-

lems, challenges, and applications. Biometrics. doi: 10.1111/j.1541-0420.2011.01656.x. Available from

http://onlinelibrary.wiley.com/doi/10.1111/j.1541-0420.2011.01656.x/full.

LAIRD, N. M., HORVATH, S. AND XU, X. (2000). Implementing a unified approach to family-based tests of associ-

ation. Genetic Epidemiology 19 (Suppl 1), S36–S42.

LAKE, S. L. AND LAIRD, N. M. (2004). Tests of gene-environment interaction for case-parent triads with general

environmental exposures. Annals of Human Genetics 68, 55–64.

LIU, Y., TRITCHLER, D. AND BULL, S. B. (2002).Aunifiedframeworkfortransmission-disequilibriumtestanalysis

of discrete and continuous traits. Genetic Epidemiology 22, 26–40.

MANOLIO, T. A., COLLINS, F. S., COX, N. J., GOLDSTEIN, D. B., HINDORFF, L. A., HUNTER, D. J.,

MCCARTHY, M. I., RAMOS, E. M., CARDON, L. R., CHAKRAVARTI, A. and others (2009). Finding the missing

heritability of complex diseases. Nature 461, 747–753.

G × E interaction testing for family-based ascertained samples

479

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 13

MCCARTHY, M. I. AND HIRSCHHORN, J. N. (2008). Genome-wide association studies: potential next steps on a

genetic journey. Human Molecular Genetics 17, R156–R165.

MOERKERKE, B., VANSTEELANDT, S. AND LANGE, C. (2010). A doubly robust test for gene-environment interac-

tion in family-based studies of affected offspring. Biostatistics 11, 213–225.

MUELLER, P. W., ROGUS, J. J., CLEARY, P. A., ZHAO, Y., SMILES, A. M., STEFFES, M. W., BUCKSA, J.,

GIBSON, T. B., CORDOVADO, S. K., KROLEWSKI, A. S. and others (2006). Genetics of kidneys in diabetes

(gokind) study: a genetics collection available for identifying genetic susceptibility factors for diabetic nephropa-

thy in type 1 diabetes. Journal of the American Society of Nephrology 17, 1782–1790.

PEARL, J. (1995). Causal diagrams for empirical research. Biometrika 82, 669.

PEARL, J. (2000). Causality: Models, Reasoning and Inference. Cambridge, UK: Cambridge University Press.

RABINOWITZ, D. AND LAIRD, N. (2000). A unified approach to adjusting association tests for population admixture

with arbitrary pedigree structure and arbitrary missing marker information. Human Heredity 50, 211–223.

ROBINS, J. M. (2001). Data, design, and background knowledge in etiologic inference. Epidemiology 12, 313–320.

ROBINS, J. M. (2002). Analytic methods for estimating HIV treatment and cofactor effects. In: Ostrow D. G., and

Kessler R. (editors), Methodological Issues of AIDS Mental Health Research. New York: Plenum Publishing;

1993:213–290.

ROBINS, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In: Lin, D. Y. and Heagerty,

P.(editors),ProceedingsoftheSecondSeattleSymposiumonBiostatistics. AnalysisofCorrelatedData.NewYork:

Springer, pp. 189–326.

ROBINS, J. M., MARK, S. D. AND NEWEY, W. K. (1992). Estimating exposure effects by modelling the expectation

of exposure conditional on confounders. Biometrics 48, 479–495.

ROBINS, J. M. AND TSIATIS, A. A. (1991). Correcting for non-compliance in randomized trials using rank preserv-

ing structural failure time models. Communications in Statistics-Theory and Methods 20, 2609–2631.

SILVERMAN, E. K. (2006). Progress in chronic obstructive pulmonary disease genetics. Proceedings of the American

Thoracic Society 3, 405–408.

SILVERMAN, E. K., CHAPMAN, H. A., DRAZEN, J. M., WEISS, S. T., ROSNER, B., CAMPBELL, E. J.,

O’DONNELL, W. J., REILLY, J. J., GINNS, L., MENTZER, S. and others (1998). Genetic epidemiology of

severe, early-onset chronic obstructive pulmonary disease. Risk to relatives for airflow obstruction and chronic

bronchitis. American Journal of Respiratory and Critical Care Medicine 157, 1770–1778.

UMBACH, D. M. AND WEINBERG, C. R. (1997). Designing and analysing case-control studies to exploit indepen-

dence of genotype and exposure. Statistics in Medicine 16, 1731–1743.

UMBACH, D. M. AND WEINBERG, C. R. (2000). The use of case-parent triads to study joint effects of genotype and

exposure. American Journal of Human Genetics 66, 251–261.

VANSTEELANDT, S., DEMEO, D. L., LASKY-SU, J., SMOLLER, J. W., MURPHY, A. J., MCQUEEN, M.,

SCHNEITER, K., CELEDON, J. C., WEISS, S. T., SILVERMAN, E. K. and others (2008). Testing and estimating

gene-environment interactions in family-based association studies. Biometrics 64, 458–467.

VANSTEELANDT, S., VANDERWEELE, T. J., TCHETGEN, E. J. AND ROBINS, J. M. (2008). Multiply robust

inference for statistical interactions. Journal of the American Statistical Association 103, 1693–1704.

WANG, Y., YANG, Q. AND RABINOWITZ, D. (2011). Unbiased and locally efficient estimation of genetic effect on

quantitative trait in the presence of population admixture. Biometrics 67, 331–343.

WHITTEMORE, A. S. (2004). Estimating genetic association parameters from family data. Biometrika 91, 219.

YANG, Q., KHOURY, M. J., SUN, F. AND FLANDERS, W. D. (1999). Case-only design to measure gene-gene

interaction. Epidemiology 10, 167–170.

D. W. FARDO AND OTHERS

480

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from

Page 14

YANG, Q., RABINOWITZ, D., ISASI, C. AND SHEA, S. (2000). Adjusting for confounding due to population admix-

ture when estimating the effect of candidate genes on quantitative traits. Human Heredity 50, 227–233.

ZENG, D. AND LIN, D. Y. (2008). Efficient resampling methods for nonsmooth estimating functions. Biostatistics 9,

355–363.

[Received December 13, 2010; revised September 14, 2011; accepted for publication September 19, 2011]

G × E interaction testing for family-based ascertained samples

481

by guest on November 6, 2015

http://biostatistics.oxfordjournals.org/

Downloaded from