Retrospective analysis of haplotypebased case control studies under a flexible model for gene environment association.
ABSTRACT Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotypeenvironment interactions from casecontrol studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as HardyWeinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotypeenvironment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectationmaximization algorithm. We study the finitesample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a casecontrol study of colorectal adenoma, designed to investigate how the smokingrelated risk of colorectal adenoma can be modified by "NAT2," a smokingmetabolism gene that may potentially influence susceptibility to smoking itself.

Article: Robust estimation for homoscedastic regression in the secondary analysis of casecontrol data.
[Show abstract] [Hide abstract]
ABSTRACT: Primary analysis of casecontrol studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the casecontrol study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the casecontrol sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most casecontrol studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.Journal of the Royal Statistical Society Series B (Statistical Methodology) 01/2013; 75(1):185206. · 5.72 Impact Factor  SourceAvailable from: onlinelibrary.wiley.com[Show abstract] [Hide abstract]
ABSTRACT: Two important contributors to missing heritability are believed to be rare variants and geneenvironment interaction (GXE). Thus, detecting GXE where G is a rare haplotype variant (rHTV) is a pressing problem. Haplotype analysis is usually the natural second step to follow up on a genomic region that is implicated to be associated through single nucleotide variants (SNV) analysis. Further, rHTV can tag associated rare SNV and provide greater power to detect them than popular collapsing methods. Recently we proposed Logistic Bayesian LASSO (LBL) for detecting rHTV association with casecontrol data. LBL shrinks the unassociated (especially common) haplotypes toward zero so that an associated rHTV can be identified with greater power. Here, we incorporate environmental factors and their interactions with haplotypes in LBL. As LBL is based on retrospective likelihood, this extension is not trivial. We model the joint distribution of haplotypes and covariates given the casecontrol status. We apply the approach (LBLGXE) to the Michigan, Mayo, AREDS, Pennsylvania Cohort Study on Agerelated Macular Degeneration (AMD). LBLGXE detects interaction of a specific rHTV in CFH gene with smoking. To the best of our knowledge, this is the first time in the AMD literature that an interaction of smoking with a specific (rather than pooled) rHTV has been implicated. We also carry out simulations and find that LBLGXE has reasonably good powers for detecting interactions with rHTV while keeping the type I error rates well controlled. Thus, we conclude that LBLGXE is a useful tool for uncovering missing heritability.Genetic Epidemiology 11/2013; 38(1). · 2.95 Impact Factor 
Article: A Note on Penalized Regression Spline Estimation in the Secondary Analysis of CaseControl Data
[Show abstract] [Hide abstract]
ABSTRACT: Primary analysis of casecontrol studies focuses on the relationship between disease (D) and a set of covariates of interest (Y,X). A secondary application of the casecontrol study, often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated due to the casecontrol sampling, and to avoid the biased sampling that arises from the design, it is typical to use the control data only. In this paper, we develop penalized regression spline methodology that uses all the data, and improves precision of estimation compared to using only the controls. A simulation study and an empirical example are used to illustrate the methodology.Statistics in Biosciences 11/2013;
Page 1
Retrospective analysis of haplotypebased case–control studies
under a flexible model for gene–environment association
YIHAU CHEN,
Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, People’s Republic of China
NILANJAN CHATTERJEE*, and
Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120
Executive Boulevard, EPS 8038, Rockville, MD 20852, USA, chattern@mail.nih.gov
RAYMOND J. CARROLL
Department of Statistics, Texas A&M University, TAMU 3143, College Station, TX 778433143, USA
Summary
Genetic epidemiologic studies often involve investigation of the association of a disease with a
genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple
loci along homologous chromosomes. In this article, we consider the problem of estimating
haplotype–environment interactions from case–control studies when some of the environmental
exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the
diplotypes (haplotype pair) given environmental exposures for the underlying population based on
a novel semiparametric model that allows haplotypes to be potentially related with environmental
exposures, while allowing the marginal distribution of the diplotypes to maintain certain population
genetics constraints such as Hardy–Weinberg equilibrium. The marginal distribution of the
environmental exposures is allowed to remain completely nonparametric. We develop a
semiparametric estimating equation methodology and related asymptotic theory for estimation of the
disease odds ratios associated with the haplotypes, environmental exposures, and their interactions,
parameters that characterize haplotype–environment associations and the marginal haplotype
frequencies. The problem of phase ambiguity of genotype data is handled using a suitable
expectation–maximization algorithm. We study the finitesample performance of the proposed
methodology using simulated data. An application of the methodology is illustrated using a case–
control study of colorectal adenoma, designed to investigate how the smokingrelated risk of
colorectal adenoma can be modified by “NAT2,” a smokingmetabolism gene that may potentially
influence susceptibility to smoking itself.
Keywords
Casecontrol studies; EM algorithm; Geneenvironment interactions; Haplotype; Semiparametric
methods
1. Introduction
Genetic epidemiologic studies often involve investigation of the association between a disease
and a candidate genomic region of biologic interest. Typically, in such studies, genotype
*To whom correspondence should be addressed.
Conflict of Interest: None declared.
NIH Public Access
Author Manuscript
Biostatistics. Author manuscript; available in PMC 2009 May 17.
Published in final edited form as:
Biostatistics. 2008 January ; 9(1): 81–99. doi:10.1093/biostatistics/kxm011.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 2
information is obtained on multiple loci that are known to harbor genetic variations within the
region of interest. An increasingly popular approach for analysis of such multilocus genetic
data are haplotypebased regression methods, where the effect of a genomic region on disease
risk is modeled through “haplotypes,” the combinations of alleles (gene variants) at multiple
loci along individual homologous chromosomes. It is believed that association analysis based
on haplotypes, which can efficiently capture interloci interactions as well as “indirect
association” due to “linkage disequilibrium” of the haplotypes with unobserved causal variant
(s), can be more powerful than more traditional locusbylocus methods (Schaid, 2004).
A technical problem for haplotypebased regression analysis is that in traditional epidemiologic
studies, the haplotype information for the study subjects is not directly observable. Instead,
locusspecific genotype data are observed, which contain information on the pair of alleles a
subject carries on his/her pair of homologous chromosomes at each of the individual loci but
does not provide the “phase information,” that is which combinations of alleles appear across
multiple loci along the individual chromosomes. In general, the genotype data of a subject will
be phase ambiguous whenever the subject is heterozygous at 2 or more loci. Statistically, the
lack of phase information can be viewed as a special missing data problem.
Recently, a variety of methods have been developed for haplotypebased analysis of case–
control data using the logistic regression model (Zhao and others, 2003; Lake and others,
2003; Epstein and Satten, 2003; Satten and Epstein, 2004; Spinka and others, 2005; Lin and
Zeng, 2006; Chatterjee and others, 2006). Two classes of methods, namely, “prospective” and
“retrospective” have evolved. Prospective methods ignore the retrospective nature of the case–
control design. In the classical setting, without any missing data, justification of prospective
analysis of case–control data relies on the wellknown result about the equivalence of
prospective and retrospective likelihoods under a semiparametric model that allows the
distribution of the underlying covariates to remain completely nonparametric (Andersen,
1970; Prentice and Pyke, 1979). Even with missing data, the equivalence of the prospective
and retrospective likelihood may hold, provided the covariate distribution is allowed to remain
unrestricted (Roeder and others, 1996). For haplotypebased genetic analysis, however,
complete nonparametric treatment of the covariates, including haplotypes, may not be possible
due to intrinsic identifiability issues for the phaseambiguous genotype data (Epstein and
Satten, 2003). Thus, in this setting, the proper retrospective analysis of case–control data
requires special attention.
An attractive feature of the retrospective likelihood is that it can enhance efficiency of case–
control analysis by directly incorporating certain type of covariate distributional constraints
that are natural for genetic epidemiologic studies. The assumptions of Hardy–Weinberg
equilibrium (HWE) and gene–environment independence are 2 prime examples of such
constraints. The HWE model, which specifies simple relationships between “allele” and
“genotype” frequencies at a given chromosomal locus or between haplotype and diplotype
(pair of haplotypes on homologous chromosomes) frequencies across multiple loci, is a natural
law for a random mating large stable population. Often, it is also natural to assume that a
subject’s genetic susceptibility, a factor which is determined at birth, is independent of his/her
subsequent environmental exposures. However, if these assumptions are violated in some
situations, then retrospective methods can produce serious bias in odds ratio estimates (see,
e.g. Satten and Epstein, 2004; Chatterjee and Carroll, 2005; Spinka and others, 2005). Thus,
there is a need for alternative flexible models for specifying the joint distribution of genetic
and environmental covariates that could be used to assess the sensitivity of the retrospective
methods to underlying assumptions as well as to develop alternative robust methods.
Both Satten and Epstein (2004) and Lin and Zeng (2006) have described retrospective
maximum likelihood analysis of case–control data under flexible population genetics models
CHEN et al.Page 2
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 3
that can relax the HWE assumption. Moreover, Lin and Zeng considered a model that allows
the conditional distribution of environmental exposure given unphased genotypes to remain
completely nonparametric, but they assumed conditional independence between haplotypes
and the environmental factors given the unphased genotypes. If, however, haplotypes are the
underlying biologic units through which a mechanism of gene is determined, then it is more
natural to allow for direct association between haplotypes and environmental exposures.
Moreover, if such association could exist, then quantifying the association between haplotypes
and certain type of environmental exposures, such as lifestyle and behaviorial factors, would
be of scientific interest.
In this article, we propose methods for retrospective analysis of case–control data using a novel
model for the gene–environment distribution that can account for direct association between
haplotypes and environmental exposures. The model is developed in Section 2. We assume a
standard logistic regression model to specify the disease risk conditional on diplotypes and
environmental exposures. In addition, we assume a polytomous logistic regression model for
specifying the population distribution of the diplotypes conditional on the environmental
exposures, with the intercept parameters of the model specified in such a way that the
“marginal” distribution of the diplotypes can follow certain population genetic constraints such
as HWE. Moreover, by exploiting the equivalence of prospective and retrospective odds ratios
under the polytomous regression model, we further incorporate certain constraints on the
diplotypeexposure odds ratio parameters that could reflect specific “mode of effects” for the
haplotypes. We allow the marginal distribution of the environmental exposure to remain
completely nonparametric.
Under the proposed modeling framework, we then describe in Section 3 a “semiparametric”
estimating equation method for inference about the finitedimensional parameters of interest,
namely the disease odds ratios, haplotype frequencies, and haplotypeexposure odds ratios.
We develop a suitable expectationmaximization (EM) algorithm to account for the phase
ambiguity problem. We study asymptotic theory of the proposed estimator under the
underlying semiparametric setting.
In Section 4, we assess the finitesample performance of the proposed estimator based on case–
control data that were simulated utilizing haplotype patterns and frequencies obtained from a
real study. In Section 5, we apply the proposed methodology to a case–control study of
colorectal adenoma to investigate whether certain haplotypes in the smoking metabolism gene,
NAT2, could modify smokingrelated risk of colorectal adenoma and whether the same
haplotypes could influence an individual’s susceptibility to smoking as well. Section 6 contains
concluding remarks. All technical details are in an appendix. A SAS macro is available from
the Web site http://www.stat.sinica.edu.tw/yhchen/download.htm.
2. Notations and proposed model
For haplotypebased studies, the underlying genetic covariate for a subject is defined by
“diplotypes,” that is, the 2 haplotypes the individual carries in his/her pair of homologous
chromosomes, where each haplotype is the combination of alleles at the loci of interest along
an individual chromosome. Following the notation developed in Spinka and others (2005), let
the diplotype status for a subject be Hdi = (H1, H2), where H1 and H2 denote the constituent
haplotypes. We assume that there are J possible haplotypes indexed by hj for j = 1, …, J. The
diplotypes are then indexed by , j1 = 1, … j1, j2 = 1, …, j2. The diplotype data,
however, is not directly observable. Instead, for each subject, the multilocus genotype data
G is observed, which contains information on the pair of alleles the individual carries at each
individual locus but does not provide the phase information, that is which combination of alleles
appears along each of the individual chromosomes. Thus, the same genotype data G could be
CHEN et al.Page 3
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 4
consistent with multiple diplotypes. We will denote (G) to be the set of all possible diplotypes
that are consistent with the genotype data G.
Given the diplotype data Hdi and a set of environmental covariate X, we assume that the risk
of the disease is given by the logistic regression model
(2.1)
for some known function m(·, β1). Often one further imposes structural assumptions on the
odds ratio parameters β1 by modeling the effect of the diplotypes through constituent
haplotypes according to a “dominant,” “additive,” or “recessive” mode of effect (Wallenstein
and others, 1998). For example, a logistic regression model which assumes an additive effect
for each copy of a haplotype corresponds to
(2.2)
where βX is the main effect of X, βhjk is the main effect of haplotype hjk, k = 1, 2, and βhjk X is
the interaction effect of X with haplotype hjk, k = 1, 2. Such modeling may be necessary due
to identifiability considerations (Epstein and Satten, 2003) and is desirable when the effects of
the haplotypes themselves are of direct scientific interest.
Unlike Spinka and others (2005), who assumed independence of Hdi and X, we assume a
general polytomous logistic regression for the conditional distribution of Hdi given X:
(2.3)
where
between Hdi and X through the regression parameters γ1j1j2. Let γ0 and γ1 denote the vectorized
forms for the parameters γ0 j1j2 and γ1j1j2. Let qhap(hdix, γ0, γ1) denote pr(Hdi = hdiX = x) as
defined by model (2.3). We allow the marginal distribution of X, denoted by F(x), to remain
completely unspecified. If Hdi were directly observable, then, in principle, no further
assumptions are necessary, and one can estimate γ0 and γ1 together with the odds ratio
parameters of the disease risk using the profile likelihood approach developed by Chatterjee
and Carroll (2005). In the presence of phase ambiguity, however, the diplotypes being not
directly observable, further constraints on the parameters γ0 and γ1 are needed for the purpose
of identifiability. In the following, we show how certain natural genetic models can be used to
impose these constraints.
is a chosen reference diplotype. Observe that model (2.3) allows association
Given that genetic susceptibility may influence environmental exposures and not vice versa,
for causal interpretation of parameters it is more natural to consider a model for the
environmental exposures given the diplotypes. However, the odds ratios associated with the
distributions [XH] and [HX] being the same, the parameters in γ1 can be interpreted as
measures of “diplotype effects” on the distribution of exposure. Thus, it is natural to specify
the γ1 parameters according to certain mode of effects of the underlying haplotypes. For
example, assuming an additive effect for the haplotypes, one can write γ1 j1j2 = γ1, j1 + γ1, j2,
which allows the diplotype effects to be determined by a reduced set of “haplotype effect”
CHEN et al.Page 4
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 5
parameters γ1, j; in this case, γ1 would denote the vectorized form for the parameters γ1, j.
Similarly, other commonly used models, such as dominant or recessive models, could be used
to impose natural constraints on the γ1 parameters in model (2.3). We also observe that the
parametric model (2.3), combined with the nonparametric distribution F(x), imposes a
semiparametric model on the distribution of [XH] with a density
This class of semiparametric models includes the parametric submodel where XHdi = hdi
follows a multivariate normal distribution with mean μhdi and common variance–covariance
matrix Σ. In this case, it is easy to see that
shift in the mean of the distribution of X due to differences in the diplotypes.
, which is a measure of the
The parameter γ0 in model (2.3) defines the population diplotype frequencies for a baseline
value of the exposure X. It is common to use population genetics models, such as HWE, to
specify a relationship between diplotype and haplotype frequencies. However, observe that if
the diplotypes can influence certain environmental exposures, then the frequencies of the
diplotypes within exposure categories may not follow the HWE constraints although the
underlying population, as a whole, may be in HWE. Thus, the populationlevel marginal
haplotypepair distribution is assumed to follow HWE and is characterized by the parameters
θ = (θ2, …, θJ) so that
(2.4)
where h1 denotes the chosen reference haplotype and θ1 = 0. Let
be the marginal frequency for the diplotype hdi. Recall that in the proposed model, γ0 is defined
as an implicit function of γ1, θ, and F(x) through the relationship
(2.5)
Note that F is left unspecified, and hence the model propoised is semiparametric.
3. Semiparametric estimating equation inference
3.1 Estimation with known haplotypes
In what follows, where there can be no confusion, we will write h for hdi.
CHEN et al.Page 5
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 6
Let ℋ(x) = exp(x)/{1+ exp(x)} be the logistic distribution function. Write the risk model
probability as
Recall that qhap(hX, γ0, γ1) = pr(Hdi = hX; γ0, γ1) is the conditional model of Hdi given X that
is specified as in (2.3).
To start with, consider the ideal case that the phase information is known so that Hdi is observed.
Since F is treated nonparametrically, assume that F is discrete and has mass δk at xk, k = 1, …,
K, where {x1, …, xK} are the distinct values of X that are observed in the case–control sample.
Let ndkh be the number of subjects in the sample with (D = d, X = xk, Hdi = h). Ignoring the
dependence of γ0 on F tentatively, the loglikelihood of the case–control data can then be
written as
Maximizing l with respect to δ for fixed values of ω = (β, γ0, γ1) then leads to
(3.1)
and the profile loglikelihood
(3.2)
where
and
with ℬ = (β0, β1, κ)T and κ = β0 + log(n1/n0) − log{pr(D = 1)/pr(D = 0)}. The calculation is
similar to that in Chatterjee and Carroll (2005).
CHEN et al.Page 6
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 7
As noted by Chatterjee and Carroll (2005), the parameter β0 is separable from κ and hence is
theoretically identifiable. In practice, however, there is usually little information about β0
available in the observed data, and hence the information matrix is nearly singular. One way
to bypass this problem is to use external information on the disease prevalence pr(D = 1), while
another way is to use the raredisease approximation when the disease is rare. The estimation
method described below can be applied to both the 2 cases of pr(D = 1) being known and the
raredisease approximation being made, with suitable definitions on ℬ and S(d, h, x, ℬ, γ0,
γ1). When pr(D = 1) is known, κ depends on β0 only, hence here we define ℬ = (β0, β1)T. When
the disease is rare so that
(3.3)
we have
Note that β0 does not appear in this expression, and hence we define ℬ = (κ, β1)T.
Our goal is to estimate the parameters (ℬ, θ, γ1) based on the profile loglikelihood (3.2), where
γ0 is defined as an implicit function of (θ, γ1, F) through (2.5), and we write γ0 = (θ, γ1, F).
Let Ω = (ℬ, γ1, γ0), Ω* = {ℬ, γ1, (θ, γ1, F)}, and Φ = (ℬ, γ1, θ). Let ℒΩ(·) and ℒΦ(·) be,
respectively, the derivatives of ℒ(·) with respect to Ω and Φ, and θ and γ1 the derivatives
of (·)with respect to and θ and γ1 We then have
Explicit expressions for γ1 and θ; are given in Appendix C. Also, the information matrix is
given by
where ℐΩΩ = E(−ℒΩΩ), with ℒΩΩ the second derivative of ℒ with respect to Ω; note that the
terms involving second derivatives of Ω* do not appear in the information matrix because E
(ℒΩ) = 0, which is a direct consequence of the Lemmas A.1 and A.2 in Appendix A. We
propose to obtain the estimate of Φ by solving the estimating equation
(3.4)
where we have substituted an estimate F̂ for F in (·); that is, for each fixed value of (θ, γ1),
we solve γ0 from
CHEN et al. Page 7
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 8
(3.5)
One convenient choice of F̂ is the empirical estimate F̂emp, which is given by
for the case where pr(D = 1) is known, where F̂emp,1(x) and F̂emp,0(x) are the empirical
distributions of X in the case and control samples, and is given by F̂emp(x) = F̂emp,0(x) for the
case where the raredisease assumption can be made. An alternative choice of F̂ (x) would be
the profile likelihood estimate (3.1). Numerical calculations not given here show that the latter
choice requires more computational efforts while yielding results very similar to those given
by the empirical estimate F̂emp.
3.2 Estimation with ambiguous haplotype data
Now, we turn to the more practical case where the haplotype data cannot be observed directly
and must be inferred from the unphased genotype data, that is, the haplotype information may
be subject to ambiguity. In this case, we apply an EMlike algorithm to the “complete data”
estimating equation (3.4). Let Gi denote the observed unphased genotype of subject i and
(Gi) the set of diplotypes that are consistent with Gi. When only Gi instead of
for each subject, we propose to obtain the estimate Φ̂ for Φ = (ℬ, γ1, θ) as the solution of the
weighted version of (3.4):
is observed
(3.6)
where using the shorthand notation that γ̂0 = (θ, γ1, F̂emp), the weights are given by
(3.7)
The limiting version of the weights is given as
(3.8)
Solving the estimating equation (3.6) can be implemented simply by an EMlike algorithm as
follows: starting with an initial value for Φ and hence an initial value for γ0, we
i.
calculate the weights {ω̂i} from (3.7) and
ii.
solve (3.6) to obtain an updated estimate of Φ using the weights {ω̂i} given in (i);
note that within this step we also need to solve (3.5) to obtain updated value of γ0.
CHEN et al.Page 8
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 9
The algorithm is iterated between the 2 steps until convergence. Note that the weights {ω̂i} are
only used in solving Φ from (3.6) and are not required in solving γ0 from (3.5).
3.3 Asymptotic theory
Make the following series of definitions. Expectations denoted as Ecc(·) are taken under the
case–control sampling design, that is, for any random vector Y,
where μd = plim nd/n, d = 0, 1. Then, define
,
(3.9)
Note that the second derivative of Ω* does not appear in ℐ̄ since E(ℒ̄Φ) = E(ℒΦ) = 0, and the
last identity in (3.9) is given by Lemma A.3 in Appendix A.
Define p̂emp(Di) to be the mass of F̂emp(Xi), which is equal to
(D = d) is known and is equal to I (Di = 0)/n0 when the raredisease approximation is used. Let
qhap(X, γ1, γ0) = {qhap(hX, γ1, γ0)} be the vector collection over h of qhap(hX, γ1, γ0) for all
diplotypes except the reference diplotype, and let qHWE(θ) be defined similarly. Define
if πd = pr
where ℐΩγ0 is the obvious submatrix of ℐΩΩ.
Theorem 3.1—Let
Suppose that Ecc{ℰ̄(·) ℰ̄T (·)} exists and the matrix ℐ̄ is invertible. Then, n1/2(Φ̂ − Φ) is
asymptotically normal with mean zero and covariance matrix
(3.10)
Remark 3.2—The asymptotic variance Γ̄ can be readily estimated by replacing each
component matrix with its empirical counterpart. Lemma A.3 gives useful expressions to
facilitate this computation.
CHEN et al. Page 9
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 10
Remark 3.3—In our numerical experiments, the estimated covariance based on formula
(3.10) is very close to that based on the “naive” covariance estimate obtained by naively treating
the estimating equation (3.6) as a genuine score equation; namely, treating the F̂emp plugged
into (·) as the true covariate distribution F. In this case, by applying Proposition 1(ii) in
Chatterjee and Carroll (2005), the naive covariance estimate can be obtained simply as the
empirical counterpart of the matrix ℐ̄−1 − ℐ̄−1Ψℐ−1, where
. Whether this naive estimate performs well in
general is unknown, and we suggest using the estimate based on (3.10).
4. Simulations
4.1 Finitesample performance under correct model
In this section, we study the finitesample performance of the proposed estimator using
simulated data generated under the proposed modeling framework. We simulated haplotypes
following published data (Epstein and Satten, 2003) on haplotype patterns and frequencies for
5 singlenucleotide polymorphisms (SNPs) in a putative susceptibility gene for diabetes (see
Table 1). The simulations involved a single environmental covariate X, assumed to follow a
standard normal distribution in the population. Given X, the diplotypes (haplotype pair) for an
individual were generated from a polytomous logistic regression of the form (2.3), where the
diplotypespecific odds ratios were further specified according to an additive model of the form
γ1j1j2= γ1, j1 + γ1, j2, where j1 and j2 denote the index for 14 haplotypes shown in Table 1. We
assume γ1,4 = γ1,5 = −0.4 and γ1,12 = 0.4, and all the other γ1, j = 0. The parameters γ0j1j2 ’s in
model (2.3) are then specified in such a way that the marginal diplotype distribution follow
HWE with haplotype frequencies given in Table 1.
For generating disease outcome, we chose the haplotype “01100” (j = 5) to be causal and used
the logistic model
where Z(5) denotes the number of the copies of the causal haplotype contained in Hdi. The true
value of the parameter vector (β0, βH, βX, βH X) was set to (−3.0, 0.2, 0.1, 0.3). A case–control
sample with 600 controls and 600 cases was then sampled. The results were based upon 1000
simulated data sets.
When analyzing the data, we only used the unphased genotype information. We did not assume
the causal haplotype to be known. Thus, in both the diseaserisk model (2.1) and the diplotype
frequency model (2.3), we choose the most common haplotype “10011” as a reference and
estimated a separate regression parameter for each of the nonreferent haplotypes. Since rare
haplotypes may lead to unreliable estimates of the associated regression parameters, when
estimating β and γ1, rare haplotypes with frequency <1% are grouped into the reference
haplotype. The resulting 8grouped haplotypes are labeled as h j′, j′ = 2, …, 8; see Table 1 for
details about how the haplotypes are grouped.
In each simulation, we obtain 2 sets of estimates from the proposed method, one using the rare
disease approximation (3.3) and the other using the known value of the population disease
prevalence. Results shown in Table 2 show that both sets of estimates are essentially unbiased.
Also, the standard error estimates are in close agreement with the true values, and the coverage
probabilities are close to the nominal value (95%). As expected, the estimates for θ and γ1 are
generally more efficient using external information on the disease prevalence than when using
CHEN et al.Page 10
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 11
the raredisease approximation, but no such efficiency gain is observed for the parameters β
in the diseaserisk model. Similar conclusion can be drawn from the simulations with a
Bernoulli covariate (success probability = 0.5), showing the applicability of the proposed
method to the categorical covariate. Detailed results for this latter set of simulations are
included in the supplementary material available at Biostatistics online.
4.2 Model robustness
Here, we consider a simulation study where we generate the data in such a way that the
polytomous model for diplotype frequencies may not exactly hold. The main goal is to give
an indication of the robustness of the estimate of the association parameters (β) from the
proposed method when the model for [HdiX] is misspecified.
Following the argument of causality in Section 2, if [XHdi] follows a normal distribution with
constant variance, then the polytomous model is exact. So, to show a modest violation of the
polytomous model, for given diplotype we generate the data on X from
where the diplotype data are again generated from the distribution in Table 1, ε is a tdistribution
(d.f. = 3) truncated at ±5, λ4 = λ5 = −1.2, λ12 = 1.2, and all the other λj are 0. The disease status
data are generated from the same logistic model as in the previous simulation. The simulated
data on 600 cases and 600 controls are then analyzed with the proposed method, where the
analysis models for the disease risk, [HdiX], and the marginal diplotype distribution are
specified the same as those in the previous simulation. As a comparison, we also fit to the
simulated data a model with the haplotype–environment (HX) independence assumption, i.e.
pr(HdiX) = pr(Hdi) = qHWE(Hdi), using the method proposed by Spinka and others (2005). The
raredisease approximation is made when applying both the 2 methods.
The results shown in Table 3 reveal that, for the estimation of the association parameters β, the
proposed method may be quite robust to modest misspecification of the model for [HdiX]. On
the other hand, the estimates from the HX independence method does result in substantial
bias, especially for parameters corresponding to haplotypes for which [XHdi] have nonzero
mean; for example, the estimate for the interaction parameter between h5 and X is severely
biased with the HX independence method. The estimates for the marginal haplotype
frequency parameters θ seem to be robust to misspecification of [HdiX] for both the 2 methods.
5. Case–control study of colorectal adenoma study, NAT2 haplotype, and
smoking
We illustrate the proposed modeling and estimating methodologies with an application to a
case–control study of colorectal adenoma, a precursor of colorectal cancer. The study involved
628 prevalent advanced adenoma cases and 635 gendermatched controls, selected from the
screening arm of the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial at the
National Cancer Institute, USA (Gohagan and others, 2000; Moslehi and others, 2006). One
of the main objectives of this study is to assess whether smokingrelated risk of colorectal
adenoma may be modified by certain haplotypes in NAT2, a gene known to be important in
metabolism of smokingrelated carcinogens. In addition, since NAT2 is involved in the
smoking metabolism pathway, potentially it can influence an individual’s addiction to
smoking. Thus, it was also of interest to identify potential haplotypes that could influence an
individual’s susceptibility to smoking.
CHEN et al.Page 11
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 12
Genotype data were available on 6 SNPs. We initially applied the EM algorithm proposed by
Li and others (2003) for haplotypefrequency estimation to derive 7 common haplotypes with
estimated frequency greater than 0.5%, which are then included in our association analysis
with the most frequent haplotype served as the reference haplotype. Subjects were categorized
as “never,” “former,” or “current” smokers. We fit a logistic regression model (2.1) assuming
an additive effect for each haplotype other than the reference one; see (2.2). The haplotype–
environment interaction terms include only those for the haplotype “101010” with “Smk1” and
“Smk2,” 2 dummy variables for former and never smokers, because they are the only promising
interactions according to preliminary analysis. The diseaserisk model was further adjusted for
“age,” recorded in years, and “gender.” A polytomous logistic regression (2.3) is specified for
the conditional distribution of diplotypes given the environmental covariates Smk1 and Smk2
with the marginal diplotype distribution being specified by the HWE constraints. The main
parameters of interest include the diseasehaplotype odds ratio parameters β1, the haplotype–
environment odds ratio parameters γ1, and the marginal haplotype frequencies in the whole
population. The marginal distribution for the environmental covariates is left unspecified. For
estimation of regression parameters β and γ1, we grouped haplotypes with frequency less than
2% into the reference haplotype. The raredisease approximation was made in deriving the
estimating equation, and the EM algorithm proposed in Section 3 is utilized to accommodate
the unphased genotype data.
Results from this application are displayed in Table 4. It is clear that current smokers can have
significantly elevated risk for colorectal adenoma relative to nonsmokers, adjusting for gender
and age. Relative to the reference haplotype “001100,” all the other haplotypes are associated
with reduced risk for colorectal adenoma, but the statistical evidence is not significant.
However, the significance of the interaction 101010 × Smk2 suggests that smokingrelated
risk of adenoma was much reduced for carriers of the haplotype 101010 than noncarriers. The
finding is consistent with previous laboratory and epidemiologic studies that have identified
the haplotype 101010, known as “NAT2*4,” as a rapid metabolizer for smokingrelated
carcinogens. The estimates for the parameter γ1 for the conditional diplotype distribution reveal
that the susceptibility to smoking seems not to be influenced by any haplotypes we considered.
Finally, the estimates for the marginal haplotype frequencies derived from the estimates of θ
are quite close to those obtained by the EM algorithm of Li and others (2003) applied to the
genotype data of the controls.
To check if the analysis is sensitive to model specification for the conditional distribution of
diplotypes given the environmental covariates, we further fit the model (2.3) with various
choices of the environmental covariates. The results (not shown) for the association parameters
β and the marginal haplotype frequencies are fairly consistent across the analyses.
6. Concluding remarks
The model we have proposed for gene–environment association is suitable when the underlying
haplotypes of a genomic region may causally influence the environmental exposure(s) under
study. The model, however, requires special treatment for environmental factors, such as
ethnicity or geographic region(s), which may be associated with the genomic region under
study, not because of any causal relationship but merely due to population stratification.
Suppose, in addition to the main environmental exposure X, there is a set of environmental
factors S which could be used to divide the underlying population into K strata that are likely
to be genetically heterogenous. In such a situation, a natural model for describing the
association between diplotypes Hdi and environmental factors W = (X, S) is given by
CHEN et al.Page 12
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 13
(6.1)
where the stratumspecific intercept parameters γ0j1j2 (S) should be specified in such a way
that the diplotype frequencies, after marginalized over X, follow population genetics
constraints, such as HWE, within each stratum defined by S. The diseaserisk model could be
also extended to include S as a risk factor. The proposed estimating equation methodology can
be easily modified to estimate the gene–environment interaction and association parameters
of interest under these extended models.
Acknowledgements
Chen’s research was supported by the National Science Council of the People’s Republic of China (NSC 952118
M001022MY3). Chatterjee’s research was supported by the Intramural Research Program of the National Cancer
Institute. Carroll’s research was supported by a grant from the National Cancer Institute (CA57030) and by the Texas
A&M Center for Environmental and Rural Health via a grant from the National Institute of Environmental Health
Sciences (P30ES09106). A SAS macro is available from the Web site
http://www.stat.sinica.edu.tw/yhchen/download.html.
References
Andersen JB. Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal
Statistical Society, Series B 1970;32:283–301.
Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in casecontrol studies of gene
environmental interactions. Biometrika 2005;92:399–418.
Chatterjee N, Chen J, Spinka C, Carroll RJ. Comment on the paper likelihood based inference on
haplotype effects in genetic association studies by D. J. Lin and D. Zeng. Journal of the American
Statistical Association 2006;101:108–110.
Epstein M, Satten G. Inference of haplotype effects in casecontrol studies using unphased genotype data.
American Journal of Human Genetics 2003;73:1316–1329. [PubMed: 14631556]
Gohagan JK, Prorok PC, Hayes RB, Kramer BS. The Prostate, Lung, Colorectal and Ovarian (PLCO)
Cancer Screening Trial of the National Cancer Institute: History, organization, and status. Controlled
Clinical Trials 2000;21:251S–272S. [PubMed: 11189683]
Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. Estimation and tests of
haplotypeenvironment interaction when linkage phase is ambiguous. Human Heredity 2003;55:56–
65. [PubMed: 12890927]
Li SS, Khalid N, Carlson C, Zhao LP. Estimating haplotype frequencies and standard errors for multiple
single nucleotide polymorphisms. Biostatistics 2003;4:513–522. [PubMed: 14557108]
Lin DY, Zeng D. Likelihoodbased inference on haplotype effects in genetic association studies (with
discussion). Journal of the American Statistical Association 2006;101:89–118.
Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfield J, Hein DW, Hayes RB. Cigarette
smoking, nacetyltransferase genes and the risk of advanced colorectal adenoma. Pharmacogenomics
2006;7:819–829. [PubMed: 16981843]
Prentice RL, Pyke R. Logistic disease incidence models and casecontrol studies. Biometrika
1979;66:403–411.
Roeder K, Carroll RJ, Lindsay BG. A nonparametric mixture approach to casecontrol studies with errors
in covariables. Journal of the American Statistical Association 1996;91:722–732.
Satten GA, Epstein MP. Comparison of prospective and retrospective methods for haplotype inference
in casecontrol studies. Genetic Epidemiology 2004;27:192–201. [PubMed: 15372619]
Schaid DJ. Evaluating associations of haplotypes with traits. Genetic Epidemiology 2004;27:348–364.
[PubMed: 15543638]
CHEN et al. Page 13
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 14
Spinka C, Carroll RJ, Chatterjee N. Analysis of casecontrol studies of genetic and environmental factors
with missing genetic information and haplotypephase ambiguity. Genetic Epidemiology
2005;29:108–127. [PubMed: 16080203]
Wallenstein S, Hodge SE, Weston A. Logistic regression model for analyzing extended haplotype data.
Genetic Epidemiology 1998;15:173–181. [PubMed: 9554554]
Zhao LP, Li SS, Khalid NA. Method for the assessment of disease associations with singlenucleotide
polymorphism haplotypes and environmental variables in casecontrol studies. American Journal of
Human Genetics 2003;72:1231–1250. [PubMed: 12704570]
APPENDIX A: BASIC LEMMAS
The following lemmas are required to derive the asymptotic distribution of the proposed
estimator. Lemma A.1 below is in fact Lemma 3 of Chatterjee and Carroll (2005).
Lemma A.1
Under the case–control sampling design where the total sample size n = n1 + n0 tends to infinity
but the sampling proportions for the cases and controls, that is, n1/n and n0/n, remain fixed at
μ1 and μ0, we have for any measurable function M(D, Hdi, X) of data (D, Hdi, X),
where E*(·X) denotes the expectation with respect to the joint distribution of (D, Hdi) given
X defined by
(A.1)
and λ(x) = Σd Σh, S(d, h, x, ℬ, γ1, γ0)
Lemma A.2 below provides an explicit expression for the estimating function ℒ̄Φ(·).
Lemma A.2
Write
Then
(A.2)
CHEN et al.Page 14
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript
Page 15
Proof
By definition,
and direct calculation yields
which proves the result.
Lemma A.3 provides explicit forms for the information matrices.
Lemma A.3
Let ℐΩΩ = Ecc{−ℒΩΩ(D, Hdi, X, Ω)}, where ℒΩΩ is the second derivative of ℒ(·) with
respective to Ω and ℐΩΩ = Ecc{−∂ℒΩ(·)/∂ΩT}. Then
Proof
The first identity has been given in Lemma 4 of Chatterjee and Carroll (2005). To show the
second identity, applying the chain rule we have
(A.3)
The first term of (A.3) equals
By the definition of ω(h, Ω) given in (3.8), it easy to see that
CHEN et al. Page 15
Biostatistics. Author manuscript; available in PMC 2009 May 17.
NIHPA Author Manuscript
NIHPA Author Manuscript
NIHPA Author Manuscript