Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association

Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, People's Republic of China.
Biostatistics (Impact Factor: 2.65). 02/2008; 9(1):81-99. DOI: 10.1093/biostatistics/kxm011
Source: PubMed


Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype-environment interactions from case-control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy-Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype-environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation-maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case-control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by "NAT2," a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.

Download full-text


Available from: Yi-Hau Chen, Jul 29, 2014
  • Source
    • "In addition to rare variants, gene-environment interaction (GXE) is believed to be another important contributor to missing heritability [Thomas, 2010]. Many statistical methods have been proposed to detect GXE where the " gene " is either taken to be common SNPs [Chatterjee and Carroll, 2005; Kraft, 2007; Mukherjee and Chatterjee, 2008; Wakefield et al., 2010] or haplotypes [Chatterjee et al., 2009; Chen et al., 2008, 2009; Hein et al., 2009; Kwee et al., 2007; Lobach et al., 2008]. However, many GXEs remain elusive due to the " curse of dimensionality " as the number of variables representing such interactions can grow prohibitively large [Wakefield et al., 2010]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Two important contributors to missing heritability are believed to be rare variants and gene-environment interaction (GXE). Thus, detecting GXE where G is a rare haplotype variant (rHTV) is a pressing problem. Haplotype analysis is usually the natural second step to follow up on a genomic region that is implicated to be associated through single nucleotide variants (SNV) analysis. Further, rHTV can tag associated rare SNV and provide greater power to detect them than popular collapsing methods. Recently we proposed Logistic Bayesian LASSO (LBL) for detecting rHTV association with case-control data. LBL shrinks the unassociated (especially common) haplotypes toward zero so that an associated rHTV can be identified with greater power. Here, we incorporate environmental factors and their interactions with haplotypes in LBL. As LBL is based on retrospective likelihood, this extension is not trivial. We model the joint distribution of haplotypes and covariates given the case-control status. We apply the approach (LBL-GXE) to the Michigan, Mayo, AREDS, Pennsylvania Cohort Study on Age-related Macular Degeneration (AMD). LBL-GXE detects interaction of a specific rHTV in CFH gene with smoking. To the best of our knowledge, this is the first time in the AMD literature that an interaction of smoking with a specific (rather than pooled) rHTV has been implicated. We also carry out simulations and find that LBL-GXE has reasonably good powers for detecting interactions with rHTV while keeping the type I error rates well controlled. Thus, we conclude that LBL-GXE is a useful tool for uncovering missing heritability.
    Genetic Epidemiology 01/2014; 38(1). DOI:10.1002/gepi.21773 · 2.60 Impact Factor
  • Source
    • "For the case–control studies that were described above, Jiang et al. (2006), Chen et al. (2008) and Lin and Zeng (2009) derived the efficient profile likelihood (in the sense that its score for β is an efficient score function), Lin and Zeng (2009) noting importantly that it can be used in our context. See also Monsees et al. (2009). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Primary analysis of case-control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case-control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case-control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case-control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
    Journal of the Royal Statistical Society Series B (Statistical Methodology) 05/2013; 75(1):185-206. DOI:10.1111/j.1467-9868.2012.01052.x · 3.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the past decade, many statistical methods have been proposed for the analysis of case–control genetic data with an emphasis on haplotype-based disease association studies. Most of the methodology has concentrated on the estimation of genetic (haplotype) main effects. Most methods accounted for environmental and gene-environment interaction effects by utilizing prospective-type analyses that may lead to biased estimates when used with case–control data. Several recent publications addressed the issue of retrospective sampling in the analysis of case–control genetic data in the presence of environmental factors by developing new efficient semiparametric statistical methods. I present the new Stata command, haplologit, that implements efficient profile-likelihood semiparametric methods for fitting gene–environment models in the very important special cases of a) a rare disease, b) a single candidate gene in Hardy-Weinberg equilibrium, and c) independence of genetic and environmental factors.
Show more