Page 1

Detecting gene-gene interactions that underlie human diseases

Heather J. Cordell

Institute of Human Genetics, Newcastle University, UK

Abstract

Following the identification of several disease-associated polymorphisms by whole genome

association analysis, interest is now focussing on the detection of effects that, due to their

interaction with other genetic (or environmental) factors, may not be identified by using standard

single-locus tests. In addition to increasing power to detect association, there is also a hope

detecting interactions between loci will allow us to elucidate the biological and biochemical

pathways underpinning disease. Here I provide a critical survey of the current methodological

approaches (and related software packages) used to detect interactions between genetic loci that

contribute to human genetic disease. I also discuss the difficulties in determining the biologcal

relevance of statistical interactions.

The search for genetic factors that influence common, complex traits, and the

characterisation of the effects of those factors is both a goal and a challenge for modern

geneticists. In the last couple of years, the field has been revolutionised by the success of

genome-wide association (GWA) studies 1 2 3 4 5. Most such studies have used a single-

locus analysis strategy, whereby each variant is tested individually for association with some

phenotype. However, an oft-cited reason for the lack of success in genetic studies of

complex disease 6 7 is the existence of interactions between loci. If a genetic factor operates

primarily through a complex mechanism involving multiple other genes, and possibly

environmental factors, the fear is that the effect will be missed if one examines it in

isolation, without allowing for its potential interactions with these other (unknown) factors.

For this reason, several methods and software packages 8 9 10 11 12 13 14 15 have been

developed to consider statistical interactions between loci, when analysing data from genetic

association studies. Although, in some cases, the motivation for such analyses is to increase

the power to detect effects 16, in other cases the motivation has been to detect statistical

interactions between loci that are informtive about the biological and biochemical pathways

underpinning the disease 7. We return to this complex issue of biological interpretation of

statistical interaction later in the article.

The purpose of this Review article is to provide a survey of the current methodological

approaches and related software packages that are currently used to detect interactions

between genetic loci that contribute to human genetic disease. Although the focus is on

human genetics, many of the concepts and approaches are strongly related to methods used

in animal and plant genetics. I begin by describing what is meant by (statistical) interaction,

and setting up definitions and notation for following sections. I then explain how one may

test for interaction between two (or more) known genetic factors, and how one may address

the slightly different question of testing for association with a single factor, while at the

same time allowing for interaction with other factors. In practice, one rarely wishes to test

for interaction purely between known factors, unless perhaps to replicate a previous finding

Corresponding author: Heather Cordell, Institute of Human Genetics, Newcastle University, International Centre for Life, Central

Parkway Newcastle upon Tyne, NE1 3BZ, UK heather.cordell@ncl.ac.uk Tel: +44 (0)191 241 8669 Fax: +44 (0)191 241 8666.

Europe PMC Funders Group

Author Manuscript

Nat Rev Genet. Author manuscript; available in PMC 2010 May 19.

Published in final edited form as:

Nat Rev Genet. 2009 June ; 10(6): 392–404. doi:10.1038/nrg2579.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 2

or to test a specific biological hypothesis. More common is the desire to search for

interactions, or for loci that may interact, given genotype data at a potentially very large

number of sites (e.g. from a GWA analysis or from a more focussed candidate gene study). I

therefore continue the article by outlining different methods (and software packages) that

search for such interactions, ranging from simple exhaustive search to various DATA-

MINING/MACHINE LEARNING approaches to BAYESIAN MODEL SELECTION

approaches. Throughout these sections I take as a recurring example the analysis of a

publically available genome-wide data set on Crohn's disease from the Wellcome Trust Case

Control Consortium (WTCCC) 1. I conclude the article with a section discussing the

biological interpretation of results found from such statistical interaction analysis.

The investigation of interactions has had a long history in genetics, ranging from classical

quantitative genetic studies of inbred plant and animal populations 17 18 19 to evolutionary

genetic studies 20 and, finally, to linkage and association studies in outbred human

populations. In this article I focus primarily on human genetic association studies: readers

are referred to references 20 21 22 23 24 25 for a discussion of interactions in the context of

evolutionary genetics or in human genetic linkage analysis.

Definition of statistical interaction

Interaction as departure from a linear model

The most common statistical definition of interaction relies on the concept of a linear model

describing the relationship between some outcome variable and some predictor variable(s).

We propose a particular model for how we believe the predictors might relate to the

outcome, and we use data (i.e. measurements of the relevant variables on a number of

individuals) to determine how well the model fits our observed data, and to compare the fit

of different models. Arguably the most well-known form of this type of analysis is simple

linear or least squares regression 26, where we relate an observed quantitative outcome y

(e.g. weight) to a predictor variable x (e.g. height) via a ‘best fit’ line or regression equation

y = mx+c. More generally, we may use multiple regression 26 to include several different

predictor variables (e.g. x1, x2, x3, representing height, age and gender).

From a statistical point of view, interaction signifies departure from a linear model

describing how two (or more) predictors, B and C say, predict a phenotype outcome A (Box

1). For a disease outcome and case-control data, rather than modelling a quantitative trait y,

the usual approach is to model the (expected) log-odds of disease as a linear function of the

relevant predictor variables 26 27. Given genotype data, we may evaluate the likelihood of

the data under this model and use MAXIMUM LIKELIHOOD (or other) methods to

estimate the regression coefficients and test hypotheses, such as the hypothesis that the

interaction term (i in the mathematical formulation of Box 1) equals 0.

Supplementary Text S1 describes some specific models that follow this general formulation,

including the SATURATED ‘genotype’ model. Although this model provides the best

possible fit to the data, it includes many parameters. We may make parameter restrictions to

generate fewer degrees of freedom (df) and thus increase power. Although written in terms

of nine or fewer regression parameters, the models of Supplementary Text S1 actually

represent an infinity of different models, depending on the values taken by those parameters.

There has been some interest in categorizing these models 28 29 30 in such a way as to aid

either mathematical or biological interpretation. As discussed later, biological interpretation

is usually easiest when the PENETRANCE values all equal either 0 or 1, leading to a clear

relationship between genotype and phenotype. This situation, however, is unlikely for

complex genetic diseases.

Cordell

Page 2

Nat Rev Genet. Author manuscript; available in PMC 2010 May 19.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 3

Marginal effects

An important issue in genetic studies is whether there are factors that display interaction

effects, without displaying so-called MARGINAL EFFECTS 6 31. The problem with factors

that display interaction effects, without displaying marginal effects, is that these factors will

be missed in a single-locus analysis, as they do not lead to any marginal correlation between

genotype and phenotype when each locus is considered individually. It is not clear in

practice how often this might occur, as many models that include an interaction term even in

the absence of main effects (α and β in the mathematical formulation of Box 1) do, in fact,

lead to significant marginal effects i.e. they show correlations between genotype and

phenotype that are detectable in a single-locus analysis. Thus, although one may derive

mathematical models (sets of specific values for the regression coefficients) that lead to

single-locus models displaying no marginal effects 6, it remains to be seen whether such

models represent common underlying scenarios – and thus a potentially serious problem – in

complex genetic diseases.

For simplicity, I have concentrated here on defining interaction in relation to two genetic

factors (two-locus interactions). In practice, however, for complex diseases we might expect

three-locus, four-locus and even higher-level interactions to operate as well. Mathematically,

such higher-level interactions are simple extensions to the two-locus models described

earlier. The problem with these models is that they contain a large number of parameters,

which would require extremely large data sets to estimate accurately. Interpreting the

resulting parameter estimates is also complicated, except perhaps in some simple cases – for

example, when risk alleles at all loci are required to alter disease risk (i.e. when only the full

multilocus interaction term differs from zero).

Testing for interaction between known factors

Regression models

Given two or more known (or hypothesised) genetic factors influencing disease risk,

arguably the most natural way to test for statistical interaction (on the log-odds scale) is

simply to fit a LOGISTIC REGRESSION MODEL that includes main effects and relevant

interaction term(s) and then to test whether the interaction term(s) equal zero or not. A

similar approach can be used for quantitative phenotypes, in which case linear rather than

logistic regression is used. These analyses can be performed in virtually any statistical

analysis package after construction of the required genotype variables. Alternatively, the --

epistasis option in the whole-genome analysis package PLINK 12 provides a logistic

regression test for interaction that assumes an allelic model both for main effects and

interactions.

A more powerful approach in case-control studies is to use a ‘case-only’ analysis 32 33 34.

Case-only analysis exploits the fact that, under certain conditions, an interaction term in the

logistic regression equation corresponds to dependency or correlation between the relevant

predictor variables within the population of cases. A case-only test of interaction can

therefore be performed by testing the null hypothesis that there is no correlation between

alleles or genotypes at the two loci, in a sample restricted to cases alone. This test can easily

be performed via a simple χ2 test of independence between genotypes (a 4 degree of

freedom (df) test) or alleles (a 1df test), or via logistic or MULTINOMIAL REGRESSION,

in any statistical analysis package.

The main problem with the case-only test is its requirement that the genotype variables be

uncorrelated in the general population – indeed it is this assumption, rather than the design

per se, that provides the increased power compared to case-control analysis. The case-only

test is therefore unsuitable for loci that are either closely linked or show correlation for some

Cordell

Page 3

Nat Rev Genet. Author manuscript; available in PMC 2010 May 19.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 4

other reason (e.g. if certain genotype combinations are related to viability). Unlike

epidemiological studies of environmental factors, where correlation and CONFOUNDING

between variables is common, in genetic studies the assumption of independence between

unlinked genetic factors would seem fairly reasonable. One could use a two-stage procedure

to test first for correlation between the loci in the general population, and then use the

outcome to determine whether to perform a case-only or case-control interaction test.

However, this procedure has potential bias 35.

A preferable approach is to incorporate the case-only and case-control estimators into a

single test. In this vein, Zhao et al. 36 proposed a test based on the difference in inter-locus

allelic association between cases and controls, an idea originally suggested by Hoh and Ott

37. The --fast-epistasis option in PLINK 12 performs a similar test. Zhao et al. 36

found their test had greater power than a 4df logistic regression test of gene-gene

interaction; however, this power increase may be largely due to the lower df in their allelic

(rather than genotypic) test. Mukherjee and Chatterjee 38 35 proposed an EMPIRICAL

BAYES PROCEDURE that uses essentially a weighted average of the case-control and

case-only estimators of the interaction. This approach exploits the gene-gene independence

assumption (and thus the power) of case-only analysis, while additionally incorporating

controls, allowing the estimation of main effects. Routines that implement this procedure are

available in Excel and/or Matlab.

Other approaches

Although regression-based tests of interaction would seem most natural (given the definition

of interaction as departure from some linear regression model), alternative approaches have

been proposed. Yang et al. 39 proposed a method based on partitioning of χ2 values that,

similar to 36, contrasts inter-locus association between cases and controls. Their method

showed higher power than logistic regression when the loci had no marginal effects.

Recently there has been interest in INFORMATION-THEORETIC or ENTROPY-BASED

approaches for modelling genetic interactions 40 41 42 43. It is unclear whether this

framework offers any advantage over more standard statistical modelling of the same

predictor variables, as in most cases the conditional probability statements implied by the

two approaches are entirely equivalent 44.

Family-based studies

Here I have focussed on testing for interaction in the context of case-control or population-

based studies. Several related methods have been proposed to test for interaction in the

context of family-based association studies 45 46 47 48 49. The case-pseudocontrol 46

approach offers a regression-based framework that allows interaction tests very similar to

those described here. Given the large sample sizes that are required when testing for

interaction as opposed to main effects, 50 51, it is unclear whether investigators will have

family-based cohorts of sufficient size to provide high power for detection of interactions.

However, such cohorts may provide a useful resource for replication and characterisation of

interaction effects that have been found using alterative means.

Testing for association while allowing for interaction

Rather than testing for interaction per se, many researchers are interested in allowing for

interaction (with other genetic or environmental factors) when testing for association at a

given genetic locus. The rationale is that if the test locus influences disease or phenotype

outcome via interaction with another factor, then allowing for this interaction should

increase the power to detect the effect at the test locus. From a mathematical point of view, a

test for association at a given locus C, while allowing for interaction with another locus B (a

Cordell

Page 4

Nat Rev Genet. Author manuscript; available in PMC 2010 May 19.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts

Page 5

‘joint’ 16 test), corresponds to comparing the fit (to the observed data) of a linear model in

which main effects of B, C and their interactions are included, to a model in which all terms

(main or interaction) involving locus C are removed (Box 1).

Theoretically, if no interaction effects exist, these joint tests will be less powerful than

marginal single-locus association tests. However, if interaction effects do exist, then the

power of joint tests can be higher than that of single-locus approaches 52. Kraft et al. 16

showed that the joint test of a genetic effect, while allowing for interaction with a known

environmental factor, performed nearly optimally over a wide range of plausible underlying

models. This test uses case-control data to test the combination of a main effect at locus C

and an interaction effect; since case-only analysis provides a more powerful test for the

interaction effect 32 33 34, Chapman and Clayton 53 proposed using a version of the joint test

that combines a case-control main effect component with a case-only interaction component.

The joint test of association, while allowing for interaction, assumes that one has some

known (or hypothesised) measured factor with which the test locus may interact. In the

absence of a specific factor of this type, a natural approach is to average over all other

(potentially interacting) genetic factors when performing a test at a given locus. A Bayesian

approach for doing this, in the context of GWA studies, is in development 14 and a beta

version of the associated BIA software is available in limited release from its authors on

request. Rather than averaging over all possible interacting loci, Chapman and Clayton 53

proposed using the maximum value of the joint test, evaluated over a pre-defined set of

potentially modifying (interacting) loci, with significance assessed using a PERMUTATION

argument.

Here I have concentrated on the issue of testing (either for interaction, or for association

while allowing for interaction) at one or two specific genetic variants of interest. Rather than

testing a single variant, it is now quite common to have genotype data at a large number of

variants that may or may not have any prior evidence for involvement with disease. Given

such data, various model selection approaches have been proposed that allow one to

essentially step through a sequence of regression models searching for significant effects,

both main effects and interactions 37 8 9 10 13 54 55 56. These approaches will be described in

more detail in subsequent sections. First, I describe an approach that is feasible provided the

number of main and interaction effects to be examined is not too large, namely, simple

exhaustive search.

Exhaustive search

Two-locus interactions

Given genotype data at a number of different loci, arguably the simplest way to search for

interactions between these loci is by exhaustive search. For example, to test all two-locus

interactions, one could consider all possible pairs of loci and perform the desired interaction

test for each pair. Similarly if testing for association while allowing for interaction, one

could perform the relevant 3df or 8df 52 test (Box 1, Supplementary Text S1). Clearly an

exhaustive search of this type raises a MULTIPLE TESTING issue somewhat analogous to

the multiple testing issue encountered in single-locus analysis of GWA studies 1. If all tests

are independent, a BONFERRONI CORRECTION is appropriate 52; however, LD between

loci will induce correlation between many of the tests. If testing for association while

allowing for interaction, additional correlation occurs due to the fact that the main effect of a

locus will be a component of all tests involving that locus. Theoretically, one can use

permutation 53 to assess significance while allowing for the multiplicity of (and correlation

between) the tests performed, but, for large numbers of loci, this may be computationally

prohibitive.

Cordell

Page 5

Nat Rev Genet. Author manuscript; available in PMC 2010 May 19.

Europe PMC Funders Author Manuscripts

Europe PMC Funders Author Manuscripts