Page 1

© 2006 Nature Publishing Group

The goal of population association studies is to identify

patterns of polymorphisms that vary systematically

between individuals with different disease states and

could therefore represent the effects of risk-enhancing or

protective alleles (BOXES 1,2). That sounds easy enough:

what could be difficult about spotting allele patterns that

are overrepresented in cases relative to controls?

One fundamental problem is that the genome is so

large that patterns that are suggestive of a causal poly-

morphism could well arise by chance. To help distinguish

causal from spurious signals, tight standards for statisti-

cal significance need to be established; another tactic is

to consider only patterns of polymorphisms that could

plausibly have been generated by causal genetic variants,

given our current understanding of human genetic his-

tory1 and evolutionary processes such as mutation and

recombination. Checking for systematic errors and

dealing with missing values present further challenges.

Upstream of the study itself, at the study design phase,

several questions need to be considered, such as: How

many individuals should be genotyped? At how many

markers? And how should markers and individuals

be chosen?

In this article I survey current approaches to such

challenges. My goal is to give a broad-brush view of dif-

ferent statistical problems and how they relate to each

other, and to suggest some solutions and sources of fur-

ther information. I look first at statistical analyses that

precede association testing and then move on to the tests

of association, based on single SNPs, multiple SNPs and

haplotypes. I also briefly introduce adjustments to allow

for possible population stratification (or population struc-

ture) and approaches to the problem of multiple testing.

My hope is that those handling genetic-association data

will obtain a clearer picture of the statistical issues and

gain some ideas for new or modified approaches.

In this review I cover only population association

studies in which unrelated individuals of different dis-

ease states are typed at a number of SNP markers. I do

not address family-based association studies, admixture

mapping or linkage studies (BOX 3), which also have an

important role in efforts to understand the effects of

genes on disease2.

Preliminary analyses

Data quality is of paramount importance, and data

should be checked thoroughly, for example, for batch or

study-centre effects, or for unusual patterns of missing

data. Testing for Hardy–Weinberg equilibrium (HWE) can

also be helpful, as can analyses to select a good subset

of the available SNPs (‘tag’ SNPs) or to infer haplotypes

from genotypes.

Hardy–Weinberg equilibrium. Deviations from HWE can

be due to inbreeding, population stratification or selec-

tion. They can also be a symptom of disease association3,

the implications of which are often under-exploited4.

Apparent deviations from HWE can arise in the pres-

ence of a common deletion polymorphism, because of

a mutant PCR-primer site or because of a tendency to

miscall heterozygotes as homozygotes. So far, researchers

have tested for HWE primarily as a data quality check

and have discarded loci that, for example, deviate from

HWE among controls at significance level α = 10−3 or 10−4.

However, the possibility that a deviation from HWE is

due to a deletion polymorphism5 or a segmental duplica-

tion6 that could be important in disease causation should

now be considered before discarding loci.

Department of Epidemiology

and Public Health, Imperial

College, St Mary’s Campus,

Norfolk Place, London

W2 1PG, UK.

e-mail:

d.balding@imperial.ac.uk

doi:10.1038/nrg1916

Haplotype

A combination of alleles at

different loci on the same

chromosome.

Population stratification

Refers to a situation in which

the population of interest

includes subgroups of

individuals that are on average

more related to each other

than to other members of the

wider population.

Multiple-testing problem

Refers to the problem that

arises when many null

hypotheses are tested; some

significant results are likely

even if all the hypotheses

are false.

Hardy–Weinberg

equilibrium

Holds at a locus in a population

when the two alleles within an

individual are not statistically

associated.

A tutorial on statistical methods for

population association studies

David J. Balding

Abstract | Although genetic association studies have been with us for many years, even for

the simplest analyses there is little consensus on the most appropriate statistical procedures.

Here I give an overview of statistical approaches to population association studies, including

preliminary analyses (Hardy–Weinberg equilibrium testing, inference of phase and missing

data, and SNP tagging), and single-SNP and multipoint tests for association. My goal is to

outline the key methods with a brief discussion of problems (population structure and

multiple testing), avenues for solutions and some ongoing developments.

REVIEWS

NATURE REVIEWS | GENETICS

VOLUME 7 | OCTOBER 2006 | 781

FOCUS ON STATISTICAL ANALYSIS

Page 2

© 2006 Nature Publishing Group

Most recent common ancestor

Time

Case

Control

Ancestral

mutation

Testing for deviations from HWE can be carried

out using a Pearson goodness-of-fit test, often known

simply as ‘the χ2 test’ because the test statistic has

approximately a χ2 null distribution. Be aware, how-

ever, that there are many different χ2 tests. The Pearson

test is easy to compute, but the χ2 approximation can

be poor when there are low genotype counts, and it is

better to use a Fisher exact test, which does not rely on

the χ2 approximation7–9. The open-source data-analysis

software R (see online links box) has an R genetics

package that implements both Pearson and Fisher tests

of HWE, and PEDSTATS also implements exact tests9.

(All statistical genetics software cited in the article can

be found at the Genetic Analysis Software website,

which can be found in the online links box).

A useful tool for interpreting the results of HWE and

other tests on many SNPs is the log quantile–quantile

(QQ) P-value plot (FIG. 1): the negative logarithm of the

ith smallest P value is plotted against −log (i / (L + 1)),

where L is the number of SNPs. Deviations from the

y = x line correspond to loci that deviate from the null

hypothesis10.

Missing genotype data. For single-SNP analyses, if a

few genotypes are missing there is not much problem.

For multipoint SNP analyses, missing data can be more

problematic because many individuals might have one

or more missing genotypes. One convenient solution is

data imputation: replacing missing genotypes with pre-

dicted values that are based on the observed genotypes

at neighbouring SNPs. This sounds like cheating, but

for tightly linked markers data imputation can be reli-

able, can simplify analyses and allows better use of the

observed data. Imputation methods either seek a ‘best’

prediction of a missing genotype, such as a maximum-

likelihood estimate (single imputation), or randomly select

it from a probability distribution (multiple imputations).

The advantage of the latter approach is that repetitions

of the random selection can allow averaging of results or

investigation of the effects of the imputation on resulting

analyses11.

Most software for phase assignment (see below) also

imputes missing alleles. There are also more general impu-

tation methods: for example, ‘hot-deck’ approaches11,

in which the missing genotype is copied from another

individual whose genotype matches at neighbouring loci,

and regression models that are based on the genotypes

of all individuals at several neighbouring loci12.

These analyses typically rely on missingness being

independent of both the true genotype and the pheno-

type. This assumption is widely made, even though its

validity is often doubtful. For example, as noted above,

heterozygotes might be missing more often than homo-

zygotes. What is worse, case samples are often collected

differently from controls, which can lead to differential

rates of missingness even if genotyping is carried out

blind to case–control status. The combination of these

two effects can lead to serious biases13. One simple way

to investigate differential missingness between cases

and controls is to code all observed genotypes as 1 and

missing genotypes as 0, and test for association of this

variable with case–control status.

Haplotype and genotype data. Underlying an individ-

ual’s genotypes at multiple tightly linked SNPs are the

two haplotypes, each containing alleles from one parent.

I discuss below the merits of analyses that are based on

phased haplotype data rather than unphased genotypes,

and consider here only ways to obtain haplotype data.

Box 1 | Rationale for association studies

Population association studies compare unrelated individuals, but ‘unrelated’ actually

means that relationships are unknown and presumed to be distant. Therefore, we

cannot trace transmissions of phenotype over generations and must rely on

correlations of current phenotype with current marker alleles. Such a correlation might

be generated by one or more groups of cases that share a relatively recent common

ancestor at a causal locus. Recombinations that have occurred since the most recent

common ancestor of the group at the locus can break down associations of phenotype

with all but the most tightly linked marker alleles, permitting fine mapping if marker

density is sufficiently high (say, ≥1 marker per 10 kb, but this depends on local levels of

linkage disequilibrium).

This principle is illustrated in the figure, in which for simplicity I assume haploidy,

such as for X-linked loci in males. The coloured circles indicate observed alleles (or

haplotypes), and the colours denote case or control status; marker information is not

shown. The alleles within the shaded oval all descend from a risk-enhancing mutant

allele that perhaps arose some hundreds of generations in the past (red star), and so

there is an excess of cases within this group. Consequently, there is an excess of the

mutant allele among cases relative to controls, as well as of alleles that are tightly linked

with it. The figure also shows a second, minor mutant allele at the same locus that might

not be detectable because it contributes to few cases.

Although the SNP markers that are used in association studies can have up to four

nucleotide alleles, because of their low mutation rate most are diallelic, and many

studies only include diallelic SNPs. With increasing interest in deletion polymorphisms5,

triallelic analyses of SNP genotypes might become more common (treating deletion as a

third allele), but in this article I assume all SNPs to be diallelic.

Broadly speaking, association studies are sufficiently powerful only for common causal

variants. The threshold for ‘common’ depends on sample and effect sizes as well as

marker frequencies90, but as a rough guide the minor-allele frequency might need to be

above 5%. Arguments for the common-disease common-variant (CDCV) hypothesis

essentially rest on the fact that human effective population sizes are small1. A related

argument is that many alleles that are now disease-predisposing might have been

advantageous in the past (for example, those that favour fat storage). In addition,

selection pressure is expected to be weak on late-onset diseases and on variants that

contribute only a small risk. Although some common variants that underlie complex

diseases have been identified91, we still do not have a clear idea of the extent to which

the CDCV hypothesis holds.

REVIEWS

REVIEWS

782 | OCTOBER 2006 | VOLUME 7

www.nature.com/reviews/genetics

Page 3

© 2006 Nature Publishing Group

Haplotype

Typed marker locus Unobserved causal locus

Disease

phenotype

Indirect

association

Direct

association

Direct

association

Significance level

Usually denoted α, and chosen

by the researcher to be the

greatest probability of type-1

error that is tolerated for a

statistical test. It is

conventional to choose

α = 5% for the overall analysis,

which might consist of many

tests each with a much lower

significance level.

Test statistic

A numerical summary of the

data that is used to measure

support for the null hypothesis.

Either the test statistic has a

known probability distribution

(such as χ2) under the null

hypothesis, or its null

distribution is approximated

computationally.

Common-disease common-

variant hypothesis

The hypothesis that many

genetic variants that underlie

complex diseases are common,

and therefore susceptible to

detection using current

population association study

designs. An alternative

possibility is that genetic

contributions to complex

diseases arise from many

variants, all of which are rare.

Effective population size

The size of a theoretical

population that best

approximates a given natural

population under an assumed

model. Human effective

population size is often taken

to mean the size of a constant-

size, panmictic population of

breeding adults that generates

the same level of

polymorphism under neutrality

as observed in an actual

human population.

Maximum-likelihood

estimate

The value of an unknown

parameter that maximizes the

probability of the observed

data under the assumed

statistical model.

Phase

The information that is

needed to determine the two

haplotypes that underlie a

multi-locus genotype within

a chromosomal segment.

Direct, laboratory-based haplotyping or typing

further family members to infer the unknown phase

are expensive ways to obtain haplotypes. Fortunately,

there are statistical methods for inferring haplotypes

and population haplotype frequencies from the geno-

types of unrelated individuals. These methods, and the

software that implements them, rely on the fact that in

regions of low recombination relatively few of the possible

haplotypes will actually be observed in any population.

These programs generally perform well14, given high

SNP density and not too much missing data. SNPHAP is

simple and fast, whereas PHASE15 tends to be more accu-

rate but comes at greater computational cost. Recently

FASTPHASE has emerged16, which is nearly as accurate

as PHASE and much faster.

True haplotypes are more informative than genotypes,

but inferred haplotypes are typically less informative

because of uncertain phasing. However, the informa-

tion loss that arises from phasing is small when linkage

disequilibrium (LD) is strong.

Note that phasing cases and controls together allows

better estimates of haplotype frequencies under the

null hypothesis of no association, but can lead to a bias

towards this hypothesis and therefore a loss of power.

Conversely, phasing cases and controls separately can

inflate type-1 error rates. A similar issue arises in imputing

missing genotypes.

Measures of LD and estimates of recombination rates.

LD will remain crucial to the design of association stud-

ies until whole-genome resequencing becomes routinely

available. Currently, few of the more than 10 million

common human polymorphisms are typed in any

given study. If a causal polymorphism is not genotyped,

we can still hope to detect its effects through LD with

polymorphisms that are typed. To assess the power of

a study design to achieve this, we need to measure LD.

However, LD is a non-quantitative phenomenon: there

is no natural scale for measuring it. Among the measures

that have been proposed for two-locus haplotype data17,

the two most important are D′ and r2.

D′ is sensitive to even a few recombinations between

the loci since the most recent mutation at one of them.

Textbooks emphasize the exponential decay over time

of D′ between linked loci under simple population-

genetic models, but stochastic effects mean that this

theoretical relationship is of limited usefulness. A disad-

vantage of D′ is that it can be large (indicating high LD)

even when one allele is very rare, which is usually of little

practical interest.

r2 reflects statistical power to detect LD: nr2 is the

Pearson test statistic for independence in a 2 × 2 table

of haplotype counts. Therefore, a low r2 corresponds to

a large sample size, n, that is required to detect the LD

between the markers. If disease risk is multiplicative

Box 2 | Types of population association study

Population association studies can be classified into

several types; for example, as follows:

Candidate polymorphism

These studies focus on an individual polymorphism that

is suspected of being implicated in disease causation.

Candidate gene

These studies might involve typing 5–50 SNPs within a

gene (defined to include coding sequence and flanking regions, and perhaps including splice or regulatory sites). The

gene can be either a positional candidate that results from a prior linkage study, or a functional candidate that is based,

for example, on homology with a gene of known function in a model species.

Fine mapping

Often refers to studies that are conducted in a candidate region of perhaps 1–10 Mb and might involve several hundred

SNPs. The candidate region might have been identified by a linkage study and contain perhaps 5–50 genes.

Genome-wide

These seek to identify common causal variants throughout the genome, and require ≥300,000 well-chosen SNPs (more are

typically needed in African populations because of greater genetic diversity). The typing of this many markers has recently

become possible because of the International HapMap Project32 and advances in high-throughput genotyping technology

(see also BOX 5).

These classifications are not precise: some candidate-gene studies involve many hundreds of genes and are similar to

genome-wide scans. Typically, a causal variant will not be typed in the study, possibly because it is not a SNP (it might be

an insertion or deletion, inversion, or copy-number polymorphism). Nevertheless, a well-designed study will have a good

chance of including one or more SNPs that are in strong linkage disequilibrium with a common causal variant, as illustrated

in the figure: the two direct associations that are indicated cannot be observed, but if r2 (see main text) between the two

loci is high then we might be able to detect the indirect association between marker locus and disease phenotype.

Statistical methods that are used in pharmacogenetics are similar to those for disease studies, but the phenotype of

interest is drug response (efficacy and/or adverse side effects). In addition, pharmacogenetic studies might be prospective

whereas disease studies are typically retrospective. Prospective studies are generally preferred by epidemiologists, and

despite their high cost and long duration some large, prospective cohort studies are currently underway for rare

diseases92,93. Often a case–control analysis of genotype data is embedded within these studies2, so many of the statistical

analyses that are discussed in this review can apply both to retrospective and prospective studies. However, specialized

statistical methods for time-to-event data might be required to analyse prospective studies94.

REVIEWS

NATURE REVIEWS | GENETICS

VOLUME 7 | OCTOBER 2006 | 783

FOCUS ON STATISTICAL ANALYSIS

Page 4

© 2006 Nature Publishing Group

Regression models

A class of statistical models

that relate an outcome variable

to one or more explanatory

variables. The goal might be

to predict further values of

the outcome variable given the

explanatory variables, or to

identify a minimal set of

explanatory variables with

good predictive power.

Prospective study design

Studies in which individuals

are followed forward in time

and disease events are

recorded as they arise. DNA

and biomarker samples, and

data on environmental

exposures and lifestyle factors,

are usually obtained at the

start of the study.

Retrospective study design

Studies in which individuals are

identified for inclusion in the

study on the basis of their

disease state. Data on previous

environmental exposures and

lifestyle factors are then

recorded, and samples for

DNA and biomarker studies

might be obtained.

across alleles, and HWE holds, r2 between a marker and

a causal SNP gives the sample size that would have been

required to detect the disease association by directly typ-

ing the causal SNP, relative to the sample size required

to achieve the same power when typing the marker.

Both D′ and r2 are two-locus measures; however, with

dense markers it is of interest to summarize LD over a

region. One approach is to compute local averages of

pairwise values of D′ and r2. Alternatively, values over a

region can be illustrated diagrammatically with colours

encoding different values18,19. LD maps20,21 provide

another solution: these fit an exponential decay func-

tion to D′ values, and the decay parameter provides a

measure of local LD. The resulting LD unit is usually

strongly correlated with underlying recombination

rate, but also reflects the history of the mutations that

generated the SNPs.

Fine-scale estimates of recombination rate might

provide the most satisfactory solution to the problem of

summarizing LD in a region because recombination is

the most important biological phenomenon underlying

LD. PHASE provides estimates22, and other available soft-

ware includes LDHAT23 and HOTSPOTTER24. Analyses

that are based on such software, and empirical studies25,26,

have shown that recombination rates are highly variable

on fine scales. This is consistent with the observation that

much of the human genome is ‘block like’27,28, with little

or no recombination within blocks but block boundaries

that are often hotspots of intense recombination.

SNP tagging. ‘Tagging’ refers to methods to select a

minimal number of SNPs that retain as much as possible

of the genetic variation of the full SNP set29–31. Simple

pairwise methods discard one (preferably that with most

missing values) of every pair of SNPs with, say, r2 > 0.9.

More sophisticated methods can be more efficient32, but

the most efficient tagging strategy will depend on the

statistical analysis to be used. In practice, tagging is only

effective in capturing common variants.

There are two principal uses for tagging. The first

is to select a ‘good’ subset of SNPs to be typed in all

the study individuals from an extensive SNP set that

has been typed in just a few individuals. Until recently,

this was frequently a laborious step in study design,

but the International HapMap Project33 and related

projects now allow selection of tag SNPs on the basis of

publicly available data. The population that underlies a

particular study will typically differ from the popula-

tions for which public data are available, and a set of

tag SNPs that have been selected in one population

might perform poorly in another. However, recent

studies indicate that tag SNPs often transfer well across

populations34,35.

A secondary use for tagging is to select for analysis

a subset of SNPs that have already been typed in all the

study individuals. Although it is undesirable to discard

available information, the amount of information lost

might be small, and reducing the SNP set in this way

can simplify analyses and lead to more statistical power

by reducing the degrees of freedom (df) of a test29.

Tests of association: single SNP

I now come to testing for association, first dealing with

single-SNP analyses. I will discuss case–control, quan-

titative (continuous) and categorical disease outcomes,

starting with the simplest tests and moving on to more

advanced regression-based tests36, and also the score

procedure.

Case–control phenotype. Perhaps the most natural

analysis of SNP genotypes and case–control status at

a single SNP is to test the null hypothesis of no asso-

ciation between rows and columns of the 2 × 3 matrix

that contains the counts of the three genotypes (the two

homozygotes and the heterozygote) among cases and

controls. Users have a choice between, among others, a

Pearson test (2 df) or a Fisher exact test. Again, the latter

is preferred: it is computationally more demanding but

is implemented in R and other software.

For complex traits, it is widely thought that contribu-

tions to disease risk from individual SNPs will often be

roughly additive — that is, the heterozygote risk will be

intermediate between the two homozygote risks. The

general tests (Pearson 2 df and Fisher) have reason-

able power regardless of the underlying risks, but if the

genotype risks are additive they will not be as powerful

as tests that are tailored to this scenario. One way to

improve power to detect additive risks is to count alleles

rather than genotypes so that each individual contrib-

utes twice to a 2 × 2 table and a Pearson 1-df test can be

applied. However, this procedure is not recommended37

because it requires an assumption of HWE in cases and

controls combined and does not lead to interpretable

risk estimates.

Box 3 | Linkage and other approaches

In all approaches to gene mapping, the key idea is that a disease-predisposing allele will

pass from generation to generation together with variants at tightly linked loci. Linkage

studies directly examine the transmission across generations of both disease phenotype

and marker alleles within a known pedigree, seeking correlations that suggest that the

marker is linked with a causal locus. In parametric linkage analysis62,95, disease and

marker transmission are evaluated under a specified disease model using likelihood-

based statistical analyses of extended pedigrees. In nonparametric linkage analysis96,

excess allele sharing is sought in affected relatives, which avoids the need to posit a

disease model.

An important advantage of linkage methods is that information is combined across

families such that evidence for a causal role of a locus can accumulate even if different

variants segregate at that locus in different families. Therefore, linkage analysis is

appropriate when many rare variants at a locus each contribute to disease risk. However,

linkage approaches can require many and/or large families to achieve satisfactory power

and resolution.

There are various strategies for combining linkage with association analyses for family-

based data sets. The best-known of the family-based association methods is the

transmission disequilibrium test (TDT)97, which implements a matched-pair study design

by comparing alleles that are transmitted to an affected child with the untransmitted

parental alleles. More general and more powerful family-based association tests are

available98,99.

Admixture mapping100,101 has some similarities with nonparametric linkage. It can use

case-only samples from a population formed by recent admixture of two or more

populations with very different disease prevalences. An excess sharing among cases of

an allele that is more common in the high-risk ancestral population could be a signal that

the allele contributes to disease risk.

REVIEWS

REVIEWS

784 | OCTOBER 2006 | VOLUME 7

www.nature.com/reviews/genetics

Page 5

© 2006 Nature Publishing Group

01234

−log10 (expected P value)

−log10 (observed P value)

0

1

2

3

4

5

Time to event

Refers to data in which the time

to an event of interest is

recorded, such as the time from

the start of the study to disease

onset, if any. This is potentially

more informative than simply

recording case or control status

at the end of the study.

Linkage disequilibrium

The statistical association,

within gametes in a population,

of the alleles at two loci.

Although linkage disequilibrium

can be due to linkage, it can

also arise at unlinked loci; for

example, because of selection

or non-random mating.

Type-1 error

The rejection of a true null

hypothesis; for example,

concluding that HWE does not

hold when in fact it does. By

contrast, the power of a test is

the probability of correctly

rejecting a false null hypothesis.

Degrees of freedom

This term is used in different

senses both within statistics

and in other fields. It can often

be interpreted as the number

of values that can be defined

arbitrarily in the specification

of a system; for example, the

number of coefficients in a

regression model. It is often

sufficient to regard degrees of

freedom as a parameter that is

used to define particular

probability distributions.

Bayesian

A statistical school of thought

that, in contrast to the

frequentist school, holds that

inferences about any unknown

parameter or hypothesis

should be encapsulated in a

probability distribution, given

the observed data. Bayes

theorem is a celebrated result

in probability theory that allows

one to compute the posterior

distribution for an unknown

from the observed data and its

assumed prior distribution.

Likelihood-ratio test

A statistical test that is based

on the ratio of likelihoods

under alternative and null

hypotheses. If the null

hypothesis is a special case of

the alternative hypothesis,

then the likelihood-ratio

statistic typically has a χ2

distribution with degrees of

freedom equal to the number

of additional parameters under

the alternative hypothesis.

The Cochran–Armitage test38 (also known as just the

Armitage test and called within R the proportion trend

test) is similar to the allele-count test. It is more conser-

vative and does not rely on an assumption of HWE. The

idea is to test the hypothesis of zero slope for a line that

fits the three genotypic risk estimates best (FIG. 2).

There is no generally accepted answer to the question

of which single-SNP test to use. We could design optimal

analyses if we knew what proportion of undiscovered

disease-predisposing variants function additively and

what proportions are dominant, recessive or even over-

dominant. Lacking this knowledge, researchers have

to use their judgment to choose which ‘horse’ to back.

Adopting the Armitage test implies sacrificing power

if the genotypic risks are far from additive, in order to

obtain better power for near-additive risks. Using the

Fisher test spreads the research investment over the full

range of risk models, but this inevitably means investing

less in the detection of additive risks.

An intermediate choice is to take the maximum test

statistic from those designed for additive, dominant or

recessive effects39. This approach weights those three

models equally but excludes possible overdominant

effects. A possible modification is to give more weight

to the additive-test statistics, reflecting the greater

plausibility of the additive model, but to allow strong

non-additive effects to be detected. A different approach

is to adopt the Armitage test when the minor-allele fre-

quency is low and the Fisher test when the counts for

all three genotypes are high enough for it to have good

power for non-additive models.

My emphasis on the role of the researcher’s judge-

ment hints at Bayesian approaches, in which researchers

make explicit their a priori predictions about the nature

of disease risks. Bayesian approaches do not yet have a

big role in genetic association analyses, possibly because

of the additional computation that they can require40.

I expect this approach to have a more prominent role in

future developments. (See Supplementary information S1

(box) for suggestions of single-SNP tests that are based

on Bayes factors.)

Continuous outcomes: linear regression. The natural

statistical tools for continuous (or quantitative) traits

are linear regression and analysis of variance (ANOVA).

ANOVA is analogous to the Pearson 2-df test in that it

compares the null hypothesis of no association with a

general alternative, whereas linear regression achieves a

reduction in degrees of freedom from 2 to 1 by assuming

a linear relationship between mean value of the trait and

genotype (FIG. 3). In either case, tests require the trait to

be approximately normally distributed for each geno-

type, with a common variance. If normality does not

hold, a transformation (for example, log) of the original

trait values might lead to approximate normality.

Standard statistical procedures offer a hierarchy of

χ2

the linear regression model, which in turn is compared

with the null model of no association. The convention

is to accept the simplest model that is not significantly

inferior to a more general model.

1 tests in which the ANOVA model is compared with

Logistic regression. Returning now to case–control

outcomes, I consider a more advanced approach. The

linear models that are outlined above for continuous

traits cannot be applied directly to case–control studies,

because case–control status is not normally distributed

and there is nothing to stop predicted probabilities lying

outside the range 0–1.

These problems are overcome in logistic regression,

in which the transformation logit (π) = log (π / (1 − π))

is applied to πi, the disease risk of the ith individual. The

value of logit (πi) is equated to either β0, β1 or β2, according

to the genotype of individual i (β1 for heterozygotes). The

likelihood-ratio test of this general model, against the null

hypothesis β0 = β1 = β2, has 2 df, and for large sample sizes

is equivalent to the Pearson 2-df test. Users can improve

the power to detect specific disease risks, at the cost of

lower power against some other risk models, by restricting

the values of β0, β1 and β2. For example, by requiring that

the coefficients are linear, so that β1 is half-way between β1

and β2, a 1-df test is obtained that is effectively equivalent

to the Armitage test. Tests for recessive or dominant effects

can be obtained by requiring that β0 = β1 or β1 = β2.

So far, logistic regression has not brought much that

is new for single-SNP analyses. There is often a score

procedure (see below) that is effectively equivalent to

a logistic regression counterpart and is usually simpler

and computationally faster. However, logistic regres-

sion offers a flexible tool that can readily accommodate

multiple SNPs (see later section), possibly with complex

epistatic and environmental interactions or covariates

such as sex or age of onset.

Figure 1 | Log quantile–quantile (QQ) P-value plot

for 3,478 single-SNP tests of association. The close

adherence of P values to the black line (which

corresponds to the null hypothesis) over most of the

range is encouraging as it implies that there are few

systematic sources of spurious association. The use of the

log scale helps to emphasize the smallest P values (in

the top right corner of the plot): the plot is suggestive of

multiple weak associations, but the deviation of

observed small P values from the null line is unlikely to be

sufficient to reach a reasonable criterion of significance.

REVIEWS

NATURE REVIEWS | GENETICS

VOLUME 7 | OCTOBER 2006 | 785

FOCUS ON STATISTICAL ANALYSIS