Genome-wide association studies for complex traits: consensus, uncertainty and challenges

Article (PDF Available)inNature Reviews Genetics 9(5):356-69 · June 2008with96 Reads
DOI: 10.1038/nrg2344 · Source: PubMed
Abstract
The past year has witnessed substantial advances in understanding the genetic basis of many common phenotypes of biomedical importance. These advances have been the result of systematic, well-powered, genome-wide surveys exploring the relationships between common sequence variation and disease predisposition. This approach has revealed over 50 disease-susceptibility loci and has provided insights into the allelic architecture of multifactorial traits. At the same time, much has been learned about the successful prosecution of association studies on such a scale. This Review highlights the knowledge gained, defines areas of emerging consensus, and describes the challenges that remain as researchers seek to obtain more complete descriptions of the susceptibility architecture of biomedical traits of interest and to translate the information gathered into improvements in clinical management.
The first wave of large-scale, high-density genome-wide
association (GWA) studies has improved our understanding
of the genetic basis of many complex traits
1
. For several
diseases, including type 1
(REFS 2,3) and type 2 diabetes
4–9
,
inflammatory bowel disease
10–14
, prostate cancer
15–20
and
breast cancer
21–23
, there has been rapid expansion in the
numbers of loci implicated in predisposition. For others,
such as asthma
24
, coronary heart disease
25–27
and atrial
fibrillation
28
, fewer novel loci have been found, although
opportunities for mechanistic insights are equally prom-
ising. Several common variants influencing important
continuous traits, such as lipids
7,29–31
, height
32–35
and
fat mass
36–38
, have also been found. An updated list of
published GWA studies can be found at the National
Cancer Institute (NCI)-National Human Genome
Research Institute (NHGRI)s catalog of published
genome-wide association studies.
These findings are providing valuable clues to the
allelic architecture of complex traits in general. At the
same time, many methodological and technical issues
that are relevant to the successful prosecution of large-
scale association studies have been addressed. However,
despite understandable celebration of these achieve-
ments, sober reflection reveals many challenges ahead.
Compelling signals have been found, often highlighting
previously unsuspected biology, but, for most of the
traits studied, known variants explain only a fraction of
observed familial aggregation
39
, limiting the potential
for early application to determine individual disease
risk. Because current technology surveys only a lim-
ited subset of potentially relevant sequence variation,
this should come as no surprise. Much work remains
to obtain a complete inventory of the variants at each
locus that contribute to disease risk and to define the
molecular mechanisms through which these variants
operate. The ultimate objectives — full descriptions of
the susceptibility architecture of major biomedical traits
and translation of the findings into clinical practice —
remain distant.
With completion of the initial wave of GWA scans, it
is timely to consider the status of the field. This Review
considers each major step in the implementation of a
GWA scan, highlighting areas where there is an emerg-
ing consensus over the ingredients for success, and those
aspects for which considerable challenges remain.
Subject ascertainment and design
Although there is a growing focus on the application
of GWA methodologies to population-based cohorts,
most published GWA studies have featured case
control designs, which raise issues related to the optimal
selection of both case and control samples.
*Wellcome Trust Centre for
Human Genetics, University
of Oxford, Oxford, UK.
Correspondence to M.I.M
e-mail: mark.mccarthy@drl.
ox.ac.uk
doi:10.1038/nrg2344
Published online 9 April 2008
Genome-wide association
(GWA) studies
Studies in which a dense array
of genetic markers, which
captures a substantial
proportion of common
variation in genome sequence,
is typed in a set of DNA
samples that are informative
for a trait of interest. The aim is
to map susceptibility effects
through the detection of
associations between genotype
frequency and trait status.
Genome-wide association studies
for complex traits: consensus,
uncertainty and challenges
Mark I. McCarthy*
, Gonçalo R. Abecasis
§
, Lon R. Cardon*
||
, David B. Goldstein
,
Julian Little
#
, John P. A. Ioannidis**
‡‡
and Joel N. Hirschhorn
§§||||¶¶
Abstract | The past year has witnessed substantial advances in understanding the
genetic basis of many common phenotypes of biomedical importance. These advances
have been the result of systematic, well-powered, genome-wide surveys exploring the
relationships between common sequence variation and disease predisposition. This
approach has revealed over 50 disease-susceptibility loci and has provided insights into
the allelic architecture of multifactorial traits. At the same time, much has been learned
about the successful prosecution of association studies on such a scale. This Review
highlights the knowledge gained, defines areas of emerging consensus, and describes
the challenges that remain as researchers seek to obtain more complete descriptions
of the susceptibility architecture of biomedical traits of interest and to translate the
information gathered into improvements in clinical management.
REVIEWS
356
|
MAY 2008
|
VOLUME 9 www.nature.com/reviews/genetics
© 2008 Nature Publishing Group
Case–control design
An association study design in
which the primary comparison
is between a group of
individuals (cases), ascertained
for the phenotype of interest
and that are presumed to have
a high prevalence of
susceptibility alleles for that
trait, and a second group
(controls), not ascertained for
the phenotype and considered
likely to have a lower
prevalence of such alleles.
Selection bias
Bias arising from the fact that
the samples ascertained for the
study (particularly controls)
might not be representative of
the wider population that they
are purported to represent.
Misclassification bias
Bias resulting from the failure to
correctly assign individuals to
the relevant group in a case–
control study; for example, the
presence of some individuals
who meet the criteria for being
cases in a population-based
control sample.
Population stratification
The presence in study samples
of individuals with different
ancestral and demographic
histories: if cases and controls
differ with respect to these
features, markers that are
informative for them might be
confounded with disease
status and lead to spurious
associations.
Case selection. The principal issues with regard to case
ascertainment revolve around the extent to which selec-
tion should be driven by manoeuvres that are designed
to improve study power through enrichment for spe-
cific disease-predisposing alleles. These include efforts
to minimize phenotypic heterogeneity or to focus on
extreme and/or familial cases (defined, for example,
by early age of onset or ascertainment from multiplex
pedigrees). Because the genetic architecture of most
complex traits remains poorly understood, the value of
such efforts is hard to predict. In most circumstances,
and particularly when the total GWA sample size has
financial or operational constraints, efforts to enrich
case selection are likely to improve power. However,
there are situations in which selection of familial
cases or extreme individuals might have the opposite
effect
40,41
.
Control selection. Optimal selection of control samples
remains more controversial, although the accumulating
empirical data indicate that many commonly expressed
concerns have been overstated. The Wellcome Trust
Case Control Consortium (WTCCC) study was able
to demonstrate the effectiveness of a ‘common control
design in which 3,000 UK controls were compared
with 2,000 cases from each of 7 different diseases
1
. The
WTCCC also assuaged concerns about the potential
for selection bias when using non-population-based
controls
1
. Comparison of the genome-wide genotypic
distributions from the two constituents of the WTCCC
common-control resource (one derived from a popula-
tion-based birth cohort, the other from opportunistic
sampling of blood donors) revealed no excess of sig-
nificant associations, indicating that ascertainment,
selection and survival biases were, in this situation at
least, having minimal impact on genotype distributions.
Although each prospective control sample must be
critically evaluated, these findings suggest that a broad
range of ascertainment schemes are compatible with
GWA analysis.
One consequence of the common-control design is
the potential loss of power that is associated with the
inability to exclude latent diagnoses of the phenotype
of interest through intensive screening of controls.
Fortunately, the consequences of misclassification bias are
modest unless the trait is common, and any loss of power
is recoverable by increasing the sample size (BOX 1).
For common traits, such as obesity and hypertension, in
which the effect of misclassification on power is great-
est
1
, one remedy involves adopting a more stringent case
definition, for example, based on early age of onset or
ascertainment of a more extreme phenotype, while
still excluding monogenic cases. Although the most
powerful strategy for a given fixed sample size involves
a ‘hypernormal’ control group, it might be difficult to
identify such individuals without introducing inadvert-
ent selection effects. For instance, selecting extremely
low-weight individuals as controls for a case–control
study of obesity could result in overrepresentation of
alleles primarily associated with chronic medical dis-
ease or nicotine addiction rather than weight regulation
per se.
Other case–control design issues. Four other issues loom
large in the design of case–control studies. The first is
sample size, and with this issue the consensus view
is clear: the more samples the better
1,34,35,38
. The initial
wave of GWA studies has shown that, with rare excep-
tions, the effect sizes resulting from common SNP
associations are modest, and that sample sizes in the
thousands are essential
1
.
The second issue relates to the propensity for latent
population substructure (population stratification and
cryptic relatedness) to inflate the type 1 error rate
and generate spurious claims of association around
variants that are informative for that substructure
42,43
.
The evidence emerging from GWA studies is reassur-
ing: as long as cases and controls are well matched for
broad ethnic background, and measures are taken to
identify and exclude individuals whose GWA data
reveal substantial differences in genetic background,
the impact of residual substructure on type 1 error
seems modest
1
. Several statistical tools exist to detect
and adjust for residual stratification
42,44
, and invento-
ries of markers that are informative for the detection of
ethnic substructure are a useful by-product of current
scans
1,45–47
. These approaches can be used to adjust for
substructure even in populations with quite diverse
antecedents (such as European-descent populations
in North America)
46,47
and with negligible impact on
power
48
. Analysis in African-descent populations is
complicated by their greater haplotypic diversity and
fine-scale geographical structure
49
, and by the exten-
sive admixture demonstrated by African-descent
populations that are resident in Europe and North
America. Furthermore, it is important to note that the
tools mentioned above (particularly genomic-control
approaches
44
) correct for average’ genome-wide meas-
ures of ethnic admixture, and will not always eliminate
spurious associations immediately adjacent to markers
that are strongly informative about ancestry.
Author addresses
Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford,
Oxford, UK.
§
Center for Statistical Genetics, Department of Biostatistics, University of Michigan,
Ann Arbor, Michigan 48109, USA.
||
Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
Center for Population Genomics and Pharmacogenetics, Duke Institute for Genomic
Sciences and Policy, Duke University Medical Center, Duke University, Durham,
North Carolina 27708, USA.
#
Department of Epidemiology and Community Medicine, University of Ottawa,
Ottawa, Ontario, Canada.
**Clinical and Molecular Epidemiology Unit, Department of Hygiene and
Epidemiology, University of Ioannina School of Medicine, and Biomedical Research
Institute, Foundation for Research and Technology-Hellas, Ioannina 45110 Greece.
‡‡
Department of Medicine, Tufts University School of Medicine, Boston,
Massachusetts 02111, USA.
§§
Division of Genetics and Endocrinology and Program in Genomics, Children’s
Hospital, Boston, Massachusetts 02115, USA.
||||
Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA.
¶¶
Broad Institute at MIT and Harvard, Cambridge, Massachusetts 02142, USA.
REVIEWS
NATURE REVIEWS
|
GENETICS VOLUME 9
|
MAY 2008
|
357
© 2008 Nature Publishing Group
Cryptic relatedness
Evidence — typically gained
from analysis of GWA data
that, despite allowance for
known family relationships,
individuals in the study sample
have residual, non-trivial
degrees of relatedness, which
can violate the independence
assumptions of standard
statistical techniques.
Family-based association
methods
A suite of analytical approaches
in which association testing is
performed within families: such
approaches offer protection
from population substructure
effects but at the price of
reduced power.
The third issue concerns the relative merits of family-
based and case–control association methods. Although
family-based association methods provide a robust strategy
for dealing with stratification, this typically comes at
the cost of reduced power
50
. Given the ease with which
GWA data enable the detection of, and correction for,
population substructure
42,44
, this particular justification
has become less persuasive. Nevertheless, there are many
valuable clinical resources (for example, isolates) for
which pedigree information can be usefully exploited.
One option for the efficient use of family data in such a
setting is to restrict high-density scanning to a subset of
pedigree members and then use information on patterns
of chromosomal segregation derived from low-density
Box 1 | The impact of selection by phenotype among controls on power and sample size
In a case–control study, the manner in which the controls are ascertained (with respect to the phenotype of interest)
has implications for the power of the study and for sample size. The panels on the left show estimates of power for a
sample size of 2,000 cases and 2,000 controls and α (p value) = 10
–6
. Those on the right show the sample sizes (that is,
the number of case–control pairs) that are required for 80% power at the same threshold. In the upper panels, the
disease of interest has a population prevalence of 5% (so that cases are ascertained purely from the top 5% of the
population distribution); in the lower panels, the population prevalence is 20%. In each panel, power or sample size
estimates are shown for a range of control selection thresholds, that is, the trait-distribution threshold that is used to
define the controls. Under scenario A, controls are ascertained from the full distribution (that is, population-based
controls): a proportion (5% or 20%) will meet the criteria for being cases. Under scenario B, controls are ascertained
only if they cannot be cases: they come from the residual part (bottom 80% or 95%) of the distribution. Under scenario
C, hypernormal controls have been selected exclusively from the lowest 5% of the distribution. Each panel considers
four potential susceptibility loci. Tracks in blue denote loci that account for 0.25% of overall trait variance, tracks in red
denote loci that account for 1%. Light red and light blue symbols denote that the variant responsible is common
(overall allele frequency 30%), red and dark blue symbols denote that the variant is rare (1%).
As expected, in all settings, scenario C is the most powerful strategy for given overall case–control sample size, and
scenario A is the least powerful strategy. When the disease prevalence is modest (5%; upper panels), the distinctions
between scenarios A and B are not large, and it will often be easier to increase sample size than to undertake detailed
phenotypic examination of the controls to exclude latent cases. When the disease prevalence is higher (20%; lower
panels), misclassification is more prevalent under scenario A, the adverse consequences of using population-based
controls are more marked, and the advantages of using hypernormal controls (scenario C), if available, are most obvious.
Nature Reviews | Genetics
B
A
C
C
A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
15,000
12,500
10,000
7,500
5,000
2,500
Sample size Sample size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
15,000
12,500
10,000
7,500
5,000
2,500
Power Power
Trait prevalence 5%
Trait prevalence 20%
Control selection threshold
Control selection threshold
Control selection threshold
Control selection threshold
α = 10
–6
; 2,000 cases; 2,000 controls 80% power; α = 10
–6
QTL variance 0.25%
QTL allele frequency 30%
QTL variance 0.25%
QTL allele frequency 1%
QTL variance 1%
QTL allele frequency 30%
QTL variance 1%
QTL allele frequency 1%
C
A
B
AB
B
C
REVIEWS
358
|
MAY 2008
|
VOLUME 9 www.nature.com/reviews/genetics
© 2008 Nature Publishing Group
Pleiotropy
The phenomenon whereby a
single allele can affect several
distinct aspects of the
phenotype of an organism,
often traits not previously
thought to be mechanistically
related.
Linkage disequilibrium
(LD). The nonrandom allocation
of alleles at nearby variants to
individual chromosomes as a
result of recent mutation,
genetic drift or selection,
manifest as correlations
between genotypes at closely
linked markers.
Copy number variant
(CNV). A class of DNA
sequence variant (including
deletions and duplications) in
which the result is a departure
from the expected diploid
representation of DNA
sequence.
DNA pooling approaches
Association studies that are
conducted using estimates of
allele frequencies derived from
pools of DNA compiled from
multiple subjects rather than
individual DNA samples.
genotyping in the remaining members to propagate
genotypes through the family
51
.
The fourth question relates to the potential to use
historical control genotypes to substitute for, or supple-
ment, newly typed controls in future GWA studies. The
risks associated with this (particularly inflation of the
type 1 error) will clearly depend on the extent to which
there are disparities between the new cases and historical
controls with respect to population origins, DNA format
(whole-genome amplified DNA versus native DNA as
well as storage conditions)
52,53
and genotyping imple-
mentation (platform, genotyping centre, generation of
chip or allele-calling software). The limits of acceptable
divergence are not yet known, but it seems safest that
studies intending to use historical control data also
type a sample of ethnically matched controls (including
a subset of the historical control samples, if available)
using the same assay as for the <