ArticlePDF Available

Haplotype diversity and SNP frequency dependence in the description of genetic variation

Authors:

Abstract and Figures

Haplotype diversity is controlled by a variety of processes, including mutation, recombination, marker ascertainment and demography. Understanding the extent to which genetic variation at physically linked loci is co-inherited is crucial for the design of the HapMap project and the correct interpretation of the resulting data. In the absence of an analytical theory extensive coalescent simulations are used to disentangle the influence of all of these factors on haplotype diversity. In addition to these qualitative insights, this study also demonstrates (i) that marker spacing and frequency profoundly influence observed levels of haplotype diversity; (ii) that the spectrum of haplotypes contains information about how exhaustively genetic variation in a region is described by a given marker set; and (iii) that so-called haplotype blocks can be generated due by the stochasticity inherent in the recombination process without having to assume variation in the recombination rate.
Content may be subject to copyright.
ARTICLE
Haplotype diversity and SNP frequency dependence
in the description of genetic variation
Michael PH Stumpf*
,1
1
Department of Biological Sciences, Biochemistry Building, Imperial College London, London SW7 2AY, UK
Haplotype diversity is controlled by a variety of processes, including mutation, recombination, marker
ascertainment and demography. Understanding the extent to which genetic variation at physically linked
loci is co-inherited is crucial for the design of the HapMap project and the correct interpretation of the
resulting data. In the absence of an analytical theory extensive coalescent simulations are used to
disentangle the influence of all of these factors on haplotype diversity. In addition to these qualitative
insights, this study also demonstrates (i) that marker spacing and frequency profoundly influence observed
levels of haplotype diversity; (ii) that the spectrum of haplotypes contains information about how
exhaustively genetic variation in a region is described by a given marker set; and (iii) that so-called
haplotype blocks can be generated due by the stochasticity inherent in the recombination process without
having to assume variation in the recombination rate.
European Journal of Human Genetics (2004) 12, 469477. doi:10.1038/sj.ejhg.5201179
Published online 17 March 2004
Keywords: population genetics; HapMap project; haplotype tagging; haplotype blocks
Introduction
An increasing number of empirical
1–3
studies investigate
whether the inheritance of genetic variants occurs in a
block-like manner and the potential implications of this
for association studies.
4
Reported haplotype diversities
along extended stretches of DNA appear surprisingly
simple with most chromosomes belonging to one of
roughly a handful of different haplotypes.
5
The levels of
linkage disequilibrium (LD) are also reported to be con-
sistently high between markers that are in the same block
although LD can also extend beyond block-boundaries.
6,7
If such a picture were to prevail it would have obvious
consequences for the design of association studies.
8
In an important paper Jeffreys et al
1
showed that at least
sometimes blocks may be delimited by recombination
hotspots; the recombination rate in fairly localized regions
can exceed the background or block recombination rate by
up to four orders of magnitude and LD does not extend
beyond the block boundaries. In many cases, however,
there is as yet no conclusive evidence for block boundaries
to coincide with recombination hotspots.
9,10
If this
were generally the case then we could hope that block-
boundaries and possibly knowledge of haplotypes in one
population would allow us to make predictions of infe-
rences for other populations. Unfortunately, however, many
reports of blocks fail to show evidence for such a connection
with hotspots
11
and the methods by which blocks are
ascertained
3,8
may at least be partly to blame for this.
There are three main objectives of this study of haplo-
type diversity. On a fundamental level we gain insight (in
the absence of an analytical theory) into how physical
proximity between markers, the marker frequencies and
the intensity of recombination interact to determine the
complexity of the haplotype spectrum. Second, recent
theoretical work by Wiuf et al
12
is followed up. These
authors have shown that the number T of haplotype
tagging SNPs (htSNPs)
2
necessary to describe a given set of
M haplotypes defined by N SNPs is bounded by log
2
M
Revised 12 December 2003; accepted 21 January 2004
*Correspondence: Dr MPH Stumpf, Department of Biological Sciences,
Biochemistry Building, Imperial College London, London SW7 2AZ, UK.
Tel: þ 44 20 7594 5114; Fax: þ 44 20 7594 5789;
E-mail: m.stumpf@imperial.ac.uk
European Journal of Human Genetics (2004) 12, 469477
&
2004 Nature Publishing Group All rights reserved 1018-4813/04
$30.00
www.nature.com/ejhg
oTomin(N,M-1). Here we determine, for different scenarios,
the relationship between M and N. Third, a simple block
definition is used to evaluate properties of blocks and how
inferred blocks (which do not correspond to recombina-
tion hotspots) depend on a marker characteristics and
recombination rate.
All three aspects of this work have implications in the
current run-up to the HapMap project.
8
The study provides
guidance into when and how the resulting SNP data are best
summarized in terms of haplotypes. Moreover, as will
become clear, haplotype diversity and the combinatorial
structure of haplotypes also hold information about how
exhaustively genetic variation in a region has been sampled.
Methods
Simulation procedures
In the following discussion we simulate the ancestral
recombination graph
13
assuming uniform recombination
and mutations rates r and m, respectively. Assuming
constant r allows the study of the behaviour in blocks of
low recombination rates, or the expected behaviour of
haplotype diversity under a Null model; both aspects will
be considered here. Throughout we assume a single
panmictic population with an effective population size of
N
e
¼ 10 000 diploid individuals. Throughout we use a
sample size of n ¼ 500 chromosomes and consider a stretch
of 50 kb length; the sample size is large compared to most
studies of LD performed today,
3,9
but smaller than the
population samples predicted for future case control
studies.
14
The mutation rates are assumed to be 10
8
/
(nucleotide generation) and 10
9
/(nucleotide genera-
tion), whence the population mutation rates along the
whole stretch are m ¼ 50 and 5; the mutation model used
here is the infinite sites model. We consider two recombi-
nation rates which correspond to 1 and 0.1 cM/Mb in
addition to the case of no recombination; the correspond-
ing population recombination rates are thus: 50, 5 and 0,
respectively.
Human genetic diversity for a stretch of 50 kb corre-
sponds approximately to m ¼ 50 and r ¼ 50.
15
The other
values therefore correspond loosely to cases where the
recombination and/or average marker density (via the
mutation rate) is decreased. The case of m ¼ 5 and r ¼ 5,
however, can also be interpreted as the correct description
of a 5 kb stretch. We also use the m ¼ 5 case, which gives rise
to a sparser marker set, as a qualitative example for SNP
ascertainment.
16
Results for the lower mutation rate to
model as representing the case of a sparser set of markers.
In addition to the constant population size we also
investigate the effects of population growth on the result-
ing haplotype diversity but refrain from a more detailed
study of the effects of demography. For each scenario, 2000
independent runs of the ancestral recombination graph
were performed. Frequency cutoffs for the minor marker
allele (and not always the derived allele) are enforced by
counting the copies of each allele in the sample. Cutoff
frequencies considered are 1, 5, 10 and 20%.
Haplotype analysis and tagging approach
The minimum number of necessary tagging SNPs to tag a
given set of haplotypes is evaluated using a brute-force
implementation of the algorithm described in Wiuf et al
12
.
Starting from the k ¼ M, where M is the number of
haplotypes, we evaluate each possible combination of k
SNPs to see if it could be used as a basis for the set of
haplotypes. If one of the N!/(N-k)/k! possible SNP combina-
tions forms a valid basis then k is decreased by 1 until the
first time a basis cannot found. For large N and M the
number of SNP combinations can become enormous but in
smaller simulations it was observed that the distribution of
the minimum number of tags required to tag a given
number of haplotypes is relatively flat: many different
combinations can be used to tag haplotypes. Thus, for large
values of N and M it is possible to proceed heuristically
12
and investigate, for exmple, a maximum of 100 Million
combinations of candidate tags and an inferred minimal
basis will be close to optimal. Similarly, we have also
implemented a strategy where we start from k ¼ min(N,M-1)
and increment k until a basis has been found. Using either
approach (which of course yield identical results) when a
set of k SNPs is found the procedure stops and the number
of necessary and sufficient tagging SNPs is set to T ¼ k.At
most min(N,M-1) tagging SNPs are required to describe all
observed haplotypes in the sample. As the algorithm is in
the NP-complete class we only evaluate the number of
tagging SNPs for the case of SNP ascertainment outlined
above. Our heuristic approach can also be implemented
more formally in a Markov Chain Monte Carlo setting.
Results
Here we discuss how the number of haplotypes depends on
the number of SNPs, the recombination rate and the cutoff
frequency for the fraction of chromosomes that should be
included. Analytic results are only available for the case of
no recombination and free recombination, respectively,
and we therefore use coalescent simulations as outlined
above.
Determinants of haplotype diversity
In Figure 1 we show how the number of haplotypes depends
on the number of SNPs, their frequency and the recombina-
tion rate. The relationship between SNP number (for each
frequency cutoff) and the total number of haplotypes in a
sample already carries information about the recombina-
tion rate and how exhaustively a given set of SNPs repre-
sents or resembles underlying genetic variation. Large SNP
sets (with a low cutoff frequency) will contain correlations
among SNPs but if marker sets are sparse, recombination
Haplotype diversity
MPH Stumpf
470
European Journal of Human Genetics
will be more effective at breaking up associations between
markers; we therefore expect a lower value for the ratio
for y ¼ 5 than for y ¼ 50, irrespective of cutoff frequency
and recombination rate. The number of tags required
to adequately describe variation in a region will there-
fore be a function of both marker frequency and marker
density.
For y ¼ 50 we find that a 10-fold decrease in the
recombination rate from r ¼ 50 to 5 already brings the
observed number of haplotypes very close to the r ¼ 0
results. For lower values of r the average number of
haplotypes is virtually indistinguishable from the r ¼ 0
case. Note that for the decay of LD measured by the same
decrease in r from r ¼ 50 to 5 does not yield a behaviour
anywhere near the r ¼ 0 case (not shown). Haplotype
diversity and LD, although related, show somewhat
different dependence on the population recombination
rate r. This is also observed for growing and bottleneck
populations (data not shown).
The dependence of haplotype diversity on the minor
SNP allele frequency is further exemplified in Figure 2.
Here we show the number of haplotypes needed to describe
90, 95 and 99%, and all of the 500 chromosomes in the
sample. These numbers are displayed for five different
marker frequency cutoffs, three recombination and two
mutation rates. Such a table can either be used to assess the
genotyping cost necessary to capture a given amount of
variation or in case all the available genetic variation has
been characterized, to obtain an indication of the average
recombination/mutation rate ratio. While for high marker
density or mutation rates rare (fo1%) alleles give rise to a
large number of haplotypes we find that for fZ5% there
is no big reduction in genotyping effort as the cutoff
frequency is further increased. Also for fZ5% the fre-
quency distribution of haplotypes holds some information
about the recombination rate: a higher recombination rate
will lead to more rare haplotypes even at moderate to high
frequency cutoffs, as is also intuitively obvious.
The frequency distribution of haplotypes is displayed in
Figure 3. At the reported genome wide average of the
recombination rate a stretch of 50 kb is not expected to
have any haplotypes at a frequency greater than 10%,
irrespective of the cutoff frequency. If the marker spacing is
decreased, however, some haplotypes will gain in fre-
quency and at y ¼ 5 and r ¼ 50 we therefore observe some
haplotypes at moderate frequencies, especially at high
cutoffs. Low values of y result in a shift of weight to higher
haplotype frequencies. For ro1 (results not shown) the
resulting haplotype distributions are very similar to the
special r ¼ 0 case apart from the origin. At low recombina-
tion rates and for cutoffs fZ5 the haplotype distribution
obtains a mode at the cutoff frequency f.
The shift of the mode to the cutoff frequency is simply
a result of the fact that in the absence of excessive
Figure 1 Average SNP (grey) and haplotype numbers (black) versus minor allele frequency cutoff for y ¼ 5 and 50 and r ¼ 50,
5 and 0, respectively.
Haplotype diversity
MPH Stumpf
471
European Journal of Human Genetics
recombination, an SNP with a minor allele frequency
of x will define a haplogroup of frequency x if x is very
close to the cutoff frequency; thus an excess of haplo-
types with frequency x will be observed. If the recombina-
tion rate is high then haplotypes defined by the youngest
SNP can be broken up by recombination and here
r ¼ 50 appears to yield results that are very close to
the case of free recombination. As a result the mode of
the haplotype frequency distribution shifts back to the
origin.
Haplotype tagging
Only one tagging strategy is investigated here
12
and at the
moment it is by no means clear what tagging strategy is
best suited for association studies.
8
Rather than focusing
on tagging haplotypes, it may for example be better to
define tags that capture the patterns of LD and/or
association between SNPs. Simulation-based power analysis
along the lines taken here will help to assess such questions
in further detail. The tagging approach used here is quite
likely not optimal for association studies, but its easy
interpretation in terms of a geometric basis for the space
spanned by the SNP defined haplotypes nicely highlights
the combinatorial nature of haplotypes and the complexity
introduced by recombination. Other haplotype tagging
frameworks, however, are likely to behave qualitatively
similarly to the approach taken here.
We only consider an allele frequency cutoff of 5%. In
Table 1 we show mean values of the ratios T/N
5
(where N is
the number of SNPs with a minor allele frequency of 5%)
and the corresponding 5 and 95 percentiles. We find an
obvious dependence on r and for r ¼ 5 the results are
already quite similar to r ¼ 0. The results for r ¼ 50 are
discouraging: on average over 90% of SNPs need to be
typed in order to reliably distinguish between haplotypes.
This suggests that reports of low haplotype diversity
indicate regions of low recombination rate. We note,
however, that the majority of currently published studies
has marker density that is at least a factor of 5 lower than
the one obtained here.
9
Moreover our results concern true,
not inferred haplotypes. Haplotype inference may system-
atically bias tagging approaches.
Figure 2 Average number of haplotypes needed to explain 90, 95, 99 and 100% of observed chromosomes in a sample for
frequency cutoffs of 1, 5, 10 and 20%, respectively, for y ¼ 5 and 50 and r ¼ 50 and 5.
Haplotype diversity
MPH Stumpf
472
European Journal of Human Genetics
Dynamics of haplotype blocks
Notions and possible uses of extended haplotype blocks
that are characterized by high levels of pairwise LD
between SNPs within the same block (and accordingly
low haplotype diversity compared to the extreme case of
free recombination) have attracted considerable inter-
est.
4,6,17 19.
Here we follow Wang et al
17
and use probably
the simplest definition of a block: all SNP pairs that are
within the same block must fail the four-gamete test, that
is, at most three out of the possible four two-locus
haplotypes are observed for each pair of bi-allelic markers.
This definition has some shortcomings but is (i) easily
implemented, and (ii) we expect it to give at least some
insight into how SNP frequencies and ascertainment affect
the behaviour of blocks. Insights gained for this simple
model will be transferable to other, more involved, block-
ascertainment methods.
In Figure 4 we show how the average number and
average size of blocks, as well as the proportion of DNA and
SNPs that are found within blocks depend on minor allele
frequency cutoff and recombination rate r. We only
consider y ¼ 50 but in each case we show both the results
Haplotype distributions for ρ = 50
Relative HT AbundanceRelative HT Abundance
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
Haplotype Frequency
Haplotype distributions for ρ = 5
0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.3 0.4
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
Haplot
y
pe Frequenc
y
0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
Figure 3 Frequency distribution of haplotypes and their dependence on y, r and minor allele frequency cutoff.
Table 1 Average fraction of SNPs (in percent) needed to capture x% of chromosomes in the sample (x ¼ 99, 95 and 90%,
respectively) for three different values of the recombination rate together with their 5 and 95 percentiles (in parentheses)
Population recombination rate r Average number of haplotypes Percentage of SNPs needed to explain fraction x of haplotypes
x ¼ 99% X ¼ 95% x ¼ 90%
0 7.3 49.7 (22.0 92.5) 45.0 (17.9 83.3) 37.5 (14.3 66.7)
5 14.2 70.6 (41.2100) 58.9 (32.0 90.0) 49.4 (25.0 78.6)
50 52.3 94.7 (83.3100) 93.1 (80.0 100) 91.7 (76.2 100)
The average number of segregating sites is 33.9 of which 14.67 had a frequency Z5%; the corresponding 5 and 95 percentiles are 21 and 50 and 5
and 28, respectively.
Haplotype diversity
MPH Stumpf
473
European Journal of Human Genetics
for all blocks that adhere to our definition and of ‘long’
blocks. ‘Long’ blocks are blocks that contain at least four
SNPs while other blocks may also contain pairs of SNPs that
fulfil our four-gamete test criterion. Full symbols denote
results for r ¼ 50, empty symbols r ¼ 5; circles (full lines)
are for all blocks while boxes (dashed lines) refer only to
the long blocks.
We observe that for low frequency cutoffs there are many
more but shorter blocks for r ¼ 50 than for 5 where the two
curves are in very close agreement. At r ¼ 50 the average
block-size is determined largely by the long blocks but for
all other measures displayed in Figure 4 we observe
significant differences between long and short blocks.
The number of long blocks, the proportion of DNA in
long blocks, and perhaps most severely, the proportion of
SNPs that are found in long blocks decreases more
dramatically with minor allele frequency cutoff than the
same measures do for all blocks. At a minor allele frequency
of 20% only approximately 20% of DNA and 50% of SNPs
are found in long blocks. For all blocks these values
increase to 40 and 90%, respectively. It is obvious that
small blocks, containing only two or three SNPs, will offer
little or no reduction in genotyping effort. Long blocks, on
the other hand, account for only a small part of the total
sequence.
The average block-size remains approximately constant
for all allele frequency cutoffs. This result can be explained
by considering those pairs of SNPs that are the most likely
to give rise to four observed two-locus haplotypes. These
SNPs have to be old enough to have undergone at least one
recombination event and therefore will have reasonably
large minor allele frequencies. Pairs of younger markers,
which by and large will have a smaller minor allele
frequency, are less likely to give rise to four haplotypes
and therefore we expect SNPs with moderate to high minor
allele frequencies to determine block-size. Undersampling
of diversity (eg restricting the analysis to already known
SNPs such as those in dbSNP) could therefore system-
atically overestimates average block-lengths. This result
is in agreement with the study of Phillips et al
3
who
find that block-length increases with marker spacing; it is
likely to hold for other definitions as suggested by recent
studies of the effects of SNP ascertainment.
16
Thus,
interpretation of haplotype diversity (like LD and block
Figure 4 Average no. of blocks, average block-size, average of the total proportion of DNA in blocks and average of the total
number of SNPs in blocks calculated for a sample of 500 chromosomes drawn from a constant size population with y ¼ 50
versus frequency cutoff. Solid symbols represent the case r ¼ 50, empty symbols r ¼ 5. Circles (solid lines) represent results
obtained for all blocks, boxes (dashed lines) represent results for blocks containing at least four SNPs.
Haplotype diversity
MPH Stumpf
474
European Journal of Human Genetics
boundaries) is problematic if not supported by extensive
simulations.
3,7
Demography and haplotype diversity
Demography and population structure are known to have
profound effects on the frequency spectrum of segregating
sites, LD and thus also on haplotype diversity.
3,4,9.
Simula-
tions of population-growth scenarios suggest that the effect
of minor-allele frequency still persists. We only show
results for one particular demographic scenario where the
population has grown from 1% of its present size to its
present size over a time t ¼ 1 (in coalescent units); before
the onset of growth the population size is assumed to be
constant at 1% of the present size. Other cases are easily
assessed using coalescent simulations. Owing to the
problems associated with diversity discussed by Pritchard
and Przeworski
15
the mutation rate was adjusted such that
the number of segregating sites in the sample is the same in
the population growth scenario as in the constant popula-
tion scenario discussed above.
Comparing Figure 1 with the top row of Figure 5 shows
only quantitative differences that are easily explained by
the different SNP allele frequency distribution resulting
from a population growth scenario. We find at the higher
recombination rate that haplotype numbers exceed SNP
numbers already for lower frequency cutoffs (ie f45%
instead of f420%). At the same cutoff frequency the ratio
of [haplotype number]/[SNP number] is less for the growth
demography considered here than for the constant size
population. Comparison of Figure 2 with the bottom row
of Figure 5 shows only a minor vertical shift: the average
number of haplotypes needed to describe x%(x ¼ 90, 95,
99, 100) of the chromosomes in the sample is higher for
population growth than for constant population size.
Again this is easily understood because population growth
results in a relative excess of rare alleles compared to the
case of constant population size. These results suggest that
the basic patterns of haplotype dependence (on allele
frequency cutoff, marker spacing and recombination rate)
elucidated above may remain valid for a range of demo-
graphic scenarios.
Conclusions
In the search for the genetic components of complex
diseases or drug response phenotypes haplotype-based
Figure 5 Top row: average numbers of SNPs (grey) and haplotypes (black) resulting for yE 65 and r ¼ 50 and 5, respectively.
Bottom row: number of haplotypes that need to be considered in order to cover 90, 95 and 99%, and all of the chromosomes
in the sample. In each case the demographic model outlined in the text was used in the coalescent simulations.
Haplotype diversity
MPH Stumpf
475
European Journal of Human Genetics
approaches have recently been heralded as particularly
promising. A host of early studies suggested that relatively
few (eg 2 6) haplotypes may suffice to describe the genetic
variation along extended stretches of DNA.
3,5,9,10
The aim
of this study was to (i) gain some understanding of the
factors influencing observed haplotype diversities, (ii)
evaluate the behaviour of haplotypes expected for simple
population genetic models, and (iii) see to what extent
haplotype blocks can appear without underlying local
variation in the recombination rate.
Before discussing the application of the results presented
here to real world data, it is important to acknowledge the
limitations of the approach taken here. The population
model is of course incorrect and at best over-simplified.
While a quantitative interpretation of the results is thus
impossible they seem to reflect qualitative trends. For
example, for many if not all population models (including
the unknown true model), haplotype diversity will increase
with increased recombination rate and decrease dramati-
cally with increased SNP frequency cutoff. This is a general
result confirmed by simulations of a wide range of
demographic models (data not shown) and intuitively
obvious in the light of what is known about the ancestral
recombination graph.
The reported haplotype frequencies and diversities are
not easily reconciled with the standard neutral constant
size model of evolution although the generally small
sample sizes will result in overestimation of LD and of
haplotype frequencies. For the sample size considered here,
n ¼ 500, which is by no means large compared to what will
be required for genetic association studies,
14
the number of
segregating sites is very large for a region of 50 kb, SE330.
Even a moderate reduction of the recombination rate
brings haplotype diversities and the number of required
tSNPs into the range observed for r ¼ 0. This suggests that
at least some of the reported blocks may occur in regions
where the recombination rate r is less than the reported
genome wide average r ¼ 1 cM/Mb. The simulations also
show that haplotype diversity and block behaviour depend
on both allele frequency and marker spacing. A number of
reports of long-range disequilibrium and/or low haplotype
diversity, based on incomplete sampling of the genetic SNP
diversity, need to be reassessed in the light of this. A
detailed assessment of local recombination rate variation
becomes important and should provide crucial informa-
tion about the usefulness of blocks. Similarly, predictions
about the success/efficiency gains to be gained from the
HapMap project that are based on present studies may
systematically underestimate the number of tagging SNPs
required to describe human genetic diversity.
Generally, we find that for complete ascertainment of
segregating sites/SNPs haplotype diversity along a 50-kb
stretch is almost unmanageably large if all markers or those
with a minor allele frequency of fr1% are to be typed.
From a cutoff of ‘5%’ and above no big efficiency gains are
obtained and if the common variant/common disease
should turn out to be correct than 5% may be a reasonable
cutoff frequency. The genotyping effort, even if tagging
approaches are used, may be considerably more than had
been hoped.
2,9,10
There are considerable problems in interpreting current
experimental data sets and the simulation study presented
here gives some clues as to what factors may compromise
inferences drawn from summaries of the data such as LD
and/or haplotype diversity. Many of these problems could
be directly addressed if the underlying recombination rate
variation were known. In addition to approaches using
sperm-typing,
1,20
a number of inferential procedures has
recently developed that allow direct estimation of the
recombination rate.
21 25
These use mainly information
from informative sites with high minor allele frequency
and their inferences should be robust against the problems
associated with low marker density and bias in allele
frequencies. Knowledge of local recombination rate varia-
tion along the human genome will provide crucial
guidance in the setup of genetic epidemiology studies.
Acknowledgements
I thank Carsten Wiuf and Gil McVean for many discussions on this
topic and Monty Slatkin for his helpful comments on a earlier version
of this manuscript. This work was funded through a Wellcome Trust
Career Development Fellowship and a Royal Society Project Grant.
References
1 Jeffreys AJ, Kauppi L, Neumann R: Intensely punctate meiotic
recombination in the class II region of the major
histocompatibility complex. Nat Genet 2001; 29: 217 222.
2 Johnson GC, Esposito L, Barrat BJ et al: Haplotype tagging for the
identification of common disease genes. Nat Genet 2001; 29:
233 237.
3 Phillips MS, Lawrence R, Schidanandam R et al: Chromosome-
wide distribution of haplotype blocks and the role of
recombination hot spots. Nat Genet 2003; 33: 382 387.
4 Stumpf MP, Goldstein DB: Demography, recombination hotspot
intensity, and the block structure of linkage disequilibrium. Curr
Biol: Cb 2003; 13:18.
5 Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-
resolution haplotype structure in the human genome. Nat Genet
2001; 29: 229 232.
6 Wall JD, Pritchard JK: Assessing the performance of haplotype
block models of linkage disequilibrium. Am J Hum Genet 2003; 73:
2003.
7 Wall JD, Pritchard JK: Haplotype blocks and linkage disequilibrium
in the human genome. Nat Rev Genet 2003; 4: 587 597.
8 Cardon LR, Abecasis GR: Using haplotype blocks to map human
complex trait locl. Trends Genet 2003; 19: 135 140.
9 Gabriel SB, Schaffner S, Nguyen H et al: The structure of
haplotype blocks in the human genome. Science 2002; 1069424.
10 Patil N, Berno AJ, Hinds DA et al: Blocks of limited haplotype
diversity revealed by high-resolution scanning of human
chromosome 21. Science 2001; 294: 1719 1723.
11 Anderson EC, Slatkin M: Population-genetic basis of haplotype
blocks in the 5q31 region. Am J Hum Genet 2004; 74: 40 49.
12 Wiuf C, Laidlaw Z, Stumpf MPH: Some notes on the combinatorial
properties of haplotype tagging. Math Biosci 2003; 185: 205 216.
Haplotype diversity
MPH Stumpf
476
European Journal of Human Genetics
13 Griffiths RC, Marjoram P: Ancestral inference from samples
of DNA sequences with recombination. J Comput Biol 1996; 3:
479 502.
14 Weiss KM, Clark AG: Linkage disequilibrium and the mapping of
complex human traits. Trends Genet 2002; 18: 19 24.
15 Pritchard JK, Przeworski M: Linkage disequilibrium in humans:
models and data. Am J Hum Genet 2001; 69: 1 14.
16 Akey JM, Zhang K, Xiong MM, Jin L: The effect of
single nucleotide polymorphism identification strategies on
estimates of linkage disequilibrium. Mol Biol Evol 2003; 20:
232 242.
17 Wang N, Akey JM, Zhang K, Chakraborty R, Jin L: Distribution of
recombination crossovers and the origin of haplotype blocks: the
interplay of population history, recombination, and mutation.
Am J Hum Genet 2002; 71: 1227 1234.
18 Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic
programming algorithm for haplotype block partitioning. PNAS
2002; 99: 7335 7339.
19 Anderson EC, Novembre J: Finding haplotype block boundaries
by using the minimum-description length principle. Am J Hum
Genet 2003; 73: 336 354.
20 Arnheim N, Calabrese P, Nordborg M: Hot and cold spots of
recombination in the human genome: the reason we should find
them and how this can be achieved. Am J Hum Genet 2003; 73: 5 16.
21 Fearnhead P, Donnelly P: Estimating recombination rates from
population genetic data. Genetics 2001; 159: 1299 1318.
22 McVean G, Awadalla P, Fearnhead P: A coalescent-based method
for detecting and estimating recombination from gene
sequences. Genetics 2002; 160: 1231 1241.
23 Hudson RR: Two-locus sampling distributions and their
application. Genetics 2001; 159: 1805 1817.
24 Li N, Stephens M: Modeling linkage disequilibrium and
identifying recombination hotspots using single-nucleotide
polymorphism data. Genetics 2003; 165: 2213 2293.
25 Stumpf MPH, McVean GAT: Estimating recombination rates from
population-genetic data. Nat Rev Genet 2003; 4: 959 968.
Haplotype diversity
MPH Stumpf
477
European Journal of Human Genetics
... TD was not significant (P > 0.10) for all the genes analyzed. and the combined data set, reflecting a high genetic diversity of the population analyzed (Nei and Tajima, 1981;Stumpf, 2004). ...
Article
Full-text available
Festuca rubra subsp. pruinosa is a perennial grass growing in sea cliffs where plants are highly exposed to salinity and marine winds, and often grow in rock fissures where soil is absent. Diaporthe species are one of the most abundant components of the root microbiome of this grass and several Diaporthe isolates have been found to produce beneficial effects in their host and other plant species of agronomic importance. In this study, 22 strains of Diaporthe isolated as endophytes from roots of Festuca rubra subsp. pruinosa were characterized by molecular, morphological, and biochemical analyses. Sequences of the nuclear ribosomal internal transcribed spacers (ITS), translation elongation factor 1-α ( TEF1 ), beta-tubulin ( TUB ), histone-3 ( HIS ), and calmodulin ( CAL ) genes were analyzed to identify the isolates. A multi-locus phylogenetic analysis of the combined five gene regions led to the identification of two new species named Diaporthe atlantica and Diaporthe iberica . Diaporthe atlantica is the most abundant Diaporthe species in its host plant, and Diaporthe iberica was also isolated from Celtica gigantea , another grass species growing in semiarid inland habitats. An in vitro biochemical characterization showed that all cultures of D. atlantica produced indole-3-acetic acid and ammonium, and the strains of D. iberica produced indole 3-acetic acid, ammonium, siderophores, and cellulase. Diaporthe atlantica is closely related to D. sclerotioides , a pathogen of cucurbits, and caused a growth reduction when inoculated in cucumber, melon, and watermelon.
... According to phylogenetic study, B. Sorokiniana was a single species with no distinct groups, since all of the sequences taken from different countries were clustered together in a single group and outgroups were put into separate groups. Haplotype diversity is influenced by mutation, marker discovery, recombination, and demography [42]. Based on three gene sequences, 40 haplotypes were identified in a group of 254 isolates, with the predominant haplotype H_1 comprising 127 individuals (50% of total isolates) with distinct geographies and host specificities. ...
Article
Full-text available
Bipolaris sorokiniana is a fungal pathogen that infects wheat, barley, and other crops, causing spot blotch disease. The disease is most common in humid, warm, wheat-growing regions, with South Asia’s Eastern Gangetic Plains serving as a hotspot. There is very little information known about its genetic variability, demography, and divergence period. The current work is the first to study the phylogeographic patterns of B. sorokiniana isolates obtained from various wheat and barley-growing regions throughout the world, with the goal of elucidating the demographic history and estimating divergence times. In this study, 162 ITS sequences, 18 GAPDH sequences, and 74 TEF-1αsequences from B. sorokiniana obtained from the GenBank, including 21 ITS sequences produced in this study, were used to analyse the phylogeographic pattern of distribution and evolution of B. sorokiniana infecting wheat and barley. The degrees of differentiation among B. sorokiniana sequences from eighteen countries imply the presence of a broad and geographically undifferentiated global population. The study provided forty haplotypes. The H_1 haplotype was identified to be the ancestral haplotype, followed by H_29 and H_27, with H_1 occupying a central position in the median-joining network and being shared by several populations from different continents. The phylogeographic patterns of species based on multi-gene analysis, as well as the predominance of a single haplotype, suggested that human-mediated dispersal may have played a significant role in shaping this pathogen’s population. According to divergence time analysis, haplogroups began at the Plio/Pleistocene boundary.
... In the global analysis, we observed 27 haplotypes of leatherback turtles, which differed from previous studies [11,19,24,45]. Different numbers of haplotypes between studies are likely linked to the length and number of sequences and SNP frequency [46]. In leatherback turtles, Dutton and colleagues [11] analyzed the mtDNA control region of Brazilian leatherback turtles with lengths of 496 and 711 bp and found longer sequences that identified seven haplotypes, while shorter sequences found only five. ...
Article
Full-text available
The leatherback sea turtle (Dermochelys coriacea) is the largest and one of the most migratory turtle species, inhabiting oceans throughout the world. There has been a steady decline in leatherback populations over the past several decades due to human activities. They are considered endangered in Thailand and global, so conservation strategies are needed to study and protect the species, including determining their genetic diversity. A total of 8 microsatellite loci and 658 bp amplicon of mitochondrial DNA (mtDNA) were used to assess genetic data from 149 dead leatherback turtle hatchlings among 14 nests in five locations along the Andaman Sea, Thailand, between 2018–2020. The microsatellite findings show that the observed heterozygosity (Ho) ranged from 0.44 ± 0.09 to 0.65 ± 0.10. Population structures were further divided into two genetically distinct groups by Bayesian inference. For the mtDNA control region, our samples consisted of three haplotypes. Globally, there are 27 haplotypes of leatherback turtles, with a relatively low genetic diversity (h = 0.43). These results reveal the genetic status of leatherback turtles in Thailand and globally, and raise concerns about their relative genetic health, which highlight the need for proactive, long-term management and conservation strategies for this endangered species.
... This observation is well correlated with extrapolations drawn from median joining network and haplotypic data generated in the present study. Haplotype diversity represents a collective effect of mutation, marker ascertainment, recombination and demography (Zhang et al. 2017;Stumpf 2004). In the present study, forty haplotypes have been observed in a group of 528 isolates based on ITS sequences with largest haplotype H-1 comprised of 377 individuals (71.40% of total population) followed by H-2 (9.28% of total population) with distinct geographical origin. ...
Article
Bipolaris sorokiniana (BS) is an economically important fungal pathogen causing spot blotch of wheat (Trtiticum aestivum) and found in all wheat-growing zones of India. Very scanty and fragmentary information is available on its genetic diversity. The current research is the first detailed report on the geographic distribution and evolution of BS population in five geographically distinct wheat-growing zones (North Western Plain Zone (NWPZ), North Eastern Plain zone (NEPZ), North Hill Zone (NHZ), Southern Hill Zone (SHZ) and Peninsular Zone (PZ)) of India, studied by performing nucleotide sequence comparison of internal transcribed spacer region of 528 isolates. A moderate to low levels of haplotypic diversity was noticed in different wheat-growing zones. Phylogenetic analysis suggests that B. sorokiniana exist in two distinct lineages as all isolates under study were grouped in two different clades and found analogous to the findings of haplotypic and TCS network analysis. The genetic parameters revealed the existence of 40 haplotypes with three major haplotypes (H-1, H-2 and H-3) which showed star-like structure network surrounded by several single haplotypes, revealing high frequency of the mutations (Eta = 2 – 158) in total analyzed population. H-1 was observed as a predominant haplotype and prevalent in all the five zones. Moderate level of genetic differentiation was found between NHZ and other zones like NWPZ (Fst = 0.332) and SHZ (Fst = 0.382) and PZ (Fst = 0.299), whereas it was low between NEPZ and PZ (Fst = 0.034). Higher transfer rate of genetic variation was noticed between NEPZ and PZ (Nm = 7.06), while it was found minimum between NHZ and SHZ (Nm = 0.40). Moreover, negative score of neutrality statistics (Tajima’s D and Fu’s FS test) for NWPZ population suggested recent population expansion. However, positive score for both the neutrality tests observed in NEPZ indicated the dominance of balancing selection in structuring their population. Recombination events were observed in the NWPZ and NHZ population, while it was absent in SHZ, NEPZ and PZ population. Thus, the lack of any specific genetic population structure in all the zones indicates for the expansion history only from one common source population, i.e. NWPZ, a mega zone of wheat production in India. Overall, it seems that the predominance of individual haplotypes with a moderate level of genetic variation and human-mediated movement of contaminated seed and dispersal of inoculum, mutations and recombination as prime evolutionary processes play essential role in defining the genetic structure of BS population.
... This result therefore indicates that Noiler, FUNAAB Alpha and Kuroiler breeds could be more genetically diverse with the haplotypes being shared with other populations. Haplotype shared observed in Shika Brown, Fulani and Sasso was very low, this suggests that these breeds must have undergone less mutational processes in their genome compared with the rest of the iTABs similar to (Stumpf, 2004). ...
Article
Full-text available
The improved tropically adapted chicken breeds (iTABs) are low-input-high-output chickens suitable for smallholder poultry (SHP). Six iTABs (Fulani, FUNAAB Alpha, Kuroiler, Noiler, Sasso and Shika Brown) were introduced, and were raised under semi-intensive management system and tested under the African Chicken Genetic Gains project in Nigeria. The objective of this study was to evaluate the genetic diversity of these iTABs tested in Imo State Nigeria using mitochondrial DNA (mtDNA), Blood samples were collected from 77 chickens belonging to these six populations of iTABs in the ratio (12:12:14:13:13:13), for Noiler, FUNAAB Alpha, Shika Brown, Kuroiler, Sasso and Fulani chickens, respectively. Genomic DNA was extracted from seventy-seven birds randomly selected from the six iTABs. A 450-bp mtDNA D-loop region was sequenced. The highest (H=5) and the lowest (H=2) number of haplotypes were found within Noiler, and Shika Brown/Fulani, respectively. Among the six populations, haplotype and nucleotide diversity was 0.558±0.063 and 0.0064±0.0013, respectively. A total of 8 haplotypes were identified from 15 polymorphic sites. These haplotypes clustered into three clades with 87.89% of the total maternal genetic variations occurring within population. Fulani and Shika Brown had the least (0.000) genetic distance. Tajima’s D was negative among populations and within Noiler, Kuroiler, Sasso and Fulani populations but was only statistically significant within the Noiler population. Diversity indices of this study revealed that mtDNA polymorphism was on the average both within populations and among populations. The results indicate the existence of two distinct maternal lineages from Southeast Asia, south central and Southeast China evenly distributed among the iTABs. The average genetic diversity observed within population can be utilized for the long-term genetic improvement and stabilization of the breeds.
... According to phylogenetic study, B. sorokiniana was a single species with no distinct groups, since all of the sequences taken from different countries were clustered together in a single group and outgroups were put in separate group. Haplotype diversity is in uenced by mutation, marker discovery, recombination, and demography [Stumpf 2004]. In present study, 40 haplotypes have been observed in a group of 254 isolates based on mutli-locus gene sequence with predominant haplotype H_1 comprising of 127 individuals (50% of total isolates) with distinct geographies and host speci cities. ...
Preprint
Full-text available
Background The spot blotch disease caused by Bipolaris sorokiniana is economically important fungal pathogen infecting wheat, barley and other crops. The disease is most prevalent in humid, warm wheat-growing regions and the Eastern Gangetic Plains of South Asia serving as a hotspot. very scanty and fragmentary information is available on its genetic variability, demography, and divergence period. The current study is the first detailed report to compare the phylogeographic patterns of B. sorokiniana isolates collected from different wheat and barley growing regions worldwide, with the goal of elucidating the demographic history and divergence time estimates. Results In this study, we used multi-locus sequence (ITS, GAPDH and TEF-α1 genes) analysis to investigate the phylogeographic pattern of distribution and evolution of the 254 isolates of B. sorokiniana infecting wheat and barley . The degrees of differentiation among B. sorokiniana sequences from eighteen countries reflect the presence of a worldwide population that is broad and geographically undifferentiated. The analysis yielded fourty haplotypes. The ancestral haplotype was determined to be the dominant haplotype H_1 followed by H_29 and H_27, in which H_1 occupies a central position in the median-joining network and is shared by multiple populations from different continents. The phylogeographic patterns of multigene analysis of species as well as the preponderance of a single haplotype, suggest that human-mediated dispersal may have had a significant role in shaping the population of this pathogen. Conclusions In conclusion, the presence of one haplotype at a very high frequency and low genetic differentiation in B. sorokiniana suggests that the durability of resistance in host plants against pathogens could be improved by exchange of elite resistance lines from different countries. For the current distribution of species, divergence time analysis suggests that haplogroups began at the Plio/Pleistocene boundary.
... This polymorphism can be partly explained by the scale of ecological divergence for each rodent species: J. jaculus and J. hirtipes were more adapted to the Saharan and mountainous biotopes, respectively. However, the small sample size of our study can lead to an overestimation [41]. These hypotheses should be addressed in future studies. ...
Article
Full-text available
The taxonomy of the Lesser Egyptian jerboa, Jaculus (J.) jaculus (Dipodinae subfamily), was recently reevaluated, and the taxonomic status was defined by the presence of two cryptic species, J. jaculus (Linnaeus 1758) and J. hirtipes (Lichtenstein, 1823), with a higher genetic divergence in the sympatric North African populations than in other studied parapatric populations. Using phylogenetic analysis of the cytochrome b (Cytb) gene from 46 specimens, we confirmed the new status in Tunisia; rodents were collected from two different biotopes belonging to the same locality at the ecological level (mountainous vs. Saharan) in the south of the country. The study of the eye lens weight of these specimens allowed the definition of a cutoff value (58.5 g), categorizing juveniles from adults. Moreover, this study confirmed the phylotaxonomic status of J. jaculus in Tunisia, as recently illustrated, into two distinct species, J. jaculus and J. hirtipes, and recorded for the first time the presence of two phylogroups among each of these rodent species. The lack of clear micro-geographical structure and biotope specificity between the two rodent species and their phylogroups was also highlighted.
... Among gltA, ITS and fstz loci, two to four haplotypes of B. henselae and B. clarridgeiae were observed in cats from Paraguay, with some exclusive from the country and others shared with a widespread worldwide haplotype. Haplotype diversity is controlled by numerous processes, namely demography, mutation, and gene recombination (Stumpf, 2004). Of those, gene recombination is frequently described for the Bartonella species and is considered an evolutionary strategy for bacterial pathogenicity (Kosoy et al., 2017). ...
Article
Full-text available
Although Bartonella spp. is described in cats worldwide, little is known about the occurrence and genetic diversity of Bartonella spp. in cats from South America. To date, it has only been detected in cats from Brazil, Chile and Argentina. This study aimed to undertake a molecular survey and explore the genetic diversity of Bartonella spp. in domestic cats from Paraguay. A TaqMan real-time quantitative PCR (qPCR) targeting the nuoG gene (83 bp) for Bartonella spp. was used to screen 125 blood samples from cats in Asuncion, Paraguay. nuoG qPCR-positive samples were further submitted to conventional PCR assays based on the ITS (453- 717 bp), gltA (767 bp), ftsZ (515 bp), rpoB (333 bp), ribC (585-588 bp), and pap-31 (564 bp) loci. Positive samples were sequenced for species identification, phylogenetic, and haplotype analyses. Bartonella D.N.A. was present in 20.8% (26/125) cat blood samples, with low levels of Bartonella nuoG D.N.A. cPCR products targeting gltA, ftsZ, ITS, and rpoB loci from sixteen cats were successfully sequenced. However, all nouG qPCR-positive samples were negative for the ribC and pap-31 genes. Bartonella henselae [62.5% (10/16)] and Bartonella clarridgeiae [37.5% (6/16)] were identified among the sequenced samples. Upon phylogenetic analysis, B. henselae and B. clarridgeiae from Paraguay clustered with sequences detected in domestic and wild cats, dogs, and cat fleas worldwide. Two to four haplotypes of B. henselae and B. clarridgeiae in cats from Paraguay were observed, with some being exclusive and others shared with worldwide distributed haplotypes. Here, we report B. henselae and B. clarridgeiae for the first time in cats from Paraguay. Its circulation in cats suggests the need to consider Bartonellae when testing clinical samples from suspected infectious diseases in humans from Paraguay.
... The nucleotide polymorphism analysis of Hepatozoon concatenated 18S rDNA sequences were diverse with a high number of haplotypes (n=5) among the population of sampled rodents, with some of the haplotypes (n=3) only identified in the present study, suggesting that novel haplotypes occur in rodents from the Valdivia province, southern Chile. Haplotype diversity is influenced by multiple processes, such as mutation, recombination, and demography (Stumpf, 2004). The haplotype diversity of Hepatozoon spp. ...
Article
Full-text available
HThis study aimed to investigate the genetic diversity of Hepatozoon spp. in rodents from Valdivia, Chile. A total of 74 rodents (synanthropic n=38; wild n=36) were trapped in Valdivia. We performed conventional PCR assays for Apicomplexa organisms targeting two overlapping 18S rDNA gene fragments (600 bp and 900 bp) followed by sequencing of selected amplicons. Hepatozoon spp. occurrence was 82.43% (61/74). Twelve sequences obtained from the 600 bp and ten from the 900 bp 18S rDNA fragments were identified as Hepatozoon sp. Six sequences obtained from 18S rDNA-based overlapping PCR protocols were used for concatenated (1,400 bp) phylogenetic, haplotype and distance analyses. Hepatozoon spp. 18S rDNA concatenated sequences from the present study were detected in Oligoryzomys longicaudatus, Rattus norvegicus, Mus musculus, and Abrothrix longipilis grouped with Hepatozoon species earlier described in rodents and reptiles from Chile and Brazil. Nucleotide polymorphism of the six 18S rDNA sequences (1,400 bp) from this study, and other Chilean sequences from rodents and rodent's ticks, showed high diversity with a total of nine Chilean haplotypes. Three haplotypes from Valdivia were identified for the first time in this study, suggesting the circulation of novel haplotypes in rodents from southern Chile.
Article
Full-text available
Backgrounds Plasmodium vivax is the predominant Plasmodium species distributed extensively in the Americas and Asia-Pacific areas. Encoded protein by Plasmodium vivax Reticulocyte Binding Proteins (PvRBPs) family member are of critical prominence to parasite invasion and have been considered the significant targets in development of malaria vaccine for the blood stage. As high genetic polymorphism of parasites may impede the effectiveness of vaccine development, more research to unraveling genetic polymorphism of pvrbp2b from various geographical regions seems indispensable to map the exact pattern of field isolates. Methodology/Principal findings The aim of this study was to determine the sequences of Iranian pvrbp2b (nt: 502–1896) gene and then, to ascertain polymorphism of pvrbp2b gene, recombination, the level of genetic distances, evaluation of natural selection, and the prediction of B-cell epitopes of Iranian and global P . vivax isolates. Pvrbp2b partial gene was amplified and sequenced from 60 Iranian P . vivax isolates. Iranian pvrbp2b sequences as well as 95 published sequences from five countries were used to evaluate the genetic diversity and neutral evolution signature in worldwide scale. A total of 38 SNPs were identified among 60 Iranian pvrbp2b sequences (32 non-synonymous and 6 synonymous mutations), and 32 amino acid substitutions were observed in 29 positions as compared to Sal-1 sequence. Worldwide sequence analysis showed that 44 amino acid changes had occurred in 37 positions of which seven polymorphic sites had trimorphic mutations while the rest was dimorphic. The overall nucleotide diversity for Iranian isolates was 0.00431 ± 0.00091 while the level of nucleotide diversity was ranged from 0.00337 ± 0.00076 (Peru) to 0.00452 ± 0.00092 (Thailand) in global scale. Conclusions/Significance Of amino acid substitutions, 12 replacements were located in the B-cell epitopes in which nine polymorphic sites were positioned in N-terminal and three polymorphic sites in predicted B-cell epitopes of C-terminal, signifying both variable and conserved epitopes for vaccine designing. Using the achieved outcome of the current investigation interrogate questions to the selection of conserved regions of pvrbp2b and understanding polymorphism and immune system pressure to pave a way for developing a vaccine based on PvRBP2b candidate antigen.
Article
Full-text available
The sampling distribution of a collection of DNA sequences is studied under a model where recombination can occur in the ancestry of the sequences. The infinitely-many-sites model of mutation is assumed where there may only be one mutation at a given site. Ancestral inference procedures are discussed for: estimating recombination and mutation rates; estimating the times to the most recent common ancestors along the sequences; estimating ages of mutations; and estimating the number of recombination events in the ancestry of the sample. Inferences are made conditional on the configuration of the pattern of mutations at sites in observed sample sequences. A computational algorithm based on a Markov chain simulation is developed, implemented, and illustrated with examples for these inference procedures. This algorithm is very computationally intensive.
Article
Full-text available
Genome-wide linkage disequilibrium (LD) mapping of common disease genes could be more powerful than linkage analysis if the appropriate density of polymorphic markers were known and if the genotyping effort and cost of producing such an LD map could be reduced. Although different metrics that measure the extent of LD have been evaluated, even the most recent studies have not placed significant emphasis on the most informative and cost-effective method of LD mapping-that based on haplotypes. We have scanned 135 kb of DNA from nine genes, genotyped 122 single-nucleotide polymorphisms (SNPs; approximately 184,000 genotypes) and determined the common haplotypes in a minimum of 384 European individuals for each gene. Here we show how knowledge of the common haplotypes and the SNPs that tag them can be used to (i) explain the often complex patterns of LD between adjacent markers, (ii) reduce genotyping significantly (in this case from 122 to 34 SNPs), (iii) scan the common variation of a gene sensitively and comprehensively and (iv) provide key fine-mapping data within regions of strong LD. Our results also indicate that, at least for the genes studied here, the current version of dbSNP would have been of limited utility for LD mapping because many common haplotypes could not be defined. A directed re-sequencing effort of the approximately 10% of the genome in or near genes in the major ethnic groups would aid the systematic evaluation of the common variant model of common disease.
Article
Full-text available
There is considerable interest in understanding patterns of linkage disequilibrium (LD) in the human genome, to aid investigations of human evolution and facilitate association studies in complex disease. The relative influences of meiotic crossover distribution and population history on LD remain unclear, however. In particular, it is uncertain to what extent crossovers are clustered into 'hot spots, that might influence LD patterns. As a first step to investigating the relationship between LD and recombination, we have analyzed a 216-kb segment of the class II region of the major histocompatibility complex (MHC) already characterized for familial crossovers. High-resolution LD analysis shows the existence of extended domains of strong association interrupted by patchwork areas of LD breakdown. Sperm typing shows that these areas correspond precisely to meiotic crossover hot spots. All six hot spots defined share a remarkably similar symmetrical morphology but vary considerably in intensity, and are not obviously associated with any primary DNA sequence determinants of hot-spot activity. These hot spots occur in clusters and together account for almost all crossovers in this region of the MHC. These data show that, within the MHC at least, crossovers are far from randomly distributed at the molecular level and that recombination hot spots can profoundly affect LD patterns.
Article
Full-text available
We introduce a new method for estimating recombination rates from population genetic data. The method uses a computationally intensive statistical procedure (importance sampling) to calculate the likelihood under a coalescent-based model. Detailed comparisons of the new algorithm with two existing methods (the importance sampling method of Griffiths and Marjoram and the MCMC method of Kuhner and colleagues) show it to be substantially more efficient. (The improvement over the existing importance sampling scheme is typically by four orders of magnitude.) The existing approaches not infrequently led to misleading results on the problems we investigated. We also performed a simulation study to look at the properties of the maximum-likelihood estimator of the recombination rate and its robustness to misspecification of the demographic model.
Article
We introduce a new statistical model for patterns of linkage disequilibrium (LD) among multiple SNPs in a population sample. The model overcomes limitations of existing approaches to understanding, summarizing, and interpreting LD by (i) relating patterns of LD directly to the underlying recombination process; (ii) considering all loci simultaneously, rather than pairwise; (iii) avoiding the assumption that LD necessarily has a “block-like” structure; and (iv) being computationally tractable for huge genomic regions (up to complete chromosomes). We examine in detail one natural application of the model: estimation of underlying recombination rates from population data. Using simulation, we show that in the case where recombination is assumed constant across the region of interest, recombination rate estimates based on our model are competitive with the very best of current available methods. More importantly, we demonstrate, on real and simulated data, the potential of the model to help identify and quantify fine-scale variation in recombination rate from population data. We also outline how the model could be useful in other contexts, such as in the development of more efficient haplotype-based methods for LD mapping.
Article
Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson 2001 has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4N(e)r, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.
Article
In this review, we describe recent empirical and theoretical work on the extent of linkage disequilibrium (LD) in the human genome, comparing the predictions of simple population-genetic models to available data. Several studies report significant LD over distances longer than those predicted by standard models, whereas some data from short, intergenic regions show less LD than would be expected. The apparent discrepancies between theory and data present a challenge-both to modelers and to human geneticists-to identify which important features are missing from our understanding of the biological processes that give rise to LD. Salient features may include demographic complications such as recent admixture, as well as genetic factors such as local variation in recombination rates, gene conversion, and the potential segregation of inversions. We also outline some implications that the emerging patterns of LD have for association-mapping strategies. In particular, we discuss what marker densities might be necessary for genomewide association scans.
Article
Linkage disequilibrium (LD) analysis is traditionally based on individual genetic markers and often yields an erratic, non-monotonic picture, because the power to detect allelic associations depends on specific properties of each marker, such as frequency and population history. Ideally, LD analysis should be based directly on the underlying haplotype structure of the human genome, but this structure has remained poorly understood. Here we report a high-resolution analysis of the haplotype structure across 500 kilobases on chromosome 5q31 using 103 single-nucleotide polymorphisms (SNPs) in a European-derived population. The results show a picture of discrete haplotype blocks (of tens to hundreds of kilobases), each with limited diversity punctuated by apparent sites of recombination. In addition, we develop an analytical model for LD mapping based on such haplotype blocks. If our observed structure is general (and published data suggest that it may be), it offers a coherent framework for creating a haplotype map of the human genome.
Article
Global patterns of human DNA sequence variation (haplotypes) defined by common single nucleotide polymorphisms (SNPs) have important implications for identifying disease associations and human traits. We have used high-density oligonucleotide arrays, in combination with somatic cell genetics, to identify a large fraction of all common human chromosome 21 SNPs and to directly observe the haplotype structure defined by these SNPs. This structure reveals blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes.