ArticlePDF Available

Haplotype diversity and SNP frequency dependence in the description of genetic variation

Authors:

Abstract and Figures

Haplotype diversity is controlled by a variety of processes, including mutation, recombination, marker ascertainment and demography. Understanding the extent to which genetic variation at physically linked loci is co-inherited is crucial for the design of the HapMap project and the correct interpretation of the resulting data. In the absence of an analytical theory extensive coalescent simulations are used to disentangle the influence of all of these factors on haplotype diversity. In addition to these qualitative insights, this study also demonstrates (i) that marker spacing and frequency profoundly influence observed levels of haplotype diversity; (ii) that the spectrum of haplotypes contains information about how exhaustively genetic variation in a region is described by a given marker set; and (iii) that so-called haplotype blocks can be generated due by the stochasticity inherent in the recombination process without having to assume variation in the recombination rate.
Content may be subject to copyright.
ARTICLE
Haplotype diversity and SNP frequency dependence
in the description of genetic variation
Michael PH Stumpf*
,1
1
Department of Biological Sciences, Biochemistry Building, Imperial College London, London SW7 2AY, UK
Haplotype diversity is controlled by a variety of processes, including mutation, recombination, marker
ascertainment and demography. Understanding the extent to which genetic variation at physically linked
loci is co-inherited is crucial for the design of the HapMap project and the correct interpretation of the
resulting data. In the absence of an analytical theory extensive coalescent simulations are used to
disentangle the influence of all of these factors on haplotype diversity. In addition to these qualitative
insights, this study also demonstrates (i) that marker spacing and frequency profoundly influence observed
levels of haplotype diversity; (ii) that the spectrum of haplotypes contains information about how
exhaustively genetic variation in a region is described by a given marker set; and (iii) that so-called
haplotype blocks can be generated due by the stochasticity inherent in the recombination process without
having to assume variation in the recombination rate.
European Journal of Human Genetics (2004) 12, 469 477. doi:10.1038/sj.ejhg.5201179
Published online 17 March 2004
Keywords: population genetics; HapMap project; haplotype tagging; haplotype blocks
Introduction
An increasing number of empirical
1–3
studies investigate
whether the inheritance of genetic variants occurs in a
block-like manner and the potential implications of this
for association studies.
4
Reported haplotype diversities
along extended stretches of DNA appear surprisingly
simple with most chromosomes belonging to one of
roughly a handful of different haplotypes.
5
The levels of
linkage disequilibrium (LD) are also reported to be con-
sistently high between markers that are in the same block
although LD can also extend beyond block-boundaries.
6,7
If such a picture were to prevail it would have obvious
consequences for the design of association studies.
8
In an important paper Jeffreys et al
1
showed that at least
sometimes blocks may be delimited by recombination
hotspots; the recombination rate in fairly localized regions
can exceed the background or block recombination rate by
up to four orders of magnitude and LD does not extend
beyond the block boundaries. In many cases, however,
there is as yet no conclusive evidence for block boundaries
to coincide with recombination hotspots.
9,10
If this
were generally the case then we could hope that block-
boundaries and possibly knowledge of haplotypes in one
population would allow us to make predictions of infe-
rences for other populations. Unfortunately, however, many
reports of blocks fail to show evidence for such a connection
with hotspots
11
and the methods by which blocks are
ascertained
3,8
may at least be partly to blame for this.
There are three main objectives of this study of haplo-
type diversity. On a fundamental level we gain insight (in
the absence of an analytical theory) into how physical
proximity between markers, the marker frequencies and
the intensity of recombination interact to determine the
complexity of the haplotype spectrum. Second, recent
theoretical work by Wiuf et al
12
is followed up. These
authors have shown that the number Tof haplotype
tagging SNPs (htSNPs)
2
necessary to describe a given set of
Mhaplotypes defined by NSNPs is bounded by log
2
M
Revised 12 December 2003; accepted 21 January 2004
*Correspondence: Dr MPH Stumpf, Department of Biological Sciences,
Biochemistry Building, Imperial College London, London SW7 2AZ, UK.
Tel: þ44 20 7594 5114; Fax: þ44 20 7594 5789;
E-mail: m.stumpf@imperial.ac.uk
European Journal of Human Genetics (2004) 12, 469 477
&
2004 Nature Publishing Group All rights reserved 1018-4813/04
$30.00
www.nature.com/ejhg
oTomin(N,M-1). Here we determine, for different scenarios,
the relationship between Mand N. Third, a simple block
definition is used to evaluate properties of blocks and how
inferred blocks (which do not correspond to recombina-
tion hotspots) depend on a marker characteristics and
recombination rate.
All three aspects of this work have implications in the
current run-up to the HapMap project.
8
The study provides
guidance into when and how the resulting SNP data are best
summarized in terms of haplotypes. Moreover, as will
become clear, haplotype diversity and the combinatorial
structure of haplotypes also hold information about how
exhaustively genetic variation in a region has been sampled.
Methods
Simulation procedures
In the following discussion we simulate the ancestral
recombination graph
13
assuming uniform recombination
and mutations rates rand m, respectively. Assuming
constant rallows the study of the behaviour in blocks of
low recombination rates, or the expected behaviour of
haplotype diversity under a Null model; both aspects will
be considered here. Throughout we assume a single
panmictic population with an effective population size of
N
e
¼10 000 diploid individuals. Throughout we use a
sample size of n¼500 chromosomes and consider a stretch
of 50 kb length; the sample size is large compared to most
studies of LD performed today,
3,9
but smaller than the
population samples predicted for future case control
studies.
14
The mutation rates are assumed to be 10
8
/
(nucleotide generation) and 10
9
/(nucleotide genera-
tion), whence the population mutation rates along the
whole stretch are m¼50 and 5; the mutation model used
here is the infinite sites model. We consider two recombi-
nation rates which correspond to 1 and 0.1 cM/Mb in
addition to the case of no recombination; the correspond-
ing population recombination rates are thus: 50, 5 and 0,
respectively.
Human genetic diversity for a stretch of 50 kb corre-
sponds approximately to m¼50 and r¼50.
15
The other
values therefore correspond loosely to cases where the
recombination and/or average marker density (via the
mutation rate) is decreased. The case of m¼5 and r¼5,
however, can also be interpreted as the correct description
of a 5 kb stretch. We also use the m¼5 case, which gives rise
to a sparser marker set, as a qualitative example for SNP
ascertainment.
16
Results for the lower mutation rate to
model as representing the case of a sparser set of markers.
In addition to the constant population size we also
investigate the effects of population growth on the result-
ing haplotype diversity but refrain from a more detailed
study of the effects of demography. For each scenario, 2000
independent runs of the ancestral recombination graph
were performed. Frequency cutoffs for the minor marker
allele (and not always the derived allele) are enforced by
counting the copies of each allele in the sample. Cutoff
frequencies considered are 1, 5, 10 and 20%.
Haplotype analysis and tagging approach
The minimum number of necessary tagging SNPs to tag a
given set of haplotypes is evaluated using a brute-force
implementation of the algorithm described in Wiuf et al
12
.
Starting from the k¼M, where Mis the number of
haplotypes, we evaluate each possible combination of k
SNPs to see if it could be used as a basis for the set of
haplotypes. If one of the N!/(N-k)/k! possible SNP combina-
tions forms a valid basis then kis decreased by 1 until the
first time a basis cannot found. For large Nand Mthe
number of SNP combinations can become enormous but in
smaller simulations it was observed that the distribution of
the minimum number of tags required to tag a given
number of haplotypes is relatively flat: many different
combinations can be used to tag haplotypes. Thus, for large
values of Nand Mit is possible to proceed heuristically
12
and investigate, for exmple, a maximum of 100 Million
combinations of candidate tags and an inferred minimal
basis will be close to optimal. Similarly, we have also
implemented a strategy where we start from k¼min(N,M-1)
and increment kuntil a basis has been found. Using either
approach (which of course yield identical results) when a
set of kSNPs is found the procedure stops and the number
of necessary and sufficient tagging SNPs is set to T¼k.At
most min(N,M-1) tagging SNPs are required to describe all
observed haplotypes in the sample. As the algorithm is in
the NP-complete class we only evaluate the number of
tagging SNPs for the case of SNP ascertainment outlined
above. Our heuristic approach can also be implemented
more formally in a Markov Chain Monte Carlo setting.
Results
Here we discuss how the number of haplotypes depends on
the number of SNPs, the recombination rate and the cutoff
frequency for the fraction of chromosomes that should be
included. Analytic results are only available for the case of
no recombination and free recombination, respectively,
and we therefore use coalescent simulations as outlined
above.
Determinants of haplotype diversity
In Figure 1 we show how the number of haplotypes depends
on the number of SNPs, their frequency and the recombina-
tion rate. The relationship between SNP number (for each
frequency cutoff) and the total number of haplotypes in a
sample already carries information about the recombina-
tion rate and how exhaustively a given set of SNPs repre-
sents or resembles underlying genetic variation. Large SNP
sets (with a low cutoff frequency) will contain correlations
among SNPs but if marker sets are sparse, recombination
Haplotype diversity
MPH Stumpf
470
European Journal of Human Genetics
will be more effective at breaking up associations between
markers; we therefore expect a lower value for the ratio
for y¼5than for y¼50, irrespective of cutoff frequency
and recombination rate. The number of tags required
to adequately describe variation in a region will there-
fore be a function of both marker frequency and marker
density.
For y¼50 we find that a 10-fold decrease in the
recombination rate from r¼50 to 5already brings the
observed number of haplotypes very close to the r¼0
results. For lower values of rthe average number of
haplotypes is virtually indistinguishable from the r¼0
case. Note that for the decay of LD measured by the same
decrease in rfrom r¼50 to 5 does not yield a behaviour
anywhere near the r¼0case (not shown). Haplotype
diversity and LD, although related, show somewhat
different dependence on the population recombination
rate r.This is also observed for growing and bottleneck
populations (data not shown).
The dependence of haplotype diversity on the minor
SNP allele frequency is further exemplified in Figure 2.
Here we show the number of haplotypes needed to describe
90, 95 and 99%, and all of the 500 chromosomes in the
sample. These numbers are displayed for five different
marker frequency cutoffs, three recombination and two
mutation rates. Such a table can either be used to assess the
genotyping cost necessary to capture a given amount of
variation or in case all the available genetic variation has
been characterized, to obtain an indication of the average
recombination/mutation rate ratio. While for high marker
density or mutation rates rare (fo1%) alleles give rise to a
large number of haplotypes we find that for fZ5% there
is no big reduction in genotyping effort as the cutoff
frequency is further increased. Also for fZ5% the fre-
quency distribution of haplotypes holds some information
about the recombination rate: a higher recombination rate
will lead to more rare haplotypes even at moderate to high
frequency cutoffs, as is also intuitively obvious.
The frequency distribution of haplotypes is displayed in
Figure 3. At the reported genome wide average of the
recombination rate a stretch of 50 kb is not expected to
have any haplotypes at a frequency greater than 10%,
irrespective of the cutoff frequency. If the marker spacing is
decreased, however, some haplotypes will gain in fre-
quency and at y¼5 and r¼50 we therefore observe some
haplotypes at moderate frequencies, especially at high
cutoffs. Low values of yresult in a shift of weight to higher
haplotype frequencies. For ro1(results not shown) the
resulting haplotype distributions are very similar to the
special r¼0case apart from the origin. At low recombina-
tion rates and for cutoffs fZ5the haplotype distribution
obtains a mode at the cutoff frequency f.
The shift of the mode to the cutoff frequency is simply
a result of the fact that in the absence of excessive
Figure 1 Average SNP (grey) and haplotype numbers (black) versus minor allele frequency cutoff for y¼5 and 50 and r¼50,
5 and 0, respectively.
Haplotype diversity
MPH Stumpf
471
European Journal of Human Genetics
recombination, an SNP with a minor allele frequency
of xwill define a haplogroup of frequency xif xis very
close to the cutoff frequency; thus an excess of haplo-
types with frequency xwill be observed. If the recombina-
tion rate is high then haplotypes defined by the youngest
SNP can be broken up by recombination and here
r¼50 appears to yield results that are very close to
the case of free recombination. As a result the mode of
the haplotype frequency distribution shifts back to the
origin.
Haplotype tagging
Only one tagging strategy is investigated here
12
and at the
moment it is by no means clear what tagging strategy is
best suited for association studies.
8
Rather than focusing
on tagging haplotypes, it may for example be better to
define tags that capture the patterns of LD and/or
association between SNPs. Simulation-based power analysis
along the lines taken here will help to assess such questions
in further detail. The tagging approach used here is quite
likely not optimal for association studies, but its easy
interpretation in terms of a geometric basis for the space
spanned by the SNP defined haplotypes nicely highlights
the combinatorial nature of haplotypes and the complexity
introduced by recombination. Other haplotype tagging
frameworks, however, are likely to behave qualitatively
similarly to the approach taken here.
We only consider an allele frequency cutoff of 5%. In
Table 1 we show mean values of the ratios T/N
5
(where Nis
the number of SNPs with a minor allele frequency of 5%)
and the corresponding 5 and 95 percentiles. We find an
obvious dependence on rand for r¼5 the results are
already quite similar to r¼0. The results for r¼50 are
discouraging: on average over 90% of SNPs need to be
typed in order to reliably distinguish between haplotypes.
This suggests that reports of low haplotype diversity
indicate regions of low recombination rate. We note,
however, that the majority of currently published studies
has marker density that is at least a factor of 5 lower than
the one obtained here.
9
Moreover our results concern true,
not inferred haplotypes. Haplotype inference may system-
atically bias tagging approaches.
Figure 2 Average number of haplotypes needed to explain 90, 95, 99 and 100% of observed chromosomes in a sample for
frequency cutoffs of 1, 5, 10 and 20%, respectively, for y¼5 and 50 and r¼50 and 5.
Haplotype diversity
MPH Stumpf
472
European Journal of Human Genetics
Dynamics of haplotype blocks
Notions and possible uses of extended haplotype blocks
that are characterized by high levels of pairwise LD
between SNPs within the same block (and accordingly
low haplotype diversity compared to the extreme case of
free recombination) have attracted considerable inter-
est.
4,6,17 19.
Here we follow Wang et al
17
and use probably
the simplest definition of a block: all SNP pairs that are
within the same block must fail the four-gamete test, that
is, at most three out of the possible four two-locus
haplotypes are observed for each pair of bi-allelic markers.
This definition has some shortcomings but is (i) easily
implemented, and (ii) we expect it to give at least some
insight into how SNP frequencies and ascertainment affect
the behaviour of blocks. Insights gained for this simple
model will be transferable to other, more involved, block-
ascertainment methods.
In Figure 4 we show how the average number and
average size of blocks, as well as the proportion of DNA and
SNPs that are found within blocks depend on minor allele
frequency cutoff and recombination rate r. We only
consider y¼50 but in each case we show both the results
Haplotype distributions for ρ = 50
Relative HT AbundanceRelative HT Abundance
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
Haplotype Frequency
Haplotype distributions for ρ = 5
0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.3 0.4
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
Haplot
y
pe Frequenc
y
0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
Figure 3 Frequency distribution of haplotypes and their dependence on y,rand minor allele frequency cutoff.
Table 1 Average fraction of SNPs (in percent) needed to capture x% of chromosomes in the sample (x¼99, 95 and 90%,
respectively) for three different values of the recombination rate together with their 5 and 95 percentiles (in parentheses)
Population recombination rate rAverage number of haplotypes Percentage of SNPs needed to explain fraction x of haplotypes
x¼99% X ¼95% x ¼90%
0 7.3 49.7 (22.0 92.5) 45.0 (17.9 83.3) 37.5 (14.3 66.7)
5 14.2 70.6 (41.2 100) 58.9 (32.0 90.0) 49.4 (25.0 78.6)
50 52.3 94.7 (83.3– 100) 93.1 (80.0 100) 91.7 (76.2 100)
The average number of segregating sites is 33.9 of which 14.67 had a frequency Z5%; the corresponding 5 and 95 percentiles are 21 and 50 and 5
and 28, respectively.
Haplotype diversity
MPH Stumpf
473
European Journal of Human Genetics
for all blocks that adhere to our definition and of ‘long’
blocks. ‘Long’ blocks are blocks that contain at least four
SNPs while other blocks may also contain pairs of SNPs that
fulfil our four-gamete test criterion. Full symbols denote
results for r¼50, empty symbols r¼5; circles (full lines)
are for all blocks while boxes (dashed lines) refer only to
the long blocks.
We observe that for low frequency cutoffs there are many
more but shorter blocks for r¼50 than for 5 where the two
curves are in very close agreement. At r¼50 the average
block-size is determined largely by the long blocks but for
all other measures displayed in Figure 4 we observe
significant differences between long and short blocks.
The number of long blocks, the proportion of DNA in
long blocks, and perhaps most severely, the proportion of
SNPs that are found in long blocks decreases more
dramatically with minor allele frequency cutoff than the
same measures do for all blocks. At a minor allele frequency
of 20% only approximately 20% of DNA and 50% of SNPs
are found in long blocks. For all blocks these values
increase to 40 and 90%, respectively. It is obvious that
small blocks, containing only two or three SNPs, will offer
little or no reduction in genotyping effort. Long blocks, on
the other hand, account for only a small part of the total
sequence.
The average block-size remains approximately constant
for all allele frequency cutoffs. This result can be explained
by considering those pairs of SNPs that are the most likely
to give rise to four observed two-locus haplotypes. These
SNPs have to be old enough to have undergone at least one
recombination event and therefore will have reasonably
large minor allele frequencies. Pairs of younger markers,
which by and large will have a smaller minor allele
frequency, are less likely to give rise to four haplotypes
and therefore we expect SNPs with moderate to high minor
allele frequencies to determine block-size. Undersampling
of diversity (eg restricting the analysis to already known
SNPs such as those in dbSNP) could therefore system-
atically overestimates average block-lengths. This result
is in agreement with the study of Phillips et al
3
who
find that block-length increases with marker spacing; it is
likely to hold for other definitions as suggested by recent
studies of the effects of SNP ascertainment.
16
Thus,
interpretation of haplotype diversity (like LD and block
Figure 4 Average no. of blocks, average block-size, average of the total proportion of DNA in blocks and average of the total
number of SNPs in blocks calculated for a sample of 500 chromosomes drawn from a constant size population with y¼50
versus frequency cutoff. Solid symbols represent the case r¼50, empty symbols r¼5. Circles (solid lines) represent results
obtained for all blocks, boxes (dashed lines) represent results for blocks containing at least four SNPs.
Haplotype diversity
MPH Stumpf
474
European Journal of Human Genetics
boundaries) is problematic if not supported by extensive
simulations.
3,7
Demography and haplotype diversity
Demography and population structure are known to have
profound effects on the frequency spectrum of segregating
sites, LD and thus also on haplotype diversity.
3,4,9.
Simula-
tions of population-growth scenarios suggest that the effect
of minor-allele frequency still persists. We only show
results for one particular demographic scenario where the
population has grown from 1% of its present size to its
present size over a time t¼1 (in coalescent units); before
the onset of growth the population size is assumed to be
constant at 1% of the present size. Other cases are easily
assessed using coalescent simulations. Owing to the
problems associated with diversity discussed by Pritchard
and Przeworski
15
the mutation rate was adjusted such that
the number of segregating sites in the sample is the same in
the population growth scenario as in the constant popula-
tion scenario discussed above.
Comparing Figure 1 with the top row of Figure 5 shows
only quantitative differences that are easily explained by
the different SNP allele frequency distribution resulting
from a population growth scenario. We find at the higher
recombination rate that haplotype numbers exceed SNP
numbers already for lower frequency cutoffs (ie f45%
instead of f420%). At the same cutoff frequency the ratio
of [haplotype number]/[SNP number] is less for the growth
demography considered here than for the constant size
population. Comparison of Figure 2 with the bottom row
of Figure 5 shows only a minor vertical shift: the average
number of haplotypes needed to describe x%(x¼90, 95,
99, 100) of the chromosomes in the sample is higher for
population growth than for constant population size.
Again this is easily understood because population growth
results in a relative excess of rare alleles compared to the
case of constant population size. These results suggest that
the basic patterns of haplotype dependence (on allele
frequency cutoff, marker spacing and recombination rate)
elucidated above may remain valid for a range of demo-
graphic scenarios.
Conclusions
In the search for the genetic components of complex
diseases or drug response phenotypes haplotype-based
Figure 5 Top row: average numbers of SNPs (grey) and haplotypes (black) resulting for yE65 and r¼50 and 5, respectively.
Bottom row: number of haplotypes that need to be considered in order to cover 90, 95 and 99%, and all of the chromosomes
in the sample. In each case the demographic model outlined in the text was used in the coalescent simulations.
Haplotype diversity
MPH Stumpf
475
European Journal of Human Genetics
approaches have recently been heralded as particularly
promising. A host of early studies suggested that relatively
few (eg 2 6) haplotypes may suffice to describe the genetic
variation along extended stretches of DNA.
3,5,9,10
The aim
of this study was to (i) gain some understanding of the
factors influencing observed haplotype diversities, (ii)
evaluate the behaviour of haplotypes expected for simple
population genetic models, and (iii) see to what extent
haplotype blocks can appear without underlying local
variation in the recombination rate.
Before discussing the application of the results presented
here to real world data, it is important to acknowledge the
limitations of the approach taken here. The population
model is of course incorrect and at best over-simplified.
While a quantitative interpretation of the results is thus
impossible they seem to reflect qualitative trends. For
example, for many if not all population models (including
the unknown true model), haplotype diversity will increase
with increased recombination rate and decrease dramati-
cally with increased SNP frequency cutoff. This is a general
result confirmed by simulations of a wide range of
demographic models (data not shown) and intuitively
obvious in the light of what is known about the ancestral
recombination graph.
The reported haplotype frequencies and diversities are
not easily reconciled with the standard neutral constant
size model of evolution although the generally small
sample sizes will result in overestimation of LD and of
haplotype frequencies. For the sample size considered here,
n¼500, which is by no means large compared to what will
be required for genetic association studies,
14
the number of
segregating sites is very large for a region of 50 kb, SE330.
Even a moderate reduction of the recombination rate
brings haplotype diversities and the number of required
tSNPs into the range observed for r¼0. This suggests that
at least some of the reported blocks may occur in regions
where the recombination rate ris less than the reported
genome wide average r¼1 cM/Mb. The simulations also
show that haplotype diversity and block behaviour depend
on both allele frequency and marker spacing. A number of
reports of long-range disequilibrium and/or low haplotype
diversity, based on incomplete sampling of the genetic SNP
diversity, need to be reassessed in the light of this. A
detailed assessment of local recombination rate variation
becomes important and should provide crucial informa-
tion about the usefulness of blocks. Similarly, predictions
about the success/efficiency gains to be gained from the
HapMap project that are based on present studies may
systematically underestimate the number of tagging SNPs
required to describe human genetic diversity.
Generally, we find that for complete ascertainment of
segregating sites/SNPs haplotype diversity along a 50-kb
stretch is almost unmanageably large if all markers or those
with a minor allele frequency of fr1% are to be typed.
From a cutoff of ‘5%’ and above no big efficiency gains are
obtained and if the common variant/common disease
should turn out to be correct than 5% may be a reasonable
cutoff frequency. The genotyping effort, even if tagging
approaches are used, may be considerably more than had
been hoped.
2,9,10
There are considerable problems in interpreting current
experimental data sets and the simulation study presented
here gives some clues as to what factors may compromise
inferences drawn from summaries of the data such as LD
and/or haplotype diversity. Many of these problems could
be directly addressed if the underlying recombination rate
variation were known. In addition to approaches using
sperm-typing,
1,20
a number of inferential procedures has
recently developed that allow direct estimation of the
recombination rate.
21 25
These use mainly information
from informative sites with high minor allele frequency
and their inferences should be robust against the problems
associated with low marker density and bias in allele
frequencies. Knowledge of local recombination rate varia-
tion along the human genome will provide crucial
guidance in the setup of genetic epidemiology studies.
Acknowledgements
I thank Carsten Wiuf and Gil McVean for many discussions on this
topic and Monty Slatkin for his helpful comments on a earlier version
of this manuscript. This work was funded through a Wellcome Trust
Career Development Fellowship and a Royal Society Project Grant.
References
1 Jeffreys AJ, Kauppi L, Neumann R: Intensely punctate meiotic
recombination in the class II region of the major
histocompatibility complex. Nat Genet 2001; 29: 217 222.
2 Johnson GC, Esposito L, Barrat BJ et al: Haplotype tagging for the
identification of common disease genes. Nat Genet 2001; 29:
233 237.
3 Phillips MS, Lawrence R, Schidanandam R et al: Chromosome-
wide distribution of haplotype blocks and the role of
recombination hot spots. Nat Genet 2003; 33: 382 387.
4 Stumpf MP, Goldstein DB: Demography, recombination hotspot
intensity, and the block structure of linkage disequilibrium. Curr
Biol: Cb 2003; 13:18.
5 Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-
resolution haplotype structure in the human genome. Nat Genet
2001; 29: 229– 232.
6 Wall JD, Pritchard JK: Assessing the performance of haplotype
block models of linkage disequilibrium. Am J Hum Genet 2003; 73:
2003.
7 Wall JD, Pritchard JK: Haplotype blocks and linkage disequilibrium
in the human genome. Nat Rev Genet 2003; 4: 587 597.
8 Cardon LR, Abecasis GR: Using haplotype blocks to map human
complex trait locl. Trends Genet 2003; 19: 135 140.
9 Gabriel SB, Schaffner S, Nguyen H et al: The structure of
haplotype blocks in the human genome. Science 2002; 1069424.
10 Patil N, Berno AJ, Hinds DA et al: Blocks of limited haplotype
diversity revealed by high-resolution scanning of human
chromosome 21. Science 2001; 294: 1719 1723.
11 Anderson EC, Slatkin M: Population-genetic basis of haplotype
blocks in the 5q31 region. Am J Hum Genet 2004; 74: 40 49.
12 Wiuf C, Laidlaw Z, Stumpf MPH: Some notes on the combinatorial
properties of haplotype tagging. Math Biosci 2003; 185: 205– 216.
Haplotype diversity
MPH Stumpf
476
European Journal of Human Genetics
13 Griffiths RC, Marjoram P: Ancestral inference from samples
of DNA sequences with recombination. J Comput Biol 1996; 3:
479 502.
14 Weiss KM, Clark AG: Linkage disequilibrium and the mapping of
complex human traits. Trends Genet 2002; 18: 19 24.
15 Pritchard JK, Przeworski M: Linkage disequilibrium in humans:
models and data. Am J Hum Genet 2001; 69: 1 14.
16 Akey JM, Zhang K, Xiong MM, Jin L: The effect of
single nucleotide polymorphism identification strategies on
estimates of linkage disequilibrium. Mol Biol Evol 2003; 20:
232 242.
17 Wang N, Akey JM, Zhang K, Chakraborty R, Jin L: Distribution of
recombination crossovers and the origin of haplotype blocks: the
interplay of population history, recombination, and mutation.
Am J Hum Genet 2002; 71: 1227 1234.
18 Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic
programming algorithm for haplotype block partitioning. PNAS
2002; 99: 7335 7339.
19 Anderson EC, Novembre J: Finding haplotype block boundaries
by using the minimum-description length principle. Am J Hum
Genet 2003; 73: 336 354.
20 Arnheim N, Calabrese P, Nordborg M: Hot and cold spots of
recombination in the human genome: the reason we should find
them and how this can be achieved. Am J Hum Genet 2003; 73: 5 16.
21 Fearnhead P, Donnelly P: Estimating recombination rates from
population genetic data. Genetics 2001; 159: 1299 1318.
22 McVean G, Awadalla P, Fearnhead P: A coalescent-based method
for detecting and estimating recombination from gene
sequences. Genetics 2002; 160: 1231 1241.
23 Hudson RR: Two-locus sampling distributions and their
application. Genetics 2001; 159: 1805 1817.
24 Li N, Stephens M: Modeling linkage disequilibrium and
identifying recombination hotspots using single-nucleotide
polymorphism data. Genetics 2003; 165: 2213 2293.
25 Stumpf MPH, McVean GAT: Estimating recombination rates from
population-genetic data. Nat Rev Genet 2003; 4: 959 968.
Haplotype diversity
MPH Stumpf
477
European Journal of Human Genetics
... The likelihood that two randomly selected haplotypes differ from each other is known as haplotype diversity and is controlled by a variety of processes, including mutation, recombination, marker ascertainment and demography. 30 The samples of O. microlepis collected along Lake nyasa indicate consistently low haplotype diversity in disturbed sites compared to relatively undisturbed sites. A similar study on the effect of anthropogenic disturbance on the genetic structure of a native Brazilian neotropical fish species revealed that the ten spotted livebearer fish Cnesterodon decemmaculatus (Cyprinodontifores, poeciliidae) served as an excellent bioindicator of environmental quality because most of the haplotypes were not able to adapt to stressful conditions. ...
Chapter
With the expansion of human settlements and the environmental changes brought on by human activity and pollutants, toxicology and risk assessment of piscine species is becoming increasingly of interest to scientists involved in environmental research and connected disciplines. This book focuses specifically on environmental risk assessment in fish species from different zoogeographical regions of the world. Fish Species in Environmental Risk Assessment Strategies is an ideal companion to toxicologists and ecologists interested in risk assessment in the environments of ichthyic fauna, particularly those with an interest in the deleterious impact introduced by human activity. The book is also of interest to those working in conservation biology, biological invasion, biocontrol, habitat management and related disciplines.
... The genetic diversity of a given species is modulated by multiple processes that include mutation, recombination, and biodemography [151,152]. For Hepatozoon spp., the life cycle, transmission dynamics, and dispersion capacity are factors that shape their diversity as well [38]. ...
Article
Full-text available
Background The study of parasites provides insight into intricate ecological relationships in ecosystem dynamics, food web structures, and evolution on multiple scales. Hepatozoon (Eucoccidiorida: Hepatozoidae) is a genus of protozoan hemoparasites with heteroxenous life cycles that switch infections between vertebrates and blood-feeding invertebrates. The most comprehensive review of the genus was published 26 years ago, and currently there are no harmonized data on the epizootiology, diagnostics, genotyping methods, evolutionary relationships, and genetic diversity of Hepatozoon in the Americas. Methods Here, we provide a comprehensive review based on the PRISMA method regarding Hepatozoon in wild mammals within the American continent, in order to generate a framework for future research. Results 11 out of the 35 countries of the Americas (31.4%) had data on Hepatozoon, with Carnivora and Rodentia orders having the most characterizations. Bats, ungulates, and shrews were the least affected groups. While Hepatozoon americanum, H. americanum-like, H. canis, H. didelphydis, H. felis, H. milleri, H. griseisciuri, and H. procyonis correspond to the identified species, a plethora of genospecies is pending for a formal description combining morphology and genetics. Most of the vectors of Hepatozoon in the Americas are unknown, but some flea, mite, and tick species have been confirmed. The detection of Hepatozoon has relied mostly on conventional polymerase chain reaction (PCR), and the implementation of specific real time PCR for the genus needs to be employed to improve its diagnosis in wild animals in the future. From a genetic perspective, the V4 region of the 18S rRNA gene has been widely sequenced for the identification of Hepatozoon in wild animals. However, mitochondrial and apicoplast markers should also be targeted to truly determine different species in the genus. A phylogenetic analysis of herein retrieved 18S ribosomal DNA (rDNA) sequences showed two main clades of Hepatozoon: Clade I associated with small mammals, birds, and herpetozoa, and Clade II associated with Carnivora. The topology of the tree is also reflected in the haplotype network. Conclusions Finally, our review emphasizes Hepatozoon as a potential disease agent in threatened wild mammals and the role of wild canids as spreaders of Hepatozoon infections in the Americas. Graphical Abstract
... The highest haplotype diversity was noticed for Varanasi (H d = 0.556) followed by Sultanpur (H d = 0.543), Prayagraj (H d = 0.524), Pratapgarh (H d = 0.476) and Samastipur (H d = 0). Haplotype diversity is governed by several factors such as mutation, recombination, marker identification, and population dynamics (Stumpf 2004). Haplotype network analysis revealed that H_1, H_3, H_4 and H_5 radiated from a common haplotype H_2 (Fig. 4). ...
Article
Full-text available
In India, spot blotch caused by Bipolaris sorokiniana is one of the most important diseases affecting barley crop and causes significant losses. With the present climate change scenario and cultivation of newer varieties, the present study aimed to characterize B. sorokiniana based on morphology, pathogenic, and cross-infectivity. Forty- five isolates were established from diseased samples collected from different hot spot locations of Uttar Pradesh and Bihar states in India and further confirmed with BLAST analysis using ITS region sequences. Morphological analysis revealed the highest mycelial growth (80.33 mm) in BS 36 (Varanasi, Uttar Pradesh) isolate and the lowest mycelial growth (19.33 mm) in BS 45 (Sultanpur, Uttar Pradesh) isolate after 15 DAI. The conidial length ranged from 74.58 μm to 42.05 μm, and the conidia width ranged from 22.94 μm to 12.18 μm across B. sorokiniana isolates. The highest sporulation (5.33 × 104 spores /ml) was observed in BS 32 isolate (Varanasi, Uttar Pradesh) and the lowest sporulation (0.73 × 104 spores /ml) was observed in BS 13 (Sultanpur, Uttar Pradesh). No correlation was found in the growth, sporulation, and pathogenic nature of B. sorokiniana. A total of 6 haplotypes were identified based on the ITS sequences with H_2 being the most predominant. The isolates (BS 52, BS 53, BS 54, and BS 55) from Samastipur, Bihar were highly pathogenic than other isolates. In cross-infectivity assays, B. sorokiniana isolate of barley was less virulent on wheat and B. sorokiniana isolate of wheat was highly virulent on barley. This study will help devise further management strategies against spot blotch of barley.
... A low pi value in Barred rainbowfishes indicates low sequence variation because of no cross-mating with other rainbowfishes. The haplotype diversity is controlled by various processes, including mutation, recombination, marker ascertainment, and demography (Stumpf 2004). A high genetic diversity will follow a high haplotype diversity. ...
Article
Full-text available
A Barred rainbowfish (Chilatherina fasciata) is one of the native fish species found in Western New Guinea of Indonesia. This study aimed to observe the levels of haplotype diversity in the partial Cytochrome-c oxidase subunit I (COI) gene of Barred rainbowfish. For the DNA analysis, thirty (30) Barred rainbowfishes were caught from the Mamberamo River. Three (3) molecular packages of BioEdit, MEGA, and DNAsp were used to analyze twenty (30) forward sequences of the COI gene (502 bp). The research showed four (4) haplotypes for the examined population, a total of seven (7) mutations, and low genetic diversity detected in the partial COI gene with the haplotype diversity (Hd) = 0.405 and nucleotide diversity (pi) = 0.003. Meanwhile, the Fu's and Tajima's tests were 1.21 and -0.69, respectively. The UPGMA tree with 1,000 × bootstrap replications revealed that Barred rainbowfishes are grouped into similar clusters with Melanotaenia vanheurni, Chilatherina alleni, and Chilatherina bleheri. In conclusion, haplotype 3 (77%) was detected as the common haplotype for Barred rainbowfishes at the Mamberamo River of Western New Guinea.
... TD was not significant (P > 0.10) for all the genes analyzed. and the combined data set, reflecting a high genetic diversity of the population analyzed (Nei and Tajima, 1981;Stumpf, 2004). ...
Article
Full-text available
Festuca rubra subsp. pruinosa is a perennial grass growing in sea cliffs where plants are highly exposed to salinity and marine winds, and often grow in rock fissures where soil is absent. Diaporthe species are one of the most abundant components of the root microbiome of this grass and several Diaporthe isolates have been found to produce beneficial effects in their host and other plant species of agronomic importance. In this study, 22 strains of Diaporthe isolated as endophytes from roots of Festuca rubra subsp. pruinosa were characterized by molecular, morphological, and biochemical analyses. Sequences of the nuclear ribosomal internal transcribed spacers (ITS), translation elongation factor 1-α (TEF1), beta-tubulin (TUB), histone-3 (HIS), and calmodulin (CAL) genes were analyzed to identify the isolates. A multi-locus phylogenetic analysis of the combined five gene regions led to the identification of two new species named Diaporthe atlantica and Diaporthe iberica. Diaporthe atlantica is the most abundant Diaporthe species in its host plant, and Diaporthe iberica was also isolated from Celtica gigantea, another grass species growing in semiarid inland habitats. An in vitro biochemical characterization showed that all cultures of D. atlantica produced indole-3-acetic acid and ammonium, and the strains of D. iberica produced indole 3-acetic acid, ammonium, siderophores, and cellulase. Diaporthe atlantica is closely related to D. sclerotioides, a pathogen of cucurbits, and caused a growth reduction when inoculated in cucumber, melon, and watermelon.
... According to phylogenetic study, B. Sorokiniana was a single species with no distinct groups, since all of the sequences taken from different countries were clustered together in a single group and outgroups were put into separate groups. Haplotype diversity is influenced by mutation, marker discovery, recombination, and demography [42]. Based on three gene sequences, 40 haplotypes were identified in a group of 254 isolates, with the predominant haplotype H_1 comprising 127 individuals (50% of total isolates) with distinct geographies and host specificities. ...
Article
Full-text available
Bipolaris sorokiniana is a fungal pathogen that infects wheat, barley, and other crops, causing spot blotch disease. The disease is most common in humid, warm, wheat-growing regions, with South Asia’s Eastern Gangetic Plains serving as a hotspot. There is very little information known about its genetic variability, demography, and divergence period. The current work is the first to study the phylogeographic patterns of B. sorokiniana isolates obtained from various wheat and barley-growing regions throughout the world, with the goal of elucidating the demographic history and estimating divergence times. In this study, 162 ITS sequences, 18 GAPDH sequences, and 74 TEF-1αsequences from B. sorokiniana obtained from the GenBank, including 21 ITS sequences produced in this study, were used to analyse the phylogeographic pattern of distribution and evolution of B. sorokiniana infecting wheat and barley. The degrees of differentiation among B. sorokiniana sequences from eighteen countries imply the presence of a broad and geographically undifferentiated global population. The study provided forty haplotypes. The H_1 haplotype was identified to be the ancestral haplotype, followed by H_29 and H_27, with H_1 occupying a central position in the median-joining network and being shared by several populations from different continents. The phylogeographic patterns of species based on multi-gene analysis, as well as the predominance of a single haplotype, suggested that human-mediated dispersal may have played a significant role in shaping this pathogen’s population. According to divergence time analysis, haplogroups began at the Plio/Pleistocene boundary.
... The haplotype diversity in the Halari donkey population was found to be 0.8152. The haplotype diversity increases with the increase of recombination and SNP frequency which results in an increase in genetic diversity (Stumpf, 2004). The results indicated that the Halari donkeys possess a nucleotide diversity of 0.12811 with a GC content of 0.422 which indicates a mild diversity among the animals. ...
... In the global analysis, we observed 27 haplotypes of leatherback turtles, which differed from previous studies [11,19,24,45]. Different numbers of haplotypes between studies are likely linked to the length and number of sequences and SNP frequency [46]. In leatherback turtles, Dutton and colleagues [11] analyzed the mtDNA control region of Brazilian leatherback turtles with lengths of 496 and 711 bp and found longer sequences that identified seven haplotypes, while shorter sequences found only five. ...
Article
Full-text available
The leatherback sea turtle (Dermochelys coriacea) is the largest and one of the most migratory turtle species, inhabiting oceans throughout the world. There has been a steady decline in leatherback populations over the past several decades due to human activities. They are considered endangered in Thailand and global, so conservation strategies are needed to study and protect the species, including determining their genetic diversity. A total of 8 microsatellite loci and 658 bp amplicon of mitochondrial DNA (mtDNA) were used to assess genetic data from 149 dead leatherback turtle hatchlings among 14 nests in five locations along the Andaman Sea, Thailand, between 2018–2020. The microsatellite findings show that the observed heterozygosity (Ho) ranged from 0.44 ± 0.09 to 0.65 ± 0.10. Population structures were further divided into two genetically distinct groups by Bayesian inference. For the mtDNA control region, our samples consisted of three haplotypes. Globally, there are 27 haplotypes of leatherback turtles, with a relatively low genetic diversity (h = 0.43). These results reveal the genetic status of leatherback turtles in Thailand and globally, and raise concerns about their relative genetic health, which highlight the need for proactive, long-term management and conservation strategies for this endangered species.
... This observation is well correlated with extrapolations drawn from median joining network and haplotypic data generated in the present study. Haplotype diversity represents a collective effect of mutation, marker ascertainment, recombination and demography (Zhang et al. 2017;Stumpf 2004). In the present study, forty haplotypes have been observed in a group of 528 isolates based on ITS sequences with largest haplotype H-1 comprised of 377 individuals (71.40% of total population) followed by H-2 (9.28% of total population) with distinct geographical origin. ...
Article
Bipolaris sorokiniana (BS) is an economically important fungal pathogen causing spot blotch of wheat (Trtiticum aestivum) and found in all wheat-growing zones of India. Very scanty and fragmentary information is available on its genetic diversity. The current research is the first detailed report on the geographic distribution and evolution of BS population in five geographically distinct wheat-growing zones (North Western Plain Zone (NWPZ), North Eastern Plain zone (NEPZ), North Hill Zone (NHZ), Southern Hill Zone (SHZ) and Peninsular Zone (PZ)) of India, studied by performing nucleotide sequence comparison of internal transcribed spacer region of 528 isolates. A moderate to low levels of haplotypic diversity was noticed in different wheat-growing zones. Phylogenetic analysis suggests that B. sorokiniana exist in two distinct lineages as all isolates under study were grouped in two different clades and found analogous to the findings of haplotypic and TCS network analysis. The genetic parameters revealed the existence of 40 haplotypes with three major haplotypes (H-1, H-2 and H-3) which showed star-like structure network surrounded by several single haplotypes, revealing high frequency of the mutations (Eta = 2 – 158) in total analyzed population. H-1 was observed as a predominant haplotype and prevalent in all the five zones. Moderate level of genetic differentiation was found between NHZ and other zones like NWPZ (Fst = 0.332) and SHZ (Fst = 0.382) and PZ (Fst = 0.299), whereas it was low between NEPZ and PZ (Fst = 0.034). Higher transfer rate of genetic variation was noticed between NEPZ and PZ (Nm = 7.06), while it was found minimum between NHZ and SHZ (Nm = 0.40). Moreover, negative score of neutrality statistics (Tajima’s D and Fu’s FS test) for NWPZ population suggested recent population expansion. However, positive score for both the neutrality tests observed in NEPZ indicated the dominance of balancing selection in structuring their population. Recombination events were observed in the NWPZ and NHZ population, while it was absent in SHZ, NEPZ and PZ population. Thus, the lack of any specific genetic population structure in all the zones indicates for the expansion history only from one common source population, i.e. NWPZ, a mega zone of wheat production in India. Overall, it seems that the predominance of individual haplotypes with a moderate level of genetic variation and human-mediated movement of contaminated seed and dispersal of inoculum, mutations and recombination as prime evolutionary processes play essential role in defining the genetic structure of BS population.
... This result therefore indicates that Noiler, FUNAAB Alpha and Kuroiler breeds could be more genetically diverse with the haplotypes being shared with other populations. Haplotype shared observed in Shika Brown, Fulani and Sasso was very low, this suggests that these breeds must have undergone less mutational processes in their genome compared with the rest of the iTABs similar to (Stumpf, 2004). ...
Article
Full-text available
The improved tropically adapted chicken breeds (iTABs) are low-input-high-output chickens suitable for smallholder poultry (SHP). Six iTABs (Fulani, FUNAAB Alpha, Kuroiler, Noiler, Sasso and Shika Brown) were introduced, and were raised under semi-intensive management system and tested under the African Chicken Genetic Gains project in Nigeria. The objective of this study was to evaluate the genetic diversity of these iTABs tested in Imo State Nigeria using mitochondrial DNA (mtDNA), Blood samples were collected from 77 chickens belonging to these six populations of iTABs in the ratio (12:12:14:13:13:13), for Noiler, FUNAAB Alpha, Shika Brown, Kuroiler, Sasso and Fulani chickens, respectively. Genomic DNA was extracted from seventy-seven birds randomly selected from the six iTABs. A 450-bp mtDNA D-loop region was sequenced. The highest (H=5) and the lowest (H=2) number of haplotypes were found within Noiler, and Shika Brown/Fulani, respectively. Among the six populations, haplotype and nucleotide diversity was 0.558±0.063 and 0.0064±0.0013, respectively. A total of 8 haplotypes were identified from 15 polymorphic sites. These haplotypes clustered into three clades with 87.89% of the total maternal genetic variations occurring within population. Fulani and Shika Brown had the least (0.000) genetic distance. Tajima’s D was negative among populations and within Noiler, Kuroiler, Sasso and Fulani populations but was only statistically significant within the Noiler population. Diversity indices of this study revealed that mtDNA polymorphism was on the average both within populations and among populations. The results indicate the existence of two distinct maternal lineages from Southeast Asia, south central and Southeast China evenly distributed among the iTABs. The average genetic diversity observed within population can be utilized for the long-term genetic improvement and stabilization of the breeds.
Article
Full-text available
The sampling distribution of a collection of DNA sequences is studied under a model where recombination can occur in the ancestry of the sequences. The infinitely-many-sites model of mutation is assumed where there may only be one mutation at a given site. Ancestral inference procedures are discussed for: estimating recombination and mutation rates; estimating the times to the most recent common ancestors along the sequences; estimating ages of mutations; and estimating the number of recombination events in the ancestry of the sample. Inferences are made conditional on the configuration of the pattern of mutations at sites in observed sample sequences. A computational algorithm based on a Markov chain simulation is developed, implemented, and illustrated with examples for these inference procedures. This algorithm is very computationally intensive.
Article
Full-text available
Genome-wide linkage disequilibrium (LD) mapping of common disease genes could be more powerful than linkage analysis if the appropriate density of polymorphic markers were known and if the genotyping effort and cost of producing such an LD map could be reduced. Although different metrics that measure the extent of LD have been evaluated, even the most recent studies have not placed significant emphasis on the most informative and cost-effective method of LD mapping-that based on haplotypes. We have scanned 135 kb of DNA from nine genes, genotyped 122 single-nucleotide polymorphisms (SNPs; approximately 184,000 genotypes) and determined the common haplotypes in a minimum of 384 European individuals for each gene. Here we show how knowledge of the common haplotypes and the SNPs that tag them can be used to (i) explain the often complex patterns of LD between adjacent markers, (ii) reduce genotyping significantly (in this case from 122 to 34 SNPs), (iii) scan the common variation of a gene sensitively and comprehensively and (iv) provide key fine-mapping data within regions of strong LD. Our results also indicate that, at least for the genes studied here, the current version of dbSNP would have been of limited utility for LD mapping because many common haplotypes could not be defined. A directed re-sequencing effort of the approximately 10% of the genome in or near genes in the major ethnic groups would aid the systematic evaluation of the common variant model of common disease.
Article
Full-text available
There is considerable interest in understanding patterns of linkage disequilibrium (LD) in the human genome, to aid investigations of human evolution and facilitate association studies in complex disease. The relative influences of meiotic crossover distribution and population history on LD remain unclear, however. In particular, it is uncertain to what extent crossovers are clustered into 'hot spots, that might influence LD patterns. As a first step to investigating the relationship between LD and recombination, we have analyzed a 216-kb segment of the class II region of the major histocompatibility complex (MHC) already characterized for familial crossovers. High-resolution LD analysis shows the existence of extended domains of strong association interrupted by patchwork areas of LD breakdown. Sperm typing shows that these areas correspond precisely to meiotic crossover hot spots. All six hot spots defined share a remarkably similar symmetrical morphology but vary considerably in intensity, and are not obviously associated with any primary DNA sequence determinants of hot-spot activity. These hot spots occur in clusters and together account for almost all crossovers in this region of the MHC. These data show that, within the MHC at least, crossovers are far from randomly distributed at the molecular level and that recombination hot spots can profoundly affect LD patterns.
Article
Full-text available
We introduce a new method for estimating recombination rates from population genetic data. The method uses a computationally intensive statistical procedure (importance sampling) to calculate the likelihood under a coalescent-based model. Detailed comparisons of the new algorithm with two existing methods (the importance sampling method of Griffiths and Marjoram and the MCMC method of Kuhner and colleagues) show it to be substantially more efficient. (The improvement over the existing importance sampling scheme is typically by four orders of magnitude.) The existing approaches not infrequently led to misleading results on the problems we investigated. We also performed a simulation study to look at the properties of the maximum-likelihood estimator of the recombination rate and its robustness to misspecification of the demographic model.
Article
We introduce a new statistical model for patterns of linkage disequilibrium (LD) among multiple SNPs in a population sample. The model overcomes limitations of existing approaches to understanding, summarizing, and interpreting LD by (i) relating patterns of LD directly to the underlying recombination process; (ii) considering all loci simultaneously, rather than pairwise; (iii) avoiding the assumption that LD necessarily has a “block-like” structure; and (iv) being computationally tractable for huge genomic regions (up to complete chromosomes). We examine in detail one natural application of the model: estimation of underlying recombination rates from population data. Using simulation, we show that in the case where recombination is assumed constant across the region of interest, recombination rate estimates based on our model are competitive with the very best of current available methods. More importantly, we demonstrate, on real and simulated data, the potential of the model to help identify and quantify fine-scale variation in recombination rate from population data. We also outline how the model could be useful in other contexts, such as in the development of more efficient haplotype-based methods for LD mapping.
Article
Determining the amount of recombination in the genealogical history of a sample of genes is important to both evolutionary biology and medical population genetics. However, recurrent mutation can produce patterns of genetic diversity similar to those generated by recombination and can bias estimates of the population recombination rate. Hudson 2001 has suggested an approximate-likelihood method based on coalescent theory to estimate the population recombination rate, 4N(e)r, under an infinite-sites model of sequence evolution. Here we extend the method to the estimation of the recombination rate in genomes, such as those of many viruses and bacteria, where the rate of recurrent mutation is high. In addition, we develop a powerful permutation-based method for detecting recombination that is both more powerful than other permutation-based methods and robust to misspecification of the model of sequence evolution. We apply the method to sequence data from viruses, bacteria, and human mitochondrial DNA. The extremely high level of recombination detected in both HIV1 and HIV2 sequences demonstrates that recombination cannot be ignored in the analysis of viral population genetic data.
Article
In this review, we describe recent empirical and theoretical work on the extent of linkage disequilibrium (LD) in the human genome, comparing the predictions of simple population-genetic models to available data. Several studies report significant LD over distances longer than those predicted by standard models, whereas some data from short, intergenic regions show less LD than would be expected. The apparent discrepancies between theory and data present a challenge-both to modelers and to human geneticists-to identify which important features are missing from our understanding of the biological processes that give rise to LD. Salient features may include demographic complications such as recent admixture, as well as genetic factors such as local variation in recombination rates, gene conversion, and the potential segregation of inversions. We also outline some implications that the emerging patterns of LD have for association-mapping strategies. In particular, we discuss what marker densities might be necessary for genomewide association scans.
Article
Linkage disequilibrium (LD) analysis is traditionally based on individual genetic markers and often yields an erratic, non-monotonic picture, because the power to detect allelic associations depends on specific properties of each marker, such as frequency and population history. Ideally, LD analysis should be based directly on the underlying haplotype structure of the human genome, but this structure has remained poorly understood. Here we report a high-resolution analysis of the haplotype structure across 500 kilobases on chromosome 5q31 using 103 single-nucleotide polymorphisms (SNPs) in a European-derived population. The results show a picture of discrete haplotype blocks (of tens to hundreds of kilobases), each with limited diversity punctuated by apparent sites of recombination. In addition, we develop an analytical model for LD mapping based on such haplotype blocks. If our observed structure is general (and published data suggest that it may be), it offers a coherent framework for creating a haplotype map of the human genome.
Article
Global patterns of human DNA sequence variation (haplotypes) defined by common single nucleotide polymorphisms (SNPs) have important implications for identifying disease associations and human traits. We have used high-density oligonucleotide arrays, in combination with somatic cell genetics, to identify a large fraction of all common human chromosome 21 SNPs and to directly observe the haplotype structure defined by these SNPs. This structure reveals blocks of limited haplotype diversity in which more than 80% of a global human sample can typically be characterized by only three common haplotypes.