Page 1

Copyright 2001 by the Genetics Society of America

Recombination, Balancing Selection and Phylogenies in MHC

and Self-Incompatibility Genes

Mikkel H. Schierup, Anders M. Mikkelsen and Jotun Hein

Bioinformatics Research Center (BiRC), Department of Ecology and Genetics, University of Aarhus, 8000 Aarhus C., Denmark

Manuscript received May 23, 2001

Accepted for publication October 1, 2001

ABSTRACT

Usingacoalescentmodelofmultiallelicbalancingselectionwithrecombination,thegenealogicalprocess

as a function of recombinational distance from a site under selection is investigated. We find that the

shape of the phylogenetic treeis independent of the distance to the site under selection.Only the timescale

changes from the value predicted by Takahata’s allelic genealogy at the site under selection, converging

with increasing recombination to the timescale of the neutral coalescent. However, if nucleotide sequences

are simulated over a recombining region containing a site under balancing selection, a phylogenetic tree

constructed while ignoring such recombination is strongly affected. This is true even for small rates of

recombination. Published studies of multiallelic balancing selection, i.e., the major histocompatibility

complex(MHC)ofvertebrates,gametophyticandsporophyticself-incompatibilityofplants,andincompati-

bility of fungi, all observe allelic genealogies with unexpected shapes. We conclude that small absolute

levels ofrecombination are compatible withthese observed distortionsof the shape ofthe allelic genealogy,

suggesting a possible cause of these observations. Furthermore, we illustrate that the variance in the

coalescent with recombination process makes it difficult to locate sites under selection and to estimate

the selection coefficient from levels of variability.

L

These systems include gametophytic (Emerson 1939)

and sporophytic (Kusaba et al. 1997) self-incompatibil-

ity systems in plants, incompatibility systems in fungi

(May and Matzke 1995), and some of the MHC genes

in vertebrates (Anderson et al. 1986; Hughes and Nei

1988).Ineachsystemalargenumberofalleles(20–150)

are maintained at intermediate frequencies and nucleo-

tide sequence variation among alleles often exceeds 30%.

TheMHCdatainparticularhavestimulatedtheanaly-

sisofmodelsthatareconsistentwiththesestrikinglevels

of polymorphism. Overdominant selection with (close

to) equal selection coefficient is sufficient (and appears

necessary) to explain the data (Takahata and Nei 1990).

With incompatibility systems, the polymorphism can be

explained by the inherent selection (Vekemans and

Slatkin 1994; Schierup et al. 1998).

Population genetics theory has successfully explained

someaspectsofthesepolymorphisms.Nevertheless,one

important aspect of the pattern of polymorphism in

these systems is not yet well understood. This is the

shape of the phylogenetic tree of the alleles. Takahata

(1990) showed for symmetrical overdominance that the

allelic genealogy (i.e., the phylogeny of functionally dif-

OCI under multiallelic balancing selection are the

most polymorphic genes known in eukaryotes.

ferent alleles) can be approximated well by a Moran

process with a constant number of allelic classes with

equal death rates. Such a Moran process satisfies the

assumptions of the neutral coalescent (Kingman 1982)

with time scaled appropriately through a scaling factor

fS. This implies that the phylogenetic tree of alleles has

the same expected shape as the neutral coalescent, dif-

fering only in the timescale. Extension to gametophytic

self-incompatibility has shown that fSis very large (?1000)

for realistic population sizes and mutation rates to new

specificities (Vekemans and Slatkin 1994). Uyeno-

yama (1997) characterized the shape of the phyloge-

netic tree through four ratios calculated from the

branch lengths of the trees and scaled to have approxi-

mate means of one under the neutral coalescent with

no recombination. She found by simulation that the

values of these ratios for allelic genealogies of gameto-

phyticself-incompatibilitysystemsare(almost)indepen-

dent of the overall sequence variability (i.e., the mutation

rate). Allelic genealogies in sporophytic self-incompati-

bility (Schierup et al. 1998) and fungal incompatibility

(May et al. 1999) are also expected to have a shape

closeto theneutralcoalescent,when measuredthrough

these ratios.

However, when these ratios are applied to real se-

quence data of functionally different alleles, they show

significant deviations from coalescent expectations

(Uyenoyama1997;Mayetal.1999;RichmanandKohn

1999; Table 1; for definition of ratios, see statistics).

The main deviation is that the terminal branches are

much longer than expected (RSD? 1). This pattern of

Corresponding author: Mikkel H. Schierup, Bioinformatics Research

Center (BiRC), Department of Ecology and Genetics, University of

Aarhus, Ny Munkegade, Bldg. 540, 8000 Aarhus C., Denmark.

E-mail: mikkel.schierup@biology.au.dk

Genetics 159: 1833–1844 (December 2001)

Page 2

1834M. H. Schierup, A. M. Mikkelsen and J. Hein

deviation is remarkably consistent over the four differ-

ent kinds of systems, even though these are based on

completely different molecular mechanisms. Two hypoth-

eses have been put forward to explain this observation.

Uyenoyama(1997)suggestedthattheenforcedhetero-

zygosity in gametophytic self-incompatibility (SI) leads

to accumulation of recessive deleterious variants

through sheltering. The probability of invasion and the

retention time of a new specificity would then decrease

over time because it would be selected against when

forming heterozygotes with the specificity it arose from.

Richman and Kohn (1999) suggested, on the basis of

a statistical analysis of phylogenetic trees of alleles, that

divergent alleles are preferentially maintained in game-

tophytic SI. However, these two hypotheses have not yet

been quantitatively investigated theoretically. Either of

them is not likely to play an equally strong role in each

of the four distinct systems. For example, homozygotes

can be formed in the MHC and sporophytic SI but

not in gametophytic SI, and Uyenoyama’s hypothesis

quantitatively depends on the frequency of homozy-

gotes.

Takahata’s (1990) allelic genealogy is an infinite al-

leles model that treats alleles as entities that cannot be

broken up by recombination. It is thus an important

assumption for application of this theory to sequence

data that recombination does not occur. However, in-

tragenic recombination/gene conversion has been re-

ported within genes of the MHC (Bergstrom et al.

1998), gametophytic self-incompatibility (Wang et al.

2001), sporophytic self-incompatibility (Awadalla and

Charlesworth 1999; Schierup et al. 2001), and fungal

incompatibility (May and Matzke 1995). In light of

this it is important to investigate how violation of the

no-recombination assumption affects genealogical in-

ferences.

Here, the expected effects of recombination on the

shape of the genealogy are investigated. We simulate a

simple model of multiallelic selection with recombina-

tion using an extension of Hudson’s (1983) algorithm

for the coalescent with recombination. The main as-

sumption is that variation in a single nonrecombining

spot on the sequence is subject to selection. This spot

could either be a single nucleotide site or a collection of

adjacent nucleotides forming a specificity-determining

region.

First, we investigate the genealogy as a function of the

recombination distance from the spot under selection.

At the spot under selection, we expect a neutral-shaped

genealogy of alleles with an extended timescale, which

depends on the number of allelic classes and the muta-

tion rate to new specificities (Takahata 1990). Suffi-

ciently far from the spot under selection we expect that

genealogical trees of sequences are determined by the

neutralcoalescent.Ourfirstquestionishow,aswemove

awayfromthespotunderselection,theallelicgenealogy

transforms into the neutral coalescent. Does the shape

TABLE 1

Analysis of population data sets of self-recognition systems

No.

Pairwise

PIST test of

R2test of

System

Gene

Species

alleles

divergence

RPT

RST

RSD

RBD

recombinationc

recombinationd

Reference

MHCa

DRB1

Homo sapiens

11

0.08

0.69

2.15

3.46

0.91

P ? 0.001

NS

Bergstrom et al. (1998)

MHCa

RT1.Ba

Rattus fuscipes greyii

36

0.06

0.42

2.62

9.41

0.58

P ? 0.001

P ? 0.05

Seddon and Baverstock (1999)

Fungal SIb

beta 1

Coprinus cinereus

6

0.48

1.06

2.26

5.15

0.11

P ? 0.05

NS

May et al. (1999)

Gametophytic SIa

S-RNAse

Lycium andersonii

22

0.70

0.54

2.40

7.20

0.32

NS

NS

Richman and Kohn (1999)

Gametophytic SIb

S-RNAse

Physalis crassifolia

28

0.54

0.63

1.98

3.00

0.99

NS

P ? 0.01

Richman et al. (1996)

Gametophytic SIb

S-RNAse

Solanum carolinense

13

0.69

0.68

2.32

5.47

0.64

NS

P ? 0.05

Richman et al. (1995)

Gametophytic SIa

S-RNAse

Petunia inflata

14

0.56

0.67

2.21

4.94

0.27

NS

NS

Wang et al. (2001)

Sporophytic SIa

AlSRK

Arabidopsis lyrata

11

0.37

0.60

2.68

7.59

0.14

P ? 0.05

NS

Schierup et al. (2001)

Sporophytic SIb

SLG

Brassica oleracea

23

0.14

0.45

2.74

5.15

0.58

P ? 0.01

P ? 0.001

Kusaba et al. (1997)

Sporophytic SIb

SLG

Brassica campestris

19

0.13

0.46

3.02

8.87

0.42

P ? 0.01

P ? 0.001

Kusaba et al. (1997)

aSequences were downloaded from GenBank and aligned with ClustalX (Thompson et al. 1997). Trees were reconstructed using DNAdist with F81 model and Kitsch

(Felsenstein 1995), and the four statistics were calculated from the branch lengths. Other reconstruction methods gave very similar results.

bValues of the four statistics were taken from the literature.

cThis test is the informative sites test of Worobey (2001). The test was performed using the PIST 1.0 software, following closely the recommendations of Worobey (2001),

including using PAUP* (Swofford 2000). NS, nonsignificant.

dThis test followed Awadalla et al. (1999) closely, except that only sites with ?30% frequency were included. Significance was assessed by 1000 permutations of the

variable sites. NS, nonsignificant.

Page 3

1835Recombination and Phylogenies

of the genealogical tree remain unaffected? How far,

measured in recombination distance, is the effect of

selectionmeasurable?ThelastquestionfollowsHudson

and Kaplan (1988) and Takahata and Satta (1998).

Second, we quantify the shape of the “average” genea-

logical tree of a sample of whole sequences subject to

recombination. With recombination a single genealogi-

cal tree does not normally describe the sequence varia-

tion since different parts of the sequence have different

histories. Previous investigations of allelic genealogies

are, however, based on phylogenetic trees, and it is

therefore of interest to investigate the expected shape

of a phylogenetic tree of sequences even when recom-

bination occurs. Biases introduced by ignoring recom-

bination can then be quantified. To investigate this

question we simulate samples of nucleotide sequences

assuming a given amount of recombination and a speci-

fied substitution model for the linked neutral nucleo-

tides. We describe how much recombination is needed

before the shape of the phylogenetic tree is distorted,

compare these results with the published studies, and

concludethatrelativelysmallamountsofrecombination

are compatible with the deviations from the expected

shape of genealogical trees observed in the data sets.

Figure 1.—The coalescent process with recombination and

balancing selection (see text). Ancestral material, solid line;

nonancestralmaterial,dottedline.Bottomlineshowsasample

of four genes associated at their left end points with two differ-

ent specificities (types), three copies of specificity 1, and one

copy of specificity 2. The first event (counting from the bot-

tom) is an allelic turnover event from type 1 to type 2, where

type 1 changes to type 2. This leads to instant coalescence of

all genes with type 1 and assignment of type 2 to the resultant

gene. A new type (in this case type 3) is invented to keep the

number of specificities constant. Initially this type does not

carry any ancestral material. The second event is a recombina-

tion that splits a gene in two. The left ancestor keeps the same

type(inthiscasetype2),whereastherightancestorisassigned

a random type among the other types present (here type 3).

Type 3 now carries ancestral material and “trapped material”

(see text). The third event is coalescence of two genes of the

same type (in this case type 2). The left part of the sequence

has thus found a most recent common ancestor. At least one

further allelic turnover event and a subsequent coalescence

event are necessary before the right part of the sequence finds

a common ancestor.

MODEL

ThemodelisanextensionofHudson’s(1983)coales-

cent with recombination, here allowing for a simple

form of symmetrical balancing selection. It is reminis-

cent of the process formulated by Griffiths and Mar-

joram (1996) except that mutation to a different speci-

ficity can happen in a single position of the sequence

only. For simplicity, we define the site of selection to

be at the left endpoint of the sequence (Figure 1).

Therearensequencessampledandthediploidpopu-

lation size is N. The continuous time approximation

scales time in 2N generations. Recombination can hap-

pen with the same probability over the sequence deter-

mined by the overall recombination parameter ? ? 4Nr,

which is the number of recombination events in a se-

quence in 4N generations, with r thus being the proba-

bility of a recombination event in a single sequence in

a single generation.

To model strong balancing selection we assume that

M distinct allelic classes are kept in equal frequencies

in the population. An allelic class is also termed a speci-

ficity as in studies of self-incompatibility or the MHC.

The turnover process of specificities follows Takahata

(1990), which describes it as a symmetric Moran process

viewed back in time with an allelic turnover rate, Q. In

other words, Q is the rate at which specificity lineages

bifurcate in the population. Q depends on the mutation

rate to new specificities and the selection coefficient

(see Takahata 1990). At a turnover event, each allelic

class is equally likely to be lost and a new allelic class is

then created to keep the number of specificities con-

stant.

Each of the n sampled sequences is associated at its

left endpoint with one of the allelic classes. A given

point at a given sequence can change its associated

specificity ifeither the specificityis changed byan allelic

turnover event or if recombination occurs between the

selected site and the focal point. Two sequences can

coalesce only when they have the same specificity (Fig-

ure 1, top).

Assume that there are M specificities in the popula-

tion and that we sample n sequences of different speci-

ficities n ? M. This corresponds to the situation where

an investigator sequences only one copy of each speci-

ficity, but not all existing specificities are necessarily

sampled. The coalescent process with recombination

and selection can then be approximated by three inde-

pendent exponentially distributed waiting times, namely

Page 4

1836 M. H. Schierup, A. M. Mikkelsen and J. Hein

coalescence, recombination, and allelic turnover (see

Figure 1). A sample of sequences is followed backward

in time until all parts of each sampled sequence (the

ancestral material) have found a most recent common

ancestor.

Coalescence: The intensity of coalescence is given by

Ct? M ?M

copies of specificity i, since coalescences can only hap-

pen within a given specificity that each has an effective

size of 2N/M. If a coalescent event happens, an allelic

class i is chosen with probability proportional to ni(ni?

1)/2, two random sequences from class i are merged

into one ancestral sequence with the same specificity i,

and niis decreased by 1. Note that initially, since we

sample at most one sequence from each specificity, ni?

1 for all specificities sampled and C0? 0. Thus, coales-

cence can only happen once recombination has shifted

ancestral material to other specificities, making ni? 1

for at least one i.

Recombination: The intensity of recombination Rtat

a given point in time is determined by the amount

of ancestral material to the sample plus any material

“trapped” by blocks of ancestral material (Wiuf and

Hein 1997). The amount of ancestral material is the

total part of the sampled sequences that have not yet

found a most recent common ancestor. The reason why

nonancestral “trapped material” has to be counted in

the intensity of recombination is that a recombination

event there would distribute ancestral material onto two

sequences rather than one, thus affecting the coales-

cence process. At time zero, R0? n?/2, i.e., the number

of sequences times their lengths. If recombination hap-

pens, the recombination point is picked uniformly over

this length of sequence. A recombination event breaks

up the sequence in a left and a right segment (Figure

1). The left segment retains its allelic class. The right

segment is assigned an allelic class among the other

existing classes. In the case of self-incompatibility, this

class is chosen randomly among the M ? 1 allelic classes

distinctfromtheclassoftherecombiningsequence.For

overdominance with selection coefficient s, the present

class is chosen with relative weight 1 ? s, corresponding

to selection against homozygotes of strength s.

Allelic turnover: The intensity of allelic turnover is

determined by Q, which is independent of the time t

by definition. If an allelic turnover event happens, an

allelic class, say i, is chosen randomly with equal proba-

bility among the M allelic classes. The nisequences from

this class are then made to coalesce instantly and the

resultant sequence has its specificity changed to one of

the other M ? 1 allelic classes at random (Figure 1).

Viewed forward in time this corresponds to a new speci-

ficity arising by mutation followed by its (almost) imme-

diate increase in frequency due to the strong selection

favoring rare specificities. Then a new allelic class is

introduced to keep the number constant. Initially, this

new allelic class does not carry ancestral material, but

recombination events can transfer ancestral material to

the class (see Figure 1). Note that if this happens, the

material between the point of selection and the left

border ofancestral material isalso added tothe trapped

material part of the recombination intensity because a

recombination event here would change the specificity

associated with the ancestral material and thus the coa-

lescent history. If the allelic turnover rate is small, the

coalescent process of the left endpoint of the sequence

(the point under selection) is dominated by the allelic

turnover process alone, and, according to Takahata

(1990), the expected time to the most recent common

ancestor is D ? M(M ? 1)(1 ? 1/n)/Q, which is likely

to be much longer than the neutral value of D ? 2(1 ?

1/n) when Q ? 1.

Sincethe threeeventsareindependent andexponen-

tially distributed, the intensity of any event to happen

is exponentially distributed with parameter Ct? Rt?

Q and given that an event happens, the probability that

it is a coalescent event, say, is Ct/(Ct? Rt? Q). The

process is simulated from starting conditions by de-

termining the time of the first event by drawing a ran-

dom number from an exponential distribution with

mean 1/(Ct? Rt? Q), then determining the type of

the event, and finally updating the intensities of the

three events according to the rules above. The process

is continued until all parts of the sequences have found

a common ancestor. For the left endpoint of the se-

quences the time until a common ancestor is primarily

determined by the allelic turnover process, whereas re-

combination plays an increasingly important role the

farther a point is away from the left endpoint.

The processresults in aset of correlatedtrees relating

the samples along the sequence. In contrast to the neu-

tral coalescent with recombination these trees are not

taken from the same distribution since their expected

branch lengths depend on the distance to the left end-

point where specificities are determined. During a sin-

gle stochastic realization of the process we stored all

information on topology and branch length for each of

these trees. From these we (a) investigate the coalescent

process as a function of distance from the point of

selection and (b) simulate and subsequently analyze

nucleotide sequences under this process.

i?1(ni(ni? 1)/2), where niis the number of

STATISTICS

To characterize the shape of the phylogenetic trees

we used five quantities calculated from branch lengths.

These are

S, sum of the length of terminal branches;

T, total length of all branches;

D, time to the most recent common ancestor;

P, average pairwise distance between two specificities;

B, average length of basal branches emanating from

the root.

Page 5

1837 Recombination and Phylogenies

From these, four ratios,quences followed Schierup and Hein (2000). Neutral

mutationscanbe addedafterthegenealogies havebeen

constructed under the coalescent model because the

coalescentprocess andtheneutralmutation processare

independent. Mutations were added at rate m to the

simulated genealogy by dividing the sequence length

into L equally sized fragments corresponding to nucleo-

tides. We used the simple Jukes-Cantor substitution

model (Jukes and Cantor 1969) and assumed that

nucleotides mutate independently. For a given position

in the sequence, first a nucleotide is assigned to the

most recent common ancestor (MRCA) with probabili-

ties according to the equilibrium frequencies of nucleo-

tides, which for the Jukes-Cantor model is 25% of each.

The evolution of the nucleotide is then followed over

thespecificgenealogicaltreeatthisposition.Foragiven

branch of length l, the number of mutations is Poisson

distributed with mean ml. Repeating this process for

each nucleotide results in n aligned sequences of length

L. We restricted analysis to the Jukes-Cantor model be-

cause we found previously that more complex substitu-

tion models have little effect on the expected values of

the above quantities (Schierup and Hein 2000).

Sequences were simulated with a single allelic turn-

over rate at the site under selection but with different

levels of recombination over the sequence. Again, se-

quences were initially assigned unique specificities, equi-

valently to sampling sequences with different specificities

only, as is done in published studies of these systems.

Each set of sequences was subsequently run though

DNAdist and Kitsch programs of PHYLIP (Felsenstein

1995), which results in an inferred phylogenetic tree

reconstructed on the basis of a distance matrix and

restricted by a molecular clock. This is clearly not an

appropriate method when recombination occurs, but

the purpose here is to investigate the bias created in

doing so. From the branch length of the reconstructed

trees the various statistics were recorded. Several combi-

nations of parameters n, M, s, and Q were investigated.

The program for simulations was written in C and can

be accessed through http:/ /www.birc.dk/?mheide.

RPT?2Pan

T

,RST?San

T,RSD?S(1 ? 1/n)

D

,

and

RBD?B(1 ? 1/n)

Dbn

,

can be defined, where an? ?n?1

?n?1

Viewed as ratios of means, all four ratios are scaled

to have an expected mean of one under the neutral

coalescent,andsimulationshaveshownthattheirmeans

as ratios are also close to one (Uyenoyama 1997). The

ratios have the advantage whenapplied to data that they

are (almost) independent of the mutation rate and in

many cases they have power to reject the hypothesis of a

neutral coalescent process (Uyenoyama 1997; Schierup

and Hein 2000). The most powerful statistic has gener-

ally been found to be RSD, which measures the ratio of

the length of the terminal branches to the height of

the tree.

We also calculated the time between subsequent co-

alescence events. Under the neutral coalescent with i

sequences, the mean waiting times Fito the next coales-

cence are independent and exponentially distributed

with mean 2/(i(i ? 1)). Thus, Gi? Fii(i ? 1)/2 are

exponentially distributed with mean 1, and plotting Gi

asafunctionoficanvisualizesystematicdeviationsfrom

neutral expectations.

Thesemeasurescan allbecalculatedfrom thebranch

length of the true trees over the sampled sequences in

asinglerealizationofthecoalescentwithrecombination

and balancing selection process. They can also be calcu-

latedfromphylogenetictreesreconstructedfromnucle-

otidesequencessimulatedunderthemodel(seebelow).

Genealogical structure over a gene: The model was

used to simulate genealogical histories. Each of n sam-

pled genes was assigned a unique specificity among the

M possibilities. Models were simulated and analyzed

where either all specificities were sampled (n ? M) or

just a subset was sampled (n ? M). One run of the

program generates a set of trees with branch lengths

over the set of genes. Such a set is a single outcome of

the stochastic process and is termed a “history.” We

sampled a given history at 1000 points spaced as a loga-

rithmicfunctionof?andcalculatedthevariousstatistics

at each point. Mean and standard deviations for a given

setofparameterswerethenfoundovermany(?15,000)

recorded histories. The statistics were then plotted as a

function of the recombination distance from the site

under selection. A total length of ? ? 100 was investi-

gated, which means that 100 recombinations are ex-

pected between the endpoints of a gene in 4N genera-

tions.

Nucleotide sequences: Simulation of nucleotide se-

i?1 (1/i) and bn? 1/n ?

i?2 (1/i2) (Uyenoyama 1997).

RESULTS

Genealogical structure over gene: Figure 2 shows re-

sults for four of the basic quantities for two different

allelicturnoverratestonewspecificities[Q?0.01(solid

line) and Q ? 0.1 (dotted-dashed line)] for the sample

size n ? 30 genes and M ? 30 specificities. The values

ofeachquantityfor??0areasexpectedfromTakaha-

ta’s (1990) theory (marked on y-axis), which predicts

coalescence times proportional to the square number

of specificities and to 1/Q (e.g., D ? M(M ? 1)(1 ? 1/

n)/Q). Selection can be seen to greatly increase ex-

pected coalescence times close to the site under selec-

tion, but as ? increases, each quantity approaches the

value expected under Kingman’s (1982) coalescent