ArticlePDF Available

An Exact Nonparametric Method for Inferring Mosaic Structure in Sequence Triplets

July 2007
Genetics 176(2):1035-47

July 2007
176(2):1035-47

DOI:10.1534/genetics.106.068874

Source
PubMed

Authors:

David Posada

University of Vigo

Marcus W Feldman

Stanford University

Statistical tests for detecting mosaic structure or recombination among nucleotide sequences usually rely on identifying a pattern or a signal that would be unlikely to appear under clonal reproduction. Dozens of such tests have been described, but many are hampered by long running times, confounding of selection and recombination, and/or inability to isolate the mosaic-producing event. We introduce a test that is exact, nonparametric, rapidly computable, free of the infinite-sites assumption, able to distinguish between recombination and variation in mutation/fixation rates, and able to identify the breakpoints and sequences involved in the mosaic-producing event. Our test considers three sequences at a time: two parent sequences that may have recombined, with one or two breakpoints, to form the third sequence (the child sequence). Excess similarity of the child sequence to a candidate recombinant of the parents is a sign of recombination; we take the maximum value of this excess similarity as our test statistic Delta(m,n,b). We present a method for rapidly calculating the distribution of Delta(m,n,b) and demonstrate that it has comparable power to and a much improved running time over previous methods, especially in detecting recombination in large data sets.

…

Mosaic structure in influenza A hemagglutinin gene

…

No caption available

…

No caption available

…

Figures - uploaded by David Posada

Content may be subject to copyright.

Content uploaded by David Posada

Content may be subject to copyright.

DOI: 10.1534/genetics.106.068874

An Exact Nonparametric Method for Inferring Mosaic Structure

in Sequence Triplets

Maciej F. Boni,*

,†,1

David Posada

‡

and Marcus W. Feldman

†

*Stanford Genome Technology Center, Palo Alto, California 94304,

†

Department of Biological Sciences, Stanford University, Stanford,

California 94305 and

‡

Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain

Manuscript received November 27, 2006

Accepted for publication March 18, 2007

ABSTRACT

Statistical tests for detecting mosaic structure or recombination among nucleotide sequences usually

rely on identifying a pattern or a signal that would be unlikely to appear under clonal reproduction.

Dozens of such tests have been described, but many are hampered by long running times, confounding of

selection and recombination, and/or inability to isolate the mosaic-producing event. We introduce a test

that is exact, nonparametric, rapidly computable, free of the inﬁnite-sites assumption, able to distinguish

between recombination and variation in mutation/ﬁxation rates, and able to identify the breakpoints and

sequences involved in the mosaic-producing event. Our test considers three sequences at a time: two par-

ent sequences that may have recombined, with one or two breakpoints, to form the third sequence (the

child sequence). Excess similarity of the child sequence to a candidate recombinant of the parents is a

sign of recombination; we take the maximum value of this excess similarity as our test statistic D

m,n,b

.We

present a method for rapidly calculating the distribution of D

m,n,b

and demonstrate that it has comparable

power to and a much improved running time over previous methods, especially in detecting recombina-

tion in large data sets.

OSAIC structure exists in a nucleotide sequence

if different segments of the sequence descend

from different ancestors. A nucleotide sequence can be

a mosaic of other sequences as a result of recombina-

tion or gene conversion; mosaic structure in bacterial

DNA can also result from transduction, transformation,

or conjugation, which are collectively referred to as hor-

izontal gene transfer. The detection of mosaic structure

has received much attention over the past two decades

as a result of both a proliferation of sequence data and

leaps in computing power, which together have allowed

for the inference of multiple ancestral contributions to

a nucleotide sequence. The biological questions at the

source of this recent attention range from interest in

the evolution of pathogens (Awadalla 2003; Moya

et al. 2004; Wilson et al. 2005) and the characterization

of linkage disequilibrium in large genomes (Pritchard

and Przeworski 2001; Ardlie et al. 2002; Gabriel et al.

2002) to theoretical questions about clonality and

the deﬁnitions of clonal and nearly clonal organisms

(Maynard Smith et al. 1993; Halkett et al. 2005). For

reviews on the methods and results in this ﬁeld, see

Posada et al. (2002) and Stumpf and McVean (2003).

Maynard Smith (1999) recognized that the contin-

uum between completely clonal and freely recombining

organisms naturally gives rise to two distinct problems:

determining whether recombination occurs and mea-

suring its frequency. In this investigation, we focus on

the former. Detecting recombination usually involves

searching groups of sequences for candidate recombi-

nants or recombination signals and testing whether

these represent statistically signiﬁcant departures from

expectation under a null hypothesis of no recombina-

tion. Dozens of statistical tests have been developed

(Stephens 1985; Sawyer 1989; Balding et al. 1992;

Karlin and Brendel 1992; Maynard Smith 1992;

Takahata 1994; Sneath 1995; Goss and Lewontin

1996; Jakobsen and Easteal 1996; Grassly and Holmes

1997; Maynard Smith and Smith 1998; Sneath 1998;

Awadalla et al. 1999; Crandall and Templeton 1999;

Holmes et al. 1999; Maynard Smith 1999; Wall 1999;

Gibbs et al. 2000; Martin and Rybicki 2000; Worobey

2001; Bruen et al. 2006) and evaluated (Wall 2000;

Brown et al. 2001; Posada and Crandall 2001; Wiuf

et al. 2001; Posada 2002) in this endeavor, none of

which has yet emerged as the single standard test to

be used for identifying recombination. In addition to

testing for the existence of recombination, certain

methods are also able to locate recombination break-

points and, sometimes, the parent sequences involved in

the recombination event, although the latter can be

quite difﬁcult. Methods that do not focus on parent

sequences and breakpoints usually rely on detecting

a recombination signal—for example, a phylogenetic

incongruence or an excess of homoplasies—but may

have trouble isolating the actual recombination event,

Corresponding author: Stanford Genome Technology Center, 855

S. California Ave., Palo Alto, CA 94304.

E-mail: maciek@charles.stanford.edu

Genetics 176: 1035–1047 ( June 2007)

which entails identifying particular parent sequences

that recombined at particular breakpoints to form a

recombinant offspring sequence.

Some methods (Takahata 1994; Robertson et al.

1995; Crandall and Templeton 1999; Holmes et al.

1999; Gibbs et al. 2000; Martin and Rybicki 2000;

Martin et al. 2005) perform tests on three sequences at

a time, which allows them to posit candidate parent

sequences and candidate breakpoints. The proposed

arrangement is then tested with a likelihood analysis, by

visual detection of similarity in different sequence

regions, or against a null distribution that would be ex-

pected under clonal evolution. The most common among

these triplet tests—the Chimaera method (Posada

and Crandall 2001; Posada 2002), which is based on

a x

-statistic (Maynard Smith 1992), and the Martin–

Rybicki (MR) binomial distribution test (Martin and

Rybicki 2000)—identify unusually high levels of se-

quence similarity inside a predeﬁned window or on

either side of a candidate breakpoint. We also take this

approach by introducing a simple and intuitive statistic

describing how identity varies along a sequence within a

sequence triple. Our test statistic D

m,n,b

is discrete and

nonparametric. Describing its distribution, in principle,

would require a computing time that grows exponentially

with the number of informative sites (a subset of the poly-

morphisms) in the given sequence triple; to avoid this

costly brute-force computation, we introduce a method

for computing probabilities and P-values in polynomial

time. Our method is memory intensive but very fast:

computation of exact P-values takes seconds on a per-

sonal computer when there are ,250 informative sites

in the proposed sequence triple.

Our triplet test represents an advance over Chimaera

and the MR method in that we eliminate the need for a

sliding window, use a nonparametric statistic, and in-

troduce a computation scheme that is exact and orders

of magnitude faster. In evaluating our method’s power

to detect recombination in sequence triplets, we ﬁnd

that we always have higher power than the MR method

and comparable power to Chimaera. In repeated appli-

cations of our triplet test to data sets with more than

three sequences, we show that our method is among the

most powerful of 16 previously tested methods.

STATISTICAL TESTS

We begin with three homologous sequences of the

same length. The relationship among these three se-

quences is similar in practice to the relationship formu-

lated by Crandall and Templeton (1999, pp. 166–167)

among networks of sequences. From our three sequen-

ces, we designate one as the child sequence and inves-

tigate whether it could be a recombinant of the other

two sequences, which we call parent sequences. We ﬁrst

present the simple case of a single-breakpoint recombi-

nant but later focus on the more interesting and realistic

case of a double-breakpoint recombinant. Considering

our sequence triple, we ask whether one can reject the

null hypothesis that the evolutionary history among the

three sequences was completely clonal.

We call our parent sequences p and q and our child

sequence c. For sequence length L, we can represent

our three sequences as vectors of nucleotides: p ¼ (p

, ..., p

), q ¼ (q

, q

, ..., q

), and c ¼ (c

, c

, ..., c

A single-breakpoint recombinant between the parent

sequences at position l can be denoted

ðpqÞ

¼ðp

; ...; p

; q

l11

; ...; q

Þ;

with 0 # l # L.

Writing jp  qj as the number of nucleotide differ-

ences between sequences p and q, we say that the most

likely recombination breakpoint l minimizes j(pq)

 cj ,

the number of differences between the observed child

sequence and a possible recombinant of the parent se-

quences. If this candidate recombinant is much closer

(than either parent) to the child sequence, then we may

have reason to believe that the evolutionary history of

sequence c is better explained by a recombination or a

gene conversion than by strictly clonal reproduction. If

the candidate recombinant (pq)

is only slightly closer

than the parents to the child sequence, then the can-

didate recombinant’s additional sequence similarity may

simply be an accident of how mutations accumulated on

either side of the breakpoint l. Assessing whether the

locations of the mutations (relative to the breakpoint)

are signiﬁcantly nonrandom is the foundation for the

maximum x

-test (Maynard Smith 1992), the Chimaera

method (Posada and Crandall 2001; Posada 2002),

the exact test based on the binomial distribution sug-

gested by Martin and Rybicki (2000), and the heuristic

test suggested by Crandall and Templeton (1999); it

is also the focus of our analysis.

We introduce a nonparametric statistic slightly differ-

ent from the ones above, but one that is more direct at

detecting potential mosaics. Let

NoRec

¼ minfjp  cj; jq  cjg ð1Þ

be the minimum distance from the child to either of the

parents, and let

Rec;1

¼ min

0#l#L

fjðpqÞ

 cjg ð2Þ

be the minimum distance from the child to a candidate

recombinant of the parents (including the boundary

case recombinants, which are just the parents them-

selves); the subscript ‘‘1’’ indicates that there is just one

breakpoint in the recombinant. Then, we deﬁne

¼ d

NoRec

 d

Rec;1

: ð3Þ

The quantity D

describes the difference, between

clonal evolution and nonclonal evolution, in the num-

ber of mutations needed to describe the evolutionary

1036 M. F. Boni, D. Posada and M. W. Feldman

history between the child and the closer parent; by

nonclonal evolution we mean, here, an evolutionary his-

tory that allows for a single recombination event with a

single breakpoint. Clearly D

$ 0, and even if there had

truly been no recombination or gene conversion among

the sequences, a particular sequence triple could give the

appearance of recombination with a high value of D

if,

by chance, the pattern of mutations was such that the

left side of the child sequence appeared to be more

closely related to parent p and the right side appeared to

be closer to parent q. The distribution of this recombi-

nation signal D

under the null hypothesis of clonal

reproduction can be easily computed (see next section).

The difference in (3) is affected only by informative

sites of the sequence triple (p, q, c). For our purposes,

we deﬁne informative sites as those where the child’s

nucleotide matches exactly one of the parents’ nucleo-

tides. Uninformative sites are sites where (i) all three

sequences agree, (ii) all three sequences differ, or (iii)

the parents have matching (i.e., identical) nucleotides

that differ from the child’s. Our deﬁnition of informa-

tive sites is identical to that used in the Chimaera method

and to the sister groups deﬁned by Takahata (1994).

Suppose that there are m informative sites where p

and c match and n informative sites where q and c

match. The quantity D

in (3) is then more precisely

deﬁned as D

m,n,1

. Under the null hypothesis of clonal

evolution among sequences p, q, and c, D

m,n,1

is a

random variable that describes the maximum number

of mutation events one could ‘‘explain away’’ by recom-

bining p with q at a single breakpoint.

A two-breakpoint recombinant of sequences p and q

can be described by

ðpqpÞ

¼ðp

; ...; p

; q

i11

; ...; q

; p

j11

; ...; p

Þ;

where i # j. Letting

Rec;2

¼ min

0#i#j#L

fjðpqpÞ

 cjg; ð4Þ

we deﬁne

m;n;2

¼ d

NoRec

 d

Rec;2

; ð5Þ

where m and n are again the numbers of the two types of

informative sites.

m,n,1

and D

m,n,2

are random variables that describe

single-breakpoint and double-breakpoint recombina-

tion signals, respectively, under the null hypothesis of

no recombination. They are discrete random variables

with range 0 #D

m,n,b

# min {m, n}, where b is the number

of breakpoints. Observed D-quantities can be quickly

calculated [in OðLÞ-time, for any b] from sequence data,

and the null hypothesis of clonal evolution can be

rejected if they are too high. In the next two sections, we

review what is already known about the distribution of

m,n,1

and present a method for calculating the distri-

bution of D

m,n,2

Single-breakpoint recombinant: Consider a sequence

triple (p, q, c) with m informative sites where p and c

match and n informative sites where q and c match.

Moving left to right across the informative sites on the

child sequence, we can assign each informative site a

letter based on probable ancestry (determined by the

parent to which it is identical) and obtain a sequence

such as PPPQPPPQQQQ, where a P denotes an in-

formative site at which the child sequence and parent

p share a nucleotide, and Q denotes an informative site

at which the child sequence and parent q share a nu-

cleotide. Under the null hypothesis of clonal reproduc-

tion, the placement of P’s and Q’s in the sequence

should be completely random; i.e., each of the (m 1

n)!/(m!n!) possibilities has equal probability. In the

example sequence above, it appears that the P’s cluster

toward the left side of the sequence and the Q’s to the

right side; therefore, this sequence may be a true (sta-

tistically signiﬁcant) recombinant.

This sequence of P’s and Q’s is most easily visualized

as a random walk on a set of axes where P is a step up and

Q is a step down. This is not a traditional random walk

since the number of up steps is known to be m, the

number of down steps is known to be n, and the only

randomness is the order in which they appear. After s

steps, the height X

of the random walk is distributed

quasi-hypergeometrically [the quantity (X

1 s)/2 is

distributed hypergeometrically]. The probability of

being at height h after s steps, when jhj # s and 0 # s #

m 1 n,is

PðX

¼ hÞ¼

s 1 h

s  h

m 1 n



1

if h 1 s is even; P(X

¼ h) ¼ 0ifh 1 s is odd. This type of

ﬁnite stochastic process can be called a hypergeometric

random walk (HGRW). HGRWs have been previously

analyzed in the probability literature in the form of

ballot problems (Feller 1957), wherein one candidate

in an election receives m votes, the second candidate

receives n votes, and the order in which the votes are

counted is of interest. We denote a hypergeometric

random walk with m up steps and n down steps by the

random variable H

m,n

. Given data, we refer to an ob-

served walk diagrammed from the informative sites of

a sequence triple; examples of observed walks dia-

grammed from real data are in Figure 1.

Given our sequence triple with m 1 n informative sites

and allowing only one breakpoint in a putative re-

combinant, the observed value D

m,n,1

is related to the

maximum height of the walk diagrammed from the

informative sites of sequences p, q, and c, by the relation

m;n;1

¼ max H

m;n

1 min f0; n  mg:

Using results from ballot theory (Barton and Mallows

1965) and gambling problems (Whitworth 1901,

prop. 39, pp. 116–117), it can be shown that

Exact Tests for Mosaic Structure 1037

PðD

m;n;1

$ kÞ¼

m 1 n

n 1 k



m 1 n



when m # n

m 1 n

m 1 k



m 1 n



when m . n

;

ð6Þ

or equivalently that

Pðmax H

m;n

$ kÞ¼

m 1 n

n 1 k



m 1 n



: ð7Þ

From the observed maximum height of the dia-

grammed walk of the informative sites of a sequence

triple, the null hypothesis of clonal reproduction can be

rejected at the level P as calculated in (6) or (7). This

is implicitly a one-tailed test with rejection of the null

hypothesis of clonal evolution when the observed D

m,n,1

(or the maximum height of the observed walk) is large

relative to m and n. An HGRW with a statistically im-

probable maximum height will have its up steps clus-

tered toward the beginning (left side) of the walk and its

down steps clustered toward the end (right side) of the

walk. This is precisely a mosaic pattern in a nucleotide

sequence: a child sequence having ancestry in p in the

left-hand side of its sequence and ancestry in q in the

right-hand side of its sequence.

Double-breakpoint recombinant: Identifying mosaics

with two breakpoints is the more relevant and interest-

ing problem since in long sequence regions, converted

tracts of DNA or horizontally transferred segments will

usually have both breakpoints present. Identiﬁcation of

two breakpoints also allows for the removal of the hor-

izontally acquired segment; the remaining segment(s)

can then be tested again for clonal evolution, and multi-

breakpoint mosaics could be inferred by repeating such

a process. Note that the two-breakpoint case subsumes

the one-breakpoint case since a one-breakpoint recom-

binant can be viewed as having two breakpoints where

one breakpoint is on the end of the sequence.

Again, considering only the informative sites of the se-

quence triple (p, q, c) and viewing their ordering in the

context of a hypergeometric random walk, the quantity

m,n,2

can be calculated by identifying the maximum

descent (md) of the walk constructed from the arrange-

ment of informative sites. Letting X

be the height of

m,n

at step s, the maximum descent is deﬁned as

md H

m;n

¼ max

0#s#t#m 1n

ðX

 X

Þ;

and it can be shown that

PðD

m;n;2

¼ kÞ¼

Pðmd H

m;n

¼ kÞ whenm $ n

Pðmd H

m;n

¼ k 1 n  mÞ whenm , n



Statistical theory underlying a general class of statistics

based on partial sum processes (Siegmund 1988; Karlin

et al. 1990), change-point problems (Siegmund 1986),

and maximal segmental sums (Karlin and Dembo

1992) provides asymptotic approximations that could

be applied to calculate the probability that md H

m,n

large relative to m and n. Notably, Lemmas 3 and 4 in

Siegmund (1988) and Theorems 2 and 3 in Hogan and

Siegmund (1986) contain the appropriate constructions

to approximate probabilities of maximum descents in

HGRWs. In the theory on ballot problems, the maximum

descent of an HGRW represents the maximum lead

change (in one direction only) when counting ballots

in a two-candidate election; as far as we are aware, this

distribution has not been calculated with the combina-

torial methods and reﬂection techniques usually applied

Figure 1.—Observed walks diagrammed from the informa-

tive sites of sequence triples. (A) The walk is diagrammed

from Neisseria data (from the fourth row of Table 1). (B) The

walk is diagrammed from inﬂuenza data (from the ﬁrst row

of Table 2). The circles indicate the beginning and end of

the maximum descent in each walk, and in both cases the be-

ginning of the maximum descent is also the maximum height

of the walk. The dotted line in each diagram denotes the ex-

pected location of the hypergeometric random walk. The

shaded areas in each diagram show the range of 100 simu-

lated HGRWs.

1038 M. F. Boni, D. Posada and M. W. Feldman

in ballot problems. Below, we provide a method for cal-

culating this distribution exactly.

We use the shorthand x

m,n,k

¼ P(md H

m,n

¼ k ), and for

j, k $ 0, we deﬁne

m;n;k;j

¼ P ðmd H

m;n

¼ k \ min H

m;n

¼jÞ:

Then,

m;n;k

j¼0

m;n;k;j

; ð8Þ

and the y-probabilities can be obtained by solving the

recursions

j ¼ 0 : y

m;n;k;0

m 1 n



½ y

m1;n;k;1

1 y

m1;n;k;0

ð9Þ

j . k $ 0 : y

m;n;k;j

¼ 0 ð10Þ

j ¼ k . 0 : y

m;n;j;j

m 1 n



½y

m;n1; j 1; j1

1 y

m;n1; j ; j1



ð11Þ

k . j . 0 : y

m;n;k;j

m 1 n



m1;n;k;j11

m 1 n



m;n1;k;j1

;

ð12Þ

with boundary conditions

m;0;k;j

1 for k ¼ j ¼ 0

0 otherwise



ð13Þ

0;n;k;j

1 for k ¼ j ¼ n

0 otherwise



ð14Þ

m;n;0;0

1 for n ¼ 0

0 otherwise



ð15Þ

m;n;k;j

¼ 0 when k . n or k , n  m ð16Þ

m;n;k;j

¼ 0 when j . n or j , n  m: ð17Þ

All of the above recursions can be proved with a simple

but careful ﬁrst-step analysis of the random walk H

m,n

Below, the random variables H

m1,n

and H

m,n1

refer to

the subwalk of H

m,n

that starts after the ﬁrst step of H

m,n

As an example, recursion (11) can be proved by not-

ing that the event {md H

m,n

¼ j \ min H

m,n

¼j} implies

that the ﬁrst step of H

m,n

must be down (X

¼1) and

that md H

m,n1

must be either j or j  1. Thus,

Pðmd H

m;n

¼ j \ min H

m;n

¼jÞ

¼ P ðmd H

m;n

¼ j \ min H

m;n

¼j

\ X

¼1 \ md H

m;n1

¼ jÞ

1 P ðmd H

m;n

¼ j \ min H

m;n

¼j

\ X

¼1 \ md H

m;n1

¼ j  1Þ: ð18Þ

In both summands of the right-hand side of (18), the

last three events imply the ﬁrst. We can rewrite the right-

hand side of (18) as

Pðmin H

m;n

¼j \ X

¼1 \ md H

m;n1

¼ jÞ

1 P ðmin H

m;n

¼j \ X

¼1 \ md H

m;n1

¼ j  1Þ:

ð19Þ

The events

fmin H

m;n

¼j \ X

¼1g

[fmin H

m;n1

¼ðj  1Þ\X

¼1gð20Þ

are identical; one occurs if and only if the other occurs.

Using this identity, we substitute into (19) and obtain

PðminH

m;n1

¼ðj  1Þ\X

¼1 \ mdH

m;n1

¼ jÞ

1 P ðmin H

m;n1

¼ðj  1Þ\X

¼1 \ mdH

m;n1

¼ j  1Þ:

ð21Þ

By independence of the ﬁrst step X

¼1 from the

subwalk H

m,n1

, this becomes

PðX

¼1ÞPðminH

m;n1

¼ðj  1Þ\mdH

m;n1

¼ jÞ

1 P ðX

¼1ÞP ðminH

m;n1

¼ðj  1Þ\mdH

m;n1

¼ j  1Þ;

ð22Þ

which is

m 1 n



½ y

m;n1;j; j 1

1 y

m;n1; j1; j1

:

The other recursions can be proven similarly, and the

boundary cases (13)–(17) are easily veriﬁable.

The computation time for any y

m,n,k,j

is bounded

above by mn

, which is the maximum table size required

in memory to solve recursions (9)–(12); k 1 1 y-values

must be computed to calculate x

m,n,k

via Equation 8. On

a single 3-GHz processor with access to 2 GB RAM, the

worst-case x-calculations for 250 informative sites take

,3 sec; most x-probabilities can be calculated in ,1 min

for up to 400 informative sites. All calculations pre-

sented in this article (except where noted) were done

on a 3.2-GHz Linux laptop with 1 GB of RAM and 750

MB of virtual memory. C11 source code for calculating

the x- and y-variables is available from the authors.

For a given sequence triple in which we observe a

m,n,2

¼ k, with a P-value of

j¼k

m;n;j

we can reject the

null hypothesis of completely clonal reproduction in

favor of an evolutionary history that includes a two-

breakpoint recombination event.

APPLICATIONS

The following are two simple examples that use the

distributions D

m,n,1

and D

m,n,2

to test for mosaic struc-

ture among three sequences.

Exact Tests for Mosaic Structure 1039

Neisseria: We considered a classic example from the

genus Neisseria and applied our tests to its argF gene,

which is widely believed to have mosaic structure as a

result of horizontal gene transfer among different spe-

cies (Zhou and Spratt 1992; Grassly and Holmes

1997; Husmeier and McGuire 2003). Zhou and Spratt

(1992)foundregionsofclusteredpolymorphisminacom-

parison between the argF genes of a Neisseria meningitidis

isolate and a N. gonorrhoeae isolate and deduced that this

region of clustered polymorphisms had likely ancestry in

the species N. cinerea (since N. meningitidis and N. cinerea

were nearly identical in this region). The authors noted

that there were two regions in N. meningitidis that could

have arisen by horizontal gene transfer, one of which

might have been the result of variation in mutation rates

or ﬁxation rates (usually called ‘‘rate variation’’). Further

studies (Grassly and Holmes 1997; Husmeier and

McGuire 2003) suggested that additional regions in

the argF gene may have arisen by recombination.

We used three of the Neisseria sequences, one of each

species, from the studies mentioned above (GenBank

accession nos. X64860, X64866, and X64869; 787 nt in

length) and tested whether there is any parent–parent–

child relationship among them that lends support to

one sequence being a mosaic of the other two. Table 1

shows that of the six possible arrangements, one has a

highly signiﬁcant (P ¼ 10

12

) single-breakpoint recom-

bination signal, while the other ﬁve have none. This oc-

curs because the ﬁrst 202 nucleotides of N. meningitidis

cluster signiﬁcantly with N. cinerea (3.5% divergent,

while N. meningitidis and N. gonorrhoeae are 13% di-

vergent in this region) and the ﬁnal 585 nucleotides of

N. meningitidis cluster signiﬁcantly with N. gon orrhoeae

(2.9% divergent, while N. meningitidis and N. cinerea are

15% divergent in this region). This indicates that the

ﬁrst 202 nucleotides of N. meningitidis have probable

ancestry in N. cinerea while the ﬁnal 585 nucleotides of

N. meningitidis have probable ancestry in N. gonorrhoeae,a

view that is supported by the last two columns of Table 1,

which allow for two breakpoints in the child sequence’s

composition but support a mosaic structure almost iden-

tical to the one-breakpoint case.

Inﬂuenza A: Gibbs et al. (2001) found evidence for

recombination in the hemagglutinin gene of the 1918

‘‘Spanish’’ inﬂuenza strain, but their results were later

refuted by Worobey et al. (2002) and Strimmer et al.

(2003). We reanalyzed the ﬁve sequences presented by

Gibbs that were the candidate recombiners and recom-

binants: two swine sequences (A/swine/Iowa/15/30 and

A/swine/Wisconsin/1/61) and three human sequen-

ces (A/South Carolina/1/18, A/Kiev/59/79, and A/

Alma Ata/1417/84), where the last two numbers in the

sequence names indicate the year the sequence was iso-

lated. In Table 2 we show the results obtained using our

D-method on the signiﬁcant relationships presented in

Figure 1 of Gibbs et al. (2001).

With any type of analysis, detecting recombination in

ancient inﬂuenza sequences is a challenge because of the

high mutation rates in RNA viruses. A recombination

that occurred 90 years ago would have its recombination

signal obscured by mutations that accumulated after the

recombination event. The relationship speciﬁed by the

ﬁrst two rows in Table 2, for example, requires a mini-

mum of 104 years of evolution after the posited recom-

bination event (61 years between the South Carolina

and the Kiev strains and 43 years between the South

Carolina and Wisconsin strains). Our ﬁve inﬂuenza se-

quences are on average 10% divergent (range: 2.4–

18.3%), which means that detecting recombination events

should be easy if the events were recent but difﬁcult if

they were ancient. On the timescale of inﬂuenza evolu-

tion, the hypothesized recombination events in Table 2

would be quite ancient.

Nevertheless, our method does detect weak recombi-

nation signals in the 1918 and 1984 human inﬂuenza

strains. It is important to note that we are performing

TABLE 1

Mosaic structure in Neisseria argF gene

pqcNull

Observed

maximum P-value

Observed maximum

descent P-value

N. men. N. cin. N. gon. H

84,6

78 1 2 0.30

N. cin. N. men. N. gon. H

6,84

0 1 78 1

N. gon. N. cin. N. men. H

84,32

52 1 19 8.93 3 10

11

N. cin. N. gon. N. men. H

32,84

19 1.42 3 10

12

71 2.50 3 10

11

N. gon. N. men. N. cin. H

6,32

0 1 26 1

N. men. N. gon. N. cin. H

32,6

26 1 2 0.60

The ﬁrst three columns show a candidate parent–parent–child conﬁguration that is tested for recombination; the fourth col-

umn shows the null distribution for the ordering of informative sites in the given sequence triple. The 1-breakpoint recombinant

in the fourth row can be achieved with three different breakpoints, at positions 201, 202, and 203 (a breakpoint at position 201

indicates a breakpoint after the 201st nucleotide). The 2-breakpoint recombinant in the fourth row can be achieved with 66 dif-

ferent pairs of breakpoints: the ﬁrst is always one of 202–204 while the second is one of 742–759/784–787. The 2-breakpoint

recombinant in the third row can be achieved with 21 different pairs of breakpoints: the ﬁrst is always one of 0–6 while the second

is one of 202–204. N. men., N. meningitidis; N. cin., N. cinerea; N. gon., N. gonorrhoeae.

1040 M. F. Boni, D. Posada and M. W. Feldman

post hoc tests on previously analyzed sequences for which

Gibbs et al. (2001) obtained statistically signiﬁcant re-

combination signals. Given these same ﬁve sequences

without any a priori knowledge about their relationships,

we might compute P-values for all 60 possible parent–

parent–child relationships among these sequences. The

last two columns of Table 2 show which of these com-

parisons would still be signiﬁcant after a Dunn–S

ida´k

correction for 60 comparisons. The Dunn–S

ida´k cor-

rection is, of course, extremely conservative, especially

since the D-values from our comparisons are positively

correlated. A more accurate correction for multiple com-

parisons would take into account that we have multiple

signiﬁcant results. Using an exact binomial test, the

probability under H

that $3 of 60 comparisons would

be signiﬁcant at the 10

3

level is P ¼ 3.3 3 10

5

.Tobe

slightly more conservative, we could say that the two

P-values in rows 1 and 2 of Table 2 that are ,10

3

are in

fact manifestations of the same arrangement of strains

(Kiev, Wisconsin, and South Carolina); then, the prob-

ability that $2 of 60 comparisons would be signiﬁcant

at the 10

3

level is P ¼ 1.7 3 10

3

Although it has been long believed that intragenic

(homologous) recombination does not occur in inﬂu-

enza (Kilbourne 1978), the occurrence of nonhomol-

ogous recombination (Khatchikian et al. 1989; Orlich

et al. 1994; Suarez et al. 2004) together with the data pre-

sented by Gibbs suggests that homologous recombina-

tion in inﬂuenza may be possible. However, as pointed

out by Worobey et al. (2002), the observed substitution

pattern in the inﬂuenza hemagglutinin can also be

explained by within-sequence rate variation that varies

across the different branches of the phylogeny (lineage-

speciﬁc rate variation). Using pairwise comparisons

among human sequences of the inﬂuenza A hem-

agglutinin, Worobey et al. described the HA1 region

(nucleotide sites 151–920) as evolving more quickly than

the HA2 region (sites 1–150 and 921–1695) in humans.

If the opposite can be shown to be true for swine hem-

agglutinin sequences—that the HA2 evolves more quickly

than the HA1—then the detected mosaicism in the 1918

human inﬂuenza hemagglutinin would be best explained

by lineage-speciﬁc rate variation. This type of rate var-

iation has also been called heterotachy (Lopez et al.

2002), and it was ﬁrst introduced in the context of a

changing set of concomitantly variable codons by Fitch

and Markowitz (1970). It has been suggested that, for

inﬂuenza A viruses, heterotachous or lineage-speciﬁc

rate variation is a more likely evolutionary history than

an intragenic recombination event (E. C. Holmes, per-

sonal communication).

SIMULATIONS

In addition to our D-method’s theoretical appeal of

being exact and nonparametric we show that it has the

practical advantages of speed, power, and a low false-

positive rate.

Power and false positives: We compared the power

and false-positive rates of our D-method to the 14 methods

TABLE 2

Mosaic structure in inﬂuenza A hemagglutinin gene

Observed

maximum

Observed

maximum

descent

Dunn–S

ida´k

pqcNull P-value P-value Maximum md

1979h 1961s 1918h H

148,159

7 0.45 46

4.00 3 10

4

0.03

1961s 1979h 1918h H

159,148

7.85 3 10

4

5.22 3 10

3

0.05 NS

1979h 1961s 1930s H

94,218

0 1 124 1

1961s 1979h 1930s H

218,94

124 1 10

9.06 3 10

3

1979h 1961s 1984h H

92,220

0 1 128 1

1961s 1979h 1984h H

220,92

128 1 12

8.88 3 10

4

0.06

The ﬁrst three columns refer to the ﬁve inﬂuenza sequences mentioned in the Inﬂuenza A section. Here, the sequences are

referred to by year and whether the sequence is human (h) or swine (s). The last two columns show the Dunn–S

ida´k corrected

P-values given that without any knowledge about which sequences are recombinant, 60 comparisons would have to be made to test

all parent–parent–child combinations. The breakpoint descriptions listed in footnotes a–e refer to a gapped alignment of length

1778 nt; 80 positions are gapped.

There are 90 pairs of breakpoints that result in a maximum descent of 46 units in the diagrammed walk from these three

sequences. The ﬁrst breakpoint is in position 242–247, while the second one is in 953–955/971–982.

The maximum height of 39 for this triple can be attained by 15 different breakpoints, at positions 952–954/970–981.

There are 30 pairs of breakpoints that result in a maximum descent of 30 for this sequence triple. The ﬁrst breakpoint is in

position 953–955/971–982; the second breakpoint is either in position 1653 or in position 1654.

There are 45 pairs of breakpoints that result in a maximum descent of 10 for this sequence triple. The ﬁrst breakpoint is in

position 953–955/971–982; the second breakpoint is in position 1049–1051.

There are 9 pairs of breakpoints that result in a maximum descent of 12 for this sequence triple. The ﬁrst breakpoint is in

position 953–955; the second breakpoint is in position 1049–1051.

Exact Tests for Mosaic Structure 1041

evaluated in Posada and Crandall (2001). Figure 2

duplicates the conditions of Figure 1 in Posada and

Crandall (2001); in addition, two of the methods de-

scribed by Carvajal-Rodrı

guez et al. (2006) are in-

cluded in the top two rows of comparisons in Figure 2.

Power and false-positive rates are tested for different

values of the population-genetic parameter u ¼ 4N

mL,

where N

is the effective population size, m is the per site

per generation mutation rate, and L is the sequence

length. Power is tested across different values of the re-

combination parameter r ¼ 4N

rL, where r is the per site

per generation recombination rate. False-positive rates

are tested for different levels a of rate variation (a is the

shape parameter of a ﬁxed-mean G-distribution of evo-

lutionary rates as in Yang 1996) since, as noted in the

Neisseria and inﬂuenza examples, statistical tests for re-

combination can confound recombination and varia-

tion in mutation/ﬁxation rates.

The left column of Figure 2 shows the power of 14 (or

16) other methods as well as the power of our D-method,

which was determined as follows. Each data point cor-

responds to 100 simulated sequence sets with 10 se-

quences in each set (details in Posada and Crandall

2001). In a set of 10 sequences, there are 720 unique

parent–parent–child arrangements; the quantity D

m,n,2

was calculated for each of these 720 triplets and the

P-value associated with that quantity was computed with

recursions (9)–(12). The minimum of these 720 P-values

was corrected with a Dunn–S

ida´k correction and then

reported as the P-value for rejecting clonal evolution in

that 10-sequence set. This procedure was implemented

in C11 as a command-line Linux program called

3SEQ; source code is available from the authors. The

number of sets in which clonal evolution could be re-

jected at the 0.05 level was reported as the power of our

D-method. The false-positive rates in the right-hand

column of Figure 2 were computed in the same way.

Figure 2 shows that for a high enough mutation rate,

our method is among the most powerful available for

detecting recombination. For the sequence sets where

u ¼ 10, the mean pairwise distance within each set of

10 sequences ranges from 1 to 30 nt. Using D

m,n,2

to test

for recombination requires a minimum of nine in-

formative sites to reject clonality at the 0.05 level; when

correcting with a Dunn–S

ida´k correction for 720 com-

parisons, a minimum of 20 informative sites is needed.

For this reason, our method has low power for data sets

with little polymorphism. For the tested parameter com-

binations, our false-positive rate is at most 2% and

among the lowest of all methods tested. It is important

to note that some of the more powerful methods in the

left-hand column had high false-positive rates in the

right column. The plots in supplemental Figure S1

(http://www.genetics.org/supplemental/) show the ra-

tios of power to false-positive rate for the 16 methods

from Figure 2.

Supplemental Figure S2 at http://www.genetics.

org/supplemental/ shows an additional false-positive

Figure 2.—Power and

false-positive comparisons

to the 14 methods tested in

Posada and Crandall

(2001). The top four graphs

include two additional

LPT methods described in

Carvajal-Rodrı

guez et al.

(2006). The graphs in the left

column plot power under dif-

ferent recombination rates,

while the right-hand column

shows false-positive rates

when there is variation in mu-

tation rates but recombina-

tion is not present; a ¼ ‘

means that there is no rate

variation, while lower values

of a indicate higher rate vari-

ation. The red line shows the

power and false-positive rate

of D

m,n,2

in detecting recom-

bination. The gray lines show

the power and false-positive

rates of 14 (or 16) other

methods. a ¼ ‘ in the left

column; r ¼ 0intheright

column.

1042 M. F. Boni, D. Posada and M. W. Feldman

analysis in data sets generated with autocorrelated muta-

tion rates (from Figure 5c of Bruen et al. 2006); our

false-positive rate was never .3.2% for these data sets.

Supplemental Figure S3 at http://www.genetics.org/

supplemental/ shows a power analysis under conditions

with population growth, using the simulated data from

Figure 4 of Bruen et al. (2006). D

m,n,2

is quite powerful

under a scenario of population growth (as long as se-

quence diversity is high enough), and it retains very

high power even when the recombination parameter

r is small.

Since our statistical test is designed for sequence trip-

lets we perform an additional power analysis that focuses

exclusively on detecting recombination in sets of three

sequences. We compare D

m,n,2

to three other common

statistical tests designed to identify recombination in

sequence triplets (a total of eight methods were tested

of which the three most powerful are shown in Figure 3;

details of and results for all eight methods are in

the supplemental materials at http://www.genetics.org/

supplemental/). For each data point in Figure 3, the

program TREEVOLVE (Grassly et al. 1999) was used to

generate 100 replicates of three sequences with the given

population-genetic parameters, using the F84 model of

nucleotide substitution (Felsenstein and Churchill

1996) with p

¼ 0.4, p

¼ 0.2, p

¼ 0.1, p

¼ 0.3, and a

transition/transversion ratio of two. The black line in

Figure 3 denotes the power and false-positive rate of a

single-breakpoint version of Chimaera with exact P-value

computations (Posada and Crandall 2001; Spencer

2003), the gray line corresponds to the most recent ver-

sion of Chimaera (Chim-2006), and the blue line corre-

sponds to the Martin–Rybicki method with window size

30 nt and step size 1 nt.

For statistical identiﬁcation of mosaic structure in

sequence triplets, our D-method is as powerful as the

most powerful methods available. All four methods in

Figure 3 have similar power and false-positive rates, with

Figure 3.—Power and false-positive comparisons with MR and Chimaera on sequence triplets. The red line shows power and

false-positive rates for D

m,n,2

. The black line shows the power and false-positive rates for Chim-Sp, a single-breakpoint no-window

Chimaera implementation (described on p. 14 of the supplemental materials of Posada and Crandall 2001) whose P-values were

calculated using the method of Spencer (2003). The gray line shows the power and false-positive rates of Chim-2006, a new Chi-

maera implementation with a sliding-window and sliding-breakpoint scheme; P-values were computed by permuting alignment

columns 1000 times. The blue line shows the power and false-positive rates for MR-30,1 (Martin–Rybicki method with window size

30 nt and step size 1 nt). The third column shows ratio of power to false-positive rate at a ¼ ‘. False-positive rates at a ¼ ‘ were

calculated with 1000 simulated triplets; all other data points were calculated with 100 simulated triplets. a ¼ ‘ in the left column; r

¼ 0 in the middle column.

Exact Tests for Mosaic Structure 1043

the distinguishing feature that Chimaera is the least con-

servative method, MR is the most conservative, and D

m,n,2

is somewhere in between. For u$50, D

m,n,2

has the best

combination of power and false-positive rate.

Speed: Table 3 shows the computation times of our

method compared to MR and Chimaera. Our method

has a clear advantage, especially in large data sets, since

P-values are simply read from memory once a table of

m,n,k,j

-values is built. For example, analysis of the inﬂu-

enza data (Boni 2007) requires reading 29 million

P-values from memory, which is not a time-consuming

task for a 3.2-GHz processor. Likewise, computing exact

P-values using the method described by Spencer (2003)

is quite fast; this is slightly slower than our D-method

since a new table needs to be built for each P-value com-

putation. On the other hand, performing 14.5 million

sliding-window x

-computations on each of 1000 ran-

domized data sets (Chim-2006) or computing 9.6

million P-values from a binomial distribution for each

of 287 possible windows (MR-30,1) can be quite com-

putationally expensive.

Note that nontriplet methods can be much faster

than triplet methods. For example, analyzing the data

in Table 3 with F

(Bruen et al. 2006) takes seconds, but

the recombinant sequences cannot be isolated.

DISCUSSION

Comparison: Many statistical methods have already

been developed for detecting recombination from se-

quence data. The usual recombination signals that these

methods attempt to identify are (i) varying patterns of

sequence identity, (ii) phylogenetic incongruencies,

(iii) excess homoplasies, (iv) clustered polymorphism,

and (v) low linkage disequilibrium; our method is of

the ﬁrst type. Here, we summarize the main similarities/

differences between and advantages/disadvantages of

our method and previous ones.

Most importantly, our method considers three se-

quences at a time using the appropriate mechanistic

framework in which to view mosaic structure: the exis-

tence of one sequence that is a mosaic of a second and

a third. Maynard Smith (1992) also acknowledged this

as the appropriate framework, although the test he de-

veloped is designed for two sequences. Maynard Smith’s

maximum x

-method was later reformulated as a proper

three-sequence problem and is now called maximum-

match x

or Chimaera (Posada and Crandall 2001;

Posada 2002). Takahata (1994) recognized that one

needed to look at a minimum of three sequences by

focusing on sites that support a particular sister-group

status where exactly two of three nucleotides agree. The

BOOTSCAN search method (Salminem et al. 1995) ex-

amines candidate recombinants to see how different

regions cluster with either of two parental sequences;

bootstrap support, rather than a signiﬁcance test, pro-

vides a measure of reliability of the proposed clustering.

Recently, Martin et al. (2005) modiﬁed the BOOT-

SCAN method to search only sequence triples and to

ﬁnd recombinants statistically using the binomial test

in Martin and Rybicki (2000). Finally, Holmes et al.

(1999) describe a phylogenetic method called LARD

that considers three sequences at a time and tests the

hypothesis of completely clonal evolution vs. the hypoth-

esis of clonal evolution for segments on either side of a

breakpoint; their problem is formulated similarly to

ours, the main difference being that their method fo-

cuses on phylogeny. It should be noted that some meth-

ods (Robertson et al. 1995; Gibbs et al. 2000) require

four sequences: three involved in a recombination event

and a fourth used as an outgroup.

The mechanistic three-sequence approach contrasts

with approaches that attempt to identify indirect signals

from sequence data, such as an excess of homoplasies

(Hudson and Kaplan 1985; Jakobsen and Easteal

1996; Maynard Smith and Smith 1998; Maynard

Smith 1999; Bruen et al. 2006) or a clustering of poly-

morphisms (Stephens 1985; Maynard Smith 1992;

Martin and Rybicki 2000) that would be indicative of

a recent recombination or gene conversion. While these

methods can be quite effective, one must keep in mind

that polymorphism clustering can be caused by se-

lection or mutational hotspots and that an excess of

homoplasies can be quite difﬁcult to detect in rapidly

mutating organisms such as RNA viruses.

Our method has several technical advantages.

First, we do not use Monte Carlo methods to generate

TABLE 3

Computation times (last four columns) for computing recombination statistics and P-values in large data sets

Data set

Segregating

sites

No.

sequences P-value

m,n,2

(min) MR-30,1 Chim-Sp Chim-2006

Dengue E 618 69 3.3 3 10

5

2 86 min 24 min 100 hr

Human mtDNA 1079 262 4.6 3 10

3

6 180 hr 48 hr 550 days

Inﬂuenza HA 316 308 1 4 43 hr 9 hr 105 days

All times and estimates are for a single 3.2-GHz processor. Dengue data are serotype 2 from Holmes et al. (1999); human mi-

tochondrial DNA sequences are a subset of distinct strains from Kivisild et al. (2006); inﬂuenza seqeunces are New Zealand H3N2

isolates from 2000–2005 analyzed in Boni (2007). The P-value reported in this table is the minimum P-value (testing with D

m,n,2

)

from all comparisons in a data set, corrected with a Dunn–S

ida´k correction.

1044 M. F. Boni, D. Posada and M. W. Feldman

P-values, which makes our P-value computations very fast.

Moreover, once a table is built in memory to calculate

a particular x

m,n,k

, successive P-values can simply be ex-

tracted from the table; this means that repeated ap-

plication of our D-tests is limited only by how quickly

the computer’s memory can be accessed. Monte Carlo

methods have the additional disadvantage that the pre-

cision of computed P-values is limited by the number

of permutations that can be done; this could be prob-

lematic in large data sets where precise P-values may be

needed to survive multiple-comparisons corrections. Sec-

ond, we avoid the widely used sliding-window approaches

(Salminem et al. 1995; Siepel et al. 1995; Grassly and

Holmes 1997; Lole et al. 1999; Martin and Rybicki

2000; Strimmer et al. 2003; Martin et al. 2005) that

require the user to deﬁne a window size at the scale at

which recombination is believed to have occurred. By

considering all possible breakpoints in expression (4),

we ﬁnd the optimal ‘‘window size’’ that should be used

for inferring recombination in a particular sequence

triplet. This allows for the detection of recombinant seg-

ments at any scale.

By removing uninformative sites, our D-method

should not confound variation in mutation/ﬁxation

rates with recombination; indeed, the middle column of

Figure 3 and supplemental Figure S2 at http://www.

genetics.org/supplemental/ show that even under high

rate variation our false-positive rate is at most 5% (and

usually ,3%). However, lineage-speciﬁc or heterota-

chous rate variation can, in the absence of recombina-

tion, produce the pattern that is meant to be rejected

by our D-distributions. Consider the tree in Figure 4.

Branch 1 connects the root to sequence p while branch

2 connects the q–c common ancestor to sequence q.

Differential environmental pressures on branches 1 and

2 can create the impression of mosaic structure. Sup-

pose that the organism, during its evolution along branch

1, experiences an environment where the right-hand

side of the sequence evolves rapidly and accumulates

many substitutions while the left-hand side is either con-

served or mutates neutrally. Suppose further that the

organism, during evolution along branch 2, experien-

ces an environment where the left-hand side of the se-

quence evolves rapidly and accumulates substitutions

while the right-hand side is conserved or mutates neu-

trally. Under this scenario of clonal evolution, where en-

vironmental pressure increases substitution rates in the

right part of the sequence on branch 1 and in the left

part of the sequence on branch 2, the resulting se-

quence triple (p, q, c) will give the appearance that a

recombination event occurred. In this case, the right

part of sequence c will be very similar to sequence q

while the left part will be very similar to sequence p. This

type of sequence identity in different sequence regions

is exactly what our D-statistics are designed to reveal.

While this combination of events may seem unlikely,

the inﬂuenza sequences described here may have under-

gone just such evolutionary pressures. A key component

in this scenario where mosaic structure is generated

without recombination is that the organism experiences

different selective environments on different branches

of its phylogeny.

General conclusions: We have introduced exact, non-

parametric statistical tests for identifying nucleotide se-

quence mosaic structure with one or two breakpoints.

Our test statistic is a function of a given sequence triple

where one sequence is hypothesized to be a recombi-

nant of the other two. Given a sequence triple, we calcu-

late the difference in proximity (to the child sequence)

between the closer parent sequence and the closest

candidate recombinant sequence. This difference is de-

noted D

m,n,b

—where m and n describe the numbers of

informative sites at which the child sequence clusters

with one or the other parent, and b denotes the number

of breakpoints allowed in a candidate recombinant—

and it is studied as a random variable under the null

hypothesis of clonal evolution. The distribution of D

m,n,1

has been described in the probability literature on bal-

lot problems, while the distribution of D

m,n,2

has been

approximated but not described exactly. With brute-

force methods, exact probabilities of the distribution of

m,n,2

would require exponentially growing computation

times that would become unmanageable once m 1 n .

35. To remedy this problem, we derive a set of recursive

equations to calculate the probability mass function

of D

m,n,2

in Oðmn

Þ-time. These calculations can be

performed in seconds on a single-processor personal

Figure 4.—Phylogenetic tree that shows a possible clonal

evolutionary history for the sequences p, q, and c. Mutations

occurring in branch 1 will result in an informative site of type

Q, while mutations occurring in branch 2 will result in an in-

formative site of type P. The distributions describing the prob-

ability that the mutations in branch 1 or 2 cluster on either

side of a breakpoint or between some pair of breakpoints

are those of D

m,n,1

and D

m,n,2

Exact Tests for Mosaic Structure 1045

computer (3 GHz, 2 GB RAM) as long as m 1 n , 250.

When 250 , m 1 n , 400, most computations are

equally quick although some may require additional

memory or the use of virtual memory.

Our method relies on deducing parent–child se-

quence identity for different parents in different se-

quence regions. If a recombination occurred between

sequences p and q to create the sequence c, then one

segment of sequence c should be more similar to parent

p while the remaining segment(s) of sequence c should

be more similar to parent q. If this pattern is statistically

signiﬁcant—i.e., if it appears in the far right-hand tail of

the distribution of D

m,n,b

—we deduce that a recombina-

tion occurred.

Our D-method is among the most powerful available

for detecting recombination in sequence data, even in

highly recombinant data sets (generating data sets as in

Figure 2 with r ¼ 128, our method had 100% power for

u$50) or in data sets generated under conditions of

population growth (see supplemental Figure S3 at http://

www.genetics.org/supplemental/). For many of the sim-

ulated data sets in this article, D

m,n,2

appears to have the

best combination of power and low false-positive rate.

With comparable power to the best available methods,

the most immediate practical advantage of using D

m,n,2

over other methods is its speed in large data sets. As can

be seen in Table 3, computing P-values from D

m,n,2

can

be many orders of magnitude faster than other triplet

methods, depending on the number of sequences and

the amount of polymorphism in the data set. For N se-

quences, triplet methods will make on the order of N

comparisons, which for N . 1000 can be quite a large

number for a personal computer. For example, 1000 in-

ﬂuenza sequences with a similar level of polymorphism

as in Table 3 would take 137 min to analyze with D

m,n,2

while 2000 sequences would take 18 hr. Fortunately, our

method (along with most triplet methods) is completely

parallelizable, which means that as sequence databases

grow we can take advantage of parallel computing to

search for recombinants in very large data sets. Note that

if we have a particular query sequence that we would like

to test for recombination, the number of comparisons is

of order N

Our choice of applications here represents only a

small sample of the clonal or nearly clonal sequences we

could analyze with our D-statistics. They would also be

quite useful in ﬁnding recombinants in human immu-

nodeﬁciency virus databases and in larger dengue virus

data sets and in analyzing the recently suggested re-

combinants in measles (Schierup et al. 2005). Human

mitochondrial DNA is generally believed to evolve

clonally, although the data set in Table 3 has quite

strong mosaic signals; a reanalysis of other mtDNA data

sets (Piganeau and Eyre-Wal ker 2004; Piganeau et al.

2004) would help determine whether recombination

occurred during the evolution of the mitochondrion.

For the inﬂuenza virus, our test could be used on whole

(concatenated) inﬂuenza genomes, as in Holmes et al.

(2005), to detect possible reassortment; hundreds of se-

quenced whole inﬂuenza genomes have already been

analyzed (Nelson et al. 2006) and thousands more have

been deposited in GenBank. As sequence databases

expand in the genomic era, the D-method presented

here could become one of the most efﬁcient methods

for detecting recombination and ﬁnding recombinants

in large data sets.

We thank E. C. Holmes for many discussions especially on the rate

variation scenario for inﬂuenza; we thank T. C. Bruen for providing

data sets for power analysis and false-positive analysis; and we thank

N. A. Rosenberg, J. M. Macpherson, and J. Van Cleve for helpful

comments and suggestions. An anonymous editor pointed us to the

known result in Equation 7. This work was funded in part by National

Institutes of Health grants GM28016 (M.F.B., M.W.F.) and HG000205

(M.F.B.). D.P. is funded by grant BFU2004-02700 of the Spanish

Ministry of Education and Science and by the Ramo´n y Cajal program

of the Spanish government.

LITERATURE CITED

Ardlie, K. G., L. Kruglyak and M. Seielstad, 2002 Patterns of

linkage disequilbrium in the human genome. Nat. Rev. Genet.

3: 299–309.

Awadalla, P., 2003 The evolutionary genomics of pathogen recom-

bination. Nat. Rev. Genet. 4: 50–60.

Awadalla, P., A. Eyre-Walker and J. Maynard Smith, 1999 Link-

age disequilibrium and recombination in hominid mitochon-

drial DNA. Science 286: 2524–2525.

Balding, D. J., R. A. Nichols and D. M. Hunt, 1992 Detecting gene

conversion: primate visual pigment genes. Proc. R. Soc. Lond.

Ser. B 249: 275–280.

Barton, D. E., and C. L. Mallows, 1965 Some aspects of the ran-

dom sequence. Ann. Math. Stat. 36: 236–260.

Boni, M. F., 2007 Vaccination and antigenic drift in inﬂuenza. Vac-

cine (in press).

Brown, C., E. C. Garner,A.K.Dunker and P. Joyce, 2001 The

power to detect recombination using the coalescent. Mol. Biol.

Evol. 18: 1421–1424.

Bruen, T. C., H. Philippe and D. Bryant, 2006 A simple and robust

statistical test detecting the presence of recombination. Genetics

172: 2665–2681.

Carvajal-Rodrı

guez, A., K. A. Crandall and D. Posada, 2006 Re-

combination estimation under complex evolutionary models

with the coalescent composite-likelihood method. Mol. Biol.

Evol. 23: 817–826.

Crandall, K. A., and A. R. Templeton, 1999 Statistical methods for

detecting recombination, pp. 153–176 in The Evolution of HIV, edi-

ted by K. A. Crandall. Johns Hopkins University Press, Baltimore.

Feller, W., 1957

An Introduction to Probability Theory and Its Applica-

tions, Vol. I. John Wiley & Sons, New York.

Felsenstein, J., and G. A. Churchill, 1996 A hidden Markov

model approach to variation among sites in rate of evolution.

Mol. Biol. Evol. 13: 93–104.

Fitch, W. M., and E. Markowitz, 1970 An improved method for

determining codon variability in a gene and its application to

the rate of ﬁxation of mutations in evolution. Biochem. Genet.

4: 579–593.

Gabriel, S. B., S. F. Schaffner,H.Nguyen,J.M.Moore,J.Roy et al.,

2002 The structure of haplotype blocks in the human genome.

Science 296: 2225–2229.

Gibbs, M. J., J. S. Armstrong and A. J. Gibbs, 2000 Sister-scanning:

a Monte Carlo procedure for assessing signals in recombinant se-

quences. Bioinformatics 16: 573–582.

Gibbs, M. J., J. S. Armstrong and A. J. Gibbs, 2001 Recombination

in the hemagglutinin gene of the 1918 ‘‘Spanish ﬂu.’’ Science

293: 1842–1845.

Goss, P. J. E., and R. C. Lewontin, 1996 Detecting heterogeneity of

substitution along DNA and protein sequences. Genetics 143:

589–602.

1046 M. F. Boni, D. Posada and M. W. Feldman

Grassly, N. C., and E. C. Holmes, 1997 A likelihood method for

the detection of selection and recombination using nucleotide

sequences. Mol. Biol. Evol. 14: 239–247.

Grassly ,N. C., P. H. Harvey andE.C.Holmes, 1999 Population dynam-

ics of HIV-1 inferred from gene sequences. Genetics 151: 427–438.

Halkett, F., J.-C. Simon and F. Balloux, 2005 Tackling the popu-

lation genetics of clonal and partially clonal organisms. Trends

Ecol. Evol. 20: 194–201.

Hogan, M. L., and D. Siegmund, 1986 Large deviations for the max-

ima of some random ﬁelds. Adv. Appl. Math. 7: 2–22.

Holmes,E.C.,M.Worobey and A. Rambaut, 1999 Phylogenetic evi-

dence for recombination in dengue virus. Mol. Biol. Evol. 16: 405–409.

Holmes, E. C., E. Ghedin,N.Miller,J.Taylor,Y.Bao et al.,

2005 Whole-genome analysis of human inﬂuenza A virus re-

veals multiple persistent lineages and reassortment among re-

cent H3N2 viruses. PLoS Biol. 3: e300.

Hudson, R. R., and N. L. Kaplan, 1985 Statistical properties of the

number of recombination events in the history of a sample of

DNA sequences. Genetics 111: 147–164.

Husmeier, D., and G. McGuire, 2003 Detecting recombination in 4-

taxa DNA sequencealignments with Bayesian hidden Markovmod-

els and Markov chain Monte Carlo. Mol. Biol. Evol. 20: 315–337.

Jakobsen, I. B., and S. Easteal, 1996 A program for calculating and

displaying compatibility matrices as an aid in determining retic-

ulate evolution in molecular sequences. Comput. Appl. Biosci 12:

291–295.

Karlin, S., and V. Brendel, 1992 Chance and statistical signiﬁcance

in protein and DNA sequence analysis. Science 257: 39–49.

Karlin, S., and A. Dembo

, 1992 Limit distributions of maximal seg-

mental score among Markov-dependent partial sums. Adv. Appl.

Probab. 24: 113–140.

Karlin, S., A. Dembo and T. Kawabata, 1990 Statistical composi-

tion of high-scoring segments from molecular sequences. Ann.

Stat. 18: 571–581.

Khatchikian, D., M. Orlich and R. Rott, 1989 Increased viral path-

ogenicity after insertion of a 28S ribosomal RNA sequence into the

hemagglutinin gene of an inﬂuenza virus. Nature 340: 156–157.

Kilbourne, E. D., 1978 Molecular epidemiology—inﬂuenza as ar-

chetype. Harvey Lect. 73: 225–258.

Kivisild, T., P. Shen,D.P.Wall,B.Do,R.Sung et al., 2006 The role

of selection in the evolution of human mitochondrial genomes.

Genetics 172: 373–387.

Lole, K. S., R. C. Bollinger,R.S.Paranjape,D.Gadkari,S.S.

Kulkarni et al., 1999 Full-length immunodeﬁciency virus type

1 genomes from subtype c–infected seroconverters in india, with

evidence of intersubtype recombination. J. Virol. 73: 152–160.

Lopez, P., D. Casane and H. Philippe, 2002 Heterotachy, an impor-

tant process of protein evolution. Mol. Biol. Evol. 19: 1–7.

Martin, D., and E. Rybicki, 2000 RDP: detection of recombination

amongst aligned sequences. Bioinformatics 16: 562–563.

Martin, D. P., D. Posada,K.A.Crandall and C. Williamson,

2005 A modiﬁed bootscan algorithm for automated identiﬁca-

tion of recombinant sequences and recombination breakpoints.

AIDS Res. Hum. Retroviruses 21: 98–102.

Maynard Smith, J., 1992 Analyzing the mosaic structure of genes.

J. Mol. Evol. 34: 126–129.

Maynard Smith, J., 1999 The detection and measurement of re-

combination from sequence data. Genetics 153: 1021–1027.

Maynard Smith, J., and N. H. Smith, 1998 Detecting recombina-

tion from gene trees. Mol. Biol. Evol. 15: 590–599.

Maynard Smith, J., N. H. Smith, M. O’Rourke and B. G. Spratt,

1993 How clonal are bacteria? Proc. Natl. Acad. Sci. USA 90:

4384–4388.

Moya, A., E. C. Holmes and F. Gonza

lez-Candelas, 2004 The pop-

ulation genetics and evolutionary epidemiology of RNA viruses.

Nat. Rev. Microbiol. 2: 279–288.

Nelson, M. I., L. Simonsen,C.Viboud,M.A.Miller,J.Taylor et al.,

2006 Stochastic processes are key determinants of the short-

term evolution of inﬂuenza A virus. PLoS Pathog. 2: e125.

Orlich, M., H. Gottwald and R. Rott, 1994 Nonhomologous re-

combination between the hemagglutinin gene and the nucleo-

protein gene of an inﬂuenza virus. Virology 204: 462–465.

Piganeau, G., and A. Eyre-Walker, 2004 A reanalysis of the indirect

evidence for recombination in human mitochondrial DNA. He-

redity 92: 282–288.

Piganeau, G., M. Gardner and A. Eyre-Walker, 2004 A broad sur-

vey of recombination in animal mitochondria. Mol. Biol. Evol.

21: 2319–2325.

Posada, D., 2002 Evaluation of methods for detecting recombination

from DNA sequences: empirical data. Mol. Biol. Evol. 19: 708–717.

Posada

, D., and K. A. Crandall, 2001 Evaluation of methods for

detecting recombination from DNA: computer simulations.

Proc. Natl. Acad. Sci. USA 98: 13757–13762.

Posada, D., K. A. Crandall and E. C. Holmes,2002 Recombination

in evolutionary genomics. Annu. Rev. Genet. 36: 75–97.

Pritchard, J. K., and M. Przeworski, 2001 Linkage disequilibrium

in humans: models and data. Am. J. Hum. Genet. 69: 1–14.

Robertson, D. L., B. H. Hahn andP.M.Sharp, 1995 Recombination

in AIDS viruses. J. Mol. Evol. 40: 249–259.

Salminem, M. O., J. K. Carr,D.S.Burke and F. E. McCutchan,

1995 Identiﬁcation of breakpoints in intergenotypic recombi-

nants of HIV type 1 by bootscanning. AIDS Res. Hum. Retrovi-

ruses 11: 1423–1425.

Sawyer, S., 1989 Statistical tests for detecting gene conversion. Mol.

Biol. Evol. 6: 526–538.

Schierup, M. H., C. H. Mordhorst,C.P.Muller and L. S.

Christensen, 2005 Evidence of recombination among early-

vaccination era measles virus strains. BMC Evol. Biol. 5: 52.

Siegmund, D., 1986 Boundary crossing probabilities and statistical

applications. Ann. Stat. 14: 361–404.

Siegmund, D., 1988 Approximate tail probabilities for the maxima

of some random ﬁelds. Ann. Probab. 16: 487–501.

Siepel, A. C., A. L. Halpern,C.Macken and B. T. M. Korber,

1995 A computer program designed to screen rapidly for HIV

type 1 intersubtype recombinant sequences. AIDS Res. Hum.

Retroviruses 11: 1413–1416.

Sneath, P. H. A., 1995 The distribution of the random division of a

molecular sequence. Binary Comput. Microbiol. 7: 148–152.

Sneath

, P. H. A., 1998 The effect of evenly spaced constant sites on

the distribution of the random division of a molecular sequence.

Bioinformatics 14: 608–616.

Spencer, M., 2003 Exact signiﬁcance levels for the maximum x

methodofdetectingrecombination.Bioinformatics 19: 1368–1370.

Stephens, J. C., 1985 Statistical methods of DNA sequence analysis:

detection of intragenic recombination or gene conversion. Mol.

Biol. Evol. 2: 539–556.

Strimmer, K., K. Forslund,B.Holland and V. Moulton, 2003 A

novel exploratory method for visual recombination detection.

Genome Biol. 4: R33.

Stumpf,M.P.H.,andG.A.T.McVean, 2003 Estimating recombina-

tion ratesfrompopulation-geneticdata.Nat.Rev.Genet.4: 959–968.

Suarez, D. L., D. A. Senne,J.Banks,I.H.Brown,S.C.Essen et al.,

2004 Recombination resulting in virulence shift in avian inﬂu-

enza outbreak, Chile. Emerg. Infect. Dis. 10: 693–699.

Takahata, N., 1994 Comments on the detection of reciprocal re-

combination or gene conversion. Immunogenetics 39: 146–149.

Wall, J. D., 1999 Recombination and the power of statistical tests of

neutrality. Genet. Res. 74: 65–79.

Wall, J. D., 2000 A comparison of estimators of the population re-

combination rate. Mol. Biol. Evol. 17: 156–163.

Whitworth, W. A., 1901 Choice and Chance, Ed. 5. Hafner Publish-

ing, New York.

Wilson, D. J., D. Falush and G. McVean, 2005 Germs, genomes,

and genealogies. Trends Ecol. Evol. 20: 39–45.

Wiuf, C., T. Christensen and J. Hein, 2001 A simulation study of

the reliability of recombination detection methods. Mol. Biol.

Evol. 18: 1929–1939.

Worobey, M., 2001 A novel approach to detecting and measuring

recombination: new insights into evolution of viruses, bacteria,

and mitochondria. Mol. Biol. Evol. 18: 1425–1434.

Worobey

, M., A. Rambaut,O.G.Pybus and D. L. Robertson,

2002 Questioning the evidence for genetic recombination in

the 1918 ‘‘Spanish ﬂu’’ virus. Science 296: 211a.

Yang, Z., 1996 Among-site variation and its impact on phylogenetic

analyses. Trends Ecol. Evol. 11: 367–371.

Zhou,J.,andB.G.Spratt, 1992 Sequence diversity within the argF, fbp

and recA genes of natural isolates of Neisseria meningitidis:interspecies

recombination within the argF gene. Mol. Microbiol. 6: 2135–2146.

Communicating editor: M. K. Uyenoyama

Exact Tests for Mosaic Structure 1047

Data-driven recombination detection in viral genomes

Article

Full-text available

Apr 2024

Recombination is a key molecular mechanism for the evolution and adaptation of viruses. The first recombinant SARS-CoV-2 genomes were recognized in 2021; as of today, more than ninety SARS-CoV-2 lineages are designated as recombinant. In the wake of the COVID-19 pandemic, several methods for detecting recombination in SARS-CoV-2 have been proposed; however, none could faithfully confirm manual analyses by experts in the field. We hereby present RecombinHunt, an original data-driven method for the identification of recombinant genomes, capable of recognizing recombinant SARS-CoV-2 genomes (or lineages) with one or two breakpoints with high accuracy and within reduced turn-around times. ReconbinHunt shows high specificity and sensitivity, compares favorably with other state-of-the-art methods, and faithfully confirms manual analyses by experts. RecombinHunt identifies recombinant viral genomes from the recent monkeypox epidemic in high concordance with manually curated analyses by experts, suggesting that our approach is robust and can be applied to any epidemic/pandemic virus.

Recombinant Viruses from the Picornaviridae Family Occurring in Racing Pigeons

Article

Full-text available

Jun 2024

Viruses from Picornaviridae family are known pathogens of poultry, although the information on their occurrence and pathogenicity in pigeons is scarce. In this research, efforts are made to broaden the knowledge on Megrivirus B and Pigeon picornavirus B prevalence, phylogenetic relationship with other avian picornaviruses and their possible connection with enteric disease in racing pigeons. As a result of Oxford Nanopore Sequencing, five Megrivirus and two pigeon picornavirus B-like genome sequences were recovered, among which three recombinant strains were detected. The recombinant fragments represented an average of 10.9% and 25.5% of the genome length of the Pigeon picornavirus B and Megrivirus B reference strains, respectively. The phylogenetic analysis revealed that pigeons are carriers of species-specific picornaviruses. TaqMan qPCR assays revealed 7.8% and 19.0% prevalence of Megrivirus B and 32.2% and 39.7% prevalence of Pigeon picornavirus B in the group of pigeons exhibiting signs of enteropathy and in the group of asymptomatic pigeons, respectively. In turn, digital droplet PCR showed a considerably higher number of genome copies of both viruses in sick than in asymptomatic pigeons. The results of quantitative analysis leave the role of picornaviruses in enteropathies of pigeons unclear.

Reconstructing the Mastrevirus communities structure on La Réunion: The tale of agricultural associated pathogens

Preprint

May 2024

The geographical distribution and diversity of viruses can differ between cultivated areas and adjacent natural environments, raising questions about the interplay between plant diversity and the species richness and prevalence of the phytoviruses. As both the amplification and the dilution of viral species richness due to increasing host diversity have been theorized and observed, a deeper understanding of how plant-viruses interact in natural environments is needed to explore how host availability conditions viral diversity and distributions. This study explores interactions of viruses from the Mastrevirus genus (family Geminiviridae) with hosts from the Poaceae family across ten sites from three contrasting ecosystems on La Réunion. Among 273 plant pools, representing 61 Poales species, 15 Mastrevirus species were characterized from 22 hosts. We find a strong association of mastreviruses with hosts from agro-ecosystems and the absence of mastreviruses in subalpine areas, dominated by native plants. This suggests that all detected mastreviruses likely originated from viruses introduced through agricultural activities rather than being native to La Réunion. Analyses of the structure of the host plant-mastrevirus interaction network revealed a pattern of increasing viral richness with increasing host richness. Accounting for variations in the diversity of hosts across sites, we observed increasing viral niche occupancies with increasing host species richness. Virus realized richness at any given site is conditioned on the global capacity of the plant populations to host diverse mastreviruses. Whether this tendency is driven by synergy between viruses, or by an interplay between vector population and plant richness, remains to be established.

Genetic recombination-mediated evolutionary interactions between phages of potential industrial importance and prophages of their hosts within or across the domains of Escherichia, Listeria, Salmonella, Campylobacter, and Staphylococcus

Article

Full-text available

May 2024
BMC MICROBIOL

Background The in-depth understanding of the role of lateral genetic transfer (LGT) in phage-prophage interactions is essential to rationalizing phage applications for human and animal therapy, as well as for food and environmental safety. This in silico study aimed to detect LGT between phages of potential industrial importance and their hosts. Methods A large array of genetic recombination detection algorithms, implemented in SplitsTree and RDP4, was applied to detect LGT between various Escherichia, Listeria, Salmonella, Campylobacter, Staphylococcus, Pseudomonas, and Vibrio phages and their hosts. PHASTER and RAST were employed respectively to identify prophages across the host genome and to annotate LGT-affected genes with unknown functions. PhageAI was used to gain deeper insights into the life cycle history of recombined phages. Results The split decomposition inferences (bootstrap values: 91.3–100; fit: 91.433-100), coupled with the Phi (0.0-2.836E-12) and RDP4 (P being well below 0.05) statistics, provided strong evidence for LGT between certain Escherichia, Listeria, Salmonella, and Campylobacter virulent phages and prophages of their hosts. The LGT events entailed mainly the phage genes encoding for hypothetical proteins, while some of these genetic loci appeared to have been affected even by intergeneric recombination in specific E. coli and S. enterica virulent phages when interacting with their host prophages. Moreover, it is shown that certain L. monocytogenes virulent phages could serve at least as the donors of the gene loci, involved in encoding for the basal promoter specificity factor, for L. monocytogenes. In contrast, the large genetic clusters were determined to have been simultaneously exchanged by many S. aureus prophages and some Staphylococcus temperate phages proposed earlier as potential therapeutic candidates (in their native or modified state). The above genetic clusters were found to encompass multiple genes encoding for various proteins, such as e.g., phage tail proteins, the capsid and scaffold proteins, holins, and transcriptional terminator proteins. Conclusions It is suggested that phage-prophage interactions, mediated by LGT (including intergeneric recombination), can have a far-reaching impact on the co-evolutionary trajectories of industrial phages and their hosts especially when excessively present across microbially rich environments.

A unified classification system for HIV-1 5' long terminal repeats

Article

Full-text available

May 2024
PLOS ONE

The HIV-1 provirus mainly consists of internal coding region flanked by 1 long terminal repeats (LTRs) at each terminus. The LTRs play important roles in HIV-1 reverse transcription, integration, and transcription. However, despite of the significant study advances of the internal coding regions of HIV-1 by using definite reference classification, there are no systematic and phylogenetic classifications for HIV-1 5’ LTRs, which hinders our elaboration on 5’ LTR and a better understanding of the viral origin, spread and therapy. Here, by analyzing all available resources of 5’ LTR sequences in public databases following 4 recognized principles for the reference classification, 83 representatives and 14 consensus sequences were identified as representatives of 2 groups, 6 subtypes, 6 sub-subtypes, and 9 CRFs. To test the reliability of the supplemented classification system, the constructed references were applied to identify the 5’ LTR assignment of the 22 clinical isolates in China. The results revealed that 16 out of 22 tested strains showed a consistent subtype classification with the previous LTR-independent classification system. However, 6 strains, for which recombination events within 5’ LTR were demonstrated, unexpectedly showed a different subtype classification, leading a significant change of binding sites for important transcription factors including SP1, p53, and NF-κB. The binding change of these transcriptional factors would probably affect the transcriptional activity of 5’ LTR. This study supplemented a unified classification system for HIV-1 5’ LTRs, which will facilitate HIV-1 characterization and be helpful for both basic and clinical research fields.

Prevalence, variation, and transmission patterns of human respiratory syncytial virus from pediatric patients in Hubei, China during 2020-2021

Article

Full-text available

Apr 2024

Human respiratory syncytial virus (RSV) is a severe threat to children and a main cause of acute lower respiratory tract infections. Nevertheless, the intra-host evolution and interregional diffusion of RSV are little known. In this study, we performed a systematic surveillance in hospitalized children in Hubei during 2020-2021, in which 106 RSV-positive samples were detected both clinically and by metagenomic next generation sequencing (mNGS). RSV-A and RSV-B groups co-circulated during surveillance with RSV-B being predominant. About 46 high-quality genomes were used for further analyses. A total of 163 intra-host nucleotide variation (iSNV) sites distributed in 34 samples were detected, and glycoprotein (G) gene was the most enriched gene for iSNVs, with non-synonymous substitutions more than synonymous substitutions. Evolutionary dynamic analysis showed that the evolutionary rates of G and NS2 genes were higher, and the population size of RSV groups changed over time. We also found evidences of interregional diffusion from Europe and Oceania to Hubei for RSV-A and RSV-B, respectively. This study highlighted the intra-host and inter-host evolution of RSV, and provided some evi-dences for understanding the evolution of RSV.

The pigeon circovirus evolution, epidemiology and interaction with the host immune system under One Loft Race rearing conditions

Article

Full-text available

Jun 2024

Tomasz Stenzel

This study was aimed to investigate the frequency of PiCV recombination, the kinetics of PiCV viremia and shedding and the correlation between viral replication and host immune response in young pigeons subclinically infected with various PiCV variants and kept under conditions mimicking the OLR system. Fifteen racing pigeons originating from five breeding facilities were housed together for six weeks. Blood and cloacal swab samples were collected from birds every seven days to recover complete PiCV genomes and determine PiCV genetic diversity and recombination dynamics, as well as to assess virus shedding rate, level of viremia, expression of selected genes and level of anti-PiCV antibodies. Three hundred and eighty-eight complete PiCV genomes were obtained and thirteen genotypes were distinguished. Twenty-five recombination events were detected. Recombinants emerged during the first three weeks of the experiment which was consistent with the peak level of viremia and viral shedding. A further decrease in viremia and shedding partially corresponded with IFN-γ and MX1 gene expression and antibody dynamics. Considering the role of OLR pigeon rearing system in spreading infectious agents and allowing their recombination, it would be reasonable to reflect on the relevance of pigeon racing from both an animal welfare and epidemiological perspective.

Phylogenomics and evolution of measles virus

Chapter

Jan 2024

Genetic substructure and host-specific natural selection trend across vaccine-candidate ORF-2 capsid protein of hepatitis-E virus

Article

Full-text available

May 2024
J VIRAL HEPATITIS

Hepatitis E virus is a primary cause of acute hepatitis worldwide. The present study attempts to assess the genetic variability and evolutionary divergence among HEV genotypes. A vaccine promising capsid-protein coding ORF-2 gene sequences of HEV was evaluated using phylogenetics, model-based population genetic methods and principal component analysis. The analyses unveiled nine distinct clusters as subpopulations for six HEV genotypes. HEV-3 genotype samples stratified into four different subgroups, while HEV-4 stratified into three additional subclusters. Rabbit-infectious HEV-3ra samples constitute a distinct cluster. Pairwise analysis identified marked genetic distinction of HEV-4c and HEV-4i subgenotypes compared to other genotypes. Numerous admixed, inter and intragenotype recombinant strains were detected. The MEME method identified several ORF-2 codon sites under positive selection. Some selection signatures lead to amino acid substitutions within ORF-2, resulting in altered physicochemical features. Moreover, a pattern of host-specific adaptive signatures was identified among HEV genotypes. The analyses conclusively depict that recombination and episodic positive selection events have shaped the observed genetic diversity among different HEV genotypes. The significant genetic diversity and stratification of HEV-3 and HEV-4 genotypes into subgroups, as identified in the current study, are noteworthy and may have implications for the efficacy of anti-HEV vaccines.

Dynamics of Potato Virus Y Infection Pressure and Strain Composition in the San Luis Valley, Colorado

Article

May 2024
PLANT DIS

The San Luis Valley (SLV), Colorado, is the second-largest fresh-potato-growing region in the United States, which accounts for about 95% of the total production in Colorado. Potato virus Y (PVY) is the leading cause of seed potato rejection in the SLV, which has caused a constant decline in seed potato production over the past two decades. To help potato growers control PVY, we monitored the dynamics of PVY infection pressure over the growing seasons of 2022 and 2023 (May through August) using tobacco bait plants exposed to field infection weekly. PVY infection dynamics were slightly different between the two seasons, but July and August had the highest infection in both years. The first PVY infection was detected in the second half of June, which coincides with the emergence of potato crops in the valley. PVY infection increased toward the beginning of August and declined toward the end of the season. Three PVY strains were identified in tobacco bait plants and potato fields, namely PVY O , PVY N−Wi , and PVY NTN . Unlike other producing areas of the United States, PVY O is still the major strain infecting potato crops in Colorado, comprising ∼40% of total PVY strain composition. This could be explained by the prevalence of the potato cultivar Russet Norkotah that lacks any identified N genes, including the Ny tbr that controls PVY O , which imposes no negative selection against this strain. The current study demonstrated the usefulness of bait plants to understand PVY epidemiology and develop more targeted control practices of PVY.

Linkage Disequilibrium and Recombination in Hominid Mitochondrial DNA

Article

Full-text available

Dec 1999
SCIENCE

The assumption that human mitochondrial DNA is inherited from one parent only and therefore does not recombine is questionable. Linkage disequilibrium in human and chimpanzee mitochondrial DNA declines as a function of the distance between sites. This pattern can be attributed to one mechanism only: recombination.

The distribution of the random division of a molecular sequence

Article

Jan 1995

P.H.A. Sneath

An Introduction To Probability Theory And Its Applications

Article

Jun 1958

Note on “Choice and Chance”

Article

W. A. Whitworth

Evolutionary genomics

Article

May 1999
TRENDS ECOL EVOL

Adam Eyre-Walker

Limit Distributions of Maximal Segmental Score among Markov-Dependent Partial Sums

Article

Mar 1992

Let s1,⋯ ,sn be generated governed by an r-state irreducible aperiodic Markov chain. The partial sum process Sα ,m=∑i=0 m-1Xsi,si+1 , m=1,2,⋯ is determined by a realization {si}i=0 n of states with s0=α and the real-valued i.i.d. bounded variables Xα β associated with the transitions si=α, si+1=β . Assume Xα β has negative stationary mean. The explicit limit distribution of the maximal segmental sum Mα(n)=max0≤ k≤ l≤ n[Sα ,l-Sα ,k] is derived. Computational methods with potential applications to the analysis of random Markov-dependent letter sequences (e.g. DNA and protein sequences) are presented.

Chance and signi cance in protein and dna sequence analysis

Article

Jan 1992

Large Deviations for the Maxima of Some Random Fields

Article

May 1985

Several statistical problems which involve the distribution of the maximum of Gaussian random fields are described. Specific examples are the pinned Brownian sheet and a Brownian bridge with 'reflection', which arises in certain change point problems. In these concrete cases the method of Pickands (1969, Trans. Amer. Math. Soc.) is adapted to give large deviation probabilities for the maximum, both for continuous and for discrete indexing sets. A different method is used to give a second order correction for the reflected Brownian bridge and hence for reflected Brownian motion. The numerical accuracy of the approximations is studied via simulation.

Recombination and the power of statistical tests of neutrality

Article

Aug 1999
GENET RES

Jeffrey D. Wall

Two new test statistics were constructed to detect departures from the equilibrium neutral theory that tend to produce genealogies with longer internal branches (e.g. population subdivision or balancing selection). The new statistics are based on a measure of linkage disequilibrium between adjacent pairs of segregating sites. Simulations were run to determine the power of these and previously proposed test statistics to reject an island model of geographic subdivision. Unlike previous power studies, this one uses a coalescent model with recombination. It is found that recombination rates on the order of the mutation rate substantially reduce the power of most test statistics, and that one of the new test statistics is generally more powerful than the others. Two suggestions are made for increasing the power of the statistical tests examined here. First, they can be made more powerful if critical values are obtained from simulations that condition on a lower bound for the population recombination rate. Secondly, for the same total length sequenced, power is increased if independent loci are considered instead of a single contiguous stretch.

An Introduction to Probability Theory and Its Applications II

Chapter

Jan 1971

William Feller

An Exact Nonparametric Method for Inferring Mosaic Structure in Sequence Triplets

Abstract and Figures

Recommended publications

The triplet code : Is it enough?

Rzhetsky, A. and Nei, M.. Tests of applicability of several substitution models for DNA sequence dat...

Inferences About Human Demography Based on Multilocus Analyses of Noncoding Sequences

Should We Use Model-Based Methods for Phylogenetic Inference When We Know That Assumptions About Amo...