ArticlePDF Available

Estimating Gene Gain and Loss Rates in the Presence of Error in Genome Assembly and Annotation Using CAFE 3

May 2013
Molecular Biology and Evolution 30(8)

May 2013
30(8)

DOI:10.1093/molbev/mst100

Source
PubMed

Authors:

Jose Lugo-Martinez

Indiana University Bloomington

Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss will often be misled by such errors, and that rates of gene family evolution will be consistently overestimated. Here we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations, and re-analyze several previously published datasets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss, but that CAFE 3 sufficiently accounts for these errors in order to provide accurate estimates of important evolutionary parameters.

Estimating the error distribution. Each panel shows the –ln likelihood of individual CAFE runs on a simulated data set with error. The points represent the score of the run with the corresponding error model (" value) on the x axis; the dashed vertical line indicates the true amount of error added to the simulated data. Panel (A) shows a simulation with " = 0.1 and panel (B) shows one with " = 0.4. In both cases, the maximum likelihood score occurs when the correct error model is used.

…

Figures - uploaded by Jose Lugo-Martinez

Content may be subject to copyright.

Content uploaded by Jose Lugo-Martinez

Content may be subject to copyright.

Article

Estimating Gene Gain and Loss Rates in the Presence of Error in

Genome Assembly and Annotation Using CAFE 3

Mira V. Han,*

,1,2

Gregg W.C. Thomas,

Jose Lugo-Martinez,

and Matthew W. Hahn*

,2,3

National Evolutionary Synthesis Center, Durham, North Carolina

School of Informatics and Computing, Indiana University

Department of Biology, Indiana University

*Corresponding authors: E-mail: mira.han@nescent.org; mwh@indiana.edu.

Associate editor: Sudhir Kumar

Abstract

Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are

often fragmented and incomplete. Incomplete and error-ﬁlledassembliesresultinmanyannotationerrors,especiallyin

the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and

loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we

present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among

genomes even with low assembly and annotation quality. The method is implemented in the newest version of the

software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with

extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome

annotation do lead to higher inferred rates of gene gain and lossbutthatCAFE3sufﬁcientlyaccountsfortheseerrorsto

provideaccurateestimatesofimportant evolutionary parameters.

Key words: duplication, gene family, adaptive evolution.

Introduction

Genome sequencing projects have revealed large and

frequent changes between species in the size of gene families

(e.g., Gibbs et al. 2004,2007;Drosophila 12 Genomes

Consortium 2007;Li et al. 2009;Floudas et al. 2012). These

changes may underlie many important morphological, phys-

iological, and behavioral differences between species and con-

tribute much of the genetic and genomic diversity observed

in nature (reviewed in Demuth and Hahn 2009). Recent work

on diversity within species has also revealed surprising num-

bers of polymorphic gene duplications and losses (e.g., Sebat

et al. 2004;Emerson et al. 2008;Kidd et al. 2008;Schrider et al.

2011), variation that contributes to long-term differences in

the size of gene families between species. To further under-

stand the importance of these changes, researchers must be

able to accurately estimate the rate at which gene families

evolve over time.

Our previous approach to estimating this rate modeled the

gain and loss of genes within a gene family using a birth-and-

death stochastic process (Hahn et al. 2005). (This probability

distribution should not be confused with the birth-and-death

conceptual model of gene family evolution of Nei and Rooney

[2005].) Given input data on the size of gene families across

multiple species and an ultrametric phylogenetic tree describ-

ing relationships among these species, the original CAFE soft-

ware package (De Bie et al. 2006) can estimate the maximum

likelihood value of the rate parameter, l,andthemaximum

likelihood estimates (MLEs) of the size of each gene family at

ancestral nodes of the tree. These MLEs can then be used to

infer expansions and contractions of individual gene families

on any lineage (e.g., Demuth et al. 2006). Updated versions of

this software (CAFE 2; Hahn, Demuth, et al. 2007;Hahn, Han,

et al. 2007) allowed for separate lvalues on different branches

of the tree, as well as several other novel features. A number of

other programs using the birth-and-death model—or related

models—have also appeared and offer similar as well as ad-

ditional features (e.g., Liu et al. 2011;Ames et al. 2012;Librado

et al. 2012).

A major concern when studying changes in gene family

size is the quality of the underlying genome assembly and

genome annotation. Low sequencing coverage in genome

assemblies can lead to both the erroneous addition and sub-

traction of genes. Genes can be missing because there is in-

complete coverage of the entire genome, with whole or parts

of genes falling in unassembled portions of the genome; genes

can also be missing because base-calling errors mistakenly

indicate frameshifts or nonsense mutations (e.g., Hubisz

et al. 2011). Extra gene copies can be inserted into the assem-

bly if allelic diversity is incorrectly assembled as duplicated loci

(e.g., Holt et al. 2002;Colbourne et al. 2011)orifasingle

multiexon gene is split among multiple scaffolds or con-

tigs—in which case multiple gene models may be predicted

fromasinglegene(e.g.,Colbourne et al. 2011). Similar prob-

lems can arise even in “ﬁnished” genomes (such as Drosophila

melanogaster), as gene annotation software can often miss

short open-reading frames or can cleave a single gene into

multiple predicted genes (e.g., Hahn, Han, et al. 2007;Stark

et al. 2007;Costello et al. 2008).

For studies focusing on gene family size change, errors in

genome assembly and annotation will result in biased

e-mail: journals.permissions@oup.com

Mol. Biol. Evol. 30(8):1987–1997 doi:10.1093/molbev/mst100 Advance Access publication May 24, 2013 1987

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

estimates of the rate of change. Because a higher rate of

evolution must be proposed in the presence of errors—

whether additional or missing genes—estimates that do

not account for errors are likely to have been upwardly

biased. Indeed, we have previously found that Drosophila spe-

cies represented by the lowest quality assemblies in compar-

ative analyses using CAFE also appear to evolve at the highest

rates (Hahn, Han, et al. 2007). Therefore, to estimate accurate

rates of gene family evolution, we must be able to account for

the error present in all current genome annotations. In this

article, we present one such method and implement it in a

new version of CAFE. Our method accounts for errors in gene

family sizes by explicitly modeling the uncertainty associated

with observed family sizes at the tips of a tree. We show that,

given a known error distribution for each genome, we can

recoveraccurateestimatesofthetruerateofgenefamily

evolution. In addition, we present multiple methods for esti-

mating error rates from the data when they are not known in

advance and show how these can be used to provide more

accurate values of evolutionary parameters.

New Approaches

We assume a random variable Xthatisatruecountof

homologous members of a gene family within a single lineage.

In theory, Xcan be from 0 to inﬁnite size, but for ease of

computation, we limit it to be at most M. A whole genome

can then be thought of as a random sample of size N, where

each gene family within a genome corresponds to each ob-

servation, and Nis the total number of gene families found in

the genome. Each gene family size in a genome is assumed to

be independent and identically distributed.

We also consider the error-prone measure of gene family

size W,W=w(w= 0,1,2,3 ...M). Wrepresents our observa-

tion for each gene family size that is affected by errors in the

genome assembly and errors in the gene annotations. The

measurement-error model, which describes the behavior of

Wgiven X=x, is speciﬁed by the error probabilities:

wjx¼PðW¼wjX¼xÞ,

that is, the probability of observing wwhenthetruegene

family size is x. The error probabilities can be represented as a

matrix, :

¼

1j1 1jM

...

Mj1 MjM

where the item in the ith row and jth column represents the

probability of observing iwhenthetruegenefamilysizeisj.

Note that the rows of the matrix do not have to sum to 1 but

thecolumnsdo.Wealsodeﬁnetheprobability

as:

x¼PðX¼xÞ,

that is, the probability of a true gene family size of xfound in

thegenome.Thelowercasedenotes the discrete probability

distribution 

for x=0...M.

In cases where there is no measurement error, it is known

that we can estimate the rate of change in gene family size

across the phylogeny by specifying a transition matrix based

on the rate parameters land and ﬁtting the model to the

observed (=true) counts (X) at the tips of the tree and the

time between the nodes (described by branch lengths, T).

Under a birth-and-death process, the probability of going

from snumber of genes to cnumber of genes in time tis

given by (Bailey 1964):

PXtðÞ¼cjX0ðÞ¼sðÞ

¼X

minðs,cÞ

j¼0

s+cj1

s1



sjcjð1Þj

¼ðelðÞt1Þ

lelðÞt,¼lðelðÞt1Þ

lelðÞt

where lis the rate of gene gain and istherateofgeneloss.If

the rates of gain and loss are equal, that is, l=,theabove

probability is as given in equation 1 of Hahn et al. (2005).Here,

we focus on cases with l=,buttheupdatedversionof

CAFE can also estimate separate rates of gain and loss

(as can BadiRate; Librado et al. 2012).

For multiple species, S=(1...s),wedeﬁneatreewith

ultrametric branch lengths, T, that has the set of species

Sas the leaves and a set of ancestral nodes, U=(1 ...u).

We deﬁne X

as the vector X

=(X

,...X

), in which

each item X

describes the size of the nth gene family in each

genome of species i(i2S). Similarly, Z

=(Z

,...Z

)is

the vector in which each item Z

isthegenefamilysizeofthe

ancestral genome at the inner node j(j2U). The actual

calculation of the likelihood over the whole tree utilizes the

“pruning algorithm” (Felsenstein 1973,1981)tosumoverthe

inner node values that we cannot observe:

l,¼argmaxl, Y

n¼1

PðXnjl,,TÞ!

¼argmaxl, Y

n¼1

PðXnjZn,l,,TÞPðZnjl,,TÞ!

¼argmaxl, Y

n¼1

zn1¼0X

zn2¼0

 X

znu¼0

PðXnjZn

¼zn1,zn2...znu

ðÞ,l,,TÞ

PðZn¼zn1,zn2...znu

ðÞjl,,TÞ

;

With error in the measurements, a naı

¨ve inference based on

use of the W’s instead of the X’s leads to:

ðl,ÞW¼argmaxl, Y

n¼1

PðWnjl,,TÞ!,

wherewedeﬁnethevectorW

=(W

,...W

). Similar

to X

is the error-prone measurement of the gene count

for the nth gene family in species i.

To account for error, we add an additional layer of uncer-

tainty on the true value Xto the values at the leaf nodes of the

phylogeny. This necessitates an additional summation of

1988

Han et al. .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

likelihoods at each leaf over X, in addition to all internal nodes,

Z. The only difference between the summation at the leaf

nodes and the summation at the inner nodes is that the

probability at the leaf nodes is deﬁned by the error matrix,

rather than following the transition probabilities derived from

theratematrixandthebranchlengths:

l,¼argmaxl,Y

n¼1

xn1¼0X

xn2¼0

... X

xns¼0

PðWnjXn

¼ðxn1,xn2,...xns ÞÞ

PðXn,l,,TÞ

;

¼argmaxl,Y

n¼1

xn1¼0X

xn2¼0

... X

xns¼0

PðWn1jXn1

¼xn1ÞPðWn2jXn2¼xn2Þ

...PðWns jXns ¼xnsÞ

PX

n,l,,TðÞ

;

The probability P(W

) follows from the

error matrix wni jxni .

Because we do not know the error matrix , it becomes an

additional set of parameters we need to estimate:

ðl,,Þ

¼argmaxl,,Y

n¼1X

Wn1jXn1Wn2jXn2

...

Wns jXns PðXn,l,,TÞ

;

When the error probabilities are unknown, it is theoreti-

callypossibletoestimatethewholesetofparametersinclud-

ing the error matrix using maximum likelihood, but in

practice the number of parameters to be estimated is too

large unless the number of samples is extremely large. For

example, even if we assume that the error matrix is the same

across all families and all species, the number of parameters to

be estimated is 2 +M

; that is, the entries of a full error

matrix when Mis the maximum possible size of a family.

Instead, here we focus on cases where we have some infor-

mation about the distribution of errors affecting measure-

ment. In practice, this means that rather than estimating

the joint distribution of land ,weestimatelusing external

information about the distribution of error. If we assume a

highly simpliﬁed error model, we can also estimate the error

matrix using a pseudomaximum likelihood approach

(Buonaccorsi 2010). Later, we present extensive simulation

results that suggest that our method provides accurate esti-

mates of all parameters.

Results and Discussion

The Effect of Error on Inferred Rates of Gene Family

Evolution

To examine the effect of error in the gene family size taken

from suboptimal genome annotations, we simulated gene

families under a model with known error. These data were

simulated using the phylogeny of 12 Drosophila species

(supplementary ﬁg. S1,Supplementary Material online) and

the distribution of sizes among 11,434 gene families previously

analyzed in these species (Hahn, Han, et al. 2007), with a true

rate parameter of l== 0.0012. A simulation consists of

generating a data set using CAFE’s genfamily command and

adding error to the data set as speciﬁed. In the simplest sim-

ulations, a known amount of error (") was added to each data

set by randomly adding or subtracting genes from "percent

of the gene families, with all gene family sizes having the same

error distribution (i.e., the same error matrix). Error can be

added to all species or to a subset of species, effectively

modeling heterogeneous assembly and annotation quality

among genomes. The error distributions added to our simu-

lated data were either "=0.1 or "= 0.4 and consisted of an

addition or subtraction of one to three genes in a family per

species independently (ﬁg. 1). An error value of "= 0.1 means

that in 90% of gene families, the observed size is equal to the

true size, whereas in 10% of gene families, the observed size is

either too large or too small (ﬁg. 1A,C,andE). These

error distributions approximate the range and distributions

of error that we observe in several published genomes (see

later).

To ﬁrst assess the effect of error on inferred rates of gene

family evolution, CAFE 3 was run on each simulated data set

with standard settings—that is, with no error model incor-

porated. Estimating lfrom these error-prone data sets gave

values of 0.0027 and 0.0085 when adding "=0.1 and "=0.4,

respectively (table 1). As expected, the more error contained

in each data set, the higher value of lwe inferred; this is

expected because higher rates of gene family evolution

must be proposed to account for greater disparities in gene

family size. Even when only 10% of families have an incorrect

size (in each of the 12 genomes), the rate of gene family

evolution is more than twice its true value (l= 0.0012).

Although adding symmetric error does not change the

mean family size across species, it does change the variance

in size: from a variance equal to 0.519 in the true data

(mean = 1.097), adding "= 0.1 changed the variance to

0.609 and adding "= 0.4 gave 0.894. Adding asymmetric

error did change the mean family size but only very slightly

(data not shown); the variances were the same as in the

symmetric case.

There did not appear to be a clear effect of asymmetry in

the error model on the absolute values of l, as putting more

of the mass of the error distribution in either gains or losses

did not signiﬁcantly affect the estimated parameter value

(compare results from error models 1A to 1C, and 1B to

1D in table 1). However, we did observe a small but substan-

tial increase in lwhen errors involving larger changes in

family size (e.g., ±3) were included (compare results from

error models 1A to 1E, and 1B to 1F in table 1).

Accounting for Errors in Gene Family Size Using

CAFE 3

We have observed how error in the observed gene family sizes

can lead to an overestimation of the rates of gene gain and

loss. We were therefore interested in whether the error model

1989

CAFE 3 .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

described earlier could be used to account for this error to

correctly infer the true value of l.Weapplytheerrormodelin

two cases: ﬁrst, when we assume that we know the correct

error distribution and second, when we purposefully use an

incorrect error model. Because the correct level of error and

the exact distribution of error will often not be known, these

two cases allow us to assess the consistency of our method. In

both cases, the new “errormodel” function is used in CAFE 3,

along with a speciﬁed error distribution.

In our ideal test case, we again simulated data with

l== 0.0012 across the phylogeny, and added a proportion

of error, ", equal to either 0.1 or 0.4 to all species. In both cases,

the size of gene families with error was either incremented or

decremented by a count of 1, with equal probability. In the

case of using the same error distribution to correct for error as

was added to the data set, CAFE correctly recovers the true l

with high precision. The data sets with either "=0.1or"=0.4

had the corresponding 0.1 and 0.4 error models applied to all

species, and the lvalue inferred was again very close 0.0012

(table 1). To ensure that these results are not unique to a

single rate parameter, we repeated the above simulations

with l== 0.01 and 0.0001 and the same error values. For

"= 0.1, application of the error model gave l= 0.00996 and

0.00009, respectively. For "= 0.4, application of the error

model gave l= 0.01085 and 0.000083, respectively. We can

see that for l= 0.01, we were able to infer the correct value

FIG.1. Error distributions used in simulations. These distributions include total errors of "=0.1 (A,C,E)and"=0.4(B,D,F)butvaryinthemannerin

which errors are spread across the error spectrum. In panels (A–D), only errors of +1or1 gene per family are considered, with (A)and(B)showinga

symmetric spread of the total error between +1and1, whereas (C)and(D) show an asymmetric spread, skewed with 75% of the total error in the

1 category. The opposite skew, with 75% in the +1 category were also simulated, but is not shown here. Panels (E)and(F)showasymmetric

distribution that extends to include the addition and subtraction of two or three genes.

1990

Han et al. .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

very accurately, whereas there was some inaccuracy for

l= 0.0001 (similar results held for both asymmetric and sym-

metric models). This latter result may be due to the small

number of changes occurring across the tree with very low

rates of change, but further simulations are needed to explore

this effect.

We have also implemented the error model to allow for

different error distributions among species. This corresponds

to subsets of the input data coming from genomes of differ-

ential quality, as we have observed previously among the

published Drosophila genomes (Drosophila 12 Genomes

Consortium 2007;Hahn, Han, et al. 2007). To assess the ac-

curacy of the lestimate when different amounts of error are

applied to individual genomes, simulations were carried out

as above with one difference: An error distribution of "=0.1

was applied to all species except D. simulans,D. sechellia,and

D. persimilis,whichwereallgiven"=0.4 (these species were

observed to have lower quality genome assemblies; Begun

et al. 2007;Drosophila 12 Genomes Consortium 2007).

When we specify these same error distributions in our error

model—correctly assigning each distribution to each spe-

cies—we again recover the true single value of lacross the

tree (third row in table 1). In addition to searching for a single

lvalue across the tree, CAFE has the ability to search for

separate lvalues on individual branches or clades. Without

accounting for error, models with separate lparameters for

terminal branches leading to each of the three species with

higher error rates ﬁt signiﬁcantly better (data not shown); this

is because the error-prone assemblies appear to have faster

rates of gene family evolution (cf., Hahn, Han, et al. 2007).

Conversely, if the data are simulated with two lparameters

corresponding to these two parts of the tree (supplementary

ﬁg. S1,Supplementary Material online), we are able to cor-

rectly infer these parameter values even in the presence of

error (table 2). We note that there is a potential identiﬁability

problem when we have separate parameters for gain and loss

on a speciﬁc branch along with a separate error model for the

samebranch.Wemaynotalwaysbeabletodistinguishhigher

gain and loss rates on the branch with the higher error rates.

We also tested the effect of using an incorrect error model

when inferring rates of change. We anticipated that using an

error distribution larger than the error distribution that is

present in the data—that is, a larger value of "—would

lead to an overcorrection and that a lower lwould be ob-

served with respect to the value of lwith which the data were

simulated. Similarly, using an error distribution that corrects

for less error than is present in the data might lead to an

undercorrection and a higher lthan the true value. Our

simulations conﬁrmed our expectations: When an error

model with "= 0.4 was applied to a data set simulated with

"=0.1,thelvalue observed was 0.00045 (approximately one-

third of the true value). Similarly, when using an error model

with "=0.1 to correct for a data set simulated under "=0.4,

the lvalue inferred was 0.0042. Although we have undercor-

rected in this case—the inferred value is more than three

times higher than the simulated value—it is important to

remember that the value that would have been obtained

without the error model was twice as high again

(l= 0.0086). Overall, we conclude that our error model can

recover the true value if the error distribution is known

exactly, and if it is not known, the model performs according

to expectations. The consistency of the error models suggests

that we may be able to predict the error distribution when it

is unknown; we demonstrate this feature of CAFE in the next

section.

Estimating the Error Distribution from External Data

Thus far, we have only considered the case where the error

distribution for gene family size (i.e., the error matrix) is spe-

ciﬁed ahead of time. In this section and the next, we take two

approaches to estimating the error distribution. In the ﬁrst

approach, we use external data from multiple genome assem-

blies, either two error-prone assemblies of the same genome

or one high-quality assembly and one low-quality assembly.

Table 1. Performance of the Error Model.

Error Added (e)k(No Error

Model)

k(Correct

Error Model)

k(Incorrect

Error Model)

Symmetric error distributions 0.1 (1A)

0.00280 0.00122

0.00043

0.4 (1B) 0.00897 0.00124 0.00447

Asymmetric error distributions

0.1 (1C) 0.00283 0.00124 0.00047

0.1 (1C)

0.00276 0.00120 0.00050

0.4 (1D) 0.00960 0.00139 0.00502

0.4 (1D)

0.00765 0.00111 0.00366

Varying error across species

0.4 for low-quality

species

(1B)

0.00417 0.00121 0.00090

0.1 for all other

branches (1A)

Symmetric error with more error classes 0.1 (1E) 0.00324 0.00130 0.00050

0.4 (1F) 0.00903 0.00154 0.00406

The incorrect error model for simulations with "=0.1 is "= 0.4 and vice versa.

Error distributions correspond to the given panel in ﬁgure 1.

The simulated value without error in all cases is l= 0.0012.

These distributions follow the same pattern as in ﬁgure 1Cand Dbut with the asymmetry in the opposite direction.

Low-quality species are highlighted in supplementary ﬁgure S1,Supplementary Material online.

1991

CAFE 3 .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

Comparisons of these assemblies can be used to ﬁnd MLEs of

probabilities in the error matrix, as described in the Materials

and Methods section. We demonstrate this approach using

both simulated and real data sets. In the next section, we ask

whether the likelihood reported by CAFE can be used to

estimate the error distribution without additional external

data.

To estimate the error distribution from external data, we

ﬁrst simulated gene family data as above. For each simulated

data set without error, we generated two error-prone data

sets by separately adding error according to the prespeciﬁed

error distribution. By data set we mean a set of gene families

that comprise a whole genome. We can either compare these

two error-prone data sets with each other to estimate the

errormatrixorwecancompareonetothedatasetwithout

error. The former comparison is equivalent to having two

equally error-prone annotations of the same genome,

whereas the latter comparison is equivalent to having one

accurate annotation and one error-prone assembly. In both

cases, we can ﬁnd the parameters in the error matrix, ,that

maximize the log likelihood of the data (see Materials and

Methods). We simulated data sets with varying numbers of

error parameters, from two parameters (error probabilities for

differences of 0 and ±1) to seven parameters (error probabil-

ities for differences of +3, +2, +1, 0, 1, 2, and 3) and

varying amounts of error (with "ranging from 0.1 to 0.6). The

results of these simulations suggest that we are able to accu-

rately estimate up to four error parameters, across all values of

"(supplementary table S1,Supplementary Material online).

With more than four parameters, we did not get convergence

of the log likelihood scores. In addition, when the simulated

error model is simpler (i.e., has fewer parameters), the more

complex error models are not better ﬁtting, suggesting that

we are able to ﬁnd both estimates of the error parameters and

the model complexity (supplementary ﬁg. S2 and table S1,

Supplementary Material online). Finally, we found that esti-

mating the error distribution by comparing two error-prone

data sets was only very slightly less accurate than estimation

via comparison between one error-prone data set and one

high-quality data set (supplementary table S1,Supplementary

Material online). This result is encouraging for systems in

which no reference-quality genome is available, although

we are making the assumption that the two available error-

pronedatasetssharethesameerrordistribution.

To evaluate the error distributions found in real genomes,

we compared low- and high-quality assemblies and

annotations for eight species (supplementary table S2,

Supplementary Material online). These comparisons range

from genomes of highly disparate quality (1.92X coverage

vs. 6.79X coverage for the two versions of the guinea pig

genome) to genomes in the ﬁnal stages of ﬁnishing (e.g.,

D. melanogaster assembly version 4 vs. version 5.4). As a

result, we ﬁnd a wide range of error distributions, with "

going from 0.687 (in Apis)to0.104(inDrosophila). The dis-

tribution of differences in gene family sizes in these pairs of

genomes is shown graphically in supplementary ﬁgure S3,

Supplementary Material online. As can be seen, these distri-

butions match our simulated distributions quite well, with

most errors involving +1or1 differences and very few

families having errors greater than ±3. When comparing es-

timated models from these data, we found better ﬁts for an

asymmetric model for the genomes of honey bee, guinea pig,

and fugu, whereas the genomes of cow, zebraﬁsh, fruit ﬂy,

human, and mouse were estimated to have a symmetric error

model. Models with more parameters ﬁt the data better in

general (supplementary table S3,Supplementary Material

online), but with more than nine parameters (up to differ-

ences of ±4in the asymmetric model, ±7inthesymmetric

model) the model often failed to ﬁnd the global maximum

likelihood. Even when the estimated error matrix for the best

model had nonzero error rates for differences in family size

greater than ±4, the estimated error rates for differences

greater than ±4 were smaller than 0.01 in all the genomes

analyzed (supplementary table S4,Supplementary Material

online). On the basis of the errors observed from real

genome data, we conclude that our error models are reason-

able tradeoffs between realistic descriptions of the errors and

ease of optimization.

Estimating the Error Distribution without

External Data

In our second approach to estimating error distributions, we

did not use external data; instead we compare the likelihood

of CAFE runs with varying error parameters. Because the like-

lihood of any particular run using the error model is calcu-

latedwithﬁxederrorparameters,weanticipatedthat

comparison of likelihood scores among runs with varying

error distributions would lead to accurate estimates of the

error distributions themselves. Although the resulting esti-

mates are not MLEs across the whole parameter space—

and are therefore not guaranteed to be global maxima—as

discussed in the Materials and Methods section, we can ﬁnd

the pseudo-MLE (PMLE) within the resolution of the grid we

are searching.

To assess our ability to estimate the error distribution, we

focused on estimating the value of ", the proportion of gene

families with error. We limit our search to estimate only a

one-parameter error model (equal probability of ±1). To do

this, we again simulated data sets with a lvalue of 0.0012,

adding either "= 0.1 or 0.4. We then ran CAFE on both of

these data sets with error models having a value of "ranging

from 0.0 to 0.9, in ﬁxed increments. Figure 2 shows that the

likelihood of the data is maximized (ln Lis minimized) when

Table 2. Performance of the Error Model on a 2-lSearch.

Error Added (e)k(No Error Model) k(Correct Error Model)

0.1 (1A)

= 0.00206

= 0.00097

= 0.05784

= 0.02513

0.4 (1B)

= 0.00616 k

= 0.00091

= 0.18173 k

= 0.02690

Error distributions correspond to the given panel in ﬁgure 1.

The simulated values without error in all cases are l

= 0.0009 and l

= 0.025.

Species with l

are highlighted in supplementary ﬁgure S1,Supplementary Material

online.

1992

Han et al. .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

themodelhavingthecorrectvalueof"is used: Models with

"=0.1 and "= 0.4 represent the best ﬁt to the data when

simulations with "=0.1and"= 0.4 are used, respectively. We

have provided a program that automates the process of run-

ning CAFE with varying error models while maximizing the

likelihood score by performing a simple optimization

(caferror.py); this program will report the error model used

to obtain the best score. These analyses show that we are able

to use CAFE to estimate simple error models from the

input data set by ﬁnding the value that maximizes the

likelihood.

When the error distribution varies across the genomes

considered, the simple one-dimensional search implemented

above will only tell us about the average error across the tree.

We therefore simulated a data set with varying levels of error

among genomes and followed the one-dimensional search for

an average error parameter with a species-by-species search.

This species-by-species search is achieved by repeatedly

adding or subtracting 10% of the average error predicted in

the ﬁrst step to each species separately, until the likelihood

score has been maximized. Again, more complex search strat-

egies that simultaneously try to maximize the average and

species-speciﬁc error distributions can be considered in the

future, but here we show that a simple approach works well

on the simulated data set, as illustrated in table 3. The error in

the data set resulted in an average error across all species

estimated as "= 0.26. We then continued to predict error

on each individual species: D. melanogaster was simulated

with "= 0.1 and was predicted to have "= 0.101. Drosophila

simulans,D. sechellia,andD. persimilis had "= 0.4 applied to

their genomes and were all predicted to have error values of

0.406 (table 3). All other species had "=0.2 and were pre-

dicted to have error values from a range of 0.203 to 0.228

(table 3). It is notable that the largest deviation we saw was in

D. pseudoobscura (simulated with "= 0.2), as it is sister to

D. persimilis (simulated with "= 0.4); this suggests that

genomes with high error rates can make it seem as though

genomes of closely related species also have higher than

expected error rates. Again, this pattern has been seen previ-

ously in the Drosophila data (Hahn, Han, et al. 2007). The

species-by-species search has been implemented as an

option that can be run after an average error parameter

has been estimated using the caferror.py script.

Applying the Error Model to Real Data Sets

Comparative data on the size of gene families has been ana-

lyzed in multiple different groups of organisms using earlier

versions of CAFE (e.g., Sackton et al. 2007;Martin et al. 2008;

Sharpton et al. 2009;Brown et al. 2010;Ohm et al. 2012;Qiu

et al. 2012). Here, we use the new error model feature in CAFE

3 to revisit data from three clades that we have previously

analyzed: fungi, mammals, and Drosophila.Weanalyzednew

gene family data from 16 fungi (Butler et al. 2009;Rasmussen

and Kellis 2011) and 10 mammals (Worley K et al., in prep-

aration), as well as previously analyzed data from Drosophila

(Hahn, Han, et al. 2007). For the Drosophila data set, we used

the gene families and tree as described in Hahn, Han, et al.

(2007); the gene families and tree from the expanded fungal

data set are described in Rasmussen and Kellis (2011) and for

the mammals is described in Worley K et al. (in preparation).

Each of these groups of species is certain to have heteroge-

neous levels of genome assembly and annotation quality

among them. Each group contains one or two focal spe-

cies—Saccharomyces cerevisiae among the fungi, Homo sapi-

ens and Mus musculus among the mammals, and D.

melanogaster among the ﬂies—that has a very high-quality

assembly and annotation and likely several species with

below-average genomes (e.g., the Drosophila species men-

tioned in the previous section). Because of the presence of

genomes with lower quality, it seems likely that the error

model introduced here will provide a more accurate estimate

oftherateofgenefamilyevolutionineachgroup.

Analyzing the data without applying an error model, we

found l= 0.0008, 0.0023, and 0.0012, for the fungi, mammals,

and Drosophila, respectively (table 4). For each data set, we

then estimated a one-parameter error model without using

FIG.2. Estimating the error distribution. Each panel shows the –ln likelihood of individual CAFE runs on a simulated data set with error. The points

represent the score of the run with the corresponding error model ("value) on the xaxis; the dashed vertical line indicates the true amount of error

added to the simulated data. Panel (A) shows a simulation with "= 0.1 and panel (B)showsonewith"= 0.4. In both cases, the maximum likelihood

score occurs when the correct error model is used.

1993

CAFE 3 .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

external data, limiting "to symmetric errors of ±1. Although

this is a highly simpliﬁed error distribution, as shown earlier it

captures most of the effect of error on rates of gene family

evolution. Assuming an average error distribution shared

across all genomes in each analysis, we searched for the

PMLE of the error parameter, ",ﬁnding"= 0.0277, 0.0732,

and 0.041 for the fungi, mammals, and Drosophila,respec-

tively (table 4). These values imply that on average 2.8%, 7.3%,

and 4.1% of gene families in each data set have observed sizes

that are in error—that is, the observed size of the families

are not equal to the true size. It is interesting to note that

the average error in each clade roughly coincides with

the number of repetitive elements, number of total

gene duplicates, and total size of the analyzed genomes

(mammals >Drosophila >fungi), all of which are expected

to coincide with errors in assembly and annotation. These

conclusions of course rest on the assumption that our error

model is an accurate representation of the true error distri-

butions. As we have shown earlier, an analysis of error-prone

assemblies supports our modeling assumptions. Nevertheless,

this is an association among three clades, arising from genome

assemblies constructed in very different ways, and should be

considered to be a preliminary indication of factors that can

affect assembly and annotation quality.

Analyzing the data with the best-ﬁt error models, we found

new rate parameters of l= 0.0006, 0.0019, and 0.0006, for the

fungi, mammals, and Drosophila,respectively(table 4). In all

three data sets, models with a single error-parameter ﬁt the

data signiﬁcantly better than models without error parame-

ters (all P0.0001; likelihood ratio test), indicating that the

corrected lvalues reported here are more accurate reﬂec-

tions of the rate of gene family evolution in these three clades.

In the fungi and mammals, the new estimates suggest rates of

evolution approximately 75% of the uncorrected estimates,

whereas in Drosophila, the new estimate is 50% of the original

one.

In addition to ﬁtting a global rate parameter, we examined

the ﬁt of a model having three lparameters on the mam-

malian tree. Previous research found higher rates of gene gain

and loss in the great apes, with an intermediate rate in other

primates, and the lowest rate on the rest of the tree (Hahn,

Demuth, et al. 2007). We wished to know whether this pat-

tern would still be observed after errors were accounted for,

so we estimated the likelihood of the three-parameter model

while setting "= 0.0732. As expected if this pattern is due to a

true rate acceleration and not error in assemblies or annota-

tion, the three-parameter model with error ﬁt the data sig-

niﬁcantly better than a one-parameter model with error

(P0.0001; likelihood ratio test). Indeed, as previously ob-

served, the value of the lparameter on the human–chimp

shared lineage was more than three times higher than the

average value (0.0062 vs. 0.0019), with the orangutan–

macaque shared lineage having an intermediate rate

(l= 0.0044). These results are not wholly unexpected, as

other previous studies have found the same rate acceleration

in a set of analyses that is free from biases due to heteroge-

neous quality in genome assemblies (Marques-Bonet et al.

2009). Our results also conﬁrm earlier conclusions about

the amount of genic copy number divergence between

humans and chimpanzees (Demuth et al. 2006) and demon-

strate that these results were not due to genome assembly or

annotation error.

Conclusions

Here, we have provided a new software package that enables

the accurate estimation of rates of gene family evolution

when there are errors in the observed gene family sizes. By

allowing users to marginalize over the uncertainty in the ob-

served gene family sizes, CAFE 3 provides a platform for ex-

panding comparative genomic analyses into clades consisting

solely of draft genome sequences. Our software is freely avail-

able (http://sourceforge.net/projects/cafehahnlab/)andcan

be compiled on multiple operating systems. Although it is

likely that there are typical error distributions associated with

different sequencing technologies used to assemble genomes

(e.g., Illumina vs. 454), our program does not require that such

distributions are known ahead of time. If prior information

about either the sequencing technology or the depth of cov-

erage is known, more accurate results may be obtained. We

have demonstrated three alternative approaches to estimat-

ing error distributions, each of which requires slightly different

types of data; regardless of how error distributions are esti-

mated, CAFE 3 allows any arbitrary distribution to be speci-

ﬁed. Finally, although we have applied this approach to

correcting for error in gene family sizes, similar methods

may be applicable to errors in nucleotide data (e.g., Heid

et al. 2008;Hubisz et al. 2011) or any trait for which a realistic

error model can be constructed (e.g., RNA-seq; Brawand et al.

2011).

Table 3. Estimating the Error Distribution (") with Heterogeneous

Error across Species.

Species Error Added Estimated Error

Drosophila willistoni 0.2 0.20283

D. virilis 0.2 0.20283

D. persimilis 0.4 0.40566

D. mojavensis 0.2 0.20283

D. sechellia 0.4 0.40566

D. pseudoobscura 0.2 0.22819

D. yakuba 0.2 0.20283

D. grimshawi 0.2 0.20283

D. erecta 0.2 0.20283

D. melanogaster 0.1 0.10141

D. ananassae 0.2 0.20283

D. simulans 0.4 0.40566

Table 4. Analysis of Previously Published Data Sets.

Clade k(No Error Model) e(Estimated)

k(Error Model = e)

Fungi 0.00080 0.02771 0.00061

Mammals 0.00238 0.07324 0.00186

Drosophila 0.00121 0.04102 0.00059

The estimated error distribution was symmetric with only ±1allowed.

1994

Han et al. .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

Materials and Methods

Application of the Error Model to Infer the Gain and

Loss Parameters

The general approach we take uses PMLEs because we do not

include the error parameters in the full likelihood formula.

Instead, we estimate the measurement error parameters from

external data as described later, and then the likelihood is

calculated using the observed data, with the error parameters

ﬁxed at their estimates (Buonaccorsi 2010):

l,¼argmaxl, Y

n¼1

PðWnjXn,l,,TÞ!

PðWnjXn,l,,TÞ

¼X

xn1¼0X

xn2¼0

... X

xns¼0

njXn¼ðxn1,xn2,...xnsÞðÞPðXn,l,,TÞ

¼X

xn1¼0X

xn2¼0

... X

xns¼0X

zn1¼0X

zn2¼0



X

znu¼0(PðWnjXn¼ðxn1,xn2,...xns ÞÞPðXnjZn

¼ðzn1,zn2...znuÞ,l,,TÞPðZnjl,,TÞ),

where P(W=wjX=x)= ^

wjxand is estimated through

external data.

Simplifying the Error Matrix

BecausewearedealingwithcountdatathatcangouptoM,if

we allow for error from any true count to any observed count

the number of parameters specifying the error matrix for this

full model is M

,which is unnecessarily complex. To simplify

the parameters and to make the model useful, we can con-

strain three aspects of the model.

First, we can assume that the error rate depends on the

difference between the observed count and the true count

but does not depend on the true count itself. This is equiv-

alent to having a single homogeneous parameter along the

diagonals of the error matrix. Although this assumption may

not be biologically realistic, because most of our gene families

are sizes of three or smaller and large families are rare, only

three rates would normally have to be used. Our modeling

framework also allows this assumption to be relaxed when

error is estimated from external data, and any error structure

can be entered into CAFE 3 if speciﬁed by the user. Second, we

can restrict the errors that are allowed to be at most Ddif-

ferences from the true count. This forces the corner param-

eters that are D+1 or more steps away from the diagonal to

a probability !that is constrained to be smaller than all other

parameters. Again, Dis a user-speciﬁed parameter that can be

quite large in practice. Third, we can assume symmetry on the

rates of errors that increase the counts and the rates of errors

that decrease the counts, reducing the number of parameters

to half. This last assumption is again optional, and we have

explored models with and without it. In our simulated results

presented above, we explore a range of models that are com-

binations of the value of D(3) and the state of symmetry.

The numbers of parameters for the error matrix are then

2D+1 for asymmetric models and D+1forsymmetric

models.

Estimation of the Error Matrix from External Data

If we know that one measurement (i.e., one set of gene fam-

ilies from a well-annotated genome) is more accurate than

another, we can estimate the error matrix by assuming the

better measure as the true value and comparing the lesser

measure to the truth. Although having a well-annotated

genome might seem to obviate the need for using the error

model, the estimated error matrix from such a comparison

could be usefully applied to poorly annotated genomes. If we

have two sets of measurements with unknown the relative

accuracy, we can estimate the error matrix based on the ob-

served agreement between the two measures. We describe

these two cases in reverse order.

First, when the two measures are similarly error prone, we

ﬁnd the triangular matrix R=r

(where i=0...M,ji)of

pairwise observations, with r

being the number of observa-

tions with W

=iand W

=j. The probability of each pairwise

observation is deﬁned based on the true count probabilities



and the error rates 

wjx

. Assuming that the probability of

observing pairwise observations is p

,with

pij ¼P

k¼0

2ijkjjkkif i6¼ j

k¼0

ijkjjkkif i¼j

then the log likelihood of the data matrix Rgiven the prob-

ability distribution of 

and the error matrix model can be

calculated using the multinomial likelihood. Ignoring the co-

efﬁcient, the log likelihood ln Lis:

ln LðRj,Þ¼ X

M,M

i¼0,j¼0ðijÞ

rij ln ðpijÞ:

However, for our data set, we have a limitation that we can

never observe gene families that are size zero in both mea-

sures. To account for this missing data, we ﬁnd the likelihood

that is conditional on the event (E) that we observed at least

onegeneineitheroneofthemeasurements.Because

PðRj,,EÞ¼PðR,Ej,Þ

PðEÞ¼PðR,Ej,Þ

1pð0,0Þ,

the conditional log likelihood ln L

is:

ln LcðRj,,EÞ¼ X

M,M

i¼0,j¼0ðijÞ

rij ln ðpijÞ

()

ln ð1p00Þ:

We do not have data for the true , but assuming the

distribution is similar to the counts found in W

and W

we can substitute the count distributions observed in

)as the approximation of .For each model of

error matrix , we ﬁnd the parameters that maximize the

1995

CAFE 3 .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

conditional log likelihood of the data Rusing the Nelder–

Mead method.

For the case when one set of annotations is treated as the

true measure, the observation matrix Risafullmatrixwith

columns that correspond to the true measure and rows to

the incorrect measure. r

is now the number of observations

with W=iand X=j. The probability of each pairwise obser-

vation is again deﬁned based on the true count probabilities



and the error rates 

wjx

, but the probability of observing

pairwise observations p

is simpler:

pij ¼ijjj:

The conditional log likelihood follows the same approach as

above but is summed across the whole discordance matrix, R:

ln LcðRj,,EÞ¼ X

M,M

i¼0,j¼0

rij ln ðpijÞ

()

ln ð1p00Þ:

The distribution can be estimated using the assumed true

count data X. MLEs of parameters that determine the error

matrix are then found by comparison of the true and error-

prone data.

Estimation of the Error Matrix from Eight Genomes

To collect data on realistic error matrices, we compared

annotations for two versions of each of eight published

genomes. We used the gene models for honey bee

(Apis mellifera), cow (Bos taurus), guinea pig (Cavia porcellus),

zebraﬁsh (Danio rerio), fruit ﬂy (Drosophila melanogaster),

human (H. sapiens), mouse (M. musculus), and fugu

(Takifugu rubripes)fromEnsembl(Flicek et al. 2012). For

each species, we compared an earlier, lower quality assembly

and annotation to a later, higher quality assembly and anno-

tation (supplementary table S2,Supplementary Material

online). For each pair of genome annotations (i.e., for each

species), we assigned genes to gene families using an all-

against-all BLASTP sequence similarity search, followed by

clustering using the MCL algorithm (Enright et al. 2002).

Because genes from both annotation versions were clustered

simultaneously, we can simply compare the size of each

resulting family to estimate the error matrix. We applied

our method of estimating the error matrix from external

data to the pairs of genomes, assuming that the updated

annotation is the true measure. We compared symmetric

and asymmetric error models with a range of parameters

(supplementary table S3,Supplementary Material online).

The number of parameters in the asymmetric models

ranged from three (difference of 1, 0, 1) to up to nine

(difference of 4, 3, ...3, 4). The number of parameters

on the symmetric models ranged from two (difference of 0,

±1) to eight (difference of 0, ±1, ...±7).

Estimation of the Error Matrix via Search without

External Data

We also demonstrate how to estimate the error matrix when

there is no external validation data available. In this case, we

ﬁnd the birth-and-death parameters that maximize the

pseudolikelihood using a ﬁxed error model but repeat the

procedure across a grid of error-parameter values. The grid

consists of error parameters in ﬁxed increments across a ﬁxed

region, and the error parameter that yields the maximum

pseudolikelihood across the grid space is determined as the

estimate. This process has the limitation that it cannot search

the whole continuous parameter space, and an n-dimensional

grid search becomes impractical for complicated error matri-

ces as the number of parameters in the model increases.

Nevertheless, it performs fairly well in practice, as is shown

in the simulations presented in the Results and Discussion

sections.

Supplementary Material

Supplementary ﬁgures S1–S3 and tables S1–S4 are available at

Molecular Biology and Evolution online (http://www.mbe.

oxfordjournals.org/).

Acknowledgments

The authors thank Matt Rasmussen for providing the fungal

data and David Swofford for helpful discussions. This work

was supported by a National Evolutionary Synthesis Center

postdoctoral fellowship to M.V.H., by a Ford Foundation pre-

doctoral fellowship to J.L.-M., and by National Science

Foundation grant DBI-0845494 to M.W.H.

References

Ames RM, Money D, Ghatge VP, Whelan S, Lovell SC. 2012. Determining

the evolutionary history of gene families. Bioinformatics 28:48–55.

Bailey NTJ. 1964. The elements of stochastic processes. New York: John

Wiley & Sons, Inc.

Begun DJ, Holloway AK, Stephens K, et al. (13 co-authors). 2007.

Population genomics: whole-genome analysis of polymorphism

and divergence in Drosophila simulans.PLoS Biol. 5:e310.

Brawand D, Soumillon M, Necsulea A, et al. (18 co-authors). 2011. The

evolution of gene expression levels in mammalian organs. Nature

478:343–348.

Brown CA, Murray AW, Verstrepen KJ. 2010. Rapid expansion and

functional divergence of subtelomeric gene families in yeasts. Curr

Biol. 20:895–903.

Buonaccorsi JP. 2010. Measurement error: models, methods and appli-

cations. Boca Raton (FL): Chapman and Hall/CRC Press.

Butler G, Rasmussen MD, Lin MF, et al. (51 co-authors). 2009. Evolution

of pathogenicity and sexual reproduction in eight Candida genomes.

Nature 459:657–662.

Colbourne JK, Pfrender ME, Gilbert D, et al. (68 co-authors). 2011. The

ecoresponsive genome of Daphnia pulex.Science 331:555–561.

Costello JC, Han MV, Hahn MW. 2008. Limitations of pseudogenes in

identifying gene losses. In: Nelson C, Vialette S, editors. Proceedings

of the Sixth Annual RECOMB Satellite Workshop on Comparative

Genomics; 2008 Oct 13–15; Paris, France. Heidelberg (Germany):

Springer Berlin. p. 14–25.

De Bie T, Demuth JP, Cristianini N, Hahn MW. 2006. CAFE: a compu-

tational tool for the study of gene family evolution. Bioinformatics

22:1269–1271.

Demuth JP, De Bie T, Stajich JE, Cristianini N, Hahn MW. 2006. The

evolution of mammalian gene families. PLoS One 1:e85.

Demuth JP, Hahn MW. 2009. The life and death of gene families.

BioEssays 31:29–39.

Drosophila 12 Genomes Consortium. 2007. Evolution of genes and ge-

nomes on the Drosophila phylogeny. Nature 450:203–218.

Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M. 2008. Natural

selection shapes genome-wide patterns of copy-number polymor-

phism in Drosophila melanogaster.Science 320:1629–1631.

1996

Han et al. .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efﬁcient algorithm for

large-scale detection of protein families. Nucleic Acids Res. 30:

1575–1584.

Felsenstein J. 1973. Maximum likelihood and minimum-steps methods

for estimating evolutionary trees from data on discrete characters.

Syst Biol. 22:240–249.

Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum

likelihood approach. J Mol Evol. 17:368–376.

Flicek P, Amode MR, Barrell D, et al. (57 co-authors). 2012. Ensembl

2012. Nucleic Acids Res. 40:D84–D90.

Floudas D, Binder M, Riley R, et al. (71 co-authors). 2012. The paleozoic

origin of enzymatic lignin decomposition reconstructed from 31

fungal genomes. Science 336:1715–1719.

Gibbs RA, Rogers J, Katze M, et al. (176 co-authors). 2007. Evolutionary

and biomedical insights from the rhesus macaque genome. Science

316:222–234.

Gibbs RA, Weinstock GM, Metzker ML, et al. (241 co-authors). 2004.

Genome sequence of the Brown Norway rat yields insights into

mammalian evolution. Nature 428:493–521.

Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N. 2005. Estimating

the tempo and mode of gene family evolution from comparative

genomic data. Genome Res. 15:1153–1160.

Hahn MW, Demuth JP, Han S-G. 2007. Accelerated rate of gene gain and

loss in primates. Genetics 177:1941–1949.

Hahn MW, Han MV, Han S-G. 2007. Gene family evolution across 12

Drosophila genomes. PLoS Genet. 3:e197.

Heid IM, Lamina C, Ku

¨chenhoff H, et al. (18 co-authors). 2008.

Estimating the single nucleotide polymorphism genotype misclassi-

ﬁcation from routine double measurements in a large epidemiologic

sample. Am J Epidemiol. 168:878–889.

Holt RA, Subramanian GM, Halpern A, et al. (123 co-authors). 2002. The

genome sequence of the malaria mosquito Anopheles gambiae.

Science 298:129–149.

Hubisz MJ, Lin MF, Kellis M, Siepel A. 2011. Error and error mitigation in

low-coverage genome assemblies. PLoS One 6:e17034.

Kidd JM, Cooper GM, Donahue WF, et al. (46 co-authors). 2008.

Mapping and sequencing of structural variation from eight

human genomes. Nature 453:56–64.

Librado P, Vieira FG, Rozas J. 2012. BadiRate: estimating family turnover

rates by likelihood-based methods. Bioinformatics 28:279–281.

Li R, Fan W, Tian G, et al. (123 co-authors). 2009. The sequence and de

novo assembly of the giant panda genome. Nature 463:311–317.

Liu L, Yu L, Kalavacharla V, Liu Z. 2011. A Bayesian model for gene family

evolution. BMC Bioinformatics 12:426.

Marques-Bonet T, Kidd JM, Ventura M, et al. (20 co-authors). 2009.

A burst of segmental duplications in the genome of the African

great ape ancestor. Nature 457:877–881.

Martin F, Aerts A, Ahren D, et al. (68 co-authors). 2008. The genome of

Laccaria bicolor provides insights into mycorrhizal symbiosis. Nature

452:88–92.

Nei M, Rooney AP. 2005. Concerted and birth-and-death evolution of

multigene families. Annu Rev Genet. 39:121–152.

Ohm RA, Feau N, Henrissat B, et al. (28 co-authors). 2012. Diverse

lifestyles and strategies of plant pathogenesis encoded in the

genomes of eighteen Dothideomycetes fungi. PLoS Pathog. 8:

e1003037.

Qiu Q, Zhang GJ, Ma T, et al. (47 co-authors). 2012. The yak genome and

adaptation to life at high altitude. Nat Genet. 44:946–949.

Rasmussen MD, Kellis M. 2011. A Bayesian approach for fast and accu-

rate gene tree reconstruction. MolBiolEvol.28:273–290.

Sackton TB, Lazzaro BP, Schlenke TA, Evans JD, Hultmark D, Clark AG.

2007. Dynamic evolution of the innate immune system in

Drosophila.Nat Genet. 39:1461–1468.

Schrider DR, Stevens KA, Carden

˜o CM, Langley CH, Hahn MW. 2011.

Genome-wide analysis of retrogene polymorphisms in Drosophila

melanogaster.Genome Res. 21:2087–2095.

Sebat J, Lakshmi B, Troge J, et al. (21 co-authors). 2004. Large-scale

copy number polymorphism in the human genome. Science 305:

525–528.

Sharpton TJ, Stajich JE, Rounsley SD, et al. (24 co-authors). 2009.

Comparative genomic analyses of the human fungal pathogens

Coccidioides and their relatives. Genome Res. 19:1722–1731.

Stark A, Lin MF, Kheradpour P, et al. (46 co-authors). 2007. Discovery of

functional elements in 12 Drosophila genomes using evolutionary

signatures. Nature 450:219–232.

1997

CAFE 3 .doi:10.1093/molbev/mst100 MBE

by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from

A chromosome-level Pinellia ternata genome assembly provides insight into the evolutionary origin of ephedrine and acrid raphide formation

Article

Jan 2024

NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model

Article

Full-text available

Apr 2024
ALGORITHM MOL BIOL

Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD.

The near‐complete genome assembly of Reynoutria multiflora reveals the genetic basis of stilbenes and anthraquinones biosynthesis

Article

Full-text available

Mar 2024

Reynoutria multiflora is a widely used medicinal plant in China. Its medicinal compounds are mainly stilbenes and anthraquinones which possess important pharmacological activities in anti‐aging, anti‐inflammatory and anti‐oxidation, but their biosynthetic pathways are still largely unresolved. Here, we reported a near‐complete genome assembly of R. multiflora consisting of 1.39 Gb with a contig N50 of 122.91 Mb and only one gap left. Genome evolution analysis revealed that two recent bursts of long terminal repeats (LTRs) contributed significantly to the increased genome size of R. multiflora , and numerous large chromosome rearrangements were observed between R. multiflora and Fagopyrum tataricum genomes. Comparative genomics analysis revealed that a recent whole‐genome duplication specific to Polygonaceae led to a significant expansion of gene families associated with disease tolerance and the biosynthesis of stilbenes and anthraquinones in R. multiflora . Combining transcriptomic and metabolomic analyses, we elucidated the molecular mechanisms underlying the dynamic changes in content of medicinal ingredients in R. multiflora roots across different growth years. Additionally, we identified several putative key genes responsible for anthraquinone and stilbene biosynthesis. We identified a stilbene synthase gene PM0G05131 highly expressed in roost, which may exhibit an important role in the accumulation of stilbenes in R. multiflora . These genomic data will expedite the discovery of anthraquinone and stilbenes biosynthesis pathways in medicinal plants.

Genome assembly of the southern pine beetle (Dendroctonus frontalis Zimmerman) reveal the origins of gene content reduction in Dendroctonus

Preprint

Full-text available

May 2024

Dendroctonus frontalis , also known as southern pine beetle (SPB), represents the most damaging forest pest in the southeastern United States. Strategies to predict, monitor and suppress SPB outbreaks have had limited success. Genomic data are critical to inform on pest biology and to identify molecular targets to develop improved management approaches. Here, we produced a chromosome-level genome assembly of SPB using long-read sequencing data. Synteny analyses confirmed the conservation of the core coleopteran Stevens elements and validated the bona fide SPB X chromosome. Transcriptomic data were used to obtain 39,588 transcripts corresponding to 13,354 putative protein-coding loci. Comparative analyses of gene content across 14 beetle and 3 other insects revealed several losses of conserved genes in the Dendroctonus clade and gene gains in SPB and Dendroctonus that were enriched for loci encoding membrane proteins and extracellular matrix proteins. While lineage-specific gene losses contributed to the gene content reduction observed in Dendroctonus , we also showed that widespread misannotation of transposable elements represents a major cause of the apparent gene expansion in several non- Dendroctonus species. Our findings uncovered distinctive features of the SPB gene complement and disentangled the role of biological and annotation-related factors contributing to gene content variation across beetles.

A high-quality chromosome-scale genome assembly of blood orange, an important pigmented sweet orange variety

Article

Full-text available

May 2024

Blood orange (BO) is a rare red-fleshed sweet orange (SWO) with a high anthocyanin content and is associated with numerous health-related benefits. Here, we reported a high-quality chromosome-scale genome assembly for Neixiu (NX) BO, reaching 336.63 Mb in length with contig and scaffold N50 values of 30.6 Mb. Furthermore, 96% of the assembled sequences were successfully anchored to 9 pseudo-chromosomes. The genome assembly also revealed the presence of 37.87% transposon elements and 7.64% tandem repeats, and the annotation of 30,395 protein-coding genes. A high level of genome synteny was observed between BO and SWO, further supporting their genetic similarity. The speciation event that gave rise to the Citrus species predated the duplication event found within them. The genome-wide variation between NX and SWO was also compared. This first high-quality BO genome will serve as a fundamental basis for future studies on functional genomics and genome evolution.

Chromosome-level genome assembly of Pedicularis kansuensis illuminates genome evolution of facultative parasitic plant

Article

May 2024
MOL ECOL RESOUR

Parasitic plants have a heterotrophic lifestyle, in which they withdraw all or part of their nutrients from their host through the haustorium. Despite the release of many draft genomes of parasitic plants, the genome evolution related to the parasitism feature of facultative parasites remains largely unknown. In this study, we present a high‐quality chromosomal‐level genome assembly for the facultative parasite Pedicularis kansuensis (Orobanchaceae), which invades both legume and grass host species in degraded grasslands on the Qinghai‐Tibet Plateau. This species has the largest genome size compared with other parasitic species, and expansions of long terminal repeat retrotransposons accounting for 62.37% of the assembly greatly contributed to the genome size expansion of this species. A total of 42,782 genes were annotated, and the patterns of gene loss in P. kansuensis differed from other parasitic species. We also found many mobile mRNAs between P. kansuensis and one of its host species, but these mobile mRNAs could not compensate for the functional losses of missing genes in P. kansuensis . In addition, we identified nine horizontal gene transfer (HGT) events from rosids and monocots, as well as one single‐gene duplication events from HGT genes, which differ distinctly from that of other parasitic species. Furthermore, we found evidence for HGT through transferring genomic fragments from phylogenetically remote host species. Taken together, these findings provide genomic insights into the evolution of facultative parasites and broaden our understanding of the diversified genome evolution in parasitic plants and the molecular mechanisms of plant parasitism.

Multi-omics analyses provide insights into the evolutionary history and the synthesis of medicinal components of the Chinese wingnut

Article

Apr 2024

Comparative secretome analysis of Striga and Cuscuta species identifies candidate virulence factors for two evolutionarily independent parasitic plant lineages

Article

Full-text available

Apr 2024
BMC PLANT BIOL

Background Many parasitic plants of the genera Striga and Cuscuta inflict huge agricultural damage worldwide. To form and maintain a connection with a host plant, parasitic plants deploy virulence factors (VFs) that interact with host biology. They possess a secretome that represents the complement of proteins secreted from cells and like other plant parasites such as fungi, bacteria or nematodes, some secreted proteins represent VFs crucial to successful host colonisation. Understanding the genome-wide complement of putative secreted proteins from parasitic plants, and their expression during host invasion, will advance understanding of virulence mechanisms used by parasitic plants to suppress/evade host immune responses and to establish and maintain a parasite-host interaction. Results We conducted a comparative analysis of the secretomes of root (Striga spp.) and shoot (Cuscuta spp.) parasitic plants, to enable prediction of candidate VFs. Using orthogroup clustering and protein domain analyses we identified gene families/functional annotations common to both Striga and Cuscuta species that were not present in their closest non-parasitic relatives (e.g. strictosidine synthase like enzymes), or specific to either the Striga or Cuscuta secretomes. For example, Striga secretomes were strongly associated with ‘PAR1’ protein domains. These were rare in the Cuscuta secretomes but an abundance of ‘GMC oxidoreductase’ domains were found, that were not present in the Striga secretomes. We then conducted transcriptional profiling of genes encoding putatively secreted proteins for the most agriculturally damaging root parasitic weed of cereals, S. hermonthica. A significant portion of the Striga-specific secretome set was differentially expressed during parasitism, which we probed further to identify genes following a ‘wave-like’ expression pattern peaking in the early penetration stage of infection. We identified 39 genes encoding putative VFs with functions such as cell wall modification, immune suppression, protease, kinase, or peroxidase activities, that are excellent candidates for future functional studies. Conclusions Our study represents a comprehensive secretome analysis among parasitic plants and revealed both similarities and differences in candidate VFs between Striga and Cuscuta species. This knowledge is crucial for the development of new management strategies and delaying the evolution of virulence in parasitic weeds.

Haplotype-resolved genome of Mimosa bimucronata revealed insights into leaf movement and nitrogen fixation

Article

Full-text available

Apr 2024
BMC GENOMICS

Background Mimosa bimucronata originates from tropical America and exhibits distinctive leaf movement characterized by a relative slow speed. Additionally, this species possesses the ability to fix nitrogen. Despite these intriguing traits, comprehensive studies have been hindered by the lack of genomic resources for M. bimucronata. Results To unravel the intricacies of leaf movement and nitrogen fixation, we successfully assembled a high-quality, haplotype-resolved, reference genome at the chromosome level, spanning 648 Mb and anchored in 13 pseudochromosomes. A total of 32,146 protein-coding genes were annotated. In particular, haplotype A was annotated with 31,035 protein-coding genes, and haplotype B with 31,440 protein-coding genes. Structural variations (SVs) and allele specific expression (ASE) analyses uncovered the potential role of structural variants in leaf movement and nitrogen fixation in M. bimucronata. Two whole-genome duplication (WGD) events were detected, that occurred ~ 2.9 and ~ 73.5 million years ago. Transcriptome and co-expression network analyses revealed the involvement of aquaporins (AQPs) and Ca²⁺-related ion channel genes in leaf movement. Moreover, we also identified nodulation-related genes and analyzed the structure and evolution of the key gene NIN in the process of symbiotic nitrogen fixation (SNF). Conclusion The detailed comparative genomic and transcriptomic analyses provided insights into the mechanisms governing leaf movement and nitrogen fixation in M. bimucronata. This research yielded genomic resources and provided an important reference for functional genomic studies of M. bimucronata and other legume species.

Genomic signatures of exceptional longevity and negligible aging in the long-lived red sea urchin

Article

Full-text available

Apr 2024

Rat Genome Sequencing Consortium.Genome sequence of the Brown Norway rat yields insights into mammalian evolution

Article

Full-text available

Evolution of pathogenicity and sexual reproduction in eight Candida genomes

Article

Full-text available

Jan 2009

Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/a2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.

Measurement Error: Models, Methods and Applications

Article

Full-text available

Mar 2010

John Buonaccorsi

Over the last 20 years, comprehensive strategies for treating measurement error in complex models and accounting for the use of extra data to estimate measurement error parameters have emerged. Focusing on both established and novel approaches, Measurement Error: Models, Methods, and Applications provides an overview of the main techniques and illustrates their application in various models. It describes the impacts of measurement errors on naive analyses that ignore them and presents ways to correct for them across a variety of statistical models, from simple one-sample problems to regression models to more complex mixed and time series models. The book covers correction methods based on known measurement error parameters, replication, internal or external validation data, and, for some models, instrumental variables. It emphasizes the use of several relatively simple methods, moment corrections, regression calibration, simulation extrapolation (SIMEX), modified estimating equation methods, and likelihood techniques. The author uses SAS-IML and Stata to implement many of the techniques in the examples. Accessible to a broad audience, this book explains how to model measurement error, the effects of ignoring it, and how to correct for it. More applied than most books on measurement error, it describes basic models and methods, their uses in a range of application areas, and the associated terminology.

Diverse Lifestyles and Strategies of Plant Pathogenesis Encoded in the Genomes of Eighteen Dothideomycetes Fungi

Article

Full-text available

Dec 2012
PLOS PATHOG

The class Dothideomycetes is one of the largest groups of fungi with a high level of ecological diversity including many plant pathogens infecting a broad range of hosts. Here, we compare genome features of 18 members of this class, including 6 necrotrophs, 9 (hemi)biotrophs and 3 saprotrophs, to analyze genome structure, evolution, and the diverse strategies of pathogenesis. The Dothideomycetes most likely evolved from a common ancestor more than 280 million years ago. The 18 genome sequences differ dramatically in size due to variation in repetitive content, but show much less variation in number of (core) genes. Gene order appears to have been rearranged mostly within chromosomal boundaries by multiple inversions, in extant genomes frequently demarcated by adjacent simple repeats. Several Dothideomycetes contain one or more gene-poor, transposable element (TE)-rich putatively dispensable chromosomes of unknown function. The 18 Dothideomycetes offer an extensive catalogue of genes involved in cellulose degradation, proteolysis, secondary metabolism, and cysteine-rich small secreted proteins. Ancestors of the two major orders of plant pathogens in the Dothideomycetes, the Capnodiales and Pleosporales, may have had different modes of pathogenesis, with the former having fewer of these genes than the latter. Many of these genes are enriched in proximity to transposable elements, suggesting faster evolution because of the effects of repeat induced point (RIP) mutations. A syntenic block of genes, including oxidoreductases, is conserved in most Dothideomycetes and upregulated during infection in L. maculans, suggesting a possible function in response to oxidative stress.

Fine-Scale Mapping and Sequencing of Structural Variation from Eight Human Genomes

Article

Full-text available

May 2008

Genetic variation among individual humans occurs on many different scales, ranging from gross alterations in the human karyotype to single nucleotide changes. Here we explore variation on an intermediate scale—particularly insertions, deletions and inversions affecting from a few thousand to a few million base pairs. We employed a clone-based method to interrogate this intermediate structural variation in eight individuals of diverse geographic ancestry. Our analysis provides a comprehensive overview of the normal pattern of structural variation present in these genomes, refining the location of 1,695 structural variants. We find that 50% were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant. We discover 525 new insertion sequences that are not present in the human reference genome and show that many of these are variable in copy number between individuals. Complete sequencing of 261 structural variants reveals considerable locus complexity and provides insights into the different mutational processes that have shaped the human genome. These data provide the first high-resolution sequence map of human structural variation—a standard for genotyping platforms and a prelude to future individual genome sequencing projects.

The sequence and de novo assembly of the giant panda genome

Article

Full-text available

Dec 2009

Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.

The yak genome and adaptation to life at high altitude

Article

Full-text available

Jul 2012
Nat Genet

Domestic yaks (Bos grunniens) provide meat and other necessities for Tibetans living at high altitude on the Qinghai-Tibetan Plateau and in adjacent regions. Comparison between yak and the closely related low-altitude cattle (Bos taurus) is informative in studying animal adaptation to high altitude. Here, we present the draft genome sequence of a female domestic yak generated using Illumina-based technology at 65-fold coverage. Genomic comparisons between yak and cattle identify an expansion in yak of gene families related to sensory perception and energy metabolism, as well as an enrichment of protein domains involved in sensing the extracellular environment and hypoxic stress. Positively selected and rapidly evolving genes in the yak lineage are also found to be significantly enriched in functional categories and pathways related to hypoxia and nutrition metabolism. These findings may have important implications for understanding adaptation to high altitude in other animal species and for hypoxia-related diseases in humans.

The Paleozoic Origin of Enzymatic Lignin Decomposition Reconstructed from 31 Fungal Genomes

Article

Full-text available

Jun 2012
SCIENCE

Wood is a major pool of organic carbon that is highly resistant to decay, owing largely to the presence of lignin. The only organisms capable of substantial lignin decay are white rot fungi in the Agaricomycetes, which also contains non–lignin-degrading brown rot and ectomycorrhizal species. Comparative analyses of 31 fungal genomes (12 generated for this study) suggest that lignin-degrading peroxidases expanded in the lineage leading to the ancestor of the Agaricomycetes, which is reconstructed as a white rot species, and then contracted in parallel lineages leading to brown rot and mycorrhizal species. Molecular clock analyses suggest that the origin of lignin degradation might have coincided with the sharp decrease in the rate of organic carbon burial around the end of the Carboniferous period.

Comparative genomic analyses of the human fungal pathogens Coccidioides and their relatives

Article

Jan 2009

Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters

Article

Sep 1973
Syst Zool

Joseph Felsenstein

The general maximum likelihood approach to the statistical estimation of phylogenies is outlined, for data in which there are a number of discrete states for each character. The details of the maximum likelihood method will depend on the details of the probabilistic model of evolution assumed. There are a very large number of possible models of evolution. For a few of the simpler models, the calculation of the likelihood of an evolutionary tree is outlined. For these models, the maximum likelihood tree will be the same as the “most parsimonious” (or minimum-steps) tree if the probability of change during the evolution of the group is assumed a priori to be very small. However, most sets of data require too many assumed state changes per character to be compatible with this assumption. Farris (1973) has argued that maximum likelihood and parsimony methods are identical under a much less restrictive set of assumptions. It is argued that the present methods are preferable to his, and a counterexample to his argument is presented. An algorithm which enables rapid calculation of the likelihood of a phylogeny is described.

Estimating Gene Gain and Loss Rates in the Presence of Error in Genome Assembly and Annotation Using CAFE 3

Abstract and Figures

Recommended publications

Annotated mitochondrial genome assemblies for two sand lances (genus: Ammodytes) from the northwest...

How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra

Next-generation sequencing and large genome assemblies

New approaches for assembly of short-read metagenomic data