ArticlePDF Available

Estimating Gene Gain and Loss Rates in the Presence of Error in Genome Assembly and Annotation Using CAFE 3

Authors:

Abstract and Figures

Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss will often be misled by such errors, and that rates of gene family evolution will be consistently overestimated. Here we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations, and re-analyze several previously published datasets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss, but that CAFE 3 sufficiently accounts for these errors in order to provide accurate estimates of important evolutionary parameters.
Content may be subject to copyright.
Article
Estimating Gene Gain and Loss Rates in the Presence of Error in
Genome Assembly and Annotation Using CAFE 3
Mira V. Han,*
,1,2
Gregg W.C. Thomas,
2
Jose Lugo-Martinez,
2
and Matthew W. Hahn*
,2,3
1
National Evolutionary Synthesis Center, Durham, North Carolina
2
School of Informatics and Computing, Indiana University
3
Department of Biology, Indiana University
*Corresponding authors: E-mail: mira.han@nescent.org; mwh@indiana.edu.
Associate editor: Sudhir Kumar
Abstract
Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are
often fragmented and incomplete. Incomplete and error-filledassembliesresultinmanyannotationerrors,especiallyin
the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and
loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we
present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among
genomes even with low assembly and annotation quality. The method is implemented in the newest version of the
software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with
extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome
annotation do lead to higher inferred rates of gene gain and lossbutthatCAFE3sufcientlyaccountsfortheseerrorsto
provideaccurateestimatesofimportant evolutionary parameters.
Key words: duplication, gene family, adaptive evolution.
Introduction
Genome sequencing projects have revealed large and
frequent changes between species in the size of gene families
(e.g., Gibbs et al. 2004,2007;Drosophila 12 Genomes
Consortium 2007;Li et al. 2009;Floudas et al. 2012). These
changes may underlie many important morphological, phys-
iological, and behavioral differences between species and con-
tribute much of the genetic and genomic diversity observed
in nature (reviewed in Demuth and Hahn 2009). Recent work
on diversity within species has also revealed surprising num-
bers of polymorphic gene duplications and losses (e.g., Sebat
et al. 2004;Emerson et al. 2008;Kidd et al. 2008;Schrider et al.
2011), variation that contributes to long-term differences in
the size of gene families between species. To further under-
stand the importance of these changes, researchers must be
able to accurately estimate the rate at which gene families
evolve over time.
Our previous approach to estimating this rate modeled the
gain and loss of genes within a gene family using a birth-and-
death stochastic process (Hahn et al. 2005). (This probability
distribution should not be confused with the birth-and-death
conceptual model of gene family evolution of Nei and Rooney
[2005].) Given input data on the size of gene families across
multiple species and an ultrametric phylogenetic tree describ-
ing relationships among these species, the original CAFE soft-
ware package (De Bie et al. 2006) can estimate the maximum
likelihood value of the rate parameter, l,andthemaximum
likelihood estimates (MLEs) of the size of each gene family at
ancestral nodes of the tree. These MLEs can then be used to
infer expansions and contractions of individual gene families
on any lineage (e.g., Demuth et al. 2006). Updated versions of
this software (CAFE 2; Hahn, Demuth, et al. 2007;Hahn, Han,
et al. 2007) allowed for separate lvalues on different branches
of the tree, as well as several other novel features. A number of
other programs using the birth-and-death model—or related
models—have also appeared and offer similar as well as ad-
ditional features (e.g., Liu et al. 2011;Ames et al. 2012;Librado
et al. 2012).
A major concern when studying changes in gene family
size is the quality of the underlying genome assembly and
genome annotation. Low sequencing coverage in genome
assemblies can lead to both the erroneous addition and sub-
traction of genes. Genes can be missing because there is in-
complete coverage of the entire genome, with whole or parts
of genes falling in unassembled portions of the genome; genes
can also be missing because base-calling errors mistakenly
indicate frameshifts or nonsense mutations (e.g., Hubisz
et al. 2011). Extra gene copies can be inserted into the assem-
bly if allelic diversity is incorrectly assembled as duplicated loci
(e.g., Holt et al. 2002;Colbourne et al. 2011)orifasingle
multiexon gene is split among multiple scaffolds or con-
tigs—in which case multiple gene models may be predicted
fromasinglegene(e.g.,Colbourne et al. 2011). Similar prob-
lems can arise even in “finished” genomes (such as Drosophila
melanogaster), as gene annotation software can often miss
short open-reading frames or can cleave a single gene into
multiple predicted genes (e.g., Hahn, Han, et al. 2007;Stark
et al. 2007;Costello et al. 2008).
For studies focusing on gene family size change, errors in
genome assembly and annotation will result in biased
ßThe Author 2013. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: journals.permissions@oup.com
Mol. Biol. Evol. 30(8):1987–1997 doi:10.1093/molbev/mst100 Advance Access publication May 24, 2013 1987
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
estimates of the rate of change. Because a higher rate of
evolution must be proposed in the presence of errors—
whether additional or missing genes—estimates that do
not account for errors are likely to have been upwardly
biased. Indeed, we have previously found that Drosophila spe-
cies represented by the lowest quality assemblies in compar-
ative analyses using CAFE also appear to evolve at the highest
rates (Hahn, Han, et al. 2007). Therefore, to estimate accurate
rates of gene family evolution, we must be able to account for
the error present in all current genome annotations. In this
article, we present one such method and implement it in a
new version of CAFE. Our method accounts for errors in gene
family sizes by explicitly modeling the uncertainty associated
with observed family sizes at the tips of a tree. We show that,
given a known error distribution for each genome, we can
recoveraccurateestimatesofthetruerateofgenefamily
evolution. In addition, we present multiple methods for esti-
mating error rates from the data when they are not known in
advance and show how these can be used to provide more
accurate values of evolutionary parameters.
New Approaches
We assume a random variable Xthatisatruecountof
homologous members of a gene family within a single lineage.
In theory, Xcan be from 0 to infinite size, but for ease of
computation, we limit it to be at most M. A whole genome
can then be thought of as a random sample of size N, where
each gene family within a genome corresponds to each ob-
servation, and Nis the total number of gene families found in
the genome. Each gene family size in a genome is assumed to
be independent and identically distributed.
We also consider the error-prone measure of gene family
size W,W=w(w= 0,1,2,3 ...M). Wrepresents our observa-
tion for each gene family size that is affected by errors in the
genome assembly and errors in the gene annotations. The
measurement-error model, which describes the behavior of
Wgiven X=x, is specified by the error probabilities:
wjx¼PðW¼wjX¼xÞ,
that is, the probability of observing wwhenthetruegene
family size is x. The error probabilities can be represented as a
matrix, :
¼
1j1 1jM
.
.
...
..
.
.
Mj1 MjM
2
6
43
7
5,
where the item in the ith row and jth column represents the
probability of observing iwhenthetruegenefamilysizeisj.
Note that the rows of the matrix do not have to sum to 1 but
thecolumnsdo.Wealsodenetheprobability
x
as:
x¼PðX¼xÞ,
that is, the probability of a true gene family size of xfound in
thegenome.Thelowercasedenotes the discrete probability
distribution
x
for x=0...M.
In cases where there is no measurement error, it is known
that we can estimate the rate of change in gene family size
across the phylogeny by specifying a transition matrix based
on the rate parameters land and fitting the model to the
observed (=true) counts (X) at the tips of the tree and the
time between the nodes (described by branch lengths, T).
Under a birth-and-death process, the probability of going
from snumber of genes to cnumber of genes in time tis
given by (Bailey 1964):
PXtðÞ¼cjX0ðÞ¼sðÞ
¼X
minðs,cÞ
j¼0
s
j
s+cj1
s1

sjcjð1Þj
¼ðelðÞt1Þ
lelðÞt,¼lðelðÞt1Þ
lelðÞt
where lis the rate of gene gain and istherateofgeneloss.If
the rates of gain and loss are equal, that is, l=,theabove
probability is as given in equation 1 of Hahn et al. (2005).Here,
we focus on cases with l=,buttheupdatedversionof
CAFE can also estimate separate rates of gain and loss
(as can BadiRate; Librado et al. 2012).
For multiple species, S=(1...s),wedeneatreewith
ultrametric branch lengths, T, that has the set of species
Sas the leaves and a set of ancestral nodes, U=(1 ...u).
We define X
n
as the vector X
n
=(X
n1
,X
n2
,...X
ns
), in which
each item X
ni
describes the size of the nth gene family in each
genome of species i(i2S). Similarly, Z
n
=(Z
n1
,Z
n2
,...Z
nu
)is
the vector in which each item Z
nj
isthegenefamilysizeofthe
ancestral genome at the inner node j(j2U). The actual
calculation of the likelihood over the whole tree utilizes the
“pruning algorithm” (Felsenstein 1973,1981)tosumoverthe
inner node values that we cannot observe:
l,¼argmaxl, Y
N
n¼1
PðXnjl,,TÞ!
¼argmaxl, Y
N
n¼1
PðXnjZn,l,,TÞPðZnjl,,TÞ!
¼argmaxl, Y
N
n¼1
X
M
zn1¼0X
M
zn2¼0
 X
M
znu¼0
PðXnjZn
¼zn1,zn2...znu
ðÞ,l,,TÞ
PðZn¼zn1,zn2...znu
ðÞjl,,TÞ
8
>
>
>
>
>
<
>
>
>
>
>
:
9
>
>
>
>
>
=
>
>
>
>
>
;
!
:
With error in the measurements, a naı
¨ve inference based on
use of the W’s instead of the X’s leads to:
ðl,ÞW¼argmaxl, Y
N
n¼1
PðWnjl,,TÞ!,
wherewedenethevectorW
n
=(W
n1
,W
n2
,...W
ns
). Similar
to X
n
,W
ni
is the error-prone measurement of the gene count
for the nth gene family in species i.
To account for error, we add an additional layer of uncer-
tainty on the true value Xto the values at the leaf nodes of the
phylogeny. This necessitates an additional summation of
1988
Han et al. .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
likelihoods at each leaf over X, in addition to all internal nodes,
Z. The only difference between the summation at the leaf
nodes and the summation at the inner nodes is that the
probability at the leaf nodes is defined by the error matrix,
rather than following the transition probabilities derived from
theratematrixandthebranchlengths:
l,¼argmaxl,Y
N
n¼1
X
M
xn1¼0X
M
xn2¼0
... X
M
xns¼0
PðWnjXn
¼ðxn1,xn2,...xns ÞÞ
PðXn,l,,TÞ
8
>
>
>
>
>
<
>
>
>
>
>
:
9
>
>
>
>
>
=
>
>
>
>
>
;
0
B
B
B
B
B
@
1
C
C
C
C
C
A
¼argmaxl,Y
N
n¼1
X
M
xn1¼0X
M
xn2¼0
... X
M
xns¼0
PðWn1jXn1
¼xn1ÞPðWn2jXn2¼xn2Þ
...PðWns jXns ¼xnsÞ
PX
n,l,,TðÞ
8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
:
9
>
>
>
>
>
>
>
>
=
>
>
>
>
>
>
>
>
;
0
B
B
B
B
B
B
B
B
@
1
C
C
C
C
C
C
C
C
A
:
The probability P(W
ni
=w
ni
jX
ni
=x
ni
) follows from the
error matrix wni jxni .
Because we do not know the error matrix , it becomes an
additional set of parameters we need to estimate:
ðl,,Þ
¼argmaxl,,Y
N
n¼1X
M
X
Wn1jXn1Wn2jXn2
...
Wns jXns PðXn,l,,TÞ
8
>
>
<
>
>
:
9
>
>
=
>
>
;
0
B
B
@1
C
C
A
When the error probabilities are unknown, it is theoreti-
callypossibletoestimatethewholesetofparametersinclud-
ing the error matrix using maximum likelihood, but in
practice the number of parameters to be estimated is too
large unless the number of samples is extremely large. For
example, even if we assume that the error matrix is the same
across all families and all species, the number of parameters to
be estimated is 2 +M
2
; that is, the entries of a full error
matrix when Mis the maximum possible size of a family.
Instead, here we focus on cases where we have some infor-
mation about the distribution of errors affecting measure-
ment. In practice, this means that rather than estimating
the joint distribution of land ,weestimatelusing external
information about the distribution of error. If we assume a
highly simplified error model, we can also estimate the error
matrix using a pseudomaximum likelihood approach
(Buonaccorsi 2010). Later, we present extensive simulation
results that suggest that our method provides accurate esti-
mates of all parameters.
Results and Discussion
The Effect of Error on Inferred Rates of Gene Family
Evolution
To examine the effect of error in the gene family size taken
from suboptimal genome annotations, we simulated gene
families under a model with known error. These data were
simulated using the phylogeny of 12 Drosophila species
(supplementary fig. S1,Supplementary Material online) and
the distribution of sizes among 11,434 gene families previously
analyzed in these species (Hahn, Han, et al. 2007), with a true
rate parameter of l== 0.0012. A simulation consists of
generating a data set using CAFE’s genfamily command and
adding error to the data set as specified. In the simplest sim-
ulations, a known amount of error (") was added to each data
set by randomly adding or subtracting genes from "percent
of the gene families, with all gene family sizes having the same
error distribution (i.e., the same error matrix). Error can be
added to all species or to a subset of species, effectively
modeling heterogeneous assembly and annotation quality
among genomes. The error distributions added to our simu-
lated data were either "=0.1 or "= 0.4 and consisted of an
addition or subtraction of one to three genes in a family per
species independently (fig. 1). An error value of "= 0.1 means
that in 90% of gene families, the observed size is equal to the
true size, whereas in 10% of gene families, the observed size is
either too large or too small (fig. 1A,C,andE). These
error distributions approximate the range and distributions
of error that we observe in several published genomes (see
later).
To first assess the effect of error on inferred rates of gene
family evolution, CAFE 3 was run on each simulated data set
with standard settings—that is, with no error model incor-
porated. Estimating lfrom these error-prone data sets gave
values of 0.0027 and 0.0085 when adding "=0.1 and "=0.4,
respectively (table 1). As expected, the more error contained
in each data set, the higher value of lwe inferred; this is
expected because higher rates of gene family evolution
must be proposed to account for greater disparities in gene
family size. Even when only 10% of families have an incorrect
size (in each of the 12 genomes), the rate of gene family
evolution is more than twice its true value (l= 0.0012).
Although adding symmetric error does not change the
mean family size across species, it does change the variance
in size: from a variance equal to 0.519 in the true data
(mean = 1.097), adding "= 0.1 changed the variance to
0.609 and adding "= 0.4 gave 0.894. Adding asymmetric
error did change the mean family size but only very slightly
(data not shown); the variances were the same as in the
symmetric case.
There did not appear to be a clear effect of asymmetry in
the error model on the absolute values of l, as putting more
of the mass of the error distribution in either gains or losses
did not significantly affect the estimated parameter value
(compare results from error models 1A to 1C, and 1B to
1D in table 1). However, we did observe a small but substan-
tial increase in lwhen errors involving larger changes in
family size (e.g., ±3) were included (compare results from
error models 1A to 1E, and 1B to 1F in table 1).
Accounting for Errors in Gene Family Size Using
CAFE 3
We have observed how error in the observed gene family sizes
can lead to an overestimation of the rates of gene gain and
loss. We were therefore interested in whether the error model
1989
CAFE 3 .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
described earlier could be used to account for this error to
correctly infer the true value of l.Weapplytheerrormodelin
two cases: first, when we assume that we know the correct
error distribution and second, when we purposefully use an
incorrect error model. Because the correct level of error and
the exact distribution of error will often not be known, these
two cases allow us to assess the consistency of our method. In
both cases, the new “errormodel” function is used in CAFE 3,
along with a specified error distribution.
In our ideal test case, we again simulated data with
l== 0.0012 across the phylogeny, and added a proportion
of error, ", equal to either 0.1 or 0.4 to all species. In both cases,
the size of gene families with error was either incremented or
decremented by a count of 1, with equal probability. In the
case of using the same error distribution to correct for error as
was added to the data set, CAFE correctly recovers the true l
with high precision. The data sets with either "=0.1or"=0.4
had the corresponding 0.1 and 0.4 error models applied to all
species, and the lvalue inferred was again very close 0.0012
(table 1). To ensure that these results are not unique to a
single rate parameter, we repeated the above simulations
with l== 0.01 and 0.0001 and the same error values. For
"= 0.1, application of the error model gave l= 0.00996 and
0.00009, respectively. For "= 0.4, application of the error
model gave l= 0.01085 and 0.000083, respectively. We can
see that for l= 0.01, we were able to infer the correct value
FIG.1. Error distributions used in simulations. These distributions include total errors of "=0.1 (A,C,E)and"=0.4(B,D,F)butvaryinthemannerin
which errors are spread across the error spectrum. In panels (A–D), only errors of +1or1 gene per family are considered, with (A)and(B)showinga
symmetric spread of the total error between +1and1, whereas (C)and(D) show an asymmetric spread, skewed with 75% of the total error in the
1 category. The opposite skew, with 75% in the +1 category were also simulated, but is not shown here. Panels (E)and(F)showasymmetric
distribution that extends to include the addition and subtraction of two or three genes.
1990
Han et al. .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
very accurately, whereas there was some inaccuracy for
l= 0.0001 (similar results held for both asymmetric and sym-
metric models). This latter result may be due to the small
number of changes occurring across the tree with very low
rates of change, but further simulations are needed to explore
this effect.
We have also implemented the error model to allow for
different error distributions among species. This corresponds
to subsets of the input data coming from genomes of differ-
ential quality, as we have observed previously among the
published Drosophila genomes (Drosophila 12 Genomes
Consortium 2007;Hahn, Han, et al. 2007). To assess the ac-
curacy of the lestimate when different amounts of error are
applied to individual genomes, simulations were carried out
as above with one difference: An error distribution of "=0.1
was applied to all species except D. simulans,D. sechellia,and
D. persimilis,whichwereallgiven"=0.4 (these species were
observed to have lower quality genome assemblies; Begun
et al. 2007;Drosophila 12 Genomes Consortium 2007).
When we specify these same error distributions in our error
model—correctly assigning each distribution to each spe-
cies—we again recover the true single value of lacross the
tree (third row in table 1). In addition to searching for a single
lvalue across the tree, CAFE has the ability to search for
separate lvalues on individual branches or clades. Without
accounting for error, models with separate lparameters for
terminal branches leading to each of the three species with
higher error rates fit significantly better (data not shown); this
is because the error-prone assemblies appear to have faster
rates of gene family evolution (cf., Hahn, Han, et al. 2007).
Conversely, if the data are simulated with two lparameters
corresponding to these two parts of the tree (supplementary
fig. S1,Supplementary Material online), we are able to cor-
rectly infer these parameter values even in the presence of
error (table 2). We note that there is a potential identifiability
problem when we have separate parameters for gain and loss
on a specific branch along with a separate error model for the
samebranch.Wemaynotalwaysbeabletodistinguishhigher
gain and loss rates on the branch with the higher error rates.
We also tested the effect of using an incorrect error model
when inferring rates of change. We anticipated that using an
error distribution larger than the error distribution that is
present in the data—that is, a larger value of "—would
lead to an overcorrection and that a lower lwould be ob-
served with respect to the value of lwith which the data were
simulated. Similarly, using an error distribution that corrects
for less error than is present in the data might lead to an
undercorrection and a higher lthan the true value. Our
simulations confirmed our expectations: When an error
model with "= 0.4 was applied to a data set simulated with
"=0.1,thelvalue observed was 0.00045 (approximately one-
third of the true value). Similarly, when using an error model
with "=0.1 to correct for a data set simulated under "=0.4,
the lvalue inferred was 0.0042. Although we have undercor-
rected in this case—the inferred value is more than three
times higher than the simulated value—it is important to
remember that the value that would have been obtained
without the error model was twice as high again
(l= 0.0086). Overall, we conclude that our error model can
recover the true value if the error distribution is known
exactly, and if it is not known, the model performs according
to expectations. The consistency of the error models suggests
that we may be able to predict the error distribution when it
is unknown; we demonstrate this feature of CAFE in the next
section.
Estimating the Error Distribution from External Data
Thus far, we have only considered the case where the error
distribution for gene family size (i.e., the error matrix) is spe-
cified ahead of time. In this section and the next, we take two
approaches to estimating the error distribution. In the first
approach, we use external data from multiple genome assem-
blies, either two error-prone assemblies of the same genome
or one high-quality assembly and one low-quality assembly.
Table 1. Performance of the Error Model.
Error Added (e)k(No Error
Model)
k(Correct
Error Model)
k(Incorrect
Error Model)
a
Symmetric error distributions 0.1 (1A)
b
0.00280 0.00122
c
0.00043
0.4 (1B) 0.00897 0.00124 0.00447
Asymmetric error distributions
0.1 (1C) 0.00283 0.00124 0.00047
0.1 (1C)
d
0.00276 0.00120 0.00050
0.4 (1D) 0.00960 0.00139 0.00502
0.4 (1D)
d
0.00765 0.00111 0.00366
Varying error across species
0.4 for low-quality
species
e
(1B)
0.00417 0.00121 0.00090
0.1 for all other
branches (1A)
Symmetric error with more error classes 0.1 (1E) 0.00324 0.00130 0.00050
0.4 (1F) 0.00903 0.00154 0.00406
a
The incorrect error model for simulations with "=0.1 is "= 0.4 and vice versa.
b
Error distributions correspond to the given panel in figure 1.
c
The simulated value without error in all cases is l= 0.0012.
d
These distributions follow the same pattern as in figure 1Cand Dbut with the asymmetry in the opposite direction.
e
Low-quality species are highlighted in supplementary figure S1,Supplementary Material online.
1991
CAFE 3 .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
Comparisons of these assemblies can be used to find MLEs of
probabilities in the error matrix, as described in the Materials
and Methods section. We demonstrate this approach using
both simulated and real data sets. In the next section, we ask
whether the likelihood reported by CAFE can be used to
estimate the error distribution without additional external
data.
To estimate the error distribution from external data, we
first simulated gene family data as above. For each simulated
data set without error, we generated two error-prone data
sets by separately adding error according to the prespecified
error distribution. By data set we mean a set of gene families
that comprise a whole genome. We can either compare these
two error-prone data sets with each other to estimate the
errormatrixorwecancompareonetothedatasetwithout
error. The former comparison is equivalent to having two
equally error-prone annotations of the same genome,
whereas the latter comparison is equivalent to having one
accurate annotation and one error-prone assembly. In both
cases, we can find the parameters in the error matrix, ,that
maximize the log likelihood of the data (see Materials and
Methods). We simulated data sets with varying numbers of
error parameters, from two parameters (error probabilities for
differences of 0 and ±1) to seven parameters (error probabil-
ities for differences of +3, +2, +1, 0, 1, 2, and 3) and
varying amounts of error (with "ranging from 0.1 to 0.6). The
results of these simulations suggest that we are able to accu-
rately estimate up to four error parameters, across all values of
"(supplementary table S1,Supplementary Material online).
With more than four parameters, we did not get convergence
of the log likelihood scores. In addition, when the simulated
error model is simpler (i.e., has fewer parameters), the more
complex error models are not better fitting, suggesting that
we are able to find both estimates of the error parameters and
the model complexity (supplementary fig. S2 and table S1,
Supplementary Material online). Finally, we found that esti-
mating the error distribution by comparing two error-prone
data sets was only very slightly less accurate than estimation
via comparison between one error-prone data set and one
high-quality data set (supplementary table S1,Supplementary
Material online). This result is encouraging for systems in
which no reference-quality genome is available, although
we are making the assumption that the two available error-
pronedatasetssharethesameerrordistribution.
To evaluate the error distributions found in real genomes,
we compared low- and high-quality assemblies and
annotations for eight species (supplementary table S2,
Supplementary Material online). These comparisons range
from genomes of highly disparate quality (1.92X coverage
vs. 6.79X coverage for the two versions of the guinea pig
genome) to genomes in the final stages of finishing (e.g.,
D. melanogaster assembly version 4 vs. version 5.4). As a
result, we find a wide range of error distributions, with "
going from 0.687 (in Apis)to0.104(inDrosophila). The dis-
tribution of differences in gene family sizes in these pairs of
genomes is shown graphically in supplementary figure S3,
Supplementary Material online. As can be seen, these distri-
butions match our simulated distributions quite well, with
most errors involving +1or1 differences and very few
families having errors greater than ±3. When comparing es-
timated models from these data, we found better fits for an
asymmetric model for the genomes of honey bee, guinea pig,
and fugu, whereas the genomes of cow, zebrafish, fruit fly,
human, and mouse were estimated to have a symmetric error
model. Models with more parameters fit the data better in
general (supplementary table S3,Supplementary Material
online), but with more than nine parameters (up to differ-
ences of ±4in the asymmetric model, ±7inthesymmetric
model) the model often failed to find the global maximum
likelihood. Even when the estimated error matrix for the best
model had nonzero error rates for differences in family size
greater than ±4, the estimated error rates for differences
greater than ±4 were smaller than 0.01 in all the genomes
analyzed (supplementary table S4,Supplementary Material
online). On the basis of the errors observed from real
genome data, we conclude that our error models are reason-
able tradeoffs between realistic descriptions of the errors and
ease of optimization.
Estimating the Error Distribution without
External Data
In our second approach to estimating error distributions, we
did not use external data; instead we compare the likelihood
of CAFE runs with varying error parameters. Because the like-
lihood of any particular run using the error model is calcu-
latedwithfixederrorparameters,weanticipatedthat
comparison of likelihood scores among runs with varying
error distributions would lead to accurate estimates of the
error distributions themselves. Although the resulting esti-
mates are not MLEs across the whole parameter space—
and are therefore not guaranteed to be global maxima—as
discussed in the Materials and Methods section, we can find
the pseudo-MLE (PMLE) within the resolution of the grid we
are searching.
To assess our ability to estimate the error distribution, we
focused on estimating the value of ", the proportion of gene
families with error. We limit our search to estimate only a
one-parameter error model (equal probability of ±1). To do
this, we again simulated data sets with a lvalue of 0.0012,
adding either "= 0.1 or 0.4. We then ran CAFE on both of
these data sets with error models having a value of "ranging
from 0.0 to 0.9, in fixed increments. Figure 2 shows that the
likelihood of the data is maximized (ln Lis minimized) when
Table 2. Performance of the Error Model on a 2-lSearch.
Error Added (e)k(No Error Model) k(Correct Error Model)
0.1 (1A)
a
k
1
= 0.00206
b
k
1
= 0.00097
k
2
= 0.05784
c
k
2
= 0.02513
0.4 (1B)
k
1
= 0.00616 k
1
= 0.00091
k
2
= 0.18173 k
2
= 0.02690
a
Error distributions correspond to the given panel in figure 1.
b
The simulated values without error in all cases are l
1
= 0.0009 and l
2
= 0.025.
c
Species with l
2
are highlighted in supplementary figure S1,Supplementary Material
online.
1992
Han et al. .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
themodelhavingthecorrectvalueof"is used: Models with
"=0.1 and "= 0.4 represent the best fit to the data when
simulations with "=0.1and"= 0.4 are used, respectively. We
have provided a program that automates the process of run-
ning CAFE with varying error models while maximizing the
likelihood score by performing a simple optimization
(caferror.py); this program will report the error model used
to obtain the best score. These analyses show that we are able
to use CAFE to estimate simple error models from the
input data set by finding the value that maximizes the
likelihood.
When the error distribution varies across the genomes
considered, the simple one-dimensional search implemented
above will only tell us about the average error across the tree.
We therefore simulated a data set with varying levels of error
among genomes and followed the one-dimensional search for
an average error parameter with a species-by-species search.
This species-by-species search is achieved by repeatedly
adding or subtracting 10% of the average error predicted in
the first step to each species separately, until the likelihood
score has been maximized. Again, more complex search strat-
egies that simultaneously try to maximize the average and
species-specific error distributions can be considered in the
future, but here we show that a simple approach works well
on the simulated data set, as illustrated in table 3. The error in
the data set resulted in an average error across all species
estimated as "= 0.26. We then continued to predict error
on each individual species: D. melanogaster was simulated
with "= 0.1 and was predicted to have "= 0.101. Drosophila
simulans,D. sechellia,andD. persimilis had "= 0.4 applied to
their genomes and were all predicted to have error values of
0.406 (table 3). All other species had "=0.2 and were pre-
dicted to have error values from a range of 0.203 to 0.228
(table 3). It is notable that the largest deviation we saw was in
D. pseudoobscura (simulated with "= 0.2), as it is sister to
D. persimilis (simulated with "= 0.4); this suggests that
genomes with high error rates can make it seem as though
genomes of closely related species also have higher than
expected error rates. Again, this pattern has been seen previ-
ously in the Drosophila data (Hahn, Han, et al. 2007). The
species-by-species search has been implemented as an
option that can be run after an average error parameter
has been estimated using the caferror.py script.
Applying the Error Model to Real Data Sets
Comparative data on the size of gene families has been ana-
lyzed in multiple different groups of organisms using earlier
versions of CAFE (e.g., Sackton et al. 2007;Martin et al. 2008;
Sharpton et al. 2009;Brown et al. 2010;Ohm et al. 2012;Qiu
et al. 2012). Here, we use the new error model feature in CAFE
3 to revisit data from three clades that we have previously
analyzed: fungi, mammals, and Drosophila.Weanalyzednew
gene family data from 16 fungi (Butler et al. 2009;Rasmussen
and Kellis 2011) and 10 mammals (Worley K et al., in prep-
aration), as well as previously analyzed data from Drosophila
(Hahn, Han, et al. 2007). For the Drosophila data set, we used
the gene families and tree as described in Hahn, Han, et al.
(2007); the gene families and tree from the expanded fungal
data set are described in Rasmussen and Kellis (2011) and for
the mammals is described in Worley K et al. (in preparation).
Each of these groups of species is certain to have heteroge-
neous levels of genome assembly and annotation quality
among them. Each group contains one or two focal spe-
cies—Saccharomyces cerevisiae among the fungi, Homo sapi-
ens and Mus musculus among the mammals, and D.
melanogaster among the flies—that has a very high-quality
assembly and annotation and likely several species with
below-average genomes (e.g., the Drosophila species men-
tioned in the previous section). Because of the presence of
genomes with lower quality, it seems likely that the error
model introduced here will provide a more accurate estimate
oftherateofgenefamilyevolutionineachgroup.
Analyzing the data without applying an error model, we
found l= 0.0008, 0.0023, and 0.0012, for the fungi, mammals,
and Drosophila, respectively (table 4). For each data set, we
then estimated a one-parameter error model without using
FIG.2. Estimating the error distribution. Each panel shows the –ln likelihood of individual CAFE runs on a simulated data set with error. The points
represent the score of the run with the corresponding error model ("value) on the xaxis; the dashed vertical line indicates the true amount of error
added to the simulated data. Panel (A) shows a simulation with "= 0.1 and panel (B)showsonewith"= 0.4. In both cases, the maximum likelihood
score occurs when the correct error model is used.
1993
CAFE 3 .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
external data, limiting "to symmetric errors of ±1. Although
this is a highly simplified error distribution, as shown earlier it
captures most of the effect of error on rates of gene family
evolution. Assuming an average error distribution shared
across all genomes in each analysis, we searched for the
PMLE of the error parameter, ",finding"= 0.0277, 0.0732,
and 0.041 for the fungi, mammals, and Drosophila,respec-
tively (table 4). These values imply that on average 2.8%, 7.3%,
and 4.1% of gene families in each data set have observed sizes
that are in error—that is, the observed size of the families
are not equal to the true size. It is interesting to note that
the average error in each clade roughly coincides with
the number of repetitive elements, number of total
gene duplicates, and total size of the analyzed genomes
(mammals >Drosophila >fungi), all of which are expected
to coincide with errors in assembly and annotation. These
conclusions of course rest on the assumption that our error
model is an accurate representation of the true error distri-
butions. As we have shown earlier, an analysis of error-prone
assemblies supports our modeling assumptions. Nevertheless,
this is an association among three clades, arising from genome
assemblies constructed in very different ways, and should be
considered to be a preliminary indication of factors that can
affect assembly and annotation quality.
Analyzing the data with the best-fit error models, we found
new rate parameters of l= 0.0006, 0.0019, and 0.0006, for the
fungi, mammals, and Drosophila,respectively(table 4). In all
three data sets, models with a single error-parameter fit the
data significantly better than models without error parame-
ters (all P0.0001; likelihood ratio test), indicating that the
corrected lvalues reported here are more accurate reflec-
tions of the rate of gene family evolution in these three clades.
In the fungi and mammals, the new estimates suggest rates of
evolution approximately 75% of the uncorrected estimates,
whereas in Drosophila, the new estimate is 50% of the original
one.
In addition to fitting a global rate parameter, we examined
the fit of a model having three lparameters on the mam-
malian tree. Previous research found higher rates of gene gain
and loss in the great apes, with an intermediate rate in other
primates, and the lowest rate on the rest of the tree (Hahn,
Demuth, et al. 2007). We wished to know whether this pat-
tern would still be observed after errors were accounted for,
so we estimated the likelihood of the three-parameter model
while setting "= 0.0732. As expected if this pattern is due to a
true rate acceleration and not error in assemblies or annota-
tion, the three-parameter model with error fit the data sig-
nificantly better than a one-parameter model with error
(P0.0001; likelihood ratio test). Indeed, as previously ob-
served, the value of the lparameter on the human–chimp
shared lineage was more than three times higher than the
average value (0.0062 vs. 0.0019), with the orangutan–
macaque shared lineage having an intermediate rate
(l= 0.0044). These results are not wholly unexpected, as
other previous studies have found the same rate acceleration
in a set of analyses that is free from biases due to heteroge-
neous quality in genome assemblies (Marques-Bonet et al.
2009). Our results also confirm earlier conclusions about
the amount of genic copy number divergence between
humans and chimpanzees (Demuth et al. 2006) and demon-
strate that these results were not due to genome assembly or
annotation error.
Conclusions
Here, we have provided a new software package that enables
the accurate estimation of rates of gene family evolution
when there are errors in the observed gene family sizes. By
allowing users to marginalize over the uncertainty in the ob-
served gene family sizes, CAFE 3 provides a platform for ex-
panding comparative genomic analyses into clades consisting
solely of draft genome sequences. Our software is freely avail-
able (http://sourceforge.net/projects/cafehahnlab/)andcan
be compiled on multiple operating systems. Although it is
likely that there are typical error distributions associated with
different sequencing technologies used to assemble genomes
(e.g., Illumina vs. 454), our program does not require that such
distributions are known ahead of time. If prior information
about either the sequencing technology or the depth of cov-
erage is known, more accurate results may be obtained. We
have demonstrated three alternative approaches to estimat-
ing error distributions, each of which requires slightly different
types of data; regardless of how error distributions are esti-
mated, CAFE 3 allows any arbitrary distribution to be speci-
fied. Finally, although we have applied this approach to
correcting for error in gene family sizes, similar methods
may be applicable to errors in nucleotide data (e.g., Heid
et al. 2008;Hubisz et al. 2011) or any trait for which a realistic
error model can be constructed (e.g., RNA-seq; Brawand et al.
2011).
Table 3. Estimating the Error Distribution (") with Heterogeneous
Error across Species.
Species Error Added Estimated Error
Drosophila willistoni 0.2 0.20283
D. virilis 0.2 0.20283
D. persimilis 0.4 0.40566
D. mojavensis 0.2 0.20283
D. sechellia 0.4 0.40566
D. pseudoobscura 0.2 0.22819
D. yakuba 0.2 0.20283
D. grimshawi 0.2 0.20283
D. erecta 0.2 0.20283
D. melanogaster 0.1 0.10141
D. ananassae 0.2 0.20283
D. simulans 0.4 0.40566
Table 4. Analysis of Previously Published Data Sets.
Clade k(No Error Model) e(Estimated)
a
k(Error Model = e)
Fungi 0.00080 0.02771 0.00061
Mammals 0.00238 0.07324 0.00186
Drosophila 0.00121 0.04102 0.00059
a
The estimated error distribution was symmetric with only ±1allowed.
1994
Han et al. .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
Materials and Methods
Application of the Error Model to Infer the Gain and
Loss Parameters
The general approach we take uses PMLEs because we do not
include the error parameters in the full likelihood formula.
Instead, we estimate the measurement error parameters from
external data as described later, and then the likelihood is
calculated using the observed data, with the error parameters
fixed at their estimates (Buonaccorsi 2010):
l,¼argmaxl, Y
N
n¼1
PðWnjXn,l,,TÞ!
PðWnjXn,l,,TÞ
¼X
M
xn1¼0X
M
xn2¼0
... X
M
xns¼0
PW
njXn¼ðxn1,xn2,...xnsÞðÞPðXn,l,,TÞ
¼X
M
xn1¼0X
M
xn2¼0
... X
M
xns¼0X
M
zn1¼0X
M
zn2¼0

X
M
znu¼0(PðWnjXn¼ðxn1,xn2,...xns ÞÞPðXnjZn
¼ðzn1,zn2...znuÞ,l,,TÞPðZnjl,,TÞ),
where P(W=wjX=x)= ^
wjxand is estimated through
external data.
Simplifying the Error Matrix
BecausewearedealingwithcountdatathatcangouptoM,if
we allow for error from any true count to any observed count
the number of parameters specifying the error matrix for this
full model is M
2
,which is unnecessarily complex. To simplify
the parameters and to make the model useful, we can con-
strain three aspects of the model.
First, we can assume that the error rate depends on the
difference between the observed count and the true count
but does not depend on the true count itself. This is equiv-
alent to having a single homogeneous parameter along the
diagonals of the error matrix. Although this assumption may
not be biologically realistic, because most of our gene families
are sizes of three or smaller and large families are rare, only
three rates would normally have to be used. Our modeling
framework also allows this assumption to be relaxed when
error is estimated from external data, and any error structure
can be entered into CAFE 3 if specified by the user. Second, we
can restrict the errors that are allowed to be at most Ddif-
ferences from the true count. This forces the corner param-
eters that are D+1 or more steps away from the diagonal to
a probability !that is constrained to be smaller than all other
parameters. Again, Dis a user-specified parameter that can be
quite large in practice. Third, we can assume symmetry on the
rates of errors that increase the counts and the rates of errors
that decrease the counts, reducing the number of parameters
to half. This last assumption is again optional, and we have
explored models with and without it. In our simulated results
presented above, we explore a range of models that are com-
binations of the value of D(3) and the state of symmetry.
The numbers of parameters for the error matrix are then
2D+1 for asymmetric models and D+1forsymmetric
models.
Estimation of the Error Matrix from External Data
If we know that one measurement (i.e., one set of gene fam-
ilies from a well-annotated genome) is more accurate than
another, we can estimate the error matrix by assuming the
better measure as the true value and comparing the lesser
measure to the truth. Although having a well-annotated
genome might seem to obviate the need for using the error
model, the estimated error matrix from such a comparison
could be usefully applied to poorly annotated genomes. If we
have two sets of measurements with unknown the relative
accuracy, we can estimate the error matrix based on the ob-
served agreement between the two measures. We describe
these two cases in reverse order.
First, when the two measures are similarly error prone, we
find the triangular matrix R=r
ij
(where i=0...M,ji)of
pairwise observations, with r
ij
being the number of observa-
tions with W
1
=iand W
2
=j. The probability of each pairwise
observation is defined based on the true count probabilities
x
and the error rates
wjx
. Assuming that the probability of
observing pairwise observations is p
ij
,with
pij ¼P
M
k¼0
2ijkjjkkif ij
P
M
k¼0
ijkjjkkif i¼j
8
>
>
<
>
>
:
,
then the log likelihood of the data matrix Rgiven the prob-
ability distribution of
x
and the error matrix model can be
calculated using the multinomial likelihood. Ignoring the co-
efficient, the log likelihood ln Lis:
ln LðRj,Þ¼ X
M,M
i¼0,j¼0ðijÞ
rij ln ðpijÞ:
However, for our data set, we have a limitation that we can
never observe gene families that are size zero in both mea-
sures. To account for this missing data, we find the likelihood
that is conditional on the event (E) that we observed at least
onegeneineitheroneofthemeasurements.Because
PðRj,,EÞ¼PðR,Ej,Þ
PðEÞ¼PðR,Ej,Þ
1pð0,0Þ,
the conditional log likelihood ln L
c
is:
ln LcðRj,,EÞ¼ X
M,M
i¼0,j¼0ðijÞ
rij ln ðpijÞ
()
ln ð1p00Þ:
We do not have data for the true , but assuming the
distribution is similar to the counts found in W
1
and W
2
,
we can substitute the count distributions observed in
W
1
+W
2
(^
)as the approximation of .For each model of
error matrix , we find the parameters that maximize the
1995
CAFE 3 .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
conditional log likelihood of the data Rusing the Nelder–
Mead method.
For the case when one set of annotations is treated as the
true measure, the observation matrix Risafullmatrixwith
columns that correspond to the true measure and rows to
the incorrect measure. r
ij
is now the number of observations
with W=iand X=j. The probability of each pairwise obser-
vation is again defined based on the true count probabilities
x
and the error rates
wjx
, but the probability of observing
pairwise observations p
ij
is simpler:
pij ¼ijjj:
The conditional log likelihood follows the same approach as
above but is summed across the whole discordance matrix, R:
ln LcðRj,,EÞ¼ X
M,M
i¼0,j¼0
rij ln ðpijÞ
()
ln ð1p00Þ:
The distribution can be estimated using the assumed true
count data X. MLEs of parameters that determine the error
matrix are then found by comparison of the true and error-
prone data.
Estimation of the Error Matrix from Eight Genomes
To collect data on realistic error matrices, we compared
annotations for two versions of each of eight published
genomes. We used the gene models for honey bee
(Apis mellifera), cow (Bos taurus), guinea pig (Cavia porcellus),
zebrafish (Danio rerio), fruit fly (Drosophila melanogaster),
human (H. sapiens), mouse (M. musculus), and fugu
(Takifugu rubripes)fromEnsembl(Flicek et al. 2012). For
each species, we compared an earlier, lower quality assembly
and annotation to a later, higher quality assembly and anno-
tation (supplementary table S2,Supplementary Material
online). For each pair of genome annotations (i.e., for each
species), we assigned genes to gene families using an all-
against-all BLASTP sequence similarity search, followed by
clustering using the MCL algorithm (Enright et al. 2002).
Because genes from both annotation versions were clustered
simultaneously, we can simply compare the size of each
resulting family to estimate the error matrix. We applied
our method of estimating the error matrix from external
data to the pairs of genomes, assuming that the updated
annotation is the true measure. We compared symmetric
and asymmetric error models with a range of parameters
(supplementary table S3,Supplementary Material online).
The number of parameters in the asymmetric models
ranged from three (difference of 1, 0, 1) to up to nine
(difference of 4, 3, ...3, 4). The number of parameters
on the symmetric models ranged from two (difference of 0,
±1) to eight (difference of 0, ±1, ...±7).
Estimation of the Error Matrix via Search without
External Data
We also demonstrate how to estimate the error matrix when
there is no external validation data available. In this case, we
find the birth-and-death parameters that maximize the
pseudolikelihood using a fixed error model but repeat the
procedure across a grid of error-parameter values. The grid
consists of error parameters in fixed increments across a fixed
region, and the error parameter that yields the maximum
pseudolikelihood across the grid space is determined as the
estimate. This process has the limitation that it cannot search
the whole continuous parameter space, and an n-dimensional
grid search becomes impractical for complicated error matri-
ces as the number of parameters in the model increases.
Nevertheless, it performs fairly well in practice, as is shown
in the simulations presented in the Results and Discussion
sections.
Supplementary Material
Supplementary figures S1–S3 and tables S1–S4 are available at
Molecular Biology and Evolution online (http://www.mbe.
oxfordjournals.org/).
Acknowledgments
The authors thank Matt Rasmussen for providing the fungal
data and David Swofford for helpful discussions. This work
was supported by a National Evolutionary Synthesis Center
postdoctoral fellowship to M.V.H., by a Ford Foundation pre-
doctoral fellowship to J.L.-M., and by National Science
Foundation grant DBI-0845494 to M.W.H.
References
Ames RM, Money D, Ghatge VP, Whelan S, Lovell SC. 2012. Determining
the evolutionary history of gene families. Bioinformatics 28:48–55.
Bailey NTJ. 1964. The elements of stochastic processes. New York: John
Wiley & Sons, Inc.
Begun DJ, Holloway AK, Stephens K, et al. (13 co-authors). 2007.
Population genomics: whole-genome analysis of polymorphism
and divergence in Drosophila simulans.PLoS Biol. 5:e310.
Brawand D, Soumillon M, Necsulea A, et al. (18 co-authors). 2011. The
evolution of gene expression levels in mammalian organs. Nature
478:343–348.
Brown CA, Murray AW, Verstrepen KJ. 2010. Rapid expansion and
functional divergence of subtelomeric gene families in yeasts. Curr
Biol. 20:895–903.
Buonaccorsi JP. 2010. Measurement error: models, methods and appli-
cations. Boca Raton (FL): Chapman and Hall/CRC Press.
Butler G, Rasmussen MD, Lin MF, et al. (51 co-authors). 2009. Evolution
of pathogenicity and sexual reproduction in eight Candida genomes.
Nature 459:657–662.
Colbourne JK, Pfrender ME, Gilbert D, et al. (68 co-authors). 2011. The
ecoresponsive genome of Daphnia pulex.Science 331:555–561.
Costello JC, Han MV, Hahn MW. 2008. Limitations of pseudogenes in
identifying gene losses. In: Nelson C, Vialette S, editors. Proceedings
of the Sixth Annual RECOMB Satellite Workshop on Comparative
Genomics; 2008 Oct 13–15; Paris, France. Heidelberg (Germany):
Springer Berlin. p. 14–25.
De Bie T, Demuth JP, Cristianini N, Hahn MW. 2006. CAFE: a compu-
tational tool for the study of gene family evolution. Bioinformatics
22:1269–1271.
Demuth JP, De Bie T, Stajich JE, Cristianini N, Hahn MW. 2006. The
evolution of mammalian gene families. PLoS One 1:e85.
Demuth JP, Hahn MW. 2009. The life and death of gene families.
BioEssays 31:29–39.
Drosophila 12 Genomes Consortium. 2007. Evolution of genes and ge-
nomes on the Drosophila phylogeny. Nature 450:203–218.
Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M. 2008. Natural
selection shapes genome-wide patterns of copy-number polymor-
phism in Drosophila melanogaster.Science 320:1629–1631.
1996
Han et al. .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for
large-scale detection of protein families. Nucleic Acids Res. 30:
1575–1584.
Felsenstein J. 1973. Maximum likelihood and minimum-steps methods
for estimating evolutionary trees from data on discrete characters.
Syst Biol. 22:240–249.
Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum
likelihood approach. J Mol Evol. 17:368–376.
Flicek P, Amode MR, Barrell D, et al. (57 co-authors). 2012. Ensembl
2012. Nucleic Acids Res. 40:D84–D90.
Floudas D, Binder M, Riley R, et al. (71 co-authors). 2012. The paleozoic
origin of enzymatic lignin decomposition reconstructed from 31
fungal genomes. Science 336:1715–1719.
Gibbs RA, Rogers J, Katze M, et al. (176 co-authors). 2007. Evolutionary
and biomedical insights from the rhesus macaque genome. Science
316:222–234.
Gibbs RA, Weinstock GM, Metzker ML, et al. (241 co-authors). 2004.
Genome sequence of the Brown Norway rat yields insights into
mammalian evolution. Nature 428:493–521.
Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N. 2005. Estimating
the tempo and mode of gene family evolution from comparative
genomic data. Genome Res. 15:1153–1160.
Hahn MW, Demuth JP, Han S-G. 2007. Accelerated rate of gene gain and
loss in primates. Genetics 177:1941–1949.
Hahn MW, Han MV, Han S-G. 2007. Gene family evolution across 12
Drosophila genomes. PLoS Genet. 3:e197.
Heid IM, Lamina C, Ku
¨chenhoff H, et al. (18 co-authors). 2008.
Estimating the single nucleotide polymorphism genotype misclassi-
fication from routine double measurements in a large epidemiologic
sample. Am J Epidemiol. 168:878–889.
Holt RA, Subramanian GM, Halpern A, et al. (123 co-authors). 2002. The
genome sequence of the malaria mosquito Anopheles gambiae.
Science 298:129–149.
Hubisz MJ, Lin MF, Kellis M, Siepel A. 2011. Error and error mitigation in
low-coverage genome assemblies. PLoS One 6:e17034.
Kidd JM, Cooper GM, Donahue WF, et al. (46 co-authors). 2008.
Mapping and sequencing of structural variation from eight
human genomes. Nature 453:56–64.
Librado P, Vieira FG, Rozas J. 2012. BadiRate: estimating family turnover
rates by likelihood-based methods. Bioinformatics 28:279–281.
Li R, Fan W, Tian G, et al. (123 co-authors). 2009. The sequence and de
novo assembly of the giant panda genome. Nature 463:311–317.
Liu L, Yu L, Kalavacharla V, Liu Z. 2011. A Bayesian model for gene family
evolution. BMC Bioinformatics 12:426.
Marques-Bonet T, Kidd JM, Ventura M, et al. (20 co-authors). 2009.
A burst of segmental duplications in the genome of the African
great ape ancestor. Nature 457:877–881.
Martin F, Aerts A, Ahren D, et al. (68 co-authors). 2008. The genome of
Laccaria bicolor provides insights into mycorrhizal symbiosis. Nature
452:88–92.
Nei M, Rooney AP. 2005. Concerted and birth-and-death evolution of
multigene families. Annu Rev Genet. 39:121–152.
Ohm RA, Feau N, Henrissat B, et al. (28 co-authors). 2012. Diverse
lifestyles and strategies of plant pathogenesis encoded in the
genomes of eighteen Dothideomycetes fungi. PLoS Pathog. 8:
e1003037.
Qiu Q, Zhang GJ, Ma T, et al. (47 co-authors). 2012. The yak genome and
adaptation to life at high altitude. Nat Genet. 44:946–949.
Rasmussen MD, Kellis M. 2011. A Bayesian approach for fast and accu-
rate gene tree reconstruction. MolBiolEvol.28:273–290.
Sackton TB, Lazzaro BP, Schlenke TA, Evans JD, Hultmark D, Clark AG.
2007. Dynamic evolution of the innate immune system in
Drosophila.Nat Genet. 39:1461–1468.
Schrider DR, Stevens KA, Carden
˜o CM, Langley CH, Hahn MW. 2011.
Genome-wide analysis of retrogene polymorphisms in Drosophila
melanogaster.Genome Res. 21:2087–2095.
Sebat J, Lakshmi B, Troge J, et al. (21 co-authors). 2004. Large-scale
copy number polymorphism in the human genome. Science 305:
525–528.
Sharpton TJ, Stajich JE, Rounsley SD, et al. (24 co-authors). 2009.
Comparative genomic analyses of the human fungal pathogens
Coccidioides and their relatives. Genome Res. 19:1722–1731.
Stark A, Lin MF, Kheradpour P, et al. (46 co-authors). 2007. Discovery of
functional elements in 12 Drosophila genomes using evolutionary
signatures. Nature 450:219–232.
1997
CAFE 3 .doi:10.1093/molbev/mst100 MBE
by guest on December 31, 2015http://mbe.oxfordjournals.org/Downloaded from
... To conduct an evolutionary analysis, we collected the protein sequences of P. ternata and 13 other species: Arabidopsis thaliana, Carica papaya, Oryza sativa, Brachypodium distachyon, Areca catechu, Wolffia australiana, Elaeis guineensis, Cocos nucifera, Colocasia esculenta, Amorphophallus konjac, Pinellia pedatisecta, Pistia stratiotes and Zostera marina. We also identified the gene clusters of these genomes using Orthofinder [44] with default settings. CAFE (v4.2.1) [45] was used to identify the gene families that underwent expansion or contraction in the 11 sequenced species. ...
... Z(t) gives the copy number state of a bin at time t. The linear birth-death process first introduced by [35] is also used in [36] to model gene content evolution. We assume each copy is independently amplified with birth rate M > 0 and deleted with death rate µ M > 0 . ...
Article
Full-text available
Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD.
... Ma between O. sativa and Z. mays) from the TimeTree database. The expansion and contraction of orthologous gene families were determined by comparing the cluster size differences between the ancestor and each species using the CAFÉ (v4.2.1) program (Han et al., 2013). GO and Kyoto encyclopedia of genes and genomes (KEGG) functional enrichment was performed by the R package clusterProfiler . ...
Article
Full-text available
Reynoutria multiflora is a widely used medicinal plant in China. Its medicinal compounds are mainly stilbenes and anthraquinones which possess important pharmacological activities in anti‐aging, anti‐inflammatory and anti‐oxidation, but their biosynthetic pathways are still largely unresolved. Here, we reported a near‐complete genome assembly of R. multiflora consisting of 1.39 Gb with a contig N50 of 122.91 Mb and only one gap left. Genome evolution analysis revealed that two recent bursts of long terminal repeats (LTRs) contributed significantly to the increased genome size of R. multiflora , and numerous large chromosome rearrangements were observed between R. multiflora and Fagopyrum tataricum genomes. Comparative genomics analysis revealed that a recent whole‐genome duplication specific to Polygonaceae led to a significant expansion of gene families associated with disease tolerance and the biosynthesis of stilbenes and anthraquinones in R. multiflora . Combining transcriptomic and metabolomic analyses, we elucidated the molecular mechanisms underlying the dynamic changes in content of medicinal ingredients in R. multiflora roots across different growth years. Additionally, we identified several putative key genes responsible for anthraquinone and stilbene biosynthesis. We identified a stilbene synthase gene PM0G05131 highly expressed in roost, which may exhibit an important role in the accumulation of stilbenes in R. multiflora . These genomic data will expedite the discovery of anthraquinone and stilbenes biosynthesis pathways in medicinal plants.
Preprint
Full-text available
Dendroctonus frontalis , also known as southern pine beetle (SPB), represents the most damaging forest pest in the southeastern United States. Strategies to predict, monitor and suppress SPB outbreaks have had limited success. Genomic data are critical to inform on pest biology and to identify molecular targets to develop improved management approaches. Here, we produced a chromosome-level genome assembly of SPB using long-read sequencing data. Synteny analyses confirmed the conservation of the core coleopteran Stevens elements and validated the bona fide SPB X chromosome. Transcriptomic data were used to obtain 39,588 transcripts corresponding to 13,354 putative protein-coding loci. Comparative analyses of gene content across 14 beetle and 3 other insects revealed several losses of conserved genes in the Dendroctonus clade and gene gains in SPB and Dendroctonus that were enriched for loci encoding membrane proteins and extracellular matrix proteins. While lineage-specific gene losses contributed to the gene content reduction observed in Dendroctonus , we also showed that widespread misannotation of transposable elements represents a major cause of the apparent gene expansion in several non- Dendroctonus species. Our findings uncovered distinctive features of the SPB gene complement and disentangled the role of biological and annotation-related factors contributing to gene content variation across beetles.
Article
Full-text available
Blood orange (BO) is a rare red-fleshed sweet orange (SWO) with a high anthocyanin content and is associated with numerous health-related benefits. Here, we reported a high-quality chromosome-scale genome assembly for Neixiu (NX) BO, reaching 336.63 Mb in length with contig and scaffold N50 values of 30.6 Mb. Furthermore, 96% of the assembled sequences were successfully anchored to 9 pseudo-chromosomes. The genome assembly also revealed the presence of 37.87% transposon elements and 7.64% tandem repeats, and the annotation of 30,395 protein-coding genes. A high level of genome synteny was observed between BO and SWO, further supporting their genetic similarity. The speciation event that gave rise to the Citrus species predated the duplication event found within them. The genome-wide variation between NX and SWO was also compared. This first high-quality BO genome will serve as a fundamental basis for future studies on functional genomics and genome evolution.
Article
Parasitic plants have a heterotrophic lifestyle, in which they withdraw all or part of their nutrients from their host through the haustorium. Despite the release of many draft genomes of parasitic plants, the genome evolution related to the parasitism feature of facultative parasites remains largely unknown. In this study, we present a high‐quality chromosomal‐level genome assembly for the facultative parasite Pedicularis kansuensis (Orobanchaceae), which invades both legume and grass host species in degraded grasslands on the Qinghai‐Tibet Plateau. This species has the largest genome size compared with other parasitic species, and expansions of long terminal repeat retrotransposons accounting for 62.37% of the assembly greatly contributed to the genome size expansion of this species. A total of 42,782 genes were annotated, and the patterns of gene loss in P. kansuensis differed from other parasitic species. We also found many mobile mRNAs between P. kansuensis and one of its host species, but these mobile mRNAs could not compensate for the functional losses of missing genes in P. kansuensis . In addition, we identified nine horizontal gene transfer (HGT) events from rosids and monocots, as well as one single‐gene duplication events from HGT genes, which differ distinctly from that of other parasitic species. Furthermore, we found evidence for HGT through transferring genomic fragments from phylogenetically remote host species. Taken together, these findings provide genomic insights into the evolution of facultative parasites and broaden our understanding of the diversified genome evolution in parasitic plants and the molecular mechanisms of plant parasitism.
Article
Full-text available
Background Many parasitic plants of the genera Striga and Cuscuta inflict huge agricultural damage worldwide. To form and maintain a connection with a host plant, parasitic plants deploy virulence factors (VFs) that interact with host biology. They possess a secretome that represents the complement of proteins secreted from cells and like other plant parasites such as fungi, bacteria or nematodes, some secreted proteins represent VFs crucial to successful host colonisation. Understanding the genome-wide complement of putative secreted proteins from parasitic plants, and their expression during host invasion, will advance understanding of virulence mechanisms used by parasitic plants to suppress/evade host immune responses and to establish and maintain a parasite-host interaction. Results We conducted a comparative analysis of the secretomes of root (Striga spp.) and shoot (Cuscuta spp.) parasitic plants, to enable prediction of candidate VFs. Using orthogroup clustering and protein domain analyses we identified gene families/functional annotations common to both Striga and Cuscuta species that were not present in their closest non-parasitic relatives (e.g. strictosidine synthase like enzymes), or specific to either the Striga or Cuscuta secretomes. For example, Striga secretomes were strongly associated with ‘PAR1’ protein domains. These were rare in the Cuscuta secretomes but an abundance of ‘GMC oxidoreductase’ domains were found, that were not present in the Striga secretomes. We then conducted transcriptional profiling of genes encoding putatively secreted proteins for the most agriculturally damaging root parasitic weed of cereals, S. hermonthica. A significant portion of the Striga-specific secretome set was differentially expressed during parasitism, which we probed further to identify genes following a ‘wave-like’ expression pattern peaking in the early penetration stage of infection. We identified 39 genes encoding putative VFs with functions such as cell wall modification, immune suppression, protease, kinase, or peroxidase activities, that are excellent candidates for future functional studies. Conclusions Our study represents a comprehensive secretome analysis among parasitic plants and revealed both similarities and differences in candidate VFs between Striga and Cuscuta species. This knowledge is crucial for the development of new management strategies and delaying the evolution of virulence in parasitic weeds.
Article
Full-text available
Background Mimosa bimucronata originates from tropical America and exhibits distinctive leaf movement characterized by a relative slow speed. Additionally, this species possesses the ability to fix nitrogen. Despite these intriguing traits, comprehensive studies have been hindered by the lack of genomic resources for M. bimucronata. Results To unravel the intricacies of leaf movement and nitrogen fixation, we successfully assembled a high-quality, haplotype-resolved, reference genome at the chromosome level, spanning 648 Mb and anchored in 13 pseudochromosomes. A total of 32,146 protein-coding genes were annotated. In particular, haplotype A was annotated with 31,035 protein-coding genes, and haplotype B with 31,440 protein-coding genes. Structural variations (SVs) and allele specific expression (ASE) analyses uncovered the potential role of structural variants in leaf movement and nitrogen fixation in M. bimucronata. Two whole-genome duplication (WGD) events were detected, that occurred ~ 2.9 and ~ 73.5 million years ago. Transcriptome and co-expression network analyses revealed the involvement of aquaporins (AQPs) and Ca²⁺-related ion channel genes in leaf movement. Moreover, we also identified nodulation-related genes and analyzed the structure and evolution of the key gene NIN in the process of symbiotic nitrogen fixation (SNF). Conclusion The detailed comparative genomic and transcriptomic analyses provided insights into the mechanisms governing leaf movement and nitrogen fixation in M. bimucronata. This research yielded genomic resources and provided an important reference for functional genomic studies of M. bimucronata and other legume species.
Article
Full-text available
Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/a2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.
Article
Full-text available
Over the last 20 years, comprehensive strategies for treating measurement error in complex models and accounting for the use of extra data to estimate measurement error parameters have emerged. Focusing on both established and novel approaches, Measurement Error: Models, Methods, and Applications provides an overview of the main techniques and illustrates their application in various models. It describes the impacts of measurement errors on naive analyses that ignore them and presents ways to correct for them across a variety of statistical models, from simple one-sample problems to regression models to more complex mixed and time series models. The book covers correction methods based on known measurement error parameters, replication, internal or external validation data, and, for some models, instrumental variables. It emphasizes the use of several relatively simple methods, moment corrections, regression calibration, simulation extrapolation (SIMEX), modified estimating equation methods, and likelihood techniques. The author uses SAS-IML and Stata to implement many of the techniques in the examples. Accessible to a broad audience, this book explains how to model measurement error, the effects of ignoring it, and how to correct for it. More applied than most books on measurement error, it describes basic models and methods, their uses in a range of application areas, and the associated terminology.
Article
Full-text available
The class Dothideomycetes is one of the largest groups of fungi with a high level of ecological diversity including many plant pathogens infecting a broad range of hosts. Here, we compare genome features of 18 members of this class, including 6 necrotrophs, 9 (hemi)biotrophs and 3 saprotrophs, to analyze genome structure, evolution, and the diverse strategies of pathogenesis. The Dothideomycetes most likely evolved from a common ancestor more than 280 million years ago. The 18 genome sequences differ dramatically in size due to variation in repetitive content, but show much less variation in number of (core) genes. Gene order appears to have been rearranged mostly within chromosomal boundaries by multiple inversions, in extant genomes frequently demarcated by adjacent simple repeats. Several Dothideomycetes contain one or more gene-poor, transposable element (TE)-rich putatively dispensable chromosomes of unknown function. The 18 Dothideomycetes offer an extensive catalogue of genes involved in cellulose degradation, proteolysis, secondary metabolism, and cysteine-rich small secreted proteins. Ancestors of the two major orders of plant pathogens in the Dothideomycetes, the Capnodiales and Pleosporales, may have had different modes of pathogenesis, with the former having fewer of these genes than the latter. Many of these genes are enriched in proximity to transposable elements, suggesting faster evolution because of the effects of repeat induced point (RIP) mutations. A syntenic block of genes, including oxidoreductases, is conserved in most Dothideomycetes and upregulated during infection in L. maculans, suggesting a possible function in response to oxidative stress.
Article
Full-text available
Genetic variation among individual humans occurs on many different scales, ranging from gross alterations in the human karyotype to single nucleotide changes. Here we explore variation on an intermediate scale—particularly insertions, deletions and inversions affecting from a few thousand to a few million base pairs. We employed a clone-based method to interrogate this intermediate structural variation in eight individuals of diverse geographic ancestry. Our analysis provides a comprehensive overview of the normal pattern of structural variation present in these genomes, refining the location of 1,695 structural variants. We find that 50% were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant. We discover 525 new insertion sequences that are not present in the human reference genome and show that many of these are variable in copy number between individuals. Complete sequencing of 261 structural variants reveals considerable locus complexity and provides insights into the different mutational processes that have shaped the human genome. These data provide the first high-resolution sequence map of human structural variation—a standard for genotyping platforms and a prelude to future individual genome sequencing projects.
Article
Full-text available
Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.
Article
Full-text available
Domestic yaks (Bos grunniens) provide meat and other necessities for Tibetans living at high altitude on the Qinghai-Tibetan Plateau and in adjacent regions. Comparison between yak and the closely related low-altitude cattle (Bos taurus) is informative in studying animal adaptation to high altitude. Here, we present the draft genome sequence of a female domestic yak generated using Illumina-based technology at 65-fold coverage. Genomic comparisons between yak and cattle identify an expansion in yak of gene families related to sensory perception and energy metabolism, as well as an enrichment of protein domains involved in sensing the extracellular environment and hypoxic stress. Positively selected and rapidly evolving genes in the yak lineage are also found to be significantly enriched in functional categories and pathways related to hypoxia and nutrition metabolism. These findings may have important implications for understanding adaptation to high altitude in other animal species and for hypoxia-related diseases in humans.
Article
Full-text available
Wood is a major pool of organic carbon that is highly resistant to decay, owing largely to the presence of lignin. The only organisms capable of substantial lignin decay are white rot fungi in the Agaricomycetes, which also contains non–lignin-degrading brown rot and ectomycorrhizal species. Comparative analyses of 31 fungal genomes (12 generated for this study) suggest that lignin-degrading peroxidases expanded in the lineage leading to the ancestor of the Agaricomycetes, which is reconstructed as a white rot species, and then contracted in parallel lineages leading to brown rot and mycorrhizal species. Molecular clock analyses suggest that the origin of lignin degradation might have coincided with the sharp decrease in the rate of organic carbon burial around the end of the Carboniferous period.
Article
The general maximum likelihood approach to the statistical estimation of phylogenies is outlined, for data in which there are a number of discrete states for each character. The details of the maximum likelihood method will depend on the details of the probabilistic model of evolution assumed. There are a very large number of possible models of evolution. For a few of the simpler models, the calculation of the likelihood of an evolutionary tree is outlined. For these models, the maximum likelihood tree will be the same as the “most parsimonious” (or minimum-steps) tree if the probability of change during the evolution of the group is assumed a priori to be very small. However, most sets of data require too many assumed state changes per character to be compatible with this assumption. Farris (1973) has argued that maximum likelihood and parsimony methods are identical under a much less restrictive set of assumptions. It is argued that the present methods are preferable to his, and a counterexample to his argument is presented. An algorithm which enables rapid calculation of the likelihood of a phylogeny is described.